You are on page 1of 119

Topics in Non-Life Insurance Mathematics

lecture notes to Fm2


Ragnar Norberg
September 2, 2001

Contents
1 Bayes methods
1.1 Decisions under uncertainty . . . . . . . . . . . . . . . . . . . . .
1.2 Estimation by square loss . . . . . . . . . . . . . . . . . . . . . .
1.3 Hilbert space methods . . . . . . . . . . . . . . . . . . . . . . . .

1
1
8
12

2 Empirical Bayes methods


16
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 The Decision problem . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Empirical Bayes decision functions . . . . . . . . . . . . . . . . . 22
3 Hierarchical credibility

24

4 Parameter estimation in credibility


25
4.1 Empirical linear Bayes estimation . . . . . . . . . . . . . . . . . . 25
4.2 An example from group life insurance . . . . . . . . . . . . . . . 28
4.3 Estimation in the compound Poisson model . . . . . . . . . . . . 33
5 Group life insurance

39

6 Bonus systems
42
6.1 Basic definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.2 Optimal design of bonus systems . . . . . . . . . . . . . . . . . . 45
7 Claims reserving
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . .
7.2 The claims process. . . . . . . . . . . . . . . . . . .
7.3 Applications . . . . . . . . . . . . . . . . . . . . . .
7.4 Prediction of outstanding claims . . . . . . . . . .
7.5 Predicting the outstanding part of reported claims
7.6 Parameter estimation . . . . . . . . . . . . . . . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

52
52
54
61
65
68
72

8 Utility Theory
77
8.1 The expected utility hypothesis . . . . . . . . . . . . . . . . . . . 77
8.2 The zero utility premium . . . . . . . . . . . . . . . . . . . . . . 82
8.3 Optimal insurance . . . . . . . . . . . . . . . . . . . . . . . . . . 83
i

ii

CONTENTS
8.4
8.5

The position of the insurer . . . . . . . . . . . . . . . . . . . . . .


Pareto-optimal risk exchanges . . . . . . . . . . . . . . . . . . . .

A Hilbert spaces
A.1 Metric spaces . . . . .
A.2 Vector spaces . . . . .
A.3 Hilbert spaces . . . . .
A.4 Special Hilbert spaces

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

84
86
90
90
92
94
98

B Matrix algebra

100

C Concave and convex functions

108

Chapter 1

Bayes methods
1.1

Decisions under uncertainty

A. Decision problems without observations. In virtually every aspect of


human activity science, business, and everyday life we encounter situations
where a decision has to be made and the consequences of the decision depend
on circumstances that are uncontrollable and not fully known. For instance,
you are to decide whether or not to buy a ticket for tomorrows socker match
without knowing how the weather will turn out: If nice, you will enjoy the game
at full value; if nasty, you will stay at home and lose the value of the ticket. Or
you are to set an automobile insurance premium without knowing the skill and
tempers of the driver: If he is accident-prone, the premium ought to be high; if
he is a safe driver, the premium ought to be low.
The theory of decisions under uncertainty aims at formulating problems of
the mentioned type as well posed mathematical optimization problems and solve
these. It sets out with the following formal definitions: The basic constituents
of a decision problem are a set D of possible decisions, a set T of possible
circumstances or states of the nature, and a loss function L : D T 7 R
specifying the loss L(d, ) incurred by taking the decision d D when the true
state of the nature is T .
The true state of the nature is unknown. The only thing we know for sure
is that it is in T . Therefore, each individual decision d must be judged by its
entire set of associated losses, that is, by the function L(d, ) : T 7 R.
In a first attempt to narrow down the set of useful decisions, we could easily
agree on discarding a decision d if there exists another decision d0 such that
L(d0 , ) L(d, ) for all T and L(d0 , ) < L(d, ) for some . If no such d0
exists, then d is called admissible. In most situations there are many admissible
decisions and, in particular, there will be no uniformly optimal decision (best
against all ). Thus, requiring admissibility is not enough to elect an uncontested
winner amongst all decisions.
We need to supply a criterion that assigns to each decision a single numerical

CHAPTER 1. BAYES METHODS

value that summarizes its loss performance, thus introducing an order relation
that would enable us to tell which is the better of any two decisions and, in
particular, find a best decision. More precisely, we need to define a suitable
real-valued functional on the space of functions {L(d, ); d D}. We shall
consider two candidate criterions, the maximum loss and the average loss,
and anticipate that the latter will be chosen to carry the theory further.
B. The minimax principle. One possibility is to judge a decision d by its
maximum loss, supT L(d, ). The best decision by this criterion, if it exists,
is accordingly called the minimax decision. There are situations where the
minimax principle does not deliver a solution, e.g. when L(d, ) is unbounded
for all d.
The maximum loss criterion judges a decision by its loss in the worst case
only and does not pay any attention to its performance against other possible
cases. It reflects a pessimistic prepare-for-the-worst attitude on the part of the
decision maker.
C. The Bayes principle. Another criterion, which in a way expresses a more
nuanced view, is the so-called Bayes criterion, which measures the performance
of a decision by a weighted average of its associated losses. More precisely, we
place a measure G on T (or rather on a suitable sigma-algebraR in T , which we
need not visualize here), and define the risk of d against G as T L(d, ) dG().
It is convenient to introduce the density g of G with respect to some basic
measure and write the risk as
Z
g (d) =
L(d, ) g() d() ,
(1.1)
T

thus interpreting g() as the weight attached to the state in the weighted sum
of losses.
The Bayes risk against g is defined as
g = inf g (d) .
dD

(1.2)

A decision dg such that g (dg ) = g is called a Bayes decision against g. Usually


it exists and is unique, and it is then the optimal decision.
Suppose G is a finite measure. Then, without loss of generality, we can
take it to be a probability measure and thus view the true state of the nature
as a r.v. with distribution G. A possible interpretation would be that G
expresses our beliefs concerning the uncertain state of the nature. We shall call
G a prior distribution since, under the present interpretation, it represents an
opinion formed prior to the decision and the experiences that may be obtained
from it. It is common usage to call G a prior (distribution) even if it is not
a probability distribution. By inspection of (1.1) we realize that there is an
element of arbitrariness in the specification of the loss function and the prior
since the integrand depends only on the product of the two. Thus, we could

CHAPTER 1. BAYES METHODS

always redefine the loss and the prior so as to make the latter a probability
distribution. This observation reflects the fact that both the loss function and
the prior are subjective elements in the set-up, and they cannot always be
separated from one another. For the Bayes principle to work, the main thing is
that the risk (1.1) should be finite for at least one decision.
We shall illustrate the concepts with some examples.
D. Testing hypotheses. Suppose we are interested only in deciding whether
the true state of the nature is in a subset H0 T or not. The notation H0
suggests that this is the null hypothesis we want to test agains the alternative
H1 = T \ H0 , but such notions will be meaningful only when there is some
empirical evidence present. We shall come back to that later on.
There are two decisions, D = {d0 , d1 }, where d0 means accept H0 and d1
means accept H1 (and reject H0 ). Let the loss function be the simple one
that expresses our yes-or-no attitude to the situation: L(di , ) = ai 1T \ Hi (),
i = 0, 1, where the ai are strictly positive.
The risk by decision di is g (di ) = ai (1 Pg [ Hi ]), and so the Bayes
decision against g is
dg = di

Pg [ Hi ]

if

ai
, i = 0, 1.
a0 + a 1

(1.3)

In the special case of simple hypotheses, Hi = {i }, i = 0, 1, we take T =


{0 , 1 }, G{i } = gi , i = 0, 1, and {i } = 1, i = 0, 1. Then (1.3) specializes to
dg = d0 if

a0
g0

,
g1
a1

(1.4)

and dg = d1 otherwise.
E. Point estimation. Just to ease notation, assume T R, and suppose we
want to estimate . The natural set of decisions is D = T . As loss function we
could reasonably take L(d, ) = `(d ) for some function ` : R 7 R+ that is
convex and assumes its minimum (0, say) at 0.
A much used candidate is the square loss, L(d, ) = (d )2 . The risk of a
decision d is
Z


g (d) = E (d )2 =
(d )2 dG() ,

which is finite if G is a probability distribution with finite second moment. We


easily find that the Bayes solution is given by
dg = E[] ,

g = V[] .

(1.5)

An alternative is the absolute loss, L(d, ) = |d |. The risk of a decision


d is
Z d
Z
g (d) =
(d ) dG() +
( d) dG() ,

CHAPTER 1. BAYES METHODS

which is finite if G is a probability distribution with finite first moment. It is


easily verified that the Bayes estimator is the median of the prior distribution:
dg = inf{; G() 1/2} .

(1.6)

F. Decision problems with observations. Suppose we have at our disposal


an observation X (real-valued, vector-valued, or more general) that contains
some partial information about the state of the nature. This notion is made
precise by letting X be a random element whose distribution depends on and
is denoted accordingly by F (x|). We shall assume that F (x|) has a density
f (x|) w.r.t. some basic measure on (a suitable sigma-algebra in) X . The
decision should be based on the available information and thus be a function
of X. Formally, a decision function is a (measurable) function : X 7 D.
The decision delivered by the rule is (X), an r.v., and the associated loss is
L((X), ), also an r.v.
In statistical decision theory a decision function is judged by its risk function
Z
(; ) = E[L((X), )] =
L((x), ) f (x|) d(x) ,
(1.7)
X

which summarizes the performance of for each fixed .


We are in the same situation as in the case without observations, now with
(; ) in the role of L(d, ). Thus, a decision function is said to be admissible
if there exists no other decision function 0 such that ( 0 ; ) (; ) for all
T and ( 0 ; ) < (; ) for some . Again, in general, there are many
admissible rules, and there exists no rule that is uniformly optimal. In some
simple situations one can successfully construct restricted optimal solutions, e.g.
a uniformly minimum variance unbiased estimator, a uniformly most powerful
unbiased or invariant test, or the like. An alternativ approach, which in principle
always will enable us to find a best rule, is to measure the overall performance
by some functional of the risk function. The concepts of minimax and Bayes
defined for simple decisions carry over straightforwardly to decision functions
and, in particular, the Bayes criterion judges a decision function by its (overall)
risk
Z
g () =
(; ) g()d() .
(1.8)
T

If it exists, the Bayes decision function against g is the one that minimizes this
overall risk. It will be denoted g , and the minimum overall risk will be called
the Bayes risk and will be denoted by g = g (g ).
In this set-up is considered as the outcome of a random variable with
density g, and the likelihood f (x|) is considered as the conditional density of
X, given = . The joint density of (X, ) is f (x|) g(), the marginal density
of X is
Z
f (x) =
f (x|) g() d() ,
T

CHAPTER 1. BAYES METHODS

and the conditional density of , given X = x, is


g(|x) =

f (x|) g()
.
f (x)

(1.9)

The prior density g expresses our judgement prior to observation. The conditional density g(|) is called the posterior density since it represents our judgement post (after) observation.
Construction of the Bayes solution goes as follows. Insert the expression
(1.7) into (1.8) and change the order of integration to obtain

Z Z
g () =
L((x), ) g(|x) d() f (x) d(x) .
X

It is seen that minimizing g () amounts to minimizing the inner integral for


each x, which is formally the same as finding the Bayes solution against g(|x).
Thus, we are in the same situation as before, now with the prior replaced by the
posterior, which means that our beliefs about the nature have been updated in
view of the empirical evidence.
We conclude that the Bayes solution, when it exists, is given by
g (x) = dg(|x) ,
g =

g(|x) f (x) d(x) = E[g(|X) ] .

(1.10)
(1.11)

When the prior is taken as fixed, we shall not always display it in the notation.
G. Testing hypotheses (continued). We consider again the testing problems
in Paragraph D above. In the presence of observations the prior distribution is
just to be replaced with the posterior. In particular, in (1.4) we should replace
gi with gi (x) = gi f (x|i )/f (x) and obtain that the optimal decision rule is
given by
f (x|0 )
g1 a0

(x)
= d0 if

,
f (x|1 )
g0 a1

and (x)
= d1 otherwise. This is the well-known Neyman-Pearson result from
classical test theory.
H. Point estimation (continued). The problem of estimation by square loss,
considered in Paragraph E above, has the following solution in the presence of

= (X)
observations: The Bayes estimator is
= E[|X], and the Bayes risk
is

= EV[|X] = V[] V[]


I. Poisson distributed observations. Let Xj , j = 1, . . . , n, be independent,
with each Xj distributed according to the Poisson law with parameter pj .

CHAPTER 1. BAYES METHODS

Here the pP
j are fixed, known
P numbers, whereas is an unknown parameter.
Put X = j Xj and p = j pj . The joint density (w.r.t. counting measure)
of X = (X1 , . . . , Xn )0 is
f (x|) = (

x
n
Y
pj j x p
) e
,
x !
j=1 j

(1.12)

xj N+ , j = 1, . . . , n, (0, ).
Consider the problem of estimating . A traditional solution would be the
ML (maximum likelihood) estimator,
= X /p ,

(1.13)

which is also UMVU (uniformly minimum variance unbiased).


This estimator is poor if p is small; its variance is /p . The problem
becomes acute when there is no experience X. Invoking the Bayes principle, we
would view the true state of the nature as an r.v. . Suppose we take the prior
to be Ga(, ), the gamma distribution with density
g, () =

e
, > 0.
()

(1.14)

The conditional density of , given X, is proportional to the product of the


expressions in (1.12) and (1.14), hence
g, (|X) = gX +,p + () ,
the density of Ga(X + , p + ).
Using the easy result
Z
(x + )
x ep g, () d =
,
() (p + )x+
we find
E(|X)

V(|X)

X +
,
p +
X +
.
(p + )2

(1.15)
(1.16)

(say) of with respect


The conditional mean in (1.15) is the Bayes estimator
to square loss. It can be cast as
= + (1 )/ ,

(1.17)

where is the maximum likelihood estimator (1.13) in the conditional model,


given = , and
= p /(p + ) .

(1.18)

CHAPTER 1. BAYES METHODS

The expression in (1.17) is a credibility weighted mean of the sample estimator


and the unconditional mean, E[] = /. The credibility is an increasing

function of the total exposure p and of 1 = V[]/E[].


J. Binomially distributed observations. Let Xj , j = 1, . . . , n, be i.i.d.
binomial variates with P[Xj = 1] = 1 P[Xj = 0] = [0, 1]. The joint density
(w.r.t. counting measure) of X = (X1 , . . . , Xn )0 is
f (x|) =

n
Y

j=1

xj (1 )1xj = x (1 )nx .

(1.19)

The ML estimator of is
= X /n ,

(1.20)

which is also UMVU. This estimator is poor if n is small; its variance is (1


)/n.
Let us use the Bayes principle, taking the prior to be Be(, ), the beta
distribution with density
g, () =

( + ) 1

(1 )1 , (0, 1) .
()()

(1.21)

The conditional density of , given X, is proportional to the product of the


expressions in (1.19) and (1.21), hence
g, (|X) = gX +,nX + () ,
the density of Be(X + , n X + ).
It is left as an exercise to the reader to complete this example along the lines
of the previous paragraph.
K. Conjugate priors. In the previous two paragraphs a lucky circumstance
was that the posterior belonged to the same well structured family as the prior.
Thus, posterior formulas could be obtained from prior formulas by just updating
the parameters. Generally, we say that a family G of prior densities is closed
under sampling from the the family of likelihoods F = {f (|); T } if g(|x)
G for all x X and all g G.
Inspection of the previous two examples suggests the following general construction. Switch the roles of observation and Rparameter in the likelihood and,
for each x0 X , consider gx0 () = f (x0 |)/ T f (x0 |0 ) d(0 ) as a candidate
prior density (assuming the integral in the denominator is finite). The densities thus constructed constitute what is called the class of priors induced
by F. The induced class of priors is closed under sampling from F since
gx0 (|x) f (x|)gx0 () f (x|)f (x0 |), which is in F since it is the density
of the combined sample of independent observations X and X 0 . Finally, if necessary in order to obtain a mathematically tractable class of priors, extend the

CHAPTER 1. BAYES METHODS

induced class to its natural parameter space and check if closedness under sampling is preserved under the extension; if yes, the result is what we call the
natural conjugate class of priors for F.

1.2

Estimation by square loss

A. Some definitions and notational conventions. In the previous section


r.v.-s were denoted by upper case letters, and values assumed by them were
denoted by corresponding lower case letters. This distincion is useful in contexts
where we are working explicitly with the probability distributions. It will not
be important in the rest of this section since we shall be working mainly with
the expectation operator, seeing only the r.v.-s. Moreover, we shall be working
mainly in Euclidean spaces where it is important to distinguish between scalars,
vectors, and matrices because they are subject to different algebraic rules. Thus,
we adopt the notation of Appendix B, letting the context and the symbols speak
for themselves under the following clues: Unobservable quantities are usually
denoted by Greek letters. Observable r.v.-s are typically denoted by x.
At the base of everything is a probability space (, F, P). The expectation
operator and the covariance operator are denoted by E and C, respectively, and
are defined for random vectors xm1R = (x1 , . . . , xm )0 and yn1 = (y1 , . . . , yn )0 ,
say, by E[x] = (E[x1 ], . . . , E[xm ])0 = x() dP() (integraton performed entrywise) and C[x, y0 ] = (C[xi , yj ]) = E[xy0 ] E[x] E[y0 ]. We abbreviate variance
matrices by V[x] = C[x, x0 ]. Recall the general rules
E[x] = E{E[x | ]}

(1.22)

C[x, y0 ] = C{E[x | ] , E[x0 | ]} + E{C[x, y0 | ]} ,

(1.23)

(iterated expectation) and

a direct consequence of (1.22),


B. Stating the estimation problem. Let ms1 be a square integrable random vector representing some unobservable quantity we want to estimate, and
let xn1 be a random vector representing the available observations. An sdimensional function m
= m(x)

that is used as a databased approximation to


the unknown m is called an estimator of m. We adopt the generalized square
loss function
X
L(m,
m) = (m m)
0 A(m m)
=
aik (mi m
i )(mk m
k) ,
(1.24)
i,k

where Ass = (aik ) is some p.d.s. non-random matrix. Viewing the scalar
expression in (1.24) as a 1 1 matrix, which is the same as its trace, we can
rewrite it as
L(m,
m) = tr(A(m m)(m

m)
0) .

(1.25)

CHAPTER 1. BAYES METHODS

The overall risk of the estimator is the (generalized) mean square error
(MSE),
X
(m)
= E[(m m)
0 A(m m)]
=
aik E[(mi m
i )(mk m
k )] . (1.26)
i,k

In view of (1.25), we can rewrite (1.26) as


(m)
= tr(AR(m)),

(1.27)

R(m)
= E[(m m)(m

m)
0]

(1.28)

where

is called the risk matrix of m.

C. The unrestricted Bayes estimator. We first seek the optimal estimator


of all square integrable estimators. Upon subtracting and adding
in the class M
E[m|x] in each factor (m m)
in (1.26), and then multiplying out, we get
(m)

= E[(m E[m|x])0 A(m E[m|x])]

+2 E[(m E[m|x])0 A(E[m|x] m)]


0
A(E[m|x] m)]
.
+E[(E[m|x] m)

(1.29)

Forming iterated expectation E[ ] = EE[ |x], we find that the second term
appears only in the last term, which
on the right of (1.29) vanishes. Thus, m
=m
defined by
obviously is minimized and becomes 0 by taking m
= E[m|x] .
m

(1.30)

Then, what remains is the first term, which is the minimum risk. By iterated
expectation and by virtue of (1.27) it can be cast as

= tr(AR)

(1.31)

= EV[m|x].
R

(1.32)

where

the Bayes estimator, the Bayes risk, and the Bayes risk ma , and R
We call m,
trix, respectively, and together they constitute the (unrestricted) Bayes solution
to the estimation problem.
The solution is meaningful also if we have no observations, and it then re = R = V[m], say.
= m = E[m] and R
duces to m
By the general rule (1.23) we have V[m] = VE[m|x] + EV[m|x], hence (1.32)
can be rewritten as
= V[m] V[m]
.
R

10

CHAPTER 1. BAYES METHODS

We see that the Bayes risk matrix is the risk matrix by no observations less the
variance of the Bayes estimator.
It is noteworthy that the weight matrix A does not appear in the Bayes
estimator and risk matrix . The weighting affects only the Bayes risk given by
(1.31).
of estimators of the inhomogeD. Linear estimators. Consider the class M
neous linear form
= g + Gx ,
m
where
g0s1

g10

= ... ,
gs0

Gsn

are constants. Inserting m


i = gi0 +
(m)
=

X
i,k

aik E[(mi gi0 +

Pn

n
X
j=1

j=1

(1.33)

g11

= ...
gs1

g1n
..
.. ,
.
.
gsn

gij xj into (1.26), we get

gij xj )(mk gk0 +

n
X

gk` xk` )] .

(1.34)

`=1

This is a positive definite quadratic form in the coefficients gij , which is easily
minimized. First, form the derivatives (recall that A is symmetric so that
aij = aji )
n
X
X

(m)
=2
aik E[(1)(mk gk0 +
gk` x` )] ,
gi0
k

`=1

i = 1, . . . , s,
n
X
X

(m)
=2
aik E[xj (mk gk0
gk` x` )] ,
gij
k

`=1

i = 1, . . . , s, j = 1, . . . , n. The optimal coefficients are the solution to the


equations obtained by equating all the derivatives to zero. In matrix form the
equations become
A E[(m x)] = 0s1 ,
A E[(m x) x0 ] = 0sn .
Since A is of full rank, we may cancel it in both equations. Then, upon postmultiplying the first equation with E[x0 ] and subtracting the result from the
second equation, we obtain the equivalent equations

= E[m] E[x] ,
= C[m, x0 ]V[x]1 .

11

CHAPTER 1. BAYES METHODS

We arrive at the following linear Bayes (LB) solution: The LB estimator, is


= + x = E[m] + C[m, x0 ]V[x]1 (x E[x]) ,
m

(1.35)

The LB risk is easily calculated and is


,
= tr(AR)

(1.36)

is the LB risk matrix defined by


where R
= V[m] C[m, x0 ]V[x]1 C[x, m0 ] .
R

(1.37)

We see that the LB risk matrix is the risk matrix by no observations less the
variance of the linear Bayes estimator, confer the remark at the end of Paragraph
B above.
The LB solution depends on the joint distribution of m and x only through
their first and second order unconditional moments. The LB estimator is the
sum of the prior Bayes estimate E[m] based on no observations and an adjustment term which depends on the deviation of the observations from their mean.
The magnitude of the adjustment depends on the variation and covariation of
the estimand and the observations: the stronger the covariation, the greater
the adjustment; the larger the variance of the observations, the smaller the
adjustment.
We observe again that the weight matrix A does not appear in the LB
estimator and the LB risk matrix, but it appears in the LB risk (1.36).
Obviously, the estimator and the risk matrix are the basic entities of any
or m,
one could
Bayes solution. Thus, for the mere purpose of constructing
m
P
as well put A = I, whereby the risk reduces to sk=1 E(mk m
k )2 . Consequently, the Bayes estimator can be constructed componentwise, using ordinary
quadratic loss for the scalar-valued estimands.
E. The regression model. Assume that xn1 is square integrable with conditional first and second order moments of the form
E[x|] = Yb() ,

(1.38)

V[x|] = V() ,

(1.39)

with n, Y, P known and non-stochastic, and bq1 and Vnn some functions of
an unobservable r.v. .
Introduce
= E[b] ,

= V[b] ,

= E[V] .

(1.40)

The moments required in the LB estimator of b are , , C[b, x0 ] = Y0 ,


E[x] = Y , and V[x] = YY 0 + . Inserting these expressions into (1.35)
and (1.36), one obtains the LB solution
= Y0 (YY0 + )1 x + (I Y0 (YY0 + )1 Y) , (1.41)
b
= Y0 (YY0 + )1 Y .
R
(1.42)

CHAPTER 1. BAYES METHODS

12

By use of Lemma 1 in Appendix B we easily obtain the equivalent expressions


= (I Z)(Y0 1 x+ ) ,
b
= (I Z) ,
R

(1.43)
(1.44)

where Z is the so-called credibility matrix defined as


Z = Y0 1 Y(Y0 1 Y + I)1
= (Y0 1 Y + I)1 Y0 1 Y
= I (Y0 1 Y + I)1 .

(1.45)
(1.46)
(1.47)

Finally, if Y is of full rank q, we arrive at the appealing formula


= Zb
+ (I Z) ,
b

(1.48)

= (Y0 1 Y)1 Y0 1 x ,
b

(1.49)

where

which expresses the LB estimator as a credibility weighted average of the sample


and the prior estimate .
estimator b

1.3

Hilbert space methods

A. Motivation. The essential features of the estimation problems considered


above are the linearity of the space of admitted estimators and the distance
measure defined by expected squared loss. Obviously, Hilbert space theory is
the right framework for a general treatment of these problems. Taking the basic
probability space as granted, the appropriate Hilbert space is the space L2 of
square integrable r.v.-s with the inner product hx, yi = E[xy], confer Appendix
A. We choose to discuss estimation of real-valued random variables for notational reasons only; extension to vector-valued estimands is straightforward by
use of the inner product hx, yiA = E[x0 Ay], where A is some positive definite
matrix.
B. Formulation of the estimation problem. Let m L2 be an r.v. representing an unknown quantity. The problem of estimating m under quadratic
loss amounts to minimizing the risk


(m)
= E (m m)
2 = km mk
2

as m
ranges in some set L of admitted estimators. If L is a closed subspace of
L2 , then the optimal estimator, called the L-Bayes estimator, is the projection
mL = pro(m|L). This is the unique element mL L satisfying the normal
equations
E[(m mL ) m]
= 0,
m
L.

13

CHAPTER 1. BAYES METHODS


The minimum risk, called the L-Bayes risk, is
L = km mL k2 = kmk2 kmL k2 = E[m2 ] E[m2L ] .

If L0 is a closed linear subspace of L, then the iterated projection theorem yields


mL0 = pro(mL | L0 ) ,
and
L0 = km mL0 k2 = L + kmL mL0 k2 = L + kmL k2 kmL0 k2 .
More generally, let L0 Ln be a nested family of closed linear
subspaces of L, where the smallest space L0 is the trivial 0-dimensional space
{0}, and Ln = L2 . For each j < k we have mLj = pro(mLk | Lj ), and Lj =
Lk + kmLk k2 kmLj k2 . The estimand decomposes into mutually orthogonal
components as
n
X
m=
(mLj mLj1 ) ,
j=1

and the total squared length of m decomposes accordingly into


kmk2 =

n
X
j=1

(kmLj k2 kmLj1 k2 ) =

n
X
j=1

(Lj1 Lj ) ,

showing how the total mean square of m is built up by successive increases of


the mean square error as we step down from the best Bayes estimator (in Ln ,
which is m itself), via Bayes estimators at all levels, to the poorest one (in L0 ,
which is just 0).
We shall look at some special cases.
C. Estimates based on a given set of observations. Let x be some obser be the closed infinite-dimensional linear subspace of square
vations and let M
integrable estimators that are functions of x (i.e. are (x)-measurable). The

M-Bayes
estimator m
= mM
is determined by the normal equations
E[(m m)
m]
= 0,

.
m
M

Write the expression on the left as


EE[(m m)
m
|x] = E[(E[m|x] m)
m]
.
In particular, m
= E[m|x] m
gives E(E[m|x] m)
2 = 0, hence
m
= E[m|x] .

The M-Bayes
risk is
= E[m2 ] E[m
2 ] = V[m] V[m]
.

14

CHAPTER 1. BAYES METHODS

that
Let M be the closed linear subspace consisting of those r.v.-s in M
are measurable w.r.t. some sub-sigmaalgebra of (x) (i.e. uses only partly
the information carried by x). Then the iterated projection theorem applies

in an obvious way. In particular, for M= M , the one-dimensional space of


constants, the M -Bayes solution is given by m = E[m] and = V[m].
D. Linear estimation. Suppose x = (x1 , . . . , xn ) is a vector of real-valued
random variables, and consider the closed linear (n + 1)-dimensional subspace
of non-homogeous estimators of the form m
M
= g0 + g1 x1 + + gn xn , where

the gj are constants. Let the M-Bayes


estimator, which is just the LB estimator,
be m
= 0 + 1 x1 + + n xn . The normal equations now assume the form
E [(m
0 1 x1 n xn )(g0 + g1 x1 + + gn xn )] = 0
for all g0 , . . . , gn . Putting gj = 1 and gk = 0, k 6= j, for each j = 0, . . . , n, we
arrive at the result in Paragraph 1.2D.
E. Semilinear estimators. (De Vylder, [15].) Let m = m() be a function of some unobservable r.v. , and assume that x1 , . . . , xn are conditionally
'

i.i.d., given . Consider the infinite-dimensional linear space M of semilinear


estimators of the form
n
X
f (xj ) ,
m
=
j=1

with f such that f (x1 ) L . This space


Pncontains the space of non-homogeous
linear estimators of the form g0 + g n1 j=1 xj (put f (x) = n1 (g0 + gx)). The
'

reader should try to establish that M is closed. The optimal semilinear estimaPn
'
tor m= j=1 f(x
j ) is determined by the normal equations:

E m

n
X
j=1

f(xj )

n
X
j=1

f (xj ) = 0

for all f such that f (x1 ) L2 . By the conditional i.i.d. property, the expected
value on the left is
nE[m f (x1 )] nE[f(x1 ) f (x1 )] n(n 1)E[f(x2 ) f (x1 )] ,
hence the normal equations reduce to
1 ) (n 1)E[f(x2 )|x1 ] ) f (x1 ) } = 0 .
E{ ( E[m|x1 ] f(x
Taking in particular
f (x1 ) = E[m|x1 ] f(x1 ) (n 1)E[f(x2 )|x1 ] ,

CHAPTER 1. BAYES METHODS


we obtain that f is the solution to the integral equation
E[m|x1 ] = f(x1 ) + (n 1)E[f(x2 )|x1 ] .
The solution must be obtained by numerical methods.

Bibliography [9], [14], [16], [15], [17], [20], [21],[25], [26], [32],[34], [42].

15

Chapter 2

Empirical Bayes methods


2.1

Introduction

A. The framework model. Consider a sequence of observational units labeled


by i = 1, . . . , I, . . .. The I first units are observable at the time of consideration and constitute the current sample, whereas the remainder of the sequence
represents future observations.
The model framework will be built in steps, starting from a parsimonious
fixed-effects model specified as follows. With each unit i there are associated
an observable output quantity Xi , an unknown parameter i representing some
hidden characteristics of the unit, and an observable input or design quantity
ci comprising certain explanatory covariates and the observational design. The
range of Xi depends on the design and is denoted by Xci . Typically the observation is vector-valued so that Xci Rni , the Euclidean space of dimension
ni = n(ci ). Denote the range of the i by T and that of the ci by C. The basic
assumption is the following.
(i) The observational units are stochastically independent, and each output Xi
has a p.d.f. of the form fci (|i ) (depending on i only through i and ci ) with
respect to a basic measure ci . Introduce
F = {fc (|) ; T , c C} ,

(2.1)

the family of all likelihood functions generated by varying the design.


Apart from the common shape of the p.d.f.s, the assumption (i) establishes
no relationship between the observational units; the outputs are stochastically
independent, and the i assume their values in T independently. In this framework inferences about the i must be based on the likelihood
I
Y

i=1

fci (xi |i ) ,

16

CHAPTER 2. EMPIRICAL BAYES METHODS

17

and, due to the multiplicative form of this likelihood, each individual i is to be


assessed from the corresponding observation xi alone. A much stronger model
is obtained by assuming that all i are equal to , say, so that X1 , . . . , XI can
be pooled into one sample with joint p.d.f.
I
Y

i=1

fci (xi |) .

Then all observations xi will be directly related to the common parameter .


Robbins [44] so-called empirical Bayes set-up places itself between these two
extreme solutions by envisaging the parameters as outcomes of independent selections from a common distribution. Thereby a certain similarity is introduced
between them, and at the same time they are allowed to vary among the units.
More specifically, the extension to an empirical Bayes model goes as follows.
(ii) Assumption (i) is the conditional model for the Xi , given i = i , i =
1, 2, . . .. The i are i.i.d. selections from one and the same distribution G with
p.d.f. g with respect to a measure , and
g G.

(2.2)

The bulk of early empirical Bayes theory rests on the assumption that all ci
are equal to c, say, which will be referred to as the case with balanced design.
A key point in the balanced design assumption is that the pairs (Xi , i ) , i =
1, 2, . . ., are i.i.d., so that standard large sample theory based on the laws of
large numbers and the central limit theorem can be invoked. In the unbalanced
case with varying designs it is necessary to impose certain regularity conditions
on the ci to ensure that the statistical procedures possess desired asymptotic
properties.
When only one unit is under consideration, it is not necessary to drag along
with the subscript i. Therefore, to facilitate the presentation, the sequence
(Xi , i , ci ), i = 1, . . . , I, is extended with an unindexed current unit, (X, , c),
which will be frequently in focus in what follows.
In the full model (i)(ii) the joint p.d.f. of (X, ) is
fc (x|) g() .

(2.3)

(2.4)

The marginal p.d.f. of X is


fc (x) =

fc (x|) g() d() .


T

The conditional p.d.f. of , given X = x, is the ratio of the expressions in (2.3)


and (2.4),
gc (|x) =

fc (x|) g()
.
fc (x)

(2.5)

18

CHAPTER 2. EMPIRICAL BAYES METHODS

The function fc (|) will be referred to as the kernel density in the following.
It is often called the likelihood (function), but this term is unfortunate in the
present context where is the outcome of a random variable stemming from
a distribution, which has a frequency interpretation and can (in principle) be
estimated. Then the marginal density of the observables Xi , given by (2.4),
can appropriately be termed the likelihood since it forms the basis of statistical
inferences.
The marginal p.d.f. g will be called the prior density in spite of its frequency
interpretation. This convention is in accordance with tradition. Accordingly,
the conditional p.d.f. in (2.5) is called the posterior density since it represents
the knowledge after observation of X = x (and c).

B. Example: Poisson kernels. In some lines of insurance it is custumary


to let the premium of each individual risk depend on its current claims record.
One example of such individual experience rating, familiar to most people, is
merit rating in automobile insurance.
Most of the merit rating schemes used in practice take only on the numbers
of claims into consideration. Consider a class comprising I independent risks.
Let Xij , j = 1, . . . , ni , be the numbers of claims reported in ni years by risk No.
i. They are assumed to be independent, with each Xij distributed according
to the Poisson law with parameter i pij . Here pij is an observable measure of
risk exposure in year j, e.g. the time exposed to risk as insured or the mileage,
and i is the individual claim intensity per unit risk exposure. In this case
ci = (ni , pi ), with pi = (pij , . . . , pini )0 . The p.d.f. of Xi = (Xi1,..., Xini )0 is
x
ni
Y
pijij
)
fni ,pi (xi |i ) = (
x ! i
j=1 ij

xij i

pij

xi Xci = Nni , i T = (0, ), the basic measure ci being the counting


measure on Xci . This completes the specification of the model at stage (i) of
the general outline.
Consider the problem of assessing i for each individual risk i = 1, . . . , I.
So far the model establishes no relationship between the risks. The i are
completely disconnected any specific value of the sequence (1 , 2 , . . .) is
just as likely at the outset as any other, and the samples Xi are stochastically
independent. Thus, each i has to be estimated from the experience Xi of risk
i alone. Referring to 1.1.I, the ML (and UMVU) estimator
X
X
i =
Xij /
pij ,
j

P
is poor if j pij is small. The problem becomes acute
when the insurer is faced
P
with a new risk with no experience of its own ( j pij = 0). The fixed effects
model at stage (i) renders no possibility of assessing i in a rational manner
in the absence of observations. However, an insurer who wants to remain in

19

CHAPTER 2. EMPIRICAL BAYES METHODS

business, has toP


fix P
a premium
somehow. Common practice is to use, as an
P P
initial estimate, i j Xij / i j pij , the observed mean risk premium of the
present portfolio. This solution is approved to by practical insurance people
and customers, but has no support whatever in the stage (i) model. Obviously,
the model fails to reflect the essential circumstance that automobile risks have
something in common, which justifies the use of portfolio-wide statistics in an
assessment of each single risk. The mathematical way of accounting for this
notion of similarity is assumption (ii) in the general outline, whereby the risks
are viewed as random selections from a population of risks different, but of
a similar origin.
C. Example: Binomial kernels. A certain type of items can be attributed
either of two quality characteristics, defect or intact. A factory delivers items
in large batches. Denote by i the proportion of intacts in batch No. i: it
is unknown and represents the quality of the batch. To prevent poor batches
from being supplied to the trade, a quality control is arranged. From each
batch i a random sample of ni items is controlled, and the outcome is recorded
as Xi = (Xi1 , . . . , Xini )0 , where Xij is 0 or 1 according as the j-th item in
the sample is defect or intact. The design is simply the sample size, ci = ni ,
Xni {0, 1}ni , and i (0, 1) = T . Suppose the ni are small compared to the
size of the batches. Then each Xi can reasonably be viewed as a sequence of
Bernoulli trials with p.d.f.
fni (xi |i ) = i

xij

(1 i )ni

xij

(2.6)

the basic measure ni being the counting measure on Xni . This completes the
specification of item (i) in the general outline above.
The model establishes no relationship between the batches and the samples
drawn from them. For each batch i the quality i has to be assessed from Xi
alone. The ML and UMVU estimator is
i =

ni
X
j=1


Xij ni ,

(2.7)

which is unreliable if ni is small.


Suppose I batches have already been subjected to control and that, on the
whole, the values of 1 , . . . , I are close to 1, indicating that the batches are of
high quality. It may be felt that this piece of information ought to be taken into
account in the assessment of future batches. The batches stem from the same
manufacturing process, and the quality of the process itself must be decisive of
the qualities of the single batches. This motivates assumption (ii) in the general
outline. The p.d.f. g represents the quality of the manufacturing process.
D. Example: Linear regression. Assume that the observational units form
a sequence of related regression problems. For the time being, concentrate on
the current unit. At stage (i) of the general model it is assumed that the vector

20

CHAPTER 2. EMPIRICAL BAYES METHODS

of outputs X n1 is distributed according to Nn (Yb, vP1 ), where Ynq and


Pnn are known and bq1 and v 11 are unobservable. The matrix P is taken
to be p.d.s., implying that v > 0. In this set-up the design is c = (n, Y, P) and
the latent quantity is = (b, v). The kernel density is
1

fn,Y,P (x|b, v) = (2)n/2 v n/2 |P| 2 exp(

1
|x Yb|2P ) .
2v

(2.8)

(Recall that | |P is the norm induced by the inner product hx, yiP = x0 Py.)
Consider the problem of estimating (b, v) by the ML method. For each fixed
solving
v > 0 the expression on the right of (2.8) is maximized at any b
= 0q1 ,
Y0 P(x Yb)

(2.9)

the normal equations determining the h , iP -projection of x onto the column


into (2.8) and maximize with respect to v. It is
space of Y. Then, insert b = b
easily shown that maximum is attained at nr
, where r = rank Y ( min(q, n))
n v
and
v =

1
2P .
|x Yb|
nr

(2.10)

The nature of the solution depends r. If r = q, then (2.9) possesses a unique


solution and delivers the ML estimator
= (Y0 PY)1 Y0 Px .
b

(2.11)

If r < q, there is a (q r)-dimensional space of solutions, and the ML estimation


of b is indeterminate. If r < n, then v > 0 a.s. If r = n, then v = 0 a.s., reflecting
the fact that the observations contain no information about the erratic terms.
= Y1 x perfect fit of
(In the special case r = q = n (2.10) reduces to b
the empirical regression to the observations.) To sum up, the ML principle
delivers a meaningful solution only if r = q < n. In this case the ML estimator
v) is unbiased and,
nr v) given by (2.10) and (2.11). The estimator (b,
is (b,
n
being based on a minimal sufficient statistic, it is also UMVU. Note that v is
well defined whenever r < n.
Like in the previous examples, it can be concluded that the fixed effect
model considered here renders no possibility of assessing the parameters when
the data are scanty. One is then urged to consider the possibility of utilizing
knowledge about other similar observations by adding a type (ii) assumption
from the general set-up.

2.2

The Decision problem

A. The empirical Bayes decision problem. For the current unit a decision
is to be selected in a space D of possible decisions. Let a loss function L :
D T R be given. Only trivial modifications are required in the following if
L depends also on the design c.

CHAPTER 2. EMPIRICAL BAYES METHODS

21

The observations constitute the available information. The primary information is the observation (X, c) from the current unit, and the secondary information is the observations
(X, c)I = {(Xi , ci ) ; i = 1, . . . , I}
from the collateral units i = 1, . . . , I. The decision problem consists in determining a decision function,
{(X, c) ; (X, c)I } ,

(2.1)

with values in D such that the overall risk,


g () = E[L(, )] ,

(2.2)

become small, preferably minimum.


B. Sketch of a two-stage procedure. Matters would be greatly simplified
if g were known, since then the decision function could be permitted to depend
on g. Thus, as a first step, look for an optimal choice in the extended class of
decision functions that are allowed to depend both on the observations and the
prior. Write (2.2) as
g () = EE[L(, )|(X, c) , (X, c)I ] .

(2.3)

Minimum is obtained by minimizing the innermost integral for fixed values of


the observations. Due to the mutual independence of the observational units
and the special form of the integrand, the problem reduces to minimizing
E[L(, )|X, c}

(2.4)

with respect to varying in the class of decision functions depending only on


(X, c) and g. The solution, when it exists, is the Bayes rule against g,
g (x, c) .

(2.5)

The collateral data (X, c)I , which were argued to be of relevance, have
dropped out of the analysis and do not appear in the solution (2.5). This is
so because g was held fixed (assumed known). In the full model it is not,
however, and this is where the collateral data come into play. The second step
in the two-stage procedure consists in estimating the Bayes decision in (2.5)
from the observations (X, c)I to obtain a genuine decision function depending
only on the available data. We shall briefly outline this part of the problem in
the next section.
Finally, the resulting decision rule ought to be assessed by the performance
criterion (2.2) to ascertain that the two-stage procedure serves the proclaimed
purpose. This problem shall not be treated in this short account of the theory,
and we refer to [38].

CHAPTER 2. EMPIRICAL BAYES METHODS

2.3

22

Empirical Bayes decision functions

A. Constructing the decision functions. The Bayes decision function (2.5)


against a specific g cannot be expected to perform well for all priors in the
entire set G of possibilities. It is merely a first auxiliary step in the two-stage
procedure described in Paragraph 2.2.B.
The second step consists in estimating the Bayes decision function from the
available data (X, c)I to arrive at a function , say, of the form (2.1). This
function is typically constructed by insertion of an estimator g of g in (2.5) or, if
a restricted Bayes solution like the LB estimator is arranged at the first stage, by
inserting estimators of those parameters that are involved in the restricted Bayes
decision function. Hopefully, the resulting decision function perform almost as
well as the Bayes rule at each point g G. That question can only be settled
by an eventual evaluation of by its risk.
As a provisional minimum requirement, should converge to the Bayes
decision function in a suitable sense. Following Robbins [44], the decision rule
is said to be empirical Bayes (against G), abbreviated EB, if
p
g , g G .

(2.1)
p

Here signifies convergence as I , and denotes convergence in probability (with respect to some appropriate metric in D). The feasibility of the EB
or any restricted empirical Bayes procedure depends on the specification of the
basic model entities F and G. The following paragraphs treat briefly the major
cases, ordered by decreasing specificity of the families of distributions.
B. The parametric case. Assume that both F and G are parametric families,
that is, T is a finite-dimensional Euclidean set, and
G = {g ; A}

(2.2)

with A Rk for some k. Then the Bayes solution in (2.5) is parametrized by


A and may be denoted by
.

(2.3)

In this case a natural approach is to estimate by some standard parametric


method and replace in (2.3) by the resulting estimator to obtain the
empirical decision rule
= .
p

(2.4)

Clearly, if for all A and the Bayes rule in (2.3) is a continuous


function of , then (2.4) is EB.
In many situations the ML construction applies. The unconditional likelihood function for the observation (X, c) is given by (2.4) with g = g . For

23

CHAPTER 2. EMPIRICAL BAYES METHODS

simplicity, replace the index g by . The ML estimator is obtained by


maximizing the likelihood of the entire sample,
() =

I
Y

f,ci (Xi ) ,

(2.5)

I
X

log f,ci (Xi ) .

(2.6)

i=1

or its logarithm
log () =

i=1

In most situations the ML estimator is obtained as the unique solution of


the necessary condition of an extremum,
I

log (a)|a= =
log fa,ci (Xi )|a= = 0k1 .
a
a
i=1

(2.7)

Under mild regularity conditions on the sequence (c1 , c2 , . . .), the ML estimator
for is asymptotically normally distributed with mean and variance matrix
(

I
X
i=1



2

log
f
(X
)

a,c
i
i

aa0

a=

)1

(2.8)

The variance matrix in (2.8) depends on the realized sequence (c1 , c2 , . . .), which
is thus decisive of the amount of information in the sample.
C. The semi- and non-parametric cases. When G and, possibly, also F
are nonparametric, the maximum likelihood procedure does not apply, and it
is usually impossible to arrange an unrestricted EB procedure. This fact is,
perhaps, the most important reason for studying restricted Bayes procedures,
like LB estimation: They typically depend on the underlying distributions only
through a finite set of parameter functions, e.g. certain first and second order
moments, which can be estimated even if the model is nonparametric and the
design is unbalanced. In this respect the LB approach is representative of a more
general methodology that consists in restricting the space of decision functions
sufficiently to obtain restricted Bayes solutions which can be reliably estimated
even if the model itself is of high complexity. We refer to Chapter 4 for an
account of parameter estimation in the empirical linear Bayes situation.

Bibliography
[5], [13], [31], [33], [38], [44], [49].

Chapter 3

Hierarchical credibility
Norberg R.: Hierarchical credibility: analysis of a random effect linear model
with nested classification. Scand. Actuarial J., 1986, 204-222. Sections 1-3.

24

Chapter 4

Parameter estimation in
credibility
4.1

Empirical linear Bayes estimation

As a fairly general framework for our discussion we take the well-known nonparametric linear regression model which specifies only that the vector of observations x (n 1) is square integrable with conditional first and second order
moments of the form
E[x|]

= Y b,

V[x|] = vP

(4.1)
,

(4.2)

with n, Y (n q), P (n n) observable and non-stochastic, and b (q 1) and v


(scalar) some functions of the unobservable random variable . We recall that
the Linear Bayes (LB) estimator of b is
b = + Y 0 (Y Y 0 + P 1 )1 (x Y ) ,

(4.3)

= E[b] , = V[b] , = E[v] .

(4.4)

where

An empirical LB estimator is obtained from (4.3) upon replacing the parameters in (4.4) by estimators based on independent replicates of the situation.
Thus, suppose that, for each i = 1, . . . , I, we have observed (ni , Yi , Pi , xi ) associated with an unobservable i , where xi fits into the regression model (4.1)
(4.2) (with all entities equipped with subscript i) and the i are i.i.d. selections
from the distribution of the current .
Since nothing is assumed as to the shape of the distribution of (and possibly also the conditional distribution of x for fixed ), inferences about the

25

CHAPTER 4. PARAMETER ESTIMATION IN CREDIBILITY

26

unconditional first and second order moments in (4.4) must be based on the
observations and their cross products
xi , xi x0i ,

(4.5)

or some summary linear functions of these. In the full rank case (rank(Yi ) = q)
we may choose to base the estimation on the statistics
bi = (Y 0 Pi Yi )1 Y 0 Pi xi ,
i
i
bib0 ,

(4.6)
(4.7)

(xi Yibi ) Pi (xi Yibi )


= (ni q) (x0i Pi xi b0i Yi0 Pi Yibi ) .

vi = (ni q)

(4.8)

= + 0 ,

(4.9)

Introduce

and, instead of estimating the parameters in (4.4), consider the equivalent problem of estimating
(, , ) .

(4.10)

The point is that the empirical first and second order empirical moments of the
observations, which form the natural basis for estimation of the first and second
order moments, have expectations that are linear in the components of (4.10):
E[xi ] = Yi ,

(4.11)

E[xi x0i ] = Yi Yi0 + Pi1 ,

(4.12)

E[bi ] = ,
E[bib0i ] = + (Yi0 Pi Yi )1 ,
E[
vi ] = .

(4.13)

or, in the full rank case,

(4.14)
(4.15)

This way the situation is made accessible to linear estimation methods. The
relations (4.11) (4.12) or (4.13) (4.15), whichever are chosen, can be written
in compact form as
E[si ] = Ai ,

(4.16)

where si is a vector-valued linear function of the statistics in (4.5), and Ai is a


matrix of coefficients which can be compiled from the expressions on the right
of (4.11) (4.12) or (4.13) (4.15), and is a vector-valued parameter function
made up of the different entries in , and . Put
Di = V(si ) .

(4.17)

27

CHAPTER 4. PARAMETER ESTIMATION IN CREDIBILITY

(Here D signifies dispersion.) The variance matrix Di is a function of the


design ci = (ni , Yi , Pi ) and certain parameters occurring in moments up to
order four:
Di = D(, ci ) .

(4.18)

Concatenate the statistics from the I units and introduce

A1
s1

s = ... , A = ... , D = diag(D1 , . . . , DI ) .


AI

sI

From (4.16) and (4.17) and the independence of the units one gathers
E[s] = A , V[s] = D .

If the variances Di were known, then the best unbiased linear (in s) estimator
of would be the so-called GLS (generalized least squares) estimator,

= (A0 D1 A)1 A0 D1 s
X
X
A0i Di1 si .
A0i Di1 Ai )1
= (

(4.19)

Since the Di are unknown, the GLS estimator is only an auxiliary construction.
It motivates a class of estimators of the form
X
X

A0i Wi si ,
(4.20)
A0i Wi Ai )1
W
=(
i

with Wi some positive definite symmetric (p.d.s.) matrices.

is unbiased,
From (4.16) (4.17) we immediately obtain that W

E[W
]=,

(4.21)

and that its variance matrix is


X
X
X

V[W
]=(
A0i Wi Ai )1
A0i Wi Di Wi Ai (
A0i Wi Ai )1 .
i

(4.22)

The minimum variance matrix, which is obtained with the optimal (luckiest
possible choice of) weights Wi = Di1 , is
X
(4.23)
V[ ] = (
A0i Di1 Ai )1 .
i

The optimality of the GLS approximation (which is not a genuine estimator)


motivates the choice
Wi = D( , ci )

(4.24)

where is some judiciously guessed/estimated value of the unknown , e.g. the


corresponding moments in some candidate distribution of . Clearly, since the
optimal weights are known to be of the form (4.24), there is no reason to look
for weights outside of this class. We shall illustrate this idea by an example.

28

CHAPTER 4. PARAMETER ESTIMATION IN CREDIBILITY

Table 4.1: Group life insurance data. For each risk class i = 1, . . . , 72 is shown
the exposure pi and number of deaths Mi .
i
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

4.2

pi
3349.02
1394.00
479.81
23.98
11.30
2273.55
179.85
3947.44
4332.56
109.84
647.77
1308.36
154.81
7525.86
428.34
1049.42
2394.95
2406.47
212.79
5334.82
1235.84
6575.43
270.60
1137.39

Mi
16
5
0
0
0
11
0
24
9
1
2
1
3
13
2
3
10
5
0
15
5
57
2
4

i
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48

pi
2080.38
26762.37
24.98
16.46
210.12
274.28
4106.71
2463.90
1224.26
8439.94
2751.09
2053.67
1648.87
99.74
28.47
762.12
742.22
104.84
2857.05
1127.65
96.64
137.22
77.11
650.88

Mi
10
49
0
0
0
3
12
5
6
21
14
12
3
0
0
2
4
0
5
3
0
1
0
0

i
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72

pi
4497.34
3959.89
5518.23
2518.79
86.21
19485.59
4.45
3259.21
24.44
9037.35
30.91
44.24
1551.30
113.49
147.44
826.40
775.19
634.16
2295.45
942.03
232.31
37.08
233.25
39.52

Mi
15
18
14
11
2
26
0
7
0
20
0
0
3
0
0
2
1
2
6
4
2
0
0
0

An example from group life insurance

Table 4.1, which is taken from Norberg (1989), contains summary data from an
authentic portfolio of workmens group life insurance policies. The portfolio is
divided into I = 72 risk classes representing different occupational categories
(mining, forestry, etc.). For each class i (= 1, . . . , I) there is a record specifying
the total number of years exposed to risk of death, pi , and the number of deaths,
M
R i , during the period of observation. The exposure pi is defined precisely as
pi (t) dt, where pi (t) is the number of individuals insured in class i at calendar
time t, and the integral ranges over the observation period.
Presumably, due to differences with respect to age composition and occupationspecific mortality, the risk conditions vary among the classes. The problem is to
set, for each individual class, an appropriate premium rate based on the present
summary statistics, the point being that the scheme should be low-cost and not
require current maintenance of individual records on all persons covered under

CHAPTER 4. PARAMETER ESTIMATION IN CREDIBILITY

29

the scheme.
Norberg (1989) proposes the following model for the situation. There is
stochastic independence across risk classes, and each Mi has a Poisson distribution with parameter pi i . Here i is the unobservable mortality rate per year
at risk in class i.
For each k = 1, 2, . . . set
(k)

sik = Mi /pki ,
(k)

where Mi

= Mi (Mi 1) (Mi k + 1). One easily verifies that


E[sik |i ] = ik .

(4.25)

By minimal sufficiency it follows that sik is the uniformly minimum variance


unbiased (UMVU) estimator of ik .
In particular, the UMVU estimator of i is its empirical counterpart
si1 = Mi /pi ,
whose variance is i /pi . Now, this estimator is poor for classes with small
exposure, and for new risk classes with no previous exposure it is not even
defined. To remedy this deficiency and also accommodate the notion of a
common origin of the classes, the empirical Bayes device is employed. Thus, the
i are considered as independent and identically distributed random variables,
and the above assumptions are taken as the conditional model, given the i .
Since nothing is assumed concerning the shape of the distribution of the i ,
one resorts to linear methods. The linear Bayes estimator of i based on si1 is
C[i , si1 ]
i = E[i ] +
(si1 E[si1 ]) .
V[si1 ]

(4.26)

k = E[ik ]

(4.27)

Introducing

and using the rule of iterated expectations together with (4.25), we find that
the moments involved in (4.26) are E[i ] = 1 and
E[si1 ] = 1 ,
V[si1 ] = 2 12 + 1 /pi ,
C[i , si1 ] = 2 12 .
Empirical linear Bayes estimators are obtained from (4.26) upon replacing
the parameters 1 and 2 by estimators based on the data. Thus we consider
the problem of estimating
 
1
=
.
2

CHAPTER 4. PARAMETER ESTIMATION IN CREDIBILITY

30

From (4.25) and (4.27) we get


E[sik ] = k .

(4.28)

This suggests using a linear method based on


 
si1
si =
, i = 1, . . . , I.
si2
Each si is an unbiased estimator of and, by use of (4.28),


2 12 + 1 /pi
3 1 2 + 22 /pi
V[si ] =
3 1 2 + 22 /pi 4 22 + 43 /pi + 22 /p2i
= D(, pi ) ,

with
= (1 , . . . , 4 )0 .

In the present case Ai = I (the identity matrix) for all i, and the estimator W
in (4.20) reduces to
!1
X
X
Wi s i ,
(4.29)
=
Wi
i

A natural estimator of
= V[] = 2 12
is

= 2 (

1 )2 .

It involves empirical moments up to 4th order, and so its variance is finite if


8 < . By first order Taylor expansion we obtain the easy approximation
V[
V[]
2 ] 4 1 C[
1 , 2 ] + 4 12 V[
1 ] .

(4.30)

Applying now the device in (4.24), we take


Wi = D( , pi )1 ,
where = (1 , . . . , 4 ) comprises the first four moments of some candidate
distribution of i . For instance the gamma distribution with shape parameter
and inverse scale parameter generates
k =

( + k 1)(k)
k

(4.31)

What remains now, is to select a reasonable pair ( , ). The perhaps easiest


way starts from the mean,

E [] = ,

CHAPTER 4. PARAMETER ESTIMATION IN CREDIBILITY

31

and the coefficient of variation,

CV () =

V []
1
= .
E []

Having specified these, we solve


=
=

1
,
(CV ())2
1

E [] (CV ())2

and then calculate the entries of by the formula (4.31) with ( , ) in the
place of (, ).
In the present situation one could imagine e.g. that the mean should be
some 0.003 and that the coefficient of variation could be some 0.5. This means
that (1 , ) should be in the vicinity of (3 103 , 2 106 ).
Table 4.2 displays point estimates ( , 1 ) obtained for various a priori specifications of the first two moments, ( , 1 ). The upper part of the table contents
shows dependence of estimates on for fixed 1 , the middle part shows dependence on 1 for fixed , and the lower part shows dependence on the size of
both a priori values.
An iterated estimate, obtained as stationary values in repeated estimations,
each time using the estimate from the previous round as new prior values, came
out as (2.10 106 , 3.32 103 ).
Finally, Table 4.3 shows mean squared errors for various choices of a priori
values and 1 when = 2.08 106 and 1 = 3.26 103 .

CHAPTER 4. PARAMETER ESTIMATION IN CREDIBILITY

32

Table 4.2: Real data estimates for various choices of 1 and .


A priori
values
106 1 103 1
0
3
2
3
4
3
10
3
40
3
60
3
100
3
16
1
16
2
16
3
16
4
16
10
16
20
16
40
2
1
4
2
6
3
10
5
20
10
40
20
60
30
80
40
100
50

Parameter
estimates
106 1 103 1
***
***
2.03
3.32
1.89
3.35
2.21
3.39
2.46
3.41
1.45
3.39
-1.13
3.36
2.28
3.40
2.73
3.41
2.55
3.40
2.43
3.40
2.71
3.36
4.25
3.33
10.01
3.30
1.72
3.37
1.84
3.37
1.95
3.37
2.18
3.37
2.75
3.37
3.90
3.37
5.04
3.37
6.18
3.37
7.32
3.37

Table 4.3: Mean squared errors for various choices of and 1 when =
2.08 106 and 1 = 3.26 103 . Each cell contains the pair (MSE(106 ),
MSE(103 1 )).

106 :

0.50
1.00
2.08
4.00
16.00

1.08
0.91
0.92
1.39
1.91

2
.078
.079
.084
.095
.150

1.13
0.89
0.82
0.92
5.13

.081
.078
.079
.084
.115

103 1 :
3.26
1.05 .086
0.84 .080
0.78 .078
0.84 .081
2.77 .100

4
1.05
0.86
0.80
0.86
2.40

.088
.081
.078
.079
.095

16
64.24
39.15
20.15
11.68
7.82

.111
.098
.087
.081
.079

CHAPTER 4. PARAMETER ESTIMATION IN CREDIBILITY

4.3

33

Estimation in the compound Poisson model

Let {Nt }t0 be a Poisson process with intensity {t }t0 , that is, the process
has independent
increments and Nt Ns , s t, is Poisson distributed with
Rt
parameter s d = t s , where
t =

d .
0

is the cumulative intensity. Let Yn , n = 1, 2, . . ., be i.i.d. selections from some


distribution G, and assume they are independent of the process N . The process
{St }t0 defined by
St =

Nt
X

Yn ,

n=1

(interpreted as 0 if Nt = 0) is called the non-homogeneous compound Poisson


process with intensity {t }t0 and jump distribution G. Since S is purely
discrete, its value at any time t is the sum of the jumps made in the time
interval (0, t], and we may write
St =

0< t

(S S ) =

Z tZ

y N (d, dy) ,

where N (d, dy) is the number of claims in the time interval [, + d ) with
claim size in the interval [y, y+dy). This representation, here kept at an informal
level, helps establish readily some useful results about moments of functionals
of the compound Poisson process. Proceeding informally, N (d, dy) is 0 or 1
(for small d ), it is independent of the past behaviour of the process in (0, ),
and it has expected value G(dy) d . Using these properties together with the
R t (k1)
identity k 0
d = kt , k > 1, we establish the following results:
E[St ] = E[Y ]t ,

E[St2 ] = E

(S2 S2 )

0< t

Z tZ

(S + y)2 S2 N (d, dy)
= E
0
Z tZ

=
E 2S y + y 2 G(dy) d
0
Z t
Z t
2
2
= 2E [Y ]
d + E[Y ]
d
0

= E2 [Y ]2t + E[Y 2 ]t ,

CHAPTER 4. PARAMETER ESTIMATION IN CREDIBILITY


X

E[St3 ] = E

34

(S3 S3 )

0< t

Z tZ



E (S + y)3 S3 N (d, dy)
0
Z tZ

= E
3S2 y + 3S y 2 + y 3 G(dy) d
0
Z t


3 E2 [Y ]2 + E[Y 2 ] E[Y ] + 3E[Y ] E[Y 2 ] + E[Y 3 ] d
=
=

= 3E [Y ]

2 d

+3E[Y ]E[Y 2 ]

+ 3E[Y ]E[Y ]

d + E[Y 3 ]

d
0

= E3 [Y ]3t + 3E[Y ]E[Y 2 ]2t + E[Y 3 ]t ,


E[St4 ] = E

(S4 S4 )

0< t

Z tZ


E (S + y)4 S4 N (d, dy)
0
Z tZ

4S3 y + 6S2 y 2 + 4 S y 3 + y 4 G(dy) d
= E
0
Z t

= 4 E[Y ]
E3 [Y ]3 + 3E[Y ]E[Y 2 ]2 + E[Y 3 ] d
=


E2 [Y ]2 + E[Y 2 ] d
0
Z t
Z t
3
4
+4E[Y ]
E[Y ] d + E[Y ]
d
0
0
Z t
Z t
4
3
2
2
= 4E [Y ]
d + 4 3 E [Y ]E[Y ]
2 d
0
0
Z t
Z t
+4 E[Y ]E[Y 3 ]
d + 6 E[Y 2 ]E2 [Y ]
2 d
0
0
Z t
Z t
+6 E2 [Y 2 ]
d + 4 E[Y 3 ]E[Y ]
d + E[Y 4 ]t
0
0

= E4 [Y ]4t + 6E2 [Y ]E[Y 2 ]3t + 4E[Y ]E[Y 3 ] + 3E2 [Y 2 ] 2t + E[Y 4 ]t .
+6 E[Y ]

Suppose the insurer does not keep a continuous record of the claims, but
only observes total claim amounts on an annual basis. Then the observable
claims data are xj = Sj Sj1 , j = 1, 2, . . . Assume the intensity is of the
multiplicative form
t = pt a(),

CHAPTER 4. PARAMETER ESTIMATION IN CREDIBILITY

35

where pt is a measure of the size of the risk at time t and a() is the claim
intensity per unit amount of risk. The parameter represents risk characteristics
that are not observable. Also the distribution of the individual claim sizes may
depend on these hidden risk characteristics, and we write G(|). We assume
that this distribution has moments of order 4 and denote its q-th noncentral
moments by
Z
mq () = y q G(dy|).
It is just a matter of change of notation and simple algebra to obtain the
following formulas from those above:
E [xj ] = pj m1 ()a() ,
E [x2j ] = p2j m21 ()a2 () + pj m2 ()a() ,
E [xj xk ] = pj pk m21 ()a2 () , j 6= k ,
E [x3j ] = p3j m31 ()a3 () + 3p2j m2 ()m1 ()a2 () + pj m3 ()a() ,
E [x2j xk ] = p2j pk m31 ()a3 () + pj pk m1 ()m2 ()a2 () , j 6= k ,

E [xj xk xl ] = pj pk pl m31 ()a3 () , j 6= k 6= l ,

E [x4j ] = p4j m41 ()a4 () + 6p3j m2 ()m21 ()a3 ()


+4 p2j m1 ()m3 ()a2 () + 3p2j m22 ()a2 () + pj m4 ()a() ,
E [x3j xk ] = p3j pk m41 ()a4 () + 3p2j pk m2 ()m21 ()a3 ()
+pj pk m3 ()m1 ()a2 () , j 6= k ,

E [x2j x2k ] = p2j p2k m41 ()a4 () + pj p2k + p2j pk m21 ()m2 ()a3 ()
+pj pk m22 ()a2 () , j 6= k ,

E [x2j xk xl ] = p2j pk pl m41 ()a4 () + pj pk pl m2 ()m21 ()a3 () , j 6= k 6= l ,

E [xj xk xl xm ] = pj pk pl pm m41 ()a4 () , j 6= k 6= l 6= m .

Uncertainty about the risk conditions is represented by random risk characteristics , and we take the model described above as the conditional model,
given . Introduce the parameters

1
1
1
4

= E [m1 ()a()] ,


= E [m2 ()a()] , 2 = E m21 ()a2 () ,




= E [m3 ()a()] , 2 = E m2 ()m1 ()a2 () , 3 = E m31 ()a3 () ,




= E [m4 ()a()] , 2 = E m3 ()m1 ()a2 () , 3 = E m22 ()a2 () ,




= E m2 ()m21 ()a3 () , 5 = E m41 ()a4 () .

We easily find the following formulas for the unconditional moments of the xj :
E[xj ] = pj ,

CHAPTER 4. PARAMETER ESTIMATION IN CREDIBILITY

36

E [xj xk ] = pj pk 2 + jk pj 1 ,
E [xj xk xl ] = pj pk pl 3 + (jk pj pl + jl pk pl + kl pj pk ) 2 + jkl pj 1 ,
E [xj xk xl xm ] = pj pk pl pm 5
+ (jk pj pl pm + jl pj pk pm + jm pj pk pl
+ kl pk pj pm + km pk pj pl + lm pl pj pk ) 4
+ (jk lm pj pl + jl km pj pk + jm kl pj pk ) 3
+ (jkl pj pm + jkm pj pl + jlm pj pk + klm pj pk ) 2
+jklm pj 1 .
Consider a portfolio of I independent risks each of which complies with the
model above; to each risk i are associated observable exposures pij and total
claim amounts xij in years j = 1, . . . , Ji , and an unobservable risk characteristic
i . The i , i = 1, . . . , I are independent selections from the same distribution
G. From the available data we are to estimate the interest parameters
= (, 1 , 2 )0 ,
which occur in the linear Bayes estimators of the latent risk premiums bi =
m1 (i )a(i ) (expected claim amount per unit of risk exposure for risk i). It is
convenient to work with the annual loss ratios,
bij = xij /pij ,
which correspond one-to-one with the data xij . The natural estimator of bi
based solely on information from risk i is the loss ratio for the entire claims
record of the risk,
bi =

Ji
X
i=1

xij /

Ji
X

pij =

i=1

Ji
X

pij bij /pi ,

i=1

j = 1, . . . , Ji , where
pi =

Ji
X

pij .

i=1

Suppose the estimation is to be based on the vectors


si = (bi , ..., bij bik , ...)0 ,
where all cross products with 1 j k Ji are included. The expected value
of si is of the form
E[si ] = Ai ,

CHAPTER 4. PARAMETER ESTIMATION IN CREDIBILITY

37

a linear regression in the interest parameter. The coefficient matrix is seen to


be

1
0
0

.
Ai =

0 jk 1/pij 1

The parameter vector appearing in the variance matrices Di = V[si ] is


= (, 1 , 2 , 1 , 2 , 3 , 1 , 2 , 3 , 4 , 5 )0 .
The entries in Di = D(, pi ) are
1
1 2 ,
V[bi ] = E[b2i ] E2 [bi ] = 2 +
pi
h
i
C bi , bik bil

1 X h i
C bij , bik bil
pi j
i
i
h
 h
1 X
pij E bij bik bil E[bij ]E bik bil
=
pi j



1
1 X
1
1
=
pij 3 + jk
2
+ jl
+ kl
pi j
pik
pij
pil

!
1
1
+jkl 2 1 2 + kl
1
pij
pik




1
1
1
2
2 + kl
+ kl
1 2 + kl
1 .
= 3 +
pi
pik
pi pik
pik
=

h
i
C bij bik , bilbim

h
i
h
i h
i
= E bij bik bilbim E bij bik E bilbim

1
1
1
= 5 + jk
+ jl
+ jm
pij
pij
pij

1
1
1
+ km
+ lm
4
+ kl
pik
pik
pil


1
1
1
+ jk lm
+ jl km
+ jm kl
3
pij pil
pij pik
pij pik
!
1
1
1
1
+ jkl 2 + jkm 2 + jlm 2 + klm 2 2
pij
pij
pij
pik



1
1
1
+jklm 3 1 2 + jk
2 + lm 1
1
pij
pij
pil

CHAPTER 4. PARAMETER ESTIMATION IN CREDIBILITY

38

Concerning the choice of in (4.24): A computationally convenient candidate would be to take a() = Ga( , ) (gamma distributed with shape
parameter and inverse scale parameter ) and the claim sizes Y identically
distributed according to Ga( , ), independent of . Under this hypothetical
specification of the model we would have
E[iq ] =

( + q 1)(q)
( + q 1)(q)
q
,
E[Y
]
=
q
q

Thus, all moments are generated by the four basic parameters , , , . To


choose these, one could start by specifying the means and the coefficients of
variation along the lines of the previous section.
Finally, we mention that a simpler choice of the basic statistics would be
si = (bi , b2i , vi )0 ,
where vi is the empirical residual variance (4.8), which in the present case with
q = 1 reduces to
X
pij b2ij pib2i ) .
vi = (ni 1)1 (
j

As an exercise, find the corresponding matrices Ai and Di .

Bibliography
[13], [33], [33], [39].
Additional references:
Bunke, H. and Gladitz, J. (1974). Empirical linear Bayes decision rules
for a sequence of linear models with different regressor matrices. Mathematische Operationsforschung und Statistik 5, 235244.
HUMAK, K.M.S. (1984). Statistische Methoden der Modellbildung III.
AkademieVerlag, Berlin.
Norberg, R. (1982): On optimal parameter estimation in credibiility.
Insurance: Math. & Econ. 1, 73-89.

Chapter 5

Group life insurance


Norberg R.: Experience rating in group life insurance. Scand. Actuarial J.,
1989, 194-224.
Here are some introductory considerations that serve to motivate the Poisson
models encountered in the article:
Consider a group of individuals covered under the same group life insurance
treaty. The basic risk characteristics of the group are its mortality law and its
age distribution, which we assemble in = (, A), where (y) is the mortality
intensity of a y years old member of the group and A(y) is the proportion of
group members aged y or less. These characteristics are assumed to be constant
over time. The size of the group may vary, and we denote the number of insured
at time t by p(t).
At time t there are p(t)dA(y) members in the age interval [y, y + dy), each of
whom dies with probability (y)dt before time t + dt. Therefore, the probability
of a death in the age interval [y, y + dy) during the short time interval [t, t + dt)
is
p(t)dA(y)(y)dt.
(5.1)
Summing over the ages gives the total probability of a death in the time interval
[t, t + dt):
Z )
p(t) dt
(y)dA(y) .
(5.2)
0

The ratio between the probabilities in (5.1) and (5.2),


dG(y|) = R

(y)dA(y)
,
(y 0 )dA(y 0 )

(5.3)

is the conditional probability, given there is a death, that the age of the diseased
is in [y, y + dy). Thus, G(|) so defined is the distribution of the age at death
for members of the group. Note that it is independent of time under the present
assumptions of permanent risk conditions .
39

CHAPTER 5. GROUP LIFE INSURANCE

40

For a large group, where the death or survival of any single individual does
not affect significantly the size and the composition of the group as a whole,
we may adopt the compound Poisson process as an approximation model. In
accordance with the considerations above, we assume that instances of death
incur with intensity p(t)a() at time t, where
Z
(y)dA(y) ,
a() =
0

and that the ages at death are independent replicates of a random variable Y
with distribution G given by (5.3) and independent of the numbers of death.
Suppose the sum insured depends of the age y at death, and call it S(y).
Then, under our assumptions, the process of claims is a compund Poisson
process with claims intensity p(t)a() and individual claim size distribution
GS 1 (s) = P[S(Y ) s]. The total claims in year j (that is, the time period
[j 1, j)) is
Nj
X
Xj =
S(Yjk )
k=1

Rj

where Nj Po(pj a()) with pj = j1 p(t) dt, and the Yjk are independent
selections from G. It follows that the loss ratios
bj = Xj /pj
have expected values

E[bj |] = pj b() ,

where
b() = a()
and variances

S(y)dG(y) ,

Var[bj |] = v()/pj ,

where
v() = a()

S 2 (y)dG(y) .

(5.4)
(5.5)

The risk conditions may vary among groups. We adopt the usual heterogenity model of credibility theory and take to be the outcome of a random
variable , whose distribution represents the collective of groups from which
the current one is selected at random.
The hierarchical extension goes in the usual way. We shall consider hierarchical credibility analyses of 1125 authentic group life contracts that are
classified into occupational classes in a hierarchical manner. There are H = 72
different occupations, and within occupational class h there are Ih individual
groups. To each group (h, i) (group No. i in class h) there are observed exposures phij and loss ratios bhij in years j = 1, . . . , Jhi , say. These are assumed
to obey a hierarchical model with conditional Poisson distributions as above at

CHAPTER 5. GROUP LIFE INSURANCE

41

the upper level. The total number of claims in group


P (h, i) is sufficient and so
we need only look at the total exposures phi = j phij and total loss ratios
bhi = P Xhij /phi = P phij bhij / P phij , These loss ratios possess the same
j
j
j
conditional moment structure as in (5.4) and (5.5), with phi , bhi , and hi in the
roles of pj , bj , and , respectively.

Bibliography
Norberg, R (1987). A note on experience rating of large group life contracts.
Mitteil. Verein. Schweiz. Vers.math. 87, 17-34.

Chapter 6

Bonus systems
6.1

Basic definitions

A. The framework model. We adopt the standard theoretical framework


for individual experience rating in a heterogeneous insurance portfolio, whereby
the unobservable risk characteristics of a randomly chosen generic policy are
represented by a latent random element . The distribution of is denoted by
G, and the density of G with respect to a suitable measure is denoted by g.
Let N (t) be the number of claims covered under the policy up to and including
time t, and let Y1 , Y2 , . . . be the individual claim amounts listed in chronological
order. The total claim amount by time t is
N (t)

X(t) =

Yi .

i=1

For the time being we make no specific assumptions concerning the distribution
of X and , apart from assuming that it is unaffected by the premium system
(absence of bonus hunger).
Coverage and premium payments are on an annual basis (say). Introduce,
for each year j = 1, 2, . . ., the number of claims
Nj = N (j) N (j 1)
and the total claim amount
Xj = X(j) X(j 1) .
The annual premium, which is payable in advance, is currently adjusted and
is made dependent on the past claims experience of the policy. The purpose is
to charge, in each year j, an annual premium that approximates the unknown
individual risk premium,
mj () = E[Xj | ] .
42

(6.1)

CHAPTER 6. BONUS SYSTEMS

43

B. What is bonus? This word is Latin and means good. In the context of
insurance it is used for various forms of dividends that are paid to the policyholders if the insurance scheme has created a systematic surplus. In automobile
insurance it denotes the premium deductible earned on an individual basis by
drivers who report few claims. A widely used class of schemes for calculating
such bonuses are the so-called automobile bonus systems. They take various
forms, but we will only treat what we may call standard bonus systems, which
are commonly used in practice.
C. An example: The Norwegian bonus system. As a typical example of
an automobile bonus system we describe the one used by Norwegian automobile
insurance companies in the 1970-ies:
Each new policy is charged with the same initial premium. Thereafter the
premium is adjusted annually as follows: After a claim-free year the premium is
reduced by 10% of the initial premium, and after a year with claims the premium
is increased by 20% of the initial premium for each claim, but the premium is
kept within the range from 30% (elite bonus) to 150%. There is one exception
from the general pattern: From the premium levels 140% and 150% the policy
advances directly to 120% after a claim-free year.
In Figure 6.1 the unbroken line shows the development of the premium for
a policyholder who files three claims during the period considered, one in each
of the first, third, and sixth year. The broken line shows the same for another
policyholder, who also files three claims, only at later and, obviously, more
favourable points of time. The area between the curves represents the difference
between the total premiums paid by the two policyholders. This is a considerable
amount of money, and to the extent that the times of occurrence are purely
random and only the total number of claims matters, it must be concluded that
there is a substantial element of randomness in the premium charged under the
bonus system. In particular, it is clear that random fluctuation of the premium
will prevail no matter how long the policy has been in force.
D. Formal definition of bonus systems. By a standard bonus system we
will mean an individual experience rating system with the following features:
The policies covered under the scheme are divided into a finite number of
bonus classes numbered from 1 to K, say. A policy stays in one and the same
class throughout a year. There is an initial class into which each new policy
is placed in its first year. Thereafter the policy is reclassified annually in accordance with transition rules that determine the bonus class in any year as a
function of the bonus class and the number of claims in the previous year. These
rules are independent of the age of the policy, and we denote by Tk` the set of
numbers of claims that will carry a policy from class k to class `. The annual
premium is the same for each policy in one and the same class, regardless of
the age of the policy. The class k premium is denoted by (k), and the vector
= ((1), . . . , (K)) is called the bonus scale.
For the transition rules to be consistent, we must require that K
`=1 Tk` =

44

CHAPTER 6. BONUS SYSTEMS


150
140
130
120

110
100

10 11 12 13 14 15 time

90
80
70
60
50

40

30

Figure 6.1: The Norwegian bonus system 1975. Two individual premium paths
corresponding to two different claims histories. Occurrences of claims are indicated with bullets ().
{0, 1, 2, . . .} and that Tk` Tk`0 = for ` 6= `0 . The transition rules can suitably
be represented as a K K matrix T = (Tk` ). The pair
R = (, T)

(6.2)

constitutes the bonus rules. The bonus system is completely specified by


S = (R, ) .

(6.3)

Obviously, the year j class of our generic policy depends on its past claim
numbers in a way that is determined by the rules R, and we denote it by ZR,j ,
j = 1, 2, . . . More specifically,
ZR,1 = ,

(6.4)

and, for j = 2, 3, . . .,
ZR,j = `

if

ZR,j1 = k

and

Nj1 Tk` .

(6.5)

E. Practical design of a bonus system. The purpose of a bonus system,


like any other system for individual experience rating, is to assess the hidden

45

CHAPTER 6. BONUS SYSTEMS

risk characteristics of each individual policy on the basis of its claims experience
and to currently adjust its premium accordingly. Thus, a bonus system should
serve to separate good risks from bad risks by placing them in different classes.
On intuitive grounds this is best achieved by progressively promoting policies
with few claims to low-premium classes and demoting policies with many claims
to high-premium classes. A bonus system is usually designed such that (k) >
(k + 1), and one speaks of class k + 1 as a better class than class k, class K
being the top class. Typically, n > n0 if n Tk` and n0 Tk,`+1 .
In terms of the formal definitions, the Norwegian bonus system described in
Paragraph C above has K = 13 classes, the initial class is = 6, the premium
scale is given by
(k) = (6)(1 +
and the transition rules T are

{1, 2, . . .}
{0}
{1, 2, . . .}
{0}

{1, 2, . . .}
{0}

{2, 3, . . .} {1}

{2, 3, . . .}
{1}

{3, 4, . . .} {2}
{1}

{3, 4, . . .}
{2}

{4, 5, . . .} {3}
{2}

{4, 5, . . .}
{3}

{5, 6, . . .} {4}
{3}

{5, 6, . . .}
{4}

{6, 7, . . .} {5}
{4}
{6, 7, . . .}
{5}

6.2

6k
),
10

(6.6)

{0}
{1}
{2}
{3}
{4}

{0}
{1}
{2}
{3}

{0}

{0}

{0}

{1}
{0}

{1}
{0}

{2}
{1}
{0}

{2}
{1}
{0}
{3}
{2}
{1}
{0}

Optimal design of bonus systems

A. Some preparatory general results. Before turning to discussions of


bonus systems in particular, let us revisit the general problem of estimating
a real-valued latent variable m on the basis of some statistic X. Adopting
quadratic loss, the problem is to find a square integrable function m(X)

that
minimizes the mean squared error (MSE)


(m)
= E (m m)
2 .
X of all such functions is
The Bayes estimator in the class M
m
X = E[m | X] ,

(6.1)

and its MSE (the Bayes risk) is


X

= E V[m | X]

= V[m] V[m
X]
= E[m2 ] E[m
2X ] .

(6.2)
(6.3)
(6.4)

CHAPTER 6. BONUS SYSTEMS

46

These results have been established in Chapter 1 by elementary methods and


X is a closed subspace of the
also by Hilbert space methods using the fact that M
Hilbert space of square integrable r.v.-s. The expression (6.4) follows directly
from the projection theorem, while (6.2) and (6.3) rest on the special structure
of the spaces involved here and add further insight into the nature of the Bayes
solution.
Overall unbiasedness of the estimator (6.1) follows by the rule of iterated
expectations (or the rule of iterated projections):
E[m
X ] = E[m] .
Suppose we want to base the estimation on some summary statistic Z =
Z of square integrable funcf (X), that is, we confine ourselves to the space M
tions of Z. The restricted Bayes solution is, of course, given by (6.1)-(6.4), with
Z in the role of X. The restricted solution is poorer than the unrestricted one,
and the MSE increase incurred by restricting to estimators based on Z is
Z X

= E V[m
X | Z]
2
= E[m
X ] E[m
2Z ]

(6.5)
(6.6)

(Add details using partly general results from the Hilbert space approach and
partly the rule of iterated expectation.) The expression (6.5) shows that X adds
to the information provided by Z to the extent that there is any variation left
in m
X when Z is given.
From (6.3) we see that the quality of the Bayes estimator increases with
its variance. This may seem puzzling to those who are used to thinking that
a good estimator should have small variance. There is no paradox here: An
estimator should be as close as possible to the estimand. If the estimand is a
fixed parameter, then a small variance is good. If the estimand is a random
variable, then the perfect estimator would of course be equal to and, hence,
have the same variance as that random variable. (Needless to say, big variance
is not a good property as such. We can make the variance of an estimator as
big as we want by just adding to it a lot of irrelevant randomness from e.g.
coin-tossing. It is for Bayes estimators of the form (6.1) that a big variance is
X , the
good. In the Hilbert setting we would say that, the bigger the space M
longer the projection m
X .)
From (6.4) we see that the quality of the Bayes estimator (or the amount of
information on m carried by X) could equally well be measured by the squared
norm of the Bayes estimator,
eX = E[m
2X ] ,

(6.7)

which we will call the efficiency of X.


We now turn to discussions of how to design a bonus system, that is, choose
efficient elements in (6.3). For this purpose we need a criterion that ascribes to
each bonus system a real-valued measure of its performance, and it is convenient
to adopt some variation of the MSE criterion, which is mathematically tractable
and well understood.

47

CHAPTER 6. BONUS SYSTEMS

B. Minimizing the MSE in year j. Let us first focus on the performance in


a fixed year j and judge any given bonus system S = (R, ) by its year j MSE,
j (S) = E[mj () (ZR,j )]2 ,

(6.8)

which measures the distance between the year j risk premium defined in (6.1)
and the year j premium delivered by S.
In general it is not possible to find an optimal bonus system that minimizes
j (S). The problem is that the space of estimators in consideration is very big
and lacks nice mathematical structure. Of course, one could find a bonus system
that is almost optimal as the following idea explains: On the one hand, since
ZR,j is a function of N1 , . . . , Nj1 , a bonus premium cannot outperform the estimator m
j = E[mj () | N1 , . . . , Nj1 ], confer Paragraph A above. On the other
hand, by taking the number of states big enough and designing the transition
rules in such a manner that ZR,j tells almost everything about N1 , . . . , Nj1 ,
we could make the bonus premium perform almost as well as m
j . This idea is,
however, of limited interest since a practical bonus system need to be simple
and, in particular, should not have too many classes (with its 13 classes the
Norwegian system is one of the most complicated systems known in practice).
Since the bonus rules must be constrained somehow, let us, as a first step,
keep the rules R fixed and consider the simpler problem of finding the optimal
year j premium scale, R,j . The solution comes directly out of Paragraph A
applied to (6.8), and is the Bayes estimator of mj () based on ZR,j :
R,j (ZR,j ) = E[mj () | ZR,j ] .

(6.9)

Its efficiency is
2
eR,j = E[R,j
(ZR,j )] .

(6.10)

To obtain explicit formulas, we introduce the conditional distribution of


ZR,j , given = ,
pR,j (k|) = P[ZR,j = k | = ] ,

from which we first find the joint distribution of (ZR,j , ),

P[ZR,j = k, d] = pR,j (k|) g() d() ,


then the marginal distribution of ZR,j ,
Z
pR,j (k) = P[ZR,j = k] = pR,j (k|) g() d() ,

(6.11)

and, finally, the conditional density of , given ZR,j ,


g(|ZR,j = k) =

pR,j (k|) g()


.
pR,j (k)

We can now express (6.9) and (6.10) as


R,j (k) = E[mj () | ZR,j = k] =

mj () pR,j (k|) g() d()


,
pR,j (k)

(6.12)

48

CHAPTER 6. BONUS SYSTEMS


and
eR,j =

2
R,j
(k) pR,j (k) .

(6.13)

The next step, which is to find good rules R, cannot so easily be guided by
theoretical clues. What we can do, is to compute the efficiency (6.13) of those
bonus rules we would like to compare and to choose the best. In some special
situations it is possible to say which of two given systems R and R 0 is better on
purely theoretical grounds, e.g. if the class process ZR is a function of the class
process ZR0 as in the following example:


{2, 3, . . .} {1, 0}
= 1,
T=
,
{2, 3, . . .} {1, 0}

{3, . . .}
{2}
{1, 0}
{3, . . .}
{2}
{1} {0}
.
0 = 1,
T0 =
{3, . . .}
{2}
{1} {0}
{2, 3, . . .} {1} {0}
We will be content to state that, usually, one must rely on intuition in the search
for good rules.

C. Minimizing the average MSE over years. By judicious choice of a


typical year j, one could hope that a bonus system that is good in terms the
year j MSE (6.8) is good also on the average over the entire duration of a policy.
This concern could be worked into a more refined optimality criterion, and an
obvious candidate would be a weighted average of the annual MSE-s,
(S) =

wj j (S) ,

(6.14)

j=1

where the weights wj are non-negative numbers summing to 1. Formally, this


criterion can be recast as
(S) = E[mJ () (ZR,J )]2 ,
where J is an integer-valued random variable, which is independent of the claims
process and the latent variable, and is distributed as P[J = j] = wj , j = 1, 2, . . ..
A natural choice of the wj would be the age distribution of the policies in the
portfolio.
The optimal premium scale R is the Bayes estimator of mJ () based on
ZR,J ,
R (ZR,J ) = E[mJ () | ZR,J ] ,

(6.15)

2
eR = E[R,J
(ZR,J )] .

(6.16)

and its efficiency is

49

CHAPTER 6. BONUS SYSTEMS


To obtain explicit formulas, we start from the joint distribution
P[J = j, ZR,J = k, d] = wj pR,j (k|) g() d() ,
and find
P[ZR,J = k] =

wj

pR,j (k|) g() d() =

wj pR,j (k) ,

hence
P[J = j, d | ZR,J = k] =

wj pR,j (k|) g() d()


P
.
j wj pR,j (k)

It follows that the premium in (6.15) is given by


R
P
mj () pR,j (k|) g() d()
j wj
P
R (k) = E[mJ () | ZR,J = k] =
,
j wj pR,j (k)

or, inserting (6.11),

R (k) =

wj pR,j (k) R,j (k)


P
.
j wj pR,j (k)

The efficiency in (6.16) is given by


X X
2
pR,j (k) R,j
(k) .
wj
eR =
j

(6.17)

(6.18)

D. The Markov chain case. To calculate the optimal premiums and their
efficiencies, we need to find the conditional distribution of ZR,j , given = ,
pR,j| = (pR,j (1|), . . . , pR,j (K|)) .
Assume that N1 , N2 , . . . are conditionally independent, given . Then, for
fixed = , the bonus class process {ZR,j }j=1,2,... is a Markov chain with j-th
step direct transition probability matrix


(j)
(j)
PT| = pT| (k, `) ,
(6.19)

where

(j)

pT| (k, `) = P[ZR,j = ` | ZR,j1 = k, = ] = P[Nj1 Tkl | = ] .


(Obviously, the direct transition probabilities depend on the bonus system only
through the transition rules.) The conditional distribution of ZR,j , given = ,
satisfies the recursive relation (Chapman-Kolmogorov)
(j)

pR,j| = pR,j1| PT| ,

(6.20)

50

CHAPTER 6. BONUS SYSTEMS


starting from
pR,1| = 1 ,

the K-vector with 1 in the -th entry and 0 elsewhere.


We cannot get much farther than this without adding more structure to the
model.
E. The homogeneous Markov chain case the asymptotic MSE. Let us
consider the situation where the Nj are i.i.d., given . Then the Markov chain
is homogeneous, that is, the direct transition probabilities are independent of j,
and (6.19) simplifies to

PT| = pT| (k, `) ,
(6.21)
where

pT| (k, `) = P[N1 Tkl | = ] .

Assuming, furthermore, that all states intercommunicate and that the Markov
chain is aperiodic, there exists for each fixed a stationary distribution
pT| = (pT (1|), . . . , pT (K|)) = lim pR,j| .
j

(6.22)

By (6.20) it must be the solution to the equations


pT| = pT| PT| ,
X
pT (k|) = 1 .
k

(Obviously, it is independent of the initial state .)


Suppose, furthermore, that also the claim amounts Yj are i.i.d for given
so that mj () = m(), independent of j. Then a sensible variation of the MSE
criterion is obtained by letting j tend to infinity in Paragraph B above, which
means that we want the bonus system to perform well for old policies that have
reached their stationary state. Thus, our criterion is
(S) = E [m() (ZR, )]2 ,

(6.23)

where the conditional distribution of ZR, for given = is the stationary


distribution in (6.22). We find the optimal scale for fixed transition rules
R
m() pT (k|) g() d()
T (k) =
,
(6.24)
pT (k)
where
pT (k) =
The corresponding efficiency is,
eT =

X
k

pT (k|) g() d() .

2
T
(k) pT (k) .

(6.25)

51

CHAPTER 6. BONUS SYSTEMS

F. Optimal linear bonus scale. The premium scale (6.6) of the Norwegian
bonus is a linear function of the bonus class. If this property is desirable, we
should restrict to bonus scales of the form
(k) = a + bk ,

(6.26)

where a and b are constants. The theory of linear Bayes estimation in Paragraph
1.2.D tells us that, in terms of e.g. year j MSE, the best choice of coefficients
in (6.26) for given bonus rules R is (confer (1.35))
bR,j
aR,j

C[mj (), ZR,j ]


,
V[ZR,j ]
= E[mj ()] bE[ZR,j ] ,
=

(6.27)
(6.28)

and the corresponding efficiency is (confer (1.36) and (1.37))


eR,j =

C2 [mj (), ZR,j ]


+ E2 [mj ()] .
V[ZR,j ]

(6.29)

The moments involved in these formulas are calculated from


Z
E[mj ()] =
mj () g() d() ,
X
k pR,j (k) ,
E[ZR,j ] =
k

E[mj () ZR,j ] =

X Z
k mj () pR,j| (k) g() d() ,
k

2
E[ZR,j
]

k 2 pR,j (k) .

Bibliography
The Bayes decision theory approach to analysis of automobile bonus systems was
first taken in [37], where the asymptotic criterion (6.23) was used. The weighted
annual MSE criterion (6.14) was introduced in [1] and applied to situations
where the conditional distribution of Nj , given , depends on the duration j.
The linear scale was proposed in Gilde and Sundt (1989): On bonus systems
with credibility scales. Scand. Actuarial J..
The reader is encouraged to contemplate an extension of the present theory
to situations where the same bonus rules are applied to policies with different
observable risk characteristics (e.g. mileage) or to several sub-portfolios with
different distributions of claims processes and latent risk characteristics.

Chapter 7

Claims reserving
7.1

Introduction

A. The break-up point of view in solvency control. Consider an insurer


who has been in business since time 0, say, and who is subjected to solvency
control at some subsequent time , henceforth referred to as the present moment. The solvency assessment is based on the break-up scenario, by which
underwriting of new business is assumed to cease at time . Thus, only outstandings for which liability has been assumed prior to time are relevant, and,
since the premiums for this part of the business have already been collected,
the reserve provided at time must be adequate to meet these claims. Possible
excess of premiums over claims in the future cannot be counted as available
for covering currently assumed liabilities (no counting of chickens before they
are hatched). In this perspective it is henceforth understood that all quantities
introduced are related to coverages secured by contracts in force prior to or at
time . They could properly be equipped with an index , but this is omitted
to save notation.
B. Data from policy records. For the time being focus will be on one single
line of business, a mass branch like non-industrial fire or automobile insurance.
The portfolio is made up of many small risk units so that a single claim has no
significant impact on the size and the composition of the portfolio as a whole
and, therefore, need not be related to the individual risk from which it stems.
Thus, the macro viewpoint of the collective theory of risk is adopted, and it
is assumed that the information provided by the policy records at time t is
adequately summarized by
w(t), the risk exposure per time unit at time t.
The exposure rate w(t) may be a thought of as a simple measure of volume or
size of the business at time t, but it could be allowed to depend on covariates
describing the composition of the portfolio. In the break-up context the function
52

53

CHAPTER 7. CLAIMS RESERVING

w will typically look as in Figure 7.1. Future exposure after time pertains to
contracts that are currently in force. In practice they will expire in finite time
so that w(t) = 0 for t large enough.
6

- t

Figure 7.1: Exposure of business written up to time

6
Y

Y (v 0 )

T +U

T + U + v0

- t

T +U +V

Figure 7.2: Occurrence and development of a claim

C. Data from claims records. The claims statistics is a file of records, one
for each individual claim. Formally, a claim is a pair C = (T, Z), where T is
the time of occurrence of the claim and Z is the so-called mark describing its
development from the time of occurrence until the time of final settlement. The
mark is taken to be of the form Z = (U, V, Y, {Y 0 (v 0 ); 0 v 0 V }), where
U is the waiting time from occurrence until notification, V is the waiting time
from notification until final settlement, Y is the eventual total claim amount,
and Y 0 (v 0 ) is the amount paid within v 0 time units after the notification, hence
Y = Y 0 (V ). Henceforth we write Y 0 = {Y 0 (v 0 ); 0 v 0 V } in short. A typical
claims history, as described by these quantities, is depicted in Figure 7.2.
We will primarily have the situation above in mind, but note that other descriptions of the claim history are possible and that the mark might be a complex
entity comprising any piece of information appearing in the claim record. The

54

CHAPTER 7. CLAIMS RESERVING

space of all possible claim outcomes is C = T Z, where T = [0, ) is the time


axis and Z is the assembly of all possible developments.

7.2

The claims process.

A. The probability model. The claims process of an insurance business is


a random collection of points in the claim space, {(Ti , Zi )}i=1,...,N , N ,
the index i indicating chronological order so that 0 < T1 < T2 < . . . It is
assumed that the times Ti are generated by an inhomogeneous Poisson process
with intensity w(t) at time t > 0 and that the marks are of the form Zi =
ZTi , where {Zt }t>0 is a family of random elements in Z that are mutually
independent and independent of the Poisson process, and Zt PZ|t . We then
speak about a marked Poisson process with intensity {w(t)}t>0 and positiondependent marking by {PZ|t }t>0 , and write
{(Ti , Zi )}i=1,...,N Po(w(t), PZ|t ; t > 0).
Introduce the total risk exposure
W =

w(t)dt .

(7.1)

We will assume throughout that W < , having in mind the liabilities of an


insurance company in respectRof (the finite) business written up to the present.
s
The case with W = and 0 w(t) dt < for all finite s is treated by just
chaining together independent models for disjoint time intervals with finite exposure.
We will use the standard notation Pf 1 for the probability distribution
induced by a mapping f , that is, Pf 1 {B} = P{; f () B} = P{f 1 (B)}.
B. The compound Poisson distribution. In the following Po(W ) denotes
the Poisson distribution with parameter W and Po(W, PY ) denotes the compound Poisson distribution with frequency parameter W and claim size distribution PY . Thus, by N Po(W ) is meant that the random variable N has
elementary probability function given by
P[N = n] =

W n W
e
,
n!

(7.2)

PN
n = 0, 1, ..., and by X Po(W, PY ) is meant that X is of the form X = i=1 Yi ,
where N Po(W ) and N is independent of Y1 , Y2 , . . ., which are independent
selections from the distribution PY . We have
P[X x] =

X
W n W n
e
PY (x) ,
n!
n=0

(7.3)

55

CHAPTER 7. CLAIMS RESERVING

where the topscript n signifies n-th convolution. The first three central moments of X are
Z
(k)
mX = W y k PY (dy) , k = 1, 2, 3,
(7.4)
provided that the first three moments of PY exist.
There are at least two reasons why the generalized Poisson distribution plays
an important role in risk theory. First, it is widely held to be a reasonable
description of claims generated by a large and fairly homogeneous portfolio of
risks and, second, it is computationally feasible. In particular, its moments
are given by simple explicit formulas, confer (7.4), and there exist a number of
techniques for computing tail probabilities and fractiles in such distributions,
see e.g. [7].
C. Alternative construction of the process. We set out by proving a basic
representation result.
Theorem 1. The marked Poisson process {(Ti , Zi )}i=1,...,N can be constructed
in two steps: First, generate
N Po(W )

(7.5)

and, second, select a random sample of N pairs from the distribution P T Z on C


given by
PT Z (dt, dz) =

w(t)dt
PZ|t (dz) ,
W

(7.6)

(t, z) C, and order them by the chronology of the occurrences.


Remark: The result has intuitive appeal: The total exposure W determines the
number of claims. The (unordered) occurrence times are independent selections
from the distribution obtained by norming the intensity to a probability density.
For given times of occurrence, the corresponding marks are independent, and
each individual mark follows the mark distribution at the time of its occurrence.
In particular, if w(t) = w (a positive constant) for t [0, ] and w(t) = 0 for
t > , then the occurrence times are uniformly distributed over [0, ].
Proof: The proof amounts to inspecting the joint probability distribution of the
claims, recast as
P {N = n, (Ti , Zi ) (dti , dzi ), i = 1, . . . , n}
=e

= e
=

t1
0

w(s)ds

w(t1 ) dt1 PZ|t1 (dz1 ) e


n
Y

w(t)dt
0
w(ti )dti PZ|ti (dzi )
i=1
n
Y

W n W
e
n!
n!

i=1

PT Z (dti , dzi ) ,

tn
0

(7.7)
w(s)ds

w(tn ) dtn PZ|tn (dzn )e

tn

w(s)ds

(7.8)
(7.9)

56

CHAPTER 7. CLAIMS RESERVING


and recalling (7.2). 
We easily obtain a useful consequence:

Corollary 1 to Theorem 1. Let f be a real-valued function defined on C and


define the random variable
Xf =

N
X

f (Ti , Zi ) .

(7.10)

i=1

Then Xf P o(W, PT Z f 1 ), and (provided they exist) the first three central
moments of Xf are
Z
Z
(k)
mXf = w(t) f (t, z)k PZ|t (dz) dt , k = 1, 2, 3.
(7.11)
Proof: The distribution result follows from the fact that the sum on the right of
(7.10) does not depend on the chronological order of the claims (f is independent of i) and therefore is distributed as the sum of N replicates of f (T, Z) that
are mutually independent and independent of N . Then (7.11) is just a standard
result about the compound Poisson law applied to (7.5) and (7.6). 
The probability distribution of Xf in (7.10) may be computed by standard
methods for numerical evaluation of total claims distributions.
Note the linearity property
Xf1 + + Xfk = Xf1 +...+fk .

(7.12)

Corollary 2 to Theorem 1. Let f 0 and f 00 be real-valued functions on C and Xf 0


and Xf 00 the corresponding compound Poisson variates defined in accordance
with (7.10). Then
Z
Z
(7.13)
Cov(Xf 0 , Xf 00 ) = w(t) f 0 (t, z)f 00 (t, z) PZ|t (dz) dt .
Proof: Write
Cov(Xf 0 , Xf 00 ) =

1
(Var(Xf 0 + Xf 00 ) Var(Xf 0 Xf 00 )) ,
4

and use the linearity property (7.12) together with (7.11) and the identity

1
(f 0 + f 00 )2 (f 0 f 00 )2 = f 0 f 00 . 
4
D. A general decomposition result and some complements. Let C g ,
g = 1, 2, . . . , h ( ), be a partition of the claim space, that is, hg=1 C g = C

57

CHAPTER 7. CLAIMS RESERVING


0

00

and C g C g = if g 0 6= g 00 . Introduce
Ztg = {z Z; (t, z) C g } ,
the set of developments that make a claim occurred at time t a g-claim (belonging to C g ), and


T g = t T ; PZ|t {Ztg } > 0 ,

the time period (or more general era) where such claims can occur. The process
of g-claims is denoted {(Tig , Zig )}1iN g , g = 1, . . . , h, where the times Tig are
listed in chronological order.

Theorem 2. Given a partition C g , g = 1, 2, . . . of the claim space, the corresponding component g-claims processes are independent, and
g
{(Tig , Zig )}i=1,...,N g P o(wg (t), PZ|t
; t > 0) ,

with
wg (t) = w(t) PZ|t {Ztg } ,

(7.14)

PZ|t (dz)
1 g (z) .
PZ|t {Ztg } Zt

g
PZ|t
(dz) =

(7.15)

Proof: Consider first the case with finite h, and look back at the proof of
Theorem 1. First, state the event appearing in (7.7) in terms of the component
processes to rewrite the probability as
(
)
\
g
g
g
g
g
g
g
P
{N = n , (Ti , Zi ) (dti , dzi ) , i = 1, . . . , n } .
g

Second, use the fact that


of (7.14) and (7.15) as
h
Y

PZ|t {Ztg } = 1 for each t to rewrite (7.8) in terms

w g (t)dt

g=1

n
Y

(tgi ) dtgi

i=1

g
g
PZ|t
g (dzi )
i

The product form of this expression together with (7.8) already proves the result.
We will, however, explicate the argument a little bit further. Recast each factor
in the product above in the same way as in (7.9) to arrive at
!
ng
Y (W g )ng
Y
g
g
g
W g g
e
n !
PT Z (dti , dzi ) ,
(7.16)
ng !
g
i=1
where
Wg

wg (t) dt ,

(7.17)

CHAPTER 7. CLAIMS RESERVING

58

with wg (t) defined by (7.14), and


wg (t) dt g
PZ|t (dz) , z Ztg .
(7.18)
Wg
Now the result follows by Theorem 1.
For h = , consider any finite set of categories, g1 , . . . , gq , and lump all the
remaining categories into one category g0 , say. The result for the finite case
applies to these q + 1 categories, and it follows that the q component processes
are independent marked Poisson processes as specified in (7.14) (7.15). Since
the probability measure is determined by the finitedimensional distributions,
the result follows. 
PTg Z (dt, dz) =

The result says that g-claims occur with an intensity which is the claim
intensity times the probability that the claim belongs to the category g, and
that the development of the claim is governed by the conditional distribution
of the mark, given that it is a g-claim. Accordingly, the quantity W g in (7.17)
may be termed the total exposure in respect of claims of category g or just the
g-exposure. Observe that PTg Z in (7.18) is the conditional distribution of (T, Z),
given that it is a g-claim:
PTg Z (dt, dz) =

PT Z (dt, dz)
1C g (t, z) .
W g /W

Theorem 2 may be seen as a general result on so-called thinning of Poisson


processes, which in its simplest form amounts to throwing out a certain proportion of the occurrences by some coin-tossing mechanism independent of the
process itself. See e.g. [30].
The following result is a direct consequence of Theorem 2 and previous
results:
Corollary 1 to Theorem 2. Let C g , g = 1, 2, . . ., be a partition of C and, for
each g = 1, 2, . . . let f g be a real-valued function on C g . Then the variates
g

X =

N
X

f g (Tig , Zig ) ,

i=1

g = 1, 2, . . ., are mutually independent and each X g is a compound Poisson


variate; X g Po(W g , P g (f g )1 ).
From (7.11) we obtain that the first three central moments of X g are
Z
Z
(k)
g
g
mX g =
w (t) f g (t, z)k PZ|t
(dz) dt
Z
Z
f g (t, z)k PZ|t (dz) dt ,
(7.19)
=
w(t)
Tg

Ztg

k = 1, 2, 3. The second equality here follows by substitution of the expressions


in (7.14) (7.15) or,Palternatively, directly from (7.11) for the original process
upon writing X g = i f g (Ti , Zi )1C g (Ti , Zi ).

59

CHAPTER 7. CLAIMS RESERVING

The following reformulation of Corollary 1 presents an interest of its own:


Corollary 2 to Theorem 2. Let f g , g = 1, 2, . . . be a sequence of real-valued
0
00
functions on C satisfying f g (t, z)f g (t, z) = 0 for g 0 6= g 00 . Then the corresponding compound Poisson variates Xf g , g = 1, 2, . . ., are mutually independent.
Remark: By Corollary 2 to Theorem 1, we knew beforehand that the Xf g are
uncorrelated.
Proof: Define C g = {(t, z); f g (t, z) 6= 0}, g = 1, 2, . . ., and note that these sets
together with C 0 = {(t, z); f g (t, z) = 0, g = 1, 2, . . .} form a partition of C. The
result follows upon rewriting each Xf g as
g

Xf g =

N
X

f g (Tig , Zig )

i=1

and invoking Theorem 2 and Corollary 1 to Theorem 1. 


Before proceeding, we present a small auxiliary lemma whose proof is obvious.
Lemma. Suppose {(Ti , Zi )}i=1,...,N P o(w(t), PZ |t ; t > 0), a marked Poisson process on T Z . Let be a function defined on Z and with values in
Z, and denote the transformed marks by Zi = (Zi ). Then
{(Ti , Zi )}i=1,...,N P o(w(t), PZ|t ; t > 0) ,
a marked Poisson process on T Z, with
PZ|t = PZ |t 1 .

(7.20)

A standard result, known as the amalgamation theorem for compound Poisson claims processes, goes as follows: Let X g , g = 1, . . . , h (< ), be independent compound Poisson processes, that is, each X g is of the form X g (t) =
PN g (t) g
g
j=1 Yj , where N is a homogeneous Poisson process with claim intensity
g
w , and the individual claims amounts Yjg are independent selections from a
claimP
size distribution P g and, moreover, independent of N g . Then thePprocess
h
h
X = g=1 X g is a compound Poisson process with claim intensity w = g=1 wg
P
h
and claim size distribution P = w 1 g=1 wg P g . This generalizes to the following, which appears as a partial converse of the decomposition Theorem 2,
but really is implied by it:
g
; t > 0), g = 1, . . . , h,
Theorem 3. Suppose {(Tig , Zig )}i=1,...,N g P o(wg (t), PZ|t
are a finite number of mutually independent marked Poisson processes on T Z.
Then the amalgamated process {(Ti , Zi )}i=1,...,N , obtained by assembling the
claims of the individual processes, is also a marked Poisson process on T Z,

60

CHAPTER 7. CLAIMS RESERVING


and {(Ti , Zi )}i=1,...,N Po(w(t), PZ|t ; t > 0), with
w(t) =

h
X

wg (t) ,

(7.21)

g=1

PZ|t (dz) =

h
1 X g
g
w (t)PZ|t
(dz) .
w(t) g=1

(7.22)

Remark: The claimed result is precisely what one would expect. The property
of independent partitions carries over from the individual processes to the
amalgamated one and suggests the Poisson property of the occurrences of the
latter. Furthermore, (7.21) says that the total probability of a claim occurrence
in a small time interval is the sum of the corresponding probabilities for the
individual processes, and (7.22) states that a claim occurred at time t is from
the g-th individual process with probability w g (t)/w(t), in which case the mark
is generated by the mark distribution of that process. 
Proof: Anticipating the result, start from a marked Poisson process
{(Ti , Zi )}i=1,...,N P o(w(t), PZ |t ; t > 0)

(7.23)

on C = T Z , where Z = {1, . . . , h} Z and


PZ |t (g, dz) =

wg (t) g
P (dz) .
w(t) Z|t

(7.24)

The generic mark of this process is Z = (G, Z), the original mark augmented
with an index for type of claim. It is seen from (7.24) that a claim occurred at
time t is of type G = g with probability w g (t)/w(t) and, given this, the Z-part
g
of the mark is generated from PZ|t
.
Now, on the one hand, applying Theorem 2 to the decomposition of C
by claim type, C g = {(t0 , g 0 , z 0 ); g 0 = g}, g = 1, . . . , h, we readily find that the
component processes have the distribution properties of the individual processes
as specified in the assumptions of the present theorem, and so we can as well
let the latter be constructed as the component processes in the present model
(7.23) (7.24).
On the other hand, it is realized that in the present model the amalgamated
process is obtained from {(Ti , Zi )}i=1,...,N upon leaving the type G unobserved
or, in the terms of the lemma above, considering the process with marks transformed by (g, z) = z. Under this simple mapping the probability distribution
in (7.20) is just the marginal distribution of Z in the distribution of (G, Z)
given by (7.24), which is precisely the one defined in (7.22). Thus, the lemma
completes the proof. 

CHAPTER 7. CLAIMS RESERVING

61

We round off this paragraph with an alternative proof of Corollary 2 to Theorem 1. It makes use of the decomposition theorem and, moreover, serves to
demonstrate a useful technique:
Second Proof of Corollary 2 to Theorem 1: Suppose the result holds for indicator functions f 0 and f 00 . Then, by the bilinearity of the covariance operator,
it also holds for linear combinations of indicator functions. Since every (measurable) non-negative function is the monotone limit of linear combinations of
simple functions, the result extends to non-negative functions f 0 and f 00 by the
monotone convergence theorem. Finally, since any function is the difference of
its positive and negative parts, the result extends to general functions.
Thus, it suffices to prove the result for f 0 = 1C 0 and f 00 = 1C 00 , where C 0 and
00
C are subsets of C. Since f 0 and f 00 are binary, the functions f 0 f 00 , f 0 (1 f 00 ),
and f 00 (1 f 0 ) satisfy the orthogonality condition in Corollary 2 to Theorem
2, hence the corresponding compound Poisson variates are independent. By
the linearity property (7.12), Xf 0 = Xf 0 f 00 + Xf 0 (1f 00 ) and Xf 00 = Xf 0 f 00 +
Xf 00 (1f 0 ) . These things together imply that
Z
Z
Cov(Xf 0 , Xf 00 ) = Var(Xf 0 f 00 ) = w(t) (f 0 (t, z)f 00 (t, z))2 PZ|t (dz) dt ,
where we have made use of (7.11). Now, since f 0 f 00 is binary, it equals its square,
and we arrive at (7.13). 
Results like Corollary 2 to Theorem 1 are valid for more general marked
pointed processes, see e.g. [30]. The present proofs are worth reporting since
they are simple thanks to the fact that everything is independent of everything
else in the Poisson scenario.

7.3

Applications

A. Notation. In the following we will frequently use notation pertaining to


the situation where (U, V, Y ) belonging to a claim occurred at time t has a joint
density pU V Y |t (u, v, y) with respect to Lebesgue measure. In a selfexplaining
way we denote marginal densities by e.g. pY |t (y) and conditional densities by
e.g. pU |t,y (u). If PZ|t is independent of t, we will speak about time-independent
marks and drop t from the subscript.
B. Decomposition by claim amount; franchise and reinsurance. Fix
0 = m0 < m1 < . . . and, for each g = 1, 2, . . ., let
C g = {(t, z); mg1 < y mg }
be the set of claims with amount in the interval (mg1 , mg ]. The corresponding
component processes are independent, with intensities and mark distributions

62

CHAPTER 7. CLAIMS RESERVING


given by
wg (t) = w(t)

mg
mg1

pY |t (y) dy ,

PZ|t (dz)
g
PZ|t
(dz) = R mg
1(mg1 ,mg ] (y) .
p (y) dy
mg1 Y |t

In particular, for a fixed m, let the two sets C s = {(t, z); y m} and
C = {(t, z); y > m} decompose the business into small and large claims. We
may interpret m as the deductible part by minimum franchise or first risk in
the context of direct insurance or as the retention level in the context of excess
of loss reinsurance.
Pursuing the latter interpretation, consider a reinsurance treaty under which
the cedant and the reinsurer cover f 0 (t, z) and f 00 (t, z), respectively, of a claim
occurred at time t and with mark z. The covariance between their total losses
is given by (7.13), and their means and variances are given in Corollary 1 to
Theorem 1.
For instance, for quota share reinsurance we have f 0 (t, z) = ky and f 0 (t, z) =
(1 k)y so that, with time-independent marks, the covariance is simply
`

W k(1 k) E[Y 2 ] .
For excess of loss reinsurance we have f 0 (t, z) = min(y, m) and f 00 (t, z) =
max(y m, 0) and, since the product of these functions is 1(m,) m (y m),
the covariance is
Z
(1 PY (y)) dy .
W m E[1(m,) (Y m)] = W m
m

The results carry over to business in respect of limited periods of exposure


by just letting the integral with respect to t range over a suitable period of time.
This aspect comes up next.
C. Decomposition by year of occurrence. As accounts are typically kept
on an annual basis, we will now decompose by year, and take calendar year j to
mean the time interval (j 1, j]. The cohort of claims occurred in year j is
C j = {(t, z); j 1 < t j} .
The total claim amount in respect of such claims is a compound Poisson variate
with frequency parameter
Z j
W j =
w(t) dt
j1

and claim size density


pj
Y (y) =

1
W j

j
j1

w(t)pY |t (y) dt .

63

CHAPTER 7. CLAIMS RESERVING

In particular, in the homogeneous case with time-independent marks and constant Poisson intensity, w(t) = w, we have
W j = w, pj
Y = pY .

(7.25)

Decomposition by cohort pertains to reinsurance on the basis of underwriting


year. Under a contract specifying that the reinsurer covers f (y) of any claim
of size y occurring in year j, the reinsured part of the total claim amount is
distributed in accordance with Po(W j , PYj f 1 ).
D. Decomposition by year of notification. Next, consider claims reported
in year j,
C j = {(t, z); 0 < t j, j 1 t < u j t] .
If claims are settled immediately upon notification, this decomposition pertains
to reinsurance on the basis of accounting year.
The total claim amount in respect of these claims is a compound Poisson
variate with frequency parameter
W j =

w(t)
0

jt
j1t

pU |t (u) du dt

and claim size density


1
W j

pj
Y (y) =

w(t)
0

jt
j1t

pU Y |t (u, y) du dt .

Interchanging the order of the integrations in the expression for W j above,


we find
Z ju
Z j
j
W =
pU |t (u)
w(t) dt du .
0

Similarly we recast

pj
Y

pj
Y (y) =

max(0,j1u)

as

1
W j

j
0

pU Y |t (u, y)

ju

w(t) dt du .
max(0,j1u)

Consider again the homogeneous case with time-independent marks and constant Poisson intensity, w. Letting j increase, the expressions above tend to
W j = w,

pj
Y (y) = pY (y) .

Comparing with (7.25) we conclude, loosely speaking, that for a stationary


insurance business the liability in respect of occurrence year is the same as
the liability in respect of accounting year. This conclusion carries over to the
reinsurance businesses that motivated the two types of decomposition.

64

CHAPTER 7. CLAIMS RESERVING

E. Decomposition by year of occurrence and year of notification. The


set of claims occurred in year j and reported in year j + d is
C jd = {(t, z); j 1 < t j, j + d 1 t < u j + d t} .
The total claim amount in respect of such claims is a compound Poisson variate
with frequency parameter
W jd =

w(t)
j1

j+dt
j+d1t

pU |t (u) du dt

and claim size density


pjd
Y (y) =

1
W jd

w(t)
j1

j+dt
j+d1t

pU Y |t (u, y) du dt .

Note that, even if PZ|t should be independent of t, pjd


Y may vary with j for
fixed d due to possible variations in the shape of the intensity w(t) from one
year to another. This effect has been studied in [23].
F. Connection to the discrete time model The model in [35] is a discrete
time rudiment of the present one. It assumes that claims are settled immediately
upon notification (or rather in the same year). An issue in that set-up was how
to specify the distribution of the size of a claim that occurs in year j and is
reported d years later. Leaving the possible dependence on j aside, we need to
specify claim size distributions PYd for delay times d = 1, 2, . . . It appears that
in the discrete time set-up we would have to specify these distributions directly,
possibly starting from some standard parametric claim size distribution and
letting the parameters be some parametric functions of d.
The continuous time model creates another and, from an aesthetic viewpoint,
more pleasing possibility: A parametric specification of the continuous time
model, which may be supported by physical reasoning, will automatically induce
a parametrization of the discrete time model. We will illustrate this by a simple
example.
Let Ga(, ) denote the gamma distribution with shape parameter and
scale parameter , both positive, which has density
ga(y; , ) =

1 y
y
e
1(0,) (y) .
()

Assume that the joint distribution of (U, Y ) is such that


pY (y) = ga(y; , )
and
pU |y (u) = ga(u; 1, y) = yeyu 1(0,) (u)

(7.26)

65

CHAPTER 7. CLAIMS RESERVING

(an exponential distribution), implying that large claims tend to be reported


more promptly than small claims. We easily find that the marginal density of
U is

pU (u) =
1(0,) (u) ,
(u + )+1
that is, U + is a Pareto variate. (It is seen that E[U k ] < for 1 < k < .)
Some easy calculations lead to the following expression for the distribution
of Y for a jd-claim as defined in the previous paragraph (by assumption it does
not depend on j):
pdY (y) =

2d d (y) d+1 d+1 (y) d1 d1 (y)


,
2d d+1 d1

where
d = (d + )1 ,
d = ga(y; 1, d + ) .

Thus, we end up with a mixture of gamma distributions, which is mathematically tractable.


G. Inflation and discounting. As a final example of the applicability of the
general theory, suppose the insurer currently invests (or borrows) at a fixed rate
of interest . Then, taking our stand at a given time , it may be relevant to
consider the value of the claims payments in [0, ] accumulated with compound
interest,
X
f (Ti , Zi ) ,
Xa =
i

where

f (T, Z) = 1[0, ](T + U )

T U
0

exp{( T U v 0 )}dY 0 (v 0 ) .

Again we can conclude that X is a compound Poisson variate, which in principle is simple. The claim size distribution may in this case be a bit complicated,
though, but it could be simulated in any case.
Inflation at rate can be accommodated
R in the model e.g. by letting
PY 0 |t,u,v,y be the distribution of Y 0 (v 0 ) = [0,v0 ] exp{(t + u + v 00 )}dY (v 00 ),
0 v 0 v, where Y is a process with some distribution PY |u,v,y , independent
of t.

7.4

Prediction of outstanding claims

A. Four categories of claims. Taking our stand at the present time , the
claims may be categorized as settled (s), reported-not-settled (rns), incurred-notreported (inr), or covered-not-incurred (cni), defined precisely by
Cs

= {(t, z); t + u + v } ,

(7.27)

66

CHAPTER 7. CLAIMS RESERVING


C rns

= {(t, z); t + u < t + u + v} ,

inr

C
C cni

= {(t, z); t < t + u} ,


= {(t, z); t > } .

(7.28)
(7.29)
(7.30)

The acronyms inr and rns are shorthand for the commonly used ibnr and
rbns, the redundant but being dropped in incurred but not reported and reported but not settled. The term cni represents claims related to what is usually
called the unearned premium reserve.
In accordance with the partition (7.27) (7.30) the claims process decomposes into the component processes {(Tig , Zig )}1iN g , g = s, rns, inr, cni,
which, by Theorem 2, are independent marked Poisson process.
The total liability X of the company decomposes accordingly into
X = X s + X rns + X inr + X cni ,

(7.31)

where
X

Xg =

Yig =

1iN g

X
i

1{(Ti ,Zi )C g } Yi ,

(7.32)

g = s, rns, inr, cni.


We are to assess the total liability of the company. The s-part is paid and
done with. The rns-liability splits into a paid part,
X
Yi0 ( Ti Ui )1{Ti +Ui } ,
X prns =
i

and an outstanding part,


X
X orns =
(Yi Yi0 ( Ti Ui ))1{Ti +Ui } .

(7.33)

Finally, the inr- and cni-parts are outstanding. These are conveniently lumped
into
X
X nr = X inr + X cni =
Yi 1{Ti +Ui > } ,
(7.34)
i

the liability in respect of not reported (nr) claims defined by


C nr = C inr C cni = {(t, z); t + u > } .

(7.35)

The total outstanding liability is


X o = X orns + X nr .

(7.36)

A reserve must currently be maintained provided to meet this liability. Thus, a


major issue is to predict (7.36) from available data.

CHAPTER 7. CLAIMS RESERVING

67

B. The prediction problem. Let F denote the statistical information available by time . It consists of the histories up to time of all claims that have
been reported (r) by that time, that is, are in
C r = C s C rns = {z; t + u } .

(7.37)

The problem of providing an adequate reserve amounts to predicting X o in


(7.36) on the basis of its conditional distribution, given F .
C. Predicting the orns-liability. Consider first prediction of the term X orns
on the right of (7.36). Since the rns-claims are conditionally independent, given
F , the predictive distribution of X orns is simply the convolution of the conditional distributions of the individual terms in (7.33).
Thus, let us consider one individual rns-claim (T, Z), T +U < T +U +V ,
dropping now the footscript i to save notation. The relevant information consists
of the observed past history of the claim, comprising the time of occurrence, T ,
the waiting time until report, U , the fact that the claim is not yet settled,
V > T U , and the partial payment process up to now, {Y 0 (v); 0
v T U }. We are to predict, on the basis of this information, the
outstanding liability Y Y 0 ( T U ) or, equivalently (and conveniently),
the total liability Y . There may be reasons to utilize only partly the available
information, and typically we would include only a finite set of partial payments
Y 0 (v1 ), . . . , Y 0 (vk ), 0 v1 < < vk = T U , thus allowing ourselves to
work only with finitedimensional distributions. Denote by F,i the piece of
information we choose to use; the subscript i survives here to avoid confusion
with F above, and is to be read as individual. The natural predictor of Y is
E[Y | F,i ] .

(7.38)

It is unbiased in the sense that E E[Y | F,i ] = E[Y ].


To keep things simple, we will for the present work with the summary prediction basis F,i = {T, U, V > T U, Y 0 ( T U )} in the following. Note
that, if the process Y 0 (v) is Markov, then no information is sacrificed by using
this F,i .
Assume, just for notational convenience, that the distribution PU V Y 0 ( tU )Y |t
has a density pt (u, v, y 0 , y) with respect to Lebesgue measure. The predictive
density of Y , given F,i = {T = t, U = u, V > t u, Y 0 ( t u) = y 0 }, is
R
pt (u, v, y 0 , y) dv
R
p(y |F,i ) = R tu
, y > y0 .
(7.39)
0 , y 00 ) dy 00 dv
p
(u,
v,
y
t
0
tu y

Prediction uncertainty may be expressed in terms of the variance and, possibly, higher order central moments built from the noncentral moments
R
R k
0
tu y 0 y pt (u, v, y , y) dv
k
R
E[Y | F,i ] = R
.
(7.40)
pt (u, v, y 0 , y) dy dv
tu y 0

68

CHAPTER 7. CLAIMS RESERVING

In Section 7.5 we will treat prediction of the orns-liability more completely


within the framework of a more specific model for the claim developments.
D. Predicting the nr-liability. The term X nr on the right of (7.36) is independent of F and is, therefore, to be predicted on the basis of its marginal
distribution. Applying Theorem 2 to the process of nr-claims, we find
wnr (t) = w(t)(1 PU |t ( t)) ,
nr

PY |t (dy) =

u> t

PU Y |t (du, dy)

1 PU |t ( t)

(7.41)

(7.42)

and, by (7.19),
(k)

mX nr

=
=

wnr (t)
t>0
Z
Z
w(t)
t>0

y>0

u> t

y k PYnr|t (dy) dt
Z
y k PU Y |t (du, dy) dt , k = 1, 2, 3. (7.43)
y>0

E. Predicting the total outstanding liability. By Theorem 2, the liability


components X orns and X nr are conditionally independent, given F . Thus, the
first three predictive moments of the outstanding claims X o in (7.36) are
(k)

(k)

(k)

mX o |F = mX orns |F + mX nr , k = 1, 2, 3.

(7.44)

An appropriate reserve is the first moment given by (7.44) with k = 1. A fluctuation loading may be provided by adding a multiple of the standard deviation.
As an alternative to this ad hoc method, one may take the upper -fractile (e.g.
= 0.01) of the predictive distribution, or some approximation of it based on
the first three moments in (7.44).

7.5

Predicting the outstanding part of reported


claims

A. Modelling the claim development; general considerations. We now


turn to the issue of modelling the mark distribution PZ|t . To avoid blurring the
picture, let us assume independence of t and denote by PZ the distribution of the
generic mark Z = (U, V, Y, Y 0 ). (Various forms of time dependence due to trends
in risk conditions and inflation can be obtained by trivial reparametrization and
scaling.)
Presumably, it will be felt that (U, V, Y ) are the primary characteristics of
the claim (they tell us what kind of claim it is) and that the partial payments
Y 0 are secondary, more or less explained by (U, V, Y ). Then it is natural to

69

CHAPTER 7. CLAIMS RESERVING

construct PZ in two steps, specifying first the marginal distribution of (U, V, Y )


and, second, the conditional distribution of the process Y 0 , given (U, V, Y ).
One convenient choice of PU V Y is the trivariate lognormal distribution. It
has 9 parameters (3 means and 6 variances or covariances) and may be viewed
as a fit model based on moments up to second order. If experience and physical
reasoning would dictate a more sophisticated model, one would typically regard
Y as the basic entity and specify first the marginal distributions PY and, second,
the conditional distribution PU V |Y . The candidate models are countless and,
having no particular application in mind, it does not make any sense to list
some dozens of them here.
We will focus on modelling the conditional distribution of Y 0 , given (U, V, Y ).
One possible way of building this model is to put
Y 0 (v 0 ) = Q(v 0 /V ) Y ,

(7.45)

where {Q(s); 0 s 1} is some stochastic distribution function on [0, 1],


stochastically independent of (U, V, Y ). This kind of model is suitable if the
shape of the partial payments process is independent of other claim characteristics, roughly speaking. Again there are many candidates; any stochastic process X that is non-decreasing, right-continuous, and such that 0 = X() <
X() < , produces a stochastic distribution function Q on the real line R
defined by
Q(s) = X(s)/X() .

(7.46)

B. The Dirichlet process. A convenient choice of X in (7.46) is the gamma


process defined as follows. Let be a scaled distribution function on R (i.e.
(s)/() is a distribution function), and let X have independent increments
such that
X(s) X(r) Ga((s) (r), )

for r s (confer (7.26)). That X is well-defined this way follows from Kolmogorovs consistency condition and the convolution property of the gamma
distribution (to be described below). The inverse scale parameter is immaterial in the construction of Q by (7.46), of course, and could be set to 1.
Now, let = s0 < s1 < < sk = be a finite partition of R, and
abbreviate i = (si ) (si1 ), i = 1, . . . , k. Starting from the independent
gamma variates Xi = X(si ) X(si1 ) Ga(i , ), i = 1, . . . , k, one easily
finds that the fractions
Qi = Q(si ) Q(si1 ) =

X(si ) X(si1 )
,
X()

i = 1, . . . , k, are independent of X(), that X() Ga((), ) (of course),


and that (Q1 , . . . , Qk ) Dir(1 , . . . , k ), the Dirichlet distribution with density
Pk
k
( j=1 j ) Y
1
dir(q1 , . . . , qk ; 1 , . . . , k ) = Qk
qj j ,
(
)
j j=1
j=1

CHAPTER 7. CLAIMS RESERVING

70

qj > 0, j = 1, . . . , k, q1 + . . . + qk = 1. In particular (taking k = 2), Q(s)


Be((s), () (s)), where Be(, ) is the beta distribution with density
be(q; , ) =

( + ) 1
q
(1 q)1 ,
()()

0 < q < 1. The stochastic process Q thus defined is called the Dirichlet process
with parameter = {(s); s R}, and we write Q Dir(). The Dirichlet
process plays an important role in nonparametric Bayesian analysis, see [18].
The moments of Q(s) are easily calculated. In particular,
E[Q(s)] = (s)/() ,
showing that the expected value of Q is just normed to a probability distribution, and
Var[Q(s)] = E[Q(s)](1 E[Q(s)])/(() + 1) ,
showing that the total mass of is a measure of the precision of the process Q;
a large value of () means little randomness in Q.
The conditional Q-distribution on an interval (a, b] is
Q(s|(a, b]) =

X(s) X(a)
Q(s) Q(a)
=
, a < s b.
Q(b) Q(a)
X(b) X(a)

Putting X(s) X(a) and X(b) X(a) in the roles of X(s) and X(), respectively, in the construction above, the whole story repeats itself. We find that
Q(s|(a, b]) Dir((a,b] ), where (a,b] is the restriction of to (a, b], and that
it is independent of X(b) X(a) and of X(r) for r
/ (a, b]. Thus, conditional
Q-distributions on disjoint intervals are independent Dirichlet processes.
C. Predicting the outstanding part of Dirichlet type payments. We
adopt the general model in Paragraph A above with partial payments Y 0 of the
form (7.45), where Q Dir(). Of course, in this context is concentrated on
the unit interval [0, 1], i.e. 0 = (0) < (1) = ().
Let denote the present time and consider a reported but not settled claim
occurred at time t < , notified with a delay U = u < t, hence V > v 0 =
tu, and for which we have observed the partial cumulative payments Y 0 (vj )
at development times 0 v1 < < vk = v 0 . Denote all this information by
F 0 . The natural predictor of the outstanding payments on the claim is
E[Y | F 0 ] Y 0 (v 0 ) .

(7.47)

It is unbiased per definition. More generally, by the law of iterated expectation,


any predictor of the form E[Y | F 00 ] Y 0 (v 0 ), with F 00 F 0 , is unbiased.
To obtain an expression for (7.47), let us derive the joint distribution of the
random variables involved. The quantities
U, V, Y, {Y 0 (vj ); j = 1, . . . , k}

CHAPTER 7. CLAIMS RESERVING

71

correspond one-to-one with


U, V, Y, {Qj ; j = 1, . . . , k}, Y 0 (v 0 ) ,
where
Qj =

(7.48)

Q(vj /V ) Q(vj1 /V )
Y 0 (vj ) Y 0 (vj1 )
=
0
0
Y (v )
Q(v 0 /V )

(recall (7.45)), with the interpretation v0 = . By use of the results in the


previous paragraph, we find that the joint density of the variates in (7.48) is
pv1 ...vk (u, v, y, q1 , . . . , qk , y 0 ) = p(u, v, y)

v1
v0
vk
vk1 
dir q1 , . . . , qk ; ( ) ( ), . . . , ( ) (
)
v
v
v
v
 0
y
vk
vk
1
be
; ( ), (1) ( )
,
(7.49)
y
v
v
y
0 < u, 0 < v, 0 < y 0 y, qj > 0, j = 1, . . . , k and q1 + + qk = 1, for
0 v1 < < vk = v 0 . Thus, we obtain the following expression for the first
term in (7.47):
R
R
y pv1 ...vk (u, v, y, q1 , . . . , qk , y 0 ) dv dy
y>y 0 v>v 0
0
R
.
E[Y | F ] = R
p
(u, v, y, q1 , . . . , qk , y 0 ) dv dy
y>y 0 v>v 0 v1 ...vk

Numerical techniques are required to compute this fairly complex expression.


It is the double integrals that represent the hard part of the problem, and
so it does not bring any great computational relief to skip the information
contained in the fractions Qj , j = 1, . . . , k. In fact, since the Qj are expected to
reproduce their conditional means ((vj /V ) (vj1 /V ))/(v 0 /V ), they may
provide valuable information on V and, thereby, also on Y if the shape of
differs significantly from the uniform distribution.
Pursuing these considerations, we note that also the remaining time until
settlement may be predicted on the basis of F 0 . The predictive distribution of
V has density
R
0
0 pv1 ...vk (u, v, y, q1 , . . . , qk , y ) dy
R
R y>y
.
00
0
00
y>y 0 v 00 >v 0 pv1 ...vk (u, v , y, q1 , . . . , qk , y ) dv dy
We round off this paragraph with a few words about the aptness of the
Dirichlet process as a description of the partial payment process. The Dirichlet process is purely discrete and has infinitely many jumps in every interval
where the continuous part of has strictly positive mass, see Ferguson (1972).
Admittedly, such path properties do not comply with the behaviour of real life
payment streams, which certainly also are purely discrete, but have isolated
jumps. However, such myopic considerations may be subordinate to the important fact that the Dirichlet process is able to depict virtually any conceivable
pattern of payments by suitable choice of .

CHAPTER 7. CLAIMS RESERVING

7.6

72

Parameter estimation

A. The purpose of this section. Practical implementation of the predictions


we have derived requires that the probability law of the claims process be known.
The present section addresses the problem of drawing inferences about this
probability law from available data on reported claims. The scope is limited to
estimation in parametric models.
B. Imposing parametric structure. We will assume that probability law
of the marked Poisson process is parametric, that is, there exists a finitedimensional parameter such that the intensity is w(t; ) and the mark distribution is PZ ( ; ), independent of t. Marginal and conditional distributions are
denoted accordingly as e.g. PU ( ; ) and PV |U ( ; ), and their densities (with
respect to Lebesgue measure, say) are denoted correspondingly with lower case
p in the place of capital P . The joint density of (U, V, Y ) can be written as
pU V Y (u, v, y; ) = pU (u; ) pV |u (v; )pY |uv (y; ) .

(7.50)

C. General considerations. Statistical analyses of the claims process may be


complicated for several reasons. One inherent difficulty is that the observations
are censored: inferences must be based on observations from claims that are
reported by the present time , and we must therefore work with the appropriate conditonal distributions. Moreover, the mark Z is of infinite dimension.
Typically, a large number of parameters will appear in a model proposed for
real life applications, where requirements to realism put limits to the amount of
structure that can be imposed on the distributions.
We will focus on point estimation and confine ourselves to parametric models
for the claim intensity and the trivariate distribution PU V Y ( ; ). This problem
can be treated fairly generally. The problem of drawing inferences about the
mechanism governing the process Y 0 (v) is far more dependent on the model.
Let it suffice here to say that one can usually work with standard techniques in
the conditional model, given (U, V, Y ), when only s-claims are utilized.
The available data at time is a file of records of s-claims (settled claims)
containing
Tis , Uis , Vis , Yis , i = 1, . . . , N s ,
(7.51)
and a file of records of rns-claims (reported-not-settled claims) containing the
information
Tirns , Uirns , Virns > Tirns Uirns , i = 1, . . . , N rns ,

(7.52)

implying that we discard all information about partial payments for the rnsclaims.
D. The likelihood of the observations. By Theorems 1 and 2, in particular

73

CHAPTER 7. CLAIMS RESERVING


relation (7.8), and by virtue of (7.50), the likelihood of the observations is
s

L = exp(W ())

N
Y

i=1

w(Tis ; ) pU (Uis ; ) pV |U s (Vis ; )pY |U s V s (Yis ; )


i

rns

exp(W rns ())

NY

w(Tirns ; ) pU (Uirns ; ) (1 PV |U rns ( Tirns Uirns ; )) ,


i

i=1

where
s

W ()
W rns ()

=
=

Z0
0

w(t; )PU +V ( t; ) dt ,

w(t; ) PU ( t; ) PU +V ( t; ) dt .

Inspection of the likelihood suggests that it be reshaped in terms of the


process of r-claims by time , {(Tir , Zir )}i=1,...,N r . Note that
N r = N s + N rns Po(W r ())
with
W r ()

= W s () + W rns ()
Z
=
w(t; )PU ( t; ) dt .

(7.53)

We can now rewrite the likelihood as


r

L = W r ()N exp(W r ())


QN r
r
r
i=1 w(Ti ; ) pU (Ui ; )

W r ()N r
s

N
Y

i=1

pV |U s (Vis ; )pY |U s ,V s (Yis ; )

rns
NY

i=1

Tirns Uirns

pV |U rns (v; ) dv .
i

E. Maximum likelihood estimation. To find the maximum likelihood estimator (MLE), , we need to maximize the likelihood or, equivalently, its
logarithm. Thus, is found as the solution to the k-dimensional system of
equations

(7.54)
ln L|= = 0 .

Under certain regularity conditions, which we assume are satisfied, the MLE is
consistent,
p

(7.55)

74

CHAPTER 7. CLAIMS RESERVING


and asymptotically normally distributed,
d

where () = I()

N(, ()) ,

(7.56)

and I() is the information matrix defined by




2
I() = E
L
.
0

(7.57)

F. Multiplicative intensity. There is no universal recipe of how to efficiently


maximize the likelihood. It has to be examined in each case. Let us proceed with
a description of the procedure under the following fairly general assumptions:
The claim intensity is of the multiplicative form
w(t; ) = w0 (t) ,

(7.58)

where w0 (t) is an observable measure of the size of the portfolio at time t and
is an unknown intensity per unit of risk exposed. Furthermore, let , , and  be
the parameter functions appearing in the constituents of the mark distribution;
pU (u; ), pV |U (v; ) and pY |U V (v; ). It is convenient to rewrite (7.53) as
W r () = W0r () ,

(7.59)

with
W0r ()

Z0

w0 (t)PU ( t; ) dt
W0 ( u)pU (u; ) du ,

where the latter equality is obtained upon putting PU ( t; ) =


changing the order of integration, and introducing
Z t
W0 (t) =
w0 (t0 ) dt0 ,

(7.60)
R t
0

pU (u; ) du,
(7.61)

the risk exposure up to time t.


Then the (essential part of the) likelihood takes the following form:
r

L = (W0r ())N exp(W0r ())

(7.62)

(7.63)

Nr
Y

R
0

i=1
s

N
Y

i=1

pU (Uir ; )

W0 ( u)pU (u; ) du

pV |U s (Vis ; )

(7.64)

pY |U s V s (Yis ; )

(7.65)

N
Y

i=1

rns Z
NY

i=1

Tirns Uirns

pV |U rns (v; ) dv .
i

(7.66)

75

CHAPTER 7. CLAIMS RESERVING

Assume that the components of = (, , , ) are functionally unrelated.


The MLE is constructed by determining maximizing (7.63), maximizing
the product of (7.64) and (7.66),  maximizing (7.65) and, finally, maximum
of (7.62) is attained for
= N r /W0r ( ) .

(7.67)

In the special case where claims are settled immediately upon notification,
we have N r = N s and the cumbersome factor (7.66) vanishes.
G. An example. Suppose that claims are settled immediately upon notification so that the likelihood is made up of (7.62), (7.63), and (7.65) with
pY |U s (Yis ; ) in the place of pY |U s V s (Yis ; ).
i
i i
Assume that (U, Y ) follows a bivariate lognormal distribution,

 



uu uy
u
ln U
).
,
N(
uy yy
y
ln Y
Then
ln Y |ln U N (0 + 1 ln U, ) ,
0
1

= y 1 u ,
uy
=
,
uu
2
uy
= yy
.
uu

(7.68)
(7.69)
(7.70)

The MLE of the parameters in (7.68) (7.70), which now constitute , is found
by simply regressing the ln Yir linearly on the ln Uir .
The MLE of u and uu , which now constitute , is obtained by maximization of (7.63), which is


1
1
r
2
Nr

exp

(ln
U

)
Y
u
r
i
2
2uu Ui
uu


.
R
1
1
2 du

i=1 0 W0 ( u) 2 u exp 2uu (ln u u )


uu

Upon cancelling equal factors in the numerator and denominator and rearranging a bit, we find that the problem is to minimize
Z


1
W0 ( t) exp ((ln U )2 (ln u)2 ) + (ln u ln U) du ,
(7.71)
u
0
where

=
and

1
,
2uu

u
,
2uu
r

N
N
1 X
1 X
r
2
ln U = r
ln Ui ; (ln U ) = r
(ln Uir )2 .
N 1
N 1

CHAPTER 7. CLAIMS RESERVING

76

Minimum must be determined numerically, e.g. by the Newton-Raphson algorithm.


Finally, is given by (7.67). The MLE of the basic parameters is obtained
by inversion of the functional relations.
H. An application to real data. Data in the form described in Paragraph C
above has been provided by a Danish insurance company. The line of business is
accident insurance, and observations are from 5 consecutive time periods sufficiently far back in history that practically all claims were settled by the time the
data extract was produced. Thus, we can work with the simplifying assumption
that claims are settled immediately upon notification. Only a certain kind of
bodily injury claims are included, and amounts have been rescaled.
Two data files were created: one containing (essentially) the number of
persons exposed to risk at time t, w0 (t), t [0, 5], and the other with N r =
1394 claim records {(Tir , Uir , Uir )}1iN r . The total exposures in the 5 periods
were 55400, 63200, 75300, 94000, 112000, which sum up to a total of 399990.
Adopting the model and the methods sketched in Paragraph 3E, the MLE
estimates came out as follows:
Linear regression of the ln Yi on the ln Ui gave 0 = 7.4941, 1 = 0.38725,
= 7.0461. The parameters of the joint distribution of (ln U, ln Y ) were estimated
by u = -2.443 (greater than the observed mean ln U = -2.4890, as one should

expect), y = 6.5480, uu
= 1.0727, uy
= 0.4154, yy
= 7.2069. The estimated
coefficient of correlation of ln U and ln Y is 0.1494, which means that large claims
tend to be reported with a longer delay than small claims. The claim intensity
was estimated by = 0.003639.

Bibliography.
[35], [24], [29], [4], [27], [43], [23].

Chapter 8

Utility Theory
This chapter is a very sketchy draft, still sufficient to replace Sundts Chapter
12.

8.1

The expected utility hypothesis

A. Risk in a simple one-period scenario.


Consider an agent who is carrying a certain insurable risk. He will henceforth
be referred to as the insured (although he might decide not to purchase any
of the insurance products he is offered). We adopt a simple one-period model,
whereby the insureds wealth at the beginning of the period is w and the loss
incurred due to the insurable risk during the period is X, so that his wealth
at the end of the period is w X. The loss X is uncertain and is therefore
represented by a non-negative random variable with some distribution F .
B. Insurance.
An insurance policy offered by an insurer specifies a compensation function
r : R+ 7 R+ , such that the insurer covers the amount r(x) if X = x, and a
premium that is payable by the insured to the insurer in advance. Since the
Insurance Treaty Act forbids gains on insurance, r must satisfy 0 r(x) x.
The selfinsurance function s : R+ 7 R+ , defined by s(x) = x r(x), specifies
the amount that has to be covered by the insured if X = x. Also s satisfies
0 s(x) x. The selfinsurance amount is also called franchise. We list some
commonly used forms of insurance coverage:
The fixed amount deductible is given by
r(x) = (x b)+ ,

s(x) = (x b) ,

(8.1)

whereby the insurer covers the excess of the loss over some deductible amount
b > 0.
The proportional deductible or proportional insurance is given by
r(x) = (1 k) x ,

s(x) = k x ,
77

(8.2)

78

CHAPTER 8. UTILITY THEORY


whereby the insurer covers the fraction 1 k of the loss, 0 k 1.
The franchise deductible is given by
r(x) = 1[b,) (x) x ,

s(x) = 1[0,b) (x) x ,

(8.3)

whereby the insurer covers everything if the loss exceeds b and nothing otherwise.
The first risk deductible is given by
r(x) = (x b) ,

s(x) = (x b)+ ,

(8.4)

whereby the insurer covers everything up to a maximum of b.


C. The expected utility hypothesis.
It may be felt that a fixed amount deductible would be favourable to the insured
since it cuts away the top risk and, correspondingly, that first risk deductible
would be bad since it does precisely the opposite. The question is, how could
the various forms of compensation be rationally compared and, in particular,
is it possible to find a best form from the point of view of the insured? More
precisely, we need a yardstick for measuring the goodness of any given policy
(, r) or, rather, the corresponding random wealth the insured is left with at the
end of the insurance period, Y = w s(X). A naive idea would be to just
look at the expected value E[Y ]. Obviously, this would not reflect the typical
insurance customers attitude to risk since people regularly purchase insurance
at prices that are sufficient to cover, not only the losses on the average, but also
administration expenses. Insurance is useful because it relieves the individual
of risk, that is, replaces w X with an Y = w s(X) that is less uncertain.
Suppose we want to compare the ultimate wealths, Y1 and Y2 , under two policies
such that E[Y1 ] = E[Y2 ]. Actuaries would certainly approve to the idea of
comparing them by their variances and recommend Y1 if V[Y1 ] < V[Y2 ] or,
equivalently, E[Y12 ] < E[Y22 ]. This leads to the more general utility approach,
which amounts to measuring the goodness of a given ultimate wealth Y by a
criterion of the form
E[u(Y )] .

(8.5)

Certain axioms, which are carefully discussed in the eminent book of De Groot
[14], lead to the criterion (8.5), that is, existence of a function u such that Y1
is preferred to Y2 if and only if E[u(Y1 )] > E[u(Y2 )]. The axioms do not imply
anything as to the form of the function u, however, and here we have to add
assumptions based on other considerations.
D. Basic properties of utility functions.
Important clues are given by the fact that the criterion, when applied to degenerate random variables (constants), generates preferences between different
amounts of certain wealth. We can confidently postulate that there must be a
limit to the satisfaction that our agent can get from a finite amount of money

79

CHAPTER 8. UTILITY THEORY

and that he is able to assess every amount within a certain range of possible
wealth:
u is finite-valued and defined on some open interval I .

(8.6)

We assume, furthermore, that he is just an ordinary human being who is happier


the richer he is:
u is strictly increasing .

(8.7)

We finally assume that the marginal utility of a given amount of additional


money decreases with his initial wealth:
For any h > 0 , u(y + h) u(y) is strictly decreasing in y

(8.8)

This means that our insured is not the sort of person whose greed increases with
the size of his purse. (A word about usage: In the following we will let increasing and decreasing mean non-decreasing and non-increasing, respectively,
and add a qualifying strictly when needed.)
Note that the preferences generated by the expected utility hypothesis are
invariant under a linear transform of the utility function u to a + bu, where a
and b > 0 are constants. Here is a list of some commonly used utility functions
specified in minimalistic form:
Quadratic utility:
u(y) = y

y2
,
2c

y < c.

(8.9)

Exponential utility:
u(y) = exp(cy) ,

y R,c > 0.

(8.10)

y R , y > c .

(8.11)

Logarithmic utility:
u(y) = log(c + y) ,

From the expected utility hypothesis and the assumptions (8.6) (8.8) we
can deduce a variety of results that serve to explain why people purchase insurance, even at prices that are (sometimes way) above the expected value of the
loss, and how insurance treaties should be designed to serve the purpose of risk
mitigation. The results are valid for all utility functions and, thus, deal with
qualitative rather than quantitative aspects of risk preferences. So, by way of
warning, the theory aims at creating general understanding rather than producing practically useful numbers (e.g. what premium to charge). We now go to
work and commence with a crucial qualitative property of utility functions.
Theorem 1. A utility function is strictly concave.

80

CHAPTER 8. UTILITY THEORY

Proof: Let u be a utility function. We will need to know that it is continuous.


First, note that existence of left and right limits u(y) and u(y+) at each
y I follows from monotonicity. Second, preparing for an ad absurdum proof,
suppose u is discontinuous at some point y. Then, by (8.8), u increases by no
less than u(y+) u(y) > 0 in every interval, however small, to the left of
y. But this means that the increment of u over a given interval (x, y] I is
infinite since (x, y] can be divided into arbitrarily many subintervals. Since u is
finite-valued, we conclude that discontinuities cannot occur.
Next, we observe that (8.8) implies that, for x < z,




1
1
u x + (z x) u(x) > u(z) u x + (z x) ,
2
2
which is the same as
u

x+z
2

>

u(x) + u(z)
.
2

(8.12)

Thus, a utility function satisfies the definition of strict concavity,


u((1 )x + z) > (1 )u(x) + u(z) ,

(0, 1),

(8.13)

for the special choice = 1/2. This is already a lot of structure since (8.12) is
true for all x < z, and we will now prove that it implies (8.13).
Fix x and z and, to save space, introduce the function
v() =

u((1 )x + z) u(x)
,
u(z) u(x)

[0, 1], which satisfies v(0) = 0, v(1) = 1, and inherits all qualitative properties of u. We want to prove (8.13), which is the same as
v() > ,

(0, 1)

(8.14)

We will first prove that (8.14) holds for all of the form
n,j = j/2n ,

j = 1, . . . , 2n 1 ,

(8.15)

and n = 1, 2, . . .. The proof goes by induction on n: Since v inherits the property


(8.12), we have v(1/2) > (v(0) + v(1))/2 = 1/2, and so the induction hypothesis
(8.15) holds for n = 1. Assuming that it holds for a given n, we must show
that it holds also for n + 1. It trivially holds for j = 2k, an even number, since
2k,n+1 = k,n . For j = 2k + 1, an odd number, write
2k+1,n+1 =

1
(k,n + k+1,n ) ,
2

and use (8.12) and the induction hypothesis for n to obtain


v(2k+1,n+1 ) >

1
1
(v(k,n + v(k+1,n )) > (k,n + k+1,n ) = 2k+1,n+1 .
2
2

CHAPTER 8. UTILITY THEORY

81

(For j = 1 and j = 2n 1 we need to introduce 0,n = 0 and n,n = 1.) This


completes the proof of the induction argument, and we conclude that (8.14)
holds for all of the form (8.15).
Next, let be an arbitrary number in (0, 1). Pick a sequence of numbers n
of the form (8.15) such that n . Since v is continuous, we have v() =
lim(v(n ) n ) 0, hence
v() .
We have now proved that (8.13) holds with > replaced by , that is, u is
concave. It remains to prove that the concavity is strict. First, consider the
case < 1/2. Applying first the concavity property (note that 2 (0, 1)) and
then (8.12), we have


1
u((1 )x + z) = u (1 2)x + 2 (x + z)
2


x+z
(1 2)u(x) + 2 u
2
u(x) + u(z)
> (1 2)u(x) + 2
2
= (1 )u(x) + u(z) .
A similar argument works for > 1/2. This completes the proof. 
E. Risk aversion.
Let u be a twice differentiable utility function and denote its first and second
derivatives by u0 and u00 , respectively. The function a defined by
a(y) =

d
u00 (y)
= ln u0 (y)
u0 (y)
dy

(8.16)

is called the risk aversion function of u (we do not visualize its dependence on
u in the notation). The risk aversion function is strictly positive and is a local
measure of the relative change of the marginal utility, that is, how much the
marginal utility changes in units of itself by an infinitesimally small increase of
wealth.
As we will see in the next section, there are good reasons to require that the
risk aversion function be decreasing. For the time being let us just suppose that
the function a really deserves its name, and state that the aversion against risk
ought to decrease with the wealth; a rich person should be more able to carry
a certain risk than a poor person. This will be made precise later.
We observe that the quadratic utility function (8.9) has an increasing risk
aversion function, and therefore does not reflect the attitudes to risk that we just
said are the typical ones. The exponential utility function (8.10) has constant
risk aversion function, and is thus a benchmark case, just barely compatible with
typical risk attitudes. The logarithmic utility function (8.10) has decreasing risk
aversion function, and is thus OK.

CHAPTER 8. UTILITY THEORY

8.2

82

The zero utility premium

A. Definition of the zero utility premium.


Equipped with Theorem 1 above and our knowledge about concave functions,
we can deduce a number of general results about the insureds preferences. It is
henceforth understood that we consider only non-trivial insurance treaties for
which P[r(X) > 0] > 0. In this section we shall discuss what conditions an
insurance treaty must satisfy in order to be of any interest to the insured and,
more specifically, at what price is the insured willing to buy a given coverage.
An insurance treaty (, r), which leaves the insured with the final wealth
w X, will be preferred to full selfinsurance, which leaves the insured with
final wealth w X, if and only if
E[u(w s(X))] E[u(w X)] .

(8.1)

(Preferred includes the case with equality, when the insured is indifferent.)
By monotone convergence and continuity of u, the expression on the left is a
continuous function of . Obviously, it is also strictly decreasing in and ranges
between the upper bound E[u(w s(X))], which is E[u(w X)] and a lower
bound, which we assume is E[u(w X)] (if e.g. I is a half-interval of the
type (, a), then obviously the lower bound is limy& u(y) E[u(w X)]).
Therefore, there exists a r (w) such that
E[u(w r (w) s(X))] = E[u(w X)] .

(8.2)

This r (w) is called the zero utility (increase) premium. Obviously, the insured
will buy the insurance if and only if r (w). The following result says that,
for reasonably designed compensations, the insured is willing to pay more than
the pure net premium E[r(X)]. This is good news because the insurer needs to
collect a premium that covers the expected claims (to avoid systematic losses
and certain ruin in the long run) and also covers administration expenses and
allows for a profit.
Theorem 2. If s and r are both increasing, then r (w) > E[r(X)] .
Proof: The proof rests on the Corollary to Theorem 3 in Appendix C. Consider
the functions g1 (x) = E[r(X)] + s(x) and g2 (x) = x. Obviously they fulfill
the condition (C.20), and since g2 (x) g1 (x) = r(x) E[r(X)] is increasing by
assumption, they also fulfill the condition (C.21). Since v(y) = u(w y) is a
strictly concave function, we conclude that
E[u(w E[r(X)] s(X))] > E[u(w X)] .
The result now follows by comparison with (8.2). 
B. How the zero utility premium depends on risk aversion.
We can now give precise contents to the previously anticipated results about
risk aversion.

CHAPTER 8. UTILITY THEORY

83

Theorem 3. If the risk aversion function is decreasing (increasing), then the


zero utility premium for full insurance r(x) = x is decreasing (increasing).
Proof: The zero utility premium (w) for full insurance is given by
u(w (w)) = E[u(w X)] .
Solve this with respect to (w) and differentiate the resulting expression with
respect to w. The rest is left as an interesting exercise to the reader, who will
realize that convexity (concavity) of the function v(y) = u0 (u1 (y)) is a crucial
point. Study v 0 (y) using
1
d 1
u (y) = 0 1
.
dy
u (u (y))

8.3

Optimal insurance

A. Local versus global treaties.


The next result says that, from the point of view of the insured, a combined
insurance covering several risks should be designed such that the compensation depends only on the total claim amount. More specifically, suppose the
insured carries n risks X1 , . . . , Xn totalling to X = X1 + + Xn , and that he
is offered an insurance treaty , r with compensation r(X1 , . . . , Xn ) and selfinsurance function s(X1 , . . . , Xn ) = X r(X1 , . . . , Xn ). A treaty is said to be
global if r (and s) depends only on the total loss X. Otherwise it is said to be
local.
Theorem 4 (Pesonen-Ohlin). If the premium depends only on the pure net
premium, then to every local treaty there exists a global treaty, which has the
same premium and which is at least as good from the point of view of the insured.
Proof: Let (, r(X1 , . . . , Xn )) be a local treaty. Consider the global compensation function R(X) = E[r(X1 , . . . , Xn ) | X] with corresponding selfinsurance
function S(X) = E[s(X1 , . . . , Xn ) | X]. By iterated expectation,
E[R(X)] = E[r(X1 , . . . , Xn )] ,
so that the premium charged for the compensation R is , the same as the
premium charged for r. Applying Jensens inequality to conditional expectation
for fixed X, we obtain
E[u(w s(X1 , . . . , Xn ))]

= E E[u(w s(X1 , . . . , Xn )) | X]
E[u(w E[s(X1 , . . . , Xn )]) | X])]
= E[u(w S(X))] ,

which proves that the global treaty (, R) is a better than the local (, r). 

CHAPTER 8. UTILITY THEORY

84

The result does not say that every global treaty is good, and one easily
constructs examples of poor global treaties. The result states only that, for the
purpose of maximizing the expected utility of the insured, one can restrict to
global treaties. Moreover, the proof is constructive and shows how to design a
global treaty that outperforms a given local treaty.
Note that the proof does not make any particular use of the definition of
X and would work equally well for any random variable X. In fact, one could
obtain a treaty that is better than the global one constructed above by letting
X be something that is not related to the risks under consideration, e.g. the
turnover of cheese in the Irma shop last month or just a constant. The reason
why the total loss is of particular interest cannot be explained endogenously in
the present theory. It rests on other circumstances, e.g. that the insurer will
only offer treaties with a certain element of selfinsurance in them (selfinsurance
may serve to save costs by eliminating small claims, and it may give the insured
incentives to prevent losses).
B. Fixed amount deductible is optimal.
Henceforth X may represent a individual risk or, in view of the result in the
previous paragraph, the total risk related to a combined policy.
Theorem 5 (Arrow-Ohlin). If the premium depends only on the pure net premium, then to every treaty there exists a fixed amount deductible treaty, which
has the same premium and which is at least as good from point of view of the
insured.
Proof: Let (, r) be any given treaty. The fixed amount deductible compensation
rb (X) = (X b)+ is a continuous and decreasing function of b and, by monotone
convergence, also the corresponding pure premium E[rb (X)] is a continuous
function of b with values ranging from E[X] to 0. Therefore, there exists a b
such that E[rb (X)] = E[r(X)] so that the premium is the same for the two
treaties. We can now invoke Ohlins lemma noting that, for y < b,
P[sb (X) y] = P[X y] P[s(X) y] ,
and, for y b,

8.4

P[sb (X) y] = 1 P[s(X) y] . 

The position of the insurer

A. The expected utility of the insurer.


Let us assume that also the insurer has preferences that comply with the expected utility hypothesis. More precisely, the insurer has a utility function u

and an initial wealth w,


and he judges a given treaty (, r) by his expected
utility at the end of the insurance period:
E[
u(w + r(X))] .

CHAPTER 8. UTILITY THEORY

85

The zero utility premium


r (w)
of the insurer is given by
E[
u(w
+
r (w)
r(X))] = u(w)
.
One easily proves that
r (w)
> E[r(X)] An insurance treaty (, r) can be signed
by both parties only if
r (w)
< < r (w).

Reproducing arguments above, we easily show that global treaties are preferred by the insurer (Pesonen-Ohlin), so in this respect the two parties have
common interests. However, analogous to Arrow-Ohlin, we show that first risk
is optimal to the insurer, and here the two parties have contradicting interests.
Now, first risk is a somewhat extreme form of compensation since it leaves
the top risk to the insured. It may be reasonable to restrict attention to compensation functions that satisfy the Vajda condition:
r(x)/x is an increasing function of x .
Thus, the bigger the loss, the bigger the fraction covered by the Vajda compensation function.
Theorem 6. If the premium depends only on the pure net premium, then
to every Vajda treaty there exists a proportional insurance treaty, which has the
same premium and which is at least as good from the point of view of the insurer.
Proof: Simple. For a given treaty (, r) we find a proportional treaty with
compensation function rk (x) = (1 k)x and the same premium . Being Vajda,
r is increasing and so is rk of course. The difference


r(x)
(1 k)
r(x) rk (x) = r(x) (1 k)x = x
x
is ()0 according as x ()x0 for some x0 . Use the corollary to Ohlins
theorem. 
B. Conflicting interests
We easily prove that, among Vajda treaties, proportional insurance is the worst
solution from the point of view of the insured. Not surprisingly, the interests of
the insured and the insurer are not the same and they are even conflicting when
we take the position of only one party at a time and compare compensation
functions with the same pure premium under the assumption that the premium
then is fixed. The possibilities of coming to terms may be improved if the parties
were allowed to sit and search a compromise solution, letting both the form of
the compensation and the size of the premium be negotiable. One could then
imagine that the insurer, despite the fact that he would prefer first risk among all
compensation functions and proportional among Vajda ones when the premium
is fixed, still could consider offering the insureds favourite compensation (fixed
amount deductible) if he could do it against a premium that is acceptable to
him. We will now formalize these loose ideas.

86

CHAPTER 8. UTILITY THEORY

8.5

Pareto-optimal risk exchanges

A. Risk exchanges.
Consider n economically independent agents (e.g. individuals, insurance companies, or reinsurance companies) who are carrying economic risk. In attempts
to get rid of, or at least mitigate, the riskiness of their individual businesses, the
agents enter into negotiations of treaties for mutual exchange of risk amongst
themselves for a certain period.
We assume that each agent i has a differentiable utility function ui and an
initial wealth wi , which will be reduced by an uncertain amount Xi during the
period of consideration. The unknown loss Xi is the risky part of the business
of agent i, and so we take the vector X = (X1 , . . . , Xn )0 to be random. We
speak of the Xi as losses because we primarily have insurable risks in mind.
However, the theory we are going to develop does not rest on this or any other
particular interpretation of the situation, and it does not require that the Xi be
non-negative. For instance, the agents could be investors who seek to reduce the
uncertainty associated with their individual investment portfolios through some
pool arrangement. In that situation an Xi would typically be either positive (a
loss) or negative (a gain), and hopefully the latter case would be the more
likely.
A treaty for exchange of risk must specify how much each individual agent
is to contribute to the coverage of the losses during the period, and these contributions must fulfill the obvious budget constraint that the total loss must be
covered. Formally, a risk exchange is a function
f = (f1 , . . . , fn )0 : Rn 7 Rn
such that (almost surely)
n
X
i=1

fi (X) =

n
X

Xi .

(8.1)

i=1

Under the risk exchange treaty f each agent i will pay fi = fi (X) instead of his
original loss Xi .
As usual we adopt the expected utility hypothesis assuming that each agent
i will judge any f by his expected terminal utility,
Vi (f ) = E ui (wi fi ) .
B. Pareto-optimality.
It is, of course, impossible to find a risk exchange that is optimal from the isolated point of view of each individual agent. After all, the total loss is to be
shared between the agents somehow, and any reduction of one agents share
must be compensated by increases of the shares of (some of) the others. In
such a situation, where the interests are partly conflicting, the fruitful approach
is to try and find out what all parties can agree on. Thus, they should first
rule out those treaties that all agents find uninteresting so as to remain with a

CHAPTER 8. UTILITY THEORY

87

set of negotiable treaties. This vague notion is made precise by the concept of
Pareto-optimality, which we now define:
Definition A risk exchange f is called Pareto-optimal if there exists no other
risk exchange f that is at least as good as f for all agents and strictly better for
some agents, that is,
Vi (f ) Vi (f ) i Vi (f ) = Vi (f ) i.

(8.2)

Equivalently, we could say that f is not Pareto-optimal if there exists a


risk exchange f such that Vi (f ) Vi (f ) for all i and Vi (f ) < Vi (f ) for some
i. Yet another way of putting it is to say that if f is Pareto-optimal and f is
a risk exchange such that Vi (f ) < Vi (f ) for some i, then we must also have
Vi (f ) > Vi (f ) for some other i.
Obviously, a risk exchange that is not Pareto-optimal can be excluded since
it can be replaced by some other risk exchange that improves the position of at
least one agent without impairing the position of any other agent. Thus, the
set of negotiable treaties is precisely the Pareto-optimal ones. Which Paretooptimal risk exchange to choose is a real issue of negotiation since replacing one
Pareto-optimal risk exchange with some other Pareto-optimal risk exchange
would improve the position of some agents and impair the position of some others.
C. The first Borch/du Mouchel theorem.
Pareto-optimality seems to be an absolute minimum requirement to place on a
negotiable treaty. It may therefore come as a pleasant surprise that the Paretooptimal solutions can be explicitly constructed and typically turn out to constitute a (comparatively) small set of parametrized functions.
Theorem 7 A risk exchange f is Pareto-optimal if and only if the ratios between
the marginal utilities u0i (w fi ) are almost surely constant, that is, there exist
constants i , i = 1, . . . , n, such that
u0i (wi fi ) = i u01 (w1 f1 ) .

(8.3)

Remark: Necessarily, the i must be strictly positive, and 1 = 1.


Proof: Sufficiency of the condition (8.3) is easy to prove. Suppose f satisfies the
condition, and let f be any other risk exchange. By the concavity of the utility
functions, we have (see Figure 8.1)
ui (wi fi ) ui (wi fi ) + u0i (wi fi )(fi fi ) .
First, insert (8.3) on the right here and then divide by i to get
ui (wi fi ) ui (wi fi )
u01 (w1 f1 )(fi fi ) .
i

(8.4)

88

CHAPTER 8. UTILITY THEORY

Next, sum this expression over all i and use (8.1), which applies to both f and
f , to obtain
X ui (wi fi ) ui (wi fi )
0.
i
i
Finally, take expectation to arrive at

X Vi (f ) Vi (f )
i

0.

Since all i are strictly positive, it is now clear that (8.2) must hold true.
For a proof of the necessity of the condition (8.3), we refer to [12]. 






u(wi fi ) + u0 (wi fi )(fi fi )






u(wi fi )









wi f i

wi fi

Figure 8.1: Illustration of the inequality (8.4)


For a given set of utility functions ui and initial wealths wi we construct
the Pareto-optimal risk exchanges by solving the equations (8.3) with respect
to f , subject to the budget constraint (8.1). The entire class of Pareto-optimal
risk exchanges is obtained by varying the i in the set of positive constants that
admit a solution.
The quantity
pi = fi (0, . . . , 0)

(8.5)

can naturally be interpreted as the premium payable by agent no. i.


It should be noted that the shape of the distribution of X does not play any
role in equation (8.3). Only its support is important since (8.3) is to hold almost
surely. For instance, if (X1 , . . . , Xn ) has a joint density that is strictly positive
in all of Rn+ , then (8.3) is to be solved for all non-negative Xi , i = 1, . . . , n, and
this is all that matters. Dependencies and other features of the joint distribution
of the Xi -s are irrelevant in this respect.
P
The solution to (8.3), if there is any, must be a function of i Xi , the wi ,
the i , and the utility functions ui . If the utility functions are well structured

89

CHAPTER 8. UTILITY THEORY

parametric functions, the solution will usually be an explicit parametric expression. In the next paragraph we shall outline the calculations and investigate
some general properties of the solution.
D. The second Borch/du Mouchel theorem.
Since the derivatives u0i are strictly decreasing functions, they are invertible,
and we can recast (8.3) as
wi fi = (u0i )1 (i u01 (w1 f1 )) .
Summing over i and using (8.1), we get
X
X
X
wi
Xi =
(u0i )1 (i u01 (w1 f1 )) .
i

(8.6)

The function
g1 (y) =

(u0i )1 (i u01 (y))

is strictly increasing, hence invertible, and so we can solve f1 from (8.6):


!
X
X
1
(8.7)
Xi .
wi
f1 = w 1 g 1
i

The labeling of the agents is, of course, quite arbitrary, so a similar expression
holds for each fj (just replace the index 1 by j).
Obviously,
the expression on the right of (8.7) is a strictly increasing funcP
tion of i Xi . This fact deserves to be highlighted as a theorem:
Theorem 8 A Pareto-optimal risk exchange is a pool, that is, each individual
agent contributes a share that depends only on the total loss of the group. Moreover, each agent must cover a genuine part of any increase of the total loss of the
group. If all utility functions are continuously differentiable, then the individual
shares are continuous functions of the total loss.

Appendix A

Hilbert spaces
A.1

Metric spaces

A. Metric spaces. Let X be a set, whose elements are called points, and
suppose we assign to each pair of points x and y a non-negative number d(x, y)
measuring the distance between them. The distance function d(, ) : X X 7
R+ should possess the following properties:
d(x, y) > 0 if x 6= y ;
d(x, y) = d(y, x) ,

d(x, y) = 0 if x = y,

(A.1)
(A.2)

d(x, z) d(x, y) + d(y, z) .

(A.3)

These axioms are motivated by geometric notions. For instance, Fig A.1 gives a
planar illustration of the so-called triangle inequality (A.3). If d satisfies (A.1)(A.3), it is called a metric on X and the pair (X , d) is called a metric space.


x



d(x, y) 








y

 @

@ d(y, z)
@
@
@
@

d(x, z)

Figure A.1: The triangle inequality

Let (Xi , di ), i = 1, . . . , q be metric spaces. One easily verifies that each


of the following functions are metrics on the product space X1 Xq with
90

91

APPENDIX A. HILBERT SPACES


generic point
pPx = (x1 , . . . , xq ): d(x, y) = maxi di (xi , yi ), d(x, y) =
2
d(x, y) =
i di (xi , yi ).

di (xi , yi ),

B. Topological concepts. Let (X , d) be a metric space. Equipped with


the yardstick d, we can speak about such things a openness and closedness of
subsets, convergence of sequences of points, and continuity of functions.
Given an x X and an  > 0, the set of points within distance  from x is
called the -neighbourhood of x. It is defined precisely as {y Y; d(x, y) < },
the interior of the ball with center x and radius . When speaking of a
neighbourhood, we mean just some -neighbourhood.
A sequence {xn }
n=1 X , or just xn in short, is said to be convergent if
there exists an x X such that d(xn , x) 0 as n . In other words,
each neighbourhood of x contains all but at most a finite number of the xn .
We say that xn converges to the limit x and write xn x and x = lim xn .
The limit of a convergent sequence is unique; if x and y are both limits of
xn , then d(x, y) d(x, xn ) + d(xn , y), which can be made arbitrarily small by
taking n large enough, hence d(x, y) = 0. A similar argument shows that every
convergent sequence is a Cauchy sequence, which means that d(xm , xn ) 0 as
m, n or, in ceremonial language, , n , n, m n , d(xm , xn ) < . If
the converse is true, so that every Cauchy sequence converges, then the metric
space is said to be complete. Examples of incomplete metric spaces are plentiful,
and the reader should provide some simple ones.
A subset Y of X is called open if each point in Y has a neighbourhood
contained in Y. A subset Y of X is called closed if it contains the limit of every
convergent sequence in Y. Note that closed is not the antonym of open; a set
may be both open and closed, and it may be neither open nor closed. A union
of open sets is itself open. A finite union of closed sets is closed.
Let (X , d) and (X 0 , d0 ) be metric spaces. A function f : X 7 X 0 is continuous at x X in case xn x implies f (xn ) f (x). (Spell out the formal -
formulation of the definition.) If f is continuous at each x, we just say that f
is continuous.
A metric is continuous in the following sense: If xn x and yn y, then
d(xn , yn ) d(x, y). To prove this, we need the following equivalent to the
triangle inequality (A.3):
|d(x, z) d(y, z)| d(x, y) .

(A.4)

(Follows from d(x, z) d(x, y) + d(y, z) and d(y, z) d(y, x) + d(x, z).) Now,
using first the triangle inequality for reals and then (A.4), we obtain
|d(xn , yn ) d(x, y)|

|d(xn , yn ) d(x, yn )| + |d(x, yn ) d(x, y)|


d(xn , x) + d(yn , y) ,

which proves the asserted result. (To be precise, we have shown that the metric
is continuous when considered as a function on X X equipped with any of the
metrics listed at the end of the previous paragraph.)

APPENDIX A. HILBERT SPACES

A.2

92

Vector spaces

A. Properties of vector spaces. Let X be a set with elements denoted by


x, y, z, etc. We call X a vector space or linear space if it is equipped with an
addition function a : X X 7 X , conveniently written a(x, y) = x + y, and
a scalar multiplication function s : R X 7 X , written s(c, y) = c y, with the
following properties:
1. Addition is commutative,
x +y = y +x,

(A.5)

x + (y + z) = (x + y) + z .

(A.6)

and associative,

2. Scalar multiplication is distributive,


c (x + y) = c x + c y ,
(c + d) x = c x + d x ,

(A.7)
(A.8)

(cd) x = c (dx) ,

(A.9)

and associative,

and, for all x,


1x = x.
3. There exists a unique null element 0 in X such that, for all x,
x+0=x

(A.10)

0x = 0.

(A.11)

and

Putting c = 1 and d = 1 in (A.8) and using (A.11), we get x + (1)x = 0.


Thus, (1)x plays the role of a unique negative of x, and we shall denote it
simply by x. Accordingly, we shall also use the shorthand x + (y) = x y.
The elements of a vector space are called vectors. To be precise, what we have
defined here is a real vector space since the scalar multiplication is defined for
real scalars; complex vector spaces are defined analogously, but are not needed
in our context.
B. Linear independence, bases, and dimension. The vectors x1 , . . . , xr
are said to be linearly dependent if (at least) one of them can be expressed as a
linear combination of the others or, equivalently, if there exist scalars c1 , . . . , cr ,
not all 0, such that
c 1 x1 + + c r xr = 0 .

(A.12)

93

APPENDIX A. HILBERT SPACES

If the only solution to (A.12) is c1 = cr = 0, then the vectors x1 , . . . , xr are


said to be linearly independent. They are said to be a basis of X if they are a
maximal linearly independent set, that is, for each vector y the set y, x1 , . . . , xr
is linearly dependent so that y = c1 x1 + + cr xr for some scalars c1 , . . . , cr .
We then also say that X is spanned by x1 , . . . , xr . It is easily shown that any
two bases must have the same number of elements, and this number is called
the dimension of X . These definitions make sense if r < , in which case we
say that the space is of dimension r. If there is no finite basis, then the space
is said to be of infinite dimension.
A subset X 0 that is itself a linear space is called a linear subspace of X . Note
that the only linear subspace with dimension 0 is the one-point set {0}.
Let X and Y be linear spaces. A function f : X 7 Y is said to be linear
if f (c1 x1 + c2 x2 ) = c1 f (x1 ) + c2 f (x2 ) for all c1 , c2 R and x1 , x2 X . This
implies that the image f (X ) = {f (x); x X } is a linear subspace of Y. The
dimension of f (X ) is called the rank of f and is denoted by rank(f ).
Suppose, moreover, that g is a linear function from Y to some linear space
Z. Then the composed function h = g f : X 7 Z defined by h(x) = g(f (x))
is linear, and rank(h) min(rank(g), rank(f )).
C. Inner products. Let X be a real vector space. An inner product on X is
a function h, i : X X 7 R obeying the following rules:
Symmetry,
hx, yi = hy, xi.

(A.13)

hx + y, zi = hx, zi + hy, zi.

(A.14)

Linearity,

Positive definiteness,
hx, xi > 0 if x 6= 0;

hx, xi = 0 if x = 0.

(A.15)

(The latter part of (A.15) is redundant as it follows by linearity.) From (A.13)(A.14) we deduce that the inner product is bilinear. The pair (X , h, i) is called
an inner product space.
The norm or length of a vector x, denoted by kxk, is defined as
p
kxk = hx, xi .
Obviously, kxk = || kxk.
By straighforward calculation,

kx + yk2 = kxk2 + 2hx, yi + kyk2 .

(A.16)

The Cauchy-Schwarz inequality,


|hx, yi| kxk kyk ,

(A.17)

94

APPENDIX A. HILBERT SPACES

is obtained upon replacing y in (A.16) by y, and examining the resulting


non-negative quadratic function of R,
kx + yk2 = kxk2 + 2hx, yi + 2 kyk2 .
By differentiation we find that its minimum value is kxk2 hx, yi2 /kyk2 , and
since this must be non-negative, (A.17) follows.
Upon combining (A.16) and (A.17), we obtain the triangle inequality,
kx + yk kxk + kyk .

(A.18)

The following identity is known as the parallellogram law:


kx + yk2 + kx yk2 = 2kxk2 + 2kyk2 .

(A.19)

D. Metrics induced by inner products. The distance between two vectors


x and y can naturally be defined as the length of the difference between them,
d(x, y) = kx yk .
This distance function obeys the rules of a metric: (A.1) follows from (A.15),
(A.2) is trivially fulfilled, and (A.3) follows from (A.18) applied to
kx zk = k(x y) + (y z)k k(x y)k + k(y z)k .

A.3

Hilbert spaces

A. Definition of Hilbert space. We have seen that an inner product space


is a metric space. If it is complete, then it is called a Hilbert space.
B. Convexity. A subset Y of a linear space is said to be convex if it contains
the straight line joining any two of its elements. More precisely, if x and y are
in Y, then so is x + (1 )y for each (0, 1). A linear space is convex.
A real-valued function f defined on a convex set Y is said to be convex
if f (x + (1 )y) f (x) + (1 )f (y) for each pair x, y Y and each
(0, 1). A linear function is convex. Observe that, by the triangle inequality,
a norm is a convex function.
C. The distance from a vector to a convex subset. Consider a Hilbert
space (X , h, i) with induced metric d. Given a point x and a subset Y in X ,
the distance from x to Y is, quite naturally, defined as
d(x; Y) = inf d(x, y) .
yY

If Y is closed and convex, then there exists a unique point y0 Y such that
d(x, y0 ) = d(x; Y). We shall call y0 the closest point to x in Y. The proof of this

APPENDIX A. HILBERT SPACES

95

result goes as follows: Choose a sequence yn in Y such that d(x, yn ) d(x; Y).
We first show that yn is a Cauchy sequence. By the parallellogram law (A.19),
k(yn x) (ym x)k2 + k(yn x) + (ym x)k2 = 2 kyn xk2 + 2 kym xk2 ,
hence
1
kyn ym k2 = 2 kyn xk2 + 2 kym xk2 4 k (yn + ym ) xk2 .
2
Since Y is convex, it contains 21 yn + 12 ym , and so k 21 (yn + ym ) xk2 d2 (x; Y).
It follows that
kyn ym k2 2 kyn xk2 + 2 kym xk2 4 d2 (x, Y) .
As n and m tend to , the expression on the right tends to 0, showing that yn
is Cauchy.
Then, since X is Hilbert, y0 = lim yn must exist. Moreover, since Y is
closed, y0 Y. Since kx y0 k = lim kx yn k = d(x; Y), we conclude that y0
is a closest point to x in Y. It remains only to show that it is unique.
Thus, suppose z0 Y is such that kx z0 k = d(x; Y). On the one hand,
since Y is convex, y0 + (1 )z0 Y for (0, 1), and so
kx y0 (1 )z0 k d(x; Y) .
On the other hand, by the triangle inequality,
kx y0 (1 )z0 k kx y0 k + (1 )kx z0 k = d(x; Y) .
It follows that kx y0 (1 )z0 k = d(x; Y). Thus, the expression
kx y0 (1 )z0 k2 = kx z0 k2 + 2hx z0 , z0 y0 i + 2 kz0 y0 k2 ,
considered as a function of (0, 1), is constant. This is possible only if the
coefficient of the square term is 0, that is, z0 = y0 .
D. Orthogonality. Consider an inner product space (X , h, i). Two vectors x
and y are said to be orthogonal, written x y, if hx, yi = 0. In this case (A.16)
becomes
kx + yk2 = kxk2 + kyk2 ,

(A.20)

which is known as the Pythagoras equality.


Let Y be a subset of of X . A vector x is said to be orthogonal to Y, written
x Y, if x y for all y Y. The set of vectors x with this property is called
the annihilator of Y and is denoted Y . Obviously, Y is a linear subspace of
X.
Two subsets Y and Z of X are said to be orthogonal, written Y Z, if
y z whenever y Y and z Z. By definition, Y Y .

96

APPENDIX A. HILBERT SPACES

E. Projections. Let (X , h, i) be an inner product space. Let Y be a linear


subspace and x a given vector.
From solid geometry we know that the distance from a given point to a given
plane is the length of the perpendicular dropped from the point onto the plane.
With this motivation, suppose we can determine an orthogonal projection of x
onto Y, denoted pro(x|Y), such that
pro(x|Y) Y ,

x pro(x|Y) Y .

(A.21)
(A.22)

Then pro(x|Y) is a minimum distance vector to x in Y, and it is unique. To see


this, take any y Y, and apply Pythagoras (A.20) to pro(x|Y) y Y and
x pro(x|Y) Y :
kx yk2

= k(x pro(x|Y)) + (pro(x|Y) y)k2

= kx pro(x|Y)k2 + kpro(x|Y) y)k2 .

The assertion follows immediately, and we have


d2 (x; Y) = kx pro(x|Y)k2 = kxk2 kpro(x|Y)k2 ,

(A.23)

where the latter equality follows by Pythagoras.


It also follows that the projection of x onto Y (linear) exists and is nothing
but
pro(x|Y ) = x pro(x|Y) ,

(A.24)

since x pro(x|Y) Y and x (x pro(x|Y)) = pro(x|Y) Y Y .


Another way of putting the result is that x has a unique decomposition
x = xY + xY ,
where xY Y and xY Y are the projections of x onto the two spaces.
If the linear subspace Y is complete, it follows by Paragraph C above that
the projection of any x onto Y exists. However, the existence issue can often
be dispensed with since, in fact, it is settled once we find a solution to (A.21)(A.22). The latter may be spelled out as
x pro(x|Y) y,

y Y ,

the normal equation(s) for the projection problem.


One readily verifies that


k
k
X
X

j pro(xj |Y) .
pro
j xj Y =

j=1
j=1

(A.25)

(A.26)

(Check that the expression on the right fits into the definition of the projection
on the left.) Thus, viewed as a mapping of X onto Y, pro(|Y) is linear.

97

APPENDIX A. HILBERT SPACES


One also finds that the projection mapping is idempotent, that is,
pro(x|Y) = pro(pro(x|Y) | Y) .

This result is a special case of a very useful result on iterated projections,


coming up next.
F. Iterated projections. Let Y and Z be closed linear subspaces of X such
that Y Z. Then, for any given vector x,
pro(x|Y) = pro(pro(x|Z)|Y) .

(A.27)

To prove this, we need only check that the iterated projection on the right of
(A.27) satisfies (A.21) and (A.22). The first part is trivial, and the second
part follows by writing x pro(pro(x|Z)|Y) = (x pro(x|Z)) + (pro(x|Z)
pro(pro(x|Z)|Y)) and observing that the first term here is Z Y and the
second term is trivially Y .
X

PP

PP

PP
PP
0P
P



P
PP
P






*

pro(x|Z)

P
PP
P

P
Pq
P
P
pro(x|Y) PPP

PP

PP

PP

PPP

Z
Y

Figure A.2: Iterated projections


By repeated use of the iterated projection theorem we obtain the following
general result: Let {0} = X0 X1 Xn = X be nested closed subspaces
of X . Then any vector x decomposes into
x = x1 + + x n ,
where

xj = pro(x|Xj ) pro(x|Xj1 ) Xj Xj1


,

hence the xj are orthogonal, and

kxk2 = kx1 k2 + + kxn k2 .

98

APPENDIX A. HILBERT SPACES


Put differently,
pro(x|Xj ) = x1 + + xj ,

and

kpro(x|Xj )k2 = kx1 k2 + + kxj k2 ,

j = 1, . . . , n.

A.4

Special Hilbert spaces

A. The space of sequences of reals. The space of infinite sequences x =


(x1 , x2 , . . .) is denoted by R . It becomes a vector space when addition and
scalar multiplication are performed entrywise, hence 0 = (0, 0, . . .). It is of
infinite dimension, and a basis (among many others) is P
e1 = (1, 0, 0, . . .), e2 =

(0, 1, 0, . . .), . . . An inner product is defined by hx, yi = j=1 pj xj yj , where pj


is
a sequence of strictly positive numbers. The subspace X of all x such that
P
2
j=1 pj xj < is complete in the metric of this inner product, that is, X , h, i
is a Hilbert space.
The Euclidean n-space Rn of column n-vectors may be viewed as an ndimensional subspace of R . This case is treated more extensively in Appendix
B.
B. The space of square integrable random variables. Let (, P, F) be
some probability space. Denote by L2 the set of all random variables X satisfying EX 2 < , which is indeed a vector space. Equip it with the inner product
defined by
hX, Y i = E[XY ] .

One easily checks that h, i satisfies (A.13)-(A.15) under the convention that
X = 0 means P[X = 0] = 1, that is, the elements of this space are classes of
equivalent random variables.
We shall prove that the inner product space (L2 , h, i) is Hilbert, that is,
we shall establish that it is complete. Thus, given a Cauchy sequence Xn , we
must construct a square integrable r.v. such that Xn X. Specify
P a decreasing
sequence of strictly positive numbers i , i = 1, 2, . . . such that i=1 i < . For
each i = 1, 2, . . . let ni be the n corresponding to i in the Cauchy criterion.
Consider the subsequence Xni and form a new sequence with elements Yi =
Xni+1 Xni , i = 1, 2, . . .. By the triangle inequality and the fact that kYi k i ,
we have, for each k = 1, 2, . . .,


k
k

X
X
X
X


|Yi |
kYi k
kYi k
i < .



i=1

i=1

i=1

i=1

Applying here the monotone convergence theorem to


get

X



|Yi | < .



i=1

Pk

i=1

|Yi | %

i=1

|Yi |, we
(A.28)

APPENDIX A. HILBERT SPACES

99

P
It
that, almost surely,
i=1 |Yi | is finite, hence, by absolute summability,
Pfollows

Y
exists
and
is
finite.
Now,
our candidate limit of Xn is X = Xn1 +
i
Pi=1

Y
,
which
is
certainly
square
integrable.
We have
i
i=1


X

X


kXni Xk =
kYj k ,
Yj
j=i j=i

where the last inequality follows by a replay of previous arguments. By (A.28) it


follows that kXni Xk 0 as i , that is, the subsequence Xni converges to
X. Convergence of the original sequence Xn to X now follows from the triangle
inequality, kXn Xk kXn Xni k + kXni Xk, and the fact that the terms
on the right can be made arbitrarily small by choosing n and ni large enough
(Xn is Cauchy and Xni X).
We remark that we could replace the probabilityR measure P with a more
general measure and, accordingly, replace EY with Y () d() and almost
surely with almost everywhere throughout.

Bibliography
Excellent expositions on general Hilbert space theory are [8] and [45].
Hilbert space methods were introduced in credibility theory by De Vylder
in two seminal papers [16] and [15]. From a mathematical point of view, the
essential features of a large class of estimation problems are that the set of
admitted estimators form a linear space and the performance of an estimator is
measured by its distance from the estimand in some suitable sense. This makes
Hilbert spaces the appropriate framework for a general treatment.
Credibility estimation in continuous time models, launched in [42], provides
an example where Hilbert space methods are indispensable.

Appendix B

Matrix algebra
A. Definition of matrices and vectors. An m n (m by n) matrix A is
a rectangular scheme of numbers organised in m horizontal rows and n vertical
columns;

a11 a1n

..
.. .
A = (aij ) = ...
.
.
am1

amn

The number aij in row i and column j is called the (i, j)-entry of A. The space
of m n matrices is denoted by Rmn . When emphasis of dimension is needed,
we shall sometimes write Amn to show that A is in Rmn .
The algebraic operations of addition and scalar multiplication are defined
for matrices by performing them entry-wise. Thus, for Amn = (aij ) and
Bmn = (bij ) the (i, j)-entry of A + B is aij + bij and, for c scalar, the (i, j)entry of c A is c aij . This way Rmn becomes a linear space of dimension mn,
with null element 0mn whose entries are all zero.
The transpose of an m n matrix A is an n m matrix, denoted by A0 ,
whose (i, j)-entry is aji , that is, A0 is obtained by turning the rows of A into
columns (or vice versa). A square matrix A is said to be symmetric if A0 = A.
An n 1 matrix is called a column vector or, more specifically, an n-vector.
We shall denote column vectors by lower case bold letters and with single subscript on the entries, e.g.

a1

a = ... .
an

n1

The linear space R


of such vectors, often referred to as the Euclidean space
of dimension n, is written Rn in short. A 1 n matrix is called a row vector
and shall invariably be viewed as a transposed column vector, e.g.
a0 = (a1 , . . . , an ).

100

101

APPENDIX B. MATRIX ALGEBRA

To the extent that we are interested in matrices merely as lists of their


entries, we could as well reorganise them into vectors by concatenating their
columns. More precisely, we introduce the vec function from Rmn to Rmn by
vec (A) = (a11 , . . . , am1 , a12 , . . . , am2 , . . . , a1n , . . . , amn )0 .
B. Partitioned matrices. It is sometimes useful to partition a matrix into
blocks of submatrices as

A11 A1q

..
.. ,
Amn = (Aij ) = ...
(B.1)
.
.

Apq
Pp
Pq
where each Aij is an mi nj matrix, with i=1 mi = m and j=1 nj = n. In
particular, A may be viewed as a column of row vectors,
0
a1
..
A = . ,
where a0i = (ai1 , . . . , ain ) ,
(B.2)
Ap1

a0m

or as a row of column vectors,

A = (a 1 , , a n ),

where a j

a1j

= ... .
amj

(B.3)

C. Special patterns. An n n matrix is said to be a square matrix. By


diag(a1 , . . . , an ), sometimes written diagj=1,...,n (aj ), is meant the n n matrix
A with entries ai down the principal (south-east) diagonal and zeros elsewhere, that is, aii = ai and aij = 0 for i 6= j. We then speak about a diagonal
matrix. More generally, a block diagonal matrix is written diag(A11 , . . . , App )
with an obvious interpretation. The matrix Inn = diag(1, . . . , 1) is called the
identity matrix of dimension n. This is an example of a permutation matrix,
which is any square matrix such that in each row and in each column there is
exactly one entry that is equal to 1 and the rest are 0. A triangular matrix is a
square matrix with only zeros below the principal diagonal.
D. Euclidean geometry. The Euclidean space RnP
becomes an inner product
space if we equip it with the scalar product hx, yi = j xj yj . One easily checks
that the product is bilinear, symmetric, and that hx, xi 0 with equality if and
only if x = 0. The corresponding norm k k is given by
X
kxk2 = hx, xi =
x2j .
j

In a geometric interpretation kxk is the length of x and hx, yi/(kxk kyk) is the
cosine of the angle between x and y. (Draw pictures in R2 !)

102

APPENDIX B. MATRIX ALGEBRA

Two vectors x and y are said to be orthogonal, and we write x y, if


hx, yi = 0.
The norm kAk of a matrix A = (aij ) can quite naturally be defined by
X
a2ij = kvec(A)k2 = tr(A0 A) = tr(AA0 ) ,
kAk2 =
i,j

which is the sum of the squared norms of its columns (or its rows).
E. The matrix product. If A and B are matrices of dimensions m n and
n `, respectively, we define the matrix product AB as the m ` matrix whose
(i, k)-entry
Pnis the scalar product of the i-th row of A and the k-th column of B;
a0i b k = j=1 aij bjk .
One easily checks that the matrix product can be formed at the level of
blocks; if Amn is partitioned as in (B.1) and Bn` is partitioned into B =
(Bjk ), j = 1, . . . , q, k = 1, . . . , r, such that Bjk has dimension nj `k with
P
r
k=1 `k = `, then
X
AB = (
Aij Bjk ) .
(B.4)
j

In particular, with the notational device (B.2) - (B.3),


X
AB =
a j b0j .
j

F. Linear subspaces of the Eucliden space. A set a1 , . . . , ar of vectors in


Rn is linearly dependent if if there exists an r-vector c 6= 0 such that
Ac = 0 ,

(B.5)

where Anr = (a1 , , ar ). If the only solution to (B.5) is c = 0, then the


vectors are linearly independent.
We denote by R(A) the linear subspace of Rn spanned by the columns of
A, that is, R(A) = {b; b = Ac for some cr1 }. If b1 , . . . , bq form a basis of
R(A), then there exists an r q matrix C such that A = BC. The dimension of
R(A) is q, and this is also called the rank of A and and is denoted by rank(A).
We have rank(A) = rank(A0 ). If A is of dimension m n, then rank(A)
min(m, n). If rank(A) = min(m, n), then A is said to be of full rank. If AB is
well defined, then rank(AB) min(rank(A), rank(B)).
If A is a square matrix of full rank, then A is invertible in the sense that
there exists a unique n n matrix A1 , called the inverse of A, such that
A1 A = A A1 = I ,

(B.6)

A non-invertible square matrix is also called singular and, correspondingly, one


also speaks of an invertible matrix as non-singular. Obviously, A is singular if

103

APPENDIX B. MATRIX ALGEBRA


and only if the equation Ax = 0 has a solution x 6= 0.

G. Matrices as functions. So far m n matrices have been viewed as elements in the space Rmn . Alternatively they can be viewed as linear functions
from Rn to Rm ; Amn maps an n-vector x to the m-vector Ax. In this perspective the notions of rank, inverse, and identity are well motivated. Also the
term permutation matrix pertains to this idea; operating on a vector with a
permutation matrix amounts to permuting its entries.
If Ann and Bnn are invertible, then AB is invertible and
(AB)1 = B1 A1 .

(B.7)

This follows by observing that B1 A1 AB = I and the fact that the inverse
is unique.
H. Some useful matrix identities. Let A be an invertible matrix partitioned
as


A11 A12
A=
.
A21 A22
Then the inverse of A, partitioned correspondingly as

 11
A
A12
,
A1 =
A21 A22
is given by
A11
A12

1
= (A11 A12 A1
,
22 A21 )
1
11
= A A12 A22
22
= A1
11 A12 A ,

(B.8)
(B.9)
(B.10)

and A21 and A22 defined by symmetry (just interchange the roles of 1 and 2 in
sub- and topscripts in (B.8) - (B.10)). The result is straightforwardly verified by
inserting the partitioned forms of the matrices into the defining relation (B.6).
For instance, write A1 A = I as

 11
 

A11 A12
A
A12
I 0
=
,
A21 A22
A21 A22
0 I
and perform the multiplication on the left by the rule (B.4) to obtain
A11 A11 + A12 A21
11

12

A A12 + A A22

= I,
= 0,

(plus two similar equations for A21 and A22 ). The solution to these equations
is (B.8) - (B.9). Starting instead from AA1 = I, we arrive at (B.8) and (B.10).

104

APPENDIX B. MATRIX ALGEBRA

Lemma 1 The following identity holds true for Ann ivertible, Bnm arbitrary,
and Cmm invertible, such that the inverses indicated exist:
(A + BCB0 )1 = A1 A1 B(C1 + B0 A1 B)1 B0 A1 .

(B.11)

In particular, for any vector bn1 ,


(A + bb0 )1 = A1

1
A1 bb0 A1 .
1 + b0 A1 b

(B.12)

Remark: The result is useful in multivariate analysis where inversion of matrices of the form A + BCB0 is frequently encountered, typically with A and
C some covariance matrices. Apart from producing certain nice formulas, the
result may also reduce computational work. Suppose A is easy to invert (e.g.
a diagonal matrix) and that m < n. Then the n n inversion on the left of
(B.11) reduces to the m m inversion on the right.
Proof: Let us denote the matrix in question by D = (A + BCB0 )1 . By
definition
DA + DBCB0 = I .

(B.13)

Postmultiplication with A1 B gives


DB + DBCB0 A1 B = A1 B ,
from which we solve
DB = A1 B(I + CB0 A1 B)1 .
Upon substituting this expression for DB back in the second term on the left
in (B.13) and isolating D in the first term, we obtain
D = A1 A1 B(I + CB0 A1 B)1 CBA1 ,
which, by virtue of (B.7), is the same as (B.11). 
I. The trace operator. The trace of a square matrix A is defined as the sum
of its entries on the principal diagonal and is denoted by tr(A). Thus,
tr(Ann ) =

n
X

aii .

i=1

The trace is a linear operator. If A and B are matrices of dimensions n m


and m n, respectively, then
tr(AB) = tr(BA) ,

(B.14)

APPENDIX B. MATRIX ALGEBRA

105

that is, the trace is invariant under cyclical permutations of the factors in a
(square) matrix product.
J. Determinants. The determinant of a square matrix Ann = (aij ) is defined
as
X
det(A) =
sign(j1 , . . . , jn )a1j1 anjn ,
(B.15)
j1 ,...,jn

where the summation extends over all n! permutations of (1, . . . , n) and sign(j1 , . . . , jn )
is the so-called sign of the permutation, which is +1 or 1 according as the
permutation is even or odd: A permution is even/odd if it is obtained by an
even/odd number of interchanges of positions of entries, two at a time. (There
are, of course, infinitely many ways of obtaining a given permutation through
such interchanges, but it can be shown that either they are all even or they are
all odd, so that these concepts are well defined.) It can be shown that
det(AB) = det(A) det(B)
and that
det(A0 ) = det(A) .
Obviously, for an invertible square matrix A, we have det(A1 ) = (det(A))1 .
Let Aji denote the (n 1) (n 1) matrix obtained by crossing out the i-th
row and the j-th column from A. Determinants can be calculated recursively
by the rule
n
X
det(A) =
aij Cofij
i=1

where

Cofij = (1)i+j det(Aji )

is the so-called (i, j)-cofactor of A.


It can be proved that A is invertible if and only if det(A) 6= 0, and that in
this case
1
(Cofij )0 .
A1 =
det(A)
For certain nicely structured matrices it is easy to find the determinant. For
instance, if A is triangular, then det(A) is the product of the diagonal elements.
(Explain why!) In particular this goes for a diagonal matrix. More generally,
det (diag(A11 , . . . , App )) = det(A11 ) det(App ).
A permutation matrix has determinant 1; it is realized that one and only
one of the terms on the right in (B.15) is different from 0 and that this term is
the sign of the permutation.
K. A basic representation result. Let c1 , . . . , cn be some orthonormal
basis in Rn , that is, c0i cj = ij (the Kronecker delta which is 1 if i = j and 0
otherwise). The n n matrix C = (c1 , , cn ) is then said to be orthogonal.

106

APPENDIX B. MATRIX ALGEBRA

(This is common usage even if orthonormal would be more appropriate). In


matrix algebraic terms the definition of orthogonality is just
C0 C = I

(B.16)

or
C0 = C1 .
It follows that also C0 is orthogonal; CC0 = I. Observe that a finite product of
orthogonal matrices is orthogonal. The determinant of an orthogonal matrix C
is 1 since det(C)2 = det(C0 ) det(C) = det(C0 C) = det(I) = 1.
Viewing an orthogonal matrix C as a linear function, we can say that it just
rotates Rn since it preserves all distances:
kCx Cyk2 = (x y)0 C0 C(x y) = kx yk2 .
Lemma 2 If Ann is symmetric, then there exists a diagonal matrix nn =
diag(1 , . . . , n ) and an orthogonal matrix Cnn = (c1 , , cn ) such that
0

A = CC =

n
X

j cj c0j .

(B.17)

For a matrix A with the representation (B.17) we have Acj = j cj and


c0j A = j c0j , that is, the columns and rows of C are right and left eigenvectors
of A, respectively, and the j are the corresponding eigenvalues. It follows that
the matrices A j I, j = 1, . . . , n, must be singular, and so the eigenvalues
must be the solutions to
det(A I) = 0 .
In other words, the eigenvalues are the roots of the polynomial of degree n on
the left hand side.
Observe also that, for k = 1, 2, . . .,
Ak = Ck C0 =

n
X

kj cj c0j ,

(B.18)

which is useful for computational purposes.


Finally, note that
det(A) = det(C) det() det(C0 ) = det() = 1 n ,
hence A is invertible if and only if all eigenvalues are different from 0.
L. Quadratic forms. Let A be an n n matrix and x an n-vector, and define
X
q(x) = x0 Ax =
aij xi xj .
i,j

APPENDIX B. MATRIX ALGEBRA

107

For fixed A this is a quadratic function of (the entries of) x, and as such it
is called a quadratic form. Without loss of generality A can be taken to be
symmetric, and we shall henceforth assume it is. Of course, q(0) = 0. If
q(x) 0 for all x, then q is said to be positive semidefinite (p.s.d.), and the
same terminology goes also for the matrix A itself. If, moreover, q(x) > 0 for
all x 6= 0, then q and A are said to be positive definite (p.d.).
By the representation (B.17) it follows that A is p.s.d. if and only if all
i are non-negative. In that case there exists a symmetric n n matrix A1/2 ,
called the square root of A, such that A = A1/2 A1/2 . This follows by using the
orthogonality of C to write
A = CC0 = C1/2 C0 C1/2 C0 ,
1/2

1/2

where 1/2 = diag(1 , . . . , n ).


M. A more general representation result. The following result is an easy
consequence of Lemma 2.
Lemma 3 If Ann is p.d. and Bnn is symmetric, then there exists a nonsingular matrix Enn such that
EAE0 = I ,

EBE0 = diag(1 , . . . , n ) ,

(B.19)

where the j are the solutions to the equation


det(B A) = 0 .

Proof: Represent A as in (B.17), and then represent 1/2 C0 BC1/2 as


DMD0 with M = diag(1 , . . . , n ) and D orthogonal. One easily verifies that
E = D0 1/2 C0 and the j have the properties stated in the lemma. 

Bibliography
A comprehensive introduction to matrix algebra is found in [3]. Recommended
is also [28], which puts emphasize on representation theorems.

Appendix C

Concave and convex


functions
A. Affine functions.
Let I be an open interval in R, finite or infinite. A function ` : I
7 R is said to
be affine (constant plus linear) with intercept h and slope k if it is given by
`(y) = h + ky .

(C.1)

Its graph is a straight line.


Let u : I 7 R be some function. In the following it is tacitly understood
that we consider only function arguments in I. An affine function ` coincides
with u at x if and only if it is of the form
`(y) = u(x) + k (y x)

(C.2)

for some k R. Such an ` is called an upper supporting line or upper tangent


to u at x if its graph lies above that of u, that is,
u(y) u(x) + k (y x)

(C.3)

for all y. If the inequality in (C.3) is strict for y 6= x, then ` is called an strictly
upper supporting line or strict upper tangent.
The unique affine ` coinciding with u at two distinct points x and z is
`(y) = u(x) + k(x, z)(y x) ,

(C.4)

with slope
k(x, z) =

u(z) u(x)
.
zx

B. Concave functions.
A function u is said to be concave if any segment of the graph of u lies above
the straight line connecting its endpoints. More precisely, by (C.4),
u(y) u(x) + k(x, z)(y x) ,
108

x < y < z.

(C.5)

APPENDIX C. CONCAVE AND CONVEX FUNCTIONS

109

An equivalent way of putting it is


u((1 )x + z) (1 )u(x) + u(z) ,

x < z, (0, 1).

(C.6)

If the inequality in (C.5) (or (C.6)) is strict, then we say that u is strictly
concave.
A function u is (strictly) convex if u is (strictly) concave.
Theorem 1. A concave function u is continuous. Moreover, it is continuously differentiable almost everywhere, and its derivative u0 (where it exists) is
decreasing. If u is strictly concave, then u0 is strictly decreasing.
Proof: Some simple algebra, or just inspection of Figure C.1, leads to the following equivalent versions of the defining relationship (C.5):
k(x, y) k(x, z) ;

(C.7)

k(x, y) k(y, z) ;

(C.8)

k(x, z) k(y, z) .

(C.9)

(y, u(y))

(x, u(x)) 














 (z, u(z))

Figure C.1: Illustration of the inequalities (C.5)(C.9)


The relationships (C.7) and (C.9) say that k is a decreasing function of both
arguments. Now, fix x and y in (C.8) and let z tend to y from above. Then,
since k(y, z) increases and is bounded from above (by k(x, y)), we conclude that
the right derivative
u+ (y) = lim k(y, z)
z&y

exists and is finite. A similar argument shows that the left derivative
u (y) = lim k(x, y)
x%y

exists and is finite. From (C.8) it follows that


u u + .

(C.10)

APPENDIX C. CONCAVE AND CONVEX FUNCTIONS

110

We interpose that the asserted continuity of u is now proved since the existence of the right (left) derivative implies continuity from the right (left).
The inequality (C.8) (or the figure) implies, furthermore, that
u+ (x) u (z) ,

x < z.

(C.11)

Combining this with (C.10), we conclude that u and u+ are decreasing functions. Now, a decreasing function has at most a countable number of discontinuities, so u and u+ are continuous almost everywhere. Let x be a continuity
point of u . Then, by (C.10) and (C.11), u (x) u+ (x) limz&x u (z) =
u (x), hence u (x) = u+ (x) showing that u0 (x) exists. (Similarly, u0 (x) exists if x is a continuity point of u+ . Thus, possible discontinuity points of the
functions u , u+ , and u0 must coincide.)
Gathering the pieces, we have now proved the results stated for a concave
u. The last statement about a strict concave u is easily added. 
The proof of the following result is left as an easy exercise to the diligent
reader.
Lemma 1. A function is (strictly) concave if and only if it possesses a (strict)
upper tangent at each point.
Corollary to Lemma 1. A twice continuously differentiable function is (strictly)
concave if and only if its second order derivative is (strictly) negative.
Proof: For a fixed y0 , Taylors formula says that
1
u(y) = u(y0 ) + u0 (y0 )(y y0 ) + u00 (y )(y y0 )2 ,
2
where y is some point between y0 and y. Since the last term on the right is
strictly positive, it follows that `(y) = u(y0 ) + u0 (y0 )(y y0 ) is a strict upper
tangent in (y0 , u(y0 )). Strict concavity follows by the corollary to Lemma 1. 
C. Jensen & Co.
One of the first results we encounter in elementary probability is the relationship
V[Y ] = E[Y 2 ] E2 [Y ], from which it follows that E[Y 2 ] E2 [Y ]. This is a
special case of a celebrated result, due to the Danish mathematician Jensen,
which states that the inequality is true, not only for the square, but for all
convex functions.
Theorem 2 (Jensens inequality). The function u is (strictly) concave if and
only if
E[u(Y )] (>) u(E[Y ])
for every non-degenerate r.v. Y with finite mean and values (only) in I.

(C.12)

APPENDIX C. CONCAVE AND CONVEX FUNCTIONS

111

Proof: For the if part, apply (C.12) to the simple random variable Y with
P[Y = x] = 1 and P[Y = z] = to obtain the defining relation (C.6).
For the only if part, suppose u is concave and let Y be an r.v. as specified
in the theorem. By Lemma 1, u possesses an upper tangent at E[Y ], that is,
there exists a k such that
u(y) u(E[Y ]) + k (y E[Y ]) .
Inserting Y in the role of y and forming expectation, we arrive at (C.12).
One easily adds to the result that strict concavity of u is equivalent to strict
inequality in (C.12). 

y0
Figure C.2: Illustration of the Ohlin condition (C.14)

Theorem 3 (Ohlins lemma). Let F1 and F2 be distribution functions defined


on R. If their means are finite and coincide,
Z
Z
y dF1 (y) = y dF2 (y) ,
(C.13)
and if there exists a y0 such that
F1 (y) F2 (y) , y < y0 ,
F1 (y) F2 (y) , y > y0 ,

(C.14)

then
Z

u(y) dF1 (y)

u(y) dF2 (y) ,

(C.15)

for each concave function u such that these integrals are well defined. If u is
strictly concave and F1 6= F2 , then the inequality in (C.15) is strict.
Proof: The conditions imply that F1 and F2 place all of their masses on the
interval I where u is defined. Consider first the case when I = (, ) and

112

APPENDIX C. CONCAVE AND CONVEX FUNCTIONS

the integrals in (C.15) are finite. In view of Lemma 1, let `0 (y) be an upper
tangent to u at y0 and introduce the difference
v(y) = `0 (y) u(y) .
The function v is non-negative, continuous, differentiable almost everywhere,
and
dv(y) 0, y < y0 ,
(C.16)
dv(y) 0, y > y0 .
R
R
Using first (C.13) and the trivial fact that dF1 (y) = dF2 (y), and then integrating by parts, we find
Z
Z
Z
u(y) dF1 (y) u(y) dF2 (y) = v(y) d(F2 F1 )(y)
= v() (F2 F1 )() v() (F2 F1 )()
Z
(F2 F1 )(y) dv(y) .

(C.17)

(C.18)

The first term in (C.17) vanishes since, for z > y0 ,


Z
0 v(z) (1 Fi (z))
v(y) dFi (y) 0
z

as z . Similarly, the second term in (C.17) vanishes since, for x < y0 ,


Z x
0 v(x) Fi (x)
v(y) dFi (y) 0

as x . The term in (C.18) is non-negative due to (C.13) and (C.16),


hence (C.15) holds true.
Finally, we consider the general case where the interval I may be a genuine
subinterval of R and/or the integrals in (C.15) may be infinite. Let x < z be
points in the interior of I, let `x and `z be upper tangents to u at x and z,
respectively, and define

`x (y) y x ,
u(y) x < y < z ,
ux,z (y) =

`z (y) y z .
R
Obviously ux,z is a concave function defined on R, ux,z (y) dFi (y) is finite for
i = 1, 2, and so the first part of the proof yields
Z
Z
ux,z (y) dF1 (y) ux,z (y) dF2 (y) .
(C.19)
Letting x decrease towards the left endpoint of I and z increase towards the
right endpoint of I, ux,z (y) decreases monotonically to u(y) for each y, and

APPENDIX C. CONCAVE AND CONVEX FUNCTIONS

113

R
R
so ux,z (y) dFi (y) u(y) dFi (y) for i = 1, 2 by monotone convergence. It
follows that the inequality (C.19) carries over to the limits, yielding (C.15).
The last statement in the theorem follows by noting that, firstly, if u is
strictly concave, then the inequalities in (C.16) are strict, and, secondly, if
F1 6= F2 , then (at least one of) the inequalities in (C.14) are strict on some
non-degenerate interval due to the right-continuity of distribution functions. 
Corollary to Theorem 3. Let X be a real r.v. assuming values in some open
interval I, and let gi : I 7 R, i = 1, 2, be increasing functions. If
E[g1 (X)] = E[g2 (X)]

(C.20)

and is finite, and if there exists an x0 such that


g1 (x) g2 (x) ,
g1 (x) g2 (x) ,

x < x0 ,
x > x0 ,

(C.21)

then
E[u(g1 (X))] E[u(g2 (X))] ,

(C.22)

for each concave function u : I 7 R such that these expected values are well
defined. If u is strictly concave and P [g1 (X) 6= g2 (X)] > 0, then the inequality
in (C.22) is strict.
Proof: The result follows by application of Theorem 3 to the cumulative distribution functions Fi (y) = P[gi (X) y], i = 1, 2. We need only to check the
condition (C.14): For y < g2 (x0 ) we have
P[g2 (X) y] = P[g1 (X) y] + P[g2 (X) y < g1 (X)] ,

(C.23)

and, for y > g2 (x0 ), we have


P[g2 (X) > y] = P[g1 (X) > y] + P[g2 (X) > y g1 (X)] .

(C.24)

The final assertion in the corollary follows since P [g1 (X) 6= g2 (X)] > 0 implies that the last term must be strictly positive either in (C.23) or in (C.24) or
in both for some y. 

Bibliography
[1]
[2] Borgan . Gill R.D. Keiding N. Andersen, P.K. Statistical Models Based
on Counting Processes. Springer-Verlag, 1993.
[3] T.W. Anderson. An Introduction to Multivariate Statistical Analysis. Wiley, 1958.
[4] E. Arjas. The claims reserving problem in non-life insurance some structural ideas. ASTIN Bull, pages 139152, 1989.
[5] A.L. Bailey. A generalized theory of credibility. Proceedings of the Casualty
Actuarial Society, pages 1320, 1945.
[6] A.L. Bailey. Credibility procedures, la places generalization of bayes rule,
and the combination of collateral knowledge with observed data. Proceedings of the Casualty Actuarial Society, pages 723, 1950.
[7] Pentik
ainen T. Pesonen E. Beard, R. Risk Theory. Chapman and Hall,
1984.
[8] S.K. Berberian. Introduction to Hilbert Space. Oxford University Press,
1961.
[9] J.O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer,
1985.
[10] H. B
uhlmann. Experience rating and credibility. AB, pages 199207, 1967.
[11] H. B
uhlmann. Experience rating and credibility. ASTIN Bull, pages 157
165, 1969.
[12] H. B
uhlmann. Mathematical Methods in Risk Theory. Springer-Verlag,
1970.
[13] Straub E. B
uhlmann, H. Glaubw
urdigkeit f
ur schadens
atze. Mitteil. Ver.
Schweiz. Vers.math., pages 111133, 1970.
[14] M. De Groot. Optimal Statistical Decisions. McGraw-Hill, 1970.
114

BIBLIOGRAPHY

115

[15] F. De Vylder. Geometrical credibility. Scand. Actuarial J., 1976:121149.


[16] F. De Vylder. Optimal semilinear credibility.
Vers.math., pages 2740, 1976.

Mitteil. Ver. Schweiz.

[17] P. Diaconis and D. Ylvisaker. Conjugate priors for exponential families.


Annals of Statistics, pages 269281, 1979.
[18] T.S. Ferguson. A bayesian analysis of some nonparametric problems. Annals of Statistics, pages 209230, 1972.
[19] H.U. Gerber. An Introduction to Mathematical Risk Theory. Huebner
Foundation Monograph 8, R.D. Irwin, Homewood, Illinois., 1979.
[20] C. Hachemeister. Credibility for regression models with application to
trend. Credibility: Theory and Applications (ed. P.M. Kahn), pages 129
163, 1975.
[21] J.A. Hartigan. Linear bayesian methods. J. Royal Statist. Soc., pages
446454, 1969.
[22] O. Hesselager. On the asymptotic distribution of weighted least squares
estimators. Scand. Actuarial J., 1988:6976.
[23] O. Hesselager. A markov model for loss reserving. ASTIN Bull, pages
183194, 1994.
[24] Witting T. Hesselager, O. A credibility model with random fluctuations
in delay probabilities for the prediction of ibnr claims. ASTIN Bull, pages
7990, 1988.
[25] W.S. Jewell. Exact multidimensional credibility. Mitteil. Ver. Schweiz.
Vers.math., pages 193214, 1974.
[26] W.S. Jewell. Regularity conditions for exact credibility. ASTIN Bull, pages
336341, 1974.
[27] W.S. Jewell. Predicting ibnyr events and delays i. continuous time. ASTIN
Bull, pages 2555, 1989.
[28] Taylor H Karlin S. A first Course in Stochastic Processes. Academic Press,
second edition, 1975.
[29] J.E. Karlsson. The expected value of ibnr-claims. Scand. Actuarial J.,
1976:108110.
[30] Alan F. Karr. Point Processes and their Statistical Inference. Marcel
Decker, second edition, 1991.
[31] Lwin T. Maritz, J.S. Empirical Bayes Methods. Chapman and Hall, 1989.

BIBLIOGRAPHY

116

[32] W. Neuhaus. Choice of statistics in linear bayes estimation. Scand. Actuarial J., 1985:126.
[33] W. Neuhaus. Inference about parameters in empirical linear bayes estimation problems. Scand. Actuarial J., 1984:131142.
[34] R. Norberg. A class of conjugate hierarchical priors for gammoid likelihoods. Scand. Actuarial J., 1989:177193.
[35] R. Norberg. A contribution to modelling of ibnr claims. Scand. Actuarial
J., 1986:155203.
[36] R. Norberg. Credibility premium plans which make allowance for bonus
hunger. Scand. Actuarial J., 1975:7386.
[37] R. Norberg. A credibility theory for automobile bonus systems. Scand.
Actuarial J., 1976:92107.
[38] R. Norberg. Empirical bayes credibility. Scand. Actuarial J., 1980:177194.
[39] R. Norberg. Experience rating in group life insurance. Scand. Actuarial J.,
1989:194224.
[40] R. Norberg. Hierarchical credibility: analysis of a random effect linear
model with nested classification. Scand. Actuarial J., 1986:204222.
[41] R. Norberg. A note on experience rating of large group life insurance
contracts. Mitteil. Ver. Schweiz. Vers.math., pages 1734, 1987.
[42] R. Norberg. Linear estimation and credibility in continuous time. ASTIN
Bull, pages 149165, 1992.
[43] R. Norberg. Prediction of outstanding liabilities in non-life insurance.
ASTIN Bull, pages 95115, 1993.
[44] H. Robbins. The empirical bayes approach to statistical problems. Ann.
Math. Statist., pages 120, 1964.
[45] H.L. Royden. Real Analysis. Macmillan, New York, 1963.
[46] B. Sundt. On greatest accuracy credibility with limited fluctuation. Scand.
Actuarial J., 109-119.
[47] B. Sundt. Recursive credibility estimation. Scand. Actuarial J., 1981:321.
[48] B. Sundt. An Introduction to Non-Life Insurance Mathematics. Verlag
Versicherungswissenschaft e.V, Karlsruhe., third edition, 1993.
[49] S. Wind. An empirical bayes approach to multiple linear regression. Annals
of Statistics, 1973:93103.

You might also like