Professional Documents
Culture Documents
Contents
1 Bayes methods
1.1 Decisions under uncertainty . . . . . . . . . . . . . . . . . . . . .
1.2 Estimation by square loss . . . . . . . . . . . . . . . . . . . . . .
1.3 Hilbert space methods . . . . . . . . . . . . . . . . . . . . . . . .
1
1
8
12
24
39
6 Bonus systems
42
6.1 Basic definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.2 Optimal design of bonus systems . . . . . . . . . . . . . . . . . . 45
7 Claims reserving
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . .
7.2 The claims process. . . . . . . . . . . . . . . . . . .
7.3 Applications . . . . . . . . . . . . . . . . . . . . . .
7.4 Prediction of outstanding claims . . . . . . . . . .
7.5 Predicting the outstanding part of reported claims
7.6 Parameter estimation . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
52
52
54
61
65
68
72
8 Utility Theory
77
8.1 The expected utility hypothesis . . . . . . . . . . . . . . . . . . . 77
8.2 The zero utility premium . . . . . . . . . . . . . . . . . . . . . . 82
8.3 Optimal insurance . . . . . . . . . . . . . . . . . . . . . . . . . . 83
i
ii
CONTENTS
8.4
8.5
A Hilbert spaces
A.1 Metric spaces . . . . .
A.2 Vector spaces . . . . .
A.3 Hilbert spaces . . . . .
A.4 Special Hilbert spaces
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
84
86
90
90
92
94
98
B Matrix algebra
100
108
Chapter 1
Bayes methods
1.1
value that summarizes its loss performance, thus introducing an order relation
that would enable us to tell which is the better of any two decisions and, in
particular, find a best decision. More precisely, we need to define a suitable
real-valued functional on the space of functions {L(d, ); d D}. We shall
consider two candidate criterions, the maximum loss and the average loss,
and anticipate that the latter will be chosen to carry the theory further.
B. The minimax principle. One possibility is to judge a decision d by its
maximum loss, supT L(d, ). The best decision by this criterion, if it exists,
is accordingly called the minimax decision. There are situations where the
minimax principle does not deliver a solution, e.g. when L(d, ) is unbounded
for all d.
The maximum loss criterion judges a decision by its loss in the worst case
only and does not pay any attention to its performance against other possible
cases. It reflects a pessimistic prepare-for-the-worst attitude on the part of the
decision maker.
C. The Bayes principle. Another criterion, which in a way expresses a more
nuanced view, is the so-called Bayes criterion, which measures the performance
of a decision by a weighted average of its associated losses. More precisely, we
place a measure G on T (or rather on a suitable sigma-algebraR in T , which we
need not visualize here), and define the risk of d against G as T L(d, ) dG().
It is convenient to introduce the density g of G with respect to some basic
measure and write the risk as
Z
g (d) =
L(d, ) g() d() ,
(1.1)
T
thus interpreting g() as the weight attached to the state in the weighted sum
of losses.
The Bayes risk against g is defined as
g = inf g (d) .
dD
(1.2)
always redefine the loss and the prior so as to make the latter a probability
distribution. This observation reflects the fact that both the loss function and
the prior are subjective elements in the set-up, and they cannot always be
separated from one another. For the Bayes principle to work, the main thing is
that the risk (1.1) should be finite for at least one decision.
We shall illustrate the concepts with some examples.
D. Testing hypotheses. Suppose we are interested only in deciding whether
the true state of the nature is in a subset H0 T or not. The notation H0
suggests that this is the null hypothesis we want to test agains the alternative
H1 = T \ H0 , but such notions will be meaningful only when there is some
empirical evidence present. We shall come back to that later on.
There are two decisions, D = {d0 , d1 }, where d0 means accept H0 and d1
means accept H1 (and reject H0 ). Let the loss function be the simple one
that expresses our yes-or-no attitude to the situation: L(di , ) = ai 1T \ Hi (),
i = 0, 1, where the ai are strictly positive.
The risk by decision di is g (di ) = ai (1 Pg [ Hi ]), and so the Bayes
decision against g is
dg = di
Pg [ Hi ]
if
ai
, i = 0, 1.
a0 + a 1
(1.3)
a0
g0
,
g1
a1
(1.4)
and dg = d1 otherwise.
E. Point estimation. Just to ease notation, assume T R, and suppose we
want to estimate . The natural set of decisions is D = T . As loss function we
could reasonably take L(d, ) = `(d ) for some function ` : R 7 R+ that is
convex and assumes its minimum (0, say) at 0.
A much used candidate is the square loss, L(d, ) = (d )2 . The risk of a
decision d is
Z
g (d) = E (d )2 =
(d )2 dG() ,
g = V[] .
(1.5)
(1.6)
If it exists, the Bayes decision function against g is the one that minimizes this
overall risk. It will be denoted g , and the minimum overall risk will be called
the Bayes risk and will be denoted by g = g (g ).
In this set-up is considered as the outcome of a random variable with
density g, and the likelihood f (x|) is considered as the conditional density of
X, given = . The joint density of (X, ) is f (x|) g(), the marginal density
of X is
Z
f (x) =
f (x|) g() d() ,
T
f (x|) g()
.
f (x)
(1.9)
The prior density g expresses our judgement prior to observation. The conditional density g(|) is called the posterior density since it represents our judgement post (after) observation.
Construction of the Bayes solution goes as follows. Insert the expression
(1.7) into (1.8) and change the order of integration to obtain
Z Z
g () =
L((x), ) g(|x) d() f (x) d(x) .
X
(1.10)
(1.11)
When the prior is taken as fixed, we shall not always display it in the notation.
G. Testing hypotheses (continued). We consider again the testing problems
in Paragraph D above. In the presence of observations the prior distribution is
just to be replaced with the posterior. In particular, in (1.4) we should replace
gi with gi (x) = gi f (x|i )/f (x) and obtain that the optimal decision rule is
given by
f (x|0 )
g1 a0
(x)
= d0 if
,
f (x|1 )
g0 a1
and (x)
= d1 otherwise. This is the well-known Neyman-Pearson result from
classical test theory.
H. Point estimation (continued). The problem of estimation by square loss,
considered in Paragraph E above, has the following solution in the presence of
= (X)
observations: The Bayes estimator is
= E[|X], and the Bayes risk
is
Here the pP
j are fixed, known
P numbers, whereas is an unknown parameter.
Put X = j Xj and p = j pj . The joint density (w.r.t. counting measure)
of X = (X1 , . . . , Xn )0 is
f (x|) = (
x
n
Y
pj j x p
) e
,
x !
j=1 j
(1.12)
xj N+ , j = 1, . . . , n, (0, ).
Consider the problem of estimating . A traditional solution would be the
ML (maximum likelihood) estimator,
= X /p ,
(1.13)
e
, > 0.
()
(1.14)
V(|X)
X +
,
p +
X +
.
(p + )2
(1.15)
(1.16)
(1.17)
(1.18)
n
Y
j=1
xj (1 )1xj = x (1 )nx .
(1.19)
The ML estimator of is
= X /n ,
(1.20)
( + ) 1
(1 )1 , (0, 1) .
()()
(1.21)
induced class to its natural parameter space and check if closedness under sampling is preserved under the extension; if yes, the result is what we call the
natural conjugate class of priors for F.
1.2
(1.22)
(1.23)
where Ass = (aik ) is some p.d.s. non-random matrix. Viewing the scalar
expression in (1.24) as a 1 1 matrix, which is the same as its trace, we can
rewrite it as
L(m,
m) = tr(A(m m)(m
m)
0) .
(1.25)
The overall risk of the estimator is the (generalized) mean square error
(MSE),
X
(m)
= E[(m m)
0 A(m m)]
=
aik E[(mi m
i )(mk m
k )] . (1.26)
i,k
(1.27)
R(m)
= E[(m m)(m
m)
0]
(1.28)
where
(1.29)
Forming iterated expectation E[ ] = EE[ |x], we find that the second term
appears only in the last term, which
on the right of (1.29) vanishes. Thus, m
=m
defined by
obviously is minimized and becomes 0 by taking m
= E[m|x] .
m
(1.30)
Then, what remains is the first term, which is the minimum risk. By iterated
expectation and by virtue of (1.27) it can be cast as
= tr(AR)
(1.31)
= EV[m|x].
R
(1.32)
where
the Bayes estimator, the Bayes risk, and the Bayes risk ma , and R
We call m,
trix, respectively, and together they constitute the (unrestricted) Bayes solution
to the estimation problem.
The solution is meaningful also if we have no observations, and it then re = R = V[m], say.
= m = E[m] and R
duces to m
By the general rule (1.23) we have V[m] = VE[m|x] + EV[m|x], hence (1.32)
can be rewritten as
= V[m] V[m]
.
R
10
We see that the Bayes risk matrix is the risk matrix by no observations less the
variance of the Bayes estimator.
It is noteworthy that the weight matrix A does not appear in the Bayes
estimator and risk matrix . The weighting affects only the Bayes risk given by
(1.31).
of estimators of the inhomogeD. Linear estimators. Consider the class M
neous linear form
= g + Gx ,
m
where
g0s1
g10
= ... ,
gs0
Gsn
X
i,k
Pn
n
X
j=1
j=1
(1.33)
g11
= ...
gs1
g1n
..
.. ,
.
.
gsn
n
X
gk` xk` )] .
(1.34)
`=1
This is a positive definite quadratic form in the coefficients gij , which is easily
minimized. First, form the derivatives (recall that A is symmetric so that
aij = aji )
n
X
X
(m)
=2
aik E[(1)(mk gk0 +
gk` x` )] ,
gi0
k
`=1
i = 1, . . . , s,
n
X
X
(m)
=2
aik E[xj (mk gk0
gk` x` )] ,
gij
k
`=1
= E[m] E[x] ,
= C[m, x0 ]V[x]1 .
11
(1.35)
(1.36)
(1.37)
We see that the LB risk matrix is the risk matrix by no observations less the
variance of the linear Bayes estimator, confer the remark at the end of Paragraph
B above.
The LB solution depends on the joint distribution of m and x only through
their first and second order unconditional moments. The LB estimator is the
sum of the prior Bayes estimate E[m] based on no observations and an adjustment term which depends on the deviation of the observations from their mean.
The magnitude of the adjustment depends on the variation and covariation of
the estimand and the observations: the stronger the covariation, the greater
the adjustment; the larger the variance of the observations, the smaller the
adjustment.
We observe again that the weight matrix A does not appear in the LB
estimator and the LB risk matrix, but it appears in the LB risk (1.36).
Obviously, the estimator and the risk matrix are the basic entities of any
or m,
one could
Bayes solution. Thus, for the mere purpose of constructing
m
P
as well put A = I, whereby the risk reduces to sk=1 E(mk m
k )2 . Consequently, the Bayes estimator can be constructed componentwise, using ordinary
quadratic loss for the scalar-valued estimands.
E. The regression model. Assume that xn1 is square integrable with conditional first and second order moments of the form
E[x|] = Yb() ,
(1.38)
V[x|] = V() ,
(1.39)
with n, Y, P known and non-stochastic, and bq1 and Vnn some functions of
an unobservable r.v. .
Introduce
= E[b] ,
= V[b] ,
= E[V] .
(1.40)
12
(1.43)
(1.44)
(1.45)
(1.46)
(1.47)
(1.48)
= (Y0 1 Y)1 Y0 1 x ,
b
(1.49)
where
1.3
as m
ranges in some set L of admitted estimators. If L is a closed subspace of
L2 , then the optimal estimator, called the L-Bayes estimator, is the projection
mL = pro(m|L). This is the unique element mL L satisfying the normal
equations
E[(m mL ) m]
= 0,
m
L.
13
n
X
j=1
(kmLj k2 kmLj1 k2 ) =
n
X
j=1
(Lj1 Lj ) ,
M-Bayes
estimator m
= mM
is determined by the normal equations
E[(m m)
m]
= 0,
.
m
M
The M-Bayes
risk is
= E[m2 ] E[m
2 ] = V[m] V[m]
.
14
that
Let M be the closed linear subspace consisting of those r.v.-s in M
are measurable w.r.t. some sub-sigmaalgebra of (x) (i.e. uses only partly
the information carried by x). Then the iterated projection theorem applies
reader should try to establish that M is closed. The optimal semilinear estimaPn
'
tor m= j=1 f(x
j ) is determined by the normal equations:
E m
n
X
j=1
f(xj )
n
X
j=1
f (xj ) = 0
for all f such that f (x1 ) L2 . By the conditional i.i.d. property, the expected
value on the left is
nE[m f (x1 )] nE[f(x1 ) f (x1 )] n(n 1)E[f(x2 ) f (x1 )] ,
hence the normal equations reduce to
1 ) (n 1)E[f(x2 )|x1 ] ) f (x1 ) } = 0 .
E{ ( E[m|x1 ] f(x
Taking in particular
f (x1 ) = E[m|x1 ] f(x1 ) (n 1)E[f(x2 )|x1 ] ,
Bibliography [9], [14], [16], [15], [17], [20], [21],[25], [26], [32],[34], [42].
15
Chapter 2
Introduction
(2.1)
i=1
fci (xi |i ) ,
16
17
i=1
fci (xi |) .
(2.2)
The bulk of early empirical Bayes theory rests on the assumption that all ci
are equal to c, say, which will be referred to as the case with balanced design.
A key point in the balanced design assumption is that the pairs (Xi , i ) , i =
1, 2, . . ., are i.i.d., so that standard large sample theory based on the laws of
large numbers and the central limit theorem can be invoked. In the unbalanced
case with varying designs it is necessary to impose certain regularity conditions
on the ci to ensure that the statistical procedures possess desired asymptotic
properties.
When only one unit is under consideration, it is not necessary to drag along
with the subscript i. Therefore, to facilitate the presentation, the sequence
(Xi , i , ci ), i = 1, . . . , I, is extended with an unindexed current unit, (X, , c),
which will be frequently in focus in what follows.
In the full model (i)(ii) the joint p.d.f. of (X, ) is
fc (x|) g() .
(2.3)
(2.4)
fc (x|) g()
.
fc (x)
(2.5)
18
The function fc (|) will be referred to as the kernel density in the following.
It is often called the likelihood (function), but this term is unfortunate in the
present context where is the outcome of a random variable stemming from
a distribution, which has a frequency interpretation and can (in principle) be
estimated. Then the marginal density of the observables Xi , given by (2.4),
can appropriately be termed the likelihood since it forms the basis of statistical
inferences.
The marginal p.d.f. g will be called the prior density in spite of its frequency
interpretation. This convention is in accordance with tradition. Accordingly,
the conditional p.d.f. in (2.5) is called the posterior density since it represents
the knowledge after observation of X = x (and c).
xij i
pij
P
is poor if j pij is small. The problem becomes acute
when the insurer is faced
P
with a new risk with no experience of its own ( j pij = 0). The fixed effects
model at stage (i) renders no possibility of assessing i in a rational manner
in the absence of observations. However, an insurer who wants to remain in
19
xij
(1 i )ni
xij
(2.6)
the basic measure ni being the counting measure on Xni . This completes the
specification of item (i) in the general outline above.
The model establishes no relationship between the batches and the samples
drawn from them. For each batch i the quality i has to be assessed from Xi
alone. The ML and UMVU estimator is
i =
ni
X
j=1
Xij ni ,
(2.7)
20
1
|x Yb|2P ) .
2v
(2.8)
(Recall that | |P is the norm induced by the inner product hx, yiP = x0 Py.)
Consider the problem of estimating (b, v) by the ML method. For each fixed
solving
v > 0 the expression on the right of (2.8) is maximized at any b
= 0q1 ,
Y0 P(x Yb)
(2.9)
1
2P .
|x Yb|
nr
(2.10)
(2.11)
2.2
A. The empirical Bayes decision problem. For the current unit a decision
is to be selected in a space D of possible decisions. Let a loss function L :
D T R be given. Only trivial modifications are required in the following if
L depends also on the design c.
21
The observations constitute the available information. The primary information is the observation (X, c) from the current unit, and the secondary information is the observations
(X, c)I = {(Xi , ci ) ; i = 1, . . . , I}
from the collateral units i = 1, . . . , I. The decision problem consists in determining a decision function,
{(X, c) ; (X, c)I } ,
(2.1)
(2.2)
(2.3)
(2.4)
(2.5)
The collateral data (X, c)I , which were argued to be of relevance, have
dropped out of the analysis and do not appear in the solution (2.5). This is
so because g was held fixed (assumed known). In the full model it is not,
however, and this is where the collateral data come into play. The second step
in the two-stage procedure consists in estimating the Bayes decision in (2.5)
from the observations (X, c)I to obtain a genuine decision function depending
only on the available data. We shall briefly outline this part of the problem in
the next section.
Finally, the resulting decision rule ought to be assessed by the performance
criterion (2.2) to ascertain that the two-stage procedure serves the proclaimed
purpose. This problem shall not be treated in this short account of the theory,
and we refer to [38].
2.3
22
(2.1)
p
Here signifies convergence as I , and denotes convergence in probability (with respect to some appropriate metric in D). The feasibility of the EB
or any restricted empirical Bayes procedure depends on the specification of the
basic model entities F and G. The following paragraphs treat briefly the major
cases, ordered by decreasing specificity of the families of distributions.
B. The parametric case. Assume that both F and G are parametric families,
that is, T is a finite-dimensional Euclidean set, and
G = {g ; A}
(2.2)
(2.3)
(2.4)
23
I
Y
f,ci (Xi ) ,
(2.5)
I
X
(2.6)
i=1
or its logarithm
log () =
i=1
log (a)|a= =
log fa,ci (Xi )|a= = 0k1 .
a
a
i=1
(2.7)
Under mild regularity conditions on the sequence (c1 , c2 , . . .), the ML estimator
for is asymptotically normally distributed with mean and variance matrix
(
I
X
i=1
2
log
f
(X
)
a,c
i
i
aa0
a=
)1
(2.8)
The variance matrix in (2.8) depends on the realized sequence (c1 , c2 , . . .), which
is thus decisive of the amount of information in the sample.
C. The semi- and non-parametric cases. When G and, possibly, also F
are nonparametric, the maximum likelihood procedure does not apply, and it
is usually impossible to arrange an unrestricted EB procedure. This fact is,
perhaps, the most important reason for studying restricted Bayes procedures,
like LB estimation: They typically depend on the underlying distributions only
through a finite set of parameter functions, e.g. certain first and second order
moments, which can be estimated even if the model is nonparametric and the
design is unbalanced. In this respect the LB approach is representative of a more
general methodology that consists in restricting the space of decision functions
sufficiently to obtain restricted Bayes solutions which can be reliably estimated
even if the model itself is of high complexity. We refer to Chapter 4 for an
account of parameter estimation in the empirical linear Bayes situation.
Bibliography
[5], [13], [31], [33], [38], [44], [49].
Chapter 3
Hierarchical credibility
Norberg R.: Hierarchical credibility: analysis of a random effect linear model
with nested classification. Scand. Actuarial J., 1986, 204-222. Sections 1-3.
24
Chapter 4
Parameter estimation in
credibility
4.1
As a fairly general framework for our discussion we take the well-known nonparametric linear regression model which specifies only that the vector of observations x (n 1) is square integrable with conditional first and second order
moments of the form
E[x|]
= Y b,
V[x|] = vP
(4.1)
,
(4.2)
(4.3)
(4.4)
where
An empirical LB estimator is obtained from (4.3) upon replacing the parameters in (4.4) by estimators based on independent replicates of the situation.
Thus, suppose that, for each i = 1, . . . , I, we have observed (ni , Yi , Pi , xi ) associated with an unobservable i , where xi fits into the regression model (4.1)
(4.2) (with all entities equipped with subscript i) and the i are i.i.d. selections
from the distribution of the current .
Since nothing is assumed as to the shape of the distribution of (and possibly also the conditional distribution of x for fixed ), inferences about the
25
26
unconditional first and second order moments in (4.4) must be based on the
observations and their cross products
xi , xi x0i ,
(4.5)
or some summary linear functions of these. In the full rank case (rank(Yi ) = q)
we may choose to base the estimation on the statistics
bi = (Y 0 Pi Yi )1 Y 0 Pi xi ,
i
i
bib0 ,
(4.6)
(4.7)
vi = (ni q)
(4.8)
= + 0 ,
(4.9)
Introduce
and, instead of estimating the parameters in (4.4), consider the equivalent problem of estimating
(, , ) .
(4.10)
The point is that the empirical first and second order empirical moments of the
observations, which form the natural basis for estimation of the first and second
order moments, have expectations that are linear in the components of (4.10):
E[xi ] = Yi ,
(4.11)
(4.12)
E[bi ] = ,
E[bib0i ] = + (Yi0 Pi Yi )1 ,
E[
vi ] = .
(4.13)
(4.14)
(4.15)
This way the situation is made accessible to linear estimation methods. The
relations (4.11) (4.12) or (4.13) (4.15), whichever are chosen, can be written
in compact form as
E[si ] = Ai ,
(4.16)
(4.17)
27
(4.18)
A1
s1
sI
From (4.16) and (4.17) and the independence of the units one gathers
E[s] = A , V[s] = D .
If the variances Di were known, then the best unbiased linear (in s) estimator
of would be the so-called GLS (generalized least squares) estimator,
= (A0 D1 A)1 A0 D1 s
X
X
A0i Di1 si .
A0i Di1 Ai )1
= (
(4.19)
Since the Di are unknown, the GLS estimator is only an auxiliary construction.
It motivates a class of estimators of the form
X
X
A0i Wi si ,
(4.20)
A0i Wi Ai )1
W
=(
i
is unbiased,
From (4.16) (4.17) we immediately obtain that W
E[W
]=,
(4.21)
V[W
]=(
A0i Wi Ai )1
A0i Wi Di Wi Ai (
A0i Wi Ai )1 .
i
(4.22)
The minimum variance matrix, which is obtained with the optimal (luckiest
possible choice of) weights Wi = Di1 , is
X
(4.23)
V[ ] = (
A0i Di1 Ai )1 .
i
(4.24)
28
Table 4.1: Group life insurance data. For each risk class i = 1, . . . , 72 is shown
the exposure pi and number of deaths Mi .
i
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
4.2
pi
3349.02
1394.00
479.81
23.98
11.30
2273.55
179.85
3947.44
4332.56
109.84
647.77
1308.36
154.81
7525.86
428.34
1049.42
2394.95
2406.47
212.79
5334.82
1235.84
6575.43
270.60
1137.39
Mi
16
5
0
0
0
11
0
24
9
1
2
1
3
13
2
3
10
5
0
15
5
57
2
4
i
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
pi
2080.38
26762.37
24.98
16.46
210.12
274.28
4106.71
2463.90
1224.26
8439.94
2751.09
2053.67
1648.87
99.74
28.47
762.12
742.22
104.84
2857.05
1127.65
96.64
137.22
77.11
650.88
Mi
10
49
0
0
0
3
12
5
6
21
14
12
3
0
0
2
4
0
5
3
0
1
0
0
i
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
pi
4497.34
3959.89
5518.23
2518.79
86.21
19485.59
4.45
3259.21
24.44
9037.35
30.91
44.24
1551.30
113.49
147.44
826.40
775.19
634.16
2295.45
942.03
232.31
37.08
233.25
39.52
Mi
15
18
14
11
2
26
0
7
0
20
0
0
3
0
0
2
1
2
6
4
2
0
0
0
Table 4.1, which is taken from Norberg (1989), contains summary data from an
authentic portfolio of workmens group life insurance policies. The portfolio is
divided into I = 72 risk classes representing different occupational categories
(mining, forestry, etc.). For each class i (= 1, . . . , I) there is a record specifying
the total number of years exposed to risk of death, pi , and the number of deaths,
M
R i , during the period of observation. The exposure pi is defined precisely as
pi (t) dt, where pi (t) is the number of individuals insured in class i at calendar
time t, and the integral ranges over the observation period.
Presumably, due to differences with respect to age composition and occupationspecific mortality, the risk conditions vary among the classes. The problem is to
set, for each individual class, an appropriate premium rate based on the present
summary statistics, the point being that the scheme should be low-cost and not
require current maintenance of individual records on all persons covered under
29
the scheme.
Norberg (1989) proposes the following model for the situation. There is
stochastic independence across risk classes, and each Mi has a Poisson distribution with parameter pi i . Here i is the unobservable mortality rate per year
at risk in class i.
For each k = 1, 2, . . . set
(k)
sik = Mi /pki ,
(k)
where Mi
(4.25)
(4.26)
k = E[ik ]
(4.27)
Introducing
and using the rule of iterated expectations together with (4.25), we find that
the moments involved in (4.26) are E[i ] = 1 and
E[si1 ] = 1 ,
V[si1 ] = 2 12 + 1 /pi ,
C[i , si1 ] = 2 12 .
Empirical linear Bayes estimators are obtained from (4.26) upon replacing
the parameters 1 and 2 by estimators based on the data. Thus we consider
the problem of estimating
1
=
.
2
30
(4.28)
with
= (1 , . . . , 4 )0 .
In the present case Ai = I (the identity matrix) for all i, and the estimator W
in (4.20) reduces to
!1
X
X
Wi s i ,
(4.29)
=
Wi
i
A natural estimator of
= V[] = 2 12
is
= 2 (
1 )2 .
(4.30)
( + k 1)(k)
k
(4.31)
E [] = ,
31
CV () =
V []
1
= .
E []
1
,
(CV ())2
1
E [] (CV ())2
and then calculate the entries of by the formula (4.31) with ( , ) in the
place of (, ).
In the present situation one could imagine e.g. that the mean should be
some 0.003 and that the coefficient of variation could be some 0.5. This means
that (1 , ) should be in the vicinity of (3 103 , 2 106 ).
Table 4.2 displays point estimates ( , 1 ) obtained for various a priori specifications of the first two moments, ( , 1 ). The upper part of the table contents
shows dependence of estimates on for fixed 1 , the middle part shows dependence on 1 for fixed , and the lower part shows dependence on the size of
both a priori values.
An iterated estimate, obtained as stationary values in repeated estimations,
each time using the estimate from the previous round as new prior values, came
out as (2.10 106 , 3.32 103 ).
Finally, Table 4.3 shows mean squared errors for various choices of a priori
values and 1 when = 2.08 106 and 1 = 3.26 103 .
32
Parameter
estimates
106 1 103 1
***
***
2.03
3.32
1.89
3.35
2.21
3.39
2.46
3.41
1.45
3.39
-1.13
3.36
2.28
3.40
2.73
3.41
2.55
3.40
2.43
3.40
2.71
3.36
4.25
3.33
10.01
3.30
1.72
3.37
1.84
3.37
1.95
3.37
2.18
3.37
2.75
3.37
3.90
3.37
5.04
3.37
6.18
3.37
7.32
3.37
Table 4.3: Mean squared errors for various choices of and 1 when =
2.08 106 and 1 = 3.26 103 . Each cell contains the pair (MSE(106 ),
MSE(103 1 )).
106 :
0.50
1.00
2.08
4.00
16.00
1.08
0.91
0.92
1.39
1.91
2
.078
.079
.084
.095
.150
1.13
0.89
0.82
0.92
5.13
.081
.078
.079
.084
.115
103 1 :
3.26
1.05 .086
0.84 .080
0.78 .078
0.84 .081
2.77 .100
4
1.05
0.86
0.80
0.86
2.40
.088
.081
.078
.079
.095
16
64.24
39.15
20.15
11.68
7.82
.111
.098
.087
.081
.079
4.3
33
Let {Nt }t0 be a Poisson process with intensity {t }t0 , that is, the process
has independent
increments and Nt Ns , s t, is Poisson distributed with
Rt
parameter s d = t s , where
t =
d .
0
Nt
X
Yn ,
n=1
0< t
(S S ) =
Z tZ
y N (d, dy) ,
where N (d, dy) is the number of claims in the time interval [, + d ) with
claim size in the interval [y, y+dy). This representation, here kept at an informal
level, helps establish readily some useful results about moments of functionals
of the compound Poisson process. Proceeding informally, N (d, dy) is 0 or 1
(for small d ), it is independent of the past behaviour of the process in (0, ),
and it has expected value G(dy) d . Using these properties together with the
R t (k1)
identity k 0
d = kt , k > 1, we establish the following results:
E[St ] = E[Y ]t ,
E[St2 ] = E
(S2 S2 )
0< t
Z tZ
(S + y)2 S2 N (d, dy)
= E
0
Z tZ
=
E 2S y + y 2 G(dy) d
0
Z t
Z t
2
2
= 2E [Y ]
d + E[Y ]
d
0
= E2 [Y ]2t + E[Y 2 ]t ,
E[St3 ] = E
34
(S3 S3 )
0< t
Z tZ
E (S + y)3 S3 N (d, dy)
0
Z tZ
= E
3S2 y + 3S y 2 + y 3 G(dy) d
0
Z t
3 E2 [Y ]2 + E[Y 2 ] E[Y ] + 3E[Y ] E[Y 2 ] + E[Y 3 ] d
=
=
= 3E [Y ]
2 d
+3E[Y ]E[Y 2 ]
+ 3E[Y ]E[Y ]
d + E[Y 3 ]
d
0
(S4 S4 )
0< t
Z tZ
E (S + y)4 S4 N (d, dy)
0
Z tZ
4S3 y + 6S2 y 2 + 4 S y 3 + y 4 G(dy) d
= E
0
Z t
= 4 E[Y ]
E3 [Y ]3 + 3E[Y ]E[Y 2 ]2 + E[Y 3 ] d
=
E2 [Y ]2 + E[Y 2 ] d
0
Z t
Z t
3
4
+4E[Y ]
E[Y ] d + E[Y ]
d
0
0
Z t
Z t
4
3
2
2
= 4E [Y ]
d + 4 3 E [Y ]E[Y ]
2 d
0
0
Z t
Z t
+4 E[Y ]E[Y 3 ]
d + 6 E[Y 2 ]E2 [Y ]
2 d
0
0
Z t
Z t
+6 E2 [Y 2 ]
d + 4 E[Y 3 ]E[Y ]
d + E[Y 4 ]t
0
0
= E4 [Y ]4t + 6E2 [Y ]E[Y 2 ]3t + 4E[Y ]E[Y 3 ] + 3E2 [Y 2 ] 2t + E[Y 4 ]t .
+6 E[Y ]
Suppose the insurer does not keep a continuous record of the claims, but
only observes total claim amounts on an annual basis. Then the observable
claims data are xj = Sj Sj1 , j = 1, 2, . . . Assume the intensity is of the
multiplicative form
t = pt a(),
35
where pt is a measure of the size of the risk at time t and a() is the claim
intensity per unit amount of risk. The parameter represents risk characteristics
that are not observable. Also the distribution of the individual claim sizes may
depend on these hidden risk characteristics, and we write G(|). We assume
that this distribution has moments of order 4 and denote its q-th noncentral
moments by
Z
mq () = y q G(dy|).
It is just a matter of change of notation and simple algebra to obtain the
following formulas from those above:
E [xj ] = pj m1 ()a() ,
E [x2j ] = p2j m21 ()a2 () + pj m2 ()a() ,
E [xj xk ] = pj pk m21 ()a2 () , j 6= k ,
E [x3j ] = p3j m31 ()a3 () + 3p2j m2 ()m1 ()a2 () + pj m3 ()a() ,
E [x2j xk ] = p2j pk m31 ()a3 () + pj pk m1 ()m2 ()a2 () , j 6= k ,
Uncertainty about the risk conditions is represented by random risk characteristics , and we take the model described above as the conditional model,
given . Introduce the parameters
1
1
1
4
= E [m1 ()a()] ,
= E [m2 ()a()] , 2 = E m21 ()a2 () ,
= E [m3 ()a()] , 2 = E m2 ()m1 ()a2 () , 3 = E m31 ()a3 () ,
= E [m4 ()a()] , 2 = E m3 ()m1 ()a2 () , 3 = E m22 ()a2 () ,
= E m2 ()m21 ()a3 () , 5 = E m41 ()a4 () .
We easily find the following formulas for the unconditional moments of the xj :
E[xj ] = pj ,
36
E [xj xk ] = pj pk 2 + jk pj 1 ,
E [xj xk xl ] = pj pk pl 3 + (jk pj pl + jl pk pl + kl pj pk ) 2 + jkl pj 1 ,
E [xj xk xl xm ] = pj pk pl pm 5
+ (jk pj pl pm + jl pj pk pm + jm pj pk pl
+ kl pk pj pm + km pk pj pl + lm pl pj pk ) 4
+ (jk lm pj pl + jl km pj pk + jm kl pj pk ) 3
+ (jkl pj pm + jkm pj pl + jlm pj pk + klm pj pk ) 2
+jklm pj 1 .
Consider a portfolio of I independent risks each of which complies with the
model above; to each risk i are associated observable exposures pij and total
claim amounts xij in years j = 1, . . . , Ji , and an unobservable risk characteristic
i . The i , i = 1, . . . , I are independent selections from the same distribution
G. From the available data we are to estimate the interest parameters
= (, 1 , 2 )0 ,
which occur in the linear Bayes estimators of the latent risk premiums bi =
m1 (i )a(i ) (expected claim amount per unit of risk exposure for risk i). It is
convenient to work with the annual loss ratios,
bij = xij /pij ,
which correspond one-to-one with the data xij . The natural estimator of bi
based solely on information from risk i is the loss ratio for the entire claims
record of the risk,
bi =
Ji
X
i=1
xij /
Ji
X
pij =
i=1
Ji
X
i=1
j = 1, . . . , Ji , where
pi =
Ji
X
pij .
i=1
37
1
0
0
.
Ai =
0 jk 1/pij 1
1 X h i
C bij , bik bil
pi j
i
i
h
h
1 X
pij E bij bik bil E[bij ]E bik bil
=
pi j
1
1 X
1
1
=
pij 3 + jk
2
+ jl
+ kl
pi j
pik
pij
pil
!
1
1
+jkl 2 1 2 + kl
1
pij
pik
1
1
1
2
2 + kl
+ kl
1 2 + kl
1 .
= 3 +
pi
pik
pi pik
pik
=
h
i
C bij bik , bilbim
h
i
h
i h
i
= E bij bik bilbim E bij bik E bilbim
1
1
1
= 5 + jk
+ jl
+ jm
pij
pij
pij
1
1
1
+ km
+ lm
4
+ kl
pik
pik
pil
1
1
1
+ jk lm
+ jl km
+ jm kl
3
pij pil
pij pik
pij pik
!
1
1
1
1
+ jkl 2 + jkm 2 + jlm 2 + klm 2 2
pij
pij
pij
pik
1
1
1
+jklm 3 1 2 + jk
2 + lm 1
1
pij
pij
pil
38
Concerning the choice of in (4.24): A computationally convenient candidate would be to take a() = Ga( , ) (gamma distributed with shape
parameter and inverse scale parameter ) and the claim sizes Y identically
distributed according to Ga( , ), independent of . Under this hypothetical
specification of the model we would have
E[iq ] =
( + q 1)(q)
( + q 1)(q)
q
,
E[Y
]
=
q
q
Bibliography
[13], [33], [33], [39].
Additional references:
Bunke, H. and Gladitz, J. (1974). Empirical linear Bayes decision rules
for a sequence of linear models with different regressor matrices. Mathematische Operationsforschung und Statistik 5, 235244.
HUMAK, K.M.S. (1984). Statistische Methoden der Modellbildung III.
AkademieVerlag, Berlin.
Norberg, R. (1982): On optimal parameter estimation in credibiility.
Insurance: Math. & Econ. 1, 73-89.
Chapter 5
(y)dA(y)
,
(y 0 )dA(y 0 )
(5.3)
is the conditional probability, given there is a death, that the age of the diseased
is in [y, y + dy). Thus, G(|) so defined is the distribution of the age at death
for members of the group. Note that it is independent of time under the present
assumptions of permanent risk conditions .
39
40
For a large group, where the death or survival of any single individual does
not affect significantly the size and the composition of the group as a whole,
we may adopt the compound Poisson process as an approximation model. In
accordance with the considerations above, we assume that instances of death
incur with intensity p(t)a() at time t, where
Z
(y)dA(y) ,
a() =
0
and that the ages at death are independent replicates of a random variable Y
with distribution G given by (5.3) and independent of the numbers of death.
Suppose the sum insured depends of the age y at death, and call it S(y).
Then, under our assumptions, the process of claims is a compund Poisson
process with claims intensity p(t)a() and individual claim size distribution
GS 1 (s) = P[S(Y ) s]. The total claims in year j (that is, the time period
[j 1, j)) is
Nj
X
Xj =
S(Yjk )
k=1
Rj
where Nj Po(pj a()) with pj = j1 p(t) dt, and the Yjk are independent
selections from G. It follows that the loss ratios
bj = Xj /pj
have expected values
E[bj |] = pj b() ,
where
b() = a()
and variances
S(y)dG(y) ,
Var[bj |] = v()/pj ,
where
v() = a()
S 2 (y)dG(y) .
(5.4)
(5.5)
The risk conditions may vary among groups. We adopt the usual heterogenity model of credibility theory and take to be the outcome of a random
variable , whose distribution represents the collective of groups from which
the current one is selected at random.
The hierarchical extension goes in the usual way. We shall consider hierarchical credibility analyses of 1125 authentic group life contracts that are
classified into occupational classes in a hierarchical manner. There are H = 72
different occupations, and within occupational class h there are Ih individual
groups. To each group (h, i) (group No. i in class h) there are observed exposures phij and loss ratios bhij in years j = 1, . . . , Jhi , say. These are assumed
to obey a hierarchical model with conditional Poisson distributions as above at
41
Bibliography
Norberg, R (1987). A note on experience rating of large group life contracts.
Mitteil. Verein. Schweiz. Vers.math. 87, 17-34.
Chapter 6
Bonus systems
6.1
Basic definitions
X(t) =
Yi .
i=1
For the time being we make no specific assumptions concerning the distribution
of X and , apart from assuming that it is unaffected by the premium system
(absence of bonus hunger).
Coverage and premium payments are on an annual basis (say). Introduce,
for each year j = 1, 2, . . ., the number of claims
Nj = N (j) N (j 1)
and the total claim amount
Xj = X(j) X(j 1) .
The annual premium, which is payable in advance, is currently adjusted and
is made dependent on the past claims experience of the policy. The purpose is
to charge, in each year j, an annual premium that approximates the unknown
individual risk premium,
mj () = E[Xj | ] .
42
(6.1)
43
B. What is bonus? This word is Latin and means good. In the context of
insurance it is used for various forms of dividends that are paid to the policyholders if the insurance scheme has created a systematic surplus. In automobile
insurance it denotes the premium deductible earned on an individual basis by
drivers who report few claims. A widely used class of schemes for calculating
such bonuses are the so-called automobile bonus systems. They take various
forms, but we will only treat what we may call standard bonus systems, which
are commonly used in practice.
C. An example: The Norwegian bonus system. As a typical example of
an automobile bonus system we describe the one used by Norwegian automobile
insurance companies in the 1970-ies:
Each new policy is charged with the same initial premium. Thereafter the
premium is adjusted annually as follows: After a claim-free year the premium is
reduced by 10% of the initial premium, and after a year with claims the premium
is increased by 20% of the initial premium for each claim, but the premium is
kept within the range from 30% (elite bonus) to 150%. There is one exception
from the general pattern: From the premium levels 140% and 150% the policy
advances directly to 120% after a claim-free year.
In Figure 6.1 the unbroken line shows the development of the premium for
a policyholder who files three claims during the period considered, one in each
of the first, third, and sixth year. The broken line shows the same for another
policyholder, who also files three claims, only at later and, obviously, more
favourable points of time. The area between the curves represents the difference
between the total premiums paid by the two policyholders. This is a considerable
amount of money, and to the extent that the times of occurrence are purely
random and only the total number of claims matters, it must be concluded that
there is a substantial element of randomness in the premium charged under the
bonus system. In particular, it is clear that random fluctuation of the premium
will prevail no matter how long the policy has been in force.
D. Formal definition of bonus systems. By a standard bonus system we
will mean an individual experience rating system with the following features:
The policies covered under the scheme are divided into a finite number of
bonus classes numbered from 1 to K, say. A policy stays in one and the same
class throughout a year. There is an initial class into which each new policy
is placed in its first year. Thereafter the policy is reclassified annually in accordance with transition rules that determine the bonus class in any year as a
function of the bonus class and the number of claims in the previous year. These
rules are independent of the age of the policy, and we denote by Tk` the set of
numbers of claims that will carry a policy from class k to class `. The annual
premium is the same for each policy in one and the same class, regardless of
the age of the policy. The class k premium is denoted by (k), and the vector
= ((1), . . . , (K)) is called the bonus scale.
For the transition rules to be consistent, we must require that K
`=1 Tk` =
44
110
100
10 11 12 13 14 15 time
90
80
70
60
50
40
30
Figure 6.1: The Norwegian bonus system 1975. Two individual premium paths
corresponding to two different claims histories. Occurrences of claims are indicated with bullets ().
{0, 1, 2, . . .} and that Tk` Tk`0 = for ` 6= `0 . The transition rules can suitably
be represented as a K K matrix T = (Tk` ). The pair
R = (, T)
(6.2)
(6.3)
Obviously, the year j class of our generic policy depends on its past claim
numbers in a way that is determined by the rules R, and we denote it by ZR,j ,
j = 1, 2, . . . More specifically,
ZR,1 = ,
(6.4)
and, for j = 2, 3, . . .,
ZR,j = `
if
ZR,j1 = k
and
Nj1 Tk` .
(6.5)
45
risk characteristics of each individual policy on the basis of its claims experience
and to currently adjust its premium accordingly. Thus, a bonus system should
serve to separate good risks from bad risks by placing them in different classes.
On intuitive grounds this is best achieved by progressively promoting policies
with few claims to low-premium classes and demoting policies with many claims
to high-premium classes. A bonus system is usually designed such that (k) >
(k + 1), and one speaks of class k + 1 as a better class than class k, class K
being the top class. Typically, n > n0 if n Tk` and n0 Tk,`+1 .
In terms of the formal definitions, the Norwegian bonus system described in
Paragraph C above has K = 13 classes, the initial class is = 6, the premium
scale is given by
(k) = (6)(1 +
and the transition rules T are
{1, 2, . . .}
{0}
{1, 2, . . .}
{0}
{1, 2, . . .}
{0}
{2, 3, . . .} {1}
{2, 3, . . .}
{1}
{3, 4, . . .} {2}
{1}
{3, 4, . . .}
{2}
{4, 5, . . .} {3}
{2}
{4, 5, . . .}
{3}
{5, 6, . . .} {4}
{3}
{5, 6, . . .}
{4}
{6, 7, . . .} {5}
{4}
{6, 7, . . .}
{5}
6.2
6k
),
10
(6.6)
{0}
{1}
{2}
{3}
{4}
{0}
{1}
{2}
{3}
{0}
{0}
{0}
{1}
{0}
{1}
{0}
{2}
{1}
{0}
{2}
{1}
{0}
{3}
{2}
{1}
{0}
that
minimizes the mean squared error (MSE)
(m)
= E (m m)
2 .
X of all such functions is
The Bayes estimator in the class M
m
X = E[m | X] ,
(6.1)
= E V[m | X]
= V[m] V[m
X]
= E[m2 ] E[m
2X ] .
(6.2)
(6.3)
(6.4)
46
= E V[m
X | Z]
2
= E[m
X ] E[m
2Z ]
(6.5)
(6.6)
(Add details using partly general results from the Hilbert space approach and
partly the rule of iterated expectation.) The expression (6.5) shows that X adds
to the information provided by Z to the extent that there is any variation left
in m
X when Z is given.
From (6.3) we see that the quality of the Bayes estimator increases with
its variance. This may seem puzzling to those who are used to thinking that
a good estimator should have small variance. There is no paradox here: An
estimator should be as close as possible to the estimand. If the estimand is a
fixed parameter, then a small variance is good. If the estimand is a random
variable, then the perfect estimator would of course be equal to and, hence,
have the same variance as that random variable. (Needless to say, big variance
is not a good property as such. We can make the variance of an estimator as
big as we want by just adding to it a lot of irrelevant randomness from e.g.
coin-tossing. It is for Bayes estimators of the form (6.1) that a big variance is
X , the
good. In the Hilbert setting we would say that, the bigger the space M
longer the projection m
X .)
From (6.4) we see that the quality of the Bayes estimator (or the amount of
information on m carried by X) could equally well be measured by the squared
norm of the Bayes estimator,
eX = E[m
2X ] ,
(6.7)
47
(6.8)
which measures the distance between the year j risk premium defined in (6.1)
and the year j premium delivered by S.
In general it is not possible to find an optimal bonus system that minimizes
j (S). The problem is that the space of estimators in consideration is very big
and lacks nice mathematical structure. Of course, one could find a bonus system
that is almost optimal as the following idea explains: On the one hand, since
ZR,j is a function of N1 , . . . , Nj1 , a bonus premium cannot outperform the estimator m
j = E[mj () | N1 , . . . , Nj1 ], confer Paragraph A above. On the other
hand, by taking the number of states big enough and designing the transition
rules in such a manner that ZR,j tells almost everything about N1 , . . . , Nj1 ,
we could make the bonus premium perform almost as well as m
j . This idea is,
however, of limited interest since a practical bonus system need to be simple
and, in particular, should not have too many classes (with its 13 classes the
Norwegian system is one of the most complicated systems known in practice).
Since the bonus rules must be constrained somehow, let us, as a first step,
keep the rules R fixed and consider the simpler problem of finding the optimal
year j premium scale, R,j . The solution comes directly out of Paragraph A
applied to (6.8), and is the Bayes estimator of mj () based on ZR,j :
R,j (ZR,j ) = E[mj () | ZR,j ] .
(6.9)
Its efficiency is
2
eR,j = E[R,j
(ZR,j )] .
(6.10)
(6.11)
(6.12)
48
2
R,j
(k) pR,j (k) .
(6.13)
The next step, which is to find good rules R, cannot so easily be guided by
theoretical clues. What we can do, is to compute the efficiency (6.13) of those
bonus rules we would like to compare and to choose the best. In some special
situations it is possible to say which of two given systems R and R 0 is better on
purely theoretical grounds, e.g. if the class process ZR is a function of the class
process ZR0 as in the following example:
{2, 3, . . .} {1, 0}
= 1,
T=
,
{2, 3, . . .} {1, 0}
{3, . . .}
{2}
{1, 0}
{3, . . .}
{2}
{1} {0}
.
0 = 1,
T0 =
{3, . . .}
{2}
{1} {0}
{2, 3, . . .} {1} {0}
We will be content to state that, usually, one must rely on intuition in the search
for good rules.
wj j (S) ,
(6.14)
j=1
(6.15)
2
eR = E[R,J
(ZR,J )] .
(6.16)
49
wj
wj pR,j (k) ,
hence
P[J = j, d | ZR,J = k] =
R (k) =
(6.17)
(6.18)
D. The Markov chain case. To calculate the optimal premiums and their
efficiencies, we need to find the conditional distribution of ZR,j , given = ,
pR,j| = (pR,j (1|), . . . , pR,j (K|)) .
Assume that N1 , N2 , . . . are conditionally independent, given . Then, for
fixed = , the bonus class process {ZR,j }j=1,2,... is a Markov chain with j-th
step direct transition probability matrix
(j)
(j)
PT| = pT| (k, `) ,
(6.19)
where
(j)
(6.20)
50
Assuming, furthermore, that all states intercommunicate and that the Markov
chain is aperiodic, there exists for each fixed a stationary distribution
pT| = (pT (1|), . . . , pT (K|)) = lim pR,j| .
j
(6.22)
(6.23)
X
k
2
T
(k) pT (k) .
(6.25)
51
F. Optimal linear bonus scale. The premium scale (6.6) of the Norwegian
bonus is a linear function of the bonus class. If this property is desirable, we
should restrict to bonus scales of the form
(k) = a + bk ,
(6.26)
where a and b are constants. The theory of linear Bayes estimation in Paragraph
1.2.D tells us that, in terms of e.g. year j MSE, the best choice of coefficients
in (6.26) for given bonus rules R is (confer (1.35))
bR,j
aR,j
(6.27)
(6.28)
(6.29)
E[mj () ZR,j ] =
X Z
k mj () pR,j| (k) g() d() ,
k
2
E[ZR,j
]
k 2 pR,j (k) .
Bibliography
The Bayes decision theory approach to analysis of automobile bonus systems was
first taken in [37], where the asymptotic criterion (6.23) was used. The weighted
annual MSE criterion (6.14) was introduced in [1] and applied to situations
where the conditional distribution of Nj , given , depends on the duration j.
The linear scale was proposed in Gilde and Sundt (1989): On bonus systems
with credibility scales. Scand. Actuarial J..
The reader is encouraged to contemplate an extension of the present theory
to situations where the same bonus rules are applied to policies with different
observable risk characteristics (e.g. mileage) or to several sub-portfolios with
different distributions of claims processes and latent risk characteristics.
Chapter 7
Claims reserving
7.1
Introduction
53
w will typically look as in Figure 7.1. Future exposure after time pertains to
contracts that are currently in force. In practice they will expire in finite time
so that w(t) = 0 for t large enough.
6
- t
6
Y
Y (v 0 )
T +U
T + U + v0
- t
T +U +V
C. Data from claims records. The claims statistics is a file of records, one
for each individual claim. Formally, a claim is a pair C = (T, Z), where T is
the time of occurrence of the claim and Z is the so-called mark describing its
development from the time of occurrence until the time of final settlement. The
mark is taken to be of the form Z = (U, V, Y, {Y 0 (v 0 ); 0 v 0 V }), where
U is the waiting time from occurrence until notification, V is the waiting time
from notification until final settlement, Y is the eventual total claim amount,
and Y 0 (v 0 ) is the amount paid within v 0 time units after the notification, hence
Y = Y 0 (V ). Henceforth we write Y 0 = {Y 0 (v 0 ); 0 v 0 V } in short. A typical
claims history, as described by these quantities, is depicted in Figure 7.2.
We will primarily have the situation above in mind, but note that other descriptions of the claim history are possible and that the mark might be a complex
entity comprising any piece of information appearing in the claim record. The
54
7.2
w(t)dt .
(7.1)
W n W
e
,
n!
(7.2)
PN
n = 0, 1, ..., and by X Po(W, PY ) is meant that X is of the form X = i=1 Yi ,
where N Po(W ) and N is independent of Y1 , Y2 , . . ., which are independent
selections from the distribution PY . We have
P[X x] =
X
W n W n
e
PY (x) ,
n!
n=0
(7.3)
55
where the topscript n signifies n-th convolution. The first three central moments of X are
Z
(k)
mX = W y k PY (dy) , k = 1, 2, 3,
(7.4)
provided that the first three moments of PY exist.
There are at least two reasons why the generalized Poisson distribution plays
an important role in risk theory. First, it is widely held to be a reasonable
description of claims generated by a large and fairly homogeneous portfolio of
risks and, second, it is computationally feasible. In particular, its moments
are given by simple explicit formulas, confer (7.4), and there exist a number of
techniques for computing tail probabilities and fractiles in such distributions,
see e.g. [7].
C. Alternative construction of the process. We set out by proving a basic
representation result.
Theorem 1. The marked Poisson process {(Ti , Zi )}i=1,...,N can be constructed
in two steps: First, generate
N Po(W )
(7.5)
w(t)dt
PZ|t (dz) ,
W
(7.6)
= e
=
t1
0
w(s)ds
w(t)dt
0
w(ti )dti PZ|ti (dzi )
i=1
n
Y
W n W
e
n!
n!
i=1
PT Z (dti , dzi ) ,
tn
0
(7.7)
w(s)ds
tn
w(s)ds
(7.8)
(7.9)
56
N
X
f (Ti , Zi ) .
(7.10)
i=1
Then Xf P o(W, PT Z f 1 ), and (provided they exist) the first three central
moments of Xf are
Z
Z
(k)
mXf = w(t) f (t, z)k PZ|t (dz) dt , k = 1, 2, 3.
(7.11)
Proof: The distribution result follows from the fact that the sum on the right of
(7.10) does not depend on the chronological order of the claims (f is independent of i) and therefore is distributed as the sum of N replicates of f (T, Z) that
are mutually independent and independent of N . Then (7.11) is just a standard
result about the compound Poisson law applied to (7.5) and (7.6).
The probability distribution of Xf in (7.10) may be computed by standard
methods for numerical evaluation of total claims distributions.
Note the linearity property
Xf1 + + Xfk = Xf1 +...+fk .
(7.12)
1
(Var(Xf 0 + Xf 00 ) Var(Xf 0 Xf 00 )) ,
4
and use the linearity property (7.12) together with (7.11) and the identity
1
(f 0 + f 00 )2 (f 0 f 00 )2 = f 0 f 00 .
4
D. A general decomposition result and some complements. Let C g ,
g = 1, 2, . . . , h ( ), be a partition of the claim space, that is, hg=1 C g = C
57
00
and C g C g = if g 0 6= g 00 . Introduce
Ztg = {z Z; (t, z) C g } ,
the set of developments that make a claim occurred at time t a g-claim (belonging to C g ), and
T g = t T ; PZ|t {Ztg } > 0 ,
the time period (or more general era) where such claims can occur. The process
of g-claims is denoted {(Tig , Zig )}1iN g , g = 1, . . . , h, where the times Tig are
listed in chronological order.
Theorem 2. Given a partition C g , g = 1, 2, . . . of the claim space, the corresponding component g-claims processes are independent, and
g
{(Tig , Zig )}i=1,...,N g P o(wg (t), PZ|t
; t > 0) ,
with
wg (t) = w(t) PZ|t {Ztg } ,
(7.14)
PZ|t (dz)
1 g (z) .
PZ|t {Ztg } Zt
g
PZ|t
(dz) =
(7.15)
Proof: Consider first the case with finite h, and look back at the proof of
Theorem 1. First, state the event appearing in (7.7) in terms of the component
processes to rewrite the probability as
(
)
\
g
g
g
g
g
g
g
P
{N = n , (Ti , Zi ) (dti , dzi ) , i = 1, . . . , n } .
g
w g (t)dt
g=1
n
Y
(tgi ) dtgi
i=1
g
g
PZ|t
g (dzi )
i
The product form of this expression together with (7.8) already proves the result.
We will, however, explicate the argument a little bit further. Recast each factor
in the product above in the same way as in (7.9) to arrive at
!
ng
Y (W g )ng
Y
g
g
g
W g g
e
n !
PT Z (dti , dzi ) ,
(7.16)
ng !
g
i=1
where
Wg
wg (t) dt ,
(7.17)
58
The result says that g-claims occur with an intensity which is the claim
intensity times the probability that the claim belongs to the category g, and
that the development of the claim is governed by the conditional distribution
of the mark, given that it is a g-claim. Accordingly, the quantity W g in (7.17)
may be termed the total exposure in respect of claims of category g or just the
g-exposure. Observe that PTg Z in (7.18) is the conditional distribution of (T, Z),
given that it is a g-claim:
PTg Z (dt, dz) =
PT Z (dt, dz)
1C g (t, z) .
W g /W
X =
N
X
f g (Tig , Zig ) ,
i=1
Ztg
59
Xf g =
N
X
f g (Tig , Zig )
i=1
(7.20)
A standard result, known as the amalgamation theorem for compound Poisson claims processes, goes as follows: Let X g , g = 1, . . . , h (< ), be independent compound Poisson processes, that is, each X g is of the form X g (t) =
PN g (t) g
g
j=1 Yj , where N is a homogeneous Poisson process with claim intensity
g
w , and the individual claims amounts Yjg are independent selections from a
claimP
size distribution P g and, moreover, independent of N g . Then thePprocess
h
h
X = g=1 X g is a compound Poisson process with claim intensity w = g=1 wg
P
h
and claim size distribution P = w 1 g=1 wg P g . This generalizes to the following, which appears as a partial converse of the decomposition Theorem 2,
but really is implied by it:
g
; t > 0), g = 1, . . . , h,
Theorem 3. Suppose {(Tig , Zig )}i=1,...,N g P o(wg (t), PZ|t
are a finite number of mutually independent marked Poisson processes on T Z.
Then the amalgamated process {(Ti , Zi )}i=1,...,N , obtained by assembling the
claims of the individual processes, is also a marked Poisson process on T Z,
60
h
X
wg (t) ,
(7.21)
g=1
PZ|t (dz) =
h
1 X g
g
w (t)PZ|t
(dz) .
w(t) g=1
(7.22)
Remark: The claimed result is precisely what one would expect. The property
of independent partitions carries over from the individual processes to the
amalgamated one and suggests the Poisson property of the occurrences of the
latter. Furthermore, (7.21) says that the total probability of a claim occurrence
in a small time interval is the sum of the corresponding probabilities for the
individual processes, and (7.22) states that a claim occurred at time t is from
the g-th individual process with probability w g (t)/w(t), in which case the mark
is generated by the mark distribution of that process.
Proof: Anticipating the result, start from a marked Poisson process
{(Ti , Zi )}i=1,...,N P o(w(t), PZ |t ; t > 0)
(7.23)
wg (t) g
P (dz) .
w(t) Z|t
(7.24)
The generic mark of this process is Z = (G, Z), the original mark augmented
with an index for type of claim. It is seen from (7.24) that a claim occurred at
time t is of type G = g with probability w g (t)/w(t) and, given this, the Z-part
g
of the mark is generated from PZ|t
.
Now, on the one hand, applying Theorem 2 to the decomposition of C
by claim type, C g = {(t0 , g 0 , z 0 ); g 0 = g}, g = 1, . . . , h, we readily find that the
component processes have the distribution properties of the individual processes
as specified in the assumptions of the present theorem, and so we can as well
let the latter be constructed as the component processes in the present model
(7.23) (7.24).
On the other hand, it is realized that in the present model the amalgamated
process is obtained from {(Ti , Zi )}i=1,...,N upon leaving the type G unobserved
or, in the terms of the lemma above, considering the process with marks transformed by (g, z) = z. Under this simple mapping the probability distribution
in (7.20) is just the marginal distribution of Z in the distribution of (G, Z)
given by (7.24), which is precisely the one defined in (7.22). Thus, the lemma
completes the proof.
61
We round off this paragraph with an alternative proof of Corollary 2 to Theorem 1. It makes use of the decomposition theorem and, moreover, serves to
demonstrate a useful technique:
Second Proof of Corollary 2 to Theorem 1: Suppose the result holds for indicator functions f 0 and f 00 . Then, by the bilinearity of the covariance operator,
it also holds for linear combinations of indicator functions. Since every (measurable) non-negative function is the monotone limit of linear combinations of
simple functions, the result extends to non-negative functions f 0 and f 00 by the
monotone convergence theorem. Finally, since any function is the difference of
its positive and negative parts, the result extends to general functions.
Thus, it suffices to prove the result for f 0 = 1C 0 and f 00 = 1C 00 , where C 0 and
00
C are subsets of C. Since f 0 and f 00 are binary, the functions f 0 f 00 , f 0 (1 f 00 ),
and f 00 (1 f 0 ) satisfy the orthogonality condition in Corollary 2 to Theorem
2, hence the corresponding compound Poisson variates are independent. By
the linearity property (7.12), Xf 0 = Xf 0 f 00 + Xf 0 (1f 00 ) and Xf 00 = Xf 0 f 00 +
Xf 00 (1f 0 ) . These things together imply that
Z
Z
Cov(Xf 0 , Xf 00 ) = Var(Xf 0 f 00 ) = w(t) (f 0 (t, z)f 00 (t, z))2 PZ|t (dz) dt ,
where we have made use of (7.11). Now, since f 0 f 00 is binary, it equals its square,
and we arrive at (7.13).
Results like Corollary 2 to Theorem 1 are valid for more general marked
pointed processes, see e.g. [30]. The present proofs are worth reporting since
they are simple thanks to the fact that everything is independent of everything
else in the Poisson scenario.
7.3
Applications
62
mg
mg1
pY |t (y) dy ,
PZ|t (dz)
g
PZ|t
(dz) = R mg
1(mg1 ,mg ] (y) .
p (y) dy
mg1 Y |t
In particular, for a fixed m, let the two sets C s = {(t, z); y m} and
C = {(t, z); y > m} decompose the business into small and large claims. We
may interpret m as the deductible part by minimum franchise or first risk in
the context of direct insurance or as the retention level in the context of excess
of loss reinsurance.
Pursuing the latter interpretation, consider a reinsurance treaty under which
the cedant and the reinsurer cover f 0 (t, z) and f 00 (t, z), respectively, of a claim
occurred at time t and with mark z. The covariance between their total losses
is given by (7.13), and their means and variances are given in Corollary 1 to
Theorem 1.
For instance, for quota share reinsurance we have f 0 (t, z) = ky and f 0 (t, z) =
(1 k)y so that, with time-independent marks, the covariance is simply
`
W k(1 k) E[Y 2 ] .
For excess of loss reinsurance we have f 0 (t, z) = min(y, m) and f 00 (t, z) =
max(y m, 0) and, since the product of these functions is 1(m,) m (y m),
the covariance is
Z
(1 PY (y)) dy .
W m E[1(m,) (Y m)] = W m
m
1
W j
j
j1
w(t)pY |t (y) dt .
63
In particular, in the homogeneous case with time-independent marks and constant Poisson intensity, w(t) = w, we have
W j = w, pj
Y = pY .
(7.25)
w(t)
0
jt
j1t
pU |t (u) du dt
pj
Y (y) =
w(t)
0
jt
j1t
pU Y |t (u, y) du dt .
Similarly we recast
pj
Y
pj
Y (y) =
max(0,j1u)
as
1
W j
j
0
pU Y |t (u, y)
ju
w(t) dt du .
max(0,j1u)
Consider again the homogeneous case with time-independent marks and constant Poisson intensity, w. Letting j increase, the expressions above tend to
W j = w,
pj
Y (y) = pY (y) .
64
w(t)
j1
j+dt
j+d1t
pU |t (u) du dt
1
W jd
w(t)
j1
j+dt
j+d1t
pU Y |t (u, y) du dt .
1 y
y
e
1(0,) (y) .
()
(7.26)
65
where
d = (d + )1 ,
d = ga(y; 1, d + ) .
where
T U
0
exp{( T U v 0 )}dY 0 (v 0 ) .
Again we can conclude that X is a compound Poisson variate, which in principle is simple. The claim size distribution may in this case be a bit complicated,
though, but it could be simulated in any case.
Inflation at rate can be accommodated
R in the model e.g. by letting
PY 0 |t,u,v,y be the distribution of Y 0 (v 0 ) = [0,v0 ] exp{(t + u + v 00 )}dY (v 00 ),
0 v 0 v, where Y is a process with some distribution PY |u,v,y , independent
of t.
7.4
A. Four categories of claims. Taking our stand at the present time , the
claims may be categorized as settled (s), reported-not-settled (rns), incurred-notreported (inr), or covered-not-incurred (cni), defined precisely by
Cs
= {(t, z); t + u + v } ,
(7.27)
66
inr
C
C cni
(7.28)
(7.29)
(7.30)
The acronyms inr and rns are shorthand for the commonly used ibnr and
rbns, the redundant but being dropped in incurred but not reported and reported but not settled. The term cni represents claims related to what is usually
called the unearned premium reserve.
In accordance with the partition (7.27) (7.30) the claims process decomposes into the component processes {(Tig , Zig )}1iN g , g = s, rns, inr, cni,
which, by Theorem 2, are independent marked Poisson process.
The total liability X of the company decomposes accordingly into
X = X s + X rns + X inr + X cni ,
(7.31)
where
X
Xg =
Yig =
1iN g
X
i
1{(Ti ,Zi )C g } Yi ,
(7.32)
(7.33)
Finally, the inr- and cni-parts are outstanding. These are conveniently lumped
into
X
X nr = X inr + X cni =
Yi 1{Ti +Ui > } ,
(7.34)
i
(7.35)
(7.36)
67
B. The prediction problem. Let F denote the statistical information available by time . It consists of the histories up to time of all claims that have
been reported (r) by that time, that is, are in
C r = C s C rns = {z; t + u } .
(7.37)
(7.38)
Prediction uncertainty may be expressed in terms of the variance and, possibly, higher order central moments built from the noncentral moments
R
R k
0
tu y 0 y pt (u, v, y , y) dv
k
R
E[Y | F,i ] = R
.
(7.40)
pt (u, v, y 0 , y) dy dv
tu y 0
68
PY |t (dy) =
u> t
PU Y |t (du, dy)
1 PU |t ( t)
(7.41)
(7.42)
and, by (7.19),
(k)
mX nr
=
=
wnr (t)
t>0
Z
Z
w(t)
t>0
y>0
u> t
y k PYnr|t (dy) dt
Z
y k PU Y |t (du, dy) dt , k = 1, 2, 3. (7.43)
y>0
(k)
(k)
mX o |F = mX orns |F + mX nr , k = 1, 2, 3.
(7.44)
An appropriate reserve is the first moment given by (7.44) with k = 1. A fluctuation loading may be provided by adding a multiple of the standard deviation.
As an alternative to this ad hoc method, one may take the upper -fractile (e.g.
= 0.01) of the predictive distribution, or some approximation of it based on
the first three moments in (7.44).
7.5
69
(7.45)
(7.46)
for r s (confer (7.26)). That X is well-defined this way follows from Kolmogorovs consistency condition and the convolution property of the gamma
distribution (to be described below). The inverse scale parameter is immaterial in the construction of Q by (7.46), of course, and could be set to 1.
Now, let = s0 < s1 < < sk = be a finite partition of R, and
abbreviate i = (si ) (si1 ), i = 1, . . . , k. Starting from the independent
gamma variates Xi = X(si ) X(si1 ) Ga(i , ), i = 1, . . . , k, one easily
finds that the fractions
Qi = Q(si ) Q(si1 ) =
X(si ) X(si1 )
,
X()
70
( + ) 1
q
(1 q)1 ,
()()
0 < q < 1. The stochastic process Q thus defined is called the Dirichlet process
with parameter = {(s); s R}, and we write Q Dir(). The Dirichlet
process plays an important role in nonparametric Bayesian analysis, see [18].
The moments of Q(s) are easily calculated. In particular,
E[Q(s)] = (s)/() ,
showing that the expected value of Q is just normed to a probability distribution, and
Var[Q(s)] = E[Q(s)](1 E[Q(s)])/(() + 1) ,
showing that the total mass of is a measure of the precision of the process Q;
a large value of () means little randomness in Q.
The conditional Q-distribution on an interval (a, b] is
Q(s|(a, b]) =
X(s) X(a)
Q(s) Q(a)
=
, a < s b.
Q(b) Q(a)
X(b) X(a)
Putting X(s) X(a) and X(b) X(a) in the roles of X(s) and X(), respectively, in the construction above, the whole story repeats itself. We find that
Q(s|(a, b]) Dir((a,b] ), where (a,b] is the restriction of to (a, b], and that
it is independent of X(b) X(a) and of X(r) for r
/ (a, b]. Thus, conditional
Q-distributions on disjoint intervals are independent Dirichlet processes.
C. Predicting the outstanding part of Dirichlet type payments. We
adopt the general model in Paragraph A above with partial payments Y 0 of the
form (7.45), where Q Dir(). Of course, in this context is concentrated on
the unit interval [0, 1], i.e. 0 = (0) < (1) = ().
Let denote the present time and consider a reported but not settled claim
occurred at time t < , notified with a delay U = u < t, hence V > v 0 =
tu, and for which we have observed the partial cumulative payments Y 0 (vj )
at development times 0 v1 < < vk = v 0 . Denote all this information by
F 0 . The natural predictor of the outstanding payments on the claim is
E[Y | F 0 ] Y 0 (v 0 ) .
(7.47)
71
(7.48)
Q(vj /V ) Q(vj1 /V )
Y 0 (vj ) Y 0 (vj1 )
=
0
0
Y (v )
Q(v 0 /V )
7.6
72
Parameter estimation
(7.50)
(7.52)
implying that we discard all information about partial payments for the rnsclaims.
D. The likelihood of the observations. By Theorems 1 and 2, in particular
73
L = exp(W ())
N
Y
i=1
rns
NY
i=1
where
s
W ()
W rns ()
=
=
Z0
0
w(t; )PU +V ( t; ) dt ,
w(t; ) PU ( t; ) PU +V ( t; ) dt .
= W s () + W rns ()
Z
=
w(t; )PU ( t; ) dt .
(7.53)
W r ()N r
s
N
Y
i=1
rns
NY
i=1
Tirns Uirns
pV |U rns (v; ) dv .
i
E. Maximum likelihood estimation. To find the maximum likelihood estimator (MLE), , we need to maximize the likelihood or, equivalently, its
logarithm. Thus, is found as the solution to the k-dimensional system of
equations
(7.54)
ln L|= = 0 .
Under certain regularity conditions, which we assume are satisfied, the MLE is
consistent,
p
(7.55)
74
where () = I()
N(, ()) ,
(7.56)
(7.57)
(7.58)
where w0 (t) is an observable measure of the size of the portfolio at time t and
is an unknown intensity per unit of risk exposed. Furthermore, let , , and be
the parameter functions appearing in the constituents of the mark distribution;
pU (u; ), pV |U (v; ) and pY |U V (v; ). It is convenient to rewrite (7.53) as
W r () = W0r () ,
(7.59)
with
W0r ()
Z0
w0 (t)PU ( t; ) dt
W0 ( u)pU (u; ) du ,
(7.60)
R t
0
pU (u; ) du,
(7.61)
(7.62)
(7.63)
Nr
Y
R
0
i=1
s
N
Y
i=1
pU (Uir ; )
W0 ( u)pU (u; ) du
pV |U s (Vis ; )
(7.64)
pY |U s V s (Yis ; )
(7.65)
N
Y
i=1
rns Z
NY
i=1
Tirns Uirns
pV |U rns (v; ) dv .
i
(7.66)
75
(7.67)
In the special case where claims are settled immediately upon notification,
we have N r = N s and the cumbersome factor (7.66) vanishes.
G. An example. Suppose that claims are settled immediately upon notification so that the likelihood is made up of (7.62), (7.63), and (7.65) with
pY |U s (Yis ; ) in the place of pY |U s V s (Yis ; ).
i
i i
Assume that (U, Y ) follows a bivariate lognormal distribution,
uu uy
u
ln U
).
,
N(
uy yy
y
ln Y
Then
ln Y |ln U N (0 + 1 ln U, ) ,
0
1
= y 1 u ,
uy
=
,
uu
2
uy
= yy
.
uu
(7.68)
(7.69)
(7.70)
The MLE of the parameters in (7.68) (7.70), which now constitute , is found
by simply regressing the ln Yir linearly on the ln Uir .
The MLE of u and uu , which now constitute , is obtained by maximization of (7.63), which is
1
1
r
2
Nr
exp
(ln
U
)
Y
u
r
i
2
2uu Ui
uu
.
R
1
1
2 du
Upon cancelling equal factors in the numerator and denominator and rearranging a bit, we find that the problem is to minimize
Z
1
W0 ( t) exp ((ln U )2 (ln u)2 ) + (ln u ln U) du ,
(7.71)
u
0
where
=
and
1
,
2uu
u
,
2uu
r
N
N
1 X
1 X
r
2
ln U = r
ln Ui ; (ln U ) = r
(ln Uir )2 .
N 1
N 1
76
expect), y = 6.5480, uu
= 1.0727, uy
= 0.4154, yy
= 7.2069. The estimated
coefficient of correlation of ln U and ln Y is 0.1494, which means that large claims
tend to be reported with a longer delay than small claims. The claim intensity
was estimated by = 0.003639.
Bibliography.
[35], [24], [29], [4], [27], [43], [23].
Chapter 8
Utility Theory
This chapter is a very sketchy draft, still sufficient to replace Sundts Chapter
12.
8.1
s(x) = (x b) ,
(8.1)
whereby the insurer covers the excess of the loss over some deductible amount
b > 0.
The proportional deductible or proportional insurance is given by
r(x) = (1 k) x ,
s(x) = k x ,
77
(8.2)
78
(8.3)
whereby the insurer covers everything if the loss exceeds b and nothing otherwise.
The first risk deductible is given by
r(x) = (x b) ,
s(x) = (x b)+ ,
(8.4)
(8.5)
Certain axioms, which are carefully discussed in the eminent book of De Groot
[14], lead to the criterion (8.5), that is, existence of a function u such that Y1
is preferred to Y2 if and only if E[u(Y1 )] > E[u(Y2 )]. The axioms do not imply
anything as to the form of the function u, however, and here we have to add
assumptions based on other considerations.
D. Basic properties of utility functions.
Important clues are given by the fact that the criterion, when applied to degenerate random variables (constants), generates preferences between different
amounts of certain wealth. We can confidently postulate that there must be a
limit to the satisfaction that our agent can get from a finite amount of money
79
and that he is able to assess every amount within a certain range of possible
wealth:
u is finite-valued and defined on some open interval I .
(8.6)
(8.7)
(8.8)
This means that our insured is not the sort of person whose greed increases with
the size of his purse. (A word about usage: In the following we will let increasing and decreasing mean non-decreasing and non-increasing, respectively,
and add a qualifying strictly when needed.)
Note that the preferences generated by the expected utility hypothesis are
invariant under a linear transform of the utility function u to a + bu, where a
and b > 0 are constants. Here is a list of some commonly used utility functions
specified in minimalistic form:
Quadratic utility:
u(y) = y
y2
,
2c
y < c.
(8.9)
Exponential utility:
u(y) = exp(cy) ,
y R,c > 0.
(8.10)
y R , y > c .
(8.11)
Logarithmic utility:
u(y) = log(c + y) ,
From the expected utility hypothesis and the assumptions (8.6) (8.8) we
can deduce a variety of results that serve to explain why people purchase insurance, even at prices that are (sometimes way) above the expected value of the
loss, and how insurance treaties should be designed to serve the purpose of risk
mitigation. The results are valid for all utility functions and, thus, deal with
qualitative rather than quantitative aspects of risk preferences. So, by way of
warning, the theory aims at creating general understanding rather than producing practically useful numbers (e.g. what premium to charge). We now go to
work and commence with a crucial qualitative property of utility functions.
Theorem 1. A utility function is strictly concave.
80
x+z
2
>
u(x) + u(z)
.
2
(8.12)
(0, 1),
(8.13)
for the special choice = 1/2. This is already a lot of structure since (8.12) is
true for all x < z, and we will now prove that it implies (8.13).
Fix x and z and, to save space, introduce the function
v() =
u((1 )x + z) u(x)
,
u(z) u(x)
[0, 1], which satisfies v(0) = 0, v(1) = 1, and inherits all qualitative properties of u. We want to prove (8.13), which is the same as
v() > ,
(0, 1)
(8.14)
We will first prove that (8.14) holds for all of the form
n,j = j/2n ,
j = 1, . . . , 2n 1 ,
(8.15)
1
(k,n + k+1,n ) ,
2
1
1
(v(k,n + v(k+1,n )) > (k,n + k+1,n ) = 2k+1,n+1 .
2
2
81
d
u00 (y)
= ln u0 (y)
u0 (y)
dy
(8.16)
is called the risk aversion function of u (we do not visualize its dependence on
u in the notation). The risk aversion function is strictly positive and is a local
measure of the relative change of the marginal utility, that is, how much the
marginal utility changes in units of itself by an infinitesimally small increase of
wealth.
As we will see in the next section, there are good reasons to require that the
risk aversion function be decreasing. For the time being let us just suppose that
the function a really deserves its name, and state that the aversion against risk
ought to decrease with the wealth; a rich person should be more able to carry
a certain risk than a poor person. This will be made precise later.
We observe that the quadratic utility function (8.9) has an increasing risk
aversion function, and therefore does not reflect the attitudes to risk that we just
said are the typical ones. The exponential utility function (8.10) has constant
risk aversion function, and is thus a benchmark case, just barely compatible with
typical risk attitudes. The logarithmic utility function (8.10) has decreasing risk
aversion function, and is thus OK.
8.2
82
(8.1)
(Preferred includes the case with equality, when the insured is indifferent.)
By monotone convergence and continuity of u, the expression on the left is a
continuous function of . Obviously, it is also strictly decreasing in and ranges
between the upper bound E[u(w s(X))], which is E[u(w X)] and a lower
bound, which we assume is E[u(w X)] (if e.g. I is a half-interval of the
type (, a), then obviously the lower bound is limy& u(y) E[u(w X)]).
Therefore, there exists a r (w) such that
E[u(w r (w) s(X))] = E[u(w X)] .
(8.2)
This r (w) is called the zero utility (increase) premium. Obviously, the insured
will buy the insurance if and only if r (w). The following result says that,
for reasonably designed compensations, the insured is willing to pay more than
the pure net premium E[r(X)]. This is good news because the insurer needs to
collect a premium that covers the expected claims (to avoid systematic losses
and certain ruin in the long run) and also covers administration expenses and
allows for a profit.
Theorem 2. If s and r are both increasing, then r (w) > E[r(X)] .
Proof: The proof rests on the Corollary to Theorem 3 in Appendix C. Consider
the functions g1 (x) = E[r(X)] + s(x) and g2 (x) = x. Obviously they fulfill
the condition (C.20), and since g2 (x) g1 (x) = r(x) E[r(X)] is increasing by
assumption, they also fulfill the condition (C.21). Since v(y) = u(w y) is a
strictly concave function, we conclude that
E[u(w E[r(X)] s(X))] > E[u(w X)] .
The result now follows by comparison with (8.2).
B. How the zero utility premium depends on risk aversion.
We can now give precise contents to the previously anticipated results about
risk aversion.
83
8.3
Optimal insurance
= E E[u(w s(X1 , . . . , Xn )) | X]
E[u(w E[s(X1 , . . . , Xn )]) | X])]
= E[u(w S(X))] ,
which proves that the global treaty (, R) is a better than the local (, r).
84
The result does not say that every global treaty is good, and one easily
constructs examples of poor global treaties. The result states only that, for the
purpose of maximizing the expected utility of the insured, one can restrict to
global treaties. Moreover, the proof is constructive and shows how to design a
global treaty that outperforms a given local treaty.
Note that the proof does not make any particular use of the definition of
X and would work equally well for any random variable X. In fact, one could
obtain a treaty that is better than the global one constructed above by letting
X be something that is not related to the risks under consideration, e.g. the
turnover of cheese in the Irma shop last month or just a constant. The reason
why the total loss is of particular interest cannot be explained endogenously in
the present theory. It rests on other circumstances, e.g. that the insurer will
only offer treaties with a certain element of selfinsurance in them (selfinsurance
may serve to save costs by eliminating small claims, and it may give the insured
incentives to prevent losses).
B. Fixed amount deductible is optimal.
Henceforth X may represent a individual risk or, in view of the result in the
previous paragraph, the total risk related to a combined policy.
Theorem 5 (Arrow-Ohlin). If the premium depends only on the pure net premium, then to every treaty there exists a fixed amount deductible treaty, which
has the same premium and which is at least as good from point of view of the
insured.
Proof: Let (, r) be any given treaty. The fixed amount deductible compensation
rb (X) = (X b)+ is a continuous and decreasing function of b and, by monotone
convergence, also the corresponding pure premium E[rb (X)] is a continuous
function of b with values ranging from E[X] to 0. Therefore, there exists a b
such that E[rb (X)] = E[r(X)] so that the premium is the same for the two
treaties. We can now invoke Ohlins lemma noting that, for y < b,
P[sb (X) y] = P[X y] P[s(X) y] ,
and, for y b,
8.4
85
Reproducing arguments above, we easily show that global treaties are preferred by the insurer (Pesonen-Ohlin), so in this respect the two parties have
common interests. However, analogous to Arrow-Ohlin, we show that first risk
is optimal to the insurer, and here the two parties have contradicting interests.
Now, first risk is a somewhat extreme form of compensation since it leaves
the top risk to the insured. It may be reasonable to restrict attention to compensation functions that satisfy the Vajda condition:
r(x)/x is an increasing function of x .
Thus, the bigger the loss, the bigger the fraction covered by the Vajda compensation function.
Theorem 6. If the premium depends only on the pure net premium, then
to every Vajda treaty there exists a proportional insurance treaty, which has the
same premium and which is at least as good from the point of view of the insurer.
Proof: Simple. For a given treaty (, r) we find a proportional treaty with
compensation function rk (x) = (1 k)x and the same premium . Being Vajda,
r is increasing and so is rk of course. The difference
r(x)
(1 k)
r(x) rk (x) = r(x) (1 k)x = x
x
is ()0 according as x ()x0 for some x0 . Use the corollary to Ohlins
theorem.
B. Conflicting interests
We easily prove that, among Vajda treaties, proportional insurance is the worst
solution from the point of view of the insured. Not surprisingly, the interests of
the insured and the insurer are not the same and they are even conflicting when
we take the position of only one party at a time and compare compensation
functions with the same pure premium under the assumption that the premium
then is fixed. The possibilities of coming to terms may be improved if the parties
were allowed to sit and search a compromise solution, letting both the form of
the compensation and the size of the premium be negotiable. One could then
imagine that the insurer, despite the fact that he would prefer first risk among all
compensation functions and proportional among Vajda ones when the premium
is fixed, still could consider offering the insureds favourite compensation (fixed
amount deductible) if he could do it against a premium that is acceptable to
him. We will now formalize these loose ideas.
86
8.5
A. Risk exchanges.
Consider n economically independent agents (e.g. individuals, insurance companies, or reinsurance companies) who are carrying economic risk. In attempts
to get rid of, or at least mitigate, the riskiness of their individual businesses, the
agents enter into negotiations of treaties for mutual exchange of risk amongst
themselves for a certain period.
We assume that each agent i has a differentiable utility function ui and an
initial wealth wi , which will be reduced by an uncertain amount Xi during the
period of consideration. The unknown loss Xi is the risky part of the business
of agent i, and so we take the vector X = (X1 , . . . , Xn )0 to be random. We
speak of the Xi as losses because we primarily have insurable risks in mind.
However, the theory we are going to develop does not rest on this or any other
particular interpretation of the situation, and it does not require that the Xi be
non-negative. For instance, the agents could be investors who seek to reduce the
uncertainty associated with their individual investment portfolios through some
pool arrangement. In that situation an Xi would typically be either positive (a
loss) or negative (a gain), and hopefully the latter case would be the more
likely.
A treaty for exchange of risk must specify how much each individual agent
is to contribute to the coverage of the losses during the period, and these contributions must fulfill the obvious budget constraint that the total loss must be
covered. Formally, a risk exchange is a function
f = (f1 , . . . , fn )0 : Rn 7 Rn
such that (almost surely)
n
X
i=1
fi (X) =
n
X
Xi .
(8.1)
i=1
Under the risk exchange treaty f each agent i will pay fi = fi (X) instead of his
original loss Xi .
As usual we adopt the expected utility hypothesis assuming that each agent
i will judge any f by his expected terminal utility,
Vi (f ) = E ui (wi fi ) .
B. Pareto-optimality.
It is, of course, impossible to find a risk exchange that is optimal from the isolated point of view of each individual agent. After all, the total loss is to be
shared between the agents somehow, and any reduction of one agents share
must be compensated by increases of the shares of (some of) the others. In
such a situation, where the interests are partly conflicting, the fruitful approach
is to try and find out what all parties can agree on. Thus, they should first
rule out those treaties that all agents find uninteresting so as to remain with a
87
set of negotiable treaties. This vague notion is made precise by the concept of
Pareto-optimality, which we now define:
Definition A risk exchange f is called Pareto-optimal if there exists no other
risk exchange f that is at least as good as f for all agents and strictly better for
some agents, that is,
Vi (f ) Vi (f ) i Vi (f ) = Vi (f ) i.
(8.2)
(8.3)
(8.4)
88
Next, sum this expression over all i and use (8.1), which applies to both f and
f , to obtain
X ui (wi fi ) ui (wi fi )
0.
i
i
Finally, take expectation to arrive at
X Vi (f ) Vi (f )
i
0.
Since all i are strictly positive, it is now clear that (8.2) must hold true.
For a proof of the necessity of the condition (8.3), we refer to [12].
u(wi fi ) + u0 (wi fi )(fi fi )
u(wi fi )
wi f i
wi fi
(8.5)
89
parametric functions, the solution will usually be an explicit parametric expression. In the next paragraph we shall outline the calculations and investigate
some general properties of the solution.
D. The second Borch/du Mouchel theorem.
Since the derivatives u0i are strictly decreasing functions, they are invertible,
and we can recast (8.3) as
wi fi = (u0i )1 (i u01 (w1 f1 )) .
Summing over i and using (8.1), we get
X
X
X
wi
Xi =
(u0i )1 (i u01 (w1 f1 )) .
i
(8.6)
The function
g1 (y) =
The labeling of the agents is, of course, quite arbitrary, so a similar expression
holds for each fj (just replace the index 1 by j).
Obviously,
the expression on the right of (8.7) is a strictly increasing funcP
tion of i Xi . This fact deserves to be highlighted as a theorem:
Theorem 8 A Pareto-optimal risk exchange is a pool, that is, each individual
agent contributes a share that depends only on the total loss of the group. Moreover, each agent must cover a genuine part of any increase of the total loss of the
group. If all utility functions are continuously differentiable, then the individual
shares are continuous functions of the total loss.
Appendix A
Hilbert spaces
A.1
Metric spaces
A. Metric spaces. Let X be a set, whose elements are called points, and
suppose we assign to each pair of points x and y a non-negative number d(x, y)
measuring the distance between them. The distance function d(, ) : X X 7
R+ should possess the following properties:
d(x, y) > 0 if x 6= y ;
d(x, y) = d(y, x) ,
d(x, y) = 0 if x = y,
(A.1)
(A.2)
(A.3)
These axioms are motivated by geometric notions. For instance, Fig A.1 gives a
planar illustration of the so-called triangle inequality (A.3). If d satisfies (A.1)(A.3), it is called a metric on X and the pair (X , d) is called a metric space.
x
d(x, y)
y
@
@ d(y, z)
@
@
@
@
d(x, z)
91
di (xi , yi ),
(A.4)
(Follows from d(x, z) d(x, y) + d(y, z) and d(y, z) d(y, x) + d(x, z).) Now,
using first the triangle inequality for reals and then (A.4), we obtain
|d(xn , yn ) d(x, y)|
which proves the asserted result. (To be precise, we have shown that the metric
is continuous when considered as a function on X X equipped with any of the
metrics listed at the end of the previous paragraph.)
A.2
92
Vector spaces
(A.5)
x + (y + z) = (x + y) + z .
(A.6)
and associative,
(A.7)
(A.8)
(cd) x = c (dx) ,
(A.9)
and associative,
(A.10)
0x = 0.
(A.11)
and
(A.12)
93
(A.13)
(A.14)
Linearity,
Positive definiteness,
hx, xi > 0 if x 6= 0;
hx, xi = 0 if x = 0.
(A.15)
(The latter part of (A.15) is redundant as it follows by linearity.) From (A.13)(A.14) we deduce that the inner product is bilinear. The pair (X , h, i) is called
an inner product space.
The norm or length of a vector x, denoted by kxk, is defined as
p
kxk = hx, xi .
Obviously, kxk = || kxk.
By straighforward calculation,
(A.16)
(A.17)
94
(A.18)
(A.19)
A.3
Hilbert spaces
If Y is closed and convex, then there exists a unique point y0 Y such that
d(x, y0 ) = d(x; Y). We shall call y0 the closest point to x in Y. The proof of this
95
result goes as follows: Choose a sequence yn in Y such that d(x, yn ) d(x; Y).
We first show that yn is a Cauchy sequence. By the parallellogram law (A.19),
k(yn x) (ym x)k2 + k(yn x) + (ym x)k2 = 2 kyn xk2 + 2 kym xk2 ,
hence
1
kyn ym k2 = 2 kyn xk2 + 2 kym xk2 4 k (yn + ym ) xk2 .
2
Since Y is convex, it contains 21 yn + 12 ym , and so k 21 (yn + ym ) xk2 d2 (x; Y).
It follows that
kyn ym k2 2 kyn xk2 + 2 kym xk2 4 d2 (x, Y) .
As n and m tend to , the expression on the right tends to 0, showing that yn
is Cauchy.
Then, since X is Hilbert, y0 = lim yn must exist. Moreover, since Y is
closed, y0 Y. Since kx y0 k = lim kx yn k = d(x; Y), we conclude that y0
is a closest point to x in Y. It remains only to show that it is unique.
Thus, suppose z0 Y is such that kx z0 k = d(x; Y). On the one hand,
since Y is convex, y0 + (1 )z0 Y for (0, 1), and so
kx y0 (1 )z0 k d(x; Y) .
On the other hand, by the triangle inequality,
kx y0 (1 )z0 k kx y0 k + (1 )kx z0 k = d(x; Y) .
It follows that kx y0 (1 )z0 k = d(x; Y). Thus, the expression
kx y0 (1 )z0 k2 = kx z0 k2 + 2hx z0 , z0 y0 i + 2 kz0 y0 k2 ,
considered as a function of (0, 1), is constant. This is possible only if the
coefficient of the square term is 0, that is, z0 = y0 .
D. Orthogonality. Consider an inner product space (X , h, i). Two vectors x
and y are said to be orthogonal, written x y, if hx, yi = 0. In this case (A.16)
becomes
kx + yk2 = kxk2 + kyk2 ,
(A.20)
96
x pro(x|Y) Y .
(A.21)
(A.22)
(A.23)
(A.24)
y Y ,
k
k
X
X
j pro(xj |Y) .
pro
j xj Y =
j=1
j=1
(A.25)
(A.26)
(Check that the expression on the right fits into the definition of the projection
on the left.) Thus, viewed as a mapping of X onto Y, pro(|Y) is linear.
97
(A.27)
To prove this, we need only check that the iterated projection on the right of
(A.27) satisfies (A.21) and (A.22). The first part is trivial, and the second
part follows by writing x pro(pro(x|Z)|Y) = (x pro(x|Z)) + (pro(x|Z)
pro(pro(x|Z)|Y)) and observing that the first term here is Z Y and the
second term is trivially Y .
X
PP
PP
PP
PP
0P
P
P
PP
P
*
pro(x|Z)
P
PP
P
P
Pq
P
P
pro(x|Y) PPP
PP
PP
PP
PPP
Z
Y
98
and
j = 1, . . . , n.
A.4
One easily checks that h, i satisfies (A.13)-(A.15) under the convention that
X = 0 means P[X = 0] = 1, that is, the elements of this space are classes of
equivalent random variables.
We shall prove that the inner product space (L2 , h, i) is Hilbert, that is,
we shall establish that it is complete. Thus, given a Cauchy sequence Xn , we
must construct a square integrable r.v. such that Xn X. Specify
P a decreasing
sequence of strictly positive numbers i , i = 1, 2, . . . such that i=1 i < . For
each i = 1, 2, . . . let ni be the n corresponding to i in the Cauchy criterion.
Consider the subsequence Xni and form a new sequence with elements Yi =
Xni+1 Xni , i = 1, 2, . . .. By the triangle inequality and the fact that kYi k i ,
we have, for each k = 1, 2, . . .,
k
k
X
X
X
X
|Yi |
kYi k
kYi k
i < .
i=1
i=1
i=1
i=1
X
|Yi |
< .
i=1
Pk
i=1
|Yi | %
i=1
|Yi |, we
(A.28)
99
P
It
that, almost surely,
i=1 |Yi | is finite, hence, by absolute summability,
Pfollows
Y
exists
and
is
finite.
Now,
our candidate limit of Xn is X = Xn1 +
i
Pi=1
Y
,
which
is
certainly
square
integrable.
We have
i
i=1
X
X
kXni Xk =
kYj k ,
Yj
j=i
j=i
Bibliography
Excellent expositions on general Hilbert space theory are [8] and [45].
Hilbert space methods were introduced in credibility theory by De Vylder
in two seminal papers [16] and [15]. From a mathematical point of view, the
essential features of a large class of estimation problems are that the set of
admitted estimators form a linear space and the performance of an estimator is
measured by its distance from the estimand in some suitable sense. This makes
Hilbert spaces the appropriate framework for a general treatment.
Credibility estimation in continuous time models, launched in [42], provides
an example where Hilbert space methods are indispensable.
Appendix B
Matrix algebra
A. Definition of matrices and vectors. An m n (m by n) matrix A is
a rectangular scheme of numbers organised in m horizontal rows and n vertical
columns;
a11 a1n
..
.. .
A = (aij ) = ...
.
.
am1
amn
The number aij in row i and column j is called the (i, j)-entry of A. The space
of m n matrices is denoted by Rmn . When emphasis of dimension is needed,
we shall sometimes write Amn to show that A is in Rmn .
The algebraic operations of addition and scalar multiplication are defined
for matrices by performing them entry-wise. Thus, for Amn = (aij ) and
Bmn = (bij ) the (i, j)-entry of A + B is aij + bij and, for c scalar, the (i, j)entry of c A is c aij . This way Rmn becomes a linear space of dimension mn,
with null element 0mn whose entries are all zero.
The transpose of an m n matrix A is an n m matrix, denoted by A0 ,
whose (i, j)-entry is aji , that is, A0 is obtained by turning the rows of A into
columns (or vice versa). A square matrix A is said to be symmetric if A0 = A.
An n 1 matrix is called a column vector or, more specifically, an n-vector.
We shall denote column vectors by lower case bold letters and with single subscript on the entries, e.g.
a1
a = ... .
an
n1
100
101
A11 A1q
..
.. ,
Amn = (Aij ) = ...
(B.1)
.
.
Apq
Pp
Pq
where each Aij is an mi nj matrix, with i=1 mi = m and j=1 nj = n. In
particular, A may be viewed as a column of row vectors,
0
a1
..
A = . ,
where a0i = (ai1 , . . . , ain ) ,
(B.2)
Ap1
a0m
A = (a 1 , , a n ),
where a j
a1j
= ... .
amj
(B.3)
In a geometric interpretation kxk is the length of x and hx, yi/(kxk kyk) is the
cosine of the angle between x and y. (Draw pictures in R2 !)
102
which is the sum of the squared norms of its columns (or its rows).
E. The matrix product. If A and B are matrices of dimensions m n and
n `, respectively, we define the matrix product AB as the m ` matrix whose
(i, k)-entry
Pnis the scalar product of the i-th row of A and the k-th column of B;
a0i b k = j=1 aij bjk .
One easily checks that the matrix product can be formed at the level of
blocks; if Amn is partitioned as in (B.1) and Bn` is partitioned into B =
(Bjk ), j = 1, . . . , q, k = 1, . . . , r, such that Bjk has dimension nj `k with
P
r
k=1 `k = `, then
X
AB = (
Aij Bjk ) .
(B.4)
j
(B.5)
(B.6)
103
G. Matrices as functions. So far m n matrices have been viewed as elements in the space Rmn . Alternatively they can be viewed as linear functions
from Rn to Rm ; Amn maps an n-vector x to the m-vector Ax. In this perspective the notions of rank, inverse, and identity are well motivated. Also the
term permutation matrix pertains to this idea; operating on a vector with a
permutation matrix amounts to permuting its entries.
If Ann and Bnn are invertible, then AB is invertible and
(AB)1 = B1 A1 .
(B.7)
This follows by observing that B1 A1 AB = I and the fact that the inverse
is unique.
H. Some useful matrix identities. Let A be an invertible matrix partitioned
as
A11 A12
A=
.
A21 A22
Then the inverse of A, partitioned correspondingly as
11
A
A12
,
A1 =
A21 A22
is given by
A11
A12
1
= (A11 A12 A1
,
22 A21 )
1
11
= A A12 A22
22
= A1
11 A12 A ,
(B.8)
(B.9)
(B.10)
and A21 and A22 defined by symmetry (just interchange the roles of 1 and 2 in
sub- and topscripts in (B.8) - (B.10)). The result is straightforwardly verified by
inserting the partitioned forms of the matrices into the defining relation (B.6).
For instance, write A1 A = I as
11
A11 A12
A
A12
I 0
=
,
A21 A22
A21 A22
0 I
and perform the multiplication on the left by the rule (B.4) to obtain
A11 A11 + A12 A21
11
12
A A12 + A A22
= I,
= 0,
(plus two similar equations for A21 and A22 ). The solution to these equations
is (B.8) - (B.9). Starting instead from AA1 = I, we arrive at (B.8) and (B.10).
104
Lemma 1 The following identity holds true for Ann ivertible, Bnm arbitrary,
and Cmm invertible, such that the inverses indicated exist:
(A + BCB0 )1 = A1 A1 B(C1 + B0 A1 B)1 B0 A1 .
(B.11)
1
A1 bb0 A1 .
1 + b0 A1 b
(B.12)
Remark: The result is useful in multivariate analysis where inversion of matrices of the form A + BCB0 is frequently encountered, typically with A and
C some covariance matrices. Apart from producing certain nice formulas, the
result may also reduce computational work. Suppose A is easy to invert (e.g.
a diagonal matrix) and that m < n. Then the n n inversion on the left of
(B.11) reduces to the m m inversion on the right.
Proof: Let us denote the matrix in question by D = (A + BCB0 )1 . By
definition
DA + DBCB0 = I .
(B.13)
n
X
aii .
i=1
(B.14)
105
that is, the trace is invariant under cyclical permutations of the factors in a
(square) matrix product.
J. Determinants. The determinant of a square matrix Ann = (aij ) is defined
as
X
det(A) =
sign(j1 , . . . , jn )a1j1 anjn ,
(B.15)
j1 ,...,jn
where the summation extends over all n! permutations of (1, . . . , n) and sign(j1 , . . . , jn )
is the so-called sign of the permutation, which is +1 or 1 according as the
permutation is even or odd: A permution is even/odd if it is obtained by an
even/odd number of interchanges of positions of entries, two at a time. (There
are, of course, infinitely many ways of obtaining a given permutation through
such interchanges, but it can be shown that either they are all even or they are
all odd, so that these concepts are well defined.) It can be shown that
det(AB) = det(A) det(B)
and that
det(A0 ) = det(A) .
Obviously, for an invertible square matrix A, we have det(A1 ) = (det(A))1 .
Let Aji denote the (n 1) (n 1) matrix obtained by crossing out the i-th
row and the j-th column from A. Determinants can be calculated recursively
by the rule
n
X
det(A) =
aij Cofij
i=1
where
106
(B.16)
or
C0 = C1 .
It follows that also C0 is orthogonal; CC0 = I. Observe that a finite product of
orthogonal matrices is orthogonal. The determinant of an orthogonal matrix C
is 1 since det(C)2 = det(C0 ) det(C) = det(C0 C) = det(I) = 1.
Viewing an orthogonal matrix C as a linear function, we can say that it just
rotates Rn since it preserves all distances:
kCx Cyk2 = (x y)0 C0 C(x y) = kx yk2 .
Lemma 2 If Ann is symmetric, then there exists a diagonal matrix nn =
diag(1 , . . . , n ) and an orthogonal matrix Cnn = (c1 , , cn ) such that
0
A = CC =
n
X
j cj c0j .
(B.17)
n
X
kj cj c0j ,
(B.18)
107
For fixed A this is a quadratic function of (the entries of) x, and as such it
is called a quadratic form. Without loss of generality A can be taken to be
symmetric, and we shall henceforth assume it is. Of course, q(0) = 0. If
q(x) 0 for all x, then q is said to be positive semidefinite (p.s.d.), and the
same terminology goes also for the matrix A itself. If, moreover, q(x) > 0 for
all x 6= 0, then q and A are said to be positive definite (p.d.).
By the representation (B.17) it follows that A is p.s.d. if and only if all
i are non-negative. In that case there exists a symmetric n n matrix A1/2 ,
called the square root of A, such that A = A1/2 A1/2 . This follows by using the
orthogonality of C to write
A = CC0 = C1/2 C0 C1/2 C0 ,
1/2
1/2
EBE0 = diag(1 , . . . , n ) ,
(B.19)
Bibliography
A comprehensive introduction to matrix algebra is found in [3]. Recommended
is also [28], which puts emphasize on representation theorems.
Appendix C
(C.1)
(C.2)
(C.3)
for all y. If the inequality in (C.3) is strict for y 6= x, then ` is called an strictly
upper supporting line or strict upper tangent.
The unique affine ` coinciding with u at two distinct points x and z is
`(y) = u(x) + k(x, z)(y x) ,
(C.4)
with slope
k(x, z) =
u(z) u(x)
.
zx
B. Concave functions.
A function u is said to be concave if any segment of the graph of u lies above
the straight line connecting its endpoints. More precisely, by (C.4),
u(y) u(x) + k(x, z)(y x) ,
108
x < y < z.
(C.5)
109
(C.6)
If the inequality in (C.5) (or (C.6)) is strict, then we say that u is strictly
concave.
A function u is (strictly) convex if u is (strictly) concave.
Theorem 1. A concave function u is continuous. Moreover, it is continuously differentiable almost everywhere, and its derivative u0 (where it exists) is
decreasing. If u is strictly concave, then u0 is strictly decreasing.
Proof: Some simple algebra, or just inspection of Figure C.1, leads to the following equivalent versions of the defining relationship (C.5):
k(x, y) k(x, z) ;
(C.7)
k(x, y) k(y, z) ;
(C.8)
k(x, z) k(y, z) .
(C.9)
(y, u(y))
(x, u(x))
(z, u(z))
exists and is finite. A similar argument shows that the left derivative
u (y) = lim k(x, y)
x%y
(C.10)
110
We interpose that the asserted continuity of u is now proved since the existence of the right (left) derivative implies continuity from the right (left).
The inequality (C.8) (or the figure) implies, furthermore, that
u+ (x) u (z) ,
x < z.
(C.11)
Combining this with (C.10), we conclude that u and u+ are decreasing functions. Now, a decreasing function has at most a countable number of discontinuities, so u and u+ are continuous almost everywhere. Let x be a continuity
point of u . Then, by (C.10) and (C.11), u (x) u+ (x) limz&x u (z) =
u (x), hence u (x) = u+ (x) showing that u0 (x) exists. (Similarly, u0 (x) exists if x is a continuity point of u+ . Thus, possible discontinuity points of the
functions u , u+ , and u0 must coincide.)
Gathering the pieces, we have now proved the results stated for a concave
u. The last statement about a strict concave u is easily added.
The proof of the following result is left as an easy exercise to the diligent
reader.
Lemma 1. A function is (strictly) concave if and only if it possesses a (strict)
upper tangent at each point.
Corollary to Lemma 1. A twice continuously differentiable function is (strictly)
concave if and only if its second order derivative is (strictly) negative.
Proof: For a fixed y0 , Taylors formula says that
1
u(y) = u(y0 ) + u0 (y0 )(y y0 ) + u00 (y )(y y0 )2 ,
2
where y is some point between y0 and y. Since the last term on the right is
strictly positive, it follows that `(y) = u(y0 ) + u0 (y0 )(y y0 ) is a strict upper
tangent in (y0 , u(y0 )). Strict concavity follows by the corollary to Lemma 1.
C. Jensen & Co.
One of the first results we encounter in elementary probability is the relationship
V[Y ] = E[Y 2 ] E2 [Y ], from which it follows that E[Y 2 ] E2 [Y ]. This is a
special case of a celebrated result, due to the Danish mathematician Jensen,
which states that the inequality is true, not only for the square, but for all
convex functions.
Theorem 2 (Jensens inequality). The function u is (strictly) concave if and
only if
E[u(Y )] (>) u(E[Y ])
for every non-degenerate r.v. Y with finite mean and values (only) in I.
(C.12)
111
Proof: For the if part, apply (C.12) to the simple random variable Y with
P[Y = x] = 1 and P[Y = z] = to obtain the defining relation (C.6).
For the only if part, suppose u is concave and let Y be an r.v. as specified
in the theorem. By Lemma 1, u possesses an upper tangent at E[Y ], that is,
there exists a k such that
u(y) u(E[Y ]) + k (y E[Y ]) .
Inserting Y in the role of y and forming expectation, we arrive at (C.12).
One easily adds to the result that strict concavity of u is equivalent to strict
inequality in (C.12).
y0
Figure C.2: Illustration of the Ohlin condition (C.14)
(C.14)
then
Z
(C.15)
for each concave function u such that these integrals are well defined. If u is
strictly concave and F1 6= F2 , then the inequality in (C.15) is strict.
Proof: The conditions imply that F1 and F2 place all of their masses on the
interval I where u is defined. Consider first the case when I = (, ) and
112
the integrals in (C.15) are finite. In view of Lemma 1, let `0 (y) be an upper
tangent to u at y0 and introduce the difference
v(y) = `0 (y) u(y) .
The function v is non-negative, continuous, differentiable almost everywhere,
and
dv(y) 0, y < y0 ,
(C.16)
dv(y) 0, y > y0 .
R
R
Using first (C.13) and the trivial fact that dF1 (y) = dF2 (y), and then integrating by parts, we find
Z
Z
Z
u(y) dF1 (y) u(y) dF2 (y) = v(y) d(F2 F1 )(y)
= v() (F2 F1 )() v() (F2 F1 )()
Z
(F2 F1 )(y) dv(y) .
(C.17)
(C.18)
`x (y) y x ,
u(y) x < y < z ,
ux,z (y) =
`z (y) y z .
R
Obviously ux,z is a concave function defined on R, ux,z (y) dFi (y) is finite for
i = 1, 2, and so the first part of the proof yields
Z
Z
ux,z (y) dF1 (y) ux,z (y) dF2 (y) .
(C.19)
Letting x decrease towards the left endpoint of I and z increase towards the
right endpoint of I, ux,z (y) decreases monotonically to u(y) for each y, and
113
R
R
so ux,z (y) dFi (y) u(y) dFi (y) for i = 1, 2 by monotone convergence. It
follows that the inequality (C.19) carries over to the limits, yielding (C.15).
The last statement in the theorem follows by noting that, firstly, if u is
strictly concave, then the inequalities in (C.16) are strict, and, secondly, if
F1 6= F2 , then (at least one of) the inequalities in (C.14) are strict on some
non-degenerate interval due to the right-continuity of distribution functions.
Corollary to Theorem 3. Let X be a real r.v. assuming values in some open
interval I, and let gi : I 7 R, i = 1, 2, be increasing functions. If
E[g1 (X)] = E[g2 (X)]
(C.20)
x < x0 ,
x > x0 ,
(C.21)
then
E[u(g1 (X))] E[u(g2 (X))] ,
(C.22)
for each concave function u : I 7 R such that these expected values are well
defined. If u is strictly concave and P [g1 (X) 6= g2 (X)] > 0, then the inequality
in (C.22) is strict.
Proof: The result follows by application of Theorem 3 to the cumulative distribution functions Fi (y) = P[gi (X) y], i = 1, 2. We need only to check the
condition (C.14): For y < g2 (x0 ) we have
P[g2 (X) y] = P[g1 (X) y] + P[g2 (X) y < g1 (X)] ,
(C.23)
(C.24)
The final assertion in the corollary follows since P [g1 (X) 6= g2 (X)] > 0 implies that the last term must be strictly positive either in (C.23) or in (C.24) or
in both for some y.
Bibliography
[1]
[2] Borgan . Gill R.D. Keiding N. Andersen, P.K. Statistical Models Based
on Counting Processes. Springer-Verlag, 1993.
[3] T.W. Anderson. An Introduction to Multivariate Statistical Analysis. Wiley, 1958.
[4] E. Arjas. The claims reserving problem in non-life insurance some structural ideas. ASTIN Bull, pages 139152, 1989.
[5] A.L. Bailey. A generalized theory of credibility. Proceedings of the Casualty
Actuarial Society, pages 1320, 1945.
[6] A.L. Bailey. Credibility procedures, la places generalization of bayes rule,
and the combination of collateral knowledge with observed data. Proceedings of the Casualty Actuarial Society, pages 723, 1950.
[7] Pentik
ainen T. Pesonen E. Beard, R. Risk Theory. Chapman and Hall,
1984.
[8] S.K. Berberian. Introduction to Hilbert Space. Oxford University Press,
1961.
[9] J.O. Berger. Statistical Decision Theory and Bayesian Analysis. Springer,
1985.
[10] H. B
uhlmann. Experience rating and credibility. AB, pages 199207, 1967.
[11] H. B
uhlmann. Experience rating and credibility. ASTIN Bull, pages 157
165, 1969.
[12] H. B
uhlmann. Mathematical Methods in Risk Theory. Springer-Verlag,
1970.
[13] Straub E. B
uhlmann, H. Glaubw
urdigkeit f
ur schadens
atze. Mitteil. Ver.
Schweiz. Vers.math., pages 111133, 1970.
[14] M. De Groot. Optimal Statistical Decisions. McGraw-Hill, 1970.
114
BIBLIOGRAPHY
115
BIBLIOGRAPHY
116
[32] W. Neuhaus. Choice of statistics in linear bayes estimation. Scand. Actuarial J., 1985:126.
[33] W. Neuhaus. Inference about parameters in empirical linear bayes estimation problems. Scand. Actuarial J., 1984:131142.
[34] R. Norberg. A class of conjugate hierarchical priors for gammoid likelihoods. Scand. Actuarial J., 1989:177193.
[35] R. Norberg. A contribution to modelling of ibnr claims. Scand. Actuarial
J., 1986:155203.
[36] R. Norberg. Credibility premium plans which make allowance for bonus
hunger. Scand. Actuarial J., 1975:7386.
[37] R. Norberg. A credibility theory for automobile bonus systems. Scand.
Actuarial J., 1976:92107.
[38] R. Norberg. Empirical bayes credibility. Scand. Actuarial J., 1980:177194.
[39] R. Norberg. Experience rating in group life insurance. Scand. Actuarial J.,
1989:194224.
[40] R. Norberg. Hierarchical credibility: analysis of a random effect linear
model with nested classification. Scand. Actuarial J., 1986:204222.
[41] R. Norberg. A note on experience rating of large group life insurance
contracts. Mitteil. Ver. Schweiz. Vers.math., pages 1734, 1987.
[42] R. Norberg. Linear estimation and credibility in continuous time. ASTIN
Bull, pages 149165, 1992.
[43] R. Norberg. Prediction of outstanding liabilities in non-life insurance.
ASTIN Bull, pages 95115, 1993.
[44] H. Robbins. The empirical bayes approach to statistical problems. Ann.
Math. Statist., pages 120, 1964.
[45] H.L. Royden. Real Analysis. Macmillan, New York, 1963.
[46] B. Sundt. On greatest accuracy credibility with limited fluctuation. Scand.
Actuarial J., 109-119.
[47] B. Sundt. Recursive credibility estimation. Scand. Actuarial J., 1981:321.
[48] B. Sundt. An Introduction to Non-Life Insurance Mathematics. Verlag
Versicherungswissenschaft e.V, Karlsruhe., third edition, 1993.
[49] S. Wind. An empirical bayes approach to multiple linear regression. Annals
of Statistics, 1973:93103.