Insurance Analytics: Prof. Julien Trufin

Insurance Analytics
Prof. Julien Trufin
Année académique 2020-2021
1
Neural networks
Neural networks
2
Neural networks
From biology to mathematics
In biology, neural cells (neurons) work in parallel and reorganize themselves

during the learning phase.
Neurons receive input signals throughout their dendritic trees.
Whether or not a neuron is excited into firing an impulse depends on the sum
of all of the excitatory signals.
If the neuron does end up firing, the nerve impulse, or action potential, is
conducted down the axon.
3
Neural networks
The signals (p vectors x i = {xi1 , ..., xip } ) interact with the dendrites
through synaptic weights ( ω = {ω0 , . . . , ωp } ). The dendrites carry input
signals to the cell body, where they all are summed.
The output signal ybi is equal to :
p
!
X
> 1
ybi = φ ω0 + ωj xij =φ ω .
xi
j =1
4
Neural networks
The most common activation function is the sigmoid or logistic function :

1
φ(z ) = ∈ [0, 1].
1 + e −z
An other activation function for the neurons is the hyperbolic tangent
function, tanh(.) :
e z − e −z
φ(z ) = ∈ [−1, 1].
e z + e −z
These activation functions are bounded. Other popular activation functions
are the rectifier
φ(z ) = max(z , 0)
and the softplus function (smooth approximation to the rectifier)
φ(z ) = log(1 + e z ).
5
Neural networks
Feed-forward neural networks

A neural network is a set of interconnected neurons and there is an infinity of
possible layouts. Here is an example of single layer feed-forward network (or
perceptron) :
6
Neural networks
The number of neuronal layers : n net . The number of neurons in the j th layer
is noted njnet .
Activation function φi,j (.) : i denotes the position in a layer and j the layer.
Output of the i th neurons in hidden and output layers j , ybi,j , is
njnet
 
−1
j
X j
ybi,j = φi,j ωi0 + ωi,k ybk ,j −1 
k =1
j
where ωi,k are the weights for the k th input signal received by the neuron
(i , j ).
7
Neural networks
Cybenko’s theorem.
Let φ(·) be a bounded, continuous function. Let Im denote [0, 1]m . Given an ε > 0 and any
function f ∈ C (Im ), there exist N ∈ N, vi , ω0,i ∈ R and vectors ωi ∈ Rm , where i = 1, . . . , N ,
such that
N
X
F (x ) = vi φ ωiT x + ω0,i
i=1
approach f ; |F (x ) − f (x )| < ε for all x ∈ Im .
With simple words : Any continuous function may be approached by a single layer neural
networks.
8
Neural networks
Deep learning ?
A shallow network has only 1 hidden A deep neural network has multiple
layer. hidden layers and eventually loops
Universal approximator but # of neurons (recurrent networks).
may be high for approaching a function. Universal approximator with
multiplicative layer. Less neurons for
approaching high-dim. functions.
9
Neural networks

Neural networks can be used for regression.
Observations : a sample of value yi ∈ R for i = 1, ..., n.
Forecasts : an estimator of ybi ∈ R for i = 1, ..., n.
... as a function of explanatory variables, x i ∈ Rp for i = 1, ..., n :

Quantitative variables ;
Categorical variables.
10
Neural networks
Example : 100 occurences of

A network with 4 NN estimates well the trend :
y = exp (0.5x + ) ypred(x)= 0.11182
+0.51839*phi(1.29626+2.8344*x)
x ∈ [−1, 1], ∼ N (0, 0.15) +0.62848*phi(0.65442-0.01889*x)
+3.31112*phi(-3.91938+2.75793*x)
+0.03327*phi(0.07774+0.94149*x) where
phi(z)=1/(1+exp(-z))
11
Neural networks
LM, GLM, GAM vs neural networks

Comparison with existing approaches :
Input : vector of d explanatory variables, x .
Output : y realization of Y with E(Y ) = µ.
Linear regression, Y ∼ N (µ, σ 2 )
µ = β> x
GLM : Y ∼ exponential dispersed law with
g(µ) = β> x
GAM : Y ∼ exponential dispersed law with
g(µ) = β > x + f (xk ) + ... + f (xj )
Neural networks : Y ∼ exponential dispersed law with
g(µ) = f (x )
12
Neural networks
Data preprocessing
Before any computations...

Activation functions quickly converges toward 0 (or -1) and 1 outside a small
interval centered around zero ⇒ scaling of data !
Quantitative variables (ages) are scaled according to one of the 3 methods :
xi,j − mini=1,...,n (xi,j )

xij∗ = ∈ [0, 1]
maxi=1,...,n (xi,j ) − mini=1,...,n (xi,j )
xij∗ = 2 − 1 ∈ [−1, 1]
xij − µj
xij∗ = ∈ R.
σj
13
Neural networks
Data preprocessing
Categorical variables are modeled as dummy variables : e.g. if a variable counts
the three following modalities : ’Urban’, ’Suburban’ and ’Countryside’, we create
two dummy input variables as follows
(
1 if Urban
xi,d =
0 else .
(
1 if Suburban
xi,d+1 =
0 else .
Treatment similar to preprocessing for GLM.
14
Neural networks
Training of supervised networks
Training...
Weights are found by minimizing the distance between output signals of the
network and real outputs.
Input signal n vectors (x i )i=1,...,n of dimension p.
We observe a vector of responses y = (yi )i=1,...,n .
The criterion used for calibration is a loss function L : R2 → R, continuous

with first order derivative.
The vector of weights ω ji of neurons (i , j ) is denoted Ω and is such that

n
1X
Ω = arg min L(yi , ybi,n net ) , (1)
Ω n i=1
15
Neural networks
Training of supervised networks
L may be the opposite of the log-likelihood, a deviance, a quadratic function

or any other well-behaved penalty function.
To lighten further notation, we denote the vector of estimators by

ŷ = (b
yi )i=1,...n = (b
yi,n net )i=1,...n
1
Pn
and the total of losses by R(Ω) = n i=1 L(yi , ybi ).
The optimal weights are found by minimizing R(Ω) :

Gradient descent ;
Back-propagation.
16
Neural networks
Gradient descent
Initial vector of weights Ω0 . Second order Taylor development, the function

R:
R(Ω) ∼ R(Ω0 ) + ∇R(Ω0 )> (Ω − Ω0 ) (2)

1 >
+ (Ω − Ω0 ) H (R(Ω0 )) (Ω − Ω0 ) + O(3) ,
2
where ∇R(Ω0 ) and H (R(Ω0 )) are respectively the gradient vector and the
Hessian matrix
!
∂
∇R(Ω0 ) = R(Ω) ,

j
∂ωik
Ω0 i,j ,k =...
!
∂2
H (R(Ω0 )) = R(Ω) .

j s

∂ωik ∂ωut
Ω 0 i,j ,k ,s,u,t=...
17
Neural networks
Gradient descent
The vector of parameter Ω∗ minimizing R(Ω) cancels the first order

derivative of the development (2) :
0 = ∇R(Ω0 ) + H (R(Ω0 )) (Ω∗ − Ω0 ) ,
or
−1
Ω∗ = Ω0 − H (R(Ω0 )) ∇R(Ω0 ) .
Therefore by iterating, we can find optimal weights.
Problems :
Risk to reach a local minimum...
Inversion of the Hessian matrix...
18
Neural networks
Back-propagation
Solution : Adjust the vector of weights Ωt by a small step in the opposite direction
of the gradient :
Algorithm
Main procedure :
For t = 0 to maximum epoch, T
1. Calculate the gradient ∇R(Ωt )
2. Update the step size
ρt+1 = ρ0 e −αt
3. Modify the vector of weights :
Ωt+1 = Ωt − ρt+1 ∇R(Ωt ) . (3)
End loop on epochs

The step size ρt+1 decreases with the epoch.
19
Neural networks
Measure of goodness of fit
In most of applications, the quality of the model is assessed by the total loss
R(Ω).
There is no rule for determining the best architecture for the neural network.
In practice, we test several models and choose the one with the lowest loss...
But overfitting must be checked.
Overfitting
Overfitting is the production of an analysis that corresponds too closely or exactly
to a particular set of data, and may therefore fail to fit additional data or predict
future observations reliably.
In practice : models that contains more parameters than can be justified by

the data.
20
Neural networks
Solution to control overfitting : split the database into training (e.g. 85%)
and validation (e.g. 15%) samples
Train the neural network on the training set, monitor the training loss
Train the neural network on the validation set, monitor the validation loss
21
Neural networks

Example : 100 occurences of y = exp(0.5x + ) where x ∈ [−1, 1], ∼ N (0, 0.15). A
4NN achieves a good fit both on validation & training samples.
22
Neural networks

For a (4-10-4) NN, good fit on training sample but a bad one on the validation set.
23
Neural networks
Choice of the loss function
Deviance, D(yi , ŷi )

2
Normal νi (yi −
ŷi )
(
yi
2νi ŷi − 1 − log ŷyii yi > 0
Gamma
( 0 yi = 0
2νi (yi log yi − yi log ŷi − yi + ŷi ) yi > 0
Poisson
2ν ŷ y =0
 i i i
y 1−y
2νi yi log ybi + (1 − yi ) log 1−byi yi > 0
 i i

Binomial −2νi log (1 − ybi ) yi = 0

−2ν log (b

y) yi = 1
i i
24
Neural networks
Processing of the network output
Till now, we have assumed that the estimator is the output of the last layer
of neurons : (b
yi )i=1,...n = (b
yi,n net )i=1,...n .
However the domain of Yi depends on the distribution (R for the Gaussian,
R+ for the Gamma and Poisson).
For these two last distributions, we transform the output signal of the neural
network with a function g(.) to ensure that the estimator is well in the
domain of definition of Yi .
ybi = g (b
yi,n net ) i = 1, ...n
Next table presents two standard possible transformations.

∂
On the other hand, the gradient j
∂ωik
R(Ω) is computed in two steps.
25
Neural networks
Processing of the network output
First, we calculate analytically the derivative of the deviance with respect to

the output signal ybi,n net . Next, we calculate numerically the derivative of the
output signal :
n
∂ 1X ∂ ∂b
yi,n net
j R(Ω) = D(yi , ybi,n net ) j
∂ωik n i=1 ∂b
yi,n net ∂ωik
∂
transform, g(.) ∂b
yi,n net
D(yi , ybi,n net )

Normal none y
bi := y
bi,n net 2νi ybi,n net − yi
−byi,n net
Gamma exponential y
bi := exp y
bi,n net 2νi 1 − yi e

y
Poisson exponential y
bi := exp y 2νi e i,n net − yi
b
bi,n net
26
Neural networks
1 Comparison of Deviances on training and validation samples (if possible).

2 Other criteria for measuring the goodness of fit AIC and BIC : let m be the
number of neural weigths,
AIC = 2m − 2bl (ŷ ) ,
BIC = ln(n)m − 2bl (ŷ ).

The preferred model is the one with the lowest AIC/BIC. The AIC/BIC
rewards goodness of fit but penalizes models with a large number parameters.
3 K-fold cross validation.
27
Neural networks
Frequencies of claims
n vectors (x i )i=1,...,n of dimension d . We observe a vector of responses

y = (yi )i=1,...,n .
We assume that Yi are distributed according to a Poisson law distribution
with pdf of the form :
k
(λi di )
P (Yi = k ) = exp (−λi di ) .
k!
The domain of Yi depends on the distribution : here R+ for the Poisson law.
ybi = exp (b
yi,n net ) i = 1, ..., n.
We work with logistic activation functions.

The deviance is minimized by resilient back propagation (kind of gradient
descent).
28
Neural networks
Estimation
Quantitative variables (ages) are scaled :

xij∗ = .
Categorical variables are modeled as dummy variables : e.g. if a variable

counts the three following modalities : ’Urban’, ’Suburban’ and ’Countryside’,
we create two dummy input variables as follows
(
1 if Urban
xi,d =
0 else .
(
1 if Suburban
xi,d+1 =
0 else .
29
Neural networks
Application to WASA dataset
Database : 62436 contracts but only 693 claims then we fit the model on the
whole dataset :
# of hidden # of
Model neurons weights Deviance AIC BIC
NN(3) 3 52 5664.36 7116.93 7587.11
NN(4) 4 69 5564.86 7051.43 7675.32
NN(5) 5 86 5546.91 7067.48 7845.08
NN(2,2) 2×2 41 5684.57 7115.14 7485.86
NN(3,3) 3×3 57 5499.72 7080.29 8129.15
GLM 16 5781.66 7162.23 7306.90
Choice of the NN(4), the lowest AIC.
30
Neural networks
Output of hidden neurons i = 1, 2, 3, 4 :

15
!
X
1 1
ybi,1 = φ ωi0 + ωik xk .
k =1
Frequency estimates :
µ = g −1 (fnet (x ))
4
!
X
2
= exp ω10 + ωk21 ybk ,1 .
k =1
31
Neural networks
Forecast frequencies of claims for drivers of a 4 years old vehicle, with the NN(4)
model.
32
Neural networks
Forecast frequencies of claims for drivers of a 4 years old vehicle, with the NN(4)
model.
33
Neural networks
Package ’NeuralNet’ in R
Training of neural networks using the backpropagation, resilient

backpropagation with (Riedmiller, 1994).
neuralnet(formula, data, hidden = 1, threshold = 0.01,
stepmax = 1e+05,rep = 1,learningrate=NULL,
algorithm = "rprop+", err.fct = "sse", act.fct =
"logistic", linear.output = TRUE, exclude = NULL,...)
34
Neural networks
Package ’NeuralNet’ in R
Other functions :
compute(x, covariate, rep = 1)
Computes the outputs of all neurons for specific covariate given a trained
neural network, x.
rep : an integer indicating the neural network’s repetition which should be
used.
gwplot(x, rep = NULL, ...)
gwplot, a method for plotting objects produced by neuralnet.

See R file ’neuralNetExample.R’
35

Insurance Analytics: Prof. Julien Trufin

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Insurance Analytics: Prof. Julien Trufin

Uploaded by

Copyright:

Available Formats

Insurance Analytics

Prof. Julien Trufin

Année académique 2020-2021

From biology to mathematics

In biology, neural cells (neurons) work in parallel and reorganize themselves

From biology to mathematics

From biology to mathematics

The most common activation function is the sigmoid or logistic function :

and the softplus function (smooth approximation to the rectifier)

Feed-forward neural networks

Feed-forward neural networks

Feed-forward neural networks

approach f ; |F (x ) − f (x )| < ε for all x ∈ Im .

Feed-forward neural networks

Observations : a sample of value yi ∈ R for i = 1, ..., n.

Forecasts : an estimator of ybi ∈ R for i = 1, ..., n.

... as a function of explanatory variables, x i ∈ Rp for i = 1, ..., n :

Feed-forward neural networks

Example : 100 occurences of

LM, GLM, GAM vs neural networks

GLM : Y ∼ exponential dispersed law with

GAM : Y ∼ exponential dispersed law with

g(µ) = β > x + f (xk ) + ... + f (xj )

Neural networks : Y ∼ exponential dispersed law with

Before any computations...

Quantitative variables (ages) are scaled according to one of the 3 methods :

xi,j − mini=1,...,n (xi,j )

Treatment similar to preprocessing for GLM.

Training of supervised networks

Input signal n vectors (x i )i=1,...,n of dimension p.

We observe a vector of responses y = (yi )i=1,...,n .

The criterion used for calibration is a loss function L : R2 → R, continuous

The vector of weights ω ji of neurons (i , j ) is denoted Ω and is such that

Training of supervised networks

L may be the opposite of the log-likelihood, a deviance, a quadratic function

To lighten further notation, we denote the vector of estimators by

The optimal weights are found by minimizing R(Ω) :

Initial vector of weights Ω0 . Second order Taylor development, the function

R(Ω) ∼ R(Ω0 ) + ∇R(Ω0 )> (Ω − Ω0 ) (2)

The vector of parameter Ω∗ minimizing R(Ω) cancels the first order

0 = ∇R(Ω0 ) + H (R(Ω0 )) (Ω∗ − Ω0 ) ,

3. Modify the vector of weights :

Ωt+1 = Ωt − ρt+1 ∇R(Ωt ) . (3)

End loop on epochs

Measure of goodness of fit

In practice : models that contains more parameters than can be justified by

Measure of goodness of fit

Measure of goodness of fit

Measure of goodness of fit

Choice of the loss function

Deviance, D(yi , ŷi )

Processing of the network output

Next table presents two standard possible transformations.

Processing of the network output

First, we calculate analytically the derivative of the deviance with respect to

Measure of goodness of fit

1 Comparison of Deviances on training and validation samples (if possible).

AIC = 2m − 2bl (ŷ ) ,

BIC = ln(n)m − 2bl (ŷ ).

n vectors (x i )i=1,...,n of dimension d . We observe a vector of responses

We work with logistic activation functions.

Quantitative variables (ages) are scaled :

xi,j − mini=1,...,n (xi,j )