You are on page 1of 35

Insurance Analytics

Prof. Julien Trufin

Année académique 2020-2021

1
Neural networks

Neural networks

2
Neural networks

From biology to mathematics

In biology, neural cells (neurons) work in parallel and reorganize themselves


during the learning phase.
Neurons receive input signals throughout their dendritic trees.
Whether or not a neuron is excited into firing an impulse depends on the sum
of all of the excitatory signals.

If the neuron does end up firing, the nerve impulse, or action potential, is
conducted down the axon.

3
Neural networks

From biology to mathematics

The signals (p vectors x i = {xi1 , ..., xip } ) interact with the dendrites
through synaptic weights ( ω = {ω0 , . . . , ωp } ). The dendrites carry input
signals to the cell body, where they all are summed.
The output signal ybi is equal to :
p
!   
X
> 1
ybi = φ ω0 + ωj xij =φ ω .
xi
j =1

4
Neural networks

From biology to mathematics

The most common activation function is the sigmoid or logistic function :


1
φ(z ) = ∈ [0, 1].
1 + e −z
An other activation function for the neurons is the hyperbolic tangent
function, tanh(.) :

e z − e −z
φ(z ) = ∈ [−1, 1].
e z + e −z
These activation functions are bounded. Other popular activation functions
are the rectifier

φ(z ) = max(z , 0)

and the softplus function (smooth approximation to the rectifier)

φ(z ) = log(1 + e z ).
5
Neural networks

Feed-forward neural networks


A neural network is a set of interconnected neurons and there is an infinity of
possible layouts. Here is an example of single layer feed-forward network (or
perceptron) :

6
Neural networks

Feed-forward neural networks

The number of neuronal layers : n net . The number of neurons in the j th layer
is noted njnet .
Activation function φi,j (.) : i denotes the position in a layer and j the layer.
Output of the i th neurons in hidden and output layers j , ybi,j , is

njnet
 
−1
j
X j
ybi,j = φi,j ωi0 + ωi,k ybk ,j −1 
k =1

j
where ωi,k are the weights for the k th input signal received by the neuron
(i , j ).

7
Neural networks

Feed-forward neural networks

Cybenko’s theorem.
Let φ(·) be a bounded, continuous function. Let Im denote [0, 1]m . Given an ε > 0 and any
function f ∈ C (Im ), there exist N ∈ N, vi , ω0,i ∈ R and vectors ωi ∈ Rm , where i = 1, . . . , N ,
such that
N
X  
F (x ) = vi φ ωiT x + ω0,i
i=1

approach f ; |F (x ) − f (x )| < ε for all x ∈ Im .

With simple words : Any continuous function may be approached by a single layer neural
networks.

8
Neural networks

Deep learning ?

A shallow network has only 1 hidden A deep neural network has multiple
layer. hidden layers and eventually loops
Universal approximator but # of neurons (recurrent networks).
may be high for approaching a function. Universal approximator with
multiplicative layer. Less neurons for
approaching high-dim. functions.

9
Neural networks

Feed-forward neural networks


Neural networks can be used for regression.

Observations : a sample of value yi ∈ R for i = 1, ..., n.

Forecasts : an estimator of ybi ∈ R for i = 1, ..., n.

... as a function of explanatory variables, x i ∈ Rp for i = 1, ..., n :


Quantitative variables ;
Categorical variables.

10
Neural networks

Feed-forward neural networks

Example : 100 occurences of


A network with 4 NN estimates well the trend :
y = exp (0.5x + ) ypred(x)= 0.11182
+0.51839*phi(1.29626+2.8344*x)
x ∈ [−1, 1],  ∼ N (0, 0.15) +0.62848*phi(0.65442-0.01889*x)
+3.31112*phi(-3.91938+2.75793*x)
+0.03327*phi(0.07774+0.94149*x) where
phi(z)=1/(1+exp(-z))

11
Neural networks

LM, GLM, GAM vs neural networks


Comparison with existing approaches :
Input : vector of d explanatory variables, x .
Output : y realization of Y with E(Y ) = µ.
Linear regression, Y ∼ N (µ, σ 2 )

µ = β> x

GLM : Y ∼ exponential dispersed law with

g(µ) = β> x

GAM : Y ∼ exponential dispersed law with

g(µ) = β > x + f (xk ) + ... + f (xj )

Neural networks : Y ∼ exponential dispersed law with

g(µ) = f (x )

12
Neural networks

Data preprocessing

Before any computations...


Activation functions quickly converges toward 0 (or -1) and 1 outside a small
interval centered around zero ⇒ scaling of data !

Quantitative variables (ages) are scaled according to one of the 3 methods :

xi,j − mini=1,...,n (xi,j )


xij∗ = ∈ [0, 1]
maxi=1,...,n (xi,j ) − mini=1,...,n (xi,j )
xi,j − mini=1,...,n (xi,j )
xij∗ = 2 − 1 ∈ [−1, 1]
maxi=1,...,n (xi,j ) − mini=1,...,n (xi,j )
xij − µj
xij∗ = ∈ R.
σj

13
Neural networks

Data preprocessing
Categorical variables are modeled as dummy variables : e.g. if a variable counts
the three following modalities : ’Urban’, ’Suburban’ and ’Countryside’, we create
two dummy input variables as follows
(
1 if Urban
xi,d =
0 else .

(
1 if Suburban
xi,d+1 =
0 else .

Treatment similar to preprocessing for GLM.

14
Neural networks

Training of supervised networks

Training...
Weights are found by minimizing the distance between output signals of the
network and real outputs.

Input signal n vectors (x i )i=1,...,n of dimension p.

We observe a vector of responses y = (yi )i=1,...,n .

The criterion used for calibration is a loss function L : R2 → R, continuous


with first order derivative.

The vector of weights ω ji of neurons (i , j ) is denoted Ω and is such that


n
1X
Ω = arg min L(yi , ybi,n net ) , (1)
Ω n i=1

15
Neural networks

Training of supervised networks

L may be the opposite of the log-likelihood, a deviance, a quadratic function


or any other well-behaved penalty function.

To lighten further notation, we denote the vector of estimators by


ŷ = (b
yi )i=1,...n = (b
yi,n net )i=1,...n
1
Pn
and the total of losses by R(Ω) = n i=1 L(yi , ybi ).

The optimal weights are found by minimizing R(Ω) :


Gradient descent ;
Back-propagation.

16
Neural networks

Gradient descent

Initial vector of weights Ω0 . Second order Taylor development, the function


R:

R(Ω) ∼ R(Ω0 ) + ∇R(Ω0 )> (Ω − Ω0 ) (2)


1 >
+ (Ω − Ω0 ) H (R(Ω0 )) (Ω − Ω0 ) + O(3) ,
2
where ∇R(Ω0 ) and H (R(Ω0 )) are respectively the gradient vector and the
Hessian matrix
!

∇R(Ω0 ) = R(Ω) ,

j
∂ωik
Ω0 i,j ,k =...
!
∂2
H (R(Ω0 )) = R(Ω) .

j s

∂ωik ∂ωut
Ω 0 i,j ,k ,s,u,t=...

17
Neural networks

Gradient descent

The vector of parameter Ω∗ minimizing R(Ω) cancels the first order


derivative of the development (2) :

0 = ∇R(Ω0 ) + H (R(Ω0 )) (Ω∗ − Ω0 ) ,

or
−1
Ω∗ = Ω0 − H (R(Ω0 )) ∇R(Ω0 ) .
Therefore by iterating, we can find optimal weights.
Problems :
Risk to reach a local minimum...
Inversion of the Hessian matrix...

18
Neural networks

Back-propagation
Solution : Adjust the vector of weights Ωt by a small step in the opposite direction
of the gradient :
Algorithm
Main procedure :
For t = 0 to maximum epoch, T
1. Calculate the gradient ∇R(Ωt )
2. Update the step size

ρt+1 = ρ0 e −αt

3. Modify the vector of weights :

Ωt+1 = Ωt − ρt+1 ∇R(Ωt ) . (3)

End loop on epochs


The step size ρt+1 decreases with the epoch.

19
Neural networks

Measure of goodness of fit

In most of applications, the quality of the model is assessed by the total loss
R(Ω).
There is no rule for determining the best architecture for the neural network.
In practice, we test several models and choose the one with the lowest loss...
But overfitting must be checked.

Overfitting
Overfitting is the production of an analysis that corresponds too closely or exactly
to a particular set of data, and may therefore fail to fit additional data or predict
future observations reliably.

In practice : models that contains more parameters than can be justified by


the data.

20
Neural networks

Measure of goodness of fit

Solution to control overfitting : split the database into training (e.g. 85%)
and validation (e.g. 15%) samples
Train the neural network on the training set, monitor the training loss
Train the neural network on the validation set, monitor the validation loss

21
Neural networks

Measure of goodness of fit


Example : 100 occurences of y = exp(0.5x + ) where x ∈ [−1, 1],  ∼ N (0, 0.15). A
4NN achieves a good fit both on validation & training samples.

22
Neural networks

Measure of goodness of fit


For a (4-10-4) NN, good fit on training sample but a bad one on the validation set.

23
Neural networks

Choice of the loss function

Deviance, D(yi , ŷi )


2
Normal  νi (yi − 
ŷi ) 
(
yi
2νi ŷi − 1 − log ŷyii yi > 0
Gamma
( 0 yi = 0
2νi (yi log yi − yi log ŷi − yi + ŷi ) yi > 0
Poisson
2ν ŷ y =0
 i i     i
y 1−y
2νi yi log ybi + (1 − yi ) log 1−byi yi > 0
 i i

Binomial −2νi log (1 − ybi ) yi = 0

−2ν log (b

y) yi = 1
i i

24
Neural networks

Processing of the network output

Till now, we have assumed that the estimator is the output of the last layer
of neurons : (b
yi )i=1,...n = (b
yi,n net )i=1,...n .
However the domain of Yi depends on the distribution (R for the Gaussian,
R+ for the Gamma and Poisson).
For these two last distributions, we transform the output signal of the neural
network with a function g(.) to ensure that the estimator is well in the
domain of definition of Yi .

ybi = g (b
yi,n net ) i = 1, ...n

Next table presents two standard possible transformations.



On the other hand, the gradient j
∂ωik
R(Ω) is computed in two steps.

25
Neural networks

Processing of the network output

First, we calculate analytically the derivative of the deviance with respect to


the output signal ybi,n net . Next, we calculate numerically the derivative of the
output signal :
n
∂ 1X ∂ ∂b
yi,n net
j R(Ω) = D(yi , ybi,n net ) j
∂ωik n i=1 ∂b
yi,n net ∂ωik


transform, g(.) ∂b
yi,n net
D(yi , ybi,n net )

Normal none y
bi := y
bi,n net 2νi ybi,n net − yi 
 −byi,n net
Gamma exponential y
bi := exp y
bi,n net 2νi 1 − yi e
 
 y
Poisson exponential y
bi := exp y 2νi e i,n net − yi
b
bi,n net

26
Neural networks

Measure of goodness of fit

1 Comparison of Deviances on training and validation samples (if possible).


2 Other criteria for measuring the goodness of fit AIC and BIC : let m be the
number of neural weigths,

AIC = 2m − 2bl (ŷ ) ,

BIC = ln(n)m − 2bl (ŷ ).


The preferred model is the one with the lowest AIC/BIC. The AIC/BIC
rewards goodness of fit but penalizes models with a large number parameters.
3 K-fold cross validation.

27
Neural networks

Frequencies of claims

n vectors (x i )i=1,...,n of dimension d . We observe a vector of responses


y = (yi )i=1,...,n .
We assume that Yi are distributed according to a Poisson law distribution
with pdf of the form :
k
(λi di )
P (Yi = k ) = exp (−λi di ) .
k!
The domain of Yi depends on the distribution : here R+ for the Poisson law.

ybi = exp (b
yi,n net ) i = 1, ..., n.

We work with logistic activation functions.


The deviance is minimized by resilient back propagation (kind of gradient
descent).

28
Neural networks

Estimation

Quantitative variables (ages) are scaled :

xi,j − mini=1,...,n (xi,j )


xij∗ = .
maxi=1,...,n (xi,j ) − mini=1,...,n (xi,j )

Categorical variables are modeled as dummy variables : e.g. if a variable


counts the three following modalities : ’Urban’, ’Suburban’ and ’Countryside’,
we create two dummy input variables as follows
(
1 if Urban
xi,d =
0 else .

(
1 if Suburban
xi,d+1 =
0 else .

29
Neural networks

Application to WASA dataset

Database : 62436 contracts but only 693 claims then we fit the model on the
whole dataset :
# of hidden # of
Model neurons weights Deviance AIC BIC
NN(3) 3 52 5664.36 7116.93 7587.11
NN(4) 4 69 5564.86 7051.43 7675.32
NN(5) 5 86 5546.91 7067.48 7845.08
NN(2,2) 2×2 41 5684.57 7115.14 7485.86
NN(3,3) 3×3 57 5499.72 7080.29 8129.15
GLM 16 5781.66 7162.23 7306.90

Choice of the NN(4), the lowest AIC.

30
Neural networks

Application to WASA dataset

Output of hidden neurons i = 1, 2, 3, 4 :


15
!
X
1 1
ybi,1 = φ ωi0 + ωik xk .
k =1

Frequency estimates :

µ = g −1 (fnet (x ))
4
!
X
2
= exp ω10 + ωk21 ybk ,1 .
k =1

31
Neural networks

Application to WASA dataset

Forecast frequencies of claims for drivers of a 4 years old vehicle, with the NN(4)
model.

32
Neural networks

Application to WASA dataset

Forecast frequencies of claims for drivers of a 4 years old vehicle, with the NN(4)
model.

33
Neural networks

Package ’NeuralNet’ in R

Training of neural networks using the backpropagation, resilient


backpropagation with (Riedmiller, 1994).
neuralnet(formula, data, hidden = 1, threshold = 0.01,
stepmax = 1e+05,rep = 1,learningrate=NULL,
algorithm = "rprop+", err.fct = "sse", act.fct =
"logistic", linear.output = TRUE, exclude = NULL,...)

34
Neural networks

Package ’NeuralNet’ in R

Other functions :
compute(x, covariate, rep = 1)

Computes the outputs of all neurons for specific covariate given a trained
neural network, x.
rep : an integer indicating the neural network’s repetition which should be
used.
gwplot(x, rep = NULL, ...)

gwplot, a method for plotting objects produced by neuralnet.


See R file ’neuralNetExample.R’

35

You might also like