Professional Documents
Culture Documents
@freakonometrics 1
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 2
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Agenda
All those topics are related to computational issues, so codes will be mentioned
@freakonometrics 3
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
• an algorithm
@freakonometrics 4
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
MAÎTRE DE PHILOSOPHIE: Sans doute. Sont-ce des vers que vous lui voulez écrire?
MONSIEUR JOURDAIN: Non, non, point de vers.
MAÎTRE DE PHILOSOPHIE: Vous ne voulez que de la prose?
MONSIEUR JOURDAIN: Non, je ne veux ni prose ni vers.
MAÎTRE DE PHILOSOPHIE: Il faut bien que ce soit l’un, ou l’autre.
MONSIEUR JOURDAIN: Pourquoi?
MAÎTRE DE PHILOSOPHIE: Par la raison, Monsieur, qu’il n’y a pour s’exprimer que la prose, ou
les vers.
MONSIEUR JOURDAIN: Il n’y a que la prose ou les vers?
MAÎTRE DE PHILOSOPHIE: Non, Monsieur: tout ce qui n’est point prose est vers; et tout ce qui
n’est point vers est prose.
MONSIEUR JOURDAIN: Et comme l’on parle qu’est-ce que c’est donc que cela?
MAÎTRE DE PHILOSOPHIE: De la prose.
MONSIEUR JOURDAIN: Quoi? quand je dis: "Nicole, apportez-moi mes pantoufles, et me donnez
mon bonnet de nuit" , c’est de la prose?
MAÎTRE DE PHILOSOPHIE: Oui, Monsieur.
MONSIEUR JOURDAIN: Par ma foi! il y a plus de quarante ans que je dis de la prose sans que
j’en susse rien, et je vous suis le plus obligé du monde de m’avoir appris cela. Je voudrais
donc lui mettre dans un billet: Belle Marquise, vos beaux yeux me font mourir d’amour;
mais je voudrais que cela fût mis d’une manière galante, que cela fût tourné gentiment.
@freakonometrics 5
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Part 1.
Statistical/Machine Learning
@freakonometrics 6
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 7
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 8
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 9
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
4
●
1915● ●
X
T 1916●
−2
●
xi ∼ ωi,j z j or X ∼ ZΩ
3
1944 ●
● 1918● ●
1917● ●
−4
●
2
PC score 2
1940● ●
● ●●
●
● ●●
● ●
●
●● ●● ●
●●
●
● ● ●
j=1
●
● ●● ●●
1
●
1919●
−6
●
●●
●●●●
1942● ●
● ●
●
●
0
●
●●
● ● ●
● ● ●● ●
●
●● ● ●
●
● ●
● ●
● ●●
● ● ●●
●
−8
● ● ●
●
●●●● ●●
● ● ●
● ●
● ● ●
●●
●
● ●
●
−1
where Ω is a k × d matrix, with d < k.
●
●● ● ●
●
●
0 20 40 60 80 −10 −5 0 5 10 15
Age PC score 1
3
●
−2
●
● ●
2
●●
●●
●
● ●
● ● ●
n o ● ● ●
●
●
−4
●
PC score 2
●●
●● ●
●
● ●
1
●
● ● ●
●
●
●
●
● ●● ●
−6
●●
● ●
●● ● ● ● ●
● ●
● ● ●
●
● ●
●
●
● ●
●
●●
0
● ● ●
●
●
kωk=1 kωk=1
●
●● ●
−8
● ● ●
● ●
●
● ●
●●
●●
●●● ●
−1
●
●
●
● ●
●
●
●●
●
−10
●
0 20 40 60 80 −10 −5 0 5 10 15
Age PC score 1
@freakonometrics 10
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
5
6
1.0
7 9
0.8
0.6
Height
10 8
0.4
0.2
4 3
0.0
10
2
d
hclust (*, "complete")
@freakonometrics 11
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
i.e. we would like to get the distribution of the response variable Y conditioning
on one (or more) predictors X.
Consider a regression model, yi = m(xi ) + εi , where εi ’s are i.i.d. N (0, σ 2 ),
possibly linear yi = xT
i β + εi , where εi ’s are (somehow) unpredictible.
@freakonometrics 12
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 13
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 14
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 15
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 16
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
1 > library ( np )
2 > nw <- npreg ( y ~ x , data = db , bws =h ,
3 + ckertype = " gaussian " )
@freakonometrics 17
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 18
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
? − 15
(hard to get a simple rule of thumb... up to a constant, h ∼ n )
Use bootstrap, or cross-validation to get an optimal h
@freakonometrics 19
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 20
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Boostrap estimate is based on bootstrap samples: set θb(b) = θ(x b (b) ), and
n
1 Xb
θ̃ = θ(b) , where x(b) is a vector of size n, where values are drawn from
n i=1
{x1 , · · · , xn }, with replacement. And then use the law of large numbers...
See Efron (1979).
@freakonometrics 21
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 22
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 23
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Loss Functions
Fitting criteria are based on loss functions (also called cost functions). For a
quantitative response, a popular one is the quadratic loss,
`(y, m(x)) = [y − m(x)]2 .
Recall that
2
E(Y ) = argmin{kY − mk`2 } = argmin{E [Y − m] }
m∈R m∈R
2 2
Var(Y ) = min {E [Y − m] } = E [Y − E(Y )]
m∈R
@freakonometrics 24
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Loss Functions
Robust estimation is based on a different loss function, `(y, m(x)) = |y − m(x)|.
In the context of classification, we can use a misclassification indicator,
`(y, m(x)) = 1(y 6= m(x))
Note that those loss functions have symmetric weighting.
@freakonometrics 25
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 26
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Linear Predictors
In the linear model, least square estimator yields
y b = X[X T X]−1 X T Y
b = Xβ
| {z }
H
@freakonometrics 27
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 28
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Model Evaluation
In linear models, the R2 is defined as the proportion of the variance of the the
response y that can be obtained using the predictors.
But maximizing the R2 usually yields overfit (or unjustified optimism in Berk
(2008)).
In linear models, consider the adjusted R2 ,
2 n−1 2
R = 1 − [1 − R ]
n−p−1
@freakonometrics 29
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Model Evaluation
Alternatives are based on the Akaike Information Criterion (AIC) and the
Bayesian Information Criterion (BIC), based on a penalty imposed on some
criteria (the logarithm of the variance of the residuals),
n
!
1X 2p
AIC = log [yi − ybi ]2 +
n i=1 n
n
!
1 X log(n)p
BIC = log [yi − ybi ]2 +
n i=1
n
@freakonometrics 30
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Model Evaluation
One can also consider the expected prediction error (with a probabilistic model)
E[`(Y, m(X)]
b
since m
b depends on (yi , xi )’s.
Natural option : use two (random) samples, a training one and a validation one.
Alternative options, use cross-validation, leave-one-out or k-fold.
@freakonometrics 31
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 32
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 33
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 34
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 35
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 36
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Scalability Issues
Dealing with big (or massive) datasets, large number of observations (n) and/or
large number of predictors (features or covariates, k).
Ability to parallelize algorithms might be important (map-reduce).
→ Feature Engineering
@freakonometrics 37
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Part 2.
Classification, y ∈ {0, 1}
@freakonometrics 38
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Classification?
Example: Fraud detection, automatic reading (classifying handwriting
symbols), face recognition, accident occurence, death, purchase of optinal
insurance cover, etc
Here yi ∈ {0, 1} or yi ∈ {−1, +1} or yi ∈ {•, •}.
1.0
●
●
There will be two steps, ●
0.8
●
0.6
• the score function, s(x) = P(Y = 1|X = x) ∈ [0, 1] ●
●
●
0.4
• the classification function s(x) → Yb ∈ {0, 1}. ●
0.2
●
●
0.0
0.0 0.2 0.4 0.6 0.8 1.0
@freakonometrics 39
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
1 > myocarde = read . table ( " http : / / fre a ko no me tr ics . free . fr / myocarde . csv " ,
head = TRUE , sep = " ; " )
@freakonometrics 40
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Logistic Regression
Assume that P(Yi = 1) = πi ,
πi
logit(πi ) = X 0i β, where logit(πi ) = log ,
1 − πi
or
0
exp[X i β]
πi = logit−1 (X 0i β) = T
.
1 + exp[X i β]
The log-likelihood is
n
X n
X
log L(β) = yi log(πi )+(1−yi ) log(1−πi ) = yi log(πi (β))+(1−yi ) log(1−πi (β))
i=1 i=1
@freakonometrics 41
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
4 Coefficients :
5 Estimate Std . Error z value Pr ( >| z |)
6 ( Intercept ) -10.187642 11.895227 -0.856 0.392
7 FRCAR 0.138178 0.114112 1.211 0.226
8 INCAR -5.862429 6.748785 -0.869 0.385
9 INSYS 0.717084 0.561445 1.277 0.202
10 PRDIA -0.073668 0.291636 -0.253 0.801
11 PAPUL 0.016757 0.341942 0.049 0.961
12 PVENT -0.106776 0.110550 -0.966 0.334
13 REPUL -0.003154 0.004891 -0.645 0.519
14
@freakonometrics 42
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
5 Coefficients :
6 Estimate Std . Error z value
7 ( Intercept ) 10.1876411 11.8941581 0.856525
8 FRCAR -0.1381781 0.1141056 -1.210967
9 INCAR 5.8624289 6.7484319 0.868710
10 INSYS -0.7170840 0.5613961 -1.277323
11 PRDIA 0.0736682 0.2916276 0.252610
12 PAPUL -0.0167565 0.3419255 -0.049006
13 PVENT 0.1067760 0.1105456 0.965901
14 REPUL 0.0031542 0.0048907 0.644939
15
@freakonometrics 43
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
XTβ
e p1 1 p0
P(Y = 1) = X Tβ = ∝ p1 and P(Y = 0) = X T = ∝ p0
1+e p0 + p1 1+e p0 + p1
X T βA
pA e
P(X = A) = ∝ pA i.e. P(X = A) = X T β T
pA + pB + pC e B + eX β B + 1
T
pB eX βB
P(X = B) = ∝ pB i.e. P(X = B) = X T β T
pA + pB + pC e A + eX β B + 1
pC 1
P(X = C) = ∝ pC i.e. P(X = C) = X T β T
pA + pB + pC e A + eX β B + 1
@freakonometrics 44
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
where ∇ log L(β)is the gradient, and H(β) the Hessian matrix, also called
Fisher’s score.
The generic term of the Hessian is
n
∂ 2 log L(β) X
= Xk,i X`,i [yi − πi (β)]
∂βk ∂β` i=1
πi (1 − π
Define Ω = [ωi,j ] = diag(b bi )) so that the gradient is writen
∂ log L(β)
∇ log L(β) = = X 0 (y − π)
∂β
@freakonometrics 45
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 46
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 47
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Predicted Probability
Let m(x) = E(Y |X = x). With a logistic regression, we can get a prediction
exp[xT β]
b
m(x)
b =
1 + exp[xT β]
b
@freakonometrics 48
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Predicted Probability
exp[xT β]
b exp[βb0 + βb1 x1 + · · · + βbk xk ]
m(x)
b = =
T
1 + exp[x β]
b 1 + exp[βb0 + βb1 x1 + · · · + βbk xk ]
use
1 > predict ( fit _ glm , newdata = data , type = " response " )
e.g.
●
3000
●
●
●
2500
1 ●
2000
●
REPUL
● ●
●
1500
●
● ●
● ●
● ●
●
●
● ●
1000
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
500
● ●
● ●
● ● ● ● ●
●
0 5 10 15 20
PVENT
@freakonometrics 49
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Predictive Classifier
To go from a score to a class:
if s(x) > s, then Yb (x) = 1 and s(x) ≤ s, then Yb (x) = 0
Plot T P (s) = P[Yb = 1|Y = 1] against F P (s) = P[Yb = 1|Y = 0]
@freakonometrics 50
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Predictive Classifier
With a threshold (e.g. s = 50%) and the predicted probabilities, one can get a
classifier and the confusion matrix
1 > probabilities <- predict ( logistic , myocarde , type = " response " )
2 > predictions <- levels ( myocarde $ PRONO ) [( probabilities >.5) +1]
3 > table ( predictions , myocarde $ PRONO )
4
@freakonometrics 51
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Death Death
Survival Survival
4
19 29 19 29
● 66 ●
● 66 ●
● ●
54 54
● ●
69 31 69 31
2
● ● 64 ● ● 64
● 22
34 ● 22
34
Dim 2 (18.64%)
Dim 2 (18.64%)
65 15 11 10 ● ●
3 65 15 11 10 ● ●
3
●
21 ●
● 5637 7 ● 67 ●
21 ●
● 5637 7 ● 67
46● 12 51
●
36
● ● ● 46● 12 51
●
36
● ● ●
●
Survival 43 ●
● 28 61
5324 47 35 ●
Survival 43 ●
● 28 61
5324 47 35
52 13 58 71 ●
17
●
● 2 68 32 52 13 58 71 ●
17
●
● 2 68 32
● ● ●
60 16 ●
25● ● ● ●
● ● ●
● ● ●
60 16 ●
25● ● ● ●
● ● ●
0
● ● ● ● ● ● ● ●
30 ●
4841 30 ●
4841
14 40 38 9● 44
● ● ● ● 14 40 38 9● 44
● ● ● ●
50 5 39 ● 50 5 39 ●
● ●
● ● ● ●
●4 ●
1 ●4 ●
1
●
59 45 ●
●
59 45 ●
●
18 ● ● 26
●
6 ● 57 ●
18 ● ● 26
●
6 ● 57
23 27 23 27
● ●
●
●
● ● ● 42 ●
●
● ● ● 42
●
8 ●
63
●
8 ●
63
49 49
−2
−2
● ●
● 33 ● 33
● ●
● ●
−4
−4
5
0.
−4 −2 0 2 4 −4 −2 0 2 4
@freakonometrics 52
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 53
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
yi? = xT
i β + εi is a nonobservable score.
P(Y = 1|X = x)
= exp[xT β]
P(Y 6= 1|X = x)
T exp[·]
P(Y = 1|X = x) = H(x β) where H(·) =
1 + exp[·]
which is the c.d.f. of the logistic variable, see Verhulst (1845)
@freakonometrics 54
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
For k-Nearest Neighbors, the class is usually the majority vote of the k closest
neighbors of x.
●
3000
1 > library ( caret ) ●
●
●
2500
2 ●
myocarde , k = 15) ●
●
2000
●
REPUL
● ●
●
3 > ●
● ●
●
1500
●
● ●
● ●
● ●
●
●
● ●
1000
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
500
6 + data . frame ( PVENT =p , REPUL = r ) , type = " prob " ) [ ,2]} ●
● ● ●
●
●
●
●
●
●
0 5 10 15 20
PVENT
@freakonometrics 55
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
k-Nearest Neighbors
Distance d(·, ·) should not be sensitive to units: normalize by standard deviation
●
3000
1 > sP <- sd ( myocarde $ PVENT ) ; sR <- sd ( myocarde $ ●
●
●
REPUL )
2500
●
2000
●
REPUL
● ●
●
1500
●
● ●
● ●
● ●
●
●
● ●
1000
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
500
5 + data . frame ( PVENT =p , REPUL = r ) , type = " prob " ) [ ,2]} ●
● ● ●
●
●
●
●
●
●
0 5 10 15 20
PVENT
@freakonometrics 56
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
1.0
●
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
dim1 dim2 dim3 dim4 dim5 dim1 dim2 dim3 dim4 dim5
n = 10 n = 100
@freakonometrics 57
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
or
1 > fancyRpartPl ot ( cart , sub = " " )
@freakonometrics 58
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 59
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
e.g. Gini index (used originally in CART, see Breiman et al. (1984))
X nx X nx,y nx,y
gini(NL , NR ) = − 1−
n nx nx
x∈{L,R} y∈{0,1}
@freakonometrics 60
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
−0.14
−0.14
−0.25
−0.25
−0.16
−0.16
−0.35
−0.35
−0.18
−0.18
solve max {I(NL , NR )}
−0.45
−0.45
−0.20
−0.20
j∈{1,··· ,k},s
1.0 1.5 2.0 2.5 3.0 15 20 25 30 1.8 2.2 2.6 3.0 20 24 28 32
−0.14
−0.14
−0.25
−0.25
←− first split
−0.16
−0.16
−0.35
−0.35
−0.18
−0.18
−0.45
−0.45
−0.20
−0.20
12 16 20 24 20 25 30 35 12 14 16 18 20 22 16 18 20 22 24 26 28
−0.14
−0.14
−0.25
−0.25
−0.16
−0.16
−0.35
−0.35
−0.18
−0.18
−0.45
−0.45
−0.20
−0.20
4 6 8 10 12 14 16 500 1000 1500 2000 4 6 8 10 12 14 500 700 900 1100
@freakonometrics 61
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Pruning Trees
One can grow a big tree, until leaves have a (preset) small number of
observations, and then possibly go back and prune branches (or leaves) that do
not improve gains on good classification sufficiently.
Or we can decide, at each node, whether we split, or not.
@freakonometrics 62
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Pruning Trees
In trees, overfitting increases with the number of steps, and leaves. Drop in
impurity at node N is defined as
n nR
L
∆I(NL , NR ) = I(N ) − I(NL , NR ) = I(N ) − I(NL ) − I(NR )
n n
3000
1 > library ( rpart ) ●
●
●
2500
2 ●
2000
●
REPUL
● ●
●
3 > ●
● ●
●
1500
●
● ●
● ●
● ●
●
●
● ●
1000
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
500
6 + data . frame ( PVENT =p , REPUL = r ) [ , " Survival " ]) } ●
● ● ●
●
●
●
●
●
●
0 5 10 15 20
PVENT
@freakonometrics 63
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Pruning Trees
●
3000
1 > library ( rpart ) ●
●
●
2500
2 ●
myocarde , minsplit = 5) ●
●
2000
●
REPUL
● ●
●
3 > ●
● ●
●
1500
●
● ●
● ●
● ●
●
●
● ●
1000
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
500
6 + data . frame ( PVENT =p , REPUL = r ) [ , " Survival " ]) } ●
● ● ●
●
●
●
●
●
●
0 5 10 15 20
PVENT
See also
1 > library ( mvpart )
2 > ? prune
@freakonometrics 64
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Pruning Trees
Given a cost-complexity parameter cp (see tunning parameter in Ridge-Lasso)
define a penalized R(·)
If cp is small the optimal tree is large, if cp is large the optimal tree has no leaf,
see Breiman et al. (1984).
size of tree
1 2 3 7 9
1.2
> cart <- rpart ( PRONO ~ . , data = myocarde , minsplit
1.0
1 ●
0.8
2 > plotcp ( cart ) ●
0.6
●
●
●
3 > prune ( cart , cp =0.06)
0.4
Inf 0.27 0.06 0.024 0.013
cp
@freakonometrics 65
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Bagging
Bootstrapped Aggregation (Bagging) , is a machine learning ensemble
meta-algorithm designed to improve the stability and accuracy of machine
learning algorithms used in statistical classification (Source: wikipedia).
It is an ensemble method that creates multiple models of the same type from
different sub-samples of the same dataset [boostrap]. The predictions from each
separate model are combined together to provide a superior result [aggregation].
→ can be used on any kind of model, but interesting for trees, see Breiman (1996)
Boostrap can be used to define the concept of margin,
B B
1 X 1 X
margini = yi = yi ) −
1(b yi 6= yi )
1(b
B B
b=1 b=1
Remark Probability that ith raw is not selection (1 − n−1 )n → e−1 ∼ 36.8%, cf
training / validation samples (2/3-1/3)
@freakonometrics 66
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Bagging Trees
●
●
3 + idx = sample (1: n , size =n , replace = TRUE )
2500
●
4 > cart <- rpart ( PRONO ~ PVENT + REPUL ,
● ●
2000
●
●
5 + data = myocarde [ idx ,] , minsplit = 5)
REPUL
●
● ●
● ●
● ●
1500
●
● ●
●
● ●
●
● ● 6 > margin [j ,] <- ( predict ( cart2 , newdata =
● ●
● ●
● ● ●
1000
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
myocarde , type = " prob " ) [ , " Survival " ] >.5) ! =
● ●
● ● ●●
●
( myocarde $ PRONO == " Survival " )
500
● ● ●
●
● ● ● ● ●
●
5 10
PVENT
15 20
7 + }
8 > apply ( margin , 2 , mean )
@freakonometrics 67
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Bagging
@freakonometrics 68
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Bagging Trees
Interesting because of instability in CARTs (in terms of tree structure, not
necessarily prediction)
@freakonometrics 69
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
2 2
2 2
E[(Y − m(x))
b σ + E[m(x)]
] = |{z} b − m(x) + E m(x)
b − E[(m(x)]
b
| {z } | {z }
1
2 3
@freakonometrics 70
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
3000
1 > library ( ipred ) ●
●
●
2500
2 ●
myocarde ) ●
●
2000
●
REPUL
● ●
●
3 > ●
● ●
●
1500
●
● ●
● ●
● ●
●
●
● ●
1000
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
500
6 + data . frame ( PVENT =p , REPUL = r ) , type = " prob " ) [ ,2]) } ●
● ● ●
●
●
●
●
●
●
0 5 10 15 20
PVENT
@freakonometrics 71
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Random Forests
Strictly speaking, when boostrapping among observations, and aggregating, we
use a bagging algorithm.
In the random forest algorithm, we combine Breiman’s bagging idea and the
random selection of features, introduced independently by Ho (1995)) and Amit
& Geman (1997))
●
3000
1 > library ( randomForest ) ●
●
●
2500
2 ●
myocarde ) ●
●
2000
●
REPUL
● ●
●
3 > ●
● ●
●
1500
●
● ●
● ●
● ●
●
●
● ●
1000
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
500
6 + data . frame ( PVENT =p , REPUL = r ) , type = " prob " ) [ ,2]) } ●
● ● ●
●
●
●
●
●
●
0 5 10 15 20
PVENT
@freakonometrics 72
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Random Forest
√
At each node, select k covariates out of k (randomly).
@freakonometrics 73
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Random Forest
can deal with small n large k-problems
Random Forest are used not only for prediction, but also to assess variable
importance (see last section).
@freakonometrics 74
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 75
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ●
@freakonometrics 76
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
ω T x0 + b
d(x0 , Hω,b ) =
kωk
@freakonometrics 77
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
|ω T xi + b| = 1
The distance from support vectors to Hω,b is kωk−1 , and the margin is then
2kωk−1 .
−→ the algorithm is to minimize the inverse of the margins s.t. Hω,b separates
±1 points, i.e.
1 T
min ω ω s.t. Yi (ω T xi + b) ≥ 1, ∀i.
2
@freakonometrics 78
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
In the dual space, the problem becomes (hint: consider the Lagrangian)
( )
X 1X T
X
max αi − αi αj Yi Yj xi xj s.t. αi Yi = 0.
i=1
2 i=1 i=1
@freakonometrics 79
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 80
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
3000
●
2500
●
myocarde , ●
●
●
2000
●
REPUL
● ●
●
● ●
● ●
1500
4 > pred _ SVM2 = function (p , r ) { ● ●
●
● ●
●
● ●
●
● ●
●
1000
●
●
● ●
●
● ● ●
● ●
● ● ● ● ● ●
● ● ●
●
●
●
●
500
● ●
● ●
● ● ● ● ●
●
" ) [ ,2]) } 0 5 10 15 20
PVENT
@freakonometrics 81
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
3000
●
2500
●
myocarde , ●
●
●
2000
●
REPUL
● ●
●
● ●
● ●
1500
4 > pred _ SVM2 = function (p , r ) { ● ●
●
● ●
●
● ●
●
● ●
●
1000
●
●
● ●
●
● ● ●
● ●
● ● ● ● ● ●
● ● ●
●
●
●
●
500
● ●
● ●
● ● ● ● ●
●
" ) [ ,2]) } 0 5 10 15 20
PVENT
@freakonometrics 82
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Still Hungry ?
There are still several (machine learning) techniques that can be used for
classification
@freakonometrics 83
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Still Hungry ?
@freakonometrics 84
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Difference in Differences
1.0
●
●
Data : {(xi , yi )} with yi ∈ {•, •} ●
0.8
●
0.6
●
● ●
See clinical trials, treatment vs. control group
0.4
E.g. direct mail campaign in a bank ●
0.2
●
Control Promotion • ●
0.0
No Purchase 85.17% 61.60% 0.0 0.2 0.4 0.6 0.8 1.0
overall uplift effect +23.57%, see Guelman et al. (2014) for more details.
@freakonometrics 85
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Part 3.
Regression
@freakonometrics 86
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Regression?
In statistics, regression analysis is a statistical process for estimating the
relationships among variables [...] In a narrower sense, regression may refer
specifically to the estimation of continuous response variables, as opposed to the
discrete response variables used in classification. (Source: wikipedia).
Here regression is opposed to classification (as in the CART algorithm). y is
either a continuous variable y ∈ R or a counting variable y ∈ N .
@freakonometrics 87
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 88
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Linear Model:
• (Y |X = x) ∼ N (θx , σ 2 )
• E[Y |X = x] = θx = xT β
@freakonometrics 89
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
• (Y |X = x) ∼ N (θx , σ 2 )
• E[Y |X = x] = θx = m(x)
@freakonometrics 90
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
• (Y |X = x) ∼ L(θx , ϕ)
@freakonometrics 91
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Linear Model
Consider a linear regression model, yi = xT
i β + εi .
b = [X T X]−1 X T Y
β is estimated using ordinary least squares, β
−→ best linear unbiased estimator
Unbiased estimators in important in statistics because they have nice
mathematical properties (see Cramér-Rao lower bound).
Looking for biased estimators (bias-variance tradeoff) becomes important in
high-dimension, see Burr & Fry (2005)
@freakonometrics 92
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
i.e.
n
X R0 (x)
xT xT xT
ω yi − iβ · yi − iβ i = 0 with ω(x) = ,
i=1
| {z } x
ωi
@freakonometrics 93
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 94
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 95
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 96
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 97
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 98
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 99
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 100
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Penalized Smoothing
We have mentioned in the introduction that usually, we penalize a criteria (R2 or
log-likelihood) but it is also possible to penalize while fitting.
Heuristically, we have to minimuize the following objective function,
The regression coefficient can be shrunk toward 0, making fitted values more
homogeneous.
Consider a standard linear regression. The Ridge estimate is
X n
β
b = argmin [yi − β0 − xT
i β] 2
+ λ kβk`2
β
i=1 | {z }
1T β 2
@freakonometrics 101
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
b = [X T X + λI]−1 X T y.
Observe that β
We‘ inflate’ the X T X matrix by λI so that it is positive definite whatever k,
including k > n.
There is a Bayesian interpretation: if β has a N (0, τ 2 I)-prior and if resiuals are
i.i.d. N (0, σ 2 ), then the posteriory mean (and median) βb is the Ridge estimator,
with λ = σ 2 /τ 2 .
The Lasso estimate is
X n
β
b = argmin [yi − β0 − xT 2
i β] + λ kβk`1 .
β
i=1 | {z }
1T |β|
@freakonometrics 102
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 103
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 104
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 105
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
i6∈Ik
@freakonometrics 106
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 107
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 108
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 109
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
for some tuning parameter λ. Consider some cubic spline basis, so that
J
X
m(x) = θj Nj (x)
j=1
b = [N T N + λΩ]−1 N T y
θ
@freakonometrics 110
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 111
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Regression Trees
The partitioning is sequential, one covariate at a time (see adaptative neighbor
estimation).
n
X
Start with Q = [yi − y]2
i=1
For covariate k and threshold t, split the data according to {xi,k ≤ t} (L) or
{xi,k > t} (R). Compute
P P
i,xi,k ≤t yi i,xi,k >t yi
yL = P and y R = P
i,xi,k ≤t 1 i,xi,k >t 1
and let
y if x ≤ t
(k,t) L i,k
mi =
y if xi,k > t
R
@freakonometrics 112
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Regression Trees
( n )
(k,t) 2
X
Then compute (k ? , t? ) = argmin [yi − mi ] , and partition the space
i=1
?
intro two subspace, whether xk? ≤ t , or not.
Then repeat this procedure, and minimize
n
X
[yi − mi ]2 + λ · #{leaves},
i=1
(cf LASSO).
One can also consider random forests with regression trees.
@freakonometrics 113
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Local Regression
@freakonometrics 114
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Local Regression
@freakonometrics 115
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 116
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 117
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
1 > library ( np )
2 > fit <- npreg ( Y ~ X , data = db , bws = h ,
3 + ckertype = " gaussian " )
4 > predict ( fit , newdata = data . frame ( X = x ) )
@freakonometrics 118
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 119
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
100
80
missing/observed in Air.Temp
missing/observed in Humidity
80
60
60
40
40
20
20
missing
0
70 75 80 85 90 95
Humidity
missing
22 24 26 28
Air.Temp
@freakonometrics 120
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
95
Neighbors algorithm for imputation ●
90
1 > tao _ kNN <- kNN ( tao , k = 5)
85
Humidity
Imputation can be visualized using
80
1 vars <- c ( " Air . Temp " ," Humidity " ,"
75
Air . Temp _ imp " ," Humidity _ imp " )
●
●
2 marginplot ( tao _ kNN [ , vars ] , 93
22 23 24 25 26 27 28
Air.Temp
@freakonometrics 121
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
where a(·), b(·) and c(·) are functions, θ is the natural - canonical - parameter
and φ is a nuisance parameter.
The Gaussian distribution N (µ, σ 2 ) belongs to this family
θ = µ , φ = σ 2 , a(φ) = φ, b(θ) = θ2 /2
| {z } | {z }
θ↔E(Y ) φ↔Var(Y )
@freakonometrics 122
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
where the g? (·) is some link function (here the logistic transformation): the
canonical link.
Canonical links are
1 binomial ( link = " logit " )
2 gaussian ( link = " identity " )
3 Gamma ( link = " inverse " )
4 inverse . gaussian ( link = " 1 / mu ^2 " )
5 poisson ( link = " log " )
6 quasi ( link = " identity " , variance = " constant " )
7 quasibinomial ( link = " logit " )
8 quasipoisson ( link = " log " )
@freakonometrics 123
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
µ = E(Y ) = b0 (θ) and Var(Y ) = b00 (θ) · φ = b00 ([b0 ]−1 (µ)) · φ
| {z }
variance function V (µ)
@freakonometrics 124
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
for some ω i,j (see e.g. Müller (2004)) which are simple when g? = g.
@freakonometrics 125
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
b k )g 0 (b
z k = θ k + (y − µ µk ).
@freakonometrics 126
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
b→P
Under some technical conditions, we can prove that β β and
√ L
b − β) → N (0, I(β)−1 ).
n(β
@freakonometrics 127
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 128
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 129
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 130
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
GLM: Distribution?
From a computational point of view, the Poisson regression is not (really) related
to the Poisson distribution.
Here we solve the first order conditions (or normal equations)
X
[Yi − exp(X T
i β)]Xi,j = 0 ∀j
i
@freakonometrics 131
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 132
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
for Pn n
b = Pni=1 Yi
X Yi Ei
λ = ωi where ωi = Pn
i=1 Ei i=1
Ei i=1 Ei
@freakonometrics 133
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 134
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Boosting
Boosting is a machine learning ensemble meta-algorithm for reducing bias
primarily and also variance in supervised learning, and a family of machine
learning algorithms which convert weak learners to strong ones. (source:
Wikipedia)
The heuristics is simple: we consider an iterative process where we keep modeling
the errors.
Fit model for y, m1 (·) from y and X, and compute the error, ε1 = y − m1 (X).
Fit model for ε1 , m2 (·) from ε1 and X, and compute the error,
ε2 = ε1 − m2 (X), etc. Then set
@freakonometrics 135
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Boosting
With (very) general notations, we want to solve
m? = argmin{E[`(Y, m(X))]}
@freakonometrics 136
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Boosting
In a perfect world, h(X) = y − mk (X), which can be interpreted as a residual.
1
Note that this residual is the gradient of [y − m(x)]2
2
A gradient descent is based on Taylor expansion
@freakonometrics 137
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Boosting
Here, fk is a Rd → R function, so the gradient should be in such a (big)
functional space → want to approximate that function.
( n )
X
mk (x) = mk−1 (x) + argmin `(Yi , mk−1 (x) + f (x))
f ∈F i=1
@freakonometrics 138
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
M
X
E[Y |X = x] = m(x) ∼ mM (x) = νj hj (x)
j=1
@freakonometrics 139
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
and at step j,
@freakonometrics 140
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 141
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 142
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 143
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
n
X
Deviance is |yi − m(xi ))|, with gradient εbi = sign(yi − m(xi ))
i=1
@freakonometrics 144
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
• Poisson distribution
n
X yi yi − m(xi )
Deviance 2 yi · log − [yi − m(xi )] with gradient εbi = p
i=1
m(xi ) m(xi )
@freakonometrics 145
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Regularized GLM
In Regularized GLMs, we introduced a penalty in the loss function (the
deviance), see e.g. `1 regularized logistic regression
X n k
T X
max yi [β0 + xT
i β − log[1 + e
β0 +xi β
]] − λ |βj |
i=1 j=1
7 7 7 7
FRCAR
6
1 > library ( glmnet ) INCAR
INSYS
PRDIA
PAPUL
> y <- myocarde $ PRONO
4
2 PVENT
REPUL
Coefficients
3 > x <- as . matrix ( myocarde [ ,1:7])
2
4 > glm _ ridge <- glmnet (x , y , alpha =0 , lambda = seq
0
(0 ,2 , by =.01) , family = " binomial " )
−2
5 > plot ( lm _ ridge )
−4
0 5 10 15
L1 Norm
@freakonometrics 146
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 147
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
φ+2
• variance function power is p =
φ+1
• mean is λi µi
1
φ+1 −1
λi φ
• scale parameter is φ
φ+1 1+φ
µi
@freakonometrics 148
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
• mean is µi = exp[X T
i β Tweedie ]
• scale parameter is φ
@freakonometrics 149
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Part 4.
Model Choice, Feature Selection, etc.
@freakonometrics 150
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
AIC, BIC
AIC and BIC are both maximum likelihood estimate driven and penalize useless
parameters(to avoid overfitting)
AIC focus on overfit, while BIC depends on n so it might also avoid underfit
BIC penalize complexity more than AIC does.
Minimizing AIC ⇔ minimizing cross-validation value, Stone (1977).
Minimizing BIC ⇔ k-fold leave-out cross-validation, Shao (1997), with
k = n[1 − (log n − 1)]
−→ used in econometric stepwise procedures
@freakonometrics 151
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Cross-Validation
Formally, the leave-one-out cross validation is based on
n
1X
CV = `(yi , m
b −i (xi ))
n i=1
@freakonometrics 152
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Econometric approach ●
2
●
[x] [x]
Define m(x)
b = βb0 + βb1 x with ●
●
●
●
●
●
●
●
●
● ● ●
● ●● ●
● ● ● ●
1
● ● ●
● ● ●
● ● ● ●
( n ) ●●
● ● ● ●
● ● ● ●
● ● ●
●
●
●
● ●
[x] [x] [x]
X ●
(βb , βb ) = argmin ● ● ● ●
0
● ●
0 1 h ●
●
●
●
● ●●
●
●
●
(β0 ,β1 ) i=1
● ●●
●
●
● ●●●● ●
●
−1
● ●
●● ●
● ●
●
−2
(see previous discussion). 0 2 4 6 8 10
@freakonometrics 153
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
12
●
2
●
●
● ●
● ●
10
● ●
● ●
● ● ●
● ●● ●
● ● ● ●
1
● ● ●
● ● ●
● ● ● ●
8
●●
● ● ● ●
● ● ● ●
● ● ●
●
●
●
● ● ●
● ● ●
● ● ●
6
●
0
● ●
● ●
● ● ●
● ●
● ●●
● ●● ●
4
●
● ●●●● ●
●
−1
● ●
●● ●
● ●
2
●
●
● ●
● ●
● ●
●
0
−2
@freakonometrics 154
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 155
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 156
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 157
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 158
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
●●●●●●●●●●●
●●●●●●●●●
●●●●●
1.4
●●●●
●●●
●●
●●
●●●
●●
1.2
●
●
Binomial Deviance
●
●
●
1.0
●
●
●
●
●
0.8
●
●
●
●
●
●
0.6
5 + " binomial " , type = " auc " , nlambda = 100)
6 > cvfit $ lambda . min −2 0 2 4 6
log(Lambda)
7 [1] 0.0408752
7 7 7 6 6 6 6 5 5 6 5 4 4 3 3 2 1
8 > plot ( cvfit )
> cvfit <- cv . glmnet (x , y , alpha =1 , family =
4
9
Binomial Deviance
3
●●●
●●●
●●●
●●
●
●
●
2
●
12 [1] 0.03315514 ●
●
●
●
●
●
●
●
● ●
● ●
●
1
●●● ●
●●●● ●
●●● ●
●● ●
●●● ●●
●●● ●●
●●● ●●
●●●●●●●●●●●●●●●●●
−10 −8 −6 −4 −2
log(Lambda)
@freakonometrics 159
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
PVENT ●
6 INCAR 8.194572
PRDIA ●
9 PAPUL 2.341335 0 2 4 6 8
MeanDecreaseGini
10 PVENT 3.313113
11 REPUL 7.078838
@freakonometrics 160
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 161
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Feature Selection
Use Mallow’s Cp , from Mallow (1974) on all subset of predictors, in a regression
n
1 X
Cp = 2 [Yi − Ybi ]2 − n + 2p,
S i=1
@freakonometrics 162
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Feature Selection
Use random forest algorithm, removing some features at each iterations (the less
relevent ones).
The algorithm uses shadow attributes (obtained from existing features by
shuffling the values).
20
Importance
2 > B <- Boruta ( PRONO ~ . , data = myocarde , ntree =500)
10
3 > plot ( B ) ●
PVENT
PAPUL
PRDIA
REPUL
INCAR
INSYS
@freakonometrics 163
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Feature Selection
Use random forests, and variable importance plots
7
8
7 > VB <- varSelRFBoot (X , Y , usingCluster = FALSE ) Number of variables
8 > plot ( VB )
@freakonometrics 164
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 165
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 166
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
with the H-measure (see hmeasure), Gini and AUC, as well as the area under the
convex hull (AUCH).
@freakonometrics 167
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
1.0
family = binomial )
0.8
2 > Y <- myocarde $ PRONO
0.6
3 > S <- predict ( logistic , type = " response " )
0.4
For a standard ROC curve
0.2
1 > library ( ROCR )
0.0
2 > pred <- prediction (S , Y ) 0.0 0.2 0.4 0.6 0.8 1.0
3 > perf <- performance ( pred , " tpr " , " fpr " ) False positive rate
@freakonometrics 168
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
100
procedures)
80
1 > library ( pROC )
60
Sensitivity (%)
2 > roc <- plot . roc (Y , S , main = " " , percent = TRUE ,
40
ci = TRUE )
3 > roc . se <- ci . se ( roc , specificities = seq (0 , 100 ,
20
5) )
0
4 > plot ( roc . se , type = " shape " , col = " light blue " )
100 80 60 40 20 0
Specificity (%)
@freakonometrics 169
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Y = 0
b TN FN TN+FN
Y = 1
b FP TP FP+TP
TN+FP FN+TP n
Y = 0
b 25 3 28 Y = 0
b 11.44 16.56 28
Y = 1
b 4 39 43 Y = 1
b 17.56 25.44 43
29 42 71 29 42 71
TP + TN
total accuracy = ∼ 90.14%
n
[T N + F P ] · [T P + F N ] + [T P + F P ] · [T N + F N ]
random accuracy = ∼ 51.93%
n2
total accuracy − random accuracy
κ= ∼ 79.48%
1 − random accuracy
@freakonometrics 170
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 171
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
rf ●
gbm ● ●
●●
boost ●●
nn ●
bag
svm ●●
● ●
knn ●●
aic ● ●
●
glm ●● ●
loess ● ●
−0.5
0.0
0.5
1.0
@freakonometrics 172
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
100
poorest ← → richest
2 + base = base [ order ( base [ , vary ] , decreasing = FALSE ) ,]
80
3 + base $ cum =(1: nrow ( base ) ) / nrow ( base )
60
Income (%)
4 + return ( sum ( base [ base $ cum <= u , vary ]) /
40
5 + sum ( base [ , vary ]) ) }
> vu <- seq (0 ,1 , length = nrow ( base ) +1)
20
6
0
0 20 40 60 80 100
8 > plot ( vu , vv , type = " l " ) Proportion (%)
@freakonometrics 173
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
100
2 + base = base [ order ( base [ , vary ] , decreasing = TRUE ) ,]
80
3 + base $ cum =(1: nrow ( base ) ) / nrow ( base )
60
Income (%)
4 + return ( sum ( base [ base $ cum <= u , vary ]) /
40
5 + sum ( base [ , vary ]) ) }
> vu <- seq (0 ,1 , length = nrow ( base ) +1)
20
6
0
0 20 40 60 80 100
8 > plot ( vu , vv , type = " l " ) Proportion (%)
@freakonometrics 174
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
1 > L <- function (u , varx = " premium " , vary = " losses " ) {
100
2 + base = base [ order ( base [ , varx ] , decreasing = TRUE ) ,]
80
3 + base $ cum =(1: nrow ( base ) ) / nrow ( base )
60
Income (%)
4 + return ( sum ( base [ base $ cum <= u , vary ]) /
40
5 + sum ( base [ , vary ]) ) }
> vu <- seq (0 ,1 , length = nrow ( base ) +1)
20
6
0
0 20 40 60 80 100
8 > plot ( vu , vv , type = " l " ) Proportion (%)
@freakonometrics 175
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
@freakonometrics 176
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Brief Summary
k-nearest neighbors
+ very intuitive
— sensitive to irrelevant features, does not work in high dimension
trees and forests
+ single tree easy to interpret, tolerent to irrelevant features, works with big
data
— cannot handle linear combinations of features
support vector machine
+ good predictive power
— looks like a black box
boosting
+ intuitive and efficient
— looks like a black box
@freakonometrics 177
Arthur CHARPENTIER - Big Data and Machine Learning with an Actuarial Perspective - IA|BE
Mohri, Rostamizadeh & Talwalker (2012) Foundations of Machine Learning. MIT Press.
@freakonometrics 178