You are on page 1of 11

Exam SRM

You have what it takes to pass updated 12/17/20

STATISTICAL LEARNING STATISTICAL LEARNING Contrasting Statistical Learning Elements



Statistical Learning Problems
Modeling Problems
Types of Variables
Response A variable of primary interest
Supervised Unsupervised
Explanatory A variable used to study the response variable
Has response variable No response variable
Count A quantitative variable usually valid on
non-negative integers
Continuous A real-valued quantitative variable
Regression Classification
Nominal A categorical/qualitative variable having categories
Quantitative Categorical
without a meaningful or logical order
response variable response variable
Ordinal A categorical/qualitative variable having categories
with a meaningful or logical order
Parametric Non-Parametric
Notation Functional form Functional form
𝑦𝑦, 𝑌𝑌 Response variable of f specified of f not specified
𝑥𝑥, 𝑋𝑋 Explanatory variable
Subscript 𝑖𝑖 Index for observations Inference
Method Prediction
𝑛𝑛 No. of observations Properties Comprehension
Output of fˆ
Subscript 𝑗𝑗 Index for variables except response of f
𝑝𝑝 No. of variables except response
Flexibility Interpretability
!
𝐀𝐀 Transpose of matrix 𝐀𝐀 ˆ, ,
f s ability to fˆ s ability to
"# Inverse of matrix 𝐀𝐀
𝐀𝐀 follow the data be understood
𝜀𝜀 Error term
, .
𝑦𝑦+, 𝑌𝑌, 𝑓𝑓(𝑥𝑥) Estimate/Estimator of 𝑓𝑓(𝑥𝑥)
Data

Training Test
Observations used Observations not used
to train/obtain fˆ to train/obtain fˆ

www.coachingactuaries.com Copyright © 2020 Coaching Actuaries. All Rights Reserved. 1


Regression Problems Descriptive Data Analysis
𝑌𝑌 = 𝑓𝑓2𝑥𝑥# , … , 𝑥𝑥$ 5 + 𝜀𝜀 where E[𝜀𝜀] = 0, so E[𝑌𝑌] = 𝑓𝑓2𝑥𝑥# , … , 𝑥𝑥$ 5 Numerical Summaries
% ∑'&(# 𝑥𝑥& ∑' (𝑥𝑥& − 𝑥𝑥̅ )%
Test MSE = E B2𝑌𝑌 − 𝑌𝑌,5 D , 𝑥𝑥̅ = , 𝑠𝑠4% = &(#
𝑛𝑛 𝑛𝑛 − 1
∑'&(#(𝑦𝑦& − 𝑦𝑦+& )% ∑' (𝑥𝑥& − 𝑥𝑥̅ )(𝑦𝑦& − 𝑦𝑦g)
which can be estimated using 𝑐𝑐𝑐𝑐𝑐𝑐4,6 = &(#
𝑛𝑛 𝑛𝑛 − 1
For fixed inputs 𝑥𝑥# , … , 𝑥𝑥$ , the test MSE is 𝑐𝑐𝑐𝑐𝑐𝑐4,6 ∑'&(#(𝑥𝑥& − 𝑥𝑥̅ )(𝑦𝑦& − 𝑦𝑦g)
% 𝑟𝑟4,6 = = , −1 ≤ 𝑟𝑟4,6 ≤ 1
VarS𝑓𝑓.2𝑥𝑥# , … , 𝑥𝑥$ 5T + 2BiasS𝑓𝑓.2𝑥𝑥# , … , 𝑥𝑥$ 5T5 +
VWWWWWWWWWWWWXWWWWWWWWWWWWY Var[𝜀𝜀]
VXY 𝑠𝑠4 ⋅ 𝑠𝑠6 j∑&(#(𝑥𝑥& − 𝑥𝑥̅ )% ⋅ ∑'&(#(𝑦𝑦& − 𝑦𝑦g)%
'

.))*+,-./0* *))2)
)*+,-./0* *))2)

Classification Problems Scatterplots


Test Error Rate = ES𝐼𝐼2𝑌𝑌 ≠ 𝑌𝑌,5T, Plots values of two variables to investigate their relationship.
∑' 𝐼𝐼(𝑦𝑦& ≠ 𝑦𝑦+& )
which can be estimated using &(# Box Plots
𝑛𝑛
Captures a variable's distribution using its median, 1st and 3rd
Bayes Classifier: quartiles, and distribution tails.
𝑓𝑓2𝑥𝑥# , … , 𝑥𝑥$ 5 = arg max Pr2𝑌𝑌 = 𝑐𝑐a𝑋𝑋# = 𝑥𝑥# , … , 𝑋𝑋$ = 𝑥𝑥$ 5
3 interquartile range

Key Ideas
25% 25% 25% 25%
• The disadvantage to parametric methods is the danger of
choosing a form for 𝑓𝑓 that is not close to the truth.
• The disadvantage to non-parametric methods is the need for an
abundance of observations.
• Flexibility and interpretability are typically at odds.
1st 3rd outliers
• As flexibility increases, the training MSE (or error rate) decreases, quartile quartile
but the test MSE (or error rate) follows a u-shaped pattern.
smallest median largest
• Low flexibility leads to a method with low variance and high bias; non-outlier non-outlier
high flexibility leads to a method with high variance and low bias.



qq Plots
Plots sample quantiles against theoretical quantiles to determine
whether the sample and theoretical distributions have
similar shapes.

www.coachingactuaries.com Copyright © 2020 Coaching Actuaries. All Rights Reserved. 2


LINEAR MODELS Estimation – Ordinary Least Squares (OLS) MLR Inferences
LINEAR MODELS
𝑦𝑦+ = 𝑏𝑏7 + 𝑏𝑏# 𝑥𝑥# + ⋯ + 𝑏𝑏$ 𝑥𝑥$ Notation
Simple Linear Regression (SLR) 𝑏𝑏7 𝛽𝛽.; Estimator for 𝛽𝛽;
Special case of MLR where 𝑝𝑝 = 1 u ⋮ w = 𝐛𝐛 = (𝐗𝐗 ! 𝐗𝐗)"# 𝐗𝐗 ! 𝐲𝐲 𝑌𝑌, Estimator for E[𝑌𝑌]
𝑏𝑏$
𝑠𝑠𝑠𝑠 Estimated standard error
Estimation MSE = SSE⁄(𝑛𝑛 − 𝑝𝑝 − 1)
∑' (𝑥𝑥& − 𝑥𝑥̅ )(𝑦𝑦& − 𝑦𝑦g) 𝐻𝐻7 Null hypothesis
𝑏𝑏# = &(# ' residual standard error = √MSE 𝐻𝐻# Alternative hypothesis
∑&(#(𝑥𝑥& − 𝑥𝑥̅ )%
df Degrees of freedom
𝑏𝑏7 = 𝑦𝑦g − 𝑏𝑏# 𝑥𝑥̅ Other Numerical Results
𝑡𝑡#"?,+@ 𝑞𝑞 quantile of
𝐇𝐇 = 𝐗𝐗(𝐗𝐗 ! 𝐗𝐗)"# 𝐗𝐗 !
a 𝑡𝑡-distribution
SLR Inferences 𝐲𝐲+ = 𝐇𝐇𝐇𝐇
𝛼𝛼 Significance level
Standard Errors 𝑒𝑒 = 𝑦𝑦 − 𝑦𝑦+
𝑘𝑘 Confidence level
1 𝑥𝑥̅ % SST = ∑'&(#(𝑦𝑦& − 𝑦𝑦g)% = total variability
𝑠𝑠𝑠𝑠8! = nMSE o + ' p ndf Numerator degrees
𝑛𝑛 ∑&(#(𝑥𝑥& − 𝑥𝑥̅ )% SSR = ∑'&(#(𝑦𝑦+& − 𝑦𝑦g)% = explained
of freedom
SSE = ∑'&(#(𝑦𝑦& − 𝑦𝑦+& )% = unexplained
MSE ddf Denominator degrees
𝑠𝑠𝑠𝑠8" = n ' SST = SSR + SSE
∑&(#(𝑥𝑥& − 𝑥𝑥̅ )% of freedom
𝑅𝑅% = SSR⁄SST
𝐹𝐹#"?,A+@,++@ 𝑞𝑞 quantile of
% MSE 𝑛𝑛 − 1
1 (𝑥𝑥 − 𝑥𝑥̅ )% 𝑅𝑅<+=. = 1 − % = 1 − (1 − 𝑅𝑅% )  Ä an 𝐹𝐹-distribution
𝑠𝑠𝑠𝑠69 = nMSE o + ' p 𝑠𝑠6 𝑛𝑛 − 𝑝𝑝 − 1
𝑛𝑛 ∑&(#(𝑥𝑥& − 𝑥𝑥̅ )% 𝑌𝑌':# Response of
Key Ideas new observation
1 (𝑥𝑥':# − 𝑥𝑥̅ )%
𝑠𝑠𝑠𝑠69#$" = nMSE o1 + + ' p • 𝑅𝑅% is a poor measure for model Subscript 𝑟𝑟 Reduced model
𝑛𝑛 ∑&(#(𝑥𝑥& − 𝑥𝑥̅ )%
comparison because it will increase Subscript 𝑓𝑓 Full model
simply by adding more predictors

Multiple Linear Regression (MLR) Standard Errors


to a model.
𝑌𝑌 = 𝛽𝛽7 + 𝛽𝛽# 𝑥𝑥# + ⋯ + 𝛽𝛽$ 𝑥𝑥$ + 𝜀𝜀 • Polynomials do not change consistently ä S𝛽𝛽.; T
𝑠𝑠𝑠𝑠8% = âVar

by unit increases of its variable, i.e. no
Notation
constant slope. Variance-Covariance Matrix
𝛽𝛽; The 𝑗𝑗th regression coefficient
• Only 𝑤𝑤 − 1 dummy variables are Var å T = MSE(𝐗𝐗 ! 𝐗𝐗)"# =
ä S𝜷𝜷
𝑏𝑏; Estimate of 𝛽𝛽;
needed to represent 𝑤𝑤 classes of a ä S𝛽𝛽.7 T
Var ä S𝛽𝛽.7 , 𝛽𝛽.# T ⋯ Cov
Cov ä S𝛽𝛽.7 , 𝛽𝛽.$ T
%
𝜎𝜎 Variance of response / ⎡ ⎤
categorical predictor; one of the classes ä S𝛽𝛽.7 , 𝛽𝛽.# T
⎢Cov ä S𝛽𝛽.# T
Var ä S𝛽𝛽.# , 𝛽𝛽.$ T⎥
⋯ Cov
Irreducible error ⎢ ⋮ ⋮ ⋱ ⋮ ⎥
acts as a baseline. ⎢ ⎥
MSE Estimate of 𝜎𝜎 % ä S𝛽𝛽.7 , 𝛽𝛽.$ T Cov
ä S𝛽𝛽.# , 𝛽𝛽.$ T ⋯ ä S𝛽𝛽.$ T ⎦
• In effect, dummy variables define a ⎣Cov Var
X Design matrix
distinct intercept for each class. Without
𝐇𝐇 Hat matrix 𝑡𝑡 Tests
the interaction between a dummy estimate − hypothesized value
𝑒𝑒 Residual
variable and a predictor, the dummy 𝑡𝑡 statistic =
SST Total sum of squares standard error
variable cannot additionally affect that
SSR Regression sum of squares Test Type Rejection Region
predictor's regression coefficient.
SSE Error sum of squares Two-tailed |𝑡𝑡 statistic| ≥ 𝑡𝑡B⁄%,'"$"#

Assumptions Left-tailed 𝑡𝑡 statistic ≤ −𝑡𝑡B,'"$"#


1. 𝑌𝑌& = 𝛽𝛽7 + 𝛽𝛽# 𝑥𝑥&,# + ⋯ + 𝛽𝛽$ 𝑥𝑥&,$ + 𝜀𝜀& Right-tailed 𝑡𝑡 statistic ≥ 𝑡𝑡B,'"$"#

2. 𝑥𝑥& ’s are non-random
𝐹𝐹 Tests
3. E[𝜀𝜀& ] = 0
MSR SSR⁄𝑝𝑝
4. Var[𝜀𝜀& ] = 𝜎𝜎 % 𝐹𝐹 statistic = =
MSE SSE⁄(𝑛𝑛 − 𝑝𝑝 − 1)
5. 𝜀𝜀& ’s are independent Reject 𝐻𝐻7 if 𝐹𝐹 statistic ≥ 𝐹𝐹B,A+@,++@
6. 𝜀𝜀& ’s are normally distributed
• ndf = 𝑝𝑝
7. The predictor 𝑥𝑥; is not a linear
• ddf = 𝑛𝑛 − 𝑝𝑝 − 1
combination of the other 𝑝𝑝 predictors,

for 𝑗𝑗 = 0, 1, … , 𝑝𝑝

www.coachingactuaries.com Copyright © 2020 Coaching Actuaries. All Rights Reserved. 3


Partial 𝐹𝐹 Tests Key Ideas Selection Criteria
2SSED − SSEE 5õ2𝑝𝑝E − 𝑝𝑝D 5 • As realizations of a 𝑡𝑡-distribution, • Mallows’ 𝐶𝐶$
𝐹𝐹 statistic =
SSEE ⁄2𝑛𝑛 − 𝑝𝑝E − 15 studentized residuals can help SSE + 2𝑝𝑝 ⋅ MSEJ
𝐶𝐶$ =
Reject 𝐻𝐻7 if 𝐹𝐹 statistic ≥ 𝐹𝐹B,A+@,++@ identify outliers. 𝑛𝑛
• When residuals have a larger spread for SSE
• ndf = 𝑝𝑝E − 𝑝𝑝D 𝐶𝐶$ = − 𝑛𝑛 + 2(𝑝𝑝 + 1)
larger predictions, one solution is to MSEJ
• ddf = 𝑛𝑛 − 𝑝𝑝E − 1
transform the response variable with a • Akaike information criterion
For all hypothesis tests, reject 𝐻𝐻7 if concave function. SSE + 2𝑝𝑝 ⋅ MSEJ
AIC =
𝑝𝑝-value ≤ 𝛼𝛼. • There is no universal approach to 𝑛𝑛 ⋅ MSEJ
handling multicollinearity; it is even • Bayesian information criterion
Confidence and Prediction Intervals possible to accept it, such as when there SSE + ln 𝑛𝑛 ⋅ 𝑝𝑝 ⋅ MSEJ
BIC =
estimate ± (𝑡𝑡 quantile)(standard error) is a suppressor variable. On the other 𝑛𝑛 ⋅ MSEJ
hand, it can be eliminated by using a set • Adjusted 𝑅𝑅%
Quantity Interval Expression of orthogonal predictors. • Cross-validation error

𝛽𝛽; 𝑏𝑏; ± 𝑡𝑡(#"G)⁄%,'"$"# ⋅ 𝑠𝑠𝑠𝑠8%
Validation Set
E[𝑌𝑌] 𝑦𝑦+ ± 𝑡𝑡(#"G)⁄%,'"$"# ⋅ 𝑠𝑠𝑠𝑠69 Model Selection
• Randomly splits all available
𝑌𝑌':# 𝑦𝑦+':# ± 𝑡𝑡(#"G)⁄%,'"$"# ⋅ 𝑠𝑠𝑠𝑠69#$" Notation
observations into two groups: the
𝑔𝑔 Total no. of predictors
training set and the validation set.
Linear Model Assumptions in consideration
• Only the observations in the training set
Leverage 𝑝𝑝 No. of predictors for a
are used to attain the fitted model, and
𝑠𝑠𝑠𝑠69& % specific model
ℎ& = 𝐱𝐱&! (𝐗𝐗 ! 𝐗𝐗)"# 𝐱𝐱& = those in validation set are used to
MSE MSEJ MSE of the model that uses
estimate the test MSE.
1 (𝑥𝑥& − 𝑥𝑥̅ )% all 𝑔𝑔 predictors
ℎ& = + ' for SLR Μ$ The "best" model with 𝑝𝑝 predictors
𝑛𝑛 ∑I(#(𝑥𝑥I − 𝑥𝑥̅ )% 𝑘𝑘-fold Cross-Validation
• 1⁄𝑛𝑛 ≤ ℎ& ≤ 1

1. Randomly divide all available
Best Subset Selection
• ∑'&(# ℎ& = 𝑝𝑝 + 1 observations into 𝑘𝑘 folds.
1. For 𝑝𝑝 = 0, 1, … , 𝑔𝑔, fit all ßJ$® models with 𝑝𝑝 2. For 𝑣𝑣 = 1, … , 𝑘𝑘, obtain the 𝑣𝑣th fit by
Cook’s Distance predictors. The model with the largest 𝑅𝑅% training with all observations except
%
∑'I(#2𝑦𝑦+I
− 𝑦𝑦+(&)I 5 is Μ$ . those in the 𝑣𝑣th fold.
𝐷𝐷& =
MSE(𝑝𝑝 + 1) 2. Choose the best model among Μ7 , … , ΜJ 3. For 𝑣𝑣 = 1, … , 𝑘𝑘, use 𝑦𝑦+ from the 𝑣𝑣th fit to
%
𝑒𝑒& ℎ& using a selection criterion of choice. calculate a test MSE estimate with
=
MSE(𝑝𝑝 + 1)(1 − ℎ& )% observations in the 𝑣𝑣th fold.
Forward Stepwise Selection 4. To calculate CV error, average the 𝑘𝑘 test
Plots of Residuals 1. Fit all 𝑔𝑔 simple linear regression models. MSE estimates in the previous step.
• 𝑒𝑒 versus 𝑦𝑦+ The model with the largest 𝑅𝑅% is Μ# .

Residuals are well-behaved if 2. For 𝑝𝑝 = 2, … , 𝑔𝑔, fit the models that add Leave-one-out Cross-Validation (LOOCV)
o Points appear to be randomly scattered one of the remaining predictors to Μ$"# . • Calculate LOOCV error as a special case of
o Residuals seem to average to 0 The model with the largest 𝑅𝑅% is Μ$ . 𝑘𝑘-fold cross-validation where 𝑘𝑘 = 𝑛𝑛.
o Spread of residuals does not change • For MLR:
3. Choose the best model among Μ7 , … , ΜJ
'
• 𝑒𝑒 versus 𝑖𝑖 1 𝑦𝑦& − 𝑦𝑦+& %
using a selection criterion of choice. LOOCV Error = ≠ Ä
Detects dependence of error terms 𝑛𝑛 1 − ℎ&
&(#
• 𝑞𝑞𝑞𝑞 plot of 𝑒𝑒 Backward Stepwise Selection

1. Fit the model with all 𝑔𝑔 predictors, ΜJ . Key Ideas on Cross-Validation
Variance Inflation Factor
2. For 𝑝𝑝 = 𝑔𝑔 − 1, … , 1, fit the models that • The validation set approach has unstable
1 𝑠𝑠4%% (𝑛𝑛 − 1)
VIF; = = 𝑠𝑠𝑒𝑒8%% drop one of the predictors from Μ$:# . results and will tend to overestimate the
1 − 𝑅𝑅;% MSE
The model with the largest 𝑅𝑅% is Μ$ . test MSE. The two other approaches
Tolerance is the reciprocal of VIF. mitigate these issues.

3. Choose the best model among Μ7 , … , ΜJ
using a selection criterion of choice. • With respect to bias, LOOCV < 𝑘𝑘-fold CV <
Validation Set.
• With respect to variance, LOOCV > 𝑘𝑘-fold
CV > Validation Set.

www.coachingactuaries.com Copyright © 2020 Coaching Actuaries. All Rights Reserved. 4


Other Regression Approaches Key Ideas on Ridge and Lasso Weighted Least Squares
Standardizing Variables • 𝑥𝑥# , … , 𝑥𝑥$ are scaled predictors. • Var[𝜀𝜀& ] = 𝜎𝜎 % ⁄𝑤𝑤&
• A centered variable is the result of • 𝜆𝜆 is inversely related to flexibility. • Equivalent to running OLS with √𝑤𝑤𝑦𝑦 as
subtracting the sample mean from • With a finite 𝜆𝜆, none of the ridge the response and √𝑤𝑤𝐱𝐱 as the predictors,
a variable. estimates will equal 0, but the lasso hence minimizing ∑'&(# 𝑤𝑤& (𝑦𝑦& − 𝑦𝑦+& )% .
• A scaled variable is the result of estimates could equal 0. 𝐛𝐛 = (𝐗𝐗 ! 𝐖𝐖𝐖𝐖)"# 𝐗𝐗 ! 𝐖𝐖𝐖𝐖 where 𝐖𝐖 is the
dividing a variable by its sample
diagonal matrix of the weights.
standard deviation. Partial Least Squares

• A standardized variable is the result of • The first partial least squares direction, 𝑘𝑘-Nearest Neighbors (KNN)
first centering a variable, then scaling it. 𝑧𝑧# , is a linear combination of standardized 1. Identify the "center of the neighborhood",
predictors 𝑥𝑥# , … , 𝑥𝑥$ , with coefficients i.e. the location of an observation with
Ridge Regression based on the relation between 𝑥𝑥; and 𝑦𝑦. inputs 𝑥𝑥# , … , 𝑥𝑥$ .
Coefficients are estimated by minimizing • Every subsequent partial least squares 2. Starting from the "center of the
the SSE while constrained by ∑$;(# 𝑏𝑏;% ≤ 𝑎𝑎 direction is calculated iteratively as a neighborhood", identify the 𝑘𝑘 nearest
or equivalently, by minimizing the linear combination of "updated training observations.
expression SSE + 𝜆𝜆 ∑$;(# 𝑏𝑏;% . predictors" which are the residuals of fits 3. For classification, 𝑦𝑦+ is the most frequent

with the "previous predictors" explained category among the 𝑘𝑘 observations; for
Lasso Regression regression, 𝑦𝑦+ is the average of the
by the previous direction.
Coefficients are estimated by minimizing response among the 𝑘𝑘 observations.
• The directions 𝑧𝑧# , … , 𝑧𝑧J are used as
the SSE while constrained by ∑$;(#a𝑏𝑏; a ≤ 𝑎𝑎 𝑘𝑘 is inversely related to flexibility.
predictors in a multiple linear regression.
or equivalently, by minimizing the The number of directions, 𝑔𝑔, is a measure
expression SSE + 𝜆𝜆 ∑$;(#a𝑏𝑏; a.
of flexibility.




Key Results for Distributions in the Exponential Family

Distribution Probability Function 𝜃𝜃 𝜙𝜙 𝑏𝑏(𝜃𝜃) Canonical Link, 𝑏𝑏K "# (𝜇𝜇)

1 (𝑦𝑦 − 𝜇𝜇)% 𝜃𝜃 %
Normal exp ∂− ∑ 𝜇𝜇 𝜎𝜎 % 𝜇𝜇
𝜎𝜎√2𝜋𝜋 2𝜎𝜎 % 2
Binomial, 𝑛𝑛 𝜋𝜋 𝜇𝜇
 Ä 𝜋𝜋 6 (1 − 𝜋𝜋)'"6 ln ß ® 1 𝑛𝑛 ln21 + 𝑒𝑒 L 5 ln  Ä
(fixed 𝑛𝑛) 𝑦𝑦 1 − 𝜋𝜋 𝑛𝑛 − 𝜇𝜇
𝜆𝜆6
Poisson exp(−𝜆𝜆) ln 𝜆𝜆 1 𝑒𝑒 L ln 𝜇𝜇
𝑦𝑦!
Negative Binomial Γ(𝑦𝑦 + 𝑟𝑟) D 𝜇𝜇
𝑝𝑝 (1 − 𝑝𝑝)6 ln(1 − 𝑝𝑝) 1 −𝑟𝑟 ln21 − 𝑒𝑒 L 5 ln  Ä
(fixed 𝑟𝑟) 𝑦𝑦! Γ(𝑟𝑟) 𝑟𝑟 + 𝜇𝜇
𝛾𝛾 B B"# 𝛾𝛾 1 1
Gamma 𝑦𝑦 exp(−𝑦𝑦𝑦𝑦) − − ln(−𝜃𝜃) −
Γ(𝛼𝛼) 𝛼𝛼 𝛼𝛼 𝜇𝜇

𝜆𝜆 𝜆𝜆(𝑦𝑦 − 𝜇𝜇)% 1 1 1
Inverse Gaussian n exp ∂− ∑ − −√−2𝜃𝜃 −
2𝜋𝜋𝑦𝑦 M 2𝜇𝜇% 𝑦𝑦 2𝜇𝜇% 𝜆𝜆 2𝜇𝜇%

www.coachingactuaries.com Copyright © 2020 Coaching Actuaries. All Rights Reserved. 5


NON-LINEAR MODELS Numerical Results Inference
NON-LINEAR MODELS
𝐷𝐷∗ = 2[𝑙𝑙N<O − 𝑙𝑙(𝐛𝐛)] • Maximum likelihood estimators 𝜷𝜷 å
Generalized Linear Models 𝐷𝐷 = 𝜙𝜙𝐷𝐷∗ asymptotically have a multivariate
Notation For MLR, 𝐷𝐷 = SSE normal distribution with mean 𝜷𝜷
𝜃𝜃, 𝜙𝜙 Linear exponential family
and asymptotic variance-covariance
%
1 − exp{2[𝑙𝑙7 − 𝑙𝑙(𝐛𝐛)]⁄𝑛𝑛}
parameters 𝑅𝑅QN = matrix 𝐈𝐈"# .
1 − exp{2𝑙𝑙7 ⁄𝑛𝑛}
E[𝑌𝑌], 𝜇𝜇 Mean response • To address overdispersion, change the
%
𝑙𝑙(𝐛𝐛) − 𝑙𝑙7
𝑏𝑏K (𝜃𝜃) Mean function 𝑅𝑅RN*. = variance to Var[𝑌𝑌& ] = 𝛿𝛿 ⋅ 𝜙𝜙& ⋅ 𝑏𝑏KK (𝜃𝜃& ) and
𝑙𝑙N<O − 𝑙𝑙7
𝑣𝑣(𝜇𝜇) Variance function estimate 𝛿𝛿 as the Pearson chi-square
ℎ(𝜇𝜇) Link function AIC∗ = −2 ⋅ 𝑙𝑙(𝐛𝐛) + 2 ⋅ (𝑝𝑝 + 1) statistic divided by 𝑛𝑛 − 𝑝𝑝 − 1.
𝐛𝐛 Maximum likelihood estimate BIC∗ = −2 ⋅ 𝑙𝑙(𝐛𝐛) + ln 𝑛𝑛 ⋅ (𝑝𝑝 + 1)

*Assumes only 𝜷𝜷 need to be estimated. If Likelihood Ratio Tests


of 𝜷𝜷
𝑙𝑙(𝐛𝐛) Maximized log-likelihood estimating 𝜙𝜙 is required, replace 𝑝𝑝 + 1 with 𝜒𝜒 % statistic = 2S𝑙𝑙2𝐛𝐛E 5 − 𝑙𝑙(𝐛𝐛D )T
𝑙𝑙7 Maximized log-likelihood for 𝑝𝑝 + 2. Reject 𝐻𝐻7 if 𝜒𝜒 % statistic ≥ 𝜒𝜒B,$
%
' "$(


null model
Residuals Goodness-of-Fit Tests
𝑙𝑙N<O Maximized log-likelihood for
Raw Residual 𝑌𝑌 follows a distribution of choice with 𝑔𝑔 free
saturated model
𝑒𝑒& = 𝑦𝑦& − 𝜇𝜇̂ & parameters, whose domain is split into 𝑤𝑤
𝑒𝑒 Residual
mutually exclusive intervals.
𝐈𝐈 Information matrix Pearson Residual S
%
𝜒𝜒#"?,+@ 𝑞𝑞 quantile of a chi-square 𝑦𝑦& − 𝜇𝜇̂ & (𝑛𝑛3 − 𝑛𝑛𝑞𝑞3 )%
𝑒𝑒& = 𝜒𝜒 % statistic = ≠
distribution 𝑛𝑛𝑞𝑞3
j𝜙𝜙 ⋅ 𝑣𝑣(𝜇𝜇̂ & ) 3(#
𝐷𝐷∗ Scaled deviance The Pearson chi-square statistic is ∑'&(# 𝑒𝑒&% .
%
Reject 𝐻𝐻7 if 𝜒𝜒 % statistic ≥ 𝜒𝜒B,S"J"#
𝐷𝐷 Deviance statistic

Deviance Residual Tweedie Distribution


Linear Exponential Family 𝑒𝑒& = ±j𝐷𝐷&∗ whose sign follows the E[𝑌𝑌] = 𝜇𝜇, Var[𝑌𝑌] = 𝜙𝜙 ⋅ 𝜇𝜇T
𝑦𝑦𝑦𝑦 − 𝑏𝑏(𝜃𝜃)
Prob. fn. of 𝑌𝑌 = exp ∂ + 𝑎𝑎(𝑦𝑦, 𝜙𝜙)∑ 𝑖𝑖th raw residual
𝜙𝜙
Distribution 𝑑𝑑
E[𝑌𝑌] = 𝑏𝑏K (𝜃𝜃) Anscombe Residual
Var[𝑌𝑌] = 𝜙𝜙 ⋅ 𝑏𝑏KK (𝜃𝜃) = 𝜙𝜙 ⋅ 𝑣𝑣(𝜇𝜇) å[𝑡𝑡(𝑌𝑌& )]
𝑡𝑡(𝑦𝑦& ) − E Normal 0

𝑒𝑒& =
ä
jVar[𝑡𝑡(𝑌𝑌& )] Poisson 1
Model Framework
• ℎ(𝜇𝜇) = 𝐱𝐱 ! 𝜷𝜷 Gamma 2
• Canonical link is the link function where Tweedie (1, 2)
ℎ(𝜇𝜇) = 𝑏𝑏K "# (𝜇𝜇).
Inverse Gaussian 3
Parameter Estimation
'
𝑦𝑦& 𝜃𝜃& − 𝑏𝑏(𝜃𝜃& )
𝑙𝑙(𝜷𝜷) = ≠ ∂ + 𝑎𝑎(𝑦𝑦& , 𝜙𝜙) ∑
𝜙𝜙
&(#
where 𝜃𝜃& = 𝑏𝑏K "# Sℎ"# 2𝐱𝐱&! 𝜷𝜷5T

The score equations are the partial


derivatives of 𝑙𝑙(𝜷𝜷) with respect to each 𝛽𝛽;
all set equal to 0. The solution to the score
equations is 𝐛𝐛. Then, 𝜇𝜇̂ = ℎ"#(𝐱𝐱 ! 𝐛𝐛).

www.coachingactuaries.com Copyright © 2020 Coaching Actuaries. All Rights Reserved. 6


Logistic and Probit Regression Nominal Response – Generalized Logit Poisson Count Regression
• The odds of an event are the ratio of the Let 𝜋𝜋&,3 be the probability that the 𝑖𝑖th ln 𝜇𝜇 = 𝐱𝐱 ! 𝜷𝜷
'
probability that the event will occur to observation is classified as category 𝑐𝑐. 𝑘𝑘 is
the probability that the event will the reference category. 𝑙𝑙(𝜷𝜷) = ≠[𝑦𝑦& ln 𝜇𝜇& − 𝜇𝜇& − ln(𝑦𝑦& !) ]
&(#
not occur. 𝜋𝜋&,3 '
ln o p = 𝐱𝐱&! 𝜷𝜷3 𝜕𝜕
• The odds ratio is the ratio of the odds 𝜋𝜋&,G 𝑙𝑙(𝜷𝜷) = ≠ 𝐱𝐱& (𝑦𝑦& − 𝜇𝜇& ) = 𝟎𝟎
𝜕𝜕𝜷𝜷
of an event with the presence of a exp2𝐱𝐱&! 𝜷𝜷3 5 &(#

⎪1 + ∑UVG exp2𝐱𝐱 ! 𝜷𝜷U 5 , 𝑐𝑐 ≠ 𝑘𝑘 '
characteristic to the odds of the same
𝜋𝜋&,3 = &
𝐈𝐈 = ≠ 𝜇𝜇& 𝐱𝐱& 𝐱𝐱&!
event without the presence of ⎨ 1
⎪ !
, 𝑐𝑐 = 𝑘𝑘 &(#
that characteristic. ⎩1 + ∑UVG exp2𝐱𝐱& 𝜷𝜷U 5 '
𝑦𝑦&
' S 𝐷𝐷 = 2 ≠ –𝑦𝑦& «ln  Ä − 1» + 𝜇𝜇̂ & —
Binary Response 𝜇𝜇̂ &
𝑙𝑙(𝜷𝜷) = ≠ ≠ 𝐼𝐼(𝑦𝑦& = 𝑐𝑐) ln 𝜋𝜋&,3 &(#
Function Name ℎ(𝜇𝜇) &(# 3(# 𝑦𝑦& − 𝜇𝜇̂ &
Pearson residual, 𝑒𝑒& =
j𝜇𝜇̂ &
𝜇𝜇 Ordinal Response – Proportional Odds
Logit ln  Ä '
1 − 𝜇𝜇 Cumulative (𝑦𝑦& − 𝜇𝜇̂ & )%
Pearson chi-square statistic = ≠
ℎ(Π3 ) = 𝛼𝛼3 + 𝐱𝐱&! 𝜷𝜷 where 𝜇𝜇̂ &
&(#
Probit Φ"# (𝜇𝜇) • Π3 = 𝜋𝜋# + ⋯ + 𝜋𝜋3

Poisson Regression with Exposures Model


𝑥𝑥&,# 𝛽𝛽#
• 𝐱𝐱& = Œ ⋮ œ , 𝜷𝜷 = u ⋮ w ln 𝜇𝜇 = ln 𝑤𝑤 + 𝐱𝐱 ! 𝜷𝜷
Complementary
ln(− ln(1 − 𝜇𝜇)) 𝑥𝑥&,$ 𝛽𝛽$
log-log Alternative Count Models
These models can incorporate a Poisson
'
distribution while letting the mean of
𝑙𝑙(𝜷𝜷) = ≠[𝑦𝑦& ln 𝜇𝜇& + (1 − 𝑦𝑦& ) ln(1 − 𝜇𝜇& )]
&(#
the response differ from the variance of
𝜕𝜕
'
𝜇𝜇&K the response:
𝑙𝑙(𝜷𝜷) = ≠ 𝐱𝐱& (𝑦𝑦& − 𝜇𝜇& ) = 𝟎𝟎
𝜕𝜕𝜷𝜷 𝜇𝜇& (1 − 𝜇𝜇& )
&(#
' Mean < Mean >
𝑦𝑦& 1 − 𝑦𝑦& Models
𝐷𝐷 = 2 ≠ «𝑦𝑦& ln  Ä + (1 − 𝑦𝑦& ) ln  Ä» Variance Variance
𝜇𝜇̂ & 1 − 𝜇𝜇̂ &
&(#
Negative binomial Yes No
𝑦𝑦& − 𝜇𝜇̂ &
Pearson residual, 𝑒𝑒& =
j𝜇𝜇̂ & (1 − 𝜇𝜇̂ & ) Zero-inflated Yes No
'
(𝑦𝑦& − 𝜇𝜇̂ & )% Hurdle Yes Yes
Pearson chi-square statistic = ≠
𝜇𝜇̂ & (1 − 𝜇𝜇̂ & )
&(# Heterogeneity Yes No



www.coachingactuaries.com Copyright © 2020 Coaching Actuaries. All Rights Reserved. 7


TIME SERIES Autoregressive Models Smoothing and Predictions
TIME SERIES
Notation 𝑦𝑦+W = 𝑏𝑏7 + 𝑏𝑏# 𝑦𝑦W"# , 2 ≤ 𝑡𝑡 ≤ 𝑛𝑛
Trend Models 𝜌𝜌G Lag 𝑘𝑘 autocorrelation 𝑏𝑏7 + 𝑏𝑏# 𝑦𝑦':X"# , 𝑙𝑙 = 1
𝑦𝑦+':X = –
Notation 𝑟𝑟G Lag 𝑘𝑘 sample autocorrelation 𝑏𝑏7 + 𝑏𝑏# 𝑦𝑦+':X"# , 𝑙𝑙 > 1
Subscript 𝑡𝑡 Index for observations 𝜎𝜎 % Variance of white noise %(X"#)
𝑠𝑠𝑠𝑠69#$) = 𝑠𝑠â1 + 𝑏𝑏#% + 𝑏𝑏#Y + ⋯ + 𝑏𝑏#
𝑇𝑇W Trends in time 𝑠𝑠 % Estimate of 𝜎𝜎 %
100𝑘𝑘% prediction interval for 𝑦𝑦':X is
𝑆𝑆W Seasonal trends 𝑏𝑏7 Estimate of 𝛽𝛽7
𝑦𝑦+':X ± 𝑡𝑡(#"G)⁄%,'"M ⋅ 𝑠𝑠𝑠𝑠69#$)
𝜀𝜀W Random patterns 𝑏𝑏# Estimate of 𝛽𝛽#
𝑦𝑦+':X 𝑙𝑙-step ahead forecast 𝑦𝑦g" Sample mean of first
𝑠𝑠𝑠𝑠 Estimated standard error 𝑛𝑛 − 1 observations Other Time Series Models
𝑡𝑡#"?,+@ 𝑞𝑞 quantile of a 𝑡𝑡-distribution 𝑦𝑦g: Sample mean of last Notation
𝑛𝑛 − 1 observations 𝑘𝑘 Moving average length
𝑛𝑛# Training sample size
𝑤𝑤 Smoothing parameter
𝑛𝑛% Test sample size

Autocorrelation 𝑔𝑔 Seasonal base
Trends ∑'W(G:#(𝑦𝑦W"G − 𝑦𝑦g)(𝑦𝑦W − 𝑦𝑦g) 𝑑𝑑 No. of trigonometric functions
𝑟𝑟G =
Additive: 𝑌𝑌W = 𝑇𝑇W + 𝑆𝑆W + 𝜀𝜀W ∑'W(#(𝑦𝑦W − 𝑦𝑦g)%

Multiplicative: 𝑌𝑌W = 𝑇𝑇W × 𝑆𝑆W + 𝜀𝜀W


Smoothing with Moving Averages

To test 𝐻𝐻7 : 𝜌𝜌G = 0 against 𝐻𝐻# : 𝜌𝜌G ≠ 0 𝑌𝑌W = 𝛽𝛽7 + 𝜀𝜀W
Stationarity • 𝑠𝑠𝑠𝑠D* = 1⁄√𝑛𝑛

Smoothing
Stationarity describes how something does • test statistic = 𝑟𝑟G ⁄𝑠𝑠𝑠𝑠D*
𝑦𝑦W + 𝑦𝑦W"# + ⋯ + 𝑦𝑦W"G:#
not vary with respect to time. Control charts 𝑠𝑠̂W =
AR(1) Model 𝑘𝑘
can be used to identify stationarity. 𝑦𝑦W − 𝑦𝑦W"G
𝑌𝑌W = 𝛽𝛽7 + 𝛽𝛽# 𝑌𝑌W"# + 𝜀𝜀W 𝑠𝑠̂W = 𝑠𝑠̂W"# + , 𝑘𝑘 = 1, 2, …
𝑘𝑘
White Noise

𝑦𝑦+':X = 𝑦𝑦g Assumptions Predictions


1. E[𝜀𝜀W ] = 0 𝑏𝑏7 = 𝑠𝑠̂'
𝑠𝑠𝑠𝑠69#$) = 𝑠𝑠6 j1 + 1⁄𝑛𝑛
2. Var[𝜀𝜀W ] = 𝜎𝜎 % 𝑦𝑦+':X = 𝑏𝑏7
100𝑘𝑘% prediction interval for 𝑦𝑦':X is
3. Cov[𝜀𝜀W:G , 𝑌𝑌W ] = 0 for 𝑘𝑘 > 0
𝑦𝑦+':X ± 𝑡𝑡(#"G)⁄%,'"# ⋅ 𝑠𝑠𝑠𝑠69#$) Double Smoothing with Moving Averages

• If 𝛽𝛽# = 0, 𝑌𝑌W follows a white noise process. 𝑌𝑌W = 𝛽𝛽7 + 𝛽𝛽# 𝑡𝑡 + 𝜀𝜀W
Random Walk
• If 𝛽𝛽# = 1, 𝑌𝑌W follows a random
𝑤𝑤W = 𝑦𝑦W − 𝑦𝑦W"# Smoothing
walk process.
𝑦𝑦+':X = 𝑦𝑦' + 𝑙𝑙𝑤𝑤
’ (%) 𝑠𝑠̂W + 𝑠𝑠̂W"# + ⋯ + 𝑠𝑠̂W"G:#
• If −1 < 𝛽𝛽# < 1, 𝑌𝑌W is stationary. 𝑠𝑠̂W =
𝑠𝑠𝑠𝑠69#$) = 𝑠𝑠S √𝑙𝑙
𝑘𝑘

Approximate 95% prediction interval for Properties of Stationary AR(1) Model
Predictions
𝑦𝑦':X is 𝑦𝑦+':X ± 2 ⋅ 𝑠𝑠𝑠𝑠69#$) 𝛽𝛽7
E[𝑌𝑌W ] = 𝑏𝑏7 = 𝑠𝑠̂'
1 − 𝛽𝛽# (%)
Model Comparison 2 ß𝑠𝑠̂' − 𝑠𝑠̂' ®
𝜎𝜎 % 𝑏𝑏# =
' Var[𝑌𝑌W ] = 𝑘𝑘 − 1
1 1 − 𝛽𝛽#%
ME = ≠ 𝑒𝑒W 𝑦𝑦+':X = 𝑏𝑏7 + 𝑏𝑏# ⋅ 𝑙𝑙
𝑛𝑛% 𝜌𝜌G = 𝛽𝛽#G
W('" :#

'
1 𝑒𝑒W Estimation
MPE = 100 ⋅ ≠
𝑛𝑛% 𝑦𝑦W ∑'W(%(𝑦𝑦W"# − 𝑦𝑦g" )(𝑦𝑦W − 𝑦𝑦g: )
W('" :# 𝑏𝑏# =
' ∑'W(%(𝑦𝑦W"# − 𝑦𝑦g")%
1
MSE = ≠ 𝑒𝑒W% 𝑏𝑏7 = 𝑦𝑦g: − 𝑏𝑏# 𝑦𝑦g"
𝑛𝑛%
W('" :# ∑'W(% 𝑒𝑒W%
' 𝑠𝑠 % =
1 𝑛𝑛 − 3
MAE = ≠ |𝑒𝑒W | 𝑠𝑠 %
𝑛𝑛% ä [𝑌𝑌W ] =
Var
W('" :#
'
1 − 𝑏𝑏#%
1 𝑒𝑒W
MAPE = 100 ⋅ ≠ ÷ ÷
𝑛𝑛% 𝑦𝑦W
W('" :#

www.coachingactuaries.com Copyright © 2020 Coaching Actuaries. All Rights Reserved. 8


Exponential Smoothing Seasonal Time Series Models Unit Root Test
𝑌𝑌W = 𝛽𝛽7 + 𝜀𝜀W Fixed Seasonal Effects – Trigonometric • A unit root test is used to evaluate the fit

Functions of a random walk model.
Smoothing T
• A random walk model is a good fit if the
𝑠𝑠̂W = (1 − 𝑤𝑤)(𝑦𝑦W + 𝑤𝑤𝑦𝑦W"# + ⋯ + 𝑤𝑤 W 𝑦𝑦7 ) 𝑆𝑆W = ≠S𝛽𝛽#,& sin(𝑓𝑓& 𝑡𝑡) + 𝛽𝛽%,& cos(𝑓𝑓& 𝑡𝑡)T time series possesses a unit root.
𝑠𝑠̂W = (1 − 𝑤𝑤)𝑦𝑦W + 𝑤𝑤𝑠𝑠̂W"# , 0 ≤ 𝑤𝑤 < 1 &(#
• The Dickey-Fuller test and augmented
The value of 𝑤𝑤 is determined by minimizing • 𝑓𝑓& = 2𝜋𝜋𝜋𝜋⁄𝑔𝑔
Dickey-Fuller test are two examples of
∑'W(#(𝑦𝑦W − 𝑠𝑠̂W"# )% . • 𝑑𝑑 ≤ 𝑔𝑔⁄2
unit root tests.
Predictions Seasonal Autoregressive Models, SAR(p)

Volatility Models
𝑏𝑏7 = 𝑠𝑠̂' 𝑌𝑌W = 𝛽𝛽7 + 𝛽𝛽# 𝑌𝑌W"J + ⋯ + 𝛽𝛽$ 𝑌𝑌W"$J + 𝜀𝜀W
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴(𝑝𝑝) Model
𝑦𝑦+':X = 𝑏𝑏7

Holt-Winter Seasonal Additive Model 𝜎𝜎W% = 𝜃𝜃 + 𝛾𝛾# 𝜀𝜀W"#


% %
+ ⋯ + 𝛾𝛾$ 𝜀𝜀W"$
Double Exponential Smoothing 𝑌𝑌W = 𝛽𝛽7 + 𝛽𝛽# 𝑡𝑡 + 𝑆𝑆W + 𝜀𝜀W

𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺(𝑝𝑝, 𝑞𝑞) Model


𝑌𝑌W = 𝛽𝛽7 + 𝛽𝛽# 𝑡𝑡 + 𝜀𝜀W • 𝑆𝑆W = 𝑆𝑆W"J

J
𝜎𝜎W% = 𝜃𝜃 + 𝛾𝛾# 𝜀𝜀W"#
% %
+ ⋯ + 𝛾𝛾$ 𝜀𝜀W"$ +
Smoothing • ∑W(# 𝑆𝑆W = 0

%
𝛿𝛿# 𝜎𝜎W"# %
+ ⋯ + 𝛿𝛿? 𝜎𝜎W"?
(%) W
𝑠𝑠̂W = (1 − 𝑤𝑤)(𝑠𝑠̂W + 𝑤𝑤𝑠𝑠̂W"# + ⋯ + 𝑤𝑤 𝑠𝑠̂7 ) 𝜃𝜃
Var[𝜀𝜀W ] =

1 − ∑$;(# 𝛾𝛾; − ∑?;(# 𝛿𝛿;
Predictions

(%)
𝑏𝑏7 = 2𝑠𝑠̂' − 𝑠𝑠̂' Assumptions
1 − 𝑤𝑤 (%) • 𝜃𝜃 > 0
𝑏𝑏# = ß𝑠𝑠̂' − 𝑠𝑠̂' ®
𝑤𝑤 • 𝛾𝛾; ≥ 0
𝑦𝑦+':X = 𝑏𝑏7 + 𝑏𝑏# ⋅ 𝑙𝑙
• 𝛿𝛿; ≥ 0

• ∑$;(# 𝛾𝛾; + ∑?;(# 𝛿𝛿; < 1
Key Ideas for Smoothing
• It is only appropriate for time series data
without a linear trend.
• It is related to weighted least squares.
• A double smoothing procedure can be
used to forecast time series data with a
linear trend.
• Holt-Winter double exponential
smoothing is a generalization of the
double exponential smoothing.

www.coachingactuaries.com Copyright © 2020 Coaching Actuaries. All Rights Reserved. 9


DECISION TREES Cost Complexity Pruning Random Forests
DECISION TREES
Regression: 1. Create 𝑏𝑏 bootstrap samples from the
Regression and Classification Trees |!| original training dataset.
%
Notation Minimize ≠ ≠ 2𝑦𝑦& − 𝑦𝑦gZ+ 5 + 𝜆𝜆|𝑇𝑇| 2. Construct a decision tree for each
𝑅𝑅 Region of predictor space U(# &:𝐱𝐱 & ∈Z+ bootstrap sample using recursive binary
𝑛𝑛U No. of observations in node 𝑚𝑚 Classification: splitting. At each split, a random subset of
|!|
𝑛𝑛U,3 No. of category 𝑐𝑐 observations in 1 𝑘𝑘 variables are considered.
node 𝑚𝑚 Minimize ≠ 𝑛𝑛U ⋅ 𝐼𝐼U + 𝜆𝜆|𝑇𝑇|
𝑛𝑛 3. Predict the response of a new observation
U(#
𝐼𝐼 Impurity by averaging the predictions (regression
𝐸𝐸 Classification error rate Key Ideas trees) or by using the most frequent
𝐺𝐺 Gini index • Terminal nodes or leaves represent the category (classification trees) across
𝐷𝐷 Cross entropy partitions of the predictor space. all 𝑏𝑏 trees.
𝑇𝑇 Subtree • Internal nodes are points along the tree

|𝑇𝑇| No. of terminal nodes in 𝑇𝑇 Properties


where splits occur.
𝜆𝜆 Tuning parameter • Bagging is a special case of
• Terminal nodes do not have child nodes,
random forests.
Algorithm but internal nodes do.
• Increasing 𝑏𝑏 does not cause overfitting.
1. Construct a large tree with 𝑔𝑔 terminal • Branches are lines that connect any
• Decreasing 𝑘𝑘 reduces the correlation
nodes using recursive binary splitting. two nodes.
between predictions.
2. Obtain a sequence of best subtrees, • A decision tree with only one internal

as a function of 𝜆𝜆, using cost node is called a stump. Boosting



complexity pruning. Let 𝑧𝑧# be the actual response variable, 𝑦𝑦.
Advantages of Trees
3. Choose 𝜆𝜆 by applying 𝑘𝑘-fold cross 1. For 𝑘𝑘 = 1, 2, … , 𝑏𝑏:
• Easy to interpret and explain
validation. Select the 𝜆𝜆 that results in the • Use recursive binary splitting to fit a
• Can be presented visually
lowest cross-validation error. tree with 𝑑𝑑 splits to the data with 𝑧𝑧G as
• Manage categorical variables without the
4. The best subtree is the subtree created in the response.
need of dummy variables
step 2 with the selected 𝜆𝜆 value. • Update 𝑧𝑧G by subtracting 𝜆𝜆 ⋅ 𝑓𝑓.G (𝐱𝐱), i.e.
• Mimic human decision-making
let 𝑧𝑧G:# = 𝑧𝑧G − 𝜆𝜆 ⋅ 𝑓𝑓.G (𝐱𝐱).


Recursive Binary Splitting
Disadvantages of Trees 2. Calculate the boosted model prediction as
Regression:
J • Not robust 𝑓𝑓.(𝐱𝐱) = ∑8G(# 𝜆𝜆 ⋅ 𝑓𝑓.G (𝐱𝐱).
Minimize ≠ ≠ 2𝑦𝑦& − 𝑦𝑦gZ+ 5
% • Do not have the same degree of predictive

Properties
U(# &:𝐱𝐱 & ∈Z+ accuracy as other statistical methods
• Increasing 𝑏𝑏 can cause overfitting.
Classification:
J Multiple Trees • Boosting reduces bias.
1 • 𝑑𝑑 controls complexity of the
Minimize ≠ 𝑛𝑛U ⋅ 𝐼𝐼U Bagging
𝑛𝑛 boosted model.
U(# 1. Create 𝑏𝑏 bootstrap samples from the

original training dataset. • 𝜆𝜆 controls the rate at which
More Under Classification:
2. Construct a decision tree for each boosting learns.
𝑝𝑝̂U,3 = 𝑛𝑛U,3 ⁄𝑛𝑛U
bootstrap sample using recursive
𝐸𝐸U = 1 − max 𝑝𝑝̂U,3
3
binary splitting.
𝐺𝐺U = ∑S
3(# 𝑝𝑝̂U,3 21 − 𝑝𝑝̂ U,3 5
3. Predict the response of a new observation
𝐷𝐷U = − ∑S
3(# 𝑝𝑝̂U,3 ln 𝑝𝑝̂ U,3 by averaging the predictions (regression
deviance = −2 ∑JU(# ∑S
3(# 𝑛𝑛U,3 ln 𝑝𝑝̂ U,3 trees) or by using the most frequent
deviance category (classification trees) across
residual mean deviance =
𝑛𝑛 − 𝑔𝑔 all 𝑏𝑏 trees.

Properties
• Increasing 𝑏𝑏 does not cause overfitting.
• Bagging reduces variance.
• Out-of-bag error is a valid estimate of
test error.

www.coachingactuaries.com Copyright © 2020 Coaching Actuaries. All Rights Reserved. 10


UNSUPERVISED LEARNING Cluster Analysis Hierarchical Clustering
UNSUPERVISED LEARNING
Notation 1. Select the dissimilarity measure and
Principal Components Analysis 𝐶𝐶 Cluster containing indices linkage to be used. Treat each
Notation 𝑊𝑊(𝐶𝐶) Within-cluster variation observation as its own cluster.
𝑧𝑧, 𝑍𝑍 Principal component of cluster 2. For 𝑘𝑘 = 𝑛𝑛, 𝑛𝑛 − 1, … , 2:
(score) |𝐶𝐶| No. of observations in cluster • Compute the inter-cluster dissimilarity
Subscript 𝑚𝑚 Index for principal
between all 𝑘𝑘 clusters.
$ %
components Euclidean Distance = â∑;(#2𝑥𝑥&,; − 𝑥𝑥U,; 5 • Examine all 2G%5 pairwise
𝜙𝜙 Principal component dissimilarities. The two clusters with
loading 𝑘𝑘-Means Clustering the lowest inter-cluster dissimilarity
𝑥𝑥, 𝑋𝑋 Centered explanatory 1. Randomly assign a cluster to each are fused. The dissimilarity indicates
variable observation. This serves as the initial the height in the dendrogram at which

cluster assignments. these two clusters join.
Principal Components
$ $
2. Calculate the centroid of each cluster.

𝑧𝑧U = ≠ 𝜙𝜙;,U 𝑥𝑥; , 𝑧𝑧&,U = ≠ 𝜙𝜙;,U 𝑥𝑥&,;


3. For each observation, identify the closest Linkage Inter-cluster dissimilarity =
centroid and reassign to that cluster.
;(# ;(# Complete The largest dissimilarity
• ∑$;(# 𝜙𝜙;,U
%
= 1 4. Repeat steps 2 and 3 until the cluster
$ assignments stop changing. Single The smallest dissimilarity
• ∑;(# 𝜙𝜙;,U ⋅ 𝜙𝜙;,I = 0, 𝑚𝑚 ≠ 𝑢𝑢
$ Average The arithmetic mean
Proportion of Variance Explained (PVE) 1 %
𝑊𝑊(𝐶𝐶I ) = ≠ ≠2𝑥𝑥&,; − 𝑥𝑥U,; 5 The dissimilarity between
$ $ ' |𝐶𝐶I | Centroid
1 &,U∈`, ;(#
% the cluster centroids
≠ 𝑠𝑠4%% = ≠ ≠ 𝑥𝑥&,; $
𝑛𝑛 − 1 %
;(# ;(# &(# = 2 ≠ ≠2𝑥𝑥&,; − 𝑥𝑥̅I,; 5 Key Ideas
'
1 %
&∈`, ;(#
• For 𝑘𝑘-means clustering, the algorithm
𝑠𝑠_%+ = ≠ 𝑧𝑧&,U
𝑛𝑛 − 1 needs to be repeated for each 𝑘𝑘.
&(#
𝑠𝑠_%+ • For hierarchical clustering, the algorithm
PVE = $
∑;(# 𝑠𝑠4%% only needs to be performed once for any
number of clusters.
Key Ideas • The result of clustering depends on many
• The variance explained by each parameters, such as:
subsequent principal component is o Choice of 𝑘𝑘 in 𝑘𝑘-means clustering
always less than the variance explained o Choice of number of clusters, linkage,
by the previous principal component. and dissimilarity measure in
• All principal components are hierarchical clustering
uncorrelated with one another. o Choice to standardize variables
• A dataset has min(𝑛𝑛 − 1, 𝑝𝑝) distinct
principal components.
• The first 𝑘𝑘 principal component scores
and loadings approximate the original
dataset, 𝑥𝑥&,; ≈ ∑GU(# 𝑧𝑧&,U 𝜙𝜙;,U .

Principal Components Regression


𝑌𝑌 = 𝜃𝜃7 + 𝜃𝜃# 𝑧𝑧# + ⋯ + 𝜃𝜃G 𝑧𝑧G + 𝜀𝜀
• If 𝑘𝑘 = 𝑝𝑝, then 𝛽𝛽; = ∑GU(# 𝜃𝜃U 𝜙𝜙;,U .

Copyright © 2020 Coaching Actuaries. All Rights Reserved. 11


www.coachingactuaries.com Personal copies permitted. Resale or distribution is prohibited.

You might also like