Professional Documents
Culture Documents
Exam SRM: You Have What It Takes To Pass
Exam SRM: You Have What It Takes To Pass
Training Test
Observations used Observations not used
to train/obtain fˆ to train/obtain fˆ
.))*+,-./0* *))2)
)*+,-./0* *))2)
Key Ideas
25% 25% 25% 25%
• The disadvantage to parametric methods is the danger of
choosing a form for 𝑓𝑓 that is not close to the truth.
• The disadvantage to non-parametric methods is the need for an
abundance of observations.
• Flexibility and interpretability are typically at odds.
1st 3rd outliers
• As flexibility increases, the training MSE (or error rate) decreases, quartile quartile
but the test MSE (or error rate) follows a u-shaped pattern.
smallest median largest
• Low flexibility leads to a method with low variance and high bias; non-outlier non-outlier
high flexibility leads to a method with high variance and low bias.
qq Plots
Plots sample quantiles against theoretical quantiles to determine
whether the sample and theoretical distributions have
similar shapes.
for 𝑗𝑗 = 0, 1, … , 𝑝𝑝
Residuals are well-behaved if 2. For 𝑝𝑝 = 2, … , 𝑔𝑔, fit the models that add Leave-one-out Cross-Validation (LOOCV)
o Points appear to be randomly scattered one of the remaining predictors to Μ$"# . • Calculate LOOCV error as a special case of
o Residuals seem to average to 0 The model with the largest 𝑅𝑅% is Μ$ . 𝑘𝑘-fold cross-validation where 𝑘𝑘 = 𝑛𝑛.
o Spread of residuals does not change • For MLR:
3. Choose the best model among Μ7 , … , ΜJ
'
• 𝑒𝑒 versus 𝑖𝑖 1 𝑦𝑦& − 𝑦𝑦+& %
using a selection criterion of choice. LOOCV Error = ≠ Ä
Detects dependence of error terms 𝑛𝑛 1 − ℎ&
&(#
• 𝑞𝑞𝑞𝑞 plot of 𝑒𝑒 Backward Stepwise Selection
1. Fit the model with all 𝑔𝑔 predictors, ΜJ . Key Ideas on Cross-Validation
Variance Inflation Factor
2. For 𝑝𝑝 = 𝑔𝑔 − 1, … , 1, fit the models that • The validation set approach has unstable
1 𝑠𝑠4%% (𝑛𝑛 − 1)
VIF; = = 𝑠𝑠𝑒𝑒8%% drop one of the predictors from Μ$:# . results and will tend to overestimate the
1 − 𝑅𝑅;% MSE
The model with the largest 𝑅𝑅% is Μ$ . test MSE. The two other approaches
Tolerance is the reciprocal of VIF. mitigate these issues.
3. Choose the best model among Μ7 , … , ΜJ
using a selection criterion of choice. • With respect to bias, LOOCV < 𝑘𝑘-fold CV <
Validation Set.
• With respect to variance, LOOCV > 𝑘𝑘-fold
CV > Validation Set.
• A standardized variable is the result of • The first partial least squares direction, 𝑘𝑘-Nearest Neighbors (KNN)
first centering a variable, then scaling it. 𝑧𝑧# , is a linear combination of standardized 1. Identify the "center of the neighborhood",
predictors 𝑥𝑥# , … , 𝑥𝑥$ , with coefficients i.e. the location of an observation with
Ridge Regression based on the relation between 𝑥𝑥; and 𝑦𝑦. inputs 𝑥𝑥# , … , 𝑥𝑥$ .
Coefficients are estimated by minimizing • Every subsequent partial least squares 2. Starting from the "center of the
the SSE while constrained by ∑$;(# 𝑏𝑏;% ≤ 𝑎𝑎 direction is calculated iteratively as a neighborhood", identify the 𝑘𝑘 nearest
or equivalently, by minimizing the linear combination of "updated training observations.
expression SSE + 𝜆𝜆 ∑$;(# 𝑏𝑏;% . predictors" which are the residuals of fits 3. For classification, 𝑦𝑦+ is the most frequent
with the "previous predictors" explained category among the 𝑘𝑘 observations; for
Lasso Regression regression, 𝑦𝑦+ is the average of the
by the previous direction.
Coefficients are estimated by minimizing response among the 𝑘𝑘 observations.
• The directions 𝑧𝑧# , … , 𝑧𝑧J are used as
the SSE while constrained by ∑$;(#a𝑏𝑏; a ≤ 𝑎𝑎 𝑘𝑘 is inversely related to flexibility.
predictors in a multiple linear regression.
or equivalently, by minimizing the The number of directions, 𝑔𝑔, is a measure
expression SSE + 𝜆𝜆 ∑$;(#a𝑏𝑏; a.
of flexibility.
Key Results for Distributions in the Exponential Family
1 (𝑦𝑦 − 𝜇𝜇)% 𝜃𝜃 %
Normal exp ∂− ∑ 𝜇𝜇 𝜎𝜎 % 𝜇𝜇
𝜎𝜎√2𝜋𝜋 2𝜎𝜎 % 2
Binomial, 𝑛𝑛 𝜋𝜋 𝜇𝜇
Ä 𝜋𝜋 6 (1 − 𝜋𝜋)'"6 ln ß ® 1 𝑛𝑛 ln21 + 𝑒𝑒 L 5 ln Ä
(fixed 𝑛𝑛) 𝑦𝑦 1 − 𝜋𝜋 𝑛𝑛 − 𝜇𝜇
𝜆𝜆6
Poisson exp(−𝜆𝜆) ln 𝜆𝜆 1 𝑒𝑒 L ln 𝜇𝜇
𝑦𝑦!
Negative Binomial Γ(𝑦𝑦 + 𝑟𝑟) D 𝜇𝜇
𝑝𝑝 (1 − 𝑝𝑝)6 ln(1 − 𝑝𝑝) 1 −𝑟𝑟 ln21 − 𝑒𝑒 L 5 ln Ä
(fixed 𝑟𝑟) 𝑦𝑦! Γ(𝑟𝑟) 𝑟𝑟 + 𝜇𝜇
𝛾𝛾 B B"# 𝛾𝛾 1 1
Gamma 𝑦𝑦 exp(−𝑦𝑦𝑦𝑦) − − ln(−𝜃𝜃) −
Γ(𝛼𝛼) 𝛼𝛼 𝛼𝛼 𝜇𝜇
𝜆𝜆 𝜆𝜆(𝑦𝑦 − 𝜇𝜇)% 1 1 1
Inverse Gaussian n exp ∂− ∑ − −√−2𝜃𝜃 −
2𝜋𝜋𝑦𝑦 M 2𝜇𝜇% 𝑦𝑦 2𝜇𝜇% 𝜆𝜆 2𝜇𝜇%
Smoothing
Stationarity describes how something does • test statistic = 𝑟𝑟G ⁄𝑠𝑠𝑠𝑠D*
𝑦𝑦W + 𝑦𝑦W"# + ⋯ + 𝑦𝑦W"G:#
not vary with respect to time. Control charts 𝑠𝑠̂W =
AR(1) Model 𝑘𝑘
can be used to identify stationarity. 𝑦𝑦W − 𝑦𝑦W"G
𝑌𝑌W = 𝛽𝛽7 + 𝛽𝛽# 𝑌𝑌W"# + 𝜀𝜀W 𝑠𝑠̂W = 𝑠𝑠̂W"# + , 𝑘𝑘 = 1, 2, …
𝑘𝑘
White Noise
Volatility Models
𝑏𝑏7 = 𝑠𝑠̂' 𝑌𝑌W = 𝛽𝛽7 + 𝛽𝛽# 𝑌𝑌W"J + ⋯ + 𝛽𝛽$ 𝑌𝑌W"$J + 𝜀𝜀W
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴(𝑝𝑝) Model
𝑦𝑦+':X = 𝑏𝑏7
Properties
U(# &:𝐱𝐱 & ∈Z+ accuracy as other statistical methods
• Increasing 𝑏𝑏 can cause overfitting.
Classification:
J Multiple Trees • Boosting reduces bias.
1 • 𝑑𝑑 controls complexity of the
Minimize ≠ 𝑛𝑛U ⋅ 𝐼𝐼U Bagging
𝑛𝑛 boosted model.
U(# 1. Create 𝑏𝑏 bootstrap samples from the
original training dataset. • 𝜆𝜆 controls the rate at which
More Under Classification:
2. Construct a decision tree for each boosting learns.
𝑝𝑝̂U,3 = 𝑛𝑛U,3 ⁄𝑛𝑛U
bootstrap sample using recursive
𝐸𝐸U = 1 − max 𝑝𝑝̂U,3
3
binary splitting.
𝐺𝐺U = ∑S
3(# 𝑝𝑝̂U,3 21 − 𝑝𝑝̂ U,3 5
3. Predict the response of a new observation
𝐷𝐷U = − ∑S
3(# 𝑝𝑝̂U,3 ln 𝑝𝑝̂ U,3 by averaging the predictions (regression
deviance = −2 ∑JU(# ∑S
3(# 𝑛𝑛U,3 ln 𝑝𝑝̂ U,3 trees) or by using the most frequent
deviance category (classification trees) across
residual mean deviance =
𝑛𝑛 − 𝑔𝑔 all 𝑏𝑏 trees.
Properties
• Increasing 𝑏𝑏 does not cause overfitting.
• Bagging reduces variance.
• Out-of-bag error is a valid estimate of
test error.