Statistical Learning: Master in Data Science For Management

Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S.
Ingrassia
Master in Data Science for Management

Statistical Learning
Prof. Salvatore Ingrassia
s.ingrassia@unict.it
http://www.dei.unict.it/docenti/salvatore.ingrassia
http://www.datasciencegroup.unict.it
1 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
4. Resampling Methods
[On the Gaussian curve, remarked to Poincaré:]

Experimentalists think that it is a mathematical
theorem while the mathematicians believe it to
be an experimental fact.
Gabriel Lippmann, Physics Nobel Prize 1908.
2 / 49
Outline
4.1. Training, Validation and Test Sets

4.1.1. The Validation Approach
4.1.2. The Training, Validation and Test Set Approach
4.1.3. Labo activity with R
4.2. Cross Validation

4.2.4. Leave-One-Out Cross-Validation (LOOCV)
4.2.5. k-fold Cross-Validation
4.2.6. Bias-Variance Trade-off for k-fold CV
4.2.7. Cross-Validation on Classification Problems
4.3. The Bootstrap

3 / 49
4.1 Training, Validation and Test Sets
4 / 49
Introduction - 1/2
Resampling methods are a key tool in modern statistics and machine
learning. Resampling methods involve:
1. Repeatedly drawing a sample from the training data.
2. Refitting the model of interest with each new sample.
3. Examining all of the refitted models and then drawing appropriate
conclusions.
in particular, in order to obtain more information about the fitted model
! Model Selection: estimating the performance of different models in
order to choose the best one;
! Model Assessment: having chosen a final model, estimating its
prediction error (generalization error) on new data.
For example, in order to estimate the variability of a linear regression fit, we
can repeatedly draw different samples from the training data, fit a linear
regression to each new sample, and then examine the extent to which the
resulting fits differ.
Such an approach may allow us to obtain information that would not be
available from fitting the model only once using the original training sample
5 / 49
Introduction - 2/2
Resampling is computationally very expensive but with the advent of modern

computing this is not a major drawback. There are two major resampling
techniques:
1. Cross-Validation: generally used to estimate the error (model
assessment) associated with a given learning model and / or to select
the appropriate learning model (model selection).
2. Bootstrap: most commonly used to provide a measure of accuracy of a
parameter estimate or of a given learning method.
Cross-validation and the bootstrap are easy to implement and very broadly
applicable.
6 / 49
The Training and Validation Set Approach 1/2
Suppose that we would like to find a set of variables that give the lowest test
(not training) error rate.
If we have a large data set, we can achieve this goal by randomly splitting the
data into training and validation(testing) parts.
We would then use the training part to build each possible model (i.e. the
different combinations of variables) and choose the model that gave the
lowest error rate when applied to the validation data.
7 / 49
The Training and Validation Set Approach 2/2
Advantages
! Simple
! Easy to implement
Disadvantages
! The validation MSE can be highly variable
! Only a subset of observations are used to fit the model (training data).
Statistical methods tend to perform worse when trained on fewer
observations
8 / 49
Example: Auto data 1/4
Consider the Auto data containing N = 392 units.

Suppose that we want to predict mpg from horsepower.
Two models:
! mpg∼horsepower
! mpg∼horsepower + horsepower2
Which model gives a better fit?
! Randomly split Auto data set into training (N/2 = 196 units) and
validation data (N/2 = 196 units);
! Fit both models using the training data set;
! Then, evaluate both models using the validation data set;
! The model with the lowest validation (testing) MSE is the winner.
9 / 49
Example: Auto data (results) 2/4

28
28
Mean Squared Error
Mean Squared Error

26
26
24
24
22
22
20
20
18
18
16
16
2 4 6 8 10 2 4 6 8 10
Degree of Polynomial Degree of Polynomial
Validation error rate Validation method repeated

for a single split 10 times each time
the split is done randomly
There is a lot of variability among the MSE’s.

Not good: we need more stable methods!
10 / 49
Example: Auto data (results) 2/4

28
28
Mean Squared Error
Mean Squared Error

26
26
24
24
22
22
20
20
18
18
16
16
2 4 6 8 10 2 4 6 8 10
Degree of Polynomial Degree of Polynomial
Validation error rate Validation method repeated

for a single split 10 times each time
the split is done randomly
There is a lot of variability among the MSE’s.

Not good: we need more stable methods!
11 / 49
Training, Validation and Test Sets 1/2
If we are in a data-rich situation, the best approach for both model selection
and model assessment is to randomly divide the dataset into three parts;
! training set: to be used to fit the models;
! validation set: to be used to estimate prediction error for model
selection;
! test set: to be used for assessment of the generalization error of the
final chosen model.
Why three subsets
Ideally, the test set should be kept in a ”vault”, and be brought out only at the
end of the data analysis.
Suppose instead that we use the test-set repeatedly, choosing the model with
the smallest test-set error. Then the test set error of the final chosen model
will underestimate the true test error, sometimes substantially.
12 / 49
Training, Validation and Test Sets 2/2
Problem: data scarcity

When data is often scarce, we cannot afford to set aside separate validation
(and / or test sets) when training a classifier or fitting a regression.
There are also some drawbacks to using a validation set in the training
procedure:
1. Performance on the validation set is a (often highly variable) random
variable depending on the data-split into training and validation sets.
2. The error on the validation set tends to over-estimate the test-error rate
of the model that is fitted to the entire data-set.
13 / 49
Labo activity with R
Labo activity 4.R
14 / 49
4.2 Cross Validation
15 / 49
Leave-One-Out Cross-Validation (LOOCV) - 1/3

This method is similar to the Validation Set Approach, but it tries to address
their disadvantages.
For each suggested model, do:
1. Split the data set of size (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) of size n into
! Training data set (x2 , y2 ), (x3 , y3 ), . . . , (xn , yn ) of size n − 1,
! Validation data set (x1 , y1 ) of size 1.
2. Fit the model using the training data.
3. Validate model using the validation data and compute the predicted
value ŷ1 corresponding to x1 ; afterwards compute the mean squared
test error
MSE1 = (y1 − ŷ1 )2 .
4. Now split the data set of size (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) of size n into
! Training data set (x1 , y1 ), (x3 , y3 ), . . . , (xn , yn ) of size n − 1,
! Validation data set (x2 , y2 ) of size 1.
5. Fit the model using the training data.
6. Validate model using the validation data and compute the predicted
value ŷ2 corresponding to x2 ; afterwards the mean squared test error
MSE2 = (y2 − ŷ2 )2 .
16 / 49
7. Repeat the process for each unit (xi, yi) for i= 1, . . . , n:

! choose the unit (xi, yi) of size 1 for validation data set,
! choose the remainders units for the training data set.
! Fit the model using the training data.
! Validate model using the validation data and compute the
predicted value ŷi corresponding to xi; afterwards the mean
squared test error
MSEi = (yi − ŷi)2 .
8. Finally we get the Leave-One-Out Cross-Validation (LOOCV) estimate
for the test MSE given by the average of these n test error estimates:
n n
1! 1!
CV(n) = MSEi = (yi − ŷi)2 .
n i=1 n i=1
17 / 49

A schematic of the LOOCV approach:
A set of n data points is repeatedly split into a training set (shown in blue)
containing all but one observation, and a validation set that contains only that
observation (shown in beige).
The test error is then estimated by averaging the n resulting MSE’s.
The first training set contains all but observation 1, the second training set
contains all but observation 2, and so forth.
18 / 49
LOOCV vs. the Validation Set Approach
LOOCV has a couple of major advantages over the validation set approach:
1. LOOCV has less bias:

! We repeatedly fit the statistical learning method using training
data that contains n − 1 obs., i.e. almost all the data set is used.
2. LOOCV produces a less variable MSE:
! The validation approach produces different MSE when applied
repeatedly due to randomness in the splitting process, while
performing LOOCV multiple times will always yield the same
results, because we split based on 1 obs. each time.
and LOOCV has a main disadvantage over the validation set approach:
1. LOOCV is computationally intensive:
! We fit the each model n times.
19 / 49
Example: Auto data: The LOOCV error curve - 3/4

LOOCV
28
Mean Squared Error
26
24
22
20
18
16
2 4 6 8 10
Degree of Polynomial
Cross-validation was used on the Auto data set in order to estimate the test
error that results from predicting mpg using polynomial functions of
horsepower.
20 / 49
k-fold Cross-Validation - 1/3

LOOCV is computationally intensive, so an alternative is k-fold Cross
Validation.
1. This approach involves randomly dividing the set of observations into k
groups (e.g. k = 5, or k = 10, etc.):
! The first fold is treated as a validation set,
! and the method is fit on the remaining k − 1 folds.
2. The mean squared error, MSE1 , is then computed on the observations
in the held-out fold.
3. This procedure is repeated k times; each time, a different group of
observations is treated as a validation set.
4. This process results in k estimates of the test error,
MSE1 , MSE2 , . . . , MSEk .
5. By averaging the k different MSE’s we get an estimated validation (test)
error rate for new observations. In other words, the k-fold CV estimate is
computed by averaging these values
1
CV(k) = MSEi.
k
LOOCV is a special case of k-fold CV in which k is set to equal n.
21 / 49

A schematic display of 5-fold Cross-Validation:
A set of n observations is randomly split into five non-overlapping groups.

Each of these fifths acts as a validation set (shown in beige), and the
remainder as a training set (shown in blue).
The test error is estimated by averaging the five resulting MSE estimates.
22 / 49
What is the advantage of using k = 5 or k = 10 rather than k = n?

The most obvious advantage is computational. LOOCV requires fitting the
statistical learning method n times. This has the potential to be
computationally expensive (except for linear models fit by least squares).
But cross-validation is a very general approach that can be applied to almost
any statistical learning method. Some statistical learning methods have
computationally intensive fitting procedures, and so performing LOOCV may
pose computational problems, especially if n is extremely large.
In contrast, performing 10-fold CV requires fitting the learning procedure only
ten times, which may be much more feasible,
23 / 49
Example: Auto data: The 10-fold CV error curve - 4/4

10−fold CV
28
Mean Squared Error
26
24
22
20
18
16
2 4 6 8 10
Degree of Polynomial
10-fold CV was run nine separate times, each with a different random split of
the data into ten parts. The figure shows the nine slightly different CV error
curves.
24 / 49
Bias-Variance Trade-off for k-fold CV - 1/3

k-fold CV with k < n has a computational advantage to LOOCV. Putting aside
that LOOCV is more computationally intensive than k-fold CV, which is better
LOOCV or k-fold CV?
Remember that the validation set approach can lead to overestimates of the
test error rate, since in this approach the training set used to fit the statistical
learning method contains only half the observations of the entire data set.
! Hence, LOOCV will give approximately unbiased estimates of the test

error, since each training set contains n − 1 observations, which is
almost as many as the number of observations in the full data set.
! And performing k-fold CV for, say, k = 5 or k = 10 will lead to an
intermediate level of bias, since each training set contains (k − 1) · n/k
observations – fewer than in the LOOCV approach, but substantially
more than in the validation set approach.
Conclusion: LOOCV is less bias than k-fold CV (when k < n)

Therefore, from the perspective of bias reduction, it is clear that LOOCV is to
be preferred to k-fold CV.
25 / 49

However, we know that bias is not the only source for concern in an
estimating procedure; we must also consider the procedure’s variance.
! When we perform LOOCV, we are in effect averaging the outputs of n
fitted models, each of which is trained on an almost identical set of
observations; therefore, these outputs are highly (positively) correlated
with each other.
! In contrast, when we perform k-fold CV with k < n, we are averaging the
outputs of k fitted models that are somewhat less correlated with each
other, since the overlap between the training sets in each model is
smaller.
! Since the mean of many highly correlated quantities has higher variance
than does the mean of many quantities that are not as highly correlated,
the test error estimate resulting from LOOCV tends to have higher
variance than does the test error estimate resulting from k-fold CV
Conclusion: LOOCV has higher variance than k-fold CV (when k < n)
Therefore, from the perspective of variance reduction, it is clear that k-fold CV
is to be preferred to LOOCV.
26 / 49

LOOCV is less bias than k-fold CV (when k < n)
Therefore, from the perspective of bias reduction, it is clear that LOOCV is to
be preferred to k-fold CV.
+
LOOCV has higher variance than k-fold CV (when k < n)
Therefore, from the perspective of variance reduction, it is clear that k-fold CV
is to be preferred to LOOCV.
There is a bias-variance trade-off associated with the choice of k in k-fold

cross-validation.
We tend to use k-fold CV with k = 5 or k = 10. It has been empirically shown
that they yield test error rate estimates that suffer neither from excessively high
bias, nor from very high variance.
27 / 49
Cross-Validation on Classification Problems
So far, we have been dealing with CV on regression problems.

We can use cross validation in a classification situation in a similar manner:
1. Divide data into k parts.
2. Hold out one part, fit using the remaining data and compute the error
rate on the hold out data.
3. Repeat k times
4. CV error rate is the average over the k errors we have computed.
For instance, in the classification setting, the LOOCV error rate takes the form
n n
1! 1!
CVn) = I(yi ̸= ŷi) = Erri
n i=1 n i=1
where Erri = I(yi ̸= ŷi).
28 / 49
CV to Choose Order of Polynomial - 1/4

! The data set used is simulated
! The purple dashed line is the Bayes’ boundary
oo o
oo o
o
o
o oo oo o
o o
o oo oo o o
o o oo ooo o
oo o o o o ooo oo oo
o o oo o oo
o o o o o o
oo oo o o o o
o o o oo o o o
o oo o o o o o o o
o o oooo o ooo o o oo o
o ooo
oo o o o ooooo o o o
X2
o o o oo o o
oo o o o o o oo o
o o o o
o o o oo o ooo o o
o oo o
oo o ooooo oooo
o o oo o o
o o oo oo o o o
o o oo oo
ooo
o o o o
o oo o
o o o
X1
29 / 49

Estimated decision boundaries: degree 1 and degree 2
Degree=1 Degree=2
o o o o
o o oo o o o o oo o o
o oo oo o o o oo oo o o
o
o oo oo ooo oo o oo oo ooo oo
oo oo oo o oooooo o oo oo oo
o
oo o oooooo o oo
o oo oo o oo oo
o
o o o oo ooo o oo o o
o o ooo o oo o oo
o
o o o oo o o oo o ooo o o
ooo oo
o o ooo o o
o o o o ooo o o
o oo o o o ooo oo o o o o
o o o o oooo oo ooo o o o o oooo o oo oooo o
o oo o ooooooo o o o oo
o o o oo o o o o o o o
o
oo ooooooo o o o
oo o o o oo o oo o o o o o
o o oo
o o o
o o o oo o oooo o
o o ooo oo o o o
o o o o oooo o
oo o ooooo oo oo o o
o ooo o oo o
o oo o oo oo o oo o
oo ooo o o oo o
o o oo ooo o o o
o ooo o o o
o o oo ooo
oo oo ooo
o o o o o o
o oooo o
o oo o oo o
o
o o o o o o
D 3
Error rate= 0.201 Error rate = 0.197
Linear Logistic regression is not able Quadratic Logistic regression
to fit the Bayes’ decision boundary does better than linear
30 / 49

Estimated decision boundaries: using cubic and quartic predictors, the
accuracy of the model improves.
Degree=3
Degree=3
o o
o o o o oo o o
o o oo o o o oo oo o o
o oo oo o o
o oo oo ooo oo
o
o oo oo ooo oo
oo oo
o
oo o oooooo o oo oo oo oo o oooooo o oo
oo o o oo oo oo
o oo ooo o oo
o
o o o oo ooo o oo o o oo o o ooo o
o o
o
o o o oo o o o o o o ooo o
ooo oo
o o ooo o o ooo oo o o o o o
o o o o ooo oo o o o
o o o o o o o o oooo o oo oooo o
o ooo oo o
oo ooooooo o o o
o o o oo o ooooooo o o o o o
oo o oo o o o o o
oo o oo o o o o o oo
o o
o ooo oo o o o o ooo oo o o o
o o o oooo o o o o oooo o
o o oo o
oo o ooooo oo oo o oo o o o
oooooooo o
o oo oo o o oooo o
o o oo ooo ooo o o o oo ooo
o o o oo ooo
ooo
o o
o oooo
o ooo o o o o
oo o o oo o
o
o o o o o o
Error rate= 0.160 Error rate = 0.162
31 / 49

We can decide between the four logistic regression models using
cross-validation.
0.20
0.18
Error Rate
0.16
0.14
0.12
2 4 6 8 10
Order of Polynomials Used
Brown : Test Error

Blue: Training Error
Black: 10-fold CV Error
32 / 49
Labo activity 4.R
33 / 49
4.3 The Bootstrap
34 / 49
The bootstrap
The textcolorred1bootstrap is one of the most general and the most widely
used tools to estimate measures of uncertainty associated with a given
statistical method.
Some common bootstrap applications are: estimating the bias or variance of
a particular statistical estimator, or constructing approximate confidence
intervals for parameters of interest.
As a simple example, the bootstrap can be used to estimate the standard
errors of the coefficients from a linear regression fit. In the specific case of
linear regression, this is not particularly useful, since standard statistical
software such as R outputs such standard errors automatically.
However, the power of the bootstrap lies in the fact that it can be easily
applied to a wide range of statistical learning methods, including some for
which a measure of variability is otherwise difficult to obtain and is not
automatically output by statistical software.
35 / 49
The problem
Suppose that we have independent samples z1 , . . . , zn ∼ Pθ . The subscript θ

emphasizes the fact that θ is some parameter of interest, defined at the
population level. E.g., this could be the mean of the distribution, the variance,
or something more complicated.
Let θ" be an estimate for θ that we compute from the samples z1 , . . . , zn .
" the variance of our statistic θ.
We may be interested knowing Var(θ), "
If we had access to Pθ , then we could just draw n samples, recompute the

statistic, and repeat; after doing this, say, 1000 times, we would have
computed 1000 statistics, and could just take the sample variance of these
statistics.
However, of course, we generally don’t have access to Pθ .
36 / 49
Bootstrap in pills
The idea behind the bootstrap is to use the observed samples z1 , . . . , zn to

generate n “new” samples, as if they came from Pθ . In particular, denoting the
new samples by z̃1 , . . . , z̃n , we draw these according to
i.i.d.
z̃j ∼ Unif{z1 , . . . , zn }, i= 1, . . . , n,
in other words, each z̃j is independent and drawn uniformly among z1 , . . . , zn .

This is called sampling with replacement, because in our new sample
z̃1 , . . . , z̃n we could very well have repeated observations.
In fact, we can think of z̃1 , . . . , z̃n as coming from a distribution, – it is just an
independent sample of size n from the empirical distribution function over the
original sample z1 , . . . , zn .
This is a discrete distribution, with probability mass 1/n at each of z1 , . . . , zn .
37 / 49
A running example - 1/4

Suppose that we wish to invest a fixed sum of money in two financial assets
that yield returns of X and Y , respectively, where X and Y are random
quantities.
We will invest a fraction α of our money in X, and will invest the remaining
1 − α in Y .
Since there is variability associated with the returns on these two assets, we
wish to choose α to minimize the total risk, or variance, of our investment.
In other words, we want to minimize Var(αX + (1 − α)Y)1 .
Consider Var(αX + (1 − α)Y) as a function of α. Then we have:
f(α) = Var(αX + (1 − α)Y)

= α2 Var(X) + (1 − α)2 Var(Y) + 2α(1 − α)Cov(X, Y)
= α2 Var(X) + (1 − α)2 Var(Y) + (2α − 2α2 )Cov(X, Y).
1
Remember that in general Var(αX + βY) = α2 Var(X) + β 2 Var(Y) + 2αβCov(X, Y)
38 / 49

Then consider the first derivative f′ (α):
f′ (α) = 2αVar(X) − 2(1 − α)Var(Y) + (2 − 4α)Cov(X, Y)

= 2αVar(X) − 2Var(Y) + 2αVar(Y) + 2Cov(X, Y) − 4αCov(X, Y)
= 2ασX2 − 2σY2 + 2ασY2 + 2σXY − 4ασXY
where σX2 = Var(X), σY2 = Var(Y) and σXY = Cov(X, Y).

The value that minimizes the risk is given by the solution of the equation
f′ (α) = 0:
σ 2 − σXY
α= 2 Y 2 .
σX + σX − 2σXY
In practice, the quantities σX2 , σY2 and Cov(X, Y) are unknown and we have to
estimate these quantities using a data set that contains past measurement
for X and Y, and we get an estimate of α
"Y2 − σ
σ "XY
"=
α .
"X2
σ +σ "X2 − 2"σXY
40 / 49

In each panel, we simulated 100 pairs of returns for the investments X and Y .
We used these returns to estimate σX2 , σY2 , and σXY in order to obtain
estimates for α:
2
2
1
1
0
Y
Y
0
−1
−1
−2
−2
−2 −1 0 1 2 −2 −1 0 1 2
X X
2
2
1
1
0
0
Y
Y
−1
−1
−2
−2
−3
−3
−3 −2 −1 0 1 2 −2 −1 0 1 2 3
X X
Obviously, each sample yields a different value of α

", see Labo activity 4.R.
41 / 49

We wish to quantify the accuracy of our estimate of α. To estimate the
standard deviation of α
", we repeated the process of simulating 100 paired
observations of X and Y.
We thereby obtained 1000 estimates for α, say α "1000 .
"2 , . . . , α
"1 , α
Mean over 1000 estimates of α:

200
1000
1 !
ᾱ = "i
α
1000 i=1
150
Standard deviation:
100
#
$ 1000
$ 1 !
50
α) = %
SE(" αi − ᾱ)2
("
1000 − 1 i=1
0
0.4 0.5 0.6 0.7 0.8 0.9
Problem α
In practice the procedure for estimating SE("

α) outlined above cannot be ap-
plied, because for real data we cannot generate new samples from the original
population.
43 / 49
Estimating the standard errors 1/2

So we use bootstrap to estimate the standard error of our estimator α "
(remind that this is just its standard deviation; we often call the standard
deviation of an estimator its standard error.)
Let z1 = (x1 , y1 ), . . . , zn = (xn , yn ). Then pick a large number B, say B = 1000,
and repeat for b = 1, . . . , B:
! draw a bootstrap sample z̃∗b
1 , . . . , z̃n ;
∗b
! recompute the statistic α̃∗b like in
"Y2 − σ
σ "XY
"=
α .
"X2
σ +σ "X2 − 2"σXY
(b) (b)
on z̃1 , . . . , z̃n .
Then #
$ B
& B
'
$ 1 ! 1 ! ∗r
SEB ("
α) ≈ % ∗b
" −
α "
α .
B − 1 b=1 B r=1
which is just the sample standard deviation of the bootstrap statistics

"∗1 , . . . , α
α "∗B
44 / 49
Estimating the standard errors 2/2

Consider for example simple data set Z that contains n = 3 observations.
A simple data set Z that contains n = 3 observations. We randomly select n

observations from the data set in order to produce a bootstrap data set Z ∗1 .
Remark: sampling with replacement

The sampling is performed with replacement, which means that the same ob-
servation can occur more than once in the bootstrap data set.
45 / 49
The Bootstrap - 2/3
! Z̃1 contains the third observation twice, the first observation once, and
no instances of the second observation.
! If an observation is contained in Z̃1 , then both its X and Y values are
included.
! We can use Z ∗1 to produce a new bootstrap estimate for α, which we
"∗1 .
call α
! This procedure is repeated B times for some large value of B, in order to
produce B different bootstrap data sets, Z ∗1 , Z ∗2 , . . . , Z ∗B , and B
corresponding estimates α "∗1 , α "∗B .
"∗2 , . . . , α
46 / 49
The Bootstrap - 3/3
An example of results is the following:
0.9
200
0.8
150
0.7
0.6
α
100
0.5
50
0.4
0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.3 True Bootstrap
47 / 49
Estimating bias
We can also use the bootstrap to estimate the bias of our estimator. The idea
is to make the approximation
" −θ ≈
E(θ) E(˜(θ)
" − θ"
B
1 ! ∗b "
≈ θ̃ − θ.
B b=1
We remark that the approximation is valid as long as the distributions of θ" − θ

and ˜? − θ are close.
48 / 49
Labo activity 4.R
49 / 49

Statistical Learning: Master in Data Science For Management

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistical Learning: Master in Data Science For Management

Uploaded by

Copyright:

Available Formats

Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S.

Master in Data Science for Management

Prof. Salvatore Ingrassia

[On the Gaussian curve, remarked to Poincaré:]

4.1. Training, Validation and Test Sets

4.2. Cross Validation

4.3. The Bootstrap

4.1 Training, Validation and Test Sets

Resampling is computationally very expensive but with the advent of modern

The Training and Validation Set Approach 1/2

The Training and Validation Set Approach 2/2

Example: Auto data 1/4

Consider the Auto data containing N = 392 units.

Example: Auto data (results) 2/4

Mean Squared Error

Degree of Polynomial Degree of Polynomial

Validation error rate Validation method repeated

There is a lot of variability among the MSE’s.

Example: Auto data (results) 2/4

Mean Squared Error

Degree of Polynomial Degree of Polynomial

Validation error rate Validation method repeated

There is a lot of variability among the MSE’s.

Training, Validation and Test Sets 1/2

Training, Validation and Test Sets 2/2

Problem: data scarcity

Labo activity with R

Labo activity 4.R

4.2 Cross Validation

Leave-One-Out Cross-Validation (LOOCV) - 1/3

Leave-One-Out Cross-Validation (LOOCV) - 2/3

7. Repeat the process for each unit (xi, yi) for i= 1, . . . , n:

Leave-One-Out Cross-Validation (LOOCV) - 3/3

LOOCV vs. the Validation Set Approach

1. LOOCV has less bias:

Example: Auto data: The LOOCV error curve - 3/4

k-fold Cross-Validation - 1/3

k-fold Cross-Validation - 2/3

A set of n observations is randomly split into five non-overlapping groups.

k-fold Cross-Validation - 3/3

What is the advantage of using k = 5 or k = 10 rather than k = n?

Example: Auto data: The 10-fold CV error curve - 4/4

Bias-Variance Trade-off for k-fold CV - 1/3

! Hence, LOOCV will give approximately unbiased estimates of the test

Conclusion: LOOCV is less bias than k-fold CV (when k < n)

Bias-Variance Trade-off for k-fold CV - 2/3

Bias-Variance Trade-off for k-fold CV - 3/3

There is a bias-variance trade-off associated with the choice of k in k-fold

Cross-Validation on Classification Problems

So far, we have been dealing with CV on regression problems.

where Erri = I(yi ̸= ŷi).

CV to Choose Order of Polynomial - 1/4

CV to Choose Order of Polynomial - 2/4

CV to Choose Order of Polynomial - 3/4

Error rate= 0.160 Error rate = 0.162

CV to Choose Order of Polynomial - 4/4

Order of Polynomials Used

Brown : Test Error

Labo activity with R

Labo activity 4.R

4.3 The Bootstrap

Suppose that we have independent samples z1 , . . . , zn ∼ Pθ . The subscript θ