Professional Documents
Culture Documents
Ingrassia
s.ingrassia@unict.it
http://www.dei.unict.it/docenti/salvatore.ingrassia
http://www.datasciencegroup.unict.it
1 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
4. Resampling Methods
2 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Outline
3 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
4 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Introduction - 1/2
Resampling methods are a key tool in modern statistics and machine
learning. Resampling methods involve:
1. Repeatedly drawing a sample from the training data.
2. Refitting the model of interest with each new sample.
3. Examining all of the refitted models and then drawing appropriate
conclusions.
in particular, in order to obtain more information about the fitted model
! Model Selection: estimating the performance of different models in
order to choose the best one;
! Model Assessment: having chosen a final model, estimating its
prediction error (generalization error) on new data.
For example, in order to estimate the variability of a linear regression fit, we
can repeatedly draw different samples from the training data, fit a linear
regression to each new sample, and then examine the extent to which the
resulting fits differ.
Such an approach may allow us to obtain information that would not be
available from fitting the model only once using the original training sample
5 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Introduction - 2/2
6 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Suppose that we would like to find a set of variables that give the lowest test
(not training) error rate.
If we have a large data set, we can achieve this goal by randomly splitting the
data into training and validation(testing) parts.
We would then use the training part to build each possible model (i.e. the
different combinations of variables) and choose the model that gave the
lowest error rate when applied to the validation data.
7 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Advantages
! Simple
! Easy to implement
Disadvantages
! The validation MSE can be highly variable
! Only a subset of observations are used to fit the model (training data).
Statistical methods tend to perform worse when trained on fewer
observations
8 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
9 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
28
Mean Squared Error
26
24
24
22
22
20
20
18
18
16
16
2 4 6 8 10 2 4 6 8 10
10 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
28
Mean Squared Error
26
24
24
22
22
20
20
18
18
16
16
2 4 6 8 10 2 4 6 8 10
11 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
If we are in a data-rich situation, the best approach for both model selection
and model assessment is to randomly divide the dataset into three parts;
! training set: to be used to fit the models;
! validation set: to be used to estimate prediction error for model
selection;
! test set: to be used for assessment of the generalization error of the
final chosen model.
Why three subsets
Ideally, the test set should be kept in a ”vault”, and be brought out only at the
end of the data analysis.
Suppose instead that we use the test-set repeatedly, choosing the model with
the smallest test-set error. Then the test set error of the final chosen model
will underestimate the true test error, sometimes substantially.
12 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
There are also some drawbacks to using a validation set in the training
procedure:
1. Performance on the validation set is a (often highly variable) random
variable depending on the data-split into training and validation sets.
2. The error on the validation set tends to over-estimate the test-error rate
of the model that is fitted to the entire data-set.
13 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
14 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
15 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
17 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
A set of n data points is repeatedly split into a training set (shown in blue)
containing all but one observation, and a validation set that contains only that
observation (shown in beige).
The test error is then estimated by averaging the n resulting MSE’s.
The first training set contains all but observation 1, the second training set
contains all but observation 2, and so forth.
18 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
LOOCV has a couple of major advantages over the validation set approach:
19 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
28
Mean Squared Error
26
24
22
20
18
16
2 4 6 8 10
Degree of Polynomial
Cross-validation was used on the Auto data set in order to estimate the test
error that results from predicting mpg using polynomial functions of
horsepower.
20 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
22 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
23 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
28
Mean Squared Error
26
24
22
20
18
16
2 4 6 8 10
Degree of Polynomial
10-fold CV was run nine separate times, each with a different random split of
the data into ten parts. The figure shows the nine slightly different CV error
curves.
24 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Remember that the validation set approach can lead to overestimates of the
test error rate, since in this approach the training set used to fit the statistical
learning method contains only half the observations of the entire data set.
26 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
+
LOOCV has higher variance than k-fold CV (when k < n)
Therefore, from the perspective of variance reduction, it is clear that k-fold CV
is to be preferred to LOOCV.
27 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
28 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
oo o
oo o
o
o
o oo oo o
o o
o oo oo o o
o o oo ooo o
oo o o o o ooo oo oo
o o oo o oo
o o o o o o
oo oo o o o o
o o o oo o o o
o oo o o o o o o o
o o oooo o ooo o o oo o
o ooo
oo o o o ooooo o o o
X2
o o o oo o o
oo o o o o o oo o
o o o o
o o o oo o ooo o o
o oo o
oo o ooooo oooo
o o oo o o
o o oo oo o o o
o o oo oo
ooo
o o o o
o oo o
o o o
X1
29 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
o o o o
o o oo o o o o oo o o
o oo oo o o o oo oo o o
o
o oo oo ooo oo o oo oo ooo oo
oo oo oo o oooooo o oo oo oo
o
oo o oooooo o oo
o oo oo o oo oo
o
o o o oo ooo o oo o o
o o ooo o oo o oo
o
o o o oo o o oo o ooo o o
ooo oo
o o ooo o o
o o o o ooo o o
o oo o o o ooo oo o o o o
o o o o oooo oo ooo o o o o oooo o oo oooo o
o oo o ooooooo o o o oo
o o o oo o o o o o o o
o
oo ooooooo o o o
oo o o o oo o oo o o o o o
o o oo
o o o
o o o oo o oooo o
o o ooo oo o o o
o o o o oooo o
oo o ooooo oo oo o o
o ooo o oo o
o oo o oo oo o oo o
oo ooo o o oo o
o o oo ooo o o o
o ooo o o o
o o oo ooo
oo oo ooo
o o o o o o
o oooo o
o oo o oo o
o
o o o o o o
D 3
Error rate= 0.201 Error rate = 0.197
Linear Logistic regression is not able Quadratic Logistic regression
to fit the Bayes’ decision boundary does better than linear
30 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Degree=3
Degree=3
o o
o o o o oo o o
o o oo o o o oo oo o o
o oo oo o o
o oo oo ooo oo
o
o oo oo ooo oo
oo oo
o
oo o oooooo o oo oo oo oo o oooooo o oo
oo o o oo oo oo
o oo ooo o oo
o
o o o oo ooo o oo o o oo o o ooo o
o o
o
o o o oo o o o o o o ooo o
ooo oo
o o ooo o o ooo oo o o o o o
o o o o ooo oo o o o
o o o o o o o o oooo o oo oooo o
o ooo oo o
oo ooooooo o o o
o o o oo o ooooooo o o o o o
oo o oo o o o o o
oo o oo o o o o o oo
o o
o ooo oo o o o o ooo oo o o o
o o o oooo o o o o oooo o
o o oo o
oo o ooooo oo oo o oo o o o
oooooooo o
o oo oo o o oooo o
o o oo ooo ooo o o o oo ooo
o o o oo ooo
ooo
o o
o oooo
o ooo o o o o
oo o o oo o
o
o o o o o o
31 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
0.20
0.18
Error Rate
0.16
0.14
0.12
2 4 6 8 10
33 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
34 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
The bootstrap
The textcolorred1bootstrap is one of the most general and the most widely
used tools to estimate measures of uncertainty associated with a given
statistical method.
Some common bootstrap applications are: estimating the bias or variance of
a particular statistical estimator, or constructing approximate confidence
intervals for parameters of interest.
As a simple example, the bootstrap can be used to estimate the standard
errors of the coefficients from a linear regression fit. In the specific case of
linear regression, this is not particularly useful, since standard statistical
software such as R outputs such standard errors automatically.
However, the power of the bootstrap lies in the fact that it can be easily
applied to a wide range of statistical learning methods, including some for
which a measure of variability is otherwise difficult to obtain and is not
automatically output by statistical software.
35 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
The problem
36 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Bootstrap in pills
37 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
1
Remember that in general Var(αX + βY) = α2 Var(X) + β 2 Var(Y) + 2αβCov(X, Y)
38 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
"Y2 − σ
σ "XY
"=
α .
"X2
σ +σ "X2 − 2"σXY
40 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
2
2
1
1
0
Y
Y
0
−1
−1
−2
−2
−2 −1 0 1 2 −2 −1 0 1 2
X X
2
2
1
1
0
0
Y
Y
−1
−1
−2
−2
−3
−3
−3 −2 −1 0 1 2 −2 −1 0 1 2 3
X X
1000
1 !
ᾱ = "i
α
1000 i=1
150
Standard deviation:
100
#
$ 1000
$ 1 !
50
α) = %
SE(" αi − ᾱ)2
("
1000 − 1 i=1
0
Problem α
"Y2 − σ
σ "XY
"=
α .
"X2
σ +σ "X2 − 2"σXY
(b) (b)
on z̃1 , . . . , z̃n .
Then #
$ B
& B
'
$ 1 ! 1 ! ∗r
SEB ("
α) ≈ % ∗b
" −
α "
α .
B − 1 b=1 B r=1
45 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
! Z̃1 contains the third observation twice, the first observation once, and
no instances of the second observation.
! If an observation is contained in Z̃1 , then both its X and Y values are
included.
! We can use Z ∗1 to produce a new bootstrap estimate for α, which we
"∗1 .
call α
! This procedure is repeated B times for some large value of B, in order to
produce B different bootstrap data sets, Z ∗1 , Z ∗2 , . . . , Z ∗B , and B
corresponding estimates α "∗1 , α "∗B .
"∗2 , . . . , α
46 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
0.9
200
0.8
150
0.7
0.6
α
100
0.5
50
0.4
0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.3 True Bootstrap
47 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Estimating bias
We can also use the bootstrap to estimate the bias of our estimator. The idea
is to make the approximation
" −θ ≈
E(θ) E(˜(θ)
" − θ"
B
1 ! ∗b "
≈ θ̃ − θ.
B b=1
48 / 49
Data Analysis and Statistical Learning:04 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
49 / 49