You are on page 1of 10

ST3189 Machine Learning

Block 3 - Sampling and Resampling Methods


Introduction video (transcript)
Click here for the link to the video.
[music]
Speaker: Welcome to the video introduction of Block three, titled Sampling and Resampling Models.
In this block, we will enhance our toolkit of methods with three very popular techniques in probability,
statistics, and machine learning that are also encountered in many other disciplines. These techniques
are Monte Carlo Simulation, or sometimes termed integration, cross-validation, and bootstrap.
First, Monte Carlo is a very useful technique for calculating expected values of functions of random
variables. This is not done by calculus. Instead, simulation of random variables in computers is being
used. Hence, Monte Carlo is very useful in cases where the expected values involve integrals or sums
that cannot be solved using calculus. A famous example of use of Monte Carlo comes from finance,
and is the calculation of option price.
The second tool is cross-validation which is a more general version of the process of splitting the data
in training and test datasets to assess the out of sum performance of various models. This process was
described in the introduction of the first block. It's a very useful technique in the task of choosing
between competing models. Finally, the bootstrap method is used in several context, most commonly
to provide a measure of accuracy of a parameter estimate or of a given statistical learning method.
By completing this block, you should be able to identify what kind of practical questions can be
addressed with Monte Carlo, cross-validation or bootstrap, understand the differences between them,
and how these techniques can be implemented in R. Finally, an important learning outcome of this block
mostly related with cross-validation, is to develop the ability to distinguish between the concepts of out-
of-sample performance and goodness of it.
[00:02:32] [END OF AUDIO]

Learning outcomes
 Understand and be able to implement Monte Carlo Integration, Cross-Validation and
Bootstrap
 Identify what kind of practical questions can be addressed with the previously mentioned
techniques
 Understand the difference between Monte Carlo and Bootstrap
 Familiarise yourself with the concept of out of sample performance and how it differs from
goodness of fit

Version 1.0 Last updated 09/08/19


ST3189 Machine Learning

Examples of Monte Carlo, Cross-Validation and Bootstrap in


action
The material covered in this block provides some useful techniques that are being used widely in
Statistics, Machine Learning and related disciplines. These techniques are Monte Carlointegration,
Cross-Validation and Bootstrap. Among others they can be used to answer questions like the
following:
1. Assume that the earnings of a company each month follow the Gamma(50; 0:01) distribution.
The salary of an employee for a given month is $2,000 per month plus a bonus of 10% of
company's earning if those exceed $5,000 in this month. Without using any calculus, how
much does this employee make each month on average? Which is the standard deviation of
his average earnings?
2. Recall Exercise 9 from Chapter 3 of the James et al book on the 'auto' dataset. There appears
to be a non-linear relationship between mpg and horsepower, and that a model that predicts
mpg using horsepower and horsepower2 gives better results than a model that uses only a
linear term. It is natural to wonder whether a cubic or higher-order fit might provide even
better results. Which degree of polynomial will predict future data better?
3. Suppose that we wish to invest a fixed sum of money in two financial assets that yield returns
of X and Y, respectively, where X and Y are random variables. We will invest a fraction α of
our money in X, and will invest the remaining 1−α in Y. The amount of our earnings will then
be αX+(1−α)Y and the variance of our earnings (risk) var[αX+(1−α)Y]. It can be shown that
the αα that minimises our risk is:

𝜎𝑌2 − 𝜎𝑋𝑌
𝛼= 2
𝜎𝑋 + 𝜎𝑌2 − 2𝜎𝑋𝑌
where 𝜎𝑋2 =var(X), 𝜎𝑌2 =var(Y) and 𝜎𝑋𝑌 = cov(X,Y). In the presence of past data on X and Y, we can
̂𝑋2 𝜎̂𝑌2 , and 𝜎𝑋𝑌 and therefore 𝑎̂. But how about the standard error of 𝑎̂?
estimate 𝜎

(Note that we don't want to make any assumptions on the distribution on X and Y; only
that 𝜎𝑋2 , 𝜎𝑌2 and 𝜎𝑋𝑌 are finite.)

The above questions can be addressed with Monte Carlo, Cross-Validation and Bootstrap
respectively.

Version 1.0 Last updated 09/08/19


ST3189 Machine Learning

Reading list
Essential Reading
James, Witten, Hastie and Tibshirani (James et al) , An introduction to statistical learning: All
sections of Chapter 5 except 5.1.5.
Rogers and Girolami : Section 1.5

Bootstrap
Read the material in James et al book, section 5.2. You will also find the answer to the third question
in the Motivating Examples document.

Cross-Validation
The relevant parts of the recommended resources consist of the material in James et al book, sections
5.1, 5.1.1, 5.1.2, 5.1.3 and 5.1.4. You will also find the answer to the second question in the
Motivating Examples document.
You can also check Section 1.5 in Rogers et al book, and the section in this block 'Simulating Random
Numbers From Distributions'.

Cross Validation R Tutorial (Video)


Transcript
[music]
Male Speaker: Hello there. This is a video for the course Machine Learning, in particular Block 3 on
the topic of cross-validation. This video is essentially, we'll do together on RLab correspondent cross-
validation. The necessary code is contained in a file called RLab_Block 3 and our script file, which is
available on the Moodle section of this course. You can download it and do the same thing after
watching this video.
What you need to to do is first, well, we need the ISLR package and from the ISLR package we'll be
needing the Auto dataset. This requires that you have the ISLR package installed in your computer. If
not, you need the command Install packages. Let's run those. We need to, first now, to insert the cross-
validation approach. We first need to understand the main idea, which is the following.
We want to split the data into two parts, the training bit and the testing bit. What we'll be doing is we'll
be estimating the models only on the training part of the data. Based on those estimates we will be trying
to forecast, to predict the observations in the test dataset. We have prediction for this dataset. Then we
can just check how good these predictions are by contrasting them with the true observations that we
know there are in the test dataset.
We'll be using different models. All of them will be fit to the training data. We'll get their predictions
and we'll access those predictions in the test data.
Let's see. Well, we first fit a simple linear regression model. We'll use the command "lm" for regression
and the response variable is miles per gallon, "mpg." The predictor is "horsepower."
Good to have also the "set.seed(1)." That means that we fixed a known seed to one, if you are to do it
on your own. If you do that as well, you get exactly the same answers as I'm getting. That's a way for
you to check that you're doing it correctly.
Let's run the "set.seed." Let's generate the training sample. Well, let's see what this does. This is a
command sample, it samples random integers between 1 and 392. There are 392 points in the auto
dataset and each of these indices corresponds to an [unintelligible 00:03:05] data, a pair of "mpg" and

Version 1.0 Last updated 09/08/19


ST3189 Machine Learning

"horsepower" value. What this command does is it takes a random sample from 1 to 392 and that random
sample defines the training dataset. That's what I got and that's what you should be getting as well if
you had "set.seed(1)."
These are the first few values of the training factor here. By the way, this environment contains the
factors. We can see what things are generated here. You can see some of the results and here you can
see some plots. Let's move on. The first model, as I said, is a simple linear regression model. Let's run
it. That's it. Now, let's explain what this command does. This is essentially calculating the mean squared
error of the predictions generated by the first model. This mean squared error is taking only on the test
dataset. Let's see various components of it. This is an important bit. This is the function predict. What
it does is takes the estimates of the linear regression model that they were stored in the object "lm.fit."
1 Predict function extracts those and based on those, it generates predictions for all the data in the auto
dataset. Now, this thing here [-train]. What it does, it goes to the auto dataset and it takes out all the
points not [unintelligible 00:04:46] contain in the training set. Essentially, this isolates the test dataset.
Finally, what we do is we have the real values "mpg." We have the forecasts predictions from predict.
We subtract them and take and raise this thing to the square. These are the squared and also the
predictions. We take the mean over all of them so that's the mean squared error. In this case, the mean
squared error is equal to, let's run this, you see what we get, is equal to 26.14. We can do the same for
a quadratic model. This polynomial function here and you put two, that means you have a quadratic
polynomial and you feet back to the data.
If we do that, we get a mean squared error which is smaller, it's 19.82 actually. That means that perhaps
a quadratic model performs better, has better predictive performance than simple linear regression
model. Well, since we try the quadratic, why not try a cubic polynomial as well? Just set this number
to three in the polynomial function. If we do that, we see that the mean squared error is actually a bit
even just a bit smaller.
Maybe a bit better predict performance for the polynomial, the third-degree polynomial model. Of
course, these are all dependent on the choice that we used for, which observations to go on the training
dataset, and which to go to the test dataset. If we were to use another seed and we were to redraw these
indices, then we take some different sets.
Let's do this thing first, set it for a different seed and redraws. If you see here, you see some different
indices that this thing changed. If we rerun these three models and calculate the mean squared errors,
we'll get some different numbers. For example, you may notice that the smallest mean squared error
corresponding to the third model in the previous case, whereas in this case, the smallest one is the
second model, the quadratic model. Although, both the second and the third are very, very close in both
cases.
Maybe someone wants to ask good questions, "Okay, how do we choose? I mean, how do we handle
this variability, or how do we select the training in the test dataset?" That's our way to do so is by the
leave-one-out cross-validation approach. We can calculate this. It's another more accurate estimate of
the test error of the prediction error, the mean squared error. To do so, we need to load the boot library
first. Do install it if you don't have it. Then we will be using the command "glm," I run the command
"lm."
The reason is that "glm" has a nice extra function called "cv.glm," which is very convenient for cross-
validation. Now, you may want to ask the question, "This is a linear regression model. Why do we want
to use "glm," which is essentially for generalized linear regression models?" Well, the answer is that if
you don't specify a family in this "glm" function and leave it to the default, that's essentially the linear
model. So, you will get exactly the same results by fitting "glm," using "glm" and "lm" if you don't put

Version 1.0 Last updated 09/08/19


ST3189 Machine Learning

a different family here. To see that, I will just fit the two simple linear regression models and you'll see
that we get identical coefficients with "glm" and with "lm."
"glm.fit" is what we will use and then we'll estimate the "cv.err." This is the crossvalidation error with
the "cv.glm." If we do that and run it, well, you will see that you get with cross-validation, the number
24.23. This is enough for it.Overall, well, over many possible splits between train and test datasets,
that's what it is. 24.23 is a more accurate number than 26 that we had earlier.
The next bit of the code does the same thing, actually the same thing for various polynomial models.
So we set the I from one that corresponds to simply in regression up to five. Degrees of polynomials up
to five and doing so, this is the [unintelligible 00:09:27] loop, if we run this we'll get-- Well, it takes a
while to run. Let's wait for it. It will fit different polynomials and calculate the mean squared error.
You'll see that for the simple linear regression have 24.23, which is exactly is this guy. Then, this thing
drops but stays very closer to 19. What this tells us is that perhaps it does not need to go beyond the
second-order polynomial as long as you get there your mean squared error is around in the area of 19.24.
Actually, this is the smallest one in this case. This was the leave-one-out cross validation. If you want
to do the same for "k-Fold" for cross validation. The most popular choice is tenfold cross-validation.
[unintelligible 00:10:17] is here. It does a very similar thing. Of course, this actually this also uses
polynomials up to 10. If we do this, we see the very similar results here and maybe here you get some
values that are below 18, but pretty much the message is the same. As long as you get to a quadratic
polynomial model, that's fine. That's fine. Misquote area is around 19. Perhaps quadratic model is-
performance has the best particular performance in this case.
[00:11:04] [END OF AUDIO]

Simulating Random Numbers from Distribution


Note: The material in this section (Simulating Random Numbers from Distribution) are non-
examinable and are only here for those who are curious how R (and other software) simulates
'random' numbers from distributions.
If you find the material challenging feel free to skip to the next topic 'Monte Carlo Simulation'.

Pseudo-random variables from a Uniform


The following (Linear Congruential Generator) is a deterministic algorithm that provides sequences of
numbers that resemble remarkably well random samples from the Uniform distribution
between 00 and 11. While there is nothing random in this algorithm, it produces sets of numbers
that satisfy all the properties that random sample from the Uniform distribution have! It is the
cornerstone of doing simulations in computers.

Linear Congruential Generator:


Define a sequence {si} and set (for some a,b, 𝑠0,M)

𝑠𝑖+1 = (𝑎 𝑠𝑖 + 𝑏)modM.
𝑠𝑖
Then 𝑈𝑖 = ~𝑈(0,1)
𝑀
The number s0s0 is called the seed, and is usually taken from the computer be the current time,
whereas a b and M can be chosen accordingly. See wikipedia for some good values.
Since the above numbers come from a deterministic algorithm, yet they behave as random numbers
do, they are called pseudo-random numbers.

Version 1.0 Last updated 09/08/19


ST3189 Machine Learning

Random numbers from Bernoulli/Binomial


Let X be a Bernoulli (π) random variable.
In other words P(X=1) = π and P(X=0)=1− π.
Pseudo-random numbers of X may be generated by:
1. Generate 𝑢𝑖 ∼ U(0,1)
2. If 𝑢𝑖 ≤ π set 𝑋𝑖 = 1, otherwise set 𝑋𝑖 = 0
Let Y be a Binomial (n,π) random variable. Note that if 𝑋𝑖 , i=1,…, n are independent Bernoulli (π),
then 𝑌 = ∑𝑛𝑖=1 𝑋𝑖

So to draw Y, we can draw each 𝑋𝑖 , as above and sum.

Random numbers from discrete distributions


𝑝
Let X take values 𝑎𝑖 , i=1,…,p with P(X=𝑎𝑖 )= 𝜋𝑖 , with ∑𝑖=1 𝜋𝑖 = 1

A random number of X can be drawn using the following steps:


𝑘
1. Define 𝑐𝑘 =∑𝑖=1 𝜋𝑖

2. Generate 𝑢𝑖 ∼ U(0,1

3. Find smallest k such that 𝑢𝑖 ≤ 𝑐𝑘 . Set 𝑋𝑖 = 𝑎𝑘 .

Random numbers from continuous distributions Theorem


If X is a continuous random variable with cdf F, then U = F(X)~Unif(0,1) or else 𝐹𝑈 (𝑢) = 𝑢 𝑓𝑜𝑟 0 ≤
𝑢 ≤ 1.
Proof:

𝐹𝑈 (𝑢) = 𝑃(𝑈 ≤ 𝑢) = 𝑃(𝐹(𝑋) ≤ 𝑢) = 𝑃(𝑋 ≤ 𝐹 −1 (𝑢)).

= 𝐹(𝐹 −1 (𝑢)) = 𝑢.

Corollary (Inversion method)


The random variable X=𝐹 −1 (U), where U∼U(0,1) has distribution F.

Proof:

P(X ≤ x) = P(𝐹 −1 (U) ≤ x) = P(U ≤ F(x)) = F(x)

Version 1.0 Last updated 09/08/19


ST3189 Machine Learning

Random numbers from Exponential distribution


We want to draw from a an Exponential(λ) distribution. F(x)=1−exp(−λx)

u =1−exp(−λx), or else1−u=exp(−λx), or elselog(1−u)=−λx, or else x=−1λlog(1−u).


The inversion method contains the following steps:

1. Generate Ui∼U(0,1)
1
2. Set 𝑋𝑖 = − log(1−𝑈𝑖 ).
λ

Random numbers from other distributions


The sum of n independent Exponential (λ) random variables is a Gamma (n,λ) random variable.
The inverse of a Gamma (n,λ) random variable is a IGamma(n,λ) random variable.
If 𝑢1 , 𝑢2 are independent U(0,1) random variables, then the random variables:

𝑥1 = √−2 log(𝑢1) cos(2𝜋𝑢2 ), 𝑥2 = √−2 log(𝑢1) sin(2𝜋𝑢2 )


are a pair of independent N(0,1) random variables.

 Properties of Poisson processes and its connection with the exponential distribution can be
used to efficiently generate Poisson random variables.

Calculating Expectations with Monte Carlo Simulation

Monte Carlo Integration


Let X be a random variable with cumulative distribution function F(x) and h(x) be a function such that
𝐸𝑋 (h(X))<∞. If X is a continuous random variable we let f(x) denote the probability density function,
whereas in the case of X being a discrete random variable f(x) denotes probability mass function
P(X=x). Then 𝐸𝑋 (h(X)) is defined as the following integral or sum:

where X is the set of values X can take.

Monte Carlo Calculation


Consider 𝑥1 ,…, 𝑥𝑛 be a sample from the distribution F.
The Monte Carlo approximation to equation 3.1 is then given by
𝑛
1
𝐸𝑋 (ℎ(𝑋)) ≈ ∑ ℎ(𝑥𝑖 )
𝑛
𝑖=1

Version 1.0 Last updated 09/08/19


ST3189 Machine Learning

Implementation
Draw 𝑥1 ,…,𝑥𝑛 from F in a suitable computer software such as R and calculate the integral using the
above estimator. The error may become arbitrarily small by simply drawing more numbers and thus
increasing n.

Proof
Direct application of Law of Large numbers:
𝑛
1
𝐼𝑛 = ∑ ℎ(𝑥𝑖 ) → ∫𝑥 ℎ(𝑥)𝑑𝐹(𝑥) = 𝐼
𝑛
𝑖=1

Monte Carlo in R
Below we provide some standard and simple functions in R to calculation expectations using Monte
Carlo.
Generating random numbers from distributions
runif(n, min=0, max=1)
rnorm(n, mean = 0, sd = 1)
rbeta(n, shape1, shape2)
rgamma(n, shape, rate = 1) : Note scale=1/rate
rpois(n, lambda)
rbinom(n, size, prob)
rnbinom(n, size, prob)
Probability density/mass functions, Cumulative density functions and quantile functions are also
available.

Self-test Activities
1. Exercise 1 in James et al
Attempt Exercise 1 in Chapter 5 of James et al.

Solution
Note that

𝑣𝑎𝑟[𝑎𝑋 + (1 − 𝑎)𝑌] = 𝑎2 𝜎𝑋2 + (1 − 𝑎2 )𝜎𝑌2 + 2𝑎(1 − 𝑎)𝜎𝑋𝑌


The above expression can be minimised with respect to α by taking the first derivative (wrt α) and
setting it equal to 0. You can also confirm that this is a minimum (rather than a maximum) by
checking if the 2nd derivative is positive.

Exercise 3 in James et al
Attempt Exercise 3 in Chapter 5 of James et al.

Solution
a. k-fold cross-validation is implemented by taking the set of n observations and randomly
splitting into k non-overlapping groups. Each of these groups acts as a validation set and the

Version 1.0 Last updated 09/08/19


ST3189 Machine Learning

remainder as a training set. The test error is estimated by averaging the k resulting MSE
estimates.
b.
i. The validation set approach is conceptually simple and easily implemented as you are
simply partitioning the existing training data into two sets. However, there are two
drawbacks: (1.) the estimate of the test error rate can be highly variable depending on
which observations are included in the training and validation sets. (2.) the validation
set error rate may tend to overestimate the test error rate for the model fit on the entire
data set.
ii. LOOCV is a special case of k-fold cross-validation with k = n. Thus, LOOCV is the
most computationally intense method since the model must be fit n times. Also,
LOOCV has higher variance, but lower bias, than k-fold CV.

Exercise 4 in James et al
Attempt Exercise 4 in Chapter 5 of James et al.

Solution
If we suppose using some statistical learning method to make a prediction for the response Y for a
particular value of the predictor X we might estimate the standard deviation of our prediction by using
the bootstrap approach.
The bootstrap approach works by repeatedly sampling observations (with replacement) from the
original data set B times, for some large value of B, each time fitting a new model and subsequently
obtaining the RMSE of the estimates for all B models.

Monte Carlo Quiz


Attempt the quiz on Monte Carlo in R here. You can also find a link to this quiz on the main page of
this block.
1. Provide the answer to the first question in the Examples of Monte Carlo, Cross-Validation
and Bootstrap in action section

Solution
You can use the following code in R:
x = rgamma(10000,50,0.01)

y = ifelse(x>5000, 0.1*x, 0)

mean(y) + 2000

sqrt(var(y))

2. Replicate the analysis of section 5.3 of James et al in your own computer.


3. Do Exercises (8), (9) of Chapter 5 of James et al.

Solutions to James et al Activities


Click below for the solutions:
Exercise 8
Exercise 9

Version 1.0 Last updated 09/08/19


ST3189 Machine Learning

Collaborative Activities
After you have attempted Exercise 8 (d) in James et al. Provide your explanation on why and share
with others in the discussion forum.
After you have attempted Exercise 8 (h) in James et al. Provide your explanation on why and share
with others in the discussion forum.

Learning outcomes checklist


 Understand and be able to implement Monte Carlo Integration, Cross-Validation and
Bootstrap.
 Identify what kind of practical questions can be addressed with the previously mentioned
techniques.
 Understand the difference between Monte Carlo and Bootstrap.
 Familiarise yourself with the concept of out of sample performance and how it differs from
goodness of fit.

Version 1.0 Last updated 09/08/19

You might also like