Professional Documents
Culture Documents
Block 3 ST3189
Block 3 ST3189
Learning outcomes
Understand and be able to implement Monte Carlo Integration, Cross-Validation and
Bootstrap
Identify what kind of practical questions can be addressed with the previously mentioned
techniques
Understand the difference between Monte Carlo and Bootstrap
Familiarise yourself with the concept of out of sample performance and how it differs from
goodness of fit
𝜎𝑌2 − 𝜎𝑋𝑌
𝛼= 2
𝜎𝑋 + 𝜎𝑌2 − 2𝜎𝑋𝑌
where 𝜎𝑋2 =var(X), 𝜎𝑌2 =var(Y) and 𝜎𝑋𝑌 = cov(X,Y). In the presence of past data on X and Y, we can
̂𝑋2 𝜎̂𝑌2 , and 𝜎𝑋𝑌 and therefore 𝑎̂. But how about the standard error of 𝑎̂?
estimate 𝜎
(Note that we don't want to make any assumptions on the distribution on X and Y; only
that 𝜎𝑋2 , 𝜎𝑌2 and 𝜎𝑋𝑌 are finite.)
The above questions can be addressed with Monte Carlo, Cross-Validation and Bootstrap
respectively.
Reading list
Essential Reading
James, Witten, Hastie and Tibshirani (James et al) , An introduction to statistical learning: All
sections of Chapter 5 except 5.1.5.
Rogers and Girolami : Section 1.5
Bootstrap
Read the material in James et al book, section 5.2. You will also find the answer to the third question
in the Motivating Examples document.
Cross-Validation
The relevant parts of the recommended resources consist of the material in James et al book, sections
5.1, 5.1.1, 5.1.2, 5.1.3 and 5.1.4. You will also find the answer to the second question in the
Motivating Examples document.
You can also check Section 1.5 in Rogers et al book, and the section in this block 'Simulating Random
Numbers From Distributions'.
"horsepower" value. What this command does is it takes a random sample from 1 to 392 and that random
sample defines the training dataset. That's what I got and that's what you should be getting as well if
you had "set.seed(1)."
These are the first few values of the training factor here. By the way, this environment contains the
factors. We can see what things are generated here. You can see some of the results and here you can
see some plots. Let's move on. The first model, as I said, is a simple linear regression model. Let's run
it. That's it. Now, let's explain what this command does. This is essentially calculating the mean squared
error of the predictions generated by the first model. This mean squared error is taking only on the test
dataset. Let's see various components of it. This is an important bit. This is the function predict. What
it does is takes the estimates of the linear regression model that they were stored in the object "lm.fit."
1 Predict function extracts those and based on those, it generates predictions for all the data in the auto
dataset. Now, this thing here [-train]. What it does, it goes to the auto dataset and it takes out all the
points not [unintelligible 00:04:46] contain in the training set. Essentially, this isolates the test dataset.
Finally, what we do is we have the real values "mpg." We have the forecasts predictions from predict.
We subtract them and take and raise this thing to the square. These are the squared and also the
predictions. We take the mean over all of them so that's the mean squared error. In this case, the mean
squared error is equal to, let's run this, you see what we get, is equal to 26.14. We can do the same for
a quadratic model. This polynomial function here and you put two, that means you have a quadratic
polynomial and you feet back to the data.
If we do that, we get a mean squared error which is smaller, it's 19.82 actually. That means that perhaps
a quadratic model performs better, has better predictive performance than simple linear regression
model. Well, since we try the quadratic, why not try a cubic polynomial as well? Just set this number
to three in the polynomial function. If we do that, we see that the mean squared error is actually a bit
even just a bit smaller.
Maybe a bit better predict performance for the polynomial, the third-degree polynomial model. Of
course, these are all dependent on the choice that we used for, which observations to go on the training
dataset, and which to go to the test dataset. If we were to use another seed and we were to redraw these
indices, then we take some different sets.
Let's do this thing first, set it for a different seed and redraws. If you see here, you see some different
indices that this thing changed. If we rerun these three models and calculate the mean squared errors,
we'll get some different numbers. For example, you may notice that the smallest mean squared error
corresponding to the third model in the previous case, whereas in this case, the smallest one is the
second model, the quadratic model. Although, both the second and the third are very, very close in both
cases.
Maybe someone wants to ask good questions, "Okay, how do we choose? I mean, how do we handle
this variability, or how do we select the training in the test dataset?" That's our way to do so is by the
leave-one-out cross-validation approach. We can calculate this. It's another more accurate estimate of
the test error of the prediction error, the mean squared error. To do so, we need to load the boot library
first. Do install it if you don't have it. Then we will be using the command "glm," I run the command
"lm."
The reason is that "glm" has a nice extra function called "cv.glm," which is very convenient for cross-
validation. Now, you may want to ask the question, "This is a linear regression model. Why do we want
to use "glm," which is essentially for generalized linear regression models?" Well, the answer is that if
you don't specify a family in this "glm" function and leave it to the default, that's essentially the linear
model. So, you will get exactly the same results by fitting "glm," using "glm" and "lm" if you don't put
a different family here. To see that, I will just fit the two simple linear regression models and you'll see
that we get identical coefficients with "glm" and with "lm."
"glm.fit" is what we will use and then we'll estimate the "cv.err." This is the crossvalidation error with
the "cv.glm." If we do that and run it, well, you will see that you get with cross-validation, the number
24.23. This is enough for it.Overall, well, over many possible splits between train and test datasets,
that's what it is. 24.23 is a more accurate number than 26 that we had earlier.
The next bit of the code does the same thing, actually the same thing for various polynomial models.
So we set the I from one that corresponds to simply in regression up to five. Degrees of polynomials up
to five and doing so, this is the [unintelligible 00:09:27] loop, if we run this we'll get-- Well, it takes a
while to run. Let's wait for it. It will fit different polynomials and calculate the mean squared error.
You'll see that for the simple linear regression have 24.23, which is exactly is this guy. Then, this thing
drops but stays very closer to 19. What this tells us is that perhaps it does not need to go beyond the
second-order polynomial as long as you get there your mean squared error is around in the area of 19.24.
Actually, this is the smallest one in this case. This was the leave-one-out cross validation. If you want
to do the same for "k-Fold" for cross validation. The most popular choice is tenfold cross-validation.
[unintelligible 00:10:17] is here. It does a very similar thing. Of course, this actually this also uses
polynomials up to 10. If we do this, we see the very similar results here and maybe here you get some
values that are below 18, but pretty much the message is the same. As long as you get to a quadratic
polynomial model, that's fine. That's fine. Misquote area is around 19. Perhaps quadratic model is-
performance has the best particular performance in this case.
[00:11:04] [END OF AUDIO]
𝑠𝑖+1 = (𝑎 𝑠𝑖 + 𝑏)modM.
𝑠𝑖
Then 𝑈𝑖 = ~𝑈(0,1)
𝑀
The number s0s0 is called the seed, and is usually taken from the computer be the current time,
whereas a b and M can be chosen accordingly. See wikipedia for some good values.
Since the above numbers come from a deterministic algorithm, yet they behave as random numbers
do, they are called pseudo-random numbers.
2. Generate 𝑢𝑖 ∼ U(0,1
= 𝐹(𝐹 −1 (𝑢)) = 𝑢.
Proof:
1. Generate Ui∼U(0,1)
1
2. Set 𝑋𝑖 = − log(1−𝑈𝑖 ).
λ
Properties of Poisson processes and its connection with the exponential distribution can be
used to efficiently generate Poisson random variables.
Implementation
Draw 𝑥1 ,…,𝑥𝑛 from F in a suitable computer software such as R and calculate the integral using the
above estimator. The error may become arbitrarily small by simply drawing more numbers and thus
increasing n.
Proof
Direct application of Law of Large numbers:
𝑛
1
𝐼𝑛 = ∑ ℎ(𝑥𝑖 ) → ∫𝑥 ℎ(𝑥)𝑑𝐹(𝑥) = 𝐼
𝑛
𝑖=1
Monte Carlo in R
Below we provide some standard and simple functions in R to calculation expectations using Monte
Carlo.
Generating random numbers from distributions
runif(n, min=0, max=1)
rnorm(n, mean = 0, sd = 1)
rbeta(n, shape1, shape2)
rgamma(n, shape, rate = 1) : Note scale=1/rate
rpois(n, lambda)
rbinom(n, size, prob)
rnbinom(n, size, prob)
Probability density/mass functions, Cumulative density functions and quantile functions are also
available.
Self-test Activities
1. Exercise 1 in James et al
Attempt Exercise 1 in Chapter 5 of James et al.
Solution
Note that
Exercise 3 in James et al
Attempt Exercise 3 in Chapter 5 of James et al.
Solution
a. k-fold cross-validation is implemented by taking the set of n observations and randomly
splitting into k non-overlapping groups. Each of these groups acts as a validation set and the
remainder as a training set. The test error is estimated by averaging the k resulting MSE
estimates.
b.
i. The validation set approach is conceptually simple and easily implemented as you are
simply partitioning the existing training data into two sets. However, there are two
drawbacks: (1.) the estimate of the test error rate can be highly variable depending on
which observations are included in the training and validation sets. (2.) the validation
set error rate may tend to overestimate the test error rate for the model fit on the entire
data set.
ii. LOOCV is a special case of k-fold cross-validation with k = n. Thus, LOOCV is the
most computationally intense method since the model must be fit n times. Also,
LOOCV has higher variance, but lower bias, than k-fold CV.
Exercise 4 in James et al
Attempt Exercise 4 in Chapter 5 of James et al.
Solution
If we suppose using some statistical learning method to make a prediction for the response Y for a
particular value of the predictor X we might estimate the standard deviation of our prediction by using
the bootstrap approach.
The bootstrap approach works by repeatedly sampling observations (with replacement) from the
original data set B times, for some large value of B, each time fitting a new model and subsequently
obtaining the RMSE of the estimates for all B models.
Solution
You can use the following code in R:
x = rgamma(10000,50,0.01)
y = ifelse(x>5000, 0.1*x, 0)
mean(y) + 2000
sqrt(var(y))
Collaborative Activities
After you have attempted Exercise 8 (d) in James et al. Provide your explanation on why and share
with others in the discussion forum.
After you have attempted Exercise 8 (h) in James et al. Provide your explanation on why and share
with others in the discussion forum.