Professional Documents
Culture Documents
Peter Caya
2
Chapter 1
Basic Definitions:
Define the following:
When fitting a model to a data set there is a trade off that must be made
between the stability with that training data is assessed and the accuracy main-
tained during its generalization. This is the bias- variance tradeoff.
Definition 1: Bias:
B(), is the difference between the average estimator used for the model and the
parameter in actuality. Defined as:
Note to self: This relates back to my thesis work on inverse problems where
I studied the inclusion of a regularization parameter in the OLS, MOLS and
3
EE models in order to reduce the noise in a model. As a result of this bias
was introduced into the model (IE, it could never be exactly dead on) but the
jaggedness of the model was reduced. In the statistical contenxt, I was studying
ridge regression and the identification of the proper regularization parameter
which minimized the bias-variance tradeoff to provide the best fit.
4
1. For a set of N observations, construct a subsample omitting observation
i. Call this subsample xi .
3. Use the parameters (in this case, say with a parameter using all obser-
to estimate B and V (estimates denoted with the hat symbol).
vations )
:
= (N 1)(. ).
B
V = N (N 1).
> p_load(bootstrap)
> # Calculate the jackknife means, the SE and the bias for the mean.
> jack_mpg <-jackknife(mtcars$mpg,theta = mean)
> jack_mpg$jack.se
[1] 1.065424
> jack_mpg$jack.bias
[1] 0
Heres a more complex example: Lets express the mpg as a linear factor of
weight:
[1] 0.7263368
> jack_reg$jack.bias
wt
-0.08087151
>
5
Weight Coefficient Based on Ommitted Observations
5.0
5.2
5.4
5.6
5.8
0 5 10 15 20 25 30
Obs.
1.5 Bootstrapping
For this very simple method of bootstrapping, consider the example below:
6
Bootstrap Results of MPG Mean
200
150
Frequency
100
50
0
17 18 19 20 21 22 23 24
Mean
The example from the jackknife section regarding the regression coefficients
is repeated with bootstrapping repeated 1000 times:
7
Weight Coefficient Based on Ommitted Observations
300
250
200
150
100
50
0
9 8 7 6 5 4 3
Obs.
This is essentially the jackknife method with more than one omission.
8
each group randomly. After partitioning the data a model is trained on all but
one partitions. The partition which was left out in model training is then used
to estimate the training error of the model. This is done repeatedly with each
of the K partitions being left out.
After this procedure is completed the estimate of the testing error is created:
1 X
CV (f) = L(yi , fi (xi )) (1.11)
N i=1
1.6.3 Implementation
In the bootstrap package the function which implements cross-validation is cross-
val(). Lets calculate the mean using 5 folds cross validation:
> theta_func <- function(x,y) {lm(y~x)}
> theta_predict <-function(model_fit,x){cbind(1,x)%*% model_fit$coef}
> crossval(x = mtcars$wt,y=mtcars$mpg,theta.fit = theta_func, theta.predict = theta_predict,ngr
$cv.fit
[1] 23.915987 21.651761 25.330239 19.986467 18.830932 18.728217 18.081384
[8] 19.625645 19.850435 19.107575 19.107575 15.512767 17.259427 17.610438
[15] 8.048937 8.084572 7.515059 25.119394 28.203608 27.240421 24.758539
[22] 17.771124 19.485795 16.885196 16.668645 26.678445 26.525180 28.727451
[29] 20.926287 22.830061 17.490136 22.220502
$ngroup
[1] 5
$leave.out
[1] 6
$groups
$groups[[1]]
[1] 32 5 4 19 28 6
$groups[[2]]
[1] 27 14 23 21 1 29
$groups[[3]]
[1] 10 30 3 24 16 11
$groups[[4]]
[1] 2 7 18 12 25 13
$groups[[5]]
[1] 31 8 15 20 9 17 26 22
$call
9
crossval(x = mtcars$wt, y = mtcars$mpg, theta.fit = theta_func,
theta.predict = theta_predict, ngroup = 5)
1.7 Exercises
Program a Jackknife Algorithm which takes a function, a sample data
set and returns an estimate of the testing error.
10