You are on page 1of 19

An Introduction to Statistical Learning

1. Introduction

In supervised learning we build a model to predict or estimate an output based on one or more inputs.
With unsupervised learning, there are inputs but no supervising outputs.

2. Statistical Learning

What is Statistical Learning ?

The input variables are called as predictors, independent variables, features. The output variable is
called as response or dependent variable.

We assume that there is a relationship between Y and X and that can be written in the form
Y = f (X) + ϵ

Here f is some unknown but fixed function of X and ϵ is a random error term, which is independent of
X and has zero mean. Here f represents the systematic information that X provides about Y .

Why Estimate f ?

The two main reasons to estimate f is prediction and inference.

Prediction
^
Since the error term averages to zero, we can predict Y using Y = f^(X), where f^ is the estimate of
​ ​

f and Y^ is the resulting prediction of Y .

^ as a prediction of Y depends on two quantities called the reducible error and


The accuracy of Y
^
irreducible error. Even when we get a perfect estimate of f so that Y = f (X) (called as the reducible
error), it would still have some error in it as Y is also a function of ϵ, which by definition cannot be
predicted using X . This is known as the irreducible error.

E(Y − Y^ )2 = [f (X) − f^(X)]2   + Var(ϵ)


​ ​ ​ ​

Reducible Irreducible

Inference

We might be interested in understanding the way Y is affected as X changes. We want to estimate f ,


but our goal is not to make prediction but understand the relationship between X and Y . Here, f^ ​

cannot be treated as a black box as we need to know its exact form.


How Do We Estimate f ?

Parametric

1. We make an assumption about the functional form or shape of f .

2. After the model has been selected, we need a procedure that uses the training data to fit or train
the model.

The disadvantage is that the model we choose would usually not match the true unknown form of f .
We can use more flexible models but it may result in over-fitting of the data, they follow the error or
noise too closely.

Non-Parametric

Non-parametric methods do not make explicit assumption about the functional form of f . Since they do
not reduce the problem to estimating a small number of parameters, a very large number of
observations are required to obtain as accurate estimate of f .

There is a trade-off between the interprebility of the model and the accuracy of the model. For making
prediction it is not always helpful to use the most flexible model, we can get more accurate prediction
using less flexible methods, this has to do with potential over-fitting in highly flexible methods.

Assessing Model Accuracy

Measuring the Quality of Fit

In regression setting, the most commonly-used measure is mean squared error (MSE), given by

∑ni=1 (yi − f^(xi ))2


MSE =
​ ​ ​ ​

n

We are interested in calculating the test MSE. In absence of test MSE, we do not select a model with
the lowest training MSE as it does not guarantee resulting in lower test MSE.

We can use cross-validation for estimating test MSE using training data.

Bias-Variance Trade-Off
The test MSE for given value of x0 can be decomposed into three fundamental quantities:

E(y0 − f^(x0 ))2 = Var(f^(x0 ) + [Bias(f^(x0 ))]2 + Var(ϵ)


​ ​ ​ ​ ​ ​

Variance refers to the amount by which f^ would change if we estimated it using a different training data

set. More flexible methods have higher variance and lower bias. The relative rate of these two
quantities determines whether test MSE will increase or decrease.
The Classification Setting
∑ni=1 I(yi 
=y^i )
We can compute the training error rate:
​ ​ ​ ​

n ​

The Bayes Classifier

The test error rate is minimized by assigning each observation to the most likely class, given its
predictor values i.e. to the class j for which P r(Y = j∣X = x0 ) is largest. ​

The overall Bayes error rate is given by 1 − E(maxj P r(Y ​ = j∣X))

K-Nearest Neighbors

We do not know the conditional distribution of Y given X and so computing Bayes Classifier is
impossible.

3. Linear Regression

Simple Linear Regression


It assumes that there is approximately a linear relationship between X and Y given by Y ≃ β0 + ​

β1 X . Here, β0  and β1 are called model coefficients or parameters


​ ​ ​

Estimating the Coefficients


We estimate the coefficients using the training data. The most common approach is minimizing the
= β^0 + β^1 xi be the prediction for Y based in the ith value of X . Then
least square criterion. Let y^i ​ ​ ​ ​ ​ ​ ​

ei = yi − y^i represents the ith residual - the difference between ith observed response value and the
​ ​ ​ ​

ith response value predicted by our linear model.

We define residual sum of squares (RSS) as RSS = e21 + e22 + ... + e2n . The least square method
​ ​ ​ ​

chooses the estimates to minimize the RSS.

Assessing the Accuracy of Coefficient Estimates

The true relationship between X and Y takes the form of Y = β0 + β1 X + ϵ


​ ​

We assume that the error term is independent of X . The true relationship is not known by the
regression coefficients are estimated using least squares.

An unbiased estimator does not systematically over -or under-estimate the true parameter.

^ differs from the true value of μ


Standard error tells the average amount the estimate μ ^. The more
​ ​

σ2
^. It is given by
observations we have the smaller the standard error of μ ​ ​ ^)2 =
^) = SE(μ
Var(μ ​ ​ ​

n
The standard error of regression coefficients are:
1 ˉ2
x ^1 )2 = n σ
2
SE(β^0 )2 = σ 2 [ + n ]  and  SE( β
ˉ)2
n ∑i=1 (xi − x ˉ)2
∑i=1 (xi − x
​ ​ ​ ​ ​ ​ ​

​ ​ ​ ​

where σ 2 = Var(ϵ).

For these formulas to be valid , we assume that the error ϵi for each observation are ​

uncorrelated with the common variance σ 2 .

σ 2 is not known and estimated from the data. The estimate is known as Residual Standard Error and
is given by RSE = RSS/(n − 2) ​ ​

We can also use standard errors to compute the confidence intervals, such that with the given
probability the range will contain the true unknown value of the parameter. They can aslo be used to
conduct hypothesis testing on the coefficients.

β^1 − 0
t=
​ ​

We compute the t-statistic : . If there is no relationship between X and Y then we


SE(β^1 )
​ ​

​ ​

expect that it will follow t-distribution with n-2 degrees of freedom.

A small p − value indicates that it is unlikely to observe such a substantial association between the
predictor and the response due to chance, in absence of any real association between the predictor
and the response.

Assessing the accuracy of the model

It is assessed using two quantities: the Residual Standard Error (RSE ) and the R2 statistic.

Residual Standard Error

The RSE is an estimate of the standard deviation of ϵ. It is the average amount the response will
deviate from the true regression line. It is considered as the measure of lack of fit of the model to the
data.

∑ni=1 (yi − y^i )2


RSE = RSS/(n − 2) =
​ ​ ​ ​

n−2
​ ​ ​

R2 Statistic
Since RSE is measure in units of Y , it is not always clear what constitutes a good RSE . The R2
statistic takes the form of proportion of variance explained and is independent of scale of Y

TSS − RSS RSS


R2 = =1−
TSS TSS
​ ​
where TSS = ∑(yi − yˉ)2 is the total sum of squares.
​ ​

TSS measure the total variance in the response Y and can be thought of as the amount of variability
inherent in the response before the regression is performed.. RSS measures the amount of variability
left unexplained after the regression is performed.

Correlation quantifies the association between a single pair of variables rather than a larger number of
variables and R2 fits that role.

Multiple Linear Regression

It multiple regression model takes the form :

Y = β0 + β1 X1 + β2 X2 + ⋯ + βp Xp + ϵ
​ ​ ​ ​ ​ ​ ​

We interpret βj as the average effect on Y of a one unit increase in X , holding all other predictors

fixed.

The simple and multiple linear coefficients can be quite different.

Some Important Questions

1. Relationship between Response and Predictor

In multiple regression setting with p predictors, we need to ask whether all of the regression coefficients
are zero. We test the null hypothesis

H0 : β1 = β2 = ⋯ = βp = 0
​ ​ ​ ​

versus the alternative


H1 : at least one βj  is non zero.
​ ​

It is performed by computing the F statistic:

(TSS − RSS)/p
F =
RSS/(n − p − 1)

If the linear model assumptions are correct, one can show that E(RSS/(n − p − 1)) = σ 2 and that
provided H0 is true. E((TSS
​ − RSS)/p) = σ 2 . When there is no relationship between the
response and predictors then F -statistic takes value close to 1. If H1 is true,E((TSS − RSS)/p) >

σ 2 and we expect F to be greater than 1.

We can also modify the above formula to test for a subset of q coefficients.

The t statistic is equivalent to F statistic that omits that variable from model keeping all others in. We
cannot directly use the t statistic and the p values because even if there is no association between the
response and predictors, we will get some variables that are related by chance. The F statistic does
not suffer from it as it adjusts for the number of parameters.

2. Deciding of Important Variables

There are various statistics that can be used to judge the quality of the model such as M allow ′ s Cp , ​

Alkanine Inf ormation Criterion (AIC), Bayesian Inf ormation criterion (BIC) and


adjusted R2 .

The following are methods for model building:

1. Forward Selection : We begin with a model that only contains an intercept and then fit p simple
linear regressions and add to the null model the variable that results in the lowest RSS. We
continue this approach till some stopping rule is satisfied.

2. Backward Selection :We start with a model with all the variables and remove the one with the
largest p-value. Then fit the new (p − 1) variables and the variable with the largest p-value is
removed. This is continued till some stopping rule is reached.
3. Mixed Selection: We start with a model with no variables and as with forward selection, we add the
variables that provide the best fit. The p values can get larger as new variables are added to the
model. Hence, at any point we remove the variables whose p value is above threshold. We
continue to perform forward and backward selection till all the variables in the model have
sufficiently low p values and all the variables outside the model have large p value if added to the
model.

Backward selection cannot be used with p > n while forward selection can always be used.

3. Model Fit
In simple linear regression R2 is square of correlation between X and Y and in multiple linear
^ )2 . It is property of fitted linear model that it maximizes this
regression, it is equal to Cor(Y , Y
correlation among all possible linear models.

The R2 will always increase when more variables are added to the model even if weakly associated it
allows us to fit the training data more accurately.

Models with more variables can have higher RSE if the decrease in RSS is small relative to the
increase in p. It is defined as:

RSS
RSE =
n−p−1
​ ​

It is also useful to plot graph as graphical summaries can tell problems with the models that numerical
summaries can't.

4. Predictions
Assuming that the linear model f (X)
is an approximation of the reality, so there is additional source of potential reducible error which we call
model bias.

^ and they are always wider than


We use prediction intervlas to esttimate how much Y will vary from Y
confidence interval because they incoporate both the error in estimate of f (X) and the uncertainty as
to how much an individual point will differ from the population regression plane.

Other Considerations in Regression Model

Qualitative Predictors

If the qualitative variable has more than two levels than we can create additional dummy variables that
sub-divide it's levels.

There is a baseline encoding and all other levels can be compared to it.There are many different ways
of coding qualitative variables besides the dummy variable approach taken here. All of these
approaches lead to equivalent model fits, but the coefficients are different and have different
interpretations, and are designed to measure particular contrasts.

Extensions of Linear Model

It makes the assumption that the relationship betweem the predictors and response is additive and
linear.

The additive assumption means that the effect of chnage in a predictor Xj on the response Y is

independent of the values of other predictors.

The linear assumption states that the change in response Y due to a unit change in Xj is constant

regardless of the value of Xj .​

Removing the Additive Assumption

We allow for interaction effects by including a third predictor called an interaction term, which is
constructed by computing the product of X1 and X2
​ ​

The hierarchial principle states if we include an interaction term in the model, we should also
include their main effects, even if p-values associated with their coefficients are not significant.

X1 ∗ X2 is correlated with X1 and X2 and so leaving them out tends to alter the meaning of the
​ ​ ​ ​

interaction.

Non-Linear Relationships

The linear regression model assumes a linear relationship between the response and the predictor but
the true relationship would be non-linear. We can extend the linear model to account for it using
polynomial regression. We include the polynomial functions of the regression coefficeint in the linear
model.

Potential Problems

1. Non-linearity of the response-predictor relationship

Residual plots are useful graphical tool for identifying non-linearity.


For a simple regression model, we can plot the residual ei = yi − y^i , versus the predictor xi For
​ ​ ​ ​ ​

multiple regression, we plot the residuals versus the fitted values y^i . Ideally, the residual plot will show
​ ​

no discernible pattern. The presence of a pattern indicates some problem with the aspect of linear
model.

We can use non-linear transformations of the predictors such as logX, X  and X 2 in the

regression model.

2. Correlation of Error Terms

An important assumption of the linear model is that the error terms ϵ1 , ϵ2 , … , ϵn are uncorrelated.
​ ​ ​

If the error terms are uncorrelated, then ϵi is positive provides little or no information about the sign of

ϵi+1 . The standard errors computed for the regression coefficients or the fitted values are computed on

assumption of uncorrelated error terms.

If there is correlation, then the estimated standard errors will tend to underestimate the true standard
errors. As a result, the confidecne intervals and the prediction intervals will be narrower than they
should be. The p-values associated with the model will be lower than they should be causing us to
falsely conclude that a parameter is statistically significant.

We plot the residulas as a function of time and if the errors are uncorrealted there should be no
discernible pattern and if the error terms are positively correlated then we may see tracking in the
residuals, that is adjacent residuals may have similar values. It occurs in time series data.

3. Non-constant Variance of Error Terms

The standard errors, confidence intervals and hypothesis test of the linear model rely on thre
assumption that Var(ϵi ) ​ = σ2 .

We can identify the non-constant variance in error terms or heteroscedasticity from the presence of the
funnel shape in the residual plot. We can transform the response Y using a concave function such as
logY  or  Y . Such transformations results in greater amount of shrinkage of the larger residulas;

leading to a reudction in heteroscedasticity.

Read more about weighted least sqaures.

4. Outliers
An outlier is a point for which yi is far from the predicted value of the model. It is typical for an outlier

that does not have unusal predictor value to have little effect on the least square fit but it does affect the
RSE and R2 values.

Residual plots can be used to identify outliers but it is difficult to decide how large a residual needs to
be to consdiered as an outlier. Studentized residuals computed by dividing each residual ϵi by its​

estimated standard error. Observations whose studentized residuals are greater than 3 in absolute
value are possible outliers.

5. High Leverage Points


Observations with high leverage have an unusual value for xi . It has a significant impact on the

estimated regression line. In mutiple regression , it is possible for an observation to be well within the
range of each individual predictor value but is unusal in terms of the full set of predictors.

We compute the leverage statistic, a large value of the statsitic indicates an obervation with high
leverage. For simple linear regression,

1 (xi − x ˉ)2
hi = + n

n ∑j=1 (xj − xˉ)2


​ ​ ​

​ ​

​ ˉ. The leverage statistic hi is always between 1/n


It is clear that hi increases with distance of xi from x
​ ​

(p+1)
and 1 and the average leverage for all the observations is always equal to . So leverage statistic
n

(p+1)
that exceeds n we may suspect that the point has high leverage.

We can also plot the studentized residual versus hi for the data.

6. Collinearity

It refers to the situation when two or more predictor variables are closely related to one another. It can
be difficult to seperate out the individual effects of collinear variables on the response.

Since collinearity reduces the accuracy of the estimates of regression coefficeints, it causes the
standard error for β^j to grow. It also reduces the t − statistic and we may fail to reject H0 : βj = 0.
​ ​ ​ ​

The power of the hypothesis test- the probabilty of correctly detecting a non − zero coefficient is
redcued by collinearity.

It is possible for collinearity to exist between three or more variables even if no pair of variables have
high correlation.The V IF is the ratio of the variance βj when fitting the full model divided by the

variance of βj if fit on its own. The smallest possible value for V IF is 1, which indicates the complete

absence of collinearity. V IF that exceeds 5 or 10 indicates problematic amount of collinearity. The


V IF for each variable can be computed using
1
V IF (β^j ) = 2
1 − RX
​ ​ ​

j ∣X−j

​ ​

2
where RX j ∣X−j
is the R2 from a regression of Xj onto all other predictors. If RX
​ ​

2
j ∣X−j
​ is close to one, ​ ​

then collinearity is present and so the V IF will be large. We can either drop the collinear variables or
combine them together into a single predictor.

Comparison of Linear Regression with K-Nearest Neighbors

Non-parametric methods do not explicitly assume a parametric form for f (X).

Gievn a value of K and a prediction point x0 , KNN regresiion identifies the K training observations

that are closest to x0 , represented by N0 . It then estimates f (x0 ) using the average value of all the
​ ​ ​

training response in N0 . ​

∑xi ∈N0 yi
f^(x0 ) =
​ ​

​ ​

K
​ ​ ​

A small value of K will have low bias and high variance while a larger K provides a much smoother fit.

left to add comparrsion of KNN vs LR

for binary classification, linear regression and LDA are equivalent.


simulate where true test error is 0.5 and apply CV.

4. Classification

Why not Linear Regression ?

If there is no natural ordering between the response variables, then the differenrt ordering of the
variables will lead to different results and imply a totally different relationship among the categories.

It can also result in negative probabilities or probabilities that are more than 1. Any time a straight line is
fit to binary response that is coded as 0 or 1 in principle we can always predict p(X) < 0 for some
values of X and p(X) > 1 for others (unless the range of X is limited).

The linear regression for a binary response will be the same as LDA.

Logistic Regression

It models the probability that Y belongs to a particular category.

The Logistic Model

We use a function that gives output between 0 and 1 for all the values of X .
eβ0 +β1 X
​ ​

p(X) =
1 + eβ0 +β1 X

​ ​

We use maximum likelihodd to fit the model.The logistic function will always produce an $S-shaped
curve. By a bit of manipulation:

p(X)
log( ) = β0 + β1 X .
1 − p(X)
​ ​ ​

p(X)
The quantity 1−p(X) is called the odds and can take any value between 0 and ∞. Increasing X by one

unit changes the log odds by β1 or multiplies the odds by eβ1 . The amount p(X) changes due to a unit

chnage in X depends on the current value of X .

Estimating Regression Coefficients

Non-linear lest square methods can be used but we use maximum-likelihood because of its statsitical
properties. We seek estimates of β0 and β1 such that the predicted probability
​ ​ p^(xi ) corresponds
​ ​

closely to the individuals observed value. The purpose of estimate intercept is to adjust the average
fitted probabilities to the proportions of ones in the data.

Multiple LInear Regression

We can generalize the model as follows:

p(X)
log( ) = β0 + β1 X1 + ⋯ + βp Xp
1 − p(X)
​ ​ ​ ​ ​ ​

There is a phenomenon called as confounding of variables.

Multiple-class linear regression is also possible and can be done in R but LDA is more popular for such
problems.

Linear Discriminant Analysis

Logistic regression directly models P r(Y = k∣X = x) using the logistic function. IN LDA, we
model the condtitional distributions of the response Y given the predictors X .

We model the distribution of the predictos X seperately in each of the response classes (i.e. given Y ),
and then use Baye's Theorem to flip around the estimates for P r(Y = k∣X = x). When the
distributions are assumed to be normal, it turns out the model is very similar to logistic regression.

When the classes are well-sepereated, the parameter estimates for the logistic model are unstable.
If n is small and the distribution of predictors X is approximately normal in each of the classes,
then LDA is more stable then logistic regression.
Using Baye's Theorem for Classification

The qualitative variable Y can take on K possible distinct and unordered values. Let πk represent the ​

prior probability that a randomly chosen observation comes from the k th


class. Let fk (X)
​ ≡
P r(X = x∣Y = k) denotes the density function of X for an observation that comes from the kth
class. Then using Baye's Theorem:

πk ∗ fk (x)
P r(Y = k∣X = x) =
​ ​

k

∑ πl ∗ fl (x) ​ ​ ​

l=1

LDA Analysis for p ≡1


We assume that fk (x) is normal and the normal density takes the form ​

1 1
fk (x) =  exp( − 2 (x − μk )2 )
2πσk 2σk
​ ​ ​ ​

​ ​ ​

where μk and σk2 are the mean and the varaince paramters of the k th class. Assuming that σ12 =
​ ​ ​

σ22 ⋯ = σk2 and after log transformation and manipulation, we assign the observation to the class for
​ ​

which

μk μ2k
δk (x) = x ∗ − + log(πk )
​ ​

σ2 2σ 2
​ ​ ​ ​

The estimates for πk , μk and σ 2 are calculated as follows: ​

μ^k =

1
nk ​ ∑ xi ​ ​

i:yi =k

K
σ^2 = 1
n−K
​ ∑ ∑ (xi − μ^k )2
​ ​ ​ ​ ​

k=1 i:yi =k ​

nk
πk =

n

LDA for p >1


We assume that X = (X1 , X2 , … , Xp ) si drawn from a multivariate Gaussian distribtuion with a
​ ​ ​

class-specific mean vector and a common covariance matrix.


∴ X ∼ N (μ, ∑) and it is defined as

1 1
f (x) = 1/2
 exp( − (x − μ)T ∑−1 (x − μ))
(2π) ∣ ∑ ∣
p/2 2
​ ​

The LDA classifier assumes that the observation in the k th class are drawn from a multivariate
Gaussian distribution N (μk , ∑), where μk is class specific mean vector and ∑ is the covaraince ​ ​

matrix common to all K classes. The Bayes's classifier assign an observation X = x to a class for
which
δk (x) = xT ∑−1 μk − 12 μTk ∑−1 μk + logπk is largest.
​ ​ ​ ​ ​ ​

A confusion matrix is used to display the infortmation of the type of error made by the classifier. (read
more from book)

left to add more. Moving to next section. will read and come back

Quadratic Discriminant Analysis

QDA classifier results from assuming that the observations from each class are drawn from a
Gaussian distribution and plugging the estimates for the parameters into Baye's theorem in order to
perfrom prediction. QDA assumes that each class has it's own covariance matrix. It assumes that an
observation from the k th class is of the from X ∼ N (μk , ∑k ) where ∑k is the covariance matrix for
​ ​ ​

th
the k class.

Under this assumptions, the Baye's classifier assigns an observation X = x to the class for which the
following is largest:

1
δk (x) = − xT ∑−1 −1 1 T −1
k x  + x ∑k μk − 2 μk ∑k μk   + log(πk )
T
2
​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Why prefer LDA to QDA,or vice versa ? The answer lies in the bias-varaince trade off. When there
are p predictors, then estimating the covraince matrix requires estimating p(p + 1)/2 parameters. By
assuming the K classes share a common covariance matrix, the LDA model becomes linear in x,
which mean that there are Kp linear coefficients to estimtae. LDA is much less flexible than QDA
and hence less variance.

LDA is preffered over QDA if there are relatively few training observations and it is crucial to reduce
variance. QDA is recommended if the training set is large and variance of classfifer is not a major
concern or if the assumption of common variance for the K classes is clearly untenable.

Comparison of Classification Methods

read and add from book.

5. Resampling Methods

Resampling methods involve repeatedly drawing samples from a training set and refitting a model of
interest on each sample in order to obtain addtional information about the fitted model.

Cross-validation can be used to estimate the test error associated with a given statistical learning
method in order to evaluate its performance, or to select the appropriate level of flexibility.

Bootstrap is used to provide a measure of accuracy of a parameter estimate or of a given statistical


learning method.
Cross-Validation

It consists of a class of methods that estimate the test error rate by holding out a subset of the training
observations from the fitting process, and then apply the statistical learning method to those held out
oibservations.

The Validation Set Approach

It involves randomly dividing the available set of observations into two parts: training set and a
validation set or hold-out set. The model is fit on the training set, and the fitted model is used to predict
the responses for the observations in the validation set.

It has two potential drawbacks:

1. The validation estimate of the test error rate can be highly variable, depending on precisely which
observations are included in the training set and which are included in validation set.

2. Since statistical methods tend to perfrom worse when trained on fewer observations, this suggest
that the validation test error rate tends to overestimate the test error rate of the model fit on entire
data.

Leave-One-Out Cross-Validation

LOOCV involves spliting the obsevration where a single observation (xi , yi ) is used for validation
​ ​

set and the remaing observations {(x2 , y2 ), (x3 , y3 ) … (xn , yn )} make up the training set. The
​ ​ ​ ​ ​ ​

statistical learning model is fit on the n − 1 training observations and a prediction y^i is made for the
​ ​

excluded observation using it value x1 ​

Since (x1 , y1 ) was not used in the fitting process, MSE1


​ ​ ​ = (yi − y^i )2 provides an approximatelly
​ ​ ​

unbiased estimate of the test error but it is highly variable as it is based upon a single variable.

We repeat the prodcedure the remaing n − 1 times by keeping an obsevation out from the training set
and calculating their squared errors. The LOOCV estimate for the test MSE is the avergae of these
n test error estimates:

n
1
CVn = ∑ MSEi
n i=1
​ ​ ​ ​

LOOCV has far less bias as compared to the validation set approach. It does not tend to
overestimate the test error rate as much as the validation set approach does, as we are nearly using
the entire data set. In contrast to validation approach which reuslts in different results when applied
repeatedly due to randomness in the training/test set splits, performing LOOCV multiple times, will
always yield the same results, there is no randomness in training/test splits.

LOOCV has the potential to be costly to implement as we fit the model n times.

With least sqaures linear or polynomial regression, the following makes the cost of LOOCV the same
as that of a single model fit:

CV n
yi − y^i 2

(n)= n1 ∑( )
​ ​ ​

1 − h
​ ​ ​

i=1 i ​

The leverage lies between 1/n and 1 and reflects and observation influences its own fit. Hence the
residuals for high-leverage points are inflated. But it does not hold in general, in which case the model
has to be refit n times.

k -Fold Cross Validation


It involves randomly dividing the set of observations into k groups or folds of approximately equal size.
The first fold is treated as validation set and the method is fit on remaining k − 1 folds.The k fold CV
estimate is computed by averaging the values,

k
1
CV(k) ​ = ∑ MSEi
​ ​ ​

k i=1

One typically perform k -fold CVusing k = 5 or k = 10.

We perform it to estimate the test error rate or to find the correct level of flexibility for the model.

Bias-Variance Trade-Off for k-Fold Cross-Validation

k -fold CV gives more accurate estimates of test error rates than does LOOCV.

From the perspective of bias reduction, LOOCV is prefferred over k -fold CV as it is fit on n − 1 training
observations instead of (n − 1)k/n in LOOCV.

When we perform LOOCV we are in effect averaging the outputs of n fitted models, each of which is
trained on almost identical set of observations;therefore the outputs are highly positively correlated with
each other, In contrast in k -fold CV with k < n, we are averaging the outputs of k fitted models that
are somewhat less corrrelated with each other, since the overlap between the training sets in each
model is smaller. Since the mean of many highly correalted quantities has higher variance than does
the mean of many quantities that are not highly correlated, the test error estimate of LOOCV is highert
than that of k -fold CV.

Cross-Valiation on Classification Problems

We can also use cross-validation approach in classification setting where Y is a quantitative variable.
The LOOCV approach takes the form of

n
1
CV(n) = ∑ Erri
n i=1
​ ​ ​ ​
where Erri ​ = I(yi =
 y^i )
​ ​

The Bootstrap

Bootstrap is widely used to quantify the uncertainty associated with a given estimator or statistical
learning methos. It can be used to estimate the standard errors of coefficients from a linear regression
fit.

We obtain distinct data sets by repeatedly sampling observations from the original data set.

6. Linear Model Selection and Regularization.

We discuss the ways in which simple linear model can be improved by replacing palin least sqaures
fitting with some alternative fitting procedures. The alternative fitting procedures can yield better
prediction accuracy and model interpretability.

Prediction Accuracy: Provided that the true relationship between the response and the predictors
is approximately linear, the least square estimates will have low bias. If n >> p then the least squre
estimates tend to have low variance and will perform well on test observations. However, if n is not
much larger than p, then there can be alot of variability in the least squre fit. By constraining or
shrinking the estimated coefficients, we can often substantially reduce the variance at the cost of a
negligible increase in bias.

Model Interpretability: It is useful to exclude the variable that are not related to the response by
setting their coefficients to zero which the least squares is very unlikely to yield any coefficients to
zero. We see methods that automatically perform feature selection i.e. excluding irrelevant
variables from a multiple regression model.
Subset Selection: We identify a subset of p predictors that we believe are related to the response
and then fit a model using least squares on the reduced set of variables
Shrinkage: It involves fitting a model with all p predictors but the estimated coefficients are
shrunken towards zero relative to the least square estimates. This shrinkage (also known as
regularization) has the effect of reducing variance. Depending on the method, some of the
coefficients may be estimated to be excalty zero. Hence, shrinkage methods can perfrom variable
selection
Dimension Reduction :It involves projecting the p predictors into a M -dimensional subspace,
where M < p. It is done by computing M different linear combinations or projections of the
variables. Then these M projectors are used as predictors to fit a linear regression model by least
squres.

Subset Selection

Best Subset Selection

We fit a seperate least squre regression to each possible combinations of the p predictors.
Algorithm :  Best Subset Selection

1. Let M0 denote the null model, which contains no predictors. It predicts the sample mean for each

observation.
2. For k = 1, 2, … , p :
p

a) Fit all ( k ) models that contain excatly k predictors.


b) Pick the best among these ( k ) models and call it Mk . Here the best is defined as having the
​ ​

smallest RSS or least R2 .


3. Select the single best model among M0 , … , Mp using corss-validated prediction errors, Cp ,
​ ​ ​

BIC or adjusted R2 .

Since the RSS and R2 monotnically decreases and increases monotonically as the number of
paramters increases, we will select the model based on all the predictors using them,

We can also use deviance, a measure that plays role of RSS for broader class of models. The
deviance is negative two times the maximized log-likelihood; the smaller the deviance; the better the fit.

It suffers from computational limitations although there are branch and bound techniques for elimination
but they have their limitations.

Stepwise Selection

The larger the search space, the higher the chance of finding models that look good on training data
even though they might not have any predictive power in future data. Thus an enromous search space
can lead to overfitting adn high variance of the coefficeint estimates.

A) Forward Stepwise Selection

It begins with a model with no predictors and then adds predictors to the model one-at-a-time, until all
of the predictors are in the model. At each step, the variable that gives the greatest addtional
improvement is added to the model.

Algorithm :  F orward Stepwise Selection

1. Let M0 be the null model with no predictors.


2. For k = 0, 1 … , p :

a) Consider all the p − k models that augment the predictors in Mk with one additonal predictor.

b) Choose the best among the p − k models, and call it Mk+1 . Here best is defined as having the

smallest RSS or high R2 .

3. Select a single best model among M0 , M1 , … Mp using corss-validated prediction error, Cp ,


​ ​ ​ ​

BIC or adjusted R2 .

It searches through 1 + p(p + 1)/2 models but is not guarantted to find the best model out of 2p
possible models,

It can be used in high dimensional setting where n < p but it is only possible to construct sub-models
M )0, M1 , … , Mn−1 .
​ ​

B) Backward Stepwise Selection

It beign with the full least squares model containing all p predictors and then iteratively removes the
least useful predictor, one-at-at-time.

Algorithm :  Backward Stepwise Selection :

1. Let Mp denote the full model, which contains all p predictors.


2. For k = p, p − 1, … , 1 :

a) Consider all k models that contain all but one of the predictors in Mk , for a total of k − 1

predictors.

b) Choose the best among k models, and call it M+=k+1 . Here best is defined as having the least

2
RSS or the highest R .
3. Select a single best model from among M0 , M1 , … , Mp using corss-validated prediction erroes,
​ ​ ​

2
Cp , BIC or adjusted R .

It requires that the number os samples n is larger than the number of variables p.

Hybrid Approaches

The varaibles are added to the models sequentially as per forward selection but after adding we may
also remove any variable that does not provied an improvement in the model fit.

Choosing the Optimal Model


In order to select the best model with respect to the test error, we need to estimate the test error. There
are the following approaches:

1. We estimate it by making adjustments directly to the training error to account for bias due to
overfitting.
2. We directly estimated it using either validation set approach or corss-validation approach.

For a fitted least squares model containing d predictors, the Cp estimate of test MSE is computed

using:

1
Cp =​ ​ ^ 2 )  The Cp statistic adds a penalty of 2dσ
(RSS + 2dσ ​ ^ 2 to the training RSS in order to
n
adjust for the fact that trainnjg error underestimated the test error. The penalty increases as the number
of predictors increases to adjust for the corresponing decrease in training RSS . Cp is an unbiased

estimate of test MSE and we choose the model with the lowest Cp value. ​

The AIC criterion is defined for a large class of models fit by maximum likelihood. With Gaussian
errors, maximum likelihood and least squares are the same thing and is given by :
1
AIC = 2
^2)
(RSS + 2dσ
^

BIC is derived from the Bayesian point of view. For least squres with d predictors, the BIC is, up to
irrelevant constants given by

1
BIC = (RSS + log(n)dσ ^ 2 ) It places a heavier penalty on models with many variables and
n

hence results in the selection of smaller models than Cp . ​

For least squre model with d variables, the adjusted R2 statistic is calculated as:

RSS/(n − d − 1)
Adjusted R2 = 1 − Maxmizing Adjusted R2 is equivalent to minimizng
TSS/(n − 1)

RSS
(n−d−1) . The intutuon behind Adjusted
​ R2 is that once all the correct variables have been included in
the model, adding additonal nosie variables will lead to a very small decrease in RSS . Since adding
RSS
the noise variables leads to an increase in d, such variables will lead to an increase in n−d−1 , and

consequently a decrease in the adjusted R2 .

Validation and Cross-Validation

It has advantages over the above methods as it directly estimates the test error and makes fewer
assumptions about the model.

If we repeated the validation set approach using a differnt split of data into training set and validation
set or repeated cross-validation using a different set of cross-validation folds, then the model with the
lowest estimated test error would change.We use the one-standard-error rule. We first calculate the
standard error of rhe esrimated test MSE for each model size and then select the smallest model for
which the estimated test error is within one standard error of the lowest point on the curve.

Shrinkage Methods

You might also like