Ch-2 Linear Models For Regression

SiTE
AAiT
AAU
Course Title: Machine Learning

Credit Hour: 3
Instructor: Fantahun B. (PhD)  meetfantaai@gmail.com
Office: NB #
Ch-2 Linear Models for Regression
Nov-2022, AA
2.
Linear Models for Regression
Machine Learning Fantahun B. (PhD) @ AAU-AAiT 2

Linear Models for Regression
Contents
• Linear basis function models
• Maximum likelihood and least squares, Geometry of least squares,
Sequential learning, Regularized least squares
• The bias-variance decomposition
• Bayesian linear regression
• Parameter distribution, Predictive distribution, Equivalent kernel
• Bayesian model comparison
• The evidence approximation

Linear Basis Function Models
• The simplest linear model for regression is one that involves a
linear combination of the input variables:
y(x,w) = w0 + w1x1 + . . . + wDxD (3.1)
where x = (x1, . . . , xD)T.

• This is often simply known as linear regression.
• The key property of this model is that it is a linear function of the
parameters w0, . . . , wD and the input variables xi,
(and this imposes significant limitations on the model.)
• We therefore extend the class of models by considering linear
combinations of fixed nonlinear functions of the input variables,
of the form
where Фj(x) are known as basis functions.

• By denoting the maximum value of the index j by M−1, the
total number of parameters in this model will be M.
• The parameter w0 allows for any fixed offset in the data and is
sometimes called a bias parameter.
• It is often convenient to define an additional dummy ‘basis
function’ Ф0(x) = 1 so that
Where
w = (w0, . . . , wM−1)T and Ф = (Ф0, . . . , ФM−1)T.
Examples of Basis functions

Examples of Basis functions
Figure 3.1 Examples of basis functions, showing polynomials on the left, Gaussians
of the form (3.4) in the centre, and sigmoidal of the form (3.5) on the right.

Linear Basis Function Models: Maximum likelihood and least
squares
• We can use the machinery of MLE to estimate the parameter ω and the
precision β:

Linear Basis Function Models: Maximum likelihood and least
squares
• Geometry of least square
Figure 3.2 Geometrical interpretation of the least-

squares solution, in an N-dimensional space whose axes
are the values of t1, . . . , tN.
The least-squares regression function is obtained by
finding the orthogonal projection of the data vector t
onto the subspace spanned by the basis functions φj(x)
in which each basis function is viewed as a vector
ϕj of length N with elements φj(xn).

Linear Basis Function Models: Sequential learning
• Batch techniques, such as the maximum likelihood solution (3.15),
which involve processing the entire training set in one go, can be
computationally costly for large data sets.
• If the data set is sufficiently large, it may be worthwhile to use
sequential algorithms, also known as on-line algorithms, in which the
data points are considered one at a time, and the model parameters
updated after each such presentation.
• We can obtain a sequential learning algorithm by applying the
technique of stochastic gradient descent, also known as sequential
gradient descent, as follows.

Linear Basis Function Models: Sequential learning
• Apply a technique known as stochastic gradient descent or
sequential gradient descent, i.e.,
replace:
with
( η is a learning rate parameter):

Linear Basis Function Models: Regularized least square
• Adding a regularization term to an error function is important to
control over-fitting, so that the total error function to be
minimized takes the form:
where λ is the regularization coefficient that controls the relative

importance of the data-dependent error ED(w) and the
regularization term EW(w). One of the simplest forms of regularizer
is given by the sum-of-squares of the weight vector elements:

• If we consider the sum-of-squares error function given by,
then the total error function becomes:

• This particular choice of regularizer is known in the machine
learning literature as weight decay because in sequential
learning algorithms, it encourages weight values to decay
towards zero, unless supported by the data.
• It has the advantage that the error function remains a
quadratic function of w, and so its exact minimizer can be
found in closed form.
• Specifically, setting the gradient of (3.27) with respect to w to
zero, and solving for w, we obtain

• A more general regularizer is sometimes used, for which the
regularized error takes the form:
where q = 2 corresponds to the quadratic regularizer (3.27).

The Bias-Variance Decomposition
• Over-fitting occurs whenever the number of basis functions is
large and with training data sets of limited size.
• Limiting the number of basis functions limits the flexibility of the
model.
• Regularization can control over-fitting but raises the question of
how to determine λ.
• The bias-variance tradeoff is a frequentist viewpoint of model
complexity.

• The squared bias, represents the extent to which the average
prediction over all data sets differs from the desired regression
function.
• The variance measures the extent to which the solutions for
individual data sets vary around their average, and hence this
measures the extent to which the function y(x;D) is sensitive to
the particular choice of data set.

• The regression loss-function: L(t, y(x)) = (y(x) − t)2
• The decision problem = minimize the expected loss:
• Solution:
 this is known as the regression function

 conditional average of t conditioned on x
• Another expression for the expectation of the loss function:
(1.90)
• The optimal prediction is obtained by minimization of the expected
squared loss function:
• The expected squared loss can be decomposed into two terms:
(3.37)
• The theoretical minimum of the first term is zero for an appropriate

choice of the function y(x) (for unlimited data and unlimited
computing power).
• The second term arises from noise in the data and it represents the
minimum achievable value of the expected squared loss.
The Bias-Variance Decomposition: An ensemble of datasets
• For any given data set D we obtain a prediction function y(x,D).
• The performance of a particular algorithm is assessed by taking the
average over all these data sets, namely ED[L].
• This expands into the following terms:
expected loss = (bias)2 + variance + noise (Eq. 3.42, 3.43, 3.44)
• There is a tradeoff between bias and variance:
 flexible models have low bias and high variance
 rigid models have high bias and low variance
• The bias-variance decomposition provides interesting insights in model
complexity, it is of limited practical value because several data sets
are needed.
Large λ, low variance; high bias
Figure 3.5 Illustration of the dependence of bias and

variance on model complexity, governed by a
regularization parameter λ, using the sinusoidal data set
from Chapter 1. There are L = 100 data sets, each having
N = 25 data points, and there are 24 Gaussian basis
functions in the model so that the total number of
parameters is M = 25 including the bias parameter. The
left column shows the result of fitting the model to the
data sets for various values of ln λ (for clarity, only 20 of
the 100 fits are shown). The right column shows the
corresponding average of the 100 fits (red) along with
the sinusoidal function from which the data sets were
generated (green).
Small λ, high variance; low bias

The Bias-Variance Decomposition: An ensemble of datasets
• Although the bias-variance decomposition may provide some
interesting insights into the model complexity issue from a frequentist
perspective, it is of limited practical value, because the bias-variance
decomposition is based on averages with respect to ensembles of
data sets, whereas in practice we have only the single observed data
set.
• If we had a large number of independent training sets of a given size,
we would be better off combining them into a single large training set,
which of course would reduce the level of over-fitting for a given
model complexity.
• Given these limitations, we turn to a Bayesian treatment of linear basis
function models, which not only provides powerful insights into the
issues of over-fitting but which also leads to practical techniques for
addressing the question model complexity.

Key benefits of linear regression
• Easy implementation
• Computationally simple to implement as it does not demand a lot of engineering
overheads, neither before the model launch nor during its maintenance.
• Interpretability
• Unlike other deep learning models (neural networks), linear regression is relatively
straightforward. As a result, this algorithm stands ahead of black-box models that
fall short in justifying which input variable causes the output variable to change.
• Scalability
• Not computationally heavy and, therefore, fits well in cases where scaling is
essential. Eg., the model can scale well regarding increased data volume.
• Optimal for online settings
• The model can be trained and retrained with each new example to generate
predictions in real-time, unlike the neural networks or support vector machines that
are computationally heavy and require plenty of computing resources and
substantial waiting time.
Bayesian Linear Regression
• In the Bayesian viewpoint, we formulate linear regression using probability
distributions rather than point estimates.
• The response, y, is not estimated as a single value, but is assumed to be
drawn from a probability distribution.
• The model for Bayesian Linear Regression with the response sampled from
a normal distribution is:
 The output, y is generated from a normal (Gaussian) Distribution characterized by a

mean and variance.
 The mean for linear regression is the transpose of the weight matrix multiplied by the
predictor matrix.
 The variance is the square of the standard deviation σ (multiplied by the Identity
matrix because this is a multi-dimensional formulation of the model).
• The aim of Bayesian Linear Regression is not to find the single “best”
value of the model parameters, but rather to determine the
posterior distribution for the model parameters.
• Not only is the response generated from a probability distribution,
but the model parameters are assumed to come from a distribution
as well.

• The posterior probability of the model parameters is conditional
upon the training inputs and outputs:
• P(β|y, X)  posterior probability distribution of the model parameters given

the inputs and outputs; P(y|β, X)  likelihood of the data; P(β|X)  prior
probability of the parameters and P(y|β, X) a normalization constant.
• This is a simple expression of Bayes Theorem, the fundamental
underpinning of Bayesian Inference:

• In contrast to OLS, we have a posterior distribution for the model
parameters that is proportional to the likelihood of the data
multiplied by the prior probability of the parameters.
• Here we can observe the two primary benefits of Bayesian Linear
Regression.
1. Priors: If we have domain knowledge, or a guess for what the model
parameters should be, we can include them in our model, unlike in the
frequentist approach which assumes everything there is to know about the
parameters comes from the data. If we don’t have any estimates ahead of
time, we can use non-informative priors for the parameters such as a
normal distribution.
2. Posterior: The result of performing Bayesian Linear Regression is a distribution
of possible model parameters based on the data and the prior. This allows
us to quantify our uncertainty about the model: if we have fewer data
points, the posterior distribution will be more spread out.

• As the amount of data points increases, the likelihood washes out
the prior, and in the case of infinite data, the outputs for the
parameters converge to the values obtained from OLS.
• The formulation of model parameters as distributions encapsulates
the Bayesian worldview: we start out with an initial estimate, our
prior, and as we gather more evidence, our model becomes less
wrong.
• Bayesian reasoning is a natural extension of our intuition.
• Often, we have an initial hypothesis, and as we collect data that either
supports or disproves our ideas, we change our model of the world (ideally
this is how we would reason)!

Bayesian Linear Regression: Implementation Example
• In practice, evaluating the posterior distribution for the model
parameters is intractable for continuous variables, so we use
sampling methods to draw samples from the posterior in order to
approximate the posterior.
• The technique of drawing random samples from a distribution to
approximate the distribution is one application of Monte Carlo
methods.
• There are a number of algorithms for Monte Carlo sampling, with
the most common being variants of Markov Chain Monte Carlo.

• The approximations of the posterior distributions of model
parameters: these are the result of 1000 steps of MCMC, meaning
the algorithm drew 1000 steps from the posterior distribution.

Bayesian Linear Regression Model Results with 500 (left) and 15000 observations (right)
• When we want to show the linear fit from a Bayesian model, instead
of showing only estimate, we can draw a range of lines, with each
one representing a different estimate of the model parameters.
• As the number of datapoints increases, the lines begin to overlap
because there is less uncertainty in the model parameters.
• There is much more variation in the fits when using fewer data
points, which represents a greater uncertainty in the model.
• With all of the data points, the OLS and Bayesian Fits are nearly
identical because the priors are washed out by the likelihoods from
the data.

• The probability density plot for

the number of calories burned
exercising for 15.5 minutes.
• The red vertical line indicates
the point estimate from OLS.

Bayesian Model Comparison
• The Bayesian formalism is one of the most common
approaches to statistical inference.
 For parameter inference, for a given model M, the posterior
probability distribution of parameters of interest can be related to
the likelihood of data y via Bayes’ theorem
where the prior distribution p( | M) encodes our prior knowledge about

the parameters before observation of the data
• The Bayesian model evidence, given by the denominator of the
equation above, is irrelevant for parameter estimation since it is
independent of the parameters of interest and can simply be
considered as a normalising constant.
• However, for model comparison the Bayesian model evidence, also
called the marginal likelihood, plays a central role.
• For model selection we are interested in the model posterior
probability, which, by another application of Bayes’s theorem, can
be written as

• To compare models we therefore need to compute Bayes factors,
which require computation of the model evidence of the models
under consideration. This is where things get computationally
challenging.
• The Bayesian model evidence is given by the integral of the
likelihood and prior over the parameter space:
• Computation of the evidence therefore requires evaluation of a

multi-dimensional integral, which can be highly computationally
challenging.
• It is insightful to note that the model evidence naturally
incorporates Occam’s razor, trading off model complexity and
goodness of fit, as illustrated in the following diagram.

The evidence approximation

The evidence approximation

Ch-2 Linear Models For Regression

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ch-2 Linear Models For Regression

Uploaded by

Copyright:

Available Formats

SiTE

Course Title: Machine Learning

Ch-2 Linear Models for Regression

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 2

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 3

y(x,w) = w0 + w1x1 + . . . + wDxD (3.1)

where x = (x1, . . . , xD)T.

where Фj(x) are known as basis functions.

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 5

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 7

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 8

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 9

Figure 3.2 Geometrical interpretation of the least-

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 10

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 11

( η is a learning rate parameter):

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 12

where λ is the regularization coefficient that controls the relative

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 13

then the total error function becomes:

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 14

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 15

where q = 2 corresponds to the quadratic regularizer (3.27).

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 16

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 17

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 18

 this is known as the regression function

• The expected squared loss can be decomposed into two terms:

• The theoretical minimum of the first term is zero for an appropriate

Figure 3.5 Illustration of the dependence of bias and

Small λ, high variance; low bias

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 22

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 23

 The output, y is generated from a normal (Gaussian) Distribution characterized by a

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 26

• P(β|y, X)  posterior probability distribution of the model parameters given

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 27

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 28

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 29

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 30

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 31

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 33

• The probability density plot for

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 34

where the prior distribution p( | M) encodes our prior knowledge about

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 36

• Computation of the evidence therefore requires evaluation of a

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 38

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 39

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 40

You might also like