You are on page 1of 40

SiTE

AAiT
AAU

Course Title: Machine Learning


Credit Hour: 3
Instructor: Fantahun B. (PhD)  meetfantaai@gmail.com
Office: NB #

Ch-2 Linear Models for Regression

Nov-2022, AA
2.
Linear Models for Regression

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 2


Linear Models for Regression
Contents
• Linear basis function models
• Maximum likelihood and least squares, Geometry of least squares,
Sequential learning, Regularized least squares
• The bias-variance decomposition
• Bayesian linear regression
• Parameter distribution, Predictive distribution, Equivalent kernel
• Bayesian model comparison
• The evidence approximation

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 3


Linear Basis Function Models
• The simplest linear model for regression is one that involves a
linear combination of the input variables:

y(x,w) = w0 + w1x1 + . . . + wDxD (3.1)

where x = (x1, . . . , xD)T.


• This is often simply known as linear regression.
• The key property of this model is that it is a linear function of the
parameters w0, . . . , wD and the input variables xi,
(and this imposes significant limitations on the model.)
Machine Learning Fantahun B. (PhD) @ AAU-AAiT 4
Linear Basis Function Models
• We therefore extend the class of models by considering linear
combinations of fixed nonlinear functions of the input variables,
of the form

where Фj(x) are known as basis functions.

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 5


Linear Basis Function Models
• By denoting the maximum value of the index j by M−1, the
total number of parameters in this model will be M.
• The parameter w0 allows for any fixed offset in the data and is
sometimes called a bias parameter.
• It is often convenient to define an additional dummy ‘basis
function’ Ф0(x) = 1 so that

Where
w = (w0, . . . , wM−1)T and Ф = (Ф0, . . . , ФM−1)T.
Machine Learning Fantahun B. (PhD) @ AAU-AAiT 6
Linear Basis Function Models
Examples of Basis functions

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 7


Linear Basis Function Models
Examples of Basis functions

Figure 3.1 Examples of basis functions, showing polynomials on the left, Gaussians
of the form (3.4) in the centre, and sigmoidal of the form (3.5) on the right.

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 8


Linear Basis Function Models: Maximum likelihood and least
squares

• We can use the machinery of MLE to estimate the parameter ω and the
precision β:

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 9


Linear Basis Function Models: Maximum likelihood and least
squares
• Geometry of least square

Figure 3.2 Geometrical interpretation of the least-


squares solution, in an N-dimensional space whose axes
are the values of t1, . . . , tN.
The least-squares regression function is obtained by
finding the orthogonal projection of the data vector t
onto the subspace spanned by the basis functions φj(x)
in which each basis function is viewed as a vector
ϕj of length N with elements φj(xn).

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 10


Linear Basis Function Models: Sequential learning
• Batch techniques, such as the maximum likelihood solution (3.15),
which involve processing the entire training set in one go, can be
computationally costly for large data sets.
• If the data set is sufficiently large, it may be worthwhile to use
sequential algorithms, also known as on-line algorithms, in which the
data points are considered one at a time, and the model parameters
updated after each such presentation.
• We can obtain a sequential learning algorithm by applying the
technique of stochastic gradient descent, also known as sequential
gradient descent, as follows.

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 11


Linear Basis Function Models: Sequential learning
• Apply a technique known as stochastic gradient descent or
sequential gradient descent, i.e.,
replace:

with

( η is a learning rate parameter):

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 12


Linear Basis Function Models: Regularized least square
• Adding a regularization term to an error function is important to
control over-fitting, so that the total error function to be
minimized takes the form:

where λ is the regularization coefficient that controls the relative


importance of the data-dependent error ED(w) and the
regularization term EW(w). One of the simplest forms of regularizer
is given by the sum-of-squares of the weight vector elements:

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 13


Linear Basis Function Models: Regularized least square
• If we consider the sum-of-squares error function given by,

then the total error function becomes:

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 14


Linear Basis Function Models: Regularized least square
• This particular choice of regularizer is known in the machine
learning literature as weight decay because in sequential
learning algorithms, it encourages weight values to decay
towards zero, unless supported by the data.
• It has the advantage that the error function remains a
quadratic function of w, and so its exact minimizer can be
found in closed form.
• Specifically, setting the gradient of (3.27) with respect to w to
zero, and solving for w, we obtain

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 15


Linear Basis Function Models: Regularized least square
• A more general regularizer is sometimes used, for which the
regularized error takes the form:

where q = 2 corresponds to the quadratic regularizer (3.27).

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 16


The Bias-Variance Decomposition
• Over-fitting occurs whenever the number of basis functions is
large and with training data sets of limited size.
• Limiting the number of basis functions limits the flexibility of the
model.
• Regularization can control over-fitting but raises the question of
how to determine λ.
• The bias-variance tradeoff is a frequentist viewpoint of model
complexity.

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 17


The Bias-Variance Decomposition
• The squared bias, represents the extent to which the average
prediction over all data sets differs from the desired regression
function.
• The variance measures the extent to which the solutions for
individual data sets vary around their average, and hence this
measures the extent to which the function y(x;D) is sensitive to
the particular choice of data set.

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 18


The Bias-Variance Decomposition
• The regression loss-function: L(t, y(x)) = (y(x) − t)2
• The decision problem = minimize the expected loss:

• Solution:

 this is known as the regression function


 conditional average of t conditioned on x
• Another expression for the expectation of the loss function:
(1.90)
Machine Learning Fantahun B. (PhD) @ AAU-AAiT 19
The Bias-Variance Decomposition
• The optimal prediction is obtained by minimization of the expected
squared loss function:

• The expected squared loss can be decomposed into two terms:

(3.37)

• The theoretical minimum of the first term is zero for an appropriate


choice of the function y(x) (for unlimited data and unlimited
computing power).
• The second term arises from noise in the data and it represents the
minimum achievable value of the expected squared loss.
Machine Learning Fantahun B. (PhD) @ AAU-AAiT 20
The Bias-Variance Decomposition: An ensemble of datasets
• For any given data set D we obtain a prediction function y(x,D).
• The performance of a particular algorithm is assessed by taking the
average over all these data sets, namely ED[L].
• This expands into the following terms:
expected loss = (bias)2 + variance + noise (Eq. 3.42, 3.43, 3.44)
• There is a tradeoff between bias and variance:
 flexible models have low bias and high variance
 rigid models have high bias and low variance
• The bias-variance decomposition provides interesting insights in model
complexity, it is of limited practical value because several data sets
are needed.
Machine Learning Fantahun B. (PhD) @ AAU-AAiT 21
Large λ, low variance; high bias

Figure 3.5 Illustration of the dependence of bias and


variance on model complexity, governed by a
regularization parameter λ, using the sinusoidal data set
from Chapter 1. There are L = 100 data sets, each having
N = 25 data points, and there are 24 Gaussian basis
functions in the model so that the total number of
parameters is M = 25 including the bias parameter. The
left column shows the result of fitting the model to the
data sets for various values of ln λ (for clarity, only 20 of
the 100 fits are shown). The right column shows the
corresponding average of the 100 fits (red) along with
the sinusoidal function from which the data sets were
generated (green).

Small λ, high variance; low bias

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 22


The Bias-Variance Decomposition: An ensemble of datasets
• Although the bias-variance decomposition may provide some
interesting insights into the model complexity issue from a frequentist
perspective, it is of limited practical value, because the bias-variance
decomposition is based on averages with respect to ensembles of
data sets, whereas in practice we have only the single observed data
set.
• If we had a large number of independent training sets of a given size,
we would be better off combining them into a single large training set,
which of course would reduce the level of over-fitting for a given
model complexity.
• Given these limitations, we turn to a Bayesian treatment of linear basis
function models, which not only provides powerful insights into the
issues of over-fitting but which also leads to practical techniques for
addressing the question model complexity.

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 23


Key benefits of linear regression
• Easy implementation
• Computationally simple to implement as it does not demand a lot of engineering
overheads, neither before the model launch nor during its maintenance.
• Interpretability
• Unlike other deep learning models (neural networks), linear regression is relatively
straightforward. As a result, this algorithm stands ahead of black-box models that
fall short in justifying which input variable causes the output variable to change.
• Scalability
• Not computationally heavy and, therefore, fits well in cases where scaling is
essential. Eg., the model can scale well regarding increased data volume.
• Optimal for online settings
• The model can be trained and retrained with each new example to generate
predictions in real-time, unlike the neural networks or support vector machines that
are computationally heavy and require plenty of computing resources and
substantial waiting time.
Machine Learning Fantahun B. (PhD) @ AAU-AAiT 24
Bayesian Linear Regression
• In the Bayesian viewpoint, we formulate linear regression using probability
distributions rather than point estimates.
• The response, y, is not estimated as a single value, but is assumed to be
drawn from a probability distribution.
• The model for Bayesian Linear Regression with the response sampled from
a normal distribution is:

 The output, y is generated from a normal (Gaussian) Distribution characterized by a


mean and variance.
 The mean for linear regression is the transpose of the weight matrix multiplied by the
predictor matrix.
 The variance is the square of the standard deviation σ (multiplied by the Identity
matrix because this is a multi-dimensional formulation of the model).
Machine Learning Fantahun B. (PhD) @ AAU-AAiT 25
Bayesian Linear Regression
• The aim of Bayesian Linear Regression is not to find the single “best”
value of the model parameters, but rather to determine the
posterior distribution for the model parameters.
• Not only is the response generated from a probability distribution,
but the model parameters are assumed to come from a distribution
as well.

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 26


Bayesian Linear Regression
• The posterior probability of the model parameters is conditional
upon the training inputs and outputs:

• P(β|y, X)  posterior probability distribution of the model parameters given


the inputs and outputs; P(y|β, X)  likelihood of the data; P(β|X)  prior
probability of the parameters and P(y|β, X) a normalization constant.
• This is a simple expression of Bayes Theorem, the fundamental
underpinning of Bayesian Inference:

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 27


Bayesian Linear Regression
• In contrast to OLS, we have a posterior distribution for the model
parameters that is proportional to the likelihood of the data
multiplied by the prior probability of the parameters.
• Here we can observe the two primary benefits of Bayesian Linear
Regression.
1. Priors: If we have domain knowledge, or a guess for what the model
parameters should be, we can include them in our model, unlike in the
frequentist approach which assumes everything there is to know about the
parameters comes from the data. If we don’t have any estimates ahead of
time, we can use non-informative priors for the parameters such as a
normal distribution.
2. Posterior: The result of performing Bayesian Linear Regression is a distribution
of possible model parameters based on the data and the prior. This allows
us to quantify our uncertainty about the model: if we have fewer data
points, the posterior distribution will be more spread out.

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 28


Bayesian Linear Regression
• As the amount of data points increases, the likelihood washes out
the prior, and in the case of infinite data, the outputs for the
parameters converge to the values obtained from OLS.
• The formulation of model parameters as distributions encapsulates
the Bayesian worldview: we start out with an initial estimate, our
prior, and as we gather more evidence, our model becomes less
wrong.
• Bayesian reasoning is a natural extension of our intuition.
• Often, we have an initial hypothesis, and as we collect data that either
supports or disproves our ideas, we change our model of the world (ideally
this is how we would reason)!

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 29


Bayesian Linear Regression: Implementation Example
• In practice, evaluating the posterior distribution for the model
parameters is intractable for continuous variables, so we use
sampling methods to draw samples from the posterior in order to
approximate the posterior.
• The technique of drawing random samples from a distribution to
approximate the distribution is one application of Monte Carlo
methods.
• There are a number of algorithms for Monte Carlo sampling, with
the most common being variants of Markov Chain Monte Carlo.

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 30


Bayesian Linear Regression: Implementation Example
• The approximations of the posterior distributions of model
parameters: these are the result of 1000 steps of MCMC, meaning
the algorithm drew 1000 steps from the posterior distribution.

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 31


Bayesian Linear Regression: Implementation Example

Bayesian Linear Regression Model Results with 500 (left) and 15000 observations (right)
Machine Learning Fantahun B. (PhD) @ AAU-AAiT 32
Bayesian Linear Regression: Implementation Example
• When we want to show the linear fit from a Bayesian model, instead
of showing only estimate, we can draw a range of lines, with each
one representing a different estimate of the model parameters.
• As the number of datapoints increases, the lines begin to overlap
because there is less uncertainty in the model parameters.
• There is much more variation in the fits when using fewer data
points, which represents a greater uncertainty in the model.
• With all of the data points, the OLS and Bayesian Fits are nearly
identical because the priors are washed out by the likelihoods from
the data.

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 33


Bayesian Linear Regression: Implementation Example

• The probability density plot for


the number of calories burned
exercising for 15.5 minutes.
• The red vertical line indicates
the point estimate from OLS.

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 34


Bayesian Model Comparison
• The Bayesian formalism is one of the most common
approaches to statistical inference.
 For parameter inference, for a given model M, the posterior
probability distribution of parameters of interest can be related to
the likelihood of data y via Bayes’ theorem

where the prior distribution p( | M) encodes our prior knowledge about


the parameters before observation of the data
Machine Learning Fantahun B. (PhD) @ AAU-AAiT 35
Bayesian Model Comparison
• The Bayesian model evidence, given by the denominator of the
equation above, is irrelevant for parameter estimation since it is
independent of the parameters of interest and can simply be
considered as a normalising constant.
• However, for model comparison the Bayesian model evidence, also
called the marginal likelihood, plays a central role.
• For model selection we are interested in the model posterior
probability, which, by another application of Bayes’s theorem, can
be written as

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 36


Bayesian Model Comparison
• To compare models we therefore need to compute Bayes factors,
which require computation of the model evidence of the models
under consideration. This is where things get computationally
challenging.
• The Bayesian model evidence is given by the integral of the
likelihood and prior over the parameter space:

• Computation of the evidence therefore requires evaluation of a


multi-dimensional integral, which can be highly computationally
challenging.
Machine Learning Fantahun B. (PhD) @ AAU-AAiT 37
Bayesian Model Comparison
• It is insightful to note that the model evidence naturally
incorporates Occam’s razor, trading off model complexity and
goodness of fit, as illustrated in the following diagram.

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 38


The evidence approximation

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 39


The evidence approximation

Machine Learning Fantahun B. (PhD) @ AAU-AAiT 40

You might also like