Linear Regression and Classification: Lectures 3-4

Linear regression and classification
Lectures 3-4
Pekka Parviainen
University of Bergen
30.8.2021, 31.8.2021
Learning goals
I After completing this module, a student can

I apply linear models to regression and classification tasks
I apply linear models to non-linear data using basis functions
I explain bias-variance tradeoff and its relation to overfitting
I minimize objective functions using gradient descent
Agenda
I Linear regression
I Non-linear regression with basis functions
I Bias-variance tradeoff
I Logistic regression for classification
Linear regression
I Simple and widely studied models
I Often computationally convenient
Source: https://medium.com/pharos-production/machine-learning-linear-models-part-1-312757aab7bc
Regression
I Training data consist of n pairs of observations

(x1 , y1 ), (x2 , y2 ), . . . (xn , yn ) where xi ∈ Rd is a d-dimensional
feature vector (predictors, input variables, independent
variables, covariates) and yi ∈ R is a response variable (label,
output variable, dependent variable)
I Note that we will refer observations with a subscript, i.e., xi
and features with a superscript in parentheses. That is, x (j) is
the jth element of vector x
I Regression, we predict response values using a function f :
yi = f (xi ) + i ,
where i is the error (or residual) for ith observations

Univariate linear regression
I In univariate case, we can write
yi = w · xi + b + i ,
Source: https://medium.com/all-of-us-are-belong-to-machines/
gentlest-intro-to-tensorflow-part-3-matrices-multi-feature-linear-regression-30a81ebaaa6c
Multivariate linear regression
Source: http://nbviewer.jupyter.org/urls/s3.amazonaws.com/datarobotblog/notebooks/multiple_
regression_in_python.ipynb
I Learn a linear function f : Rd → R for arbitrary d

I Instead of a line, we have a hyperplane
I Data matrix X ∈ Rn×d has n instances xTi as it rows and
y ∈ Rn has the corresponding labels yi
I X is often called the design matrix
I Weights are stored in w ∈ Rd
I Useful trick: x can automatically include a constant term,
x = (1, x (1) , x (2) , . . . , x (d) )T , such that the intercept is
automatically included:
wT x = w (0) + w (1) x (1) + . . . + w (d) x (d)
Note that x (j) denotes the jth feature

I How to choose w and b?

I Loss function L(y , f (x)) measures “cost” of predicting f (x)
when the true label is y
I For example, quadratic loss L(y , f (x)) = (y − f (x))2
I Empirical risk minimization (ERM): minimize loss on training
data
I Minimize the sum of squared errors on the training set
n n
X X 2
E (w , b) = 2i = yi − (wxi + b)
i=1 i=1
or,Pequivalently, minimize
2 the mean squared error
1 n
n i=1 yi − (wxi + b)
I Estimates for the parameters can be found by solving
(ŵ , b̂) = arg min E (w , b)

w ,b
I This is called the ordinary least squares solution; it equals the

maximum likelihood solution assuming Gaussian noise
I Goal: minimize E (w , b) by choosing values for parameters w

and b
I Solution: compute partial derivatives (gradient) and set them
zeros.
I Partial derivative w.r.t. b:
n
∂E (w , b) X
= −2 (yi − wxi − b)
∂b
i=1
I Set to zero and solve. We get
b̂ = ȳ − w x̄,
P P
where ȳ = (1/n) i yi and x̄ = (1/n) i xi
I Similarly, partial derivative w.r.t. w :

n
∂E (w , b) X
= −2 xi (yi − wxi − b)
∂w
i=1
I Set to zero and plug in b̂, get

n
X
xi (yi − wxi − ȳ + w x̄) = 0
i=1
I Solve and get Pn

xi (yi − ȳ )
ŵ = Pi=1
n
i=1 xi (xi − x̄)
I After some manipulation, we get

Pn
(x − x̄)(yi − ȳ )
ŵ = i=1 Pn i 2
i=1 (xi − x̄)
I Notice that ŵ = σxy /σxx , where σpq is sample covariance

between between p and q:
n
1 X
σpq = (pi − p̄)(qi − q̄)
n−1
i=1
I We have
yi = wT xi + i ,
I Goal: choose weights w to minimize the sum of squared errors
n
X n
X
2i = (yi − wT xi )2
i=1 i=1
= ||||22
I Setting gradient to zero and solving the equations, we get
ŵ = (XT X)−1 XT y,
where A−1 denotes a matrix inverse of matrix A (AA−1 = I ).

I If columns of X are linearly independent then the matrix XT X
is of full rank and has an inverse
I If n < d, this is not true and a unique solution does not exist
I Regularization might help
I XT X is a d × d matrix and inversion takes typically Θ(d 3 )
time. This can be too much for high-dimensional cases.
I Beware: matrix inversion can be numerically unstable
I Note: there are other ways to solve this problem (we practice
one of them in exercises)
Loss functions
I So far we have used squared loss
I Computationally convenient → widely used
I Sensitive to outliers
I Other common option is absolute loss: |y − fˆ(x)|
I Corresponds to maximum likelihood estimate with Laplace
distributed residuals
I Less sensitive to outliers
I Computations become more complex
Non-linearities
I Linear models are computationally convenient but as simple
models they are unable to capture complex non-linear relations
I Apply a non-linear transformation to data and perform linear
regression on transformed data
Source: https://www.slideshare.net/PeriklisGogas/presentation-machine-learning-56402108
Non-linearities
I That is,
f (x) = wT φ(x),
0
where φ(x) : Rd → Rd is a non-linear transform
I Typically, d 0 is larger than d
I Computations are done the same way as before but x is
replaced by φ(x) (and the dimension of w may change)
I For example, if φ(x) = (1, x, x 2 ) then
x12
 
1 x1
1 x2 x22 
X=
 ···


1 xn xn2
Basis functions
I Examples:
I Polynomials of degree k: φ(x) = (1, x, x 2 , . . . , x k )
I Interactions between two features:
φ((x (1) , x (2) )) = (1, x (1) , x (2) , x (1) x (2) )
2
I Radial basis functions, e.g., Gaussian: φ(x) = e −(||x−c||) ,
where c is called a center point
Basis functions
I Pros
I Adds flexibility
I Cons
I If dimensionality grows, then also computational requirements
grow
I How to choose appropriate basis functions?
Feature engineering
I Use domain knowledge to construct transformations of the

original features
I Can be achieved by basis functions
I Often critical to the performance of machine learning
algorithms
Coming up with features is difficult, time-consuming, re-

quires expert knowledge. ”Applied machine learning” is
basically feature engineering.
Andrew Ng
Bias-variance tradeoff
I How good is my model?

I What if we consider only the error on training data?
I A simple model will not be able to fit perfectly to the training
data
I A complex model typically fits better
I For example, in polynomial regression increasing the degree of
the polynomial never decreases the fit
I A polynomial with degree n − 1 will fit perfectly to n data
points
I Best-fitting model is not necessarily best-predicting model
Overfitting
I Our goal is to find a model that predicts well unseen data

(generalization)
I Overfitting: ”the production of an analysis that corresponds
too closely or exactly to a particular set of data, and may
therefore fail to fit additional data or predict future
observations reliably”
I How do we know that a model is overfitting?
I Bias: Error due to modeling assumptions
I Variance: Error due to variations in the training set
I Simple models tend to have high bias and low variance
I Complex models tend to have low bias and high variance
I Goal: low bias and low variance
Source: Domingos
Bias-variance tradeoff of squared error
I Consider a training set D containing n points, each coming

from a distribution P(x, y )
I Assume the data are generated as yi = f (xi ) + i , for some
function f and i ∼ N(0, σ 2 )
I Let fˆD : X → R be the model that our algorithm produces
from training data D
I Consider squared errors. Now, the expectation of error over all
possible training sets is
ED [(y − fˆD (x))2 ]

Bias-variance tradeoff of squared error
I After some computations, we get
ED [(y − fˆD (x))2 ] = (f (x) − ED [fˆD (x)])2 + ED [fˆD (x)2 ]

−ED [fˆD (x)]2 + σ 2
= Bias2 + Variance + Irreducible error
I Bias: f (x) − ED [fˆD (x)]; error due the modeling assumptions

I Variance: ED [fˆD (x)2 ] − ED [fˆD (x)]2 ; error due to variation
between training sets
I Irreducible error: σ 2 ; error due to noise
I Exact bias-variance decomposition not possible in practice
because the true function is not known
I Diagnosis using training and test errors
I High training error, high test error → underfitting
I Low training error, high test error → overfitting
Source: Hastie et al.

Simple vs. complex models
I Simple models cannot predict complex phenomena accurately

I Complex models can predict complex phenomena but they
require lots of training data
I The more training data you have, the more complex models
can you learn
Regularization
I To reduce overfitting, one may want to penalize complexity
I Basically, regularization is any modification in the objective
function (or more generally learning algorithm) that intends to
reduce generalization error but not training error
I For example, the objective function for linear regression with
L2-regularizer can be written as
||(y − wT X)||22 + λ||w||22
where hyperparameter λ ≥ 0 specifies the strength of
regularization
I Regression with L2-regularizer is called ridge regression
I Another common choice is L1-regularizer, that is, lasso
I Difference between them is that L2 tends to push all values to
smaller ones, whereas L1 pushes some values to zeros
I That is, lasso does feature selection and is preferred when data
has lots of features
I How to choose λ?
I For example, use cross-validation (more next week)
Regularized linear regression
I With L2-regularizer we get objective
||(y − wT X)||22 + λ||w||22
I Note that the more data points you have, the smaller is the
effect of regularization
I Setting gradient to zero and solving the equations we get
ŵ = (XT X + λI)−1 XT y
I L1 regression does not have a closed form solution
I Problem is convex (a local optimum is also a global optimum)
and there exists efficient algorithms for solving it
Linear models for classification
I Using standard linear regression is problematic in predicting

categorical responses
I We can use a variant called logistic regression instead
Source:
https://www.machinelearningplus.com/wp-content/uploads/2017/09/linear_vs_logistic_regression.jpg
Logistic regression
I Consider binary classification problem

I yi ∈ {0, 1}
I Let p denote our prediction of the probability that P(y = 1|x)
I Logistic linear regression
p
log = wT x
1−p
I Or equivalently
P(y = 1|x) = σ(wT x),
where σ(·) is the so-called logistic sigmoid
ez 1
σ(z) = =
1+e z 1 + e −z
Sigmoid function
ex 1
σ(x) = =
1+e x 1 + e −x
Source: Wikipedia
Logistic regression for classification
I When used in classification, the decision boundary is defined

by P(y = 1|x) = P(y = 0|x) = 0.5. This corresponds to a
hyperplane
wT x = 0
.
I Classification rule:
wT x > 0 → y = 1
wT x < 0 → y = 0
Learning parameters
I Conditional likelihood of data y given X is
n
Y
P(y|X) = P(yi = 1|xi )yi (1 − P(yi = 1|xi ))1−yi
i=1
n
Y
= σ(wT xi )yi (1 − σ(wT xi ))1−yi
i=1
I Maximizing the likelihood is equivalent to maximizing the

log-likelihood
n
X
``(w) = yi log σ(wT xi ) + (1 − yi ) log(1 − σ(wT xi ))
i=1
I Equivalent to minimizing logarithmic loss (log-loss)

n
X
L(w) = − yi log σ(wT xi ) + (1 − yi ) log(1 − σ(wT xi ))
i=1
Learning parameters
I The log-loss function does not have a close-form solution

I However, it is convex and it is possible to use gradient-based
methods
Gradient descent
I An approximate method for finding a local minimum of a
multivariate function
I A maximum can be find using an analogous gradient ascent
technique
I Let f (x) be a differentiable function. Then, in the
neighborhood of point a, f (x) decreases fastest if one goes to
the direction of the negative gradient of f (x), that is, −∇f (a)
I Iteratively,
at+1 = at − γt ∇f (at )
Learning rate
I Choosing a good value for the learning rate (step-size) γ is

important
I Too small γ, convergence takes too long
I Too large γ, the process may “jump over” the local minimum
Source: https://towardsdatascience.com/gradient-descent-in-a-nutshell-eaf8c18212f0
Gradient descent for logistic regression
I Find parameters w that minimize the loss function

I Derivative of the sigmoid function is
σ 0 (z) = σ(z) 1 − σ(z)

I The gradient of the log-loss is

n
X
∇L(w) = xi (σ(wT xi ) − yi )
i=1
I Iteratively,
n
X
wt+1 = wt − γt xi (σ(wtT xi ) − yi )
i=1
Summary
I Linear regression is a simple model for regression problems

I Non-linearities can be modeled using basis functions
I Logistic regression provides a way to use linear models in
classification tasks
I Prediction error consist of bias, variance, and irreducible error
I Simple models tend to have high bias and low variance;
complex models low bias and high variance
I We would want to find a model with low bias and variance
I Regularization is one way to control complexity of a model
Sources
I These slides are partially based on

I https://courses.helsinki.fi/sites/default/files/
course-material/4524945/chapter02.pdf
I https://mycourses.aalto.fi/pluginfile.php/427759/
mod_folder/content/0/lecture4.pdf?forcedownload=1

Linear Regression and Classification: Lectures 3-4

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Regression and Classification: Lectures 3-4

Uploaded by

Copyright:

Available Formats

Linear regression and classification

I After completing this module, a student can

I Training data consist of n pairs of observations

where i is the error (or residual) for ith observations

I In univariate case, we can write

I Learn a linear function f : Rd → R for arbitrary d

wT x = w (0) + w (1) x (1) + . . . + w (d) x (d)

Note that x (j) denotes the jth feature

I How to choose w and b?

I Estimates for the parameters can be found by solving

(ŵ , b̂) = arg min E (w , b)

I This is called the ordinary least squares solution; it equals the

I Goal: minimize E (w , b) by choosing values for parameters w

I Set to zero and solve. We get

I Similarly, partial derivative w.r.t. w :

I Set to zero and plug in b̂, get

I Solve and get Pn

I After some manipulation, we get

I Notice that ŵ = σxy /σxx , where σpq is sample covariance

I Setting gradient to zero and solving the equations, we get

where A−1 denotes a matrix inverse of matrix A (AA−1 = I ).

I Use domain knowledge to construct transformations of the

Coming up with features is difficult, time-consuming, re-

I How good is my model?

I Our goal is to find a model that predicts well unseen data

I Consider a training set D containing n points, each coming

ED [(y − fˆD (x))2 ]

I After some computations, we get

ED [(y − fˆD (x))2 ] = (f (x) − ED [fˆD (x)])2 + ED [fˆD (x)2 ]

I Bias: f (x) − ED [fˆD (x)]; error due the modeling assumptions

Source: Hastie et al.

I Simple models cannot predict complex phenomena accurately

I With L2-regularizer we get objective

||(y − wT X)||22 + λ||w||22

I Using standard linear regression is problematic in predicting

I Consider binary classification problem

P(y = 1|x) = σ(wT x),

where σ(·) is the so-called logistic sigmoid

I When used in classification, the decision boundary is defined

I Maximizing the likelihood is equivalent to maximizing the

I Equivalent to minimizing logarithmic loss (log-loss)

I The log-loss function does not have a close-form solution

I Choosing a good value for the learning rate (step-size) γ is

I Find parameters w that minimize the loss function

σ 0 (z) = σ(z) 1 − σ(z)

I The gradient of the log-loss is

I Linear regression is a simple model for regression problems

I These slides are partially based on

You might also like

where i is the error (or residual) for ith observations