You are on page 1of 41

Linear regression and classification

Lectures 3-4

Pekka Parviainen

University of Bergen

30.8.2021, 31.8.2021
Learning goals

I After completing this module, a student can


I apply linear models to regression and classification tasks
I apply linear models to non-linear data using basis functions
I explain bias-variance tradeoff and its relation to overfitting
I minimize objective functions using gradient descent
Agenda

I Linear regression
I Non-linear regression with basis functions
I Bias-variance tradeoff
I Logistic regression for classification
Linear regression
I Simple and widely studied models
I Often computationally convenient

Source: https://medium.com/pharos-production/machine-learning-linear-models-part-1-312757aab7bc
Regression

I Training data consist of n pairs of observations


(x1 , y1 ), (x2 , y2 ), . . . (xn , yn ) where xi ∈ Rd is a d-dimensional
feature vector (predictors, input variables, independent
variables, covariates) and yi ∈ R is a response variable (label,
output variable, dependent variable)
I Note that we will refer observations with a subscript, i.e., xi
and features with a superscript in parentheses. That is, x (j) is
the jth element of vector x
I Regression, we predict response values using a function f :

yi = f (xi ) + i ,

where i is the error (or residual) for ith observations


Univariate linear regression

I In univariate case, we can write

yi = w · xi + b + i ,

Source: https://medium.com/all-of-us-are-belong-to-machines/

gentlest-intro-to-tensorflow-part-3-matrices-multi-feature-linear-regression-30a81ebaaa6c
Multivariate linear regression

Source: http://nbviewer.jupyter.org/urls/s3.amazonaws.com/datarobotblog/notebooks/multiple_

regression_in_python.ipynb
Multivariate linear regression

I Learn a linear function f : Rd → R for arbitrary d


I Instead of a line, we have a hyperplane
I Data matrix X ∈ Rn×d has n instances xTi as it rows and
y ∈ Rn has the corresponding labels yi
I X is often called the design matrix
I Weights are stored in w ∈ Rd
I Useful trick: x can automatically include a constant term,
x = (1, x (1) , x (2) , . . . , x (d) )T , such that the intercept is
automatically included:

wT x = w (0) + w (1) x (1) + . . . + w (d) x (d)

Note that x (j) denotes the jth feature


Univariate linear regression

I How to choose w and b?


I Loss function L(y , f (x)) measures “cost” of predicting f (x)
when the true label is y
I For example, quadratic loss L(y , f (x)) = (y − f (x))2
I Empirical risk minimization (ERM): minimize loss on training
data
I Minimize the sum of squared errors on the training set
n n
X X 2
E (w , b) = 2i = yi − (wxi + b)
i=1 i=1

or,Pequivalently, minimize
2 the mean squared error
1 n
n i=1 yi − (wxi + b)
Univariate linear regression

I Estimates for the parameters can be found by solving

(ŵ , b̂) = arg min E (w , b)


w ,b

I This is called the ordinary least squares solution; it equals the


maximum likelihood solution assuming Gaussian noise
Univariate linear regression

I Goal: minimize E (w , b) by choosing values for parameters w


and b
I Solution: compute partial derivatives (gradient) and set them
zeros.
I Partial derivative w.r.t. b:
n
∂E (w , b) X
= −2 (yi − wxi − b)
∂b
i=1

I Set to zero and solve. We get

b̂ = ȳ − w x̄,
P P
where ȳ = (1/n) i yi and x̄ = (1/n) i xi
Univariate linear regression

I Similarly, partial derivative w.r.t. w :


n
∂E (w , b) X
= −2 xi (yi − wxi − b)
∂w
i=1

I Set to zero and plug in b̂, get


n
X
xi (yi − wxi − ȳ + w x̄) = 0
i=1

I Solve and get Pn


xi (yi − ȳ )
ŵ = Pi=1
n
i=1 xi (xi − x̄)
Univariate linear regression

I After some manipulation, we get


Pn
(x − x̄)(yi − ȳ )
ŵ = i=1 Pn i 2
i=1 (xi − x̄)

I Notice that ŵ = σxy /σxx , where σpq is sample covariance


between between p and q:
n
1 X
σpq = (pi − p̄)(qi − q̄)
n−1
i=1
Multivariate linear regression

I We have
yi = wT xi + i ,
I Goal: choose weights w to minimize the sum of squared errors
n
X n
X
2i = (yi − wT xi )2
i=1 i=1
= ||||22
Multivariate linear regression

I Setting gradient to zero and solving the equations, we get

ŵ = (XT X)−1 XT y,

where A−1 denotes a matrix inverse of matrix A (AA−1 = I ).


I If columns of X are linearly independent then the matrix XT X
is of full rank and has an inverse
I If n < d, this is not true and a unique solution does not exist
I Regularization might help
I XT X is a d × d matrix and inversion takes typically Θ(d 3 )
time. This can be too much for high-dimensional cases.
I Beware: matrix inversion can be numerically unstable
I Note: there are other ways to solve this problem (we practice
one of them in exercises)
Loss functions
I So far we have used squared loss
I Computationally convenient → widely used
I Sensitive to outliers
I Other common option is absolute loss: |y − fˆ(x)|
I Corresponds to maximum likelihood estimate with Laplace
distributed residuals
I Less sensitive to outliers
I Computations become more complex
Non-linearities
I Linear models are computationally convenient but as simple
models they are unable to capture complex non-linear relations
I Apply a non-linear transformation to data and perform linear
regression on transformed data

Source: https://www.slideshare.net/PeriklisGogas/presentation-machine-learning-56402108
Non-linearities

I That is,
f (x) = wT φ(x),
0
where φ(x) : Rd → Rd is a non-linear transform
I Typically, d 0 is larger than d
I Computations are done the same way as before but x is
replaced by φ(x) (and the dimension of w may change)
I For example, if φ(x) = (1, x, x 2 ) then

x12
 
1 x1
1 x2 x22 
X=
 ···


1 xn xn2
Basis functions

I Examples:
I Polynomials of degree k: φ(x) = (1, x, x 2 , . . . , x k )
I Interactions between two features:
φ((x (1) , x (2) )) = (1, x (1) , x (2) , x (1) x (2) )
2
I Radial basis functions, e.g., Gaussian: φ(x) = e −(||x−c||) ,
where c is called a center point
Basis functions

I Pros
I Adds flexibility
I Cons
I If dimensionality grows, then also computational requirements
grow
I How to choose appropriate basis functions?
Feature engineering

I Use domain knowledge to construct transformations of the


original features
I Can be achieved by basis functions
I Often critical to the performance of machine learning
algorithms

Coming up with features is difficult, time-consuming, re-


quires expert knowledge. ”Applied machine learning” is
basically feature engineering.
Andrew Ng
Bias-variance tradeoff

I How good is my model?


I What if we consider only the error on training data?
I A simple model will not be able to fit perfectly to the training
data
I A complex model typically fits better
I For example, in polynomial regression increasing the degree of
the polynomial never decreases the fit
I A polynomial with degree n − 1 will fit perfectly to n data
points
I Best-fitting model is not necessarily best-predicting model
Overfitting

I Our goal is to find a model that predicts well unseen data


(generalization)
I Overfitting: ”the production of an analysis that corresponds
too closely or exactly to a particular set of data, and may
therefore fail to fit additional data or predict future
observations reliably”
I How do we know that a model is overfitting?
Bias-variance tradeoff
I Bias: Error due to modeling assumptions
I Variance: Error due to variations in the training set
I Simple models tend to have high bias and low variance
I Complex models tend to have low bias and high variance
I Goal: low bias and low variance

Source: Domingos
Bias-variance tradeoff of squared error

I Consider a training set D containing n points, each coming


from a distribution P(x, y )
I Assume the data are generated as yi = f (xi ) + i , for some
function f and i ∼ N(0, σ 2 )
I Let fˆD : X → R be the model that our algorithm produces
from training data D
I Consider squared errors. Now, the expectation of error over all
possible training sets is

ED [(y − fˆD (x))2 ]


Bias-variance tradeoff of squared error

I After some computations, we get

ED [(y − fˆD (x))2 ] = (f (x) − ED [fˆD (x)])2 + ED [fˆD (x)2 ]


−ED [fˆD (x)]2 + σ 2
= Bias2 + Variance + Irreducible error

I Bias: f (x) − ED [fˆD (x)]; error due the modeling assumptions


I Variance: ED [fˆD (x)2 ] − ED [fˆD (x)]2 ; error due to variation
between training sets
I Irreducible error: σ 2 ; error due to noise
Bias-variance tradeoff
I Exact bias-variance decomposition not possible in practice
because the true function is not known
I Diagnosis using training and test errors
I High training error, high test error → underfitting
I Low training error, high test error → overfitting

Source: Hastie et al.


Simple vs. complex models

I Simple models cannot predict complex phenomena accurately


I Complex models can predict complex phenomena but they
require lots of training data
I The more training data you have, the more complex models
can you learn
Regularization
I To reduce overfitting, one may want to penalize complexity
I Basically, regularization is any modification in the objective
function (or more generally learning algorithm) that intends to
reduce generalization error but not training error
I For example, the objective function for linear regression with
L2-regularizer can be written as
||(y − wT X)||22 + λ||w||22
where hyperparameter λ ≥ 0 specifies the strength of
regularization
I Regression with L2-regularizer is called ridge regression
I Another common choice is L1-regularizer, that is, lasso
I Difference between them is that L2 tends to push all values to
smaller ones, whereas L1 pushes some values to zeros
I That is, lasso does feature selection and is preferred when data
has lots of features
I How to choose λ?
I For example, use cross-validation (more next week)
Regularized linear regression

I With L2-regularizer we get objective

||(y − wT X)||22 + λ||w||22

I Note that the more data points you have, the smaller is the
effect of regularization
I Setting gradient to zero and solving the equations we get

ŵ = (XT X + λI)−1 XT y
I L1 regression does not have a closed form solution
I Problem is convex (a local optimum is also a global optimum)
and there exists efficient algorithms for solving it
Linear models for classification

I Using standard linear regression is problematic in predicting


categorical responses
I We can use a variant called logistic regression instead

Source:

https://www.machinelearningplus.com/wp-content/uploads/2017/09/linear_vs_logistic_regression.jpg
Logistic regression

I Consider binary classification problem


I yi ∈ {0, 1}
I Let p denote our prediction of the probability that P(y = 1|x)
I Logistic linear regression
p
log = wT x
1−p
I Or equivalently

P(y = 1|x) = σ(wT x),

where σ(·) is the so-called logistic sigmoid

ez 1
σ(z) = =
1+e z 1 + e −z
Sigmoid function

ex 1
σ(x) = =
1+e x 1 + e −x

Source: Wikipedia
Logistic regression for classification

I When used in classification, the decision boundary is defined


by P(y = 1|x) = P(y = 0|x) = 0.5. This corresponds to a
hyperplane
wT x = 0
.
I Classification rule:

wT x > 0 → y = 1
wT x < 0 → y = 0
Learning parameters
I Conditional likelihood of data y given X is
n
Y
P(y|X) = P(yi = 1|xi )yi (1 − P(yi = 1|xi ))1−yi
i=1
n
Y
= σ(wT xi )yi (1 − σ(wT xi ))1−yi
i=1

I Maximizing the likelihood is equivalent to maximizing the


log-likelihood
n 
X 
``(w) = yi log σ(wT xi ) + (1 − yi ) log(1 − σ(wT xi ))
i=1

I Equivalent to minimizing logarithmic loss (log-loss)


n 
X 
L(w) = − yi log σ(wT xi ) + (1 − yi ) log(1 − σ(wT xi ))
i=1
Learning parameters

I The log-loss function does not have a close-form solution


I However, it is convex and it is possible to use gradient-based
methods
Gradient descent
I An approximate method for finding a local minimum of a
multivariate function
I A maximum can be find using an analogous gradient ascent
technique
I Let f (x) be a differentiable function. Then, in the
neighborhood of point a, f (x) decreases fastest if one goes to
the direction of the negative gradient of f (x), that is, −∇f (a)
I Iteratively,
at+1 = at − γt ∇f (at )
Learning rate

I Choosing a good value for the learning rate (step-size) γ is


important
I Too small γ, convergence takes too long
I Too large γ, the process may “jump over” the local minimum

Source: https://towardsdatascience.com/gradient-descent-in-a-nutshell-eaf8c18212f0
Gradient descent for logistic regression

I Find parameters w that minimize the loss function


I Derivative of the sigmoid function is

σ 0 (z) = σ(z) 1 − σ(z)




I The gradient of the log-loss is


n
X
∇L(w) = xi (σ(wT xi ) − yi )
i=1

I Iteratively,
n
X
wt+1 = wt − γt xi (σ(wtT xi ) − yi )
i=1
Summary

I Linear regression is a simple model for regression problems


I Non-linearities can be modeled using basis functions
I Logistic regression provides a way to use linear models in
classification tasks
I Prediction error consist of bias, variance, and irreducible error
I Simple models tend to have high bias and low variance;
complex models low bias and high variance
I We would want to find a model with low bias and variance
I Regularization is one way to control complexity of a model
Sources

I These slides are partially based on


I https://courses.helsinki.fi/sites/default/files/
course-material/4524945/chapter02.pdf
I https://mycourses.aalto.fi/pluginfile.php/427759/
mod_folder/content/0/lecture4.pdf?forcedownload=1

You might also like