Professional Documents
Culture Documents
Lectures 3-4
Pekka Parviainen
University of Bergen
30.8.2021, 31.8.2021
Learning goals
I Linear regression
I Non-linear regression with basis functions
I Bias-variance tradeoff
I Logistic regression for classification
Linear regression
I Simple and widely studied models
I Often computationally convenient
Source: https://medium.com/pharos-production/machine-learning-linear-models-part-1-312757aab7bc
Regression
yi = f (xi ) + i ,
yi = w · xi + b + i ,
Source: https://medium.com/all-of-us-are-belong-to-machines/
gentlest-intro-to-tensorflow-part-3-matrices-multi-feature-linear-regression-30a81ebaaa6c
Multivariate linear regression
Source: http://nbviewer.jupyter.org/urls/s3.amazonaws.com/datarobotblog/notebooks/multiple_
regression_in_python.ipynb
Multivariate linear regression
or,Pequivalently, minimize
2 the mean squared error
1 n
n i=1 yi − (wxi + b)
Univariate linear regression
b̂ = ȳ − w x̄,
P P
where ȳ = (1/n) i yi and x̄ = (1/n) i xi
Univariate linear regression
I We have
yi = wT xi + i ,
I Goal: choose weights w to minimize the sum of squared errors
n
X n
X
2i = (yi − wT xi )2
i=1 i=1
= ||||22
Multivariate linear regression
ŵ = (XT X)−1 XT y,
Source: https://www.slideshare.net/PeriklisGogas/presentation-machine-learning-56402108
Non-linearities
I That is,
f (x) = wT φ(x),
0
where φ(x) : Rd → Rd is a non-linear transform
I Typically, d 0 is larger than d
I Computations are done the same way as before but x is
replaced by φ(x) (and the dimension of w may change)
I For example, if φ(x) = (1, x, x 2 ) then
x12
1 x1
1 x2 x22
X=
···
1 xn xn2
Basis functions
I Examples:
I Polynomials of degree k: φ(x) = (1, x, x 2 , . . . , x k )
I Interactions between two features:
φ((x (1) , x (2) )) = (1, x (1) , x (2) , x (1) x (2) )
2
I Radial basis functions, e.g., Gaussian: φ(x) = e −(||x−c||) ,
where c is called a center point
Basis functions
I Pros
I Adds flexibility
I Cons
I If dimensionality grows, then also computational requirements
grow
I How to choose appropriate basis functions?
Feature engineering
Source: Domingos
Bias-variance tradeoff of squared error
I Note that the more data points you have, the smaller is the
effect of regularization
I Setting gradient to zero and solving the equations we get
ŵ = (XT X + λI)−1 XT y
I L1 regression does not have a closed form solution
I Problem is convex (a local optimum is also a global optimum)
and there exists efficient algorithms for solving it
Linear models for classification
Source:
https://www.machinelearningplus.com/wp-content/uploads/2017/09/linear_vs_logistic_regression.jpg
Logistic regression
ez 1
σ(z) = =
1+e z 1 + e −z
Sigmoid function
ex 1
σ(x) = =
1+e x 1 + e −x
Source: Wikipedia
Logistic regression for classification
wT x > 0 → y = 1
wT x < 0 → y = 0
Learning parameters
I Conditional likelihood of data y given X is
n
Y
P(y|X) = P(yi = 1|xi )yi (1 − P(yi = 1|xi ))1−yi
i=1
n
Y
= σ(wT xi )yi (1 − σ(wT xi ))1−yi
i=1
Source: https://towardsdatascience.com/gradient-descent-in-a-nutshell-eaf8c18212f0
Gradient descent for logistic regression
I Iteratively,
n
X
wt+1 = wt − γt xi (σ(wtT xi ) − yi )
i=1
Summary