You are on page 1of 29

Regression

What is regression ?
Linear regression

Learning

• A supervised algorithm that learns from a set of training samples.


• Each training sample has one or more input values and a single output value.

• The algorithm learns the line, plane or hyper-plane that best fits the training samples.

Prediction

• Use learned line, plane or hyper-plane to predict output value for any input sample.
Y=f(X) f: linear function

Example: Given a set of data points: fit a line; out of the different linear functions

possible, we want to find one that optimizes certain criteria.


Types

● Simple linear regression (univariate)


○ y=f(x)

● Multiple linear regression (multivariate)


○ y=f(x1,x2, ….)

● Polynomial regression (regression curve is not a line)


○ y=f(x) with degree > 1
Linear regression
● Input training samples: (x,y) pairs
● y=f(x) function f is characterised by some parameters
● y=mx+c (Take simple linear regression)
● Training samples are used to derive values of m (slope) and c (intercept) that
minimise error between actual and predicted values of y.
● y=mx+c+ε (ε is random error with (0,σ))
● Assumption: errors are independent i.e. ε1, ε2, εn, they are independent of each
other. And we can also assume that these errors are normally distributed with
mean 0, and standard deviation σ. This sort of noise is called Gaussian noise or
white noise.
Error surface

y=2*x+2 (x=range(1,100))

Find parameters such that error is minimum!


Define the error function for our algorithm so we can minimise that error

Mean square error (preferred)


Sum of squared errors (preferred)
Mean absolute error

least squares regression line gives the unique line such that the sum of the squared vertical
distances (error) between the data points and the line is the smallest possible.
For the 2-D problem,

Determine value of parameters that minimises error E,

● Compute partial derivative of E with respect to parameters (here Dm and Dc).

● Set to 0 and solve.


Gradient Descent (optimization algorithm)
Process of minimizing cost function (error function) can be achieved by Gradient Descent

The steps are:

1. Start with initial guess of coefficients

2. Keep changing the coefficients a little bit to try and reduce Cost Function/ error function

3. m=m-L*(gradient) (L is learning rate / step)

4. Repeat

5. Keep doing till no improvement is made.


Effect of learning rate (L)
Gradient descent: local minima
Local minima: minimum value in a local neighborhood
Multivariate regression
Height of a child can depend on height of the mother, height of the father, nutrition, and
environmental factors!

A multiple linear regression model with k predictor variables X1, X2, ..., Xk and a
response Y , can be written as
Example

Plane in y

x2

x1
Extend it to higher dimensions as well!!
Parameter update
Polynomial regression
Polynomial Regression
Instead of finding a best fit “line” on the given data points, we can also try to find
the best fit “curve”.

This is the form of Polynomial regression.

The equation, in case of second-order polynomial will be:

Third-order polynomial will be:


Polynomial Regression

Poor fit
Poor fit

Good fit Best fit to


training data!
Overfitting
M=9: fitted curve passes through every data point; oscillates widely.

Complexity of model: use higher order powers in regression model

Higher complexity of model, will better “fit” on training data.

So, always choose “complex” model with higher order polynomials to fit the data set??

NO, it may be possible that such a model gives very wrong predictions on test data.

Though it fits well on training data but fails to estimate the real relationship among
variables beyond the training set. This is known as “Over-fitting”.
Source: Bishop (Ch1)
Coefficients for polynomials of different order

What does this indicate?


So how do we decide the best model ?
Bias and Variance

Bias: Amount by which average of estimate differs from mean of actual values

Variance: Variance of the estimate around its mean


Bias and Variance

Bias occurs: model has enough data to learn; but is not complex enough to
capture underlying relationships (or patterns); leading to low accuracy in
prediction. This is known as underfitting.

Variance occurs: model pays too much attention to data; high error on test set;
model unable to generalize its predictions to the larger population. High sensitivity
to the training set is also known as overfitting, and generally occurs when either
the model is too complex or when we do not have enough data to support it.
Bias variance decomposition
MSE = Bias2 + variance

𝑀𝑆𝐸 = 𝐸 𝑦ො − 𝑦 2

Let, 𝜇 = 𝐸𝐷 𝑦ො

2 𝐸 𝑦ො − 𝜇 = 0
= 𝐸 𝑦ො − 𝜇 + 𝜇 − 𝑦 (add, subtract 𝜇)

2 2 2
= 𝐸 𝑦ො − 𝜇 + 𝜇−𝑦 + 2 𝑦ො − 𝜇 𝜇 − 𝑦 ( 𝑎+𝑏 = 𝑎2 + 𝑏 2 + 2𝑎𝑏)

constant

Variance Bias2
Regularization: control overfitting
● Apply complex models to small datasets
● Regularization: add penalty term to error function in order to discourage
coefficients from reaching larger values
● Penalty term: sum of squares of all coefficients
Regularization
parameter
Effect of size of data on model complexity

Source: Bishop (Ch1)

You might also like