Linear Regression

Regression
What is regression ?
Linear regression
Learning
• A supervised algorithm that learns from a set of training samples.

• Each training sample has one or more input values and a single output value.
• The algorithm learns the line, plane or hyper-plane that best fits the training samples.
Prediction
• Use learned line, plane or hyper-plane to predict output value for any input sample.
Y=f(X) f: linear function
Example: Given a set of data points: fit a line; out of the different linear functions
possible, we want to find one that optimizes certain criteria.

Types
● Simple linear regression (univariate)

○ y=f(x)
● Multiple linear regression (multivariate)

○ y=f(x1,x2, ….)
● Polynomial regression (regression curve is not a line)

○ y=f(x) with degree > 1
Linear regression
● Input training samples: (x,y) pairs
● y=f(x) function f is characterised by some parameters
● y=mx+c (Take simple linear regression)
● Training samples are used to derive values of m (slope) and c (intercept) that
minimise error between actual and predicted values of y.
● y=mx+c+ε (ε is random error with (0,σ))
● Assumption: errors are independent i.e. ε1, ε2, εn, they are independent of each
other. And we can also assume that these errors are normally distributed with
mean 0, and standard deviation σ. This sort of noise is called Gaussian noise or
white noise.
Error surface
y=2*x+2 (x=range(1,100))
Find parameters such that error is minimum!

Define the error function for our algorithm so we can minimise that error
Mean square error (preferred)

Sum of squared errors (preferred)
Mean absolute error
least squares regression line gives the unique line such that the sum of the squared vertical
distances (error) between the data points and the line is the smallest possible.
For the 2-D problem,
Determine value of parameters that minimises error E,
● Compute partial derivative of E with respect to parameters (here Dm and Dc).
● Set to 0 and solve.

Gradient Descent (optimization algorithm)
Process of minimizing cost function (error function) can be achieved by Gradient Descent
The steps are:
1. Start with initial guess of coefficients
2. Keep changing the coefficients a little bit to try and reduce Cost Function/ error function
3. m=m-L*(gradient) (L is learning rate / step)
4. Repeat
5. Keep doing till no improvement is made.

Effect of learning rate (L)
Gradient descent: local minima
Local minima: minimum value in a local neighborhood
Multivariate regression
Height of a child can depend on height of the mother, height of the father, nutrition, and
environmental factors!
A multiple linear regression model with k predictor variables X1, X2, ..., Xk and a
response Y , can be written as
Example
Plane in y
x2
x1
Extend it to higher dimensions as well!!
Parameter update
Polynomial regression
Polynomial Regression
Instead of finding a best fit “line” on the given data points, we can also try to find
the best fit “curve”.
This is the form of Polynomial regression.
The equation, in case of second-order polynomial will be:
Third-order polynomial will be:

Polynomial Regression
Poor fit
Poor fit
Good fit Best fit to

training data!
Overfitting
M=9: fitted curve passes through every data point; oscillates widely.
Complexity of model: use higher order powers in regression model
Higher complexity of model, will better “fit” on training data.
So, always choose “complex” model with higher order polynomials to fit the data set??
NO, it may be possible that such a model gives very wrong predictions on test data.
Though it fits well on training data but fails to estimate the real relationship among
variables beyond the training set. This is known as “Over-fitting”.
Source: Bishop (Ch1)
Coefficients for polynomials of different order
What does this indicate?

So how do we decide the best model ?
Bias and Variance
Bias: Amount by which average of estimate differs from mean of actual values
Variance: Variance of the estimate around its mean

Bias and Variance
Bias occurs: model has enough data to learn; but is not complex enough to
capture underlying relationships (or patterns); leading to low accuracy in
prediction. This is known as underfitting.
Variance occurs: model pays too much attention to data; high error on test set;
model unable to generalize its predictions to the larger population. High sensitivity
to the training set is also known as overfitting, and generally occurs when either
the model is too complex or when we do not have enough data to support it.
Bias variance decomposition
MSE = Bias2 + variance
𝑀𝑆𝐸 = 𝐸 𝑦ො − 𝑦 2
Let, 𝜇 = 𝐸𝐷 𝑦ො
2 𝐸 𝑦ො − 𝜇 = 0
= 𝐸 𝑦ො − 𝜇 + 𝜇 − 𝑦 (add, subtract 𝜇)
2 2 2
= 𝐸 𝑦ො − 𝜇 + 𝜇−𝑦 + 2 𝑦ො − 𝜇 𝜇 − 𝑦 ( 𝑎+𝑏 = 𝑎2 + 𝑏 2 + 2𝑎𝑏)
constant
Variance Bias2
Regularization: control overfitting
● Apply complex models to small datasets
● Regularization: add penalty term to error function in order to discourage
coefficients from reaching larger values
● Penalty term: sum of squares of all coefficients
Regularization
parameter
Effect of size of data on model complexity
Source: Bishop (Ch1)

Linear Regression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Regression

Uploaded by

Copyright:

Available Formats

Regression

• A supervised algorithm that learns from a set of training samples.

possible, we want to find one that optimizes certain criteria.

● Simple linear regression (univariate)

● Multiple linear regression (multivariate)

● Polynomial regression (regression curve is not a line)

Find parameters such that error is minimum!

Mean square error (preferred)

Determine value of parameters that minimises error E,

● Compute partial derivative of E with respect to parameters (here Dm and Dc).

● Set to 0 and solve.

The steps are:

1. Start with initial guess of coefficients

3. m=m-L*(gradient) (L is learning rate / step)

5. Keep doing till no improvement is made.

This is the form of Polynomial regression.

The equation, in case of second-order polynomial will be:

Third-order polynomial will be:

Good fit Best fit to

Complexity of model: use higher order powers in regression model

Higher complexity of model, will better “fit” on training data.

What does this indicate?

Variance: Variance of the estimate around its mean

Source: Bishop (Ch1)

You might also like