W2M3-Linear Regression

Linear Regression
Anubha Gupta, PhD.

Professor
SBILab, Dept. of ECE,
IIIT-Delhi, India
Contact: anubha@iiitd.ac.in; Lab: http://sbilab.iiitd.edu.in
Machine Learning in Hindi
Linear Regression
Learning Objectives
In this module, we will learn to
• solve a regression problem
o Using the statistical approach
o Using the data Driven approach
• Regression errors and performance metrics

• Example
Linear Regression
(using the statistical approach)
Multiple Linear Regression

• A linear regression technique that uses several (more than two) explanatory
variables to predict the outcome of a response variable.
• System of linear equations

𝑦1 = 𝜃0 + 𝜃1 𝑥1,1 + 𝜃2 𝑥1,2 + ⋯ + 𝜃𝑝−1 𝑥1,𝑝−1 + 𝜀1
𝑦2 = 𝜃0 + 𝜃1 𝑥2,1 + 𝜃2 𝑥2,2 + ⋯ + 𝜃𝑝−1 𝑥2,𝑝−1 + 𝜀2
⋮
𝑦𝑛 = 𝜃0 + 𝜃1 𝑥𝑛,1 + 𝜃2 𝑥𝑛,2 + ⋯ + 𝜃𝑝−1 𝑥𝑛,𝑝−1 + 𝜀𝑛
• Matrix Notation
1 𝑥1,1 𝑥1,2 𝑥1, 𝑝−1 𝜃0 𝜀1

⋯ 𝑥2, 𝑝−1
1 𝑥2,1 𝑥2,2 𝜃1 𝜀
[y]nx1= ⋮ ⋱ ⋮ + 2
⋮ ⋮
1 𝑥𝑛,1 𝑥𝑛,2 ⋯ 𝑥𝑛, 𝑝−1 𝜃𝑝−1 𝜀𝑛 𝑛×1
𝑛×𝑝 𝑝×1
Maximum Likelihood Estimation (MLE)
1 𝑥1,1 𝑥1,2 𝑥1, 𝑝−1 𝜃0 𝜀1

𝑥2,1 𝑥2,2 ⋯ 𝑥2, 𝑝−1
1 𝜃1 𝜀
[y]nx1= ⋮ ⋱ ⋮ + 2
⋮ ⋮
1 𝑥𝑛,1 𝑥𝑛,2 ⋯ 𝑥𝑛, 𝑝−1 𝜃𝑝−1 𝜀𝑛 𝑛×1
𝑛×𝑝 𝑝×1
𝑦1
𝑦2 𝐱1T
: 𝐱 2T
: = : 𝛉 + 𝛆
:
𝑦𝑛 𝐱 nT
𝐲 = 𝐗𝛉 + 𝛆
Maximum Likelihood Estimation (MLE)

Random variable 𝑦𝑖 = 𝐱 𝑖T 𝛉 + 𝜀𝑖
𝐸 𝑦𝑖 = 𝐸 𝐱 𝑖T 𝛉 + εi = 𝐸 𝐱 𝑖T 𝛉 + 𝐸 𝜀𝑖
𝑚𝑦 = 𝐱 𝑖T 𝛉 + 0
𝑖
2
𝐸 𝑦𝑖 − 𝑚𝑦 = 𝜎𝑦𝟐𝑖
𝑖
2
=𝐸 𝑦𝑖 − 𝐱 𝑖T 𝛉
2
𝐸 𝜀𝑖 = 𝜎2
Maximum Likelihood estimator
𝑛 2
1 𝑦 −𝐱 𝐓 𝛉
− 𝑖 𝒊2 [Using the assumptions of linear
𝐿′ 𝛉 = 𝑝 𝐲ȁ𝐗, 𝛉 = ෑ ⅇ 2𝜎
𝑖=1
2𝜋𝜎 2 regression]
𝑛
𝐿′ 𝛉 = ෑ 𝑝 𝑦𝑖ȁ𝐱 𝑖 , 𝛉 [Maximize the Likelihood function]

𝑖=1
MLE: 𝜃෠ = max 𝑝 𝐲ȁ𝐗, 𝛉

𝛉
max 𝐿′ 𝛉 = max 𝑝 𝐲ȁ𝐱, 𝛉

𝛉 𝛉
𝑛 2
1 1 𝑦𝑖 − 𝐱 𝑖T 𝛉
𝑝 𝐲ȁ𝐱, 𝛉 = 𝑛 ⅇ𝑥𝑝 − 2 ෍
2𝜋 𝜎
2 𝑛 2𝜎 2𝜎 2
𝑖=1

𝑛 2
𝑛 1 𝑦𝑖 − 𝐱 𝑖T 𝛉 [Log likelihood
𝐿 𝛉 = log 𝑝 𝒚ȁ𝐱, 𝛉 = − log 2𝜋 2 𝜎𝑛 − ෍
2𝜎 2 2𝜎 2 function]
𝑖=1
Maximum Likelihood estimator:

Minimization of 𝐿 𝛉 with respect to 𝛉:
𝑛 2
Minimization of ෍ 𝑦𝑖 − 𝐱 𝑖T 𝛉
𝑖=1

MLE leads to Least Squares Solutions Convex
function
𝑛 Squared
2 Error
𝜃෠𝑚𝑙𝑒 = min ෍ 𝑦𝑖 − 𝐱 𝑖T 𝛉
𝛉
𝑖=1
Min
squared error
Linear Regression Solution

𝑛
2
𝐿 𝛉 = ෍ 𝑦𝑖 − 𝐱 𝑖T 𝛉
𝑖=1
Let us consider a case with two features:
𝑦𝑖 = 𝜃0 + 𝜃1 𝑥𝑖1 + 𝜃2 𝑥𝑖2 + 𝜀𝑖
𝜃0
𝛉 = 𝜃1
𝜃2
Linear Regression proof

𝑦𝑖 = 𝜃0 + 𝜃1 𝑥𝑖1 + 𝜃2 𝑥𝑖2 + 𝜀𝑖
2 2 2
𝐿 𝛉 = 𝑦1 − 𝜃0 − 𝑥11 𝜃1 − 𝑥12 𝜃2 + 𝑦2 − 𝜃0 − 𝑥21 𝜃1 − 𝑥22 𝜃2 + ….. + 𝑦n − 𝜃0 − 𝑥𝑛1 𝜃1 − 𝑥𝑛2 𝜃2
𝜕𝐿 𝛉
= 2 𝑦1 − 𝜃0 − 𝑥11 𝜃1 − 𝑥12 𝜃2 −1 + 2 𝑦2 − 𝜃0 − 𝑥21 𝜃1 − 𝑥22 𝜃2 −1 + ….. +
𝜕𝜃0
2 𝑦n − 𝜃0 − 𝑥𝑛1 𝜃1 − 𝑥𝑛2 𝜃2 −1
𝜕𝐿 𝛉 𝑛
= 2෌𝑖=1 𝑦𝑖 − 𝐱 𝑖T 𝛉 −1
𝜕𝜃0
Linear Regression proof

𝜕𝐿 𝛉
= 2 𝑦1 − 𝜃0 − 𝑥11 𝜃1 − 𝑥12 𝜃2 −𝑥11 + 2 𝑦2 − 𝜃0 − 𝑥21 𝜃1 − 𝑥22 𝜃2 −𝑥21
𝜕𝜃1
+ ….. + 2 𝑦n − 𝜃0 − 𝑥n1 𝜃1 − 𝑥n2 𝜃2 −𝑥n1
𝜕𝐿 𝛉 𝑛
= 2෌𝑖=1 𝑦𝑖 − 𝐱 𝑖T 𝛉 −𝑥i1
𝜕𝜃1
𝜕𝐿 𝛉 𝑛
= 2෌𝑖=1 𝑦𝑖 − 𝐱 𝑖T 𝛉 −𝑥i2
𝜕𝜃2
𝜕𝐿 𝜃
𝜕𝜃0
𝜕𝐿 𝜃 𝑛 −1
𝛻𝐿 𝛉 = 𝜕𝜃1 = 2෌𝑖=1 𝑦𝑖 − 𝐱 𝑖T 𝛉 −𝑥i1 = 0
𝜕𝐿 𝜃 −𝑥i2
𝜕𝜃2
𝑛
− 2෌𝑖=1 𝑦𝑖 − 𝐱𝑖T 𝛉 𝐱 𝑖 = 0
Closed form solution

𝑛
− 2෌𝑖=1 𝑦𝑖 − 𝐱 𝑖T 𝛉 𝐱 𝑖 = 0
𝑛 𝑛
෍ 𝑦𝑖 𝐱𝑖 = ෍ 𝐱𝑖 𝐱𝑖T 𝛉
𝑖=1 𝑖=1
𝑦1 𝐱1 + 𝑦2 𝐱 2 + ⋯ + 𝑦𝑛 𝐱 n = 𝐱1 𝐱1T 𝛉 + 𝐱 2 𝐱 2T 𝛉 + ⋯ + 𝐱 𝑛 𝐱 𝑛T 𝛉
𝐗𝐓 𝐲 = 𝐗𝐓 𝐗 𝛉
−1 𝐓
𝛉 = 𝐗𝐓𝐗 𝐗 𝐲
Closed form solution

𝑛 2
min ෍ 𝑦𝑖 − 𝐱 𝑖T 𝛉 ⟹ min 𝐲 − 𝐗𝛉 T
𝐲 − 𝐗𝛉 ⟹ min 𝐲 − 𝐗𝛉 2
2
𝛉 𝑖=1 𝛉 𝛉
[𝐗]𝑛x𝑝 = tall matrix
−1 T
𝐗† = 𝐗T 𝐗 𝐗 𝛉 = 𝐗†𝐲
If X is a square invertible matrix
−1 T
𝐗 † = 𝐗 −1 𝐗 T 𝐗 𝛉 = 𝐗†𝐲
= 𝐗 −1 𝐗 −1 𝐗 T 𝛉 = 𝐗 −1 𝐲
= 𝐗 −1 𝐲 = 𝐗𝛉
Geometric Interpretation: MLE

𝐱1T
𝐱 2T 1 … 𝑥𝑝−1,1
𝐗= : = ⋮ ⋱ ⋮ = 𝐳1 𝐳2 𝐳3 … 𝐳𝑝 y
𝜺 𝜺𝟐
𝜺1
: 1 ⋯ 𝑥𝑝−1,𝑛
𝐱 𝑛T ෝ
𝒚
T
𝐗 𝛆=0
𝐗 T 𝐲 − 𝐗𝛉 = 0
𝐗 T 𝐲 − 𝐗 T 𝐗𝛉 = 0
𝐗T 𝐗 𝛉 = 𝐗T 𝐲
−1 T
𝛉 = 𝐗T 𝐗 𝐗 𝐲 = 𝐏𝐲
P is the projection matrix which is the orthogonal

projection of y on the data space.
Drawbacks of MLE
1. Sensitivity to outliers: MLE assumes a specific probability distribution for the data which
can produce biased estimates if the data contains outliers.
2. Lack of robustness: Small changes in the data can lead to large changes in the estimates of
the parameters. This can be a problem if the data is noisy or contains errors.
3. Overfitting: MLE can suffer from overfitting if the model is too complex or if the data is
limited. Overfitting occurs when the model fits the noise in the data rather than the
underlying pattern, which can lead to poor generalization performance on new data.
4. Computational complexity: MLE can be computationally complex, especially for high-
dimensional data or complex models. This can make it difficult to estimate the parameters or
to perform inference on the model.
Linear Regression
(Data Driven Approach)
Linear Regression Solution

𝑛
2
𝐿 𝛉 = ෍ 𝑦𝑖 − 𝐱 𝑖T 𝛉
𝑖=1
𝑛 2
min ෍ 𝑦𝑖 − 𝐱 𝑖T 𝛉 ⟹ min 𝐲 − 𝐗𝛉 T
𝐲 − 𝐗𝛉 ⟹ min 𝐲 − 𝐗𝛉 2
2
𝛉 𝑖=1 𝛉 𝛉
𝛉 = 𝐗†𝐲
where
−1 T
𝐗† = 𝐗T 𝐗 𝐗
Linear Regression
(Types of Errors and Performance Metrics)
Types of Errors and Performance Metrics

• Regression errors:
o Sum of Squares Total (SST) / Total Squared Error (TSS)
o Sum of Squares due to Regression (SSR)
o Sum of Squares error (SSE)
• Performance Metrics:
o R-squared score
o Mean Absolute error
o Mean Squared error
o Root Mean Squared error

Regression
line
Sum of Squares
due to Residual
Sum of Squares
(SSR)
Total (SST)
Sum of Squares
Error (SSE)
Mean of Response
Variable = mean of
yis
Y
X
Regression Errors
➢ Sum of Squares Total (SST) / Total Squared Error (TSS)
𝑛
2
SST = ෍ 𝑦𝑖 − 𝑚
𝑖=1
➢ Sum of Squares due to Regression (SSR)/ Residual Sum of Squares (RSS)
𝑛
2
SSR = ෍ 𝑦𝑖 − 𝑦ෝ𝑖
𝑖=1
➢ Sum of Squares error (SSE) / Explained Sum of Squares (ESS)
𝑛
2
SSE = ෍ 𝑚 − 𝑦ෝ𝑖
𝑖=1
𝑚 = σ𝑛𝑖=1 𝑦𝑖 : mean value of the various actual values

Sum of Squares
Total (SST) Regression
line
Sum of Squares
due to Residual
(SSR)
Sum of Squares
Error (SSE)
Mean of Response
Variable = sample mean
of yis
Y
X
Performance Metric: R-Squared (𝑅2 ) Error

• Also known as the Coefficient of Determination, it is given by
𝑆𝑆𝑅 𝑆𝑆𝐸
R2 = 1 − =
𝑆𝑆𝑇 𝑆𝑆𝑇
• The measure of the proportion of the variance in the dependent variable that can be
explained by the independent variables in the model.
o 𝑅2 = 1 ⟹ the model explains all the variance in the dependent variable
o 𝑅2 = 0 ⟹ the model does not explain any of the variance in the dependent variable
• Measures the goodness of fit or best-fit-line

Performance Metric: Mean Absolute Error

• The measure of average absolute difference between the predicted and actual values
of a set of observations 𝑛
1
MAE = ෍ ȁ𝑦𝑖 − 𝑦ෝ𝑖 ȁ
𝑛
𝑖=1
where, 𝑦𝑖 : actual value of the i𝑡ℎ observation
𝑦ෝ𝑖 : predicted value of the i𝑡ℎ observation
𝑛 : total number of observations
• A lower MAE value indicates better performance of the model because it implies that
the predictions are closer to the actual values
• Advantages:
o Gives an idea of how far off the predictions are from the
actual values, in the same units as the original data
o Less sensitive to outliers than some other metrics, such as
Mean Squared Error (MSE)
o Treats all errors (+ve or –ve) equally
• Disadvantages:
o Gives equal weight to all errors, regardless of their values
Performance Metric: Mean Squared Error

• The measure of average squared difference between the predicted and actual values
of a set of observations 𝑛
1 2
MSE = ෍ 𝑦𝑖 − 𝑦ෝ𝑖
𝑛
𝑖=1
• Advantages:
o Gives an idea of how far off the predictions are from the actual values
o Penalizes large errors more heavily than small errors
• Disadvantages:
o Sensitive to outliers (a few large errors can have a
disproportionate impact on the metric)
Performance Metric: Root Mean Squared Error

• The measure of square root of the average squared difference between the predicted
and actual values of a set of observations
𝑛
1 2
RMSE = ෍ 𝑦𝑖 − 𝑦ෝ𝑖
𝑛
𝑖=1

• Advantages:
o Unlike MSE, its range is suppressed due to the square root and the
magnitude of error is comparatively lower.
• Disadvantages:
o Similar to MSE
Let us Try!
Explanatory variable Response variable
• Regression errors: (feature) (output)
• Sum of Squares Total (SST):
1 2
• Sum of Squares due to Regression (SSR):
• Sum of Squares error (SSE): 2 3
3 5
Performance Metric
• R-squared score: 4 6
• Mean Absolute error: 5 8
• Mean Squared error:
• Root Mean Squared error:
Let us Try!
Explanatory variable Response variable
• Regression errors: (feature) (output)
• Sum of Squares Total (SST): 22.8
1 2
• Sum of Squares due to Regression (SSR): 0.3
• Sum of Squares error (SSE): 22.5 2 3
3 5
Performance Metric
• R-squared score: 0.9869 4 6
• Mean Absolute error: 0.240 5 8
• Mean Squared error: 0.060
• Root Mean Squared error: 0.245
Try it Yourself!
Question: Let there be a dataset of 50 observations, where each data point consists of a single independent
variable (𝑥) and a single dependent variable (𝑦). You want to fit a linear regression model to this data to
predict 𝑦 based on 𝑥. After performing the regression analysis, you obtain the following regression
equation: 𝑦ො = 2.5 + 1.8𝑥
Using the above information, what would be the predicted value of 𝑦 for a value of 𝑥 equal to 3.2?
The solution to this question will not be provided.

To Summarize
In this module, we discussed on
● What is linear regression?
● What are the different types of errors?
● How to solve the problem?
● We also looked at one example.

W2M3-Linear Regression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

W2M3-Linear Regression

Uploaded by

Copyright:

Available Formats

Linear Regression

Anubha Gupta, PhD.

• Regression errors and performance metrics

Multiple Linear Regression

• System of linear equations

1 𝑥1,1 𝑥1,2 𝑥1, 𝑝−1 𝜃0 𝜀1

Maximum Likelihood Estimation (MLE)

1 𝑥1,1 𝑥1,2 𝑥1, 𝑝−1 𝜃0 𝜀1

Maximum Likelihood Estimation (MLE)

Maximum Likelihood estimator

𝐿′ 𝛉 = ෑ 𝑝 𝑦𝑖ȁ𝐱 𝑖 , 𝛉 [Maximize the Likelihood function]

MLE: 𝜃෠ = max 𝑝 𝐲ȁ𝐗, 𝛉

max 𝐿′ 𝛉 = max 𝑝 𝐲ȁ𝐱, 𝛉

Maximum Likelihood estimator

Maximum Likelihood estimator:

Maximum Likelihood estimator

Linear Regression Solution

Linear Regression proof

Linear Regression proof

Closed form solution

Closed form solution

[𝐗]𝑛x𝑝 = tall matrix

If X is a square invertible matrix

Geometric Interpretation: MLE

P is the projection matrix which is the orthogonal

Linear Regression Solution

Types of Errors and Performance Metrics

o Mean Absolute error

o Mean Squared error

o Root Mean Squared error

𝑚 = σ𝑛𝑖=1 𝑦𝑖 : mean value of the various actual values

Performance Metric: R-Squared (𝑅2 ) Error

• Measures the goodness of fit or best-fit-line

Performance Metric: Mean Absolute Error

Performance Metric: Mean Squared Error

Performance Metric: Root Mean Squared Error

where, 𝑦𝑖 : actual value of the i𝑡ℎ observation

The solution to this question will not be provided.

You might also like