Linear - Regression - 01

What is Linear Regression?
• Learning
• A supervised algorithm that learns from a set of training samples.
• Each training sample has one or more input values and a single output
value.
• The algorithm learns the line, plane or hyper-plane that best fits the
training samples.
• Prediction
• Use the learned line, plane or hyper-plane to predict the output value
for any input sample.
Roadmap
• Single Dimension Linear Regression

• Multi Dimension Linear Regression
• Gradient Descent
• Generalisation, Over-fitting & Regularisation
• Categorical Inputs
How do we predict with only one
variable?
• Consider the example of a tips
prediction based on the bill
amount
• Assume that you begin with
only the tip amount collected
without the knowledge of the
meal amount.
• Now how do you predict the
future tip amount from this?
variable?
• Best-fit line for this data is only based on the
mean
4
variable?
• “Goodness Of Fit” is based on variability of the tip
amount from the line fitted.
5
variable?
• Measuring the deviation: Squared residuals or Sum of
Squared Errors (SSE)
6
Simple Linear Regression
• The goal of Simple Linear regression is to create a

linear model that minimizes the sum of squares of
the residuals / error (SSE)
Into Single dimensional Least Square
Linear Regression
• pattern matching to best general regression model
8
• Measure the distance from the line to the data, square the
distance and add them up.
• The distance from the line to a data point is called as “
Residual”
• Rotate the line measure the residuals, square them
and then sum up the squares.
Since the slope is not 0, it means that knowing
mouse’s weight will help us make a guess about that
mouse’s size
Single Dimension Linear Regression
•
Single Dimension
Least Square Linear Regression
• We want a line that minimises the error between the Y values in
training samples and the Y values that the line passes through.
• Or put another way, we want the line that “best fits’ the training
samples.
• So we define the error function for our algorithm so we can
minimise that error. This is called as Sum-of-Square Error (SSE)
Single Dimension Least square Linear
Regression
• To determine the value of a that minimises the
error E, we look for where the partial differential
of E with respect to a and b is zero.
Single Dimension Least square
Linear Regression
Single Dimension Least square Linear Regression
Single Dimension Least square Linear
Regression
19
Single Dimension Least Square Linear
Regression - Numerical Example
∑(x-x ̅ ) * (y-y ̅ ) = 611.36
∑(x-x ̅ ) ^2 = 94.18
b= 611.36 / 94.18 = 6.491
a= 64.45 – (6.491* 4.72) = 30.18

Y^ = 30.18 + 6.49x
Measures of Variations
Regression
• We also define a function which we can use to score how well derived line fits.
• R Squared measure:
• .Total sum of square (SST)
• Regression sum of squares
• Sum of squares of residuals (SSR/SSE)
Single Dimension Least Square
Linear Regression
• Now, SSR + SSE = SST
• We can compute R2, the ratio of regression to total error as follows:
• Or more generally
R2 = 1 – (SSE / SST)
Single Dimension Least Square
Linear Regression
• R squared
• A value of 1 indicates a perfect fit.
• A value of 0 indicates a fit that is no better than simply predicting the mean of the
input y values.
• A negative value indicates a fit that is even worse than just predicting the mean of the
input y values
27
Gradient Descent
• Gradient descent algorithm is an iterative process that takes us to

the minimum of a function(barring some caveats). Here that
function is our Loss Function.
• The formula below sums up the entire Gradient Descent
algorithm in a single line.
Working of Gradient Descent
• Say, the derivative of the Loss Function is D, the

iterations can be understood as follows:
29
General Linear Regression
• Optimize the Intercept and slope

• Height = Intercept + Slope *
Weight
• How Gradient Descent can fit a
lone to data by finding optimal
values for Intercept the slope
30
Gradient Descent by Example
• Pick random value for the Intercept

• This initial guess gives Gradient Descent something
to improve upon.
Evaluate how well this line fits the

data with the sum of squared
residuals.
In ML sum of squared residuals is a
type of Loss Function
• Predicted Height=Intercept+0.64*Weight
• Use GD to find Optimal value for the Intercept
• Initial guess for intercept =0
• Predicted Height=0+0.64*Weight
• For data point (0.5,1.4), predicted height is 0.32
• Residual is 1.4 - 0.32 = 1.1
• Sum of squared residual is (1.1)^2
•
For all data points,
Sum of squared residual is (1.1)^2 + (0.4) ^2+ (1.3)^2= 3.1
•
Predicted
Height=Intercept+0.64*Weight
• At intercept 0,
Predicted
• At intercept 0.25,
Predicted
• At increasing value of intercept
Is this the best we

can do?
• What if the best value for

the intercept is somewhere
between the highlighted
values / any value?
• What if the best value for

the intercept is somewhere
between the highlighted
values / any value?
• Gradient descent
Gradient Descent
• Gradient descent only

does a few calculations far
from the optimal solution
and increases the number
of calculations closer to
the optimal value
• And baby steps when it is
close
Gradient Descent Algorithm
• Take the derivative of the Loss function for each

parameter in it.
• Pick random values for the parameter
• Plug the parameter values into derivatives
• Calculate step size
• Step size = Learning rate * slope
• Calculate new parameter
• New intercept = Old intercept – step size
43
• Use gradient descent to find the optimal value for the Intercept,
starting from the random value
• Sum of squared residuals =
( 1.4 - (intercept + 0.64 * 0.5))^2 + (1.9 – (intercept + 0.64
*2.3))^2 + (3.2 – (intercept +0.64*2.9))^2

( 1.4 - (intercept + 0.64 * 0.5))^2 + (1.9 – (intercept + 0.64
*2.3))^2 + (3.2 – (intercept +0.64*2.9))^2
• Take derivative of this function and determine the slope at
any value for the Intercept
• D(Sum of squared residuals) /d intercept =
d ( 1.4 - (intercept + 0.64 * 0.5))^2 /d intercept + d (1.9 –
(intercept + 0.64 *2.3))^2 /d intercept + d (3.2 – (intercept
+0.64*2.9))^2 /d intercept
• d(Sum of squared residuals) /d intercept =

-2 * ( 1.4 - (intercept + 0.64 * 0.5)) + -2 * (1.9 – (intercept +
0.64 *2.3)) + -2 * (3.2 – (intercept +0.64*2.9))
• When intercept = 0
-2 * ( 1.4 - (0 + 0.64 * 0.5)) + -2 * (1.9 – (0 + 0.64 *2.3)) + -2 *
(3.2 – (0 +0.64*2.9))
• Slope is - 5.7
• Size of the step should be related to the slope.
• It tells us when should take a baby step or a big step, but
make sure the big step is not too big
• What is step size?
• Gradient Descent determines the step size by

multiplying the slope
• Step size = Learning rate * -5.7 = 0.1 * -5.7= -0.57
• For intercept 0 the step size is -0.57
• New intercept = 0 – (-0.57) = 0.57
48
• One big step and closer to the optimal value of intercept
49
Next iteration
When intercept = 0.57

-2 * ( 1.4 - (0.57 + 0.64 * 0.5)) + -2 * (1.9 – (0.57 + 0.64 *2.3)) + -
2 * (3.2 – (0.57 +0.64*2.9))
• Slope is – 2.3
• Step size = Learning rate * - 2.3 = 0.1 * -2.3= -0.23
• New intercept = 0.57 – (-0.23) = 0.8
50
Next iteration
• New intercept =0.89

• Then,
• New intercept = 0.92
• Then,
• Then,
• After 6 iteration GD estimate for intercept is 0.95
51
• Least square also 0.95 52
When to Stop
• In Practice, gradient descent stop when

• If step size is 0.001 or smaller (When the step size is very
close to zero when the slope is close to zero)
• Else the number of iterations is 1000
53
Estimate both slope and
intercept
( 1.4 - (intercept + slope * 0.5))^2 + (1.9 – (intercept +
slope *2.3))^2 + (3.2 – (intercept + slope *2.9))^2
The derivative of both intercept and slope has to be

taken now…..
54
Contd…
• d(Sum of squared residuals) / d intercept =

d( 1.4 - (intercept + slope * 0.5))^2 / d intercept +
d(1.9 – (intercept + slope *2.3))^2 / d intercept +d
(3.2 – (intercept + slope *2.9))^2 / d intercept
• d(Sum of squared residuals) / d slope =
d( 1.4 - (intercept + slope * 0.5))^2 / d slope + d(1.9 –
(intercept + slope *2.3))^2 / d slope +d (3.2 –
(intercept + slope *2.9))^2 / d slope
55
Contd…

-2*( 1.4 - (intercept + slope * 0.5)) + -2*(1.9 –
(intercept + slope *2.3)) + -2 * (3.2 – (intercept +
slope *2.9))
-2* 0.5*( 1.4 - (intercept + slope * 0.5)) + -2*2.3*(1.9
– (intercept + slope *2.3)) + -2*2.9 * (3.2 – (intercept
+ slope *2.9))
56
Initial guess
Intercept = 0 and Slope=1
-2*( 1.4 - (0 + 1 * 0.5)) + -2*(1.9 – (0 + 1 *2.3)) + -2 *
(3.2 – (0 + 1 *2.9)) = - 1.6
-2* 0.5*( 1.4 - (0 + 1 * 0.5)) + -2*2.3*(1.9 – (0 + 1
*2.3)) + -2*2.9 * (3.2 – (0 + 1 *2.9)) = - 0.8
57
For intercept
• Step size for intercept = Learning rate * -1.6 = 0.01

* -1.6= -0.016
• New intercept = 0 – (-0.16) = 0.016
58
For Slope
• Step size for slope = Learning rate * -0.8 = 0.01 * -

0.8= -0.008
• For slope 1 the step size is -0.008
• New slope = Old slope – step size
• New slope = 1 – (-0.008) = 1.008
59
Optimized Slope and Intercept
• Intercept=0.95
• Slope=0.64
• For all the loss function the GD functionality is
same
60
Algorithm
• Take the derivative of the Loss function for each

parameter in it.
• Pick random values for the parameter
• Plug the parameter values into derivatives
• Calculate step size
• Step size = Learning rate * slope
• Calculate new parameter
61
Stochastic Gradient Descent
• It uses randomly selected subset of data at every

step rather than the full dataset.
• This reduces time spent calculating the derivatives
of the loss function
62
Gradient Descent
• Let us generalize the cost function and fit both intercept and slope
• Consider the cost function as Mean Squared Error(MSE)
Where N is the total number of instances/samples.

• Our model here can be described as y^=ax+b, where
• a is the slope (to change the steepness),
• b is the bias (to move the line up and down the graph),
• x is the explanatory variable, and
• y is the output.
Gradient Descent
Let us compute the gradient of m and b
Gradient Descent
• Computing new intercept and slope
where indicates change in the associated parameter.

Gradient Descent
• Compute the SSE for every new intercept and slope

until SSE <= 0.001 or we reach 1000 iterations
Gradient Descent for the previous
example
Learning rate=0.01
Additional Content
• https://ml-
cheatsheet.readthedocs.io/en/latest/linear_regressio
n.html
Loss Functions
• https://heartbeat.fritz.ai/5-regression-loss-functions-
all-machine-learners-should-know-4fb140e9d4b0
Multi Dimension Linear Regression
• Each training sample has an x made up of multiple input values and
a corresponding y with a single value.
• The inputs can be represented as an X matrix in which each row is
sample and each column is a dimension.
• The outputs can be represented as y matrix in which each row is a
sample.
• Our predicated y values are calculated by multiple the

X matrix by a matrix of weights, w.
• If there are 2 dimension, then this equation defines
plane. If there are more dimensions then it defines a
hyper-plane.
Multi Dimension Linear Regression as Solving
System of Linear Equations
Generally, it is over determined system

Linear Models for Simple Regression
73
Multi Dimension Least Square Linear
Regression
• We want a plane or hyper-plane
that minimises the error between
the y values in training samples and
the y values that the plane or hyper-
plane passes through.
• Or put another way, we want the
plane/hyper-plane that “best fits’
the training samples.
• So we define the error function for
our algorithm so we can minimise
that error.
74
Use the following rules of
differentiation
Regression
• To determine the value of w that minimises the error

E, we look for where the differential of E with
respect to w is zero.
76
Regression
• This closed form solution solution is nice but has some issues
• The D × D matrix XT X may not be invertible
• Based solely on minimizing the error on training data
• Can overfit the training data
• Inversion is expensive for large D: we can use iterative optimization techniques
Hence we move to iterative gradient descent technique
• In addition to using the X matrix to represent basic

features our training data, we can also introduce
additional dimensions (i.e. columns in our X matrix)
that are derived from those basic feature values.
• If we introduce derived features whose values are
powers of basic features, our multi-dimensional
linear regression can then derive polynomial curves,
planes and hyper-planes.
79
Linear Models for Multi-output
Regression
80
References
• https://towardsdatascience.com/understanding-the-
mathematics-behind-gradient-descent-dde5dc9be06e
• https://www.youtube.com/watch?v=sDv4f4s2SB8
81

Linear - Regression - 01

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear - Regression - 01

Uploaded by

Copyright:

Available Formats

What is Linear Regression?

• Single Dimension Linear Regression

• The goal of Simple Linear regression is to create a

∑(x-x ̅ ) * (y-y ̅ ) = 611.36

a= 64.45 – (6.491* 4.72) = 30.18

• Gradient descent algorithm is an iterative process that takes us to

• Say, the derivative of the Loss Function is D, the

• Optimize the Intercept and slope

• Pick random value for the Intercept

Evaluate how well this line fits the

Is this the best we

• What if the best value for

• What if the best value for

• Gradient descent only

• Take the derivative of the Loss function for each

• Sum of squared residuals =

• d(Sum of squared residuals) /d intercept =

• Gradient Descent determines the step size by

• One big step and closer to the optimal value of intercept

When intercept = 0.57

• New intercept =0.89

• In Practice, gradient descent stop when

• Else the number of iterations is 1000

The derivative of both intercept and slope has to be

• d(Sum of squared residuals) / d intercept =

• d(Sum of squared residuals) / d intercept =

• Step size for intercept = Learning rate * -1.6 = 0.01

• Step size for slope = Learning rate * -0.8 = 0.01 * -

• Take the derivative of the Loss function for each

• It uses randomly selected subset of data at every

Where N is the total number of instances/samples.

where indicates change in the associated parameter.

• Compute the SSE for every new intercept and slope

• Our predicated y values are calculated by multiple the

Generally, it is over determined system

• To determine the value of w that minimises the error

• In addition to using the X matrix to represent basic

You might also like