You are on page 1of 81

What is Linear Regression?

• Learning
• A supervised algorithm that learns from a set of training samples.
• Each training sample has one or more input values and a single output
value.
• The algorithm learns the line, plane or hyper-plane that best fits the
training samples.
• Prediction
• Use the learned line, plane or hyper-plane to predict the output value
for any input sample.
Roadmap

• Single Dimension Linear Regression


• Multi Dimension Linear Regression
• Gradient Descent
• Generalisation, Over-fitting & Regularisation
• Categorical Inputs
How do we predict with only one
variable?
• Consider the example of a tips
prediction based on the bill
amount
• Assume that you begin with
only the tip amount collected
without the knowledge of the
meal amount.
• Now how do you predict the
future tip amount from this?
How do we predict with only one
variable?
• Best-fit line for this data is only based on the
mean

4
How do we predict with only one
variable?
• “Goodness Of Fit” is based on variability of the tip
amount from the line fitted.

5
How do we predict with only one
variable?
• Measuring the deviation: Squared residuals or Sum of
Squared Errors (SSE)

6
Simple Linear Regression

• The goal of Simple Linear regression is to create a


linear model that minimizes the sum of squares of
the residuals / error (SSE)
Into Single dimensional Least Square
Linear Regression
• pattern matching to best general regression model

8
• Measure the distance from the line to the data, square the
distance and add them up.
• The distance from the line to a data point is called as “
Residual”
• Rotate the line measure the residuals, square them
and then sum up the squares.
Since the slope is not 0, it means that knowing
mouse’s weight will help us make a guess about that
mouse’s size
Single Dimension Linear Regression


Single Dimension
Least Square Linear Regression
• We want a line that minimises the error between the Y values in
training samples and the Y values that the line passes through.
• Or put another way, we want the line that “best fits’ the training
samples.
• So we define the error function for our algorithm so we can
minimise that error. This is called as Sum-of-Square Error (SSE)
Single Dimension Least square Linear
Regression
• To determine the value of a that minimises the
error E, we look for where the partial differential
of E with respect to a and b is zero.
Single Dimension Least square
Linear Regression
Single Dimension Least square Linear Regression
Single Dimension Least square Linear
Regression

19
Single Dimension Least Square Linear
Regression - Numerical Example
Single Dimension Least Square Linear
Regression - Numerical Example

∑(x-x ̅ ) * (y-y ̅ ) = 611.36

∑(x-x ̅ ) ^2 = 94.18
b= 611.36 / 94.18 = 6.491

a= 64.45 – (6.491* 4.72) = 30.18


Y^ = 30.18 + 6.49x
Single Dimension Least Square Linear
Regression - Numerical Example
Single Dimension Least Square Linear
Regression - Numerical Example
Measures of Variations
Single Dimension Least Square Linear
Regression
• We also define a function which we can use to score how well derived line fits.
• R Squared measure:
• .Total sum of square (SST)
• Regression sum of squares
• Sum of squares of residuals (SSR/SSE)
Single Dimension Least Square
Linear Regression
• Now, SSR + SSE = SST
• We can compute R2, the ratio of regression to total error as follows:

• Or more generally

R2 = 1 – (SSE / SST)
Single Dimension Least Square
Linear Regression
• R squared
• A value of 1 indicates a perfect fit.
• A value of 0 indicates a fit that is no better than simply predicting the mean of the
input y values.
• A negative value indicates a fit that is even worse than just predicting the mean of the
input y values

27
Gradient Descent

• Gradient descent algorithm is an iterative process that takes us to


the minimum of a function(barring some caveats). Here that
function is our Loss Function.
• The formula below sums up the entire Gradient Descent
algorithm in a single line.
Working of Gradient Descent

• Say, the derivative of the Loss Function is D, the


iterations can be understood as follows:

29
General Linear Regression

• Optimize the Intercept and slope


• Height = Intercept + Slope *
Weight
• How Gradient Descent can fit a
lone to data by finding optimal
values for Intercept the slope

30
Gradient Descent by Example

• Pick random value for the Intercept


• This initial guess gives Gradient Descent something
to improve upon.
Gradient Descent by Example

Evaluate how well this line fits the


data with the sum of squared
residuals.
In ML sum of squared residuals is a
type of Loss Function
• Predicted Height=Intercept+0.64*Weight
• Use GD to find Optimal value for the Intercept
• Initial guess for intercept =0
• Predicted Height=0+0.64*Weight
• For data point (0.5,1.4), predicted height is 0.32
• Residual is 1.4 - 0.32 = 1.1
• Sum of squared residual is (1.1)^2


For all data points,
Sum of squared residual is (1.1)^2 + (0.4) ^2+ (1.3)^2= 3.1


Predicted
Height=Intercept+0.64*Weight
• At intercept 0,
Predicted
Height=Intercept+0.64*Weight
• At intercept 0.25,
Predicted
Height=Intercept+0.64*Weight
• At increasing value of intercept

Is this the best we


can do?
Gradient Descent by Example

• What if the best value for


the intercept is somewhere
between the highlighted
values / any value?
Gradient Descent by Example

• What if the best value for


the intercept is somewhere
between the highlighted
values / any value?
• Gradient descent
Gradient Descent

• Gradient descent only


does a few calculations far
from the optimal solution
and increases the number
of calculations closer to
the optimal value
• And baby steps when it is
close
Gradient Descent Algorithm

• Take the derivative of the Loss function for each


parameter in it.
• Pick random values for the parameter
• Plug the parameter values into derivatives
• Calculate step size
• Step size = Learning rate * slope
• Calculate new parameter
• New intercept = Old intercept – step size
43
Gradient Descent by Example

• Use gradient descent to find the optimal value for the Intercept,
starting from the random value
• Sum of squared residuals =
( 1.4 - (intercept + 0.64 * 0.5))^2 + (1.9 – (intercept + 0.64
*2.3))^2 + (3.2 – (intercept +0.64*2.9))^2
Gradient Descent by Example

• Sum of squared residuals =


( 1.4 - (intercept + 0.64 * 0.5))^2 + (1.9 – (intercept + 0.64
*2.3))^2 + (3.2 – (intercept +0.64*2.9))^2
• Take derivative of this function and determine the slope at
any value for the Intercept
• D(Sum of squared residuals) /d intercept =
d ( 1.4 - (intercept + 0.64 * 0.5))^2 /d intercept + d (1.9 –
(intercept + 0.64 *2.3))^2 /d intercept + d (3.2 – (intercept
+0.64*2.9))^2 /d intercept
Gradient Descent by Example

• d(Sum of squared residuals) /d intercept =


-2 * ( 1.4 - (intercept + 0.64 * 0.5)) + -2 * (1.9 – (intercept +
0.64 *2.3)) + -2 * (3.2 – (intercept +0.64*2.9))
Gradient Descent by Example

• When intercept = 0
• d(Sum of squared residuals) /d intercept =
-2 * ( 1.4 - (0 + 0.64 * 0.5)) + -2 * (1.9 – (0 + 0.64 *2.3)) + -2 *
(3.2 – (0 +0.64*2.9))
• Slope is - 5.7
• Size of the step should be related to the slope.
• It tells us when should take a baby step or a big step, but
make sure the big step is not too big
• What is step size?
Gradient Descent by Example

• Gradient Descent determines the step size by


multiplying the slope
• Step size = Learning rate * -5.7 = 0.1 * -5.7= -0.57
• For intercept 0 the step size is -0.57
• New intercept = Old intercept – step size
• New intercept = 0 – (-0.57) = 0.57

48
Gradient Descent by Example

• One big step and closer to the optimal value of intercept

49
Next iteration

When intercept = 0.57


• d(Sum of squared residuals) /d intercept =
-2 * ( 1.4 - (0.57 + 0.64 * 0.5)) + -2 * (1.9 – (0.57 + 0.64 *2.3)) + -
2 * (3.2 – (0.57 +0.64*2.9))
• Slope is – 2.3
• Step size = Learning rate * - 2.3 = 0.1 * -2.3= -0.23
• For intercept 0 the step size is -0.23
• New intercept = 0.57 – (-0.23) = 0.8

50
Next iteration

• New intercept =0.89


• Then,
• New intercept = 0.92
• Then,
• New intercept = 0.94
• Then,
• New intercept = 0.95
• After 6 iteration GD estimate for intercept is 0.95
51
• Least square also 0.95 52
When to Stop

• In Practice, gradient descent stop when


• If step size is 0.001 or smaller (When the step size is very
close to zero when the slope is close to zero)

• Else the number of iterations is 1000

53
Estimate both slope and
intercept
• Sum of squared residuals =
( 1.4 - (intercept + slope * 0.5))^2 + (1.9 – (intercept +
slope *2.3))^2 + (3.2 – (intercept + slope *2.9))^2

The derivative of both intercept and slope has to be


taken now…..

54
Contd…

• d(Sum of squared residuals) / d intercept =


d( 1.4 - (intercept + slope * 0.5))^2 / d intercept +
d(1.9 – (intercept + slope *2.3))^2 / d intercept +d
(3.2 – (intercept + slope *2.9))^2 / d intercept
• d(Sum of squared residuals) / d slope =
d( 1.4 - (intercept + slope * 0.5))^2 / d slope + d(1.9 –
(intercept + slope *2.3))^2 / d slope +d (3.2 –
(intercept + slope *2.9))^2 / d slope
55
Contd…

• d(Sum of squared residuals) / d intercept =


-2*( 1.4 - (intercept + slope * 0.5)) + -2*(1.9 –
(intercept + slope *2.3)) + -2 * (3.2 – (intercept +
slope *2.9))
• d(Sum of squared residuals) / d slope =
-2* 0.5*( 1.4 - (intercept + slope * 0.5)) + -2*2.3*(1.9
– (intercept + slope *2.3)) + -2*2.9 * (3.2 – (intercept
+ slope *2.9))
56
Initial guess
Intercept = 0 and Slope=1
• d(Sum of squared residuals) / d intercept =
-2*( 1.4 - (0 + 1 * 0.5)) + -2*(1.9 – (0 + 1 *2.3)) + -2 *
(3.2 – (0 + 1 *2.9)) = - 1.6
• d(Sum of squared residuals) / d slope =
-2* 0.5*( 1.4 - (0 + 1 * 0.5)) + -2*2.3*(1.9 – (0 + 1
*2.3)) + -2*2.9 * (3.2 – (0 + 1 *2.9)) = - 0.8

57
For intercept

• Step size for intercept = Learning rate * -1.6 = 0.01


* -1.6= -0.016
• For intercept 0 the step size is -0.016
• New intercept = Old intercept – step size
• New intercept = 0 – (-0.16) = 0.016

58
For Slope

• Step size for slope = Learning rate * -0.8 = 0.01 * -


0.8= -0.008
• For slope 1 the step size is -0.008
• New slope = Old slope – step size
• New slope = 1 – (-0.008) = 1.008

59
Optimized Slope and Intercept

• Intercept=0.95
• Slope=0.64
• For all the loss function the GD functionality is
same

60
Algorithm

• Take the derivative of the Loss function for each


parameter in it.
• Pick random values for the parameter
• Plug the parameter values into derivatives
• Calculate step size
• Step size = Learning rate * slope
• Calculate new parameter
• New intercept = Old intercept – step size
61
Stochastic Gradient Descent

• It uses randomly selected subset of data at every


step rather than the full dataset.
• This reduces time spent calculating the derivatives
of the loss function

62
Gradient Descent
• Let us generalize the cost function and fit both intercept and slope
• Consider the cost function as Mean Squared Error(MSE)

Where N is the total number of instances/samples.


• Our model here can be described as y^=ax+b, where
• a is the slope (to change the steepness),
• b is the bias (to move the line up and down the graph),
• x is the explanatory variable, and
• y is the output.
Gradient Descent
Let us compute the gradient of m and b
Gradient Descent
• Computing new intercept and slope

where indicates change in the associated parameter.


Gradient Descent

• Compute the SSE for every new intercept and slope


until SSE <= 0.001 or we reach 1000 iterations
Gradient Descent for the previous
example
Learning rate=0.01
Additional Content

• https://ml-
cheatsheet.readthedocs.io/en/latest/linear_regressio
n.html
Loss Functions

• https://heartbeat.fritz.ai/5-regression-loss-functions-
all-machine-learners-should-know-4fb140e9d4b0
Multi Dimension Linear Regression
• Each training sample has an x made up of multiple input values and
a corresponding y with a single value.
• The inputs can be represented as an X matrix in which each row is
sample and each column is a dimension.
• The outputs can be represented as y matrix in which each row is a
sample.
Multi Dimension Linear Regression

• Our predicated y values are calculated by multiple the


X matrix by a matrix of weights, w.
• If there are 2 dimension, then this equation defines
plane. If there are more dimensions then it defines a
hyper-plane.
Multi Dimension Linear Regression as Solving
System of Linear Equations

Generally, it is over determined system


Linear Models for Simple Regression

73
Multi Dimension Least Square Linear
Regression
• We want a plane or hyper-plane
that minimises the error between
the y values in training samples and
the y values that the plane or hyper-
plane passes through.
• Or put another way, we want the
plane/hyper-plane that “best fits’
the training samples.
• So we define the error function for
our algorithm so we can minimise
that error.

74
Use the following rules of
differentiation
Multi Dimension Least Square Linear
Regression

• To determine the value of w that minimises the error


E, we look for where the differential of E with
respect to w is zero.

76
Multi Dimension Least Square Linear
Regression

• This closed form solution solution is nice but has some issues
• The D × D matrix XT X may not be invertible
• Based solely on minimizing the error on training data
• Can overfit the training data
• Inversion is expensive for large D: we can use iterative optimization techniques
Hence we move to iterative gradient descent technique
Multi Dimension Linear Regression

• In addition to using the X matrix to represent basic


features our training data, we can also introduce
additional dimensions (i.e. columns in our X matrix)
that are derived from those basic feature values.
• If we introduce derived features whose values are
powers of basic features, our multi-dimensional
linear regression can then derive polynomial curves,
planes and hyper-planes.
Multi Dimension Linear Regression

79
Linear Models for Multi-output
Regression

80
References

• https://towardsdatascience.com/understanding-the-
mathematics-behind-gradient-descent-dde5dc9be06e
• https://www.youtube.com/watch?v=sDv4f4s2SB8

81

You might also like