Professional Documents
Culture Documents
• Learning
• A supervised algorithm that learns from a set of training samples.
• Each training sample has one or more input values and a single output
value.
• The algorithm learns the line, plane or hyper-plane that best fits the
training samples.
• Prediction
• Use the learned line, plane or hyper-plane to predict the output value
for any input sample.
Roadmap
4
How do we predict with only one
variable?
• “Goodness Of Fit” is based on variability of the tip
amount from the line fitted.
5
How do we predict with only one
variable?
• Measuring the deviation: Squared residuals or Sum of
Squared Errors (SSE)
6
Simple Linear Regression
8
• Measure the distance from the line to the data, square the
distance and add them up.
• The distance from the line to a data point is called as “
Residual”
• Rotate the line measure the residuals, square them
and then sum up the squares.
Since the slope is not 0, it means that knowing
mouse’s weight will help us make a guess about that
mouse’s size
Single Dimension Linear Regression
•
Single Dimension
Least Square Linear Regression
• We want a line that minimises the error between the Y values in
training samples and the Y values that the line passes through.
• Or put another way, we want the line that “best fits’ the training
samples.
• So we define the error function for our algorithm so we can
minimise that error. This is called as Sum-of-Square Error (SSE)
Single Dimension Least square Linear
Regression
• To determine the value of a that minimises the
error E, we look for where the partial differential
of E with respect to a and b is zero.
Single Dimension Least square
Linear Regression
Single Dimension Least square Linear Regression
Single Dimension Least square Linear
Regression
19
Single Dimension Least Square Linear
Regression - Numerical Example
Single Dimension Least Square Linear
Regression - Numerical Example
∑(x-x ̅ ) ^2 = 94.18
b= 611.36 / 94.18 = 6.491
• Or more generally
R2 = 1 – (SSE / SST)
Single Dimension Least Square
Linear Regression
• R squared
• A value of 1 indicates a perfect fit.
• A value of 0 indicates a fit that is no better than simply predicting the mean of the
input y values.
• A negative value indicates a fit that is even worse than just predicting the mean of the
input y values
27
Gradient Descent
29
General Linear Regression
30
Gradient Descent by Example
•
For all data points,
Sum of squared residual is (1.1)^2 + (0.4) ^2+ (1.3)^2= 3.1
•
Predicted
Height=Intercept+0.64*Weight
• At intercept 0,
Predicted
Height=Intercept+0.64*Weight
• At intercept 0.25,
Predicted
Height=Intercept+0.64*Weight
• At increasing value of intercept
• Use gradient descent to find the optimal value for the Intercept,
starting from the random value
• Sum of squared residuals =
( 1.4 - (intercept + 0.64 * 0.5))^2 + (1.9 – (intercept + 0.64
*2.3))^2 + (3.2 – (intercept +0.64*2.9))^2
Gradient Descent by Example
• When intercept = 0
• d(Sum of squared residuals) /d intercept =
-2 * ( 1.4 - (0 + 0.64 * 0.5)) + -2 * (1.9 – (0 + 0.64 *2.3)) + -2 *
(3.2 – (0 +0.64*2.9))
• Slope is - 5.7
• Size of the step should be related to the slope.
• It tells us when should take a baby step or a big step, but
make sure the big step is not too big
• What is step size?
Gradient Descent by Example
48
Gradient Descent by Example
49
Next iteration
50
Next iteration
53
Estimate both slope and
intercept
• Sum of squared residuals =
( 1.4 - (intercept + slope * 0.5))^2 + (1.9 – (intercept +
slope *2.3))^2 + (3.2 – (intercept + slope *2.9))^2
54
Contd…
57
For intercept
58
For Slope
59
Optimized Slope and Intercept
• Intercept=0.95
• Slope=0.64
• For all the loss function the GD functionality is
same
60
Algorithm
62
Gradient Descent
• Let us generalize the cost function and fit both intercept and slope
• Consider the cost function as Mean Squared Error(MSE)
• https://ml-
cheatsheet.readthedocs.io/en/latest/linear_regressio
n.html
Loss Functions
• https://heartbeat.fritz.ai/5-regression-loss-functions-
all-machine-learners-should-know-4fb140e9d4b0
Multi Dimension Linear Regression
• Each training sample has an x made up of multiple input values and
a corresponding y with a single value.
• The inputs can be represented as an X matrix in which each row is
sample and each column is a dimension.
• The outputs can be represented as y matrix in which each row is a
sample.
Multi Dimension Linear Regression
73
Multi Dimension Least Square Linear
Regression
• We want a plane or hyper-plane
that minimises the error between
the y values in training samples and
the y values that the plane or hyper-
plane passes through.
• Or put another way, we want the
plane/hyper-plane that “best fits’
the training samples.
• So we define the error function for
our algorithm so we can minimise
that error.
74
Use the following rules of
differentiation
Multi Dimension Least Square Linear
Regression
76
Multi Dimension Least Square Linear
Regression
• This closed form solution solution is nice but has some issues
• The D × D matrix XT X may not be invertible
• Based solely on minimizing the error on training data
• Can overfit the training data
• Inversion is expensive for large D: we can use iterative optimization techniques
Hence we move to iterative gradient descent technique
Multi Dimension Linear Regression
79
Linear Models for Multi-output
Regression
80
References
• https://towardsdatascience.com/understanding-the-
mathematics-behind-gradient-descent-dde5dc9be06e
• https://www.youtube.com/watch?v=sDv4f4s2SB8
81