You are on page 1of 70

Lecture 8: 2 February, 2023

Madhavan Mukund
https://www.cmi.ac.in/~madhavan

Data Mining and Machine Learning


January–April 2023
Linear regression
Training input is
{(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}

~
Each input xi is a vector (xi1 , . . . , xik )
Add xi0 = 1 by convention
yi is actual output

How far away is our prediction h✓ (xi ) from


the true answer yi ?
Define a cost (loss) function
n
1X
J(✓) = (h✓ (xi ) yi ) 2
2
i=1

Essentially, the sum squared error (SSE) -


Justified via MLE
Divide by n, mean squared error (MSE)
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 2 / 22
The non-linear case

What if the relationship is


not linear?

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 3 / 22


The non-linear case

What if the relationship is


not linear?
Here the best possible
explanation seems to be a
quadratic

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 3 / 22


The non-linear case

What if the relationship is


not linear?
Here the best possible
explanation seems to be a
quadratic
Non-linear : cross
dependencies

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 3 / 22


The non-linear case

What if the relationship is


not linear?
Here the best possible
explanation seems to be a
quadratic
Non-linear : cross
dependencies
Input xi : (xi1 , xi2 )

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 3 / 22


The non-linear case

What if the relationship is


not linear?
Here the best possible
explanation seems to be a
quadratic
Non-linear : cross
dependencies
Input xi : (xi1 , xi2 )
Quadratic dependencies:
y = ✓0 + ✓1 xi1 + ✓2 xi2 + ✓11 xi21 + ✓22 xi22 + ✓12 xi1 xi2
--
linear quadratic
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 3 / 22
The non-linear case

Recall how we fit a line


⇥ ⇤ ✓0
1 xi 1
✓1

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 4 / 22


The non-linear case

Recall how we fit a line


⇥ ⇤ ✓0
1 xi 1
✓1

For quadratic, add new


coefficients and expand
parameters
2 3
⇥ ⇤ ✓0
1 xi1 xi21 4 ✓1 5
✓2

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 4 / 22


The non-linear case

Input (xi1 , xi2 )

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 5 / 22


The non-linear case

Input (xi1 , xi2 )


For the general quadratic
case, we are adding new
derived “features”
xi 3 = xi21
xi 4 = xi22
xi 5 = x i 1 xi 2

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 5 / 22


The non-linear case

Original input matrix


2 3
1 x11 x1 2
6 1 x21 x2 2 7
6 7
6 ··· 7
6 7
6 1 xi xi 2 7
6 1 7
4 ··· 5
1 xn 1 xn 2

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 6 / 22


The non-linear case

Expanded input matrix


2 3
1 x1 1 x1 2 x121 x122 x 1 1 x1 2
6 7
6 1 x2 1 x2 2 x221 x222 x 2 1 x2 2 7
6 7
6 ··· 7
6 7
6 1 xi 1 xi 2 xi21 xi22 x i 1 xi 2 7
6 7
4 ··· 5
1 x n 1 xn 2 xn21 xn22 xn 1 xn 2

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 7 / 22


The non-linear case

Expanded input matrix


2 3
1 x1 1 x1 2 x121 x122 x 1 1 x1 2
6 7
6 1 x2 1 x2 2 x221 x222 x 2 1 x2 2 7
6 7
6 ··· 7
6 7
6 1 xi 1 xi 2 xi21 xi22 x i 1 xi 2 7
6 7
4 ··· 5
1 x n 1 xn 2 xn21 xn22 xn 1 xn 2

New columns are computed


and filled in from original
inputs

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 7 / 22


Exponential parameter blow-up

Cubic derived features

Jesic
xi31 , xi32 , xi33 ,

xi21 xi2 , xi21 xi3 ,


xi22 xi1 , xi22 xi3 ,
xi23 xi1 , xi23 xi2 ,
x i 1 xi 2 x i 3 ,

7
xi21 , xi22 , xi23 ,
Quadrate
x i 1 xi 2 , xi 1 x i 3 , x i 2 xi 3 ,

x i 1 , x i 2 , xi 3 . - Linear
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 8 / 22
Higher degree polynomials

How complex a polynomial


should we try?

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 9 / 22


Higher degree polynomials

How complex a polynomial


should we try?
Aim for degree that
minimizes SSE

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 9 / 22


Higher degree polynomials

How complex a polynomial


should we try?
Aim for degree that
minimizes SSE
As degree increases,
features explode
exponentially

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 9 / 22


Overfitting

Need to be careful about


adding higher degree terms

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 10 / 22


Overfitting

Need to be careful about


adding higher degree terms
For n training points,can
always fit polynomial of
degree (n 1) exactly
However, such a curve
would not generalize well to
new data points

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 10 / 22


Overfitting

Need to be careful about


adding higher degree terms
For n training points,can
always fit polynomial of
degree (n 1) exactly
However, such a curve
would not generalize well to
new data points
Overfitting — model fits
training data well, performs
poorly on unseen data

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 10 / 22


Regularization

Need to trade o↵ SSE


against curve complexity

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 11 / 22


Regularization

Need to trade o↵ SSE


against curve complexity
So far, the only cost has
been SSE

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 11 / 22


Regularization

Need to trade o↵ SSE


against curve complexity
So far, the only cost has
been SSE
Add a cost related to
parameters (✓0 , ✓1 , . . . , ✓k )

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 11 / 22


Regularization

Need to trade o↵ SSE


against curve complexity
So far, the only cost has
been SSE
Add a cost related to
parameters (✓0 , ✓1 , . . . , ✓k )
Minimize, for instance
n k
1X X
(zi yi )2 + ✓j2
2
i=1 j=1
5-
SSE
coefficients
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 11 / 22
Regularization

Need to trade o↵ SSE


against curve complexity
So far, the only cost has
been SSE
Add a cost related to
parameters (✓0 , ✓1 , . . . , ✓k )
Minimize, for instance
n k
1X X
(zi yi )2 + ✓j2
2
i=1 j=1

Second term penalizes curve


complexity
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 11 / 22
Regularization
Variations on regularization
Change the contribution of coefficients
to the loss function

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 12 / 22


Regularization
Variations on regularization
Change the contribution of coefficients
to the loss function

Ridge regression:
k
X
Coefficients contribute ✓j2
j=1

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 12 / 22


Regularization
Variations on regularization
Change the contribution of coefficients
to the loss function

Ridge regression:
k
X
Coefficients contribute ✓j2
j=1

LASSO regression:
k
X
Coefficients contribute |✓j |
j=1

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 12 / 22


Regularization
Variations on regularization
Change the contribution of coefficients
to the loss function

Ridge regression:
k
X
Coefficients contribute ✓j2
j=1

LASSO regression:
k
X
Coefficients contribute |✓j |
j=1

Elastic net regression:


k
X
2
Coefficients contribute 1 |✓j | + 2 ✓j
j=1
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 12 / 22
The non-polynomial case

Percentage of urban
population as a function of
per capita GDP
Not clear what polynomial
would be reasonable


Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 13 / 22
The non-polynomial case

Percentage of urban
population as a function of


per capita GDP
Not clear what polynomial
would be reasonable
Take log of GDP
Regression we are
computing is
y = ✓0 + ✓1 log x1

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 13 / 22


The non-polynomial case

Reverse the relationship


Plot per capita GDP in
terms of percentage of
urbanization
Now we take log of the
output variable
log y = ✓0 + ✓1 x1
Log-linear transformation
Earlier was linear-log
Can also use log-log

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 14 / 22


Regression for classification

Regression line

Set a threshold
Classifier
Output below threshold : 0 (No)
Output above threshold : 1 (Yes)

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 15 / 22


Regression for classification
classific
Regression line ↓
PhHed
Set a threshold
line
Classifier
Output below threshold : 0 (No) Treshold
Output above threshold : 1 (Yes)

Classifier output is a step function

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 15 / 22


Smoothen the step
2
2 1,2
-
-
-> 0
Sigmoid function

(z) =
1 27-
8,1 -
0
1+e z

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 16 / 22


Smoothen the step

Sigmoid function
1
(z) = z
1+e

Input z is output of our


regression
1
(z) =
1 + e 0 1 x1 +···+✓k xk )
(✓ +✓
-

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 16 / 22


Smoothen the step

Sigmoid function

j
1
(z) = z
1+e

Input z is output of our


regression
1
(z) =
1 + e 0 1 x1 +···+✓k xk )
(✓ +✓

Adjust parameters to fix


horizontal position and steepness
of step

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 16 / 22


Logistic regression

Compute the coefficients?

Solve by gradient descent

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 17 / 22


Logistic regression

Compute the coefficients?

Solve by gradient descent


Need derivatives to exist
Hence smooth sigmoid, not
step function
Check that
0
(z) = (z)(1 (z))

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 17 / 22


Logistic regression

Compute the coefficients?

Solve by gradient descent


Need derivatives to exist
Hence smooth sigmoid, not
step function
Check that
0
(z) = (z)(1 (z))

Need a cost function to minimize

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 17 / 22


MSE for logistic regression and gradient descent
Suppose we take mean squared error as the loss function.
n
1X
C= (yi (zi ))2 , where zi = ✓0 + ✓1 xi1 + ✓2 xi2
n

Lacto
i=1
↳preducted
class

class

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 18 / 22


MSE for logistic regression and gradient descent
Suppose we take mean squared error as the loss function.
n
1X
C= (yi (zi ))2 , where zi = ✓0 + ✓1 xi1 + ✓2 xi2
n
i=1
@C @C @C
For gradient descent, we compute , ,
@✓1 @✓2 @✓0

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 18 / 22


MSE for logistic regression and gradient descent
Suppose we take mean squared error as the loss function.
1X
n S
C= (yi (zi ))2 , where zi = ✓0 + ✓1 xi1 + ✓2 xi2
n
i=1
@C @C @C
For gradient descent, we compute , ,
@✓1 @✓2 @✓0
Consider two inputs x = (x1 , x2 )
For j = 1, 2,
O
n
@C 2X @ (zi )
= (yi (zi )) ·
@✓j n @✓j
i=1

-1.80

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 18 / 22


MSE for logistic regression and gradient descent
Suppose we take mean squared error as the loss function.
n
1X
C= (yi (zi ))2 , where zi = ✓0 + ✓1 xi1 + ✓2 xi2
n
i=1
@C @C @C
For gradient descent, we compute , ,
@✓1 @✓2 @✓0

↓v'(z)
Consider two inputs x = (x1 , x2 )
For j = 1, 2,
n n
@C 2X @ (zi ) 2 X @ (zi ) @zi
= (yi (zi )) · = ( (zi ) yi )
@✓j n @✓j n @zi @✓j

=
i=1 i=1

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 18 / 22


MSE for logistic regression and gradient descent
Suppose we take mean squared error as the loss function.
n
1X
C= (yi (zi ))2 , where zi = ✓0 + ✓1 xi1 + ✓2 xi2
n
i=1
@C @C @C
For gradient descent, we compute , ,
@✓1 @✓2 @✓0
Consider two inputs x = (x1 , x2 )
For j = 1, 2,
n n
@C 2X @ (zi ) 2X @ (zi ) @zi
= (yi (zi )) · = ( (zi ) yi )
@✓j n @✓j n @zi @✓j
i=1 i=1
n
X
2
= ( (zi ) yi ) 0 (zi )xij
n
i=1

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 18 / 22


MSE for logistic regression and gradient descent
Suppose we take mean squared error as the loss function.
n
1X
C= (yi (zi ))2 , where zi = ✓0 + ✓1 xi1 + ✓2 xi2
n
i=1
@C @C @C
For gradient descent, we compute , ,
@✓1 @✓2 @✓0
Consider two inputs x = (x1 , x2 )
For j = 1, 2,
n n
@C 2X @ (zi ) 2X @ (zi ) @zi
= (yi (zi )) · = ( (zi ) yi )
@✓j n @✓j n @zi @✓j
i=1 i=1
n
X
2
= ( (zi ) yi ) 0 (zi )xij
n
i=1
n n
@C 2X @ (zi ) @zi 2X
= ( (zi ) yi ) = ( (zi ) yi ) 0 (zi )
@✓0 n @zi @✓0 n
i=1 i=1
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 18 / 22
MSE for logistic regression and gradient descent . . .
n n
@C 2X @C 2X
For j = 1, 2, = ( (zi ) yi ) 0 (zi )xji , and = ( (zi ) yi ) 0 (zi )
@✓j n @✓0 n
i=1 i=1

@C @C @C 0 (z
Each term in , , is proportional to i)
@✓1 @✓2 @✓0

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 19 / 22


MSE for logistic regression and gradient descent . . .
n n
@C 2X @C 2X
For j = 1, 2, = ( (zi ) yi ) 0 (zi )xji , and = ( (zi ) yi ) 0 (zi )
@✓j n @✓0 n
i=1 i=1

@C @C @C 0 (z
Each term in , , is proportional to i)
@✓1 @✓2 @✓0
Ideally, gradient descent should take large steps when (z) y is large

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 19 / 22


MSE for logistic regression and gradient descent . . .
n n
@C 2X @C 2X
For j = 1, 2, = ( (zi ) yi ) 0 (zi )xji , and = ( (zi ) yi ) 0 (zi )
@✓j n @✓0 n
i=1 i=1

@C @C @C 0 (z
Each term in , , is proportional to i)
@✓1 @✓2 @✓0
Ideally, gradient descent should take large steps when (z) y is large
(z) is flat at both extremes
If (z) is completely wrong,
(z) ⇡ (1 y ), we still have
0 (z) ⇡ 0

Learning is slow even when current


model is far from optimal

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 19 / 22


Loss function for logistic regression

Goal is to maximize log likelihood

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 20 / 22


Loss function for logistic regression

Goal is to maximize log likelihood


Let h✓ (xi ) = (zi ).

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 20 / 22


Loss function for logistic regression

Goal is to maximize log likelihood


Let h✓ (xi ) = (zi ). So, P(yi = 1 | xi ; ✓) = h✓ (xi ),
P(yi = 0 | xi ; ✓) = 1 h✓ (xi )

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 20 / 22


Loss function for logistic regression

Goal is to maximize log likelihood


Let h✓ (xi ) = (zi ). So, P(yi = 1 | xi ; ✓) = h✓ (xi ),
P(yi = 0 | xi ; ✓) = 1 h✓ (xi )
&
y1
=
-

Combine as P(yi | xi ; ✓) = h✓ (xi )yi · (1 h✓ (xi ))1 yi


-

y =
0

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 20 / 22


Loss function for logistic regression

Goal is to maximize log likelihood


Let h✓ (xi ) = (zi ). So, P(yi = 1 | xi ; ✓) = h✓ (xi ),
P(yi = 0 | xi ; ✓) = 1 h✓ (xi )

Combine as P(yi | xi ; ✓) = h✓ (xi )yi · (1 h✓ (xi ))1 yi

n
Y
Likelihood: L(✓) = h✓ (xi )yi · (1 h✓ (xi ))1 yi

i=1

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 20 / 22


Loss function for logistic regression

Goal is to maximize log likelihood


Let h✓ (xi ) = (zi ). So, P(yi = 1 | xi ; ✓) = h✓ (xi ),
P(yi = 0 | xi ; ✓) = 1 h✓ (xi )

Combine as P(yi | xi ; ✓) = h✓ (xi )yi · (1 h✓ (xi ))1 yi

U(G) log(π -)
n
Y -

h✓ (xi )yi · (1 h✓ (xi ))1 yi


-

Likelihood: L(✓) =
i=1
n
X
=
Elg( -

.)
Log-likelihood: `(✓) = yi log h✓ (xi ) + (1 yi ) log(1 h✓ (xi )) =

i=1
*

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 20 / 22


Loss function for logistic regression

#
Goal is to maximize log likelihood
Let h✓ (xi ) = (zi ). So, P(yi = 1 | xi ; ✓) = h✓ (xi ), vz
P(yi = 0 | xi ; ✓) = 1 h✓ (xi )

Combine as P(yi | xi ; ✓) = h✓ (xi )yi · (1 h✓ (xi ))1 yi

n
Y
Likelihood: L(✓) = h✓ (xi )yi · (1 h✓ (xi ))1 yi

i=1
n
X
Log-likelihood: `(✓) = yi log h✓ (xi ) + (1 yi ) log(1 h✓ (xi ))
-
i=1
n
X
Minimize cross entropy: yi log h✓ (xi ) + (1 yi ) log(1 h✓ (xi ))
i=1
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 20 / 22
Cross entropy and gradient descent

C= [y ln( (z)) + (1 y ) ln(1 (z))]

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 21 / 22


Cross entropy and gradient descent

C= [y ln( (z)) + (1 y ) ln(1 (z))]

@C @C @
=
@✓j @ @✓j

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 21 / 22


Cross entropy and gradient descent

↓0
C= [y ln( (z)) + (1 y ) ln(1 (z))]

@C @C @ y 1 y @
= =
@✓j @ @✓j (z) 1 (z) @✓j
-
~

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 21 / 22


Cross entropy and gradient descent

C= [y ln( (z)) + (1 y ) ln(1 (z))]



@C @C @ y 1 y @
= =
@✓j @ @✓j (z) 1 (z) @✓j

y 1 y @ @z
=
(z) 1 (z) @z @✓j

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 21 / 22


Cross entropy and gradient descent

C= [y ln( (z)) + (1 y ) ln(1 (z))]



@C @C @ y 1 y @
= =
@✓j @ @✓j (z) 1 (z) @✓j

y 1 y @ @z
=
(z) 1 (z) @z @✓j

y 1 y 0
= (z)xj
(z) 1 (z)

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 21 / 22


Cross entropy and gradient descent

C= [y ln( (z)) + (1 y ) ln(1 (z))]



@C @C @ y 1 y @
= =
@✓j @ @✓j (z) 1 (z) @✓j

y 1 y @ @z
=
(z) 1 (z) @z @✓j

y 1 y 0
= (z)xj
(z) 1 (z)

y (1 (z)) (1 y ) (z) 0
= (z)xj
(z)(1 (z))

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 21 / 22


Cross entropy and gradient descent . . .

@C y (1 (z)) (1 y ) (z) 0
= (z)xj
@✓j (z)(1 (z))
- -

Recall that 0 (z) = (z)(1 (z))

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 22 / 22


Cross entropy and gradient descent . . .

@C y (1 (z)) (1 y ) (z) 0
= (z)xj
@✓j (z)(1 (z))

Recall that 0 (z) = (z)(1 (z))

@C
Therefore, = [y (1 (z)) (1 y ) (z)]xj
@✓j

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 22 / 22


Cross entropy and gradient descent . . .

@C y (1 (z)) (1 y ) (z) 0
= (z)xj
@✓j (z)(1 (z))

Recall that 0 (z) = (z)(1 (z))

@C
Therefore, = [y (1 (z)) (1 y ) (z)]xj
@✓j
= [y y (z) (z) + y (z)]xj

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 22 / 22


Cross entropy and gradient descent . . .

@C y (1 (z)) (1 y ) (z) 0
= (z)xj
@✓j (z)(1 (z))

Recall that 0 (z) = (z)(1 (z))

@C
Therefore, = [y (1 (z)) (1 y ) (z)]xj
@✓j
= [y y/ (z) /(z) + y (z)]xj
= ( (z) y )xj

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 22 / 22


Cross entropy and gradient descent . . .

@C y (1 (z)) (1 y ) (z) 0
= (z)xj
@✓j (z)(1 (z))

Recall that 0 (z) = (z)(1 (z))

@C
Therefore, = [y (1 (z)) (1 y ) (z)]xj
@✓j
= [y y (z) (z) + y (z)]xj
= ( (z) y )xj
@C
Similarly, = ( (z) y)
@✓0

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 22 / 22


Cross entropy and gradient descent . . .

@C y (1 (z)) (1 y ) (z) 0
= (z)xj
@✓j (z)(1 (z))

Recall that 0 (z) = (z)(1 (z))

@C
Therefore, = [y (1 (z)) (1 y ) (z)]xj
@✓j
= [y y (z) (z) + y (z)]xj
= ( (z) y )xj
@C
Similarly, = ( (z) y)
@✓0
Thus, as we wanted, the gradient is proportional to (z) y

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 22 / 22


Cross entropy and gradient descent . . .

@C y (1 (z)) (1 y ) (z) 0
= (z)xj
@✓j (z)(1 (z))

Recall that 0 (z) = (z)(1 (z))

@C
Therefore, = [y (1 (z)) (1 y ) (z)]xj
@✓j
= [y y (z) (z) + y (z)]xj
= ( (z) y )xj
@C
Similarly, = ( (z) y)
@✓0
Thus, as we wanted, the gradient is proportional to (z) y
The greater the error, the faster the learning rate
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 22 / 22

You might also like