Professional Documents
Culture Documents
Madhavan Mukund
https://www.cmi.ac.in/~madhavan
~
Each input xi is a vector (xi1 , . . . , xik )
Add xi0 = 1 by convention
yi is actual output
Jesic
xi31 , xi32 , xi33 ,
7
xi21 , xi22 , xi23 ,
Quadrate
x i 1 xi 2 , xi 1 x i 3 , x i 2 xi 3 ,
x i 1 , x i 2 , xi 3 . - Linear
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 8 / 22
Higher degree polynomials
Ridge regression:
k
X
Coefficients contribute ✓j2
j=1
Ridge regression:
k
X
Coefficients contribute ✓j2
j=1
LASSO regression:
k
X
Coefficients contribute |✓j |
j=1
Ridge regression:
k
X
Coefficients contribute ✓j2
j=1
LASSO regression:
k
X
Coefficients contribute |✓j |
j=1
Percentage of urban
population as a function of
per capita GDP
Not clear what polynomial
would be reasonable
⑧
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 13 / 22
The non-polynomial case
Percentage of urban
population as a function of
⑨
per capita GDP
Not clear what polynomial
would be reasonable
Take log of GDP
Regression we are
computing is
y = ✓0 + ✓1 log x1
Regression line
Set a threshold
Classifier
Output below threshold : 0 (No)
Output above threshold : 1 (Yes)
(z) =
1 27-
8,1 -
0
1+e z
Sigmoid function
1
(z) = z
1+e
Sigmoid function
j
1
(z) = z
1+e
Lacto
i=1
↳preducted
class
class
-1.80
↓v'(z)
Consider two inputs x = (x1 , x2 )
For j = 1, 2,
n n
@C 2X @ (zi ) 2 X @ (zi ) @zi
= (yi (zi )) · = ( (zi ) yi )
@✓j n @✓j n @zi @✓j
=
i=1 i=1
@C @C @C 0 (z
Each term in , , is proportional to i)
@✓1 @✓2 @✓0
@C @C @C 0 (z
Each term in , , is proportional to i)
@✓1 @✓2 @✓0
Ideally, gradient descent should take large steps when (z) y is large
@C @C @C 0 (z
Each term in , , is proportional to i)
@✓1 @✓2 @✓0
Ideally, gradient descent should take large steps when (z) y is large
(z) is flat at both extremes
If (z) is completely wrong,
(z) ⇡ (1 y ), we still have
0 (z) ⇡ 0
n
Y
Likelihood: L(✓) = h✓ (xi )yi · (1 h✓ (xi ))1 yi
i=1
U(G) log(π -)
n
Y -
Likelihood: L(✓) =
i=1
n
X
=
Elg( -
.)
Log-likelihood: `(✓) = yi log h✓ (xi ) + (1 yi ) log(1 h✓ (xi )) =
i=1
*
#
Goal is to maximize log likelihood
Let h✓ (xi ) = (zi ). So, P(yi = 1 | xi ; ✓) = h✓ (xi ), vz
P(yi = 0 | xi ; ✓) = 1 h✓ (xi )
n
Y
Likelihood: L(✓) = h✓ (xi )yi · (1 h✓ (xi ))1 yi
i=1
n
X
Log-likelihood: `(✓) = yi log h✓ (xi ) + (1 yi ) log(1 h✓ (xi ))
-
i=1
n
X
Minimize cross entropy: yi log h✓ (xi ) + (1 yi ) log(1 h✓ (xi ))
i=1
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 20 / 22
Cross entropy and gradient descent
@C @C @
=
@✓j @ @✓j
↓0
C= [y ln( (z)) + (1 y ) ln(1 (z))]
@C @C @ y 1 y @
= =
@✓j @ @✓j (z) 1 (z) @✓j
-
~
@C
Therefore, = [y (1 (z)) (1 y ) (z)]xj
@✓j
@C
Therefore, = [y (1 (z)) (1 y ) (z)]xj
@✓j
= [y y (z) (z) + y (z)]xj
@C
Therefore, = [y (1 (z)) (1 y ) (z)]xj
@✓j
= [y y/ (z) /(z) + y (z)]xj
= ( (z) y )xj
@C
Therefore, = [y (1 (z)) (1 y ) (z)]xj
@✓j
= [y y (z) (z) + y (z)]xj
= ( (z) y )xj
@C
Similarly, = ( (z) y)
@✓0
@C
Therefore, = [y (1 (z)) (1 y ) (z)]xj
@✓j
= [y y (z) (z) + y (z)]xj
= ( (z) y )xj
@C
Similarly, = ( (z) y)
@✓0
Thus, as we wanted, the gradient is proportional to (z) y
@C
Therefore, = [y (1 (z)) (1 y ) (z)]xj
@✓j
= [y y (z) (z) + y (z)]xj
= ( (z) y )xj
@C
Similarly, = ( (z) y)
@✓0
Thus, as we wanted, the gradient is proportional to (z) y
The greater the error, the faster the learning rate
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 22 / 22