Class Notes 02feb2023

Lecture 8: 2 February, 2023
Madhavan Mukund
https://www.cmi.ac.in/~madhavan
Data Mining and Machine Learning

January–April 2023
Linear regression
Training input is
{(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )}
~
Each input xi is a vector (xi1 , . . . , xik )
Add xi0 = 1 by convention
yi is actual output
How far away is our prediction h✓ (xi ) from

the true answer yi ?
Define a cost (loss) function
n
1X
J(✓) = (h✓ (xi ) yi ) 2
2
i=1
Essentially, the sum squared error (SSE) -

Justified via MLE
Divide by n, mean squared error (MSE)
Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 2 / 22
The non-linear case
What if the relationship is

not linear?

The non-linear case

not linear?
Here the best possible
explanation seems to be a
quadratic

The non-linear case

not linear?
quadratic
Non-linear : cross
dependencies

The non-linear case

not linear?
quadratic
Non-linear : cross
dependencies
Input xi : (xi1 , xi2 )

The non-linear case

not linear?
quadratic
Non-linear : cross
dependencies
Input xi : (xi1 , xi2 )
Quadratic dependencies:
y = ✓0 + ✓1 xi1 + ✓2 xi2 + ✓11 xi21 + ✓22 xi22 + ✓12 xi1 xi2
--
linear quadratic
The non-linear case
Recall how we fit a line

⇥ ⇤ ✓0
1 xi 1
✓1

The non-linear case
Recall how we fit a line

⇥ ⇤ ✓0
1 xi 1
✓1
For quadratic, add new

coefficients and expand
parameters
2 3
⇥ ⇤ ✓0
1 xi1 xi21 4 ✓1 5
✓2

The non-linear case
Input (xi1 , xi2 )

The non-linear case
Input (xi1 , xi2 )

For the general quadratic
case, we are adding new
derived “features”
xi 3 = xi21
xi 4 = xi22
xi 5 = x i 1 xi 2

The non-linear case
Original input matrix

2 3
1 x11 x1 2
6 1 x21 x2 2 7
6 7
6 ··· 7
6 7
6 1 xi xi 2 7
6 1 7
4 ··· 5
1 xn 1 xn 2

The non-linear case
Expanded input matrix

2 3
1 x1 1 x1 2 x121 x122 x 1 1 x1 2
6 7
6 1 x2 1 x2 2 x221 x222 x 2 1 x2 2 7
6 7
6 ··· 7
6 7
6 1 xi 1 xi 2 xi21 xi22 x i 1 xi 2 7
6 7
4 ··· 5
1 x n 1 xn 2 xn21 xn22 xn 1 xn 2

The non-linear case
Expanded input matrix

2 3
1 x1 1 x1 2 x121 x122 x 1 1 x1 2
6 7
6 1 x2 1 x2 2 x221 x222 x 2 1 x2 2 7
6 7
6 ··· 7
6 7
6 1 xi 1 xi 2 xi21 xi22 x i 1 xi 2 7
6 7
4 ··· 5
1 x n 1 xn 2 xn21 xn22 xn 1 xn 2
New columns are computed

and filled in from original
inputs

Exponential parameter blow-up
Cubic derived features
Jesic
xi31 , xi32 , xi33 ,
xi21 xi2 , xi21 xi3 ,

xi22 xi1 , xi22 xi3 ,
xi23 xi1 , xi23 xi2 ,
x i 1 xi 2 x i 3 ,
7
xi21 , xi22 , xi23 ,
Quadrate
x i 1 xi 2 , xi 1 x i 3 , x i 2 xi 3 ,
x i 1 , x i 2 , xi 3 . - Linear
Higher degree polynomials
How complex a polynomial

should we try?


should we try?
Aim for degree that
minimizes SSE


should we try?
Aim for degree that
minimizes SSE
As degree increases,
features explode
exponentially

Overfitting
Need to be careful about

adding higher degree terms

Overfitting

For n training points,can
always fit polynomial of
degree (n 1) exactly
However, such a curve
would not generalize well to
new data points

Overfitting

For n training points,can
always fit polynomial of
degree (n 1) exactly
However, such a curve
would not generalize well to
new data points
Overfitting — model fits
training data well, performs
poorly on unseen data

Regularization
Need to trade o↵ SSE

against curve complexity

Regularization

So far, the only cost has
been SSE

Regularization

been SSE
Add a cost related to
parameters (✓0 , ✓1 , . . . , ✓k )

Regularization

been SSE
parameters (✓0 , ✓1 , . . . , ✓k )
Minimize, for instance
n k
1X X
(zi yi )2 + ✓j2
2
i=1 j=1
5-
SSE
coefficients
Regularization

been SSE
parameters (✓0 , ✓1 , . . . , ✓k )
Minimize, for instance
n k
1X X
(zi yi )2 + ✓j2
2
i=1 j=1
Second term penalizes curve

complexity
Regularization
Variations on regularization
Change the contribution of coefficients
to the loss function

Regularization
Ridge regression:
k
X
Coefficients contribute ✓j2
j=1

Regularization
Ridge regression:
k
X
j=1
LASSO regression:
k
X
Coefficients contribute |✓j |
j=1

Regularization
Ridge regression:
k
X
j=1
LASSO regression:
k
X
Coefficients contribute |✓j |
j=1
Elastic net regression:

k
X
2
Coefficients contribute 1 |✓j | + 2 ✓j
j=1
The non-polynomial case
Percentage of urban
population as a function of
per capita GDP
Not clear what polynomial
would be reasonable
⑧
Percentage of urban
population as a function of
⑨
per capita GDP
Not clear what polynomial
would be reasonable
Take log of GDP
Regression we are
computing is
y = ✓0 + ✓1 log x1

Reverse the relationship

Plot per capita GDP in
terms of percentage of
urbanization
Now we take log of the
output variable
log y = ✓0 + ✓1 x1
Log-linear transformation
Earlier was linear-log
Can also use log-log

Regression for classification
Regression line
Set a threshold
Classifier
Output below threshold : 0 (No)
Output above threshold : 1 (Yes)

Regression for classification
classific
Regression line ↓
PhHed
Set a threshold
line
Classifier
Output below threshold : 0 (No) Treshold
Output above threshold : 1 (Yes)
Classifier output is a step function

Smoothen the step
2
2 1,2
-
-
-> 0
Sigmoid function
(z) =
1 27-
8,1 -
0
1+e z

Smoothen the step
Sigmoid function
1
(z) = z
1+e
Input z is output of our

regression
1
(z) =
1 + e 0 1 x1 +···+✓k xk )
(✓ +✓
-

Smoothen the step
Sigmoid function
j
1
(z) = z
1+e
Input z is output of our

regression
1
(z) =
1 + e 0 1 x1 +···+✓k xk )
(✓ +✓
Adjust parameters to fix

horizontal position and steepness
of step

Logistic regression
Compute the coefficients?
Solve by gradient descent

Logistic regression

Need derivatives to exist
Hence smooth sigmoid, not
step function
Check that
0
(z) = (z)(1 (z))

Logistic regression

Need derivatives to exist
Hence smooth sigmoid, not
step function
Check that
0
(z) = (z)(1 (z))
Need a cost function to minimize

MSE for logistic regression and gradient descent
Suppose we take mean squared error as the loss function.
n
1X
C= (yi (zi ))2 , where zi = ✓0 + ✓1 xi1 + ✓2 xi2
n
Lacto
i=1
↳preducted
class
class

n
1X
n
i=1
@C @C @C
For gradient descent, we compute , ,
@✓1 @✓2 @✓0

1X
n S
n
i=1
@C @C @C
@✓1 @✓2 @✓0
Consider two inputs x = (x1 , x2 )
For j = 1, 2,
O
n
@C 2X @ (zi )
= (yi (zi )) ·
@✓j n @✓j
i=1
-1.80

n
1X
n
i=1
@C @C @C
@✓1 @✓2 @✓0
↓v'(z)
For j = 1, 2,
n n
@C 2X @ (zi ) 2 X @ (zi ) @zi
= (yi (zi )) · = ( (zi ) yi )
@✓j n @✓j n @zi @✓j
=
i=1 i=1

n
1X
n
i=1
@C @C @C
@✓1 @✓2 @✓0
For j = 1, 2,
n n
@C 2X @ (zi ) 2X @ (zi ) @zi
= (yi (zi )) · = ( (zi ) yi )
@✓j n @✓j n @zi @✓j
i=1 i=1
n
X
2
= ( (zi ) yi ) 0 (zi )xij
n
i=1

n
1X
n
i=1
@C @C @C
@✓1 @✓2 @✓0
For j = 1, 2,
n n
@C 2X @ (zi ) 2X @ (zi ) @zi
= (yi (zi )) · = ( (zi ) yi )
@✓j n @✓j n @zi @✓j
i=1 i=1
n
X
2
= ( (zi ) yi ) 0 (zi )xij
n
i=1
n n
@C 2X @ (zi ) @zi 2X
= ( (zi ) yi ) = ( (zi ) yi ) 0 (zi )
@✓0 n @zi @✓0 n
i=1 i=1
MSE for logistic regression and gradient descent . . .
n n
@C 2X @C 2X
For j = 1, 2, = ( (zi ) yi ) 0 (zi )xji , and = ( (zi ) yi ) 0 (zi )
@✓j n @✓0 n
i=1 i=1
@C @C @C 0 (z
Each term in , , is proportional to i)
@✓1 @✓2 @✓0

n n
@C 2X @C 2X
@✓j n @✓0 n
i=1 i=1
@C @C @C 0 (z
@✓1 @✓2 @✓0
Ideally, gradient descent should take large steps when (z) y is large

n n
@C 2X @C 2X
@✓j n @✓0 n
i=1 i=1
@C @C @C 0 (z
@✓1 @✓2 @✓0
Ideally, gradient descent should take large steps when (z) y is large
(z) is flat at both extremes
If (z) is completely wrong,
(z) ⇡ (1 y ), we still have
0 (z) ⇡ 0
Learning is slow even when current

model is far from optimal

Loss function for logistic regression
Goal is to maximize log likelihood


Let h✓ (xi ) = (zi ).


Let h✓ (xi ) = (zi ). So, P(yi = 1 | xi ; ✓) = h✓ (xi ),
P(yi = 0 | xi ; ✓) = 1 h✓ (xi )


P(yi = 0 | xi ; ✓) = 1 h✓ (xi )
&
y1
=
-
Combine as P(yi | xi ; ✓) = h✓ (xi )yi · (1 h✓ (xi ))1 yi

-
↑
y =
0


P(yi = 0 | xi ; ✓) = 1 h✓ (xi )
n
Y
Likelihood: L(✓) = h✓ (xi )yi · (1 h✓ (xi ))1 yi
i=1


P(yi = 0 | xi ; ✓) = 1 h✓ (xi )
U(G) log(π -)
n
Y -
h✓ (xi )yi · (1 h✓ (xi ))1 yi

-
Likelihood: L(✓) =
i=1
n
X
=
Elg( -
.)
Log-likelihood: `(✓) = yi log h✓ (xi ) + (1 yi ) log(1 h✓ (xi )) =
i=1
*

#
Let h✓ (xi ) = (zi ). So, P(yi = 1 | xi ; ✓) = h✓ (xi ), vz
P(yi = 0 | xi ; ✓) = 1 h✓ (xi )
n
Y
Likelihood: L(✓) = h✓ (xi )yi · (1 h✓ (xi ))1 yi
i=1
n
X
Log-likelihood: `(✓) = yi log h✓ (xi ) + (1 yi ) log(1 h✓ (xi ))
-
i=1
n
X
Minimize cross entropy: yi log h✓ (xi ) + (1 yi ) log(1 h✓ (xi ))
i=1
Cross entropy and gradient descent
C= [y ln( (z)) + (1 y ) ln(1 (z))]

C= [y ln( (z)) + (1 y ) ln(1 (z))]
@C @C @
=
@✓j @ @✓j

↓0
C= [y ln( (z)) + (1 y ) ln(1 (z))]

@C @C @ y 1 y @
= =
@✓j @ @✓j (z) 1 (z) @✓j
-
~

C= [y ln( (z)) + (1 y ) ln(1 (z))]


@C @C @ y 1 y @
= =
@✓j @ @✓j (z) 1 (z) @✓j

y 1 y @ @z
=
(z) 1 (z) @z @✓j

C= [y ln( (z)) + (1 y ) ln(1 (z))]


@C @C @ y 1 y @
= =
@✓j @ @✓j (z) 1 (z) @✓j

y 1 y @ @z
=
(z) 1 (z) @z @✓j

y 1 y 0
= (z)xj
(z) 1 (z)

C= [y ln( (z)) + (1 y ) ln(1 (z))]


@C @C @ y 1 y @
= =
@✓j @ @✓j (z) 1 (z) @✓j

y 1 y @ @z
=
(z) 1 (z) @z @✓j

y 1 y 0
= (z)xj
(z) 1 (z)

y (1 (z)) (1 y ) (z) 0
= (z)xj
(z)(1 (z))

Cross entropy and gradient descent . . .

@C y (1 (z)) (1 y ) (z) 0
= (z)xj
@✓j (z)(1 (z))
- -
Recall that 0 (z) = (z)(1 (z))


@C y (1 (z)) (1 y ) (z) 0
= (z)xj
@✓j (z)(1 (z))
@C
Therefore, = [y (1 (z)) (1 y ) (z)]xj
@✓j


@C y (1 (z)) (1 y ) (z) 0
= (z)xj
@✓j (z)(1 (z))
@C
@✓j
= [y y (z) (z) + y (z)]xj


@C y (1 (z)) (1 y ) (z) 0
= (z)xj
@✓j (z)(1 (z))
@C
@✓j
= [y y/ (z) /(z) + y (z)]xj
= ( (z) y )xj


@C y (1 (z)) (1 y ) (z) 0
= (z)xj
@✓j (z)(1 (z))
@C
@✓j
= [y y (z) (z) + y (z)]xj
= ( (z) y )xj
@C
Similarly, = ( (z) y)
@✓0


@C y (1 (z)) (1 y ) (z) 0
= (z)xj
@✓j (z)(1 (z))
@C
@✓j
= [y y (z) (z) + y (z)]xj
= ( (z) y )xj
@C
@✓0
Thus, as we wanted, the gradient is proportional to (z) y


@C y (1 (z)) (1 y ) (z) 0
= (z)xj
@✓j (z)(1 (z))
@C
@✓j
= [y y (z) (z) + y (z)]xj
= ( (z) y )xj
@C
@✓0
Thus, as we wanted, the gradient is proportional to (z) y
The greater the error, the faster the learning rate

Class Notes 02feb2023

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Class Notes 02feb2023

Uploaded by

Copyright:

Available Formats

Lecture 8: 2 February, 2023

Data Mining and Machine Learning

How far away is our prediction h✓ (xi ) from

Essentially, the sum squared error (SSE) -

What if the relationship is

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 3 / 22

What if the relationship is

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 3 / 22

What if the relationship is

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 3 / 22

What if the relationship is

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 3 / 22

What if the relationship is

Recall how we fit a line

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 4 / 22

Recall how we fit a line

For quadratic, add new

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 4 / 22

Input (xi1 , xi2 )

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 5 / 22

Input (xi1 , xi2 )

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 5 / 22

Original input matrix

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 6 / 22

Expanded input matrix

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 7 / 22

Expanded input matrix

New columns are computed

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 7 / 22

Cubic derived features

xi21 xi2 , xi21 xi3 ,

How complex a polynomial

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 9 / 22

How complex a polynomial

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 9 / 22

How complex a polynomial

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 9 / 22

Need to be careful about

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 10 / 22

Need to be careful about

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 10 / 22

Need to be careful about

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 10 / 22

Need to trade o↵ SSE

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 11 / 22

Need to trade o↵ SSE

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 11 / 22

Need to trade o↵ SSE

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 11 / 22

Need to trade o↵ SSE

Need to trade o↵ SSE

Second term penalizes curve

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 12 / 22

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 12 / 22

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 12 / 22

Elastic net regression:

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 13 / 22

Reverse the relationship

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 14 / 22

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 15 / 22

Classifier output is a step function

Madhavan Mukund Lecture 8: 2 February, 2023 DMML Jan–Apr 2023 15 / 22