You are on page 1of 55

Lecture 19: 23 March, 2023

Madhavan Mukund
https://www.cmi.ac.in/~madhavan

Data Mining and Machine Learning


January–April 2023
Linear separators and Perceptrons

Perceptrons define linear separators w · x + b


w · x + b > 0, classify Yes (+1)
w · x + b < 0, classify No ( 1)

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 2 / 16


Linear separators and Perceptrons

Perceptrons define linear separators w · x + b


w · x + b > 0, classify Yes (+1)
w · x + b < 0, classify No ( 1)

What if we cascade perceptrons?

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 2 / 16


Linear separators and Perceptrons

Perceptrons define linear separators w · x + b


w · x + b > 0, classify Yes (+1)
w · x + b < 0, classify No ( 1)

What if we cascade perceptrons?


Result is still a linear separator

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 2 / 16


Linear separators and Perceptrons

Perceptrons define linear separators w · x + b


w · x + b > 0, classify Yes (+1)
w · x + b < 0, classify No ( 1)

What if we cascade perceptrons?


Result is still a linear separator
f 1 = w1 · x + b 1 , f 2 = w2 · x + b 2

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 2 / 16


Linear separators and Perceptrons

Perceptrons define linear separators w · x + b


w · x + b > 0, classify Yes (+1)
w · x + b < 0, classify No ( 1)

What if we cascade perceptrons?


Result is still a linear separator
f 1 = w1 · x + b 1 , f 2 = w2 · x + b 2
f3 = w3 · hf1 , f2 i + b3

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 2 / 16


Linear separators and Perceptrons

Perceptrons define linear separators w · x + b


w · x + b > 0, classify Yes (+1)
w · x + b < 0, classify No ( 1)

What if we cascade perceptrons?


Result is still a linear separator
f 1 = w1 · x + b 1 , f 2 = w2 · x + b 2
f3 = w3 · hf1 , f2 i + b3
f3 = w3 · hw1 · x + b1 , w2 · x + b2 i + b3

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 2 / 16


Linear separators and Perceptrons

Perceptrons define linear separators w · x + b


w · x + b > 0, classify Yes (+1)
w · x + b < 0, classify No ( 1)

What if we cascade perceptrons?


Result is still a linear separator
f1 = w1 · x + b 1 , f 2 = w2 · x + b 2
f3 = w3 · hf1 , f2 i + b3
f3 = w3 · hw1 · x + b1 , w2 · x + b2 i + b3
P4
f3 = i=1 (w31 w1i + w32 w2i ) · xi
+(w31 b1 + w32 b2 + b3 )

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 2 / 16


Limits of linearity

Cannot compute exclusive-or (XOR)


XOR(x1 , x2 ) is true if exactly one of x1 , x2
is true (not both)

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 3 / 16


Limits of linearity

Cannot compute exclusive-or (XOR)


XOR(x1 , x2 ) is true if exactly one of x1 , x2
is true (not both)
Suppose XOR(x1 , x2 ) = ux1 + vx2 + b

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 3 / 16


Limits of linearity

Cannot compute exclusive-or (XOR)


XOR(x1 , x2 ) is true if exactly one of x1 , x2
is true (not both)
Suppose XOR(x1 , x2 ) = ux1 + vx2 + b
x2 = 0: As x1 goes from 0 to 1, output
goes from 0 to 1, so u > 0

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 3 / 16


Limits of linearity

Cannot compute exclusive-or (XOR)


XOR(x1 , x2 ) is true if exactly one of x1 , x2
is true (not both)
Suppose XOR(x1 , x2 ) = ux1 + vx2 + b
x2 = 0: As x1 goes from 0 to 1, output
goes from 0 to 1, so u > 0
x2 = 1: As x1 goes from 0 to 1, output
goes from 1 to 0, so u < 0

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 3 / 16


Limits of linearity

Cannot compute exclusive-or (XOR)


XOR(x1 , x2 ) is true if exactly one of x1 , x2
is true (not both)
Suppose XOR(x1 , x2 ) = ux1 + vx2 + b
x2 = 0: As x1 goes from 0 to 1, output
goes from 0 to 1, so u > 0
x2 = 1: As x1 goes from 0 to 1, output
goes from 1 to 0, so u < 0
Observed by Minsky and Papert, 1969,
first “AI Winter”

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 3 / 16


Non-linear activation

Transform linear output z through a non-linear


activation function
1
Sigmoid function
1+e z

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 4 / 16


Structure of a neural network

Acyclic
Input layer, hidden layers,
output layer

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 5 / 16


Structure of a neural network

Acyclic
Input layer, hidden layers,
output layer
Assumptions

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 5 / 16


Structure of a neural network

Acyclic
Input layer, hidden layers,
output layer
Assumptions
Hidden neurons are
arranged in layers

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 5 / 16


Structure of a neural network

Acyclic
Input layer, hidden layers,
output layer
Assumptions
Hidden neurons are
arranged in layers
Each layer is fully
connected to the next

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 5 / 16


Structure of a neural network

Acyclic
Input layer, hidden layers,
output layer
Assumptions
Hidden neurons are
arranged in layers
Each layer is fully
connected to the next
Set weight to zero to
remove an edge

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 5 / 16


Non-linear activation

Transform linear output z through a non-linear


activation function
1
Sigmoid function
1+e z
Step is at z = 0
z = wx + b, so step is at x = b/w
Increasing w makes step steeper

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 6 / 16


Universality

Create a step at x = b/w

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 7 / 16


Universality

Create a step at x = b/w


Cascade steps

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 7 / 16


Universality

Create a step at x = b/w


Cascade steps
Subtract steps to create a box

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 7 / 16


Universality

Create a step at x = b/w


Cascade steps
Subtract steps to create a box
Create many boxes

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 7 / 16


Universality

Create a step at x = b/w


Cascade steps
Subtract steps to create a box
Create many boxes
Approximate any function

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 7 / 16


Universality

Create a step at x = b/w


Cascade steps
Subtract steps to create a box
Create many boxes
Approximate any function
Need only one hidden layer!

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 7 / 16


Non-linear activation

With non-linear activation, network of neurons


can approximate any function

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 8 / 16


Non-linear activation

With non-linear activation, network of neurons


can approximate any function
Can build “rectangular” blocks

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 8 / 16


Non-linear activation

With non-linear activation, network of neurons


can approximate any function
Can build “rectangular” blocks
Combine blocks to capture any classification
boundary

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 8 / 16


Example: Recognizing handwritten digits

MNIST data set

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 9 / 16


Example: Recognizing handwritten digits

MNIST data set


1000 samples of 10 handwritten
digits
Assume input has been segmented

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 9 / 16


Example: Recognizing handwritten digits

MNIST data set


1000 samples of 10 handwritten
digits
Assume input has been segmented
Each digit is 28 ⇥ 28 pixels
Grayscale value, 0 to 1
784 pixels

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 9 / 16


Example: Recognizing handwritten digits

MNIST data set


1000 samples of 10 handwritten
digits
Assume input has been segmented
Each digit is 28 ⇥ 28 pixels
Grayscale value, 0 to 1
784 pixels
Input x = (x1 , x2 , . . . , x784 )

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 9 / 16


Example: Network structure

Input layer (x1 , x2 , . . . , x784 )

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 10 / 16


Example: Network structure

Input layer (x1 , x2 , . . . , x784 )


Single hidden layer, 15 nodes

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 10 / 16


Example: Network structure

Input layer (x1 , x2 , . . . , x784 )


Single hidden layer, 15 nodes
Output layer, 10 nodes
Decision aj for each digit
j 2 {0, 1, . . . , 9}

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 10 / 16


Example: Network structure

Input layer (x1 , x2 , . . . , x784 )


Single hidden layer, 15 nodes
Output layer, 10 nodes
Decision aj for each digit
j 2 {0, 1, . . . , 9}
Final output is best aj

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 10 / 16


Example: Network structure

Input layer (x1 , x2 , . . . , x784 )


Single hidden layer, 15 nodes
Output layer, 10 nodes
Decision aj for each digit
j 2 {0, 1, . . . , 9}
Final output is best aj
Naı̈vely, arg max aj
j

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 10 / 16


Example: Network structure

Input layer (x1 , x2 , . . . , x784 )


Single hidden layer, 15 nodes
Output layer, 10 nodes
Decision aj for each digit
j 2 {0, 1, . . . , 9}
Final output is best aj
Naı̈vely, arg max aj
j
e aj
Softmax, arg max P aj
j j e
“Smooth” version of arg max

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 10 / 16


Example: Extracting features

Hidden layers extract features


For instance, patterns in di↵erent quadrants

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 11 / 16


Example: Extracting features

Hidden layers extract features


For instance, patterns in di↵erent quadrants
Combination of features determines output

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 11 / 16


Example: Extracting features

Hidden layers extract features


For instance, patterns in di↵erent quadrants
Combination of features determines output
Claim: Automatic identification of features is
strength of the model

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 11 / 16


Example: Extracting features

Hidden layers extract features


For instance, patterns in di↵erent quadrants
Combination of features determines output
Claim: Automatic identification of features is
strength of the model
Counter argument: implicitly extracted
features are impossible to interpret
Explainability

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 11 / 16


Neural networks

Without loss of generality,


Assume the network is layered
All paths from input to output have the same length
Each layer is fully connected to the previous one
Set weight to 0 if connection is not needed

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 12 / 16


Neural networks

Without loss of generality,


Assume the network is layered
All paths from input to output have the same length
Each layer is fully connected to the previous one
Set weight to 0 if connection is not needed

Structure of an individual neuron


Input weights w1 , . . . , wm , bias b, output z, activation value a

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 12 / 16


Notation

Layers ` 2 {1, 2, . . . , L}
Inputs are connected first hidden layer, layer 1
Layer L is the output layer
Layer ` has m` nodes 1, 2, . . . , m`

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 13 / 16


Notation

Layers ` 2 {1, 2, . . . , L}
Inputs are connected first hidden layer, layer 1
Layer L is the output layer
Layer ` has m` nodes 1, 2, . . . , m`
Node k in layer ` has bias bk` , output zk` and activation value ak`
Weight on edge from node j in level ` 1 to node k in level ` is wkj`

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 13 / 16


Notation

Why the inversion of indices in the subscript wkj` ?


` ` 1 ` ` 1
zk` = wk1 a1 + wk2 a2 + · · · + wkm `
a` 1
` 1 m` 1

Let w `k = (wk1
` `
, wk2 `
, . . . , wkm ` 1
)
` 1 ` 1 ` 1 ` 1
and a = (a1 , a2 , . . . , am` 1 )
Then zk` = w `k · a` 1

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 14 / 16


Notation

Why the inversion of indices in the subscript wkj` ?


` ` 1 ` ` 1
zk` = wk1 a1 + wk2 a2 + · · · + wkm `
a` 1
` 1 m` 1

Let w `k = (wk1
` `
, wk2 `
, . . . , wkm ` 1
)
` 1 ` 1 ` 1 ` 1
and a = (a1 , a2 , . . . , am` 1 )
Then zk` = w `k · a` 1
Assume all layers have same number of nodes
Let m = max m`
`2{1.2,...,L}
For any layer i, for k > mi , we set all of wkj` , bk` , zk` , ak` to 0
Matrix formulation
2 ` 3 2 ` 32 3
z1 w1 a1` 1
6 z` 7 6 w` 7 6 a` 1 7
6 2 7 = 6 2 76 2 7
4 ··· 5 4 ··· 54 ··· 5
z `m w `m am` 1

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 14 / 16


Learning the parameters

Need to find optimum values for all weights wkj`


Use gradient descent
@C @C
Cost function C , partial derivatives ,
@wkj` @bk`

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 15 / 16


Learning the parameters

Need to find optimum values for all weights wkj`


Use gradient descent
@C @C
Cost function C , partial derivatives ,
@wkj` @bk`
Assumptions about the cost function

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 15 / 16


Learning the parameters

Need to find optimum values for all weights wkj`


Use gradient descent
@C @C
Cost function C , partial derivatives ,
@wkj` @bk`
Assumptions about the cost function
1 For input x, C (x) is a function of only the output layer activation, aL
For instance, for training input (xi , yi ), sum-squared error is (yi aiL )2
Note that xi , yi are fixed values, only aiL is a variable

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 15 / 16


Learning the parameters

Need to find optimum values for all weights wkj`


Use gradient descent
@C @C
Cost function C , partial derivatives ,
@wkj` @bk`
Assumptions about the cost function
1 For input x, C (x) is a function of only the output layer activation, aL
For instance, for training input (xi , yi ), sum-squared error is (yi aiL )2
Note that xi , yi are fixed values, only aiL is a variable

2 Total cost is average of individual input costs


n
1X
Each input xi incurs cost C (xi ), total cost is C (xi )
n i=1
n
1X
For instance, mean sum-squared error (yi aiL )2
n i=1
Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 15 / 16
Learning the parameters

Assumptions about the cost function


1 For input x, C (x) is a function of only the output layer activation, aL
2 Total cost is average of individual input costs

With these assumptions:


@C @C @aiL @aiL
We can write , in terms of individual ,
@wkj` @bk` @wkj` @bk`
Can extrapolate change in individual cost C (x) to change in overall cost C — stochastic
gradient descent

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 16 / 16


Learning the parameters

Assumptions about the cost function


1 For input x, C (x) is a function of only the output layer activation, aL
2 Total cost is average of individual input costs

With these assumptions:


@C @C @aiL @aiL
We can write , in terms of individual ,
@wkj` @bk` @wkj` @bk`
Can extrapolate change in individual cost C (x) to change in overall cost C — stochastic
gradient descent
Complex dependency of C on wkj` , bk`
Many intermediate layers
Many paths through these layers
Use chain rule to decompose into local dependencies
@g @g @f
y = g (f (x)) ) =
@x @f @x
Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 16 / 16

You might also like