Class Notes 23mar2023

Lecture 19: 23 March, 2023
Madhavan Mukund
https://www.cmi.ac.in/~madhavan
Data Mining and Machine Learning

January–April 2023
Linear separators and Perceptrons
Perceptrons define linear separators w · x + b

w · x + b > 0, classify Yes (+1)
w · x + b < 0, classify No ( 1)
Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 2 / 16


What if we cascade perceptrons?



Result is still a linear separator



f 1 = w1 · x + b 1 , f 2 = w2 · x + b 2



f 1 = w1 · x + b 1 , f 2 = w2 · x + b 2
f3 = w3 · hf1 , f2 i + b3



f 1 = w1 · x + b 1 , f 2 = w2 · x + b 2
f3 = w3 · hf1 , f2 i + b3
f3 = w3 · hw1 · x + b1 , w2 · x + b2 i + b3



f1 = w1 · x + b 1 , f 2 = w2 · x + b 2
f3 = w3 · hf1 , f2 i + b3
f3 = w3 · hw1 · x + b1 , w2 · x + b2 i + b3
P4
f3 = i=1 (w31 w1i + w32 w2i ) · xi
+(w31 b1 + w32 b2 + b3 )

Limits of linearity
Cannot compute exclusive-or (XOR)

XOR(x1 , x2 ) is true if exactly one of x1 , x2
is true (not both)

Limits of linearity

is true (not both)
Suppose XOR(x1 , x2 ) = ux1 + vx2 + b

Limits of linearity

is true (not both)
x2 = 0: As x1 goes from 0 to 1, output
goes from 0 to 1, so u > 0

Limits of linearity

is true (not both)
goes from 1 to 0, so u < 0

Limits of linearity

is true (not both)
goes from 1 to 0, so u < 0
Observed by Minsky and Papert, 1969,
first “AI Winter”

Non-linear activation
Transform linear output z through a non-linear

activation function
1
Sigmoid function
1+e z

Structure of a neural network
Acyclic
Input layer, hidden layers,
output layer

Acyclic
output layer
Assumptions

Acyclic
output layer
Assumptions
Hidden neurons are
arranged in layers

Acyclic
output layer
Assumptions
Hidden neurons are
arranged in layers
Each layer is fully
connected to the next

Acyclic
output layer
Assumptions
Hidden neurons are
arranged in layers
Each layer is fully
connected to the next
Set weight to zero to
remove an edge

Transform linear output z through a non-linear

activation function
1
Sigmoid function
1+e z
Step is at z = 0
z = wx + b, so step is at x = b/w
Increasing w makes step steeper

Universality
Create a step at x = b/w

Universality

Cascade steps

Universality

Cascade steps
Subtract steps to create a box

Universality

Cascade steps
Create many boxes

Universality

Cascade steps
Create many boxes
Approximate any function

Universality

Cascade steps
Create many boxes
Approximate any function
Need only one hidden layer!

With non-linear activation, network of neurons

can approximate any function


Can build “rectangular” blocks


Can build “rectangular” blocks
Combine blocks to capture any classification
boundary

Example: Recognizing handwritten digits
MNIST data set

MNIST data set

1000 samples of 10 handwritten
digits
Assume input has been segmented

MNIST data set

digits
Each digit is 28 ⇥ 28 pixels
Grayscale value, 0 to 1
784 pixels

MNIST data set

digits
Each digit is 28 ⇥ 28 pixels
Grayscale value, 0 to 1
784 pixels
Input x = (x1 , x2 , . . . , x784 )

Example: Network structure
Input layer (x1 , x2 , . . . , x784 )

Input layer (x1 , x2 , . . . , x784 )

Single hidden layer, 15 nodes

Input layer (x1 , x2 , . . . , x784 )

Output layer, 10 nodes
Decision aj for each digit
j 2 {0, 1, . . . , 9}

Input layer (x1 , x2 , . . . , x784 )

j 2 {0, 1, . . . , 9}
Final output is best aj

Input layer (x1 , x2 , . . . , x784 )

j 2 {0, 1, . . . , 9}
Naı̈vely, arg max aj
j

Input layer (x1 , x2 , . . . , x784 )

j 2 {0, 1, . . . , 9}
Naı̈vely, arg max aj
j
e aj
Softmax, arg max P aj
j j e
“Smooth” version of arg max

Example: Extracting features
Hidden layers extract features

For instance, patterns in di↵erent quadrants


Combination of features determines output


Claim: Automatic identification of features is
strength of the model


Claim: Automatic identification of features is
strength of the model
Counter argument: implicitly extracted
features are impossible to interpret
Explainability

Neural networks
Without loss of generality,

Assume the network is layered
All paths from input to output have the same length
Each layer is fully connected to the previous one
Set weight to 0 if connection is not needed

Neural networks
Without loss of generality,

Assume the network is layered
All paths from input to output have the same length
Each layer is fully connected to the previous one
Set weight to 0 if connection is not needed
Structure of an individual neuron

Input weights w1 , . . . , wm , bias b, output z, activation value a

Notation
Layers ` 2 {1, 2, . . . , L}
Inputs are connected first hidden layer, layer 1
Layer L is the output layer
Layer ` has m` nodes 1, 2, . . . , m`

Notation
Layers ` 2 {1, 2, . . . , L}
Inputs are connected first hidden layer, layer 1
Layer L is the output layer
Layer ` has m` nodes 1, 2, . . . , m`
Node k in layer ` has bias bk` , output zk` and activation value ak`
Weight on edge from node j in level ` 1 to node k in level ` is wkj`

Notation
Why the inversion of indices in the subscript wkj` ?

` ` 1 ` ` 1
zk` = wk1 a1 + wk2 a2 + · · · + wkm `
a` 1
` 1 m` 1
Let w `k = (wk1
` `
, wk2 `
, . . . , wkm ` 1
)
` 1 ` 1 ` 1 ` 1
and a = (a1 , a2 , . . . , am` 1 )
Then zk` = w `k · a` 1

Notation
Why the inversion of indices in the subscript wkj` ?

` ` 1 ` ` 1
zk` = wk1 a1 + wk2 a2 + · · · + wkm `
a` 1
` 1 m` 1
Let w `k = (wk1
` `
, wk2 `
, . . . , wkm ` 1
)
` 1 ` 1 ` 1 ` 1
and a = (a1 , a2 , . . . , am` 1 )
Then zk` = w `k · a` 1
Assume all layers have same number of nodes
Let m = max m`
`2{1.2,...,L}
For any layer i, for k > mi , we set all of wkj` , bk` , zk` , ak` to 0
Matrix formulation
2 ` 3 2 ` 32 3
z1 w1 a1` 1
6 z` 7 6 w` 7 6 a` 1 7
6 2 7 = 6 2 76 2 7
4 ··· 5 4 ··· 54 ··· 5
z `m w `m am` 1

Learning the parameters
Need to find optimum values for all weights wkj`

Use gradient descent
@C @C
Cost function C , partial derivatives ,
@wkj` @bk`


@C @C
@wkj` @bk`
Assumptions about the cost function


@C @C
@wkj` @bk`
1 For input x, C (x) is a function of only the output layer activation, aL
For instance, for training input (xi , yi ), sum-squared error is (yi aiL )2
Note that xi , yi are fixed values, only aiL is a variable


@C @C
@wkj` @bk`
For instance, for training input (xi , yi ), sum-squared error is (yi aiL )2
Note that xi , yi are fixed values, only aiL is a variable
2 Total cost is average of individual input costs

n
1X
Each input xi incurs cost C (xi ), total cost is C (xi )
n i=1
n
1X
For instance, mean sum-squared error (yi aiL )2
n i=1

With these assumptions:

@C @C @aiL @aiL
We can write , in terms of individual ,
@wkj` @bk` @wkj` @bk`
Can extrapolate change in individual cost C (x) to change in overall cost C — stochastic
gradient descent


With these assumptions:

@C @C @aiL @aiL
We can write , in terms of individual ,
@wkj` @bk` @wkj` @bk`
Can extrapolate change in individual cost C (x) to change in overall cost C — stochastic
gradient descent
Complex dependency of C on wkj` , bk`
Many intermediate layers
Many paths through these layers
Use chain rule to decompose into local dependencies
@g @g @f
y = g (f (x)) ) =
@x @f @x

Class Notes 23mar2023

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Class Notes 23mar2023

Uploaded by

Copyright:

Available Formats

Lecture 19: 23 March, 2023

Data Mining and Machine Learning

Perceptrons define linear separators w · x + b

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 2 / 16

Perceptrons define linear separators w · x + b

What if we cascade perceptrons?

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 2 / 16

Perceptrons define linear separators w · x + b

What if we cascade perceptrons?

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 2 / 16

Perceptrons define linear separators w · x + b

What if we cascade perceptrons?

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 2 / 16

Perceptrons define linear separators w · x + b

What if we cascade perceptrons?

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 2 / 16

Perceptrons define linear separators w · x + b

What if we cascade perceptrons?

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 2 / 16

Perceptrons define linear separators w · x + b

What if we cascade perceptrons?

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 2 / 16

Cannot compute exclusive-or (XOR)

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 3 / 16

Cannot compute exclusive-or (XOR)

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 3 / 16

Cannot compute exclusive-or (XOR)

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 3 / 16

Cannot compute exclusive-or (XOR)

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 3 / 16

Cannot compute exclusive-or (XOR)

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 3 / 16

Transform linear output z through a non-linear

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 4 / 16

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 5 / 16

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 5 / 16

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 5 / 16

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 5 / 16

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 5 / 16

Transform linear output z through a non-linear

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 6 / 16

Create a step at x = b/w

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 7 / 16

Create a step at x = b/w

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 7 / 16

Create a step at x = b/w

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 7 / 16

Create a step at x = b/w

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 7 / 16

Create a step at x = b/w

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 7 / 16

Create a step at x = b/w

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 7 / 16

With non-linear activation, network of neurons

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 8 / 16

With non-linear activation, network of neurons

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 8 / 16

With non-linear activation, network of neurons

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 8 / 16

MNIST data set

Madhavan Mukund Lecture 19: 23 March, 2023 DMML Jan–Apr 2023 9 / 16

MNIST data set