You are on page 1of 228

CS7015 (Deep Learning) : Lecture 7

Autoencoders and relation to PCA, Regularization in autoencoders, Denoising


autoencoders, Sparse autoencoders, Contractive autoencoders

Mitesh M. Khapra

Department of Computer Science and Engineering


Indian Institute of Technology Madras

1/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Module 7.1: Introduction to Autoencoders

2/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i

W∗
h
W
xi

3/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
An autoencoder is a special type of
x̂i feed forward neural network which
does the following
W∗
h
W
xi

3/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
An autoencoder is a special type of
x̂i feed forward neural network which
does the following
W∗
Encodes its input xi into a hidden
h representation h
W
xi

3/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
An autoencoder is a special type of
x̂i feed forward neural network which
does the following
W∗
Encodes its input xi into a hidden
h representation h
W
xi

h = g(W xi + b)

3/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
An autoencoder is a special type of
x̂i feed forward neural network which
does the following
W∗
Encodes its input xi into a hidden
h representation h
W Decodes the input again from this
hidden representation
xi

h = g(W xi + b)

3/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
An autoencoder is a special type of
x̂i feed forward neural network which
does the following
W∗
Encodes its input xi into a hidden
h representation h
W Decodes the input again from this
hidden representation
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

3/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
An autoencoder is a special type of
x̂i feed forward neural network which
does the following
W∗
Encodes its input xi into a hidden
h representation h
W Decodes the input again from this
hidden representation
xi
The model is trained to minimize a
certain loss function which will ensure
that x̂i is close to xi (we will see some
h = g(W xi + b)
such loss functions soon)
x̂i = f (W ∗ h + c)

3/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i

W∗
h
W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

4/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case where
x̂i dim(h) < dim(xi )

W∗
h
W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

4/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case where
x̂i dim(h) < dim(xi )
If we are still able to reconstruct x̂i
W∗
perfectly from h, then what does it
h say about h?

W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

4/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case where
x̂i dim(h) < dim(xi )
If we are still able to reconstruct x̂i
W∗
perfectly from h, then what does it
h say about h?
h is a loss-free encoding of xi . It cap-
W
tures all the important characteristics
xi of xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

4/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case where
x̂i dim(h) < dim(xi )
If we are still able to reconstruct x̂i
W∗
perfectly from h, then what does it
h say about h?
h is a loss-free encoding of xi . It cap-
W
tures all the important characteristics
xi of xi
Do you see an analogy with PCA?
h = g(W xi + b)
x̂i = f (W ∗ h + c)

4/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case where
x̂i dim(h) < dim(xi )
If we are still able to reconstruct x̂i
W∗
perfectly from h, then what does it
h say about h?
h is a loss-free encoding of xi . It cap-
W
tures all the important characteristics
xi of xi
Do you see an analogy with PCA?
h = g(W xi + b)
x̂i = f (W ∗ h + c)

An autoencoder where dim(h) < dim(xi ) is


called an under complete autoencoder

4/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i

W∗
h
W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗
h
W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
Such an identity encoding is useless
xi in practice as it does not really tell us
anything about the important char-
h = g(W xi + b) acteristics of the data
x̂i = f (W ∗ h + c)

5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
Such an identity encoding is useless
xi in practice as it does not really tell us
anything about the important char-
h = g(W xi + b) acteristics of the data
x̂i = f (W ∗ h + c)

An autoencoder where dim(h) ≥ dim(xi ) is


called an over complete autoencoder

5/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
The Road Ahead

6/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
The Road Ahead
Choice of f (xi ) and g(xi )

6/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
The Road Ahead
Choice of f (xi ) and g(xi )
Choice of loss function

6/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
The Road Ahead
Choice of f (xi ) and g(xi )
Choice of loss function

7/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i = f (W ∗ h + c)

W∗
h = g(W xi + b)

W
xi

0 1 1 0 1 (binary inputs)

8/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are binary
x̂i = f (W ∗ h + c) (each xij ∈ {0, 1})
W∗
h = g(W xi + b)

W
xi

0 1 1 0 1 (binary inputs)

8/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are binary
x̂i = f (W ∗ h + c) (each xij ∈ {0, 1})
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)

W
xi

0 1 1 0 1 (binary inputs)

8/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are binary
x̂i = f (W ∗ h + c) (each xij ∈ {0, 1})
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
xi

0 1 1 0 1 (binary inputs)

8/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are binary
x̂i = f (W ∗ h + c) (each xij ∈ {0, 1})
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
x̂i = W ∗ h + c
xi

0 1 1 0 1 (binary inputs)

8/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are binary
x̂i = f (W ∗ h + c) (each xij ∈ {0, 1})
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
x̂i = W ∗ h + c
xi
x̂i = logistic(W ∗ h + c)

0 1 1 0 1 (binary inputs)

8/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are binary
x̂i = f (W ∗ h + c) (each xij ∈ {0, 1})
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
x̂i = W ∗ h + c
xi
x̂i = logistic(W ∗ h + c)

0 1 1 0 1 (binary inputs) Logistic as it naturally restricts all


outputs to be between 0 and 1

8/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are binary
x̂i = f (W ∗ h + c) (each xij ∈ {0, 1})
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
x̂i = W ∗ h + c
xi
x̂i = logistic(W ∗ h + c)

0 1 1 0 1 (binary inputs) Logistic as it naturally restricts all


outputs to be between 0 and 1

g is typically chosen as the sigmoid


function

8/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i = f (W ∗ h + c)

W∗
h = g(W xi + b)

W
xi

0.25 0.5 1.25 3.5 4.5


(real valued inputs)

9/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are real (each
x̂i = f (W ∗ h + c) xij ∈ R)
W∗
h = g(W xi + b)

W
xi

0.25 0.5 1.25 3.5 4.5


(real valued inputs)

9/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are real (each
x̂i = f (W ∗ h + c) xij ∈ R)
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)

W
xi

0.25 0.5 1.25 3.5 4.5


(real valued inputs)

9/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are real (each
x̂i = f (W ∗ h + c) xij ∈ R)
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
xi

0.25 0.5 1.25 3.5 4.5


(real valued inputs)

9/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are real (each
x̂i = f (W ∗ h + c) xij ∈ R)
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
x̂i = W ∗ h + c
xi

0.25 0.5 1.25 3.5 4.5


(real valued inputs)

9/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are real (each
x̂i = f (W ∗ h + c) xij ∈ R)
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
x̂i = W ∗ h + c
xi
x̂i = logistic(W ∗ h + c)

0.25 0.5 1.25 3.5 4.5


(real valued inputs)

9/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are real (each
x̂i = f (W ∗ h + c) xij ∈ R)
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
x̂i = W ∗ h + c
xi
x̂i = logistic(W ∗ h + c)

0.25 0.5 1.25 3.5 4.5 What will logistic and tanh do?
(real valued inputs)

9/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are real (each
x̂i = f (W ∗ h + c) xij ∈ R)
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
x̂i = W ∗ h + c
xi
x̂i = logistic(W ∗ h + c)

0.25 0.5 1.25 3.5 4.5 What will logistic and tanh do?
(real valued inputs) They will restrict the reconstruc-
ted x̂i to lie between [0,1] or [-1,1]
whereas we want x̂i ∈ Rn

9/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Suppose all our inputs are real (each
x̂i = f (W ∗ h + c) xij ∈ R)
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
x̂i = W ∗ h + c
xi
x̂i = logistic(W ∗ h + c)

0.25 0.5 1.25 3.5 4.5 What will logistic and tanh do?
(real valued inputs) They will restrict the reconstruc-
ted x̂i to lie between [0,1] or [-1,1]
Again, g is typically chosen as the
whereas we want x̂i ∈ Rn
sigmoid function

9/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
The Road Ahead
Choice of f (xi ) and g(xi )
Choice of loss function

10/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i

W∗

xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

11/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are real
x̂i valued

W∗

xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

11/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are real
x̂i valued
The objective of the autoencoder is to recon-
W∗ struct x̂i to be as close to xi as possible

xi

h = g(W xi + b)
x̂i = f (W ∗ h + c)

11/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are real
x̂i valued
The objective of the autoencoder is to recon-
W∗ struct x̂i to be as close to xi as possible

h This can be formalized using the following


objective function:
W m n
1 XX
min (x̂ij − xij )2
xi ∗
W,W ,c,b m i=1 j=1

h = g(W xi + b)
x̂i = f (W ∗ h + c)

11/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are real
x̂i valued
The objective of the autoencoder is to recon-
W∗ struct x̂i to be as close to xi as possible

h This can be formalized using the following


objective function:
W m n
1 XX
min (x̂ij − xij )2
xi ∗
W,W ,c,b m i=1 j=1
m
1 X
h = g(W xi + b) i.e., min (x̂i − xi )T (x̂i − xi )

W,W ,c,b m i=1
x̂i = f (W ∗ h + c)

11/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are real
x̂i valued
The objective of the autoencoder is to recon-
W∗ struct x̂i to be as close to xi as possible

h This can be formalized using the following


objective function:
W m n
1 XX
min (x̂ij − xij )2
xi ∗
W,W ,c,b m i=1 j=1
m
1 X
h = g(W xi + b) i.e., min (x̂i − xi )T (x̂i − xi )

W,W ,c,b m i=1
x̂i = f (W ∗ h + c)
We can then train the autoencoder just like
a regular feedforward network using back-
propagation

11/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are real
x̂i valued
The objective of the autoencoder is to recon-
W∗ struct x̂i to be as close to xi as possible

h This can be formalized using the following


objective function:
W m n
1 XX
min (x̂ij − xij )2
xi ∗
W,W ,c,b m i=1 j=1
m
1 X
h = g(W xi + b) i.e., min (x̂i − xi )T (x̂i − xi )

W,W ,c,b m i=1
x̂i = f (W ∗ h + c)
We can then train the autoencoder just like
a regular feedforward network using back-
propagation
∂L (θ) ∂L (θ)
All we need is a formula for ∂W ∗
and ∂W
which we will see now

11/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
L (θ) = (x̂i − xi )T (x̂i − xi )
h2 = x̂i
a2

W∗
h1
a1
W
h0 = xi

12/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
L (θ) = (x̂i − xi )T (x̂i − xi )
h2 = x̂i
a2

W∗
h1
a1
W
h0 = xi

Note that the loss function is


shown for only one training
example.

12/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
L (θ) = (x̂i − xi )T (x̂i − xi ) ∂L (θ) ∂L (θ) ∂h2 ∂a2
=
h2 = x̂i ∂W ∗ ∂h2 ∂a2 ∂W ∗
a2

W∗
h1
a1
W
h0 = xi

Note that the loss function is


shown for only one training
example.

12/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
L (θ) = (x̂i − xi )T (x̂i − xi ) ∂L (θ) ∂L (θ) ∂h2 ∂a2
=
h2 = x̂i ∂W ∗ ∂h2 ∂a2 ∂W ∗
a2
∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
W∗ ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
h1
a1
W
h0 = xi

Note that the loss function is


shown for only one training
example.

12/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
L (θ) = (x̂i − xi )T (x̂i − xi ) ∂L (θ) ∂L (θ) ∂h2 ∂a2
=
h2 = x̂i ∂W ∗ ∂h2 ∂a2 ∂W ∗
a2
∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
W∗ ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
h1
a1 We have already seen how to calculate the expres-
sion in the boxes when we learnt backpropagation
W
h0 = xi

Note that the loss function is


shown for only one training
example.

12/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
L (θ) = (x̂i − xi )T (x̂i − xi ) ∂L (θ) ∂L (θ) ∂h2 ∂a2
=
h2 = x̂i ∂W ∗ ∂h2 ∂a2 ∂W ∗
a2
∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
W∗ ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
h1
a1 We have already seen how to calculate the expres-
sion in the boxes when we learnt backpropagation
W
h0 = xi ∂L (θ) ∂L (θ)
=
∂h2 ∂x̂i

Note that the loss function is


shown for only one training
example.

12/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
L (θ) = (x̂i − xi )T (x̂i − xi ) ∂L (θ) ∂L (θ) ∂h2 ∂a2
=
h2 = x̂i ∂W ∗ ∂h2 ∂a2 ∂W ∗
a2
∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
W∗ ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
h1
a1 We have already seen how to calculate the expres-
sion in the boxes when we learnt backpropagation
W
h0 = xi ∂L (θ) ∂L (θ)
=
∂h2 ∂x̂i
= ∇x̂i {(x̂i − xi )T (x̂i − xi )}
Note that the loss function is
shown for only one training
example.

12/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
L (θ) = (x̂i − xi )T (x̂i − xi ) ∂L (θ) ∂L (θ) ∂h2 ∂a2
=
h2 = x̂i ∂W ∗ ∂h2 ∂a2 ∂W ∗
a2
∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
W∗ ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
h1
a1 We have already seen how to calculate the expres-
sion in the boxes when we learnt backpropagation
W
h0 = xi ∂L (θ) ∂L (θ)
=
∂h2 ∂x̂i
= ∇x̂i {(x̂i − xi )T (x̂i − xi )}
Note that the loss function is = 2(x̂i − xi )
shown for only one training
example.

12/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i = f (W ∗ h + c)

W∗

h = g(W xi + b)

W
xi

0 1 1 0 1 (binary inputs)

13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary

W∗

h = g(W xi + b)

W
xi

0 1 1 0 1 (binary inputs)

13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary

W∗ We use a sigmoid decoder which will


produce outputs between 0 and 1, and
h = g(W xi + b) can be interpreted as probabilities.

W
xi

0 1 1 0 1 (binary inputs)

13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary

W∗ We use a sigmoid decoder which will


produce outputs between 0 and 1, and
h = g(W xi + b) can be interpreted as probabilities.

W For a single n-dimensional ith input we


can use the following loss function
xi X n
min{− (xij log x̂ij + (1 − xij ) log(1 − x̂ij ))}
j=1
0 1 1 0 1 (binary inputs)

13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary

W∗ We use a sigmoid decoder which will


produce outputs between 0 and 1, and
h = g(W xi + b) can be interpreted as probabilities.

W For a single n-dimensional ith input we


can use the following loss function
xi X n
min{− (xij log x̂ij + (1 − xij ) log(1 − x̂ij ))}
j=1
0 1 1 0 1 (binary inputs)
What value of x̂ij will minimize this
function?

13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary

W∗ We use a sigmoid decoder which will


produce outputs between 0 and 1, and
h = g(W xi + b) can be interpreted as probabilities.

W For a single n-dimensional ith input we


can use the following loss function
xi X n
min{− (xij log x̂ij + (1 − xij ) log(1 − x̂ij ))}
j=1
0 1 1 0 1 (binary inputs)
What value of x̂ij will minimize this
function?
If xij = 1 ?

13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary

W∗ We use a sigmoid decoder which will


produce outputs between 0 and 1, and
h = g(W xi + b) can be interpreted as probabilities.

W For a single n-dimensional ith input we


can use the following loss function
xi X n
min{− (xij log x̂ij + (1 − xij ) log(1 − x̂ij ))}
j=1
0 1 1 0 1 (binary inputs)
What value of x̂ij will minimize this
function?
If xij = 1 ?
If xij = 0 ?

13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary

W∗ We use a sigmoid decoder which will


produce outputs between 0 and 1, and
h = g(W xi + b) can be interpreted as probabilities.

W For a single n-dimensional ith input we


can use the following loss function
xi X n
min{− (xij log x̂ij + (1 − xij ) log(1 − x̂ij ))}
j=1
0 1 1 0 1 (binary inputs) ∂L (θ)
Again we need is a formula for ∂W ∗ and
What value of x̂ij will minimize this ∂L (θ)
∂W to use backpropagation
function?
If xij = 1 ?
If xij = 0 ?

13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary

W∗ We use a sigmoid decoder which will


produce outputs between 0 and 1, and
h = g(W xi + b) can be interpreted as probabilities.

W For a single n-dimensional ith input we


can use the following loss function
xi X n
min{− (xij log x̂ij + (1 − xij ) log(1 − x̂ij ))}
j=1
0 1 1 0 1 (binary inputs) ∂L (θ)
Again we need is a formula for ∂W ∗ and
What value of x̂ij will minimize this ∂L (θ)
∂W to use backpropagation
function?
If xij = 1 ?
If xij = 0 ?
Indeed the above function will be
minimized when x̂ij = xij !
13/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
n
L (θ) = −
P
(xij log x̂ij + (1 − xij ) log(1 − x̂ij ))
j=1
h2 = x̂i
a2

W∗
h1
a1
W
h0 = xi

14/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
n ∂L (θ) ∂L (θ) ∂h2 ∂a2
L (θ) = − =
P
(xij log x̂ij + (1 − xij ) log(1 − x̂ij )) ∗
j=1 ∂W ∂h2 ∂a2 ∂W ∗
h2 = x̂i
a2

W∗
h1
a1
W
h0 = xi

14/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
n ∂L (θ) ∂L (θ) ∂h2 ∂a2
L (θ) = − =
P
(xij log x̂ij + (1 − xij ) log(1 − x̂ij )) ∗
j=1 ∂W ∂h2 ∂a2 ∂W ∗
h2 = x̂i
a2 ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
W∗
h1
a1
W
h0 = xi

14/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
n ∂L (θ) ∂L (θ) ∂h2 ∂a2
L (θ) = − =
P
(xij log x̂ij + (1 − xij ) log(1 − x̂ij )) ∗
j=1 ∂W ∂h2 ∂a2 ∂W ∗
h2 = x̂i
a2 ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
W∗
h1 We have already seen how to
a1 calculate the expressions in the
W square boxes when we learnt BP
h0 = xi

14/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
n ∂L (θ) ∂L (θ) ∂h2 ∂a2
L (θ) = − =
P
(xij log x̂ij + (1 − xij ) log(1 − x̂ij )) ∗
j=1 ∂W ∂h2 ∂a2 ∂W ∗
h2 = x̂i
a2 ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
W∗
h1 We have already seen how to
a1 calculate the expressions in the
W square boxes when we learnt BP
h0 = xi The first two terms on RHS can be
computed as:
∂L (θ) xij 1 − xij
=− +
∂h2j x̂ij 1 − x̂ij
∂h2j
= σ(a2j )(1 − σ(a2j ))
∂a2j

14/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
n ∂L (θ) ∂L (θ) ∂h2 ∂a2
L (θ) = − =
P
(xij log x̂ij + (1 − xij ) log(1 − x̂ij )) ∗
j=1 ∂W ∂h2 ∂a2 ∂W ∗
h2 = x̂i
a2 ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
W∗
h1 We have already seen how to
a1 calculate the expressions in the
W square boxes when we learnt BP
h0 = xi The first two terms on RHS can be
computed as:
∂L (θ) xij 1 − xij
=− +
1 − x̂ij
 ∂L (θ) 
∂h2j x̂ij
∂h
 ∂L 21
(θ) 
 ∂h2j
∂L (θ) 
 ∂h22  = σ(a2j )(1 − σ(a2j ))
= .  ∂a2j
∂h2  .. 
 
∂L (θ)
∂h2n
14/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Module 7.2: Link between PCA and Autoencoders

15/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now see that the encoder part
x̂i y PCA of an autoencoder is equivalent to
PCA if we

h ≡ u1 u2

x
xi P T X T XP =D

16/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now see that the encoder part
x̂i y PCA of an autoencoder is equivalent to
PCA if we
use a linear encoder
h ≡ u1 u2

x
xi P T X T XP =D

16/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now see that the encoder part
x̂i y PCA of an autoencoder is equivalent to
PCA if we
use a linear encoder
h ≡ u1 u2 use a linear decoder

x
xi P T X T XP =D

16/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now see that the encoder part
x̂i y PCA of an autoencoder is equivalent to
PCA if we
use a linear encoder
h ≡ u1 u2 use a linear decoder
use squared error loss function
x
xi P T X T XP =D

16/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now see that the encoder part
x̂i y PCA of an autoencoder is equivalent to
PCA if we
use a linear encoder
h ≡ u1 u2 use a linear decoder
use squared error loss function
normalize the inputs to
x
m
!
xi P T X T XP =D 1 1 X
x̂ij = √ xij − xkj
m m
k=1

16/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
First let us consider the implication
x̂i y PCA of normalizing the inputs to
m
!
1 1 X
u1 u2 x̂ij = √ xij − xkj
h ≡ m m
k=1

x
xi P T X T XP =D

17/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
First let us consider the implication
x̂i y PCA of normalizing the inputs to
m
!
1 1 X
u1 u2 x̂ij = √ xij − xkj
h ≡ m m
k=1
The operation in the bracket ensures
x
that the data now has 0 mean along
xi P T X T XP = D each dimension j (we are subtracting
the mean)

17/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
First let us consider the implication
x̂i y PCA of normalizing the inputs to
m
!
1 1 X
u1 u2 x̂ij = √ xij − xkj
h ≡ m m
k=1
The operation in the bracket ensures
x
that the data now has 0 mean along
xi P T X T XP = D each dimension j (we are subtracting
the mean)
0
Let X be this zero mean data mat-
rix then what the above normaliza-
0
tion gives us is X = √1m X

17/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
First let us consider the implication
x̂i y PCA of normalizing the inputs to
m
!
1 1 X
u1 u2 x̂ij = √ xij − xkj
h ≡ m m
k=1
The operation in the bracket ensures
x
that the data now has 0 mean along
xi P T X T XP = D each dimension j (we are subtracting
the mean)
0
Let X be this zero mean data mat-
rix then what the above normaliza-
0
tion gives us is X = √1m X
Now (X)T X = m 1
(X 0 )T X 0 is the co-
variance matrix (recall that covari-
ance matrix plays an important role
in PCA) 17/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i y PCA

h ≡ u1 u2

x
xi P T X T XP =D

18/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
First we will show that if we use lin-
x̂i y PCA ear decoder and a squared error loss
function then

h ≡ u1 u2

x
xi P T X T XP =D

18/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
First we will show that if we use lin-
x̂i y PCA ear decoder and a squared error loss
function then
The optimal solution to the following
h ≡ u1 u2 objective function

x
xi P T X T XP =D

18/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
First we will show that if we use lin-
x̂i y PCA ear decoder and a squared error loss
function then
The optimal solution to the following
h ≡ u1 u2 objective function
m n
x 1 XX
xi (xij − x̂ij )2
P T X T XP = D m
i=1 j=1

18/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
First we will show that if we use lin-
x̂i y PCA ear decoder and a squared error loss
function then
The optimal solution to the following
h ≡ u1 u2 objective function
m n
x 1 XX
xi (xij − x̂ij )2
P T X T XP = D m
i=1 j=1

is obtained when we use a linear en-


coder.

18/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
m X
X n
min (xij − x̂ij )2 (1)
θ
i=1 j=1

19/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
m X
X n
min (xij − x̂ij )2 (1)
θ
i=1 j=1
This is equivalent to

19/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
m X
X n
min (xij − x̂ij )2 (1)
θ
i=1 j=1
This is equivalent to
min (kX − HW ∗ kF )2
W ∗H

19/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
m X
X n
min (xij − x̂ij )2 (1)
θ
i=1 j=1
This is equivalent to v
um X
n
uX
min (kX − HW ∗ kF )2 kAkF = t a2ij
W ∗H
i=1 j=1

19/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
m X
X n
min (xij − x̂ij )2 (1)
θ
i=1 j=1
This is equivalent to v
um X
n
uX
min (kX − HW ∗ kF )2 kAkF = t a2ij
W ∗H
i=1 j=1

(just writing the expression (1) in matrix form and using the definition of ||A||F ) (we
are ignoring the biases)

19/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
m X
X n
min (xij − x̂ij )2 (1)
θ
i=1 j=1
This is equivalent to v
um X
n
uX
min (kX − HW ∗ kF )2 kAkF = t a2ij
W ∗H
i=1 j=1

(just writing the expression (1) in matrix form and using the definition of ||A||F ) (we
are ignoring the biases)
From SVD we know that optimal solution to the above problem is given by

HW ∗ = U. ,≤k Σk,k V.T,≤k

19/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
m X
X n
min (xij − x̂ij )2 (1)
θ
i=1 j=1
This is equivalent to v
um X
n
uX
min (kX − HW ∗ kF )2 kAkF = t a2ij
W ∗H
i=1 j=1

(just writing the expression (1) in matrix form and using the definition of ||A||F ) (we
are ignoring the biases)
From SVD we know that optimal solution to the above problem is given by

HW ∗ = U. ,≤k Σk,k V.T,≤k

By matching variables one possible solution is

H = U. ,≤k Σk,k
W ∗ = V.T,≤k

19/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W

20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W

H = U. ,≤k Σk,k

20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W

H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)

20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W

H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)
= (XV ΣT U T )(U ΣV T V ΣT U T )−1 U. ,≤k Σk,k (using X = U ΣV T )

20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W

H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)
= (XV ΣT U T )(U ΣV T V ΣT U T )−1 U. ,≤k Σk,k (using X = U ΣV T )
= XV ΣT U T (U ΣΣT U T )−1 U. ,≤k Σk,k (V T V = I)

20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W

H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)
= (XV ΣT U T )(U ΣV T V ΣT U T )−1 U. ,≤k Σk,k (using X = U ΣV T )
= XV ΣT U T (U ΣΣT U T )−1 U. ,≤k Σk,k (V T V = I)
= XV ΣT U T U (ΣΣT )−1 U T U. ,≤k Σk,k ((ABC)−1 = C −1 B −1 A−1 )

20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W

H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)
= (XV ΣT U T )(U ΣV T V ΣT U T )−1 U. ,≤k Σk,k (using X = U ΣV T )
= XV ΣT U T (U ΣΣT U T )−1 U. ,≤k Σk,k (V T V = I)
= XV ΣT U T U (ΣΣT )−1 U T U. ,≤k Σk,k ((ABC)−1 = C −1 B −1 A−1 )
= XV ΣT (ΣΣT )−1 U T U. ,≤k Σk,k (U T U = I)

20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W

H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)
= (XV ΣT U T )(U ΣV T V ΣT U T )−1 U. ,≤k Σk,k (using X = U ΣV T )
= XV ΣT U T (U ΣΣT U T )−1 U. ,≤k Σk,k (V T V = I)
= XV ΣT U T U (ΣΣT )−1 U T U. ,≤k Σk,k ((ABC)−1 = C −1 B −1 A−1 )
= XV ΣT (ΣΣT )−1 U T U. ,≤k Σk,k (U T U = I)
−1
= XV ΣT ΣT Σ−1 U T U. ,≤k Σk,k ((AB)−1 = B −1 A−1 )

20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W

H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)
= (XV ΣT U T )(U ΣV T V ΣT U T )−1 U. ,≤k Σk,k (using X = U ΣV T )
= XV ΣT U T (U ΣΣT U T )−1 U. ,≤k Σk,k (V T V = I)
= XV ΣT U T U (ΣΣT )−1 U T U. ,≤k Σk,k ((ABC)−1 = C −1 B −1 A−1 )
= XV ΣT (ΣΣT )−1 U T U. ,≤k Σk,k (U T U = I)
−1
= XV ΣT ΣT Σ−1 U T U. ,≤k Σk,k ((AB)−1 = B −1 A−1 )
= XV Σ−1 I. ,≤k Σk,k (U T U. ,≤k = I. ,≤k )

20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W

H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)
= (XV ΣT U T )(U ΣV T V ΣT U T )−1 U. ,≤k Σk,k (using X = U ΣV T )
= XV ΣT U T (U ΣΣT U T )−1 U. ,≤k Σk,k (V T V = I)
= XV ΣT U T U (ΣΣT )−1 U T U. ,≤k Σk,k ((ABC)−1 = C −1 B −1 A−1 )
= XV ΣT (ΣΣT )−1 U T U. ,≤k Σk,k (U T U = I)
−1
= XV ΣT ΣT Σ−1 U T U. ,≤k Σk,k ((AB)−1 = B −1 A−1 )
= XV Σ−1 I. ,≤k Σk,k (U T U. ,≤k = I. ,≤k )
= XV I. ,≤k (Σ−1 I. ,≤k = Σ−1
k,k )

20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W

H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)
= (XV ΣT U T )(U ΣV T V ΣT U T )−1 U. ,≤k Σk,k (using X = U ΣV T )
= XV ΣT U T (U ΣΣT U T )−1 U. ,≤k Σk,k (V T V = I)
= XV ΣT U T U (ΣΣT )−1 U T U. ,≤k Σk,k ((ABC)−1 = C −1 B −1 A−1 )
= XV ΣT (ΣΣT )−1 U T U. ,≤k Σk,k (U T U = I)
−1
= XV ΣT ΣT Σ−1 U T U. ,≤k Σk,k ((AB)−1 = B −1 A−1 )
= XV Σ−1 I. ,≤k Σk,k (U T U. ,≤k = I. ,≤k )
= XV I. ,≤k (Σ−1 I. ,≤k = Σ−1
k,k )
H = XV. ,≤k

20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now show that H is a linear encoding and find an expression for the encoder
weights W

H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)
= (XV ΣT U T )(U ΣV T V ΣT U T )−1 U. ,≤k Σk,k (using X = U ΣV T )
= XV ΣT U T (U ΣΣT U T )−1 U. ,≤k Σk,k (V T V = I)
= XV ΣT U T U (ΣΣT )−1 U T U. ,≤k Σk,k ((ABC)−1 = C −1 B −1 A−1 )
= XV ΣT (ΣΣT )−1 U T U. ,≤k Σk,k (U T U = I)
−1
= XV ΣT ΣT Σ−1 U T U. ,≤k Σk,k ((AB)−1 = B −1 A−1 )
= XV Σ−1 I. ,≤k Σk,k (U T U. ,≤k = I. ,≤k )
= XV I. ,≤k (Σ−1 I. ,≤k = Σ−1
k,k )
H = XV. ,≤k

Thus H is a linear transformation of X and W = V. ,≤k


20/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We have encoder W = V.,≤k

21/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of X T X

21/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of X T X
From PCA, we know that P is the matrix of the eigen vectors of the covariance
matrix

21/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of X T X
From PCA, we know that P is the matrix of the eigen vectors of the covariance
matrix
We saw earlier that, if entries of X are normalized by

21/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of X T X
From PCA, we know that P is the matrix of the eigen vectors of the covariance
matrix
We saw earlier that, if entries of X are normalized by

m
!
1 1 X
x̂ij = √ xij − xkj
m m
k=1

21/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of X T X
From PCA, we know that P is the matrix of the eigen vectors of the covariance
matrix
We saw earlier that, if entries of X are normalized by

m
!
1 1 X
x̂ij = √ xij − xkj
m m
k=1

then X T X is indeed the covariance matrix

21/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We have encoder W = V.,≤k
From SVD, we know that V is the matrix of eigen vectors of X T X
From PCA, we know that P is the matrix of the eigen vectors of the covariance
matrix
We saw earlier that, if entries of X are normalized by

m
!
1 1 X
x̂ij = √ xij − xkj
m m
k=1

then X T X is indeed the covariance matrix


Thus, the encoder matrix for linear autoencoder(W ) and the projection
matrix(P ) for PCA could indeed be the same. Hence proved

21/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Remember
The encoder of a linear autoencoder is equivalent to PCA if we

22/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
use a linear encoder

22/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
use a linear encoder
use a linear decoder

22/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
use a linear encoder
use a linear decoder
use a squared error loss function

22/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
use a linear encoder
use a linear decoder
use a squared error loss function
and normalize the inputs to

22/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
use a linear encoder
use a linear decoder
use a squared error loss function
and normalize the inputs to

m
!
1 1 X
x̂ij = √ xij − xkj
m m
k=1

22/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Module 7.3: Regularization in autoencoders
(Motivation)

23/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i

W∗
h
W
xi

24/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
While poor generalization could hap-
x̂i pen even in undercomplete autoen-
coders it is an even more serious prob-
W∗ lem for overcomplete auto encoders
h
W
xi

24/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
While poor generalization could hap-
x̂i pen even in undercomplete autoen-
coders it is an even more serious prob-
W∗ lem for overcomplete auto encoders
h Here, (as stated earlier) the model
can simply learn to copy xi to h and
W then h to x̂i
xi

24/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
While poor generalization could hap-
x̂i pen even in undercomplete autoen-
coders it is an even more serious prob-
W∗ lem for overcomplete auto encoders
h Here, (as stated earlier) the model
can simply learn to copy xi to h and
W then h to x̂i
xi To avoid poor generalization, we need
to introduce regularization

24/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
The simplest solution is to add a L2 -
x̂i regularization term to the objective
function
W∗
m n
1 XX
h min (x̂ij − xij )2 + λkθk2
θ,w,w∗ ,b,c m
i=1 j=1
W
xi

25/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
The simplest solution is to add a L2 -
x̂i regularization term to the objective
function
W∗
m n
1 XX
h min (x̂ij − xij )2 + λkθk2
θ,w,w∗ ,b,c m
i=1 j=1
W
This is very easy to implement and
xi
just adds a term λW to the gradient
∂L (θ)
∂W (and similarly for other para-
meters)

25/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Another trick is to tie the weights of
x̂i the encoder and decoder

W∗
h
W
xi

26/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Another trick is to tie the weights of
x̂i the encoder and decoder i.e., W ∗ =
WT
W∗
h
W
xi

26/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Another trick is to tie the weights of
x̂i the encoder and decoder i.e., W ∗ =
WT
W∗
This effectively reduces the capacity
h of Autoencoder and acts as a regular-
izer
W
xi

26/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Module 7.4: Denoising Autoencoders

27/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
A denoising encoder simply corrupts
the input data using a probabilistic
x̂i
xij |xij )) before feeding it
process (P (e
to the network

x̃i
xij |xij )
P (e
xi

28/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
A denoising encoder simply corrupts
the input data using a probabilistic
x̂i
xij |xij )) before feeding it
process (P (e
to the network
A simple P (e xij |xij ) used in practice
h
is the following

x̃i
xij |xij )
P (e
xi

28/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
A denoising encoder simply corrupts
the input data using a probabilistic
x̂i
xij |xij )) before feeding it
process (P (e
to the network
A simple P (e xij |xij ) used in practice
h
is the following

P (e
xij = 0|xij ) = q
x̃i
xij |xij )
P (e
xi

28/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
A denoising encoder simply corrupts
the input data using a probabilistic
x̂i
xij |xij )) before feeding it
process (P (e
to the network
A simple P (e xij |xij ) used in practice
h
is the following

P (e
xij = 0|xij ) = q
x̃i
xij = xij |xij ) = 1 − q
P (e
xij |xij )
P (e
xi

28/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
A denoising encoder simply corrupts
the input data using a probabilistic
x̂i
xij |xij )) before feeding it
process (P (e
to the network
A simple P (e xij |xij ) used in practice
h
is the following

P (e
xij = 0|xij ) = q
x̃i
xij = xij |xij ) = 1 − q
P (e
xij |xij )
P (e
In other words, with probability q the
xi input is flipped to 0 and with probab-
ility (1 − q) it is retained as it is

28/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
How does this help ?
x̂i

x̃i
xij |xij )
P (e
xi

29/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
How does this help ?
x̂i This helps because the objective is
still to reconstruct the original (un-
corrupted) xi
m n
h 1 XX
arg min (x̂ij − xij )2
θ m
i=1 j=1

x̃i
xij |xij )
P (e
xi

29/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
How does this help ?
x̂i This helps because the objective is
still to reconstruct the original (un-
corrupted) xi
m n
h 1 XX
arg min (x̂ij − xij )2
θ m
i=1 j=1

x̃i It no longer makes sense for the model


to copy the corrupted xei into h(xei )
xij |xij )
P (e and then into x̂i (the objective func-
xi tion will not be minimized by doing
so)

29/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
How does this help ?
x̂i This helps because the objective is
still to reconstruct the original (un-
corrupted) xi
m n
h 1 XX
arg min (x̂ij − xij )2
θ m
i=1 j=1

x̃i It no longer makes sense for the model


to copy the corrupted xei into h(xei )
xij |xij )
P (e and then into x̂i (the objective func-
xi tion will not be minimized by doing
so)
Instead the model will now have to
capture the characteristics of the data
correctly.

29/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
How does this help ?
x̂i This helps because the objective is
still to reconstruct the original (un-
corrupted) xi
m n
h 1 XX
arg min (x̂ij − xij )2
θ m
i=1 j=1

x̃i It no longer makes sense for the model


to copy the corrupted xei into h(xei )
xij |xij )
P (e and then into x̂i (the objective func-
xi tion will not be minimized by doing
so)
For example, it will have to learn to Instead the model will now have to
reconstruct a corrupted xij correctly by capture the characteristics of the data
relying on its interactions with other correctly.
elements of xi
29/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now see a practical application in which AEs are used and then compare
Denoising Autoencoders with regular autoencoders

30/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
0 1 2 3 9
Task: Hand-written digit
recognition

|xi | = 784 = 28 × 28

28*28

Figure: Basic approach(we use raw data as input


Figure: MNIST Data features)

31/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i ∈ R784
Task: Hand-written digit
recognition

h ∈ Rd

|xi | = 784 = 28 × 28

Figure: MNIST Data

Figure: AE approach (first learn important


characteristics of data)
32/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Task: Hand-written digit 0 1 2 3 9
recognition

h ∈ Rd

|xi | = 784 = 28 × 28

Figure: MNIST Data

Figure: AE approach (and then train a classifier on


top of this hidden representation)
33/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We will now see a way of visualizing AEs and use this visualization to compare
different AEs

34/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We can think of each neuron as a filter which
x̂i will fire (or get maximally) activated for a cer-
tain input configuration xi

xi

35/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We can think of each neuron as a filter which
x̂i will fire (or get maximally) activated for a cer-
tain input configuration xi
For example,
h
h1 = σ(W1T xi ) [ignoring bias b]

xi Where W1 is the trained vector of weights con-


necting the input to the first hidden neuron

35/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We can think of each neuron as a filter which
x̂i will fire (or get maximally) activated for a cer-
tain input configuration xi
For example,
h
h1 = σ(W1T xi ) [ignoring bias b]

xi Where W1 is the trained vector of weights con-


necting the input to the first hidden neuron
What values of xi will cause h1 to be max-
imum (or maximally activated)

35/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We can think of each neuron as a filter which
x̂i will fire (or get maximally) activated for a cer-
tain input configuration xi
For example,
h
h1 = σ(W1T xi ) [ignoring bias b]

xi Where W1 is the trained vector of weights con-


necting the input to the first hidden neuron
What values of xi will cause h1 to be max-
imum (or maximally activated)
Suppose we assume that our inputs are nor-
malized so that kxi k = 1

35/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We can think of each neuron as a filter which
x̂i will fire (or get maximally) activated for a cer-
tain input configuration xi
For example,
h
h1 = σ(W1T xi ) [ignoring bias b]

xi Where W1 is the trained vector of weights con-


necting the input to the first hidden neuron
What values of xi will cause h1 to be max-
imum (or maximally activated)
max {W1T xi } Suppose we assume that our inputs are nor-
xi
malized so that kxi k = 1
s.t. ||xi ||2 = xTi xi = 1

35/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We can think of each neuron as a filter which
x̂i will fire (or get maximally) activated for a cer-
tain input configuration xi
For example,
h
h1 = σ(W1T xi ) [ignoring bias b]

xi Where W1 is the trained vector of weights con-


necting the input to the first hidden neuron
What values of xi will cause h1 to be max-
imum (or maximally activated)
max {W1T xi } Suppose we assume that our inputs are nor-
xi
malized so that kxi k = 1
s.t. ||xi ||2 = xTi xi = 1
W1
Solution: xi = p
W1T W1

35/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Thus the inputs
x̂i

W1 W2 Wn
xi = q ,q ,... p
h W1T W1 W2T W2 WnT Wn

will respectively cause hidden neurons 1 to n


xi to maximally fire

max {W1T xi }
xi

s.t. ||xi ||2 = xTi xi = 1


W1
Solution: xi = p
W1T W1

36/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Thus the inputs
x̂i

W1 W2 Wn
xi = q ,q ,... p
h W1T W1 W2T W2 WnT Wn

will respectively cause hidden neurons 1 to n


xi to maximally fire
Let us plot these images (xi ’s) which maxim-
ally activate the first k neurons of the hidden
representations learned by a vanilla autoen-
max {W1T xi } coder and different denoising autoencoders
xi

s.t. ||xi ||2 = xTi xi = 1


W1
Solution: xi = p
W1T W1

36/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Thus the inputs
x̂i

W1 W2 Wn
xi = q ,q ,... p
h W1T W1 W2T W2 WnT Wn

will respectively cause hidden neurons 1 to n


xi to maximally fire
Let us plot these images (xi ’s) which maxim-
ally activate the first k neurons of the hidden
representations learned by a vanilla autoen-
max {W1T xi } coder and different denoising autoencoders
xi

s.t. ||xi ||2 = xTi xi = 1


These xi ’s are computed by the above formula
W1
using the weights (W1 , W2 . . . Wk ) learned by
Solution: xi = p the respective autoencoders
W1T W1

36/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Figure: Vanilla AE Figure: 25% Denoising Figure: 50% Denoising
(No noise) AE (q=0.25) AE (q=0.5)

The vanilla AE does not learn many meaningful patterns

37/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Figure: Vanilla AE Figure: 25% Denoising Figure: 50% Denoising
(No noise) AE (q=0.25) AE (q=0.5)

The vanilla AE does not learn many meaningful patterns


The hidden neurons of the denoising AEs seem to act like pen-stroke detectors
(for example, in the highlighted neuron the black region is a stroke that you
would expect in a ’0’ or a ’2’ or a ’3’ or a ’8’ or a ’9’)

37/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Figure: Vanilla AE Figure: 25% Denoising Figure: 50% Denoising
(No noise) AE (q=0.25) AE (q=0.5)

The vanilla AE does not learn many meaningful patterns


The hidden neurons of the denoising AEs seem to act like pen-stroke detectors
(for example, in the highlighted neuron the black region is a stroke that you
would expect in a ’0’ or a ’2’ or a ’3’ or a ’8’ or a ’9’)
As the noise increases the filters become more wide because the neuron has to
rely on more adjacent pixels to feel confident about a stroke
37/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We saw one form of P (e xij |xij ) which flips a
x̂i fraction q of the inputs to zero

x̃i
xij |xij )
P (e
xi

38/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We saw one form of P (e xij |xij ) which flips a
x̂i fraction q of the inputs to zero
Another way of corrupting the inputs is to add
a Gaussian noise to the input
h
eij = xij + N (0, 1)
x

x̃i
xij |xij )
P (e
xi

38/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
We saw one form of P (e xij |xij ) which flips a
x̂i fraction q of the inputs to zero
Another way of corrupting the inputs is to add
a Gaussian noise to the input
h
eij = xij + N (0, 1)
x

x̃i We will now use such a denoising AE on a


different dataset and see their performance
xij |xij )
P (e
xi

38/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Figure: Weight decay
Figure: Data Figure: AE filters
filters

The hidden neurons essentially behave like edge detectors

39/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Figure: Weight decay
Figure: Data Figure: AE filters
filters

The hidden neurons essentially behave like edge detectors


PCA does not give such edge detectors

39/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Module 7.5: Sparse Autoencoders

40/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i

xi

41/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i A hidden neuron with sigmoid activation will
have values between 0 and 1

xi

41/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i A hidden neuron with sigmoid activation will
have values between 0 and 1
We say that the neuron is activated when its
h output is close to 1 and not activated when
its output is close to 0.

xi

41/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i A hidden neuron with sigmoid activation will
have values between 0 and 1
We say that the neuron is activated when its
h output is close to 1 and not activated when
its output is close to 0.
A sparse autoencoder tries to ensure the
xi neuron is inactive most of the times.

41/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i If the neuron l is sparse (i.e. mostly inactive)
then ρ̂l → 0

xi

The average value of the


activation of a neuron l is given
by
m
1 X
ρ̂l = h(xi )l
m
i=1

42/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i If the neuron l is sparse (i.e. mostly inactive)
then ρ̂l → 0
A sparse autoencoder uses a sparsity para-
h meter ρ (typically very close to 0, say, 0.005)
and tries to enforce the constraint ρ̂l = ρ

xi

The average value of the


activation of a neuron l is given
by
m
1 X
ρ̂l = h(xi )l
m
i=1

42/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i If the neuron l is sparse (i.e. mostly inactive)
then ρ̂l → 0
A sparse autoencoder uses a sparsity para-
h meter ρ (typically very close to 0, say, 0.005)
and tries to enforce the constraint ρ̂l = ρ
One way of ensuring this is to add the follow-
xi ing term to the objective function

k
The average value of the X ρ 1−ρ
Ω(θ) = ρ log + (1 − ρ) log
activation of a neuron l is given ρ̂l 1 − ρ̂l
l=1
by
m
1 X
ρ̂l = h(xi )l
m
i=1

42/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i If the neuron l is sparse (i.e. mostly inactive)
then ρ̂l → 0
A sparse autoencoder uses a sparsity para-
h meter ρ (typically very close to 0, say, 0.005)
and tries to enforce the constraint ρ̂l = ρ
One way of ensuring this is to add the follow-
xi ing term to the objective function

k
The average value of the X ρ 1−ρ
Ω(θ) = ρ log + (1 − ρ) log
activation of a neuron l is given ρ̂l 1 − ρ̂l
l=1
by
m
1 X When will this term reach its minimum value
ρ̂l = h(xi )l and what is the minimum value? Let us plot
m
i=1
it and check.

42/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
ρ = 0.2

Ω(θ)

0.2 ρ̂l

43/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
ρ = 0.2

Ω(θ)

0.2 ρ̂l

The function will reach its minimum value(s) when ρ̂l = ρ.

43/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Now,

Lˆ(θ) = L (θ) + Ω(θ)

44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Now,

Lˆ(θ) = L (θ) + Ω(θ)

L (θ) is the squared error loss or


cross entropy loss and Ω(θ) is the
sparsity constraint.

44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Now,

Lˆ(θ) = L (θ) + Ω(θ)

L (θ) is the squared error loss or


cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W

44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Now,

Lˆ(θ) = L (θ) + Ω(θ)

L (θ) is the squared error loss or


cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
∂Ω(θ)
Let us see how to calculate ∂W .

44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
k
X ρ 1−ρ Now,
Ω(θ) = ρlog + (1 − ρ)log
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)

L (θ) is the squared error loss or


cross entropy loss and Ω(θ) is the
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
∂Ω(θ)
Let us see how to calculate ∂W .

44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
k
X ρ 1−ρ Now,
Ω(θ) = ρlog + (1 − ρ)log
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)
Can be re-written as
k
X L (θ) is the squared error loss or
Ω(θ) = ρlogρ−ρlog ρ̂l +(1−ρ)log(1−ρ)−(1−ρ)log(1−ρ̂l ) cross entropy loss and Ω(θ) is the
l=1
sparsity constraint.
We already know how to calculate
∂L (θ)
∂W
∂Ω(θ)
Let us see how to calculate ∂W .

44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
k
X ρ 1−ρ Now,
Ω(θ) = ρlog + (1 − ρ)log
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)
Can be re-written as
k
X L (θ) is the squared error loss or
Ω(θ) = ρlogρ−ρlog ρ̂l +(1−ρ)log(1−ρ)−(1−ρ)log(1−ρ̂l ) cross entropy loss and Ω(θ) is the
l=1
sparsity constraint.
By Chain rule:
We already know how to calculate
∂Ω(θ) ∂Ω(θ) ∂ ρ̂ ∂L (θ)
= . ∂W
∂W ∂ ρ̂ ∂W
∂Ω(θ)
Let us see how to calculate ∂W .

44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
k
X ρ 1−ρ Now,
Ω(θ) = ρlog + (1 − ρ)log
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)
Can be re-written as
k
X L (θ) is the squared error loss or
Ω(θ) = ρlogρ−ρlog ρ̂l +(1−ρ)log(1−ρ)−(1−ρ)log(1−ρ̂l ) cross entropy loss and Ω(θ) is the
l=1
sparsity constraint.
By Chain rule:
We already know how to calculate
∂Ω(θ) ∂Ω(θ) ∂ ρ̂ ∂L (θ)
= . ∂W
∂W ∂ ρ̂ ∂W
∂Ω(θ)
Let us see how to calculate ∂W .
∂Ω(θ) h
∂Ω(θ) ∂Ω(θ)
iT
= ∂ ρ̂1
, ∂ ρ̂2 , . . . ∂Ω(θ)
∂ ρ̂k
∂ ρ̂

44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
k
X ρ 1−ρ Now,
Ω(θ) = ρlog + (1 − ρ)log
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)
Can be re-written as
k
X L (θ) is the squared error loss or
Ω(θ) = ρlogρ−ρlog ρ̂l +(1−ρ)log(1−ρ)−(1−ρ)log(1−ρ̂l ) cross entropy loss and Ω(θ) is the
l=1
sparsity constraint.
By Chain rule:
We already know how to calculate
∂Ω(θ) ∂Ω(θ) ∂ ρ̂ ∂L (θ)
= . ∂W
∂W ∂ ρ̂ ∂W
∂Ω(θ)
Let us see how to calculate ∂W .
∂Ω(θ) h iT
= ∂Ω(θ)
∂ ρ̂1
, ∂Ω(θ)
∂ ρ̂2
, . . . ∂Ω(θ)
∂ ρ̂k
∂ ρ̂
For each neuron l ∈ 1 . . . k in hidden layer, we have

44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
k
X ρ 1−ρ Now,
Ω(θ) = ρlog + (1 − ρ)log
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)
Can be re-written as
k
X L (θ) is the squared error loss or
Ω(θ) = ρlogρ−ρlog ρ̂l +(1−ρ)log(1−ρ)−(1−ρ)log(1−ρ̂l ) cross entropy loss and Ω(θ) is the
l=1
sparsity constraint.
By Chain rule:
We already know how to calculate
∂Ω(θ) ∂Ω(θ) ∂ ρ̂ ∂L (θ)
= . ∂W
∂W ∂ ρ̂ ∂W
∂Ω(θ)
Let us see how to calculate ∂W .
∂Ω(θ) h iT
= ∂Ω(θ)
∂ ρ̂1
, ∂Ω(θ)
∂ ρ̂2
, . . . ∂Ω(θ)
∂ ρ̂k
∂ ρ̂
For each neuron l ∈ 1 . . . k in hidden layer, we have
∂Ω(θ) ρ (1 − ρ)
=− +
∂ ρ̂l ρ̂l 1 − ρ̂l

44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
k
X ρ 1−ρ Now,
Ω(θ) = ρlog + (1 − ρ)log
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)
Can be re-written as
k
X L (θ) is the squared error loss or
Ω(θ) = ρlogρ−ρlog ρ̂l +(1−ρ)log(1−ρ)−(1−ρ)log(1−ρ̂l ) cross entropy loss and Ω(θ) is the
l=1
sparsity constraint.
By Chain rule:
We already know how to calculate
∂Ω(θ) ∂Ω(θ) ∂ ρ̂ ∂L (θ)
= . ∂W
∂W ∂ ρ̂ ∂W
∂Ω(θ)
Let us see how to calculate ∂W .
∂Ω(θ) h iT
= ∂Ω(θ)∂ ρ̂1
, ∂Ω(θ)
∂ ρ̂2
, . . . ∂Ω(θ)
∂ ρ̂k
∂ ρ̂
For each neuron l ∈ 1 . . . k in hidden layer, we have
∂Ω(θ) ρ (1 − ρ)
=− +
∂ ρ̂l ρ̂l 1 − ρ̂l
∂ ρ̂l
and = xi (g 0 (W T xi + b))T (see next slide)
∂W
44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
k
X ρ 1−ρ Now,
Ω(θ) = ρlog + (1 − ρ)log
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)
Can be re-written as
k
X L (θ) is the squared error loss or
Ω(θ) = ρlogρ−ρlog ρ̂l +(1−ρ)log(1−ρ)−(1−ρ)log(1−ρ̂l ) cross entropy loss and Ω(θ) is the
l=1
sparsity constraint.
By Chain rule:
We already know how to calculate
∂Ω(θ) ∂Ω(θ) ∂ ρ̂ ∂L (θ)
= . ∂W
∂W ∂ ρ̂ ∂W
∂Ω(θ)
Let us see how to calculate ∂W .
∂Ω(θ) h iT
= ∂Ω(θ)∂ ρ̂1
, ∂Ω(θ)
∂ ρ̂2
, . . . ∂Ω(θ)
∂ ρ̂k Finally,
∂ ρ̂
For each neuron l ∈ 1 . . . k in hidden layer, we have
∂ Lˆ(θ) ∂L (θ) ∂Ω(θ)
∂Ω(θ) ρ (1 − ρ) = +
=− + ∂W ∂W ∂W
∂ ρ̂l ρ̂l 1 − ρ̂l
∂ ρ̂l (and we know how to calculate both
and = xi (g 0 (W T xi + b))T (see next slide)
∂W terms on R.H.S)
44/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Derivation
∂ ρ̂  ∂ ρ̂ ∂ ρ̂2 ∂ ρ̂k

= ∂W 1
∂W . . . ∂W
∂W
∂ ρ̂l
For each element in the above equation we can calculate ∂W (which is the partial
derivative of a scalar w.r.t. a matrix = matrix). For a single element of a matrix Wjl :-
h P i
1 m T
∂ ρ̂l ∂ m i=1 g W:,l xi + bl
=
∂Wjl ∂Wjl
h i
m T
1 X ∂ g W:,l xi + bl
=
m i=1 ∂Wjl
m
1 X 0 T

= g W:,l xi + bl xij
m i=1

So in matrix notation we can write it as :


∂ ρ̂l
= xi (g 0 (W T xi + b))T
∂W
45/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Module 7.6: Contractive Autoencoders

46/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
A contractive autoencoder also tries
to prevent an overcomplete autoen-
coder from learning the identity func- x̂
tion.

47/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
A contractive autoencoder also tries
to prevent an overcomplete autoen-
coder from learning the identity func- x̂
tion.
It does so by adding the following reg-
ularization term to the loss function h

Ω(θ) = kJx (h)k2F


x

47/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
A contractive autoencoder also tries
to prevent an overcomplete autoen-
coder from learning the identity func- x̂
tion.
It does so by adding the following reg-
ularization term to the loss function h

Ω(θ) = kJx (h)k2F


x
where Jx (h) is the Jacobian of the en-
coder.

47/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
A contractive autoencoder also tries
to prevent an overcomplete autoen-
coder from learning the identity func- x̂
tion.
It does so by adding the following reg-
ularization term to the loss function h

Ω(θ) = kJx (h)k2F


x
where Jx (h) is the Jacobian of the en-
coder.
Let us see what it looks like.

47/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
If the input has n dimensions and the
hidden layer has k dimensions then

48/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
If the input has n dimensions and the  
hidden layer has k dimensions then ∂h1 ∂h1
∂x1 ... ... ... ∂xn
 ∂h ∂h2 
 ∂x12 ... ... ... ∂xn 
Jx (h) =  .
.. .. 

 .. . . 
∂hk ∂hk
∂x1 ... ... ... ∂xn

48/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
If the input has n dimensions and the  
hidden layer has k dimensions then ∂h1 ∂h1
∂x1 ... ... ... ∂xn
 ∂h ∂h2 
In other words, the (l, j) entry of the  ∂x12 ... ... ... ∂xn 
Jacobian captures the variation in the Jx (h) =  .
.. .. 

 .. . . 
output of the lth neuron with a small ∂hk ∂hk
... ... ...
variation in the j th input. ∂x1 ∂xn

48/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
If the input has n dimensions and the  
hidden layer has k dimensions then ∂h1 ∂h1
∂x1 ... ... ... ∂xn
 ∂h ∂h2 
In other words, the (l, j) entry of the  ∂x12 ... ... ... ∂xn 
Jacobian captures the variation in the Jx (h) =  .
.. .. 

 .. . . 
output of the lth neuron with a small ∂hk ∂hk
... ... ...
variation in the j th input. ∂x1 ∂xn

n X
k 
∂hl 2
X 
kJx (h)k2F =
∂xj
j=1 l=1

48/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
What is the intuition behind this ? n X
k 
X ∂hl 2
kJx (h)k2F =
∂xj
j=1 l=1

49/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
What is the intuition behind this ? n X
k 
X ∂hl 2
Consider ∂h1 kJx (h)k2F =
∂x1 , what does it mean if ∂xj
∂h1 j=1 l=1
∂x1 = 0

49/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
What is the intuition behind this ? n X
k 
X ∂hl 2
Consider ∂h1 kJx (h)k2F =
∂x1 , what does it mean if ∂xj
∂h1 j=1 l=1
∂x1 = 0
It means that this neuron is not very
sensitive to variations in the input x1 .

49/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
What is the intuition behind this ? n X
k 
X ∂hl 2
Consider ∂h1 kJx (h)k2F =
∂x1 , what does it mean if ∂xj
∂h1 j=1 l=1
∂x1 = 0
It means that this neuron is not very
sensitive to variations in the input x1 .

But doesn’t this contradict our other
goal of minimizing L(θ) which re-
quires h to capture variations in the h
input.

49/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Indeed it does and that’s the idea n X
k 
X ∂hl 2
kJx (h)k2F =
∂xj
j=1 l=1

50/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Indeed it does and that’s the idea n X
k 
X ∂hl 2
By putting these two contradicting kJx (h)k2F =
∂xj
objectives against each other we en- j=1 l=1

sure that h is sensitive to only very


important variations as observed in
the training data. x̂

50/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Indeed it does and that’s the idea n X
k 
X ∂hl 2
By putting these two contradicting kJx (h)k2F =
∂xj
objectives against each other we en- j=1 l=1

sure that h is sensitive to only very


important variations as observed in
the training data. x̂
L(θ) - capture important variations
in data
h

50/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Indeed it does and that’s the idea n X
k 
X ∂hl 2
By putting these two contradicting kJx (h)k2F =
∂xj
objectives against each other we en- j=1 l=1

sure that h is sensitive to only very


important variations as observed in
the training data. x̂
L(θ) - capture important variations
in data
h
Ω(θ) - do not capture variations in
data
x

50/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Indeed it does and that’s the idea n X
k 
X ∂hl 2
By putting these two contradicting kJx (h)k2F =
∂xj
objectives against each other we en- j=1 l=1

sure that h is sensitive to only very


important variations as observed in
the training data. x̂
L(θ) - capture important variations
in data
h
Ω(θ) - do not capture variations in
data
Tradeoff - capture only very import- x
ant variations in the data

50/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Let us try to understand this with the help of an illustration.

51/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
y

u1
u2

52/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
y
Consider the variations in the data
along directions u1 and u2
u1
u2

52/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
y
Consider the variations in the data
along directions u1 and u2
u1 It makes sense to maximize a neuron
to be sensitive to variations along u1
u2

52/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
y
Consider the variations in the data
along directions u1 and u2
u1 It makes sense to maximize a neuron
to be sensitive to variations along u1
u2 At the same time it makes sense to
inhibit a neuron from being sensitive
to variations along u2 (as there seems
to be small noise and unimportant for
x reconstruction)

52/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
y
Consider the variations in the data
along directions u1 and u2
u1 It makes sense to maximize a neuron
to be sensitive to variations along u1
u2 At the same time it makes sense to
inhibit a neuron from being sensitive
to variations along u2 (as there seems
to be small noise and unimportant for
x reconstruction)
By doing so we can balance between
the contradicting goals of good recon-
struction and low sensitivity.

52/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
y
Consider the variations in the data
along directions u1 and u2
u1 It makes sense to maximize a neuron
to be sensitive to variations along u1
u2 At the same time it makes sense to
inhibit a neuron from being sensitive
to variations along u2 (as there seems
to be small noise and unimportant for
x reconstruction)
By doing so we can balance between
the contradicting goals of good recon-
struction and low sensitivity.
What does this remind you of ?

52/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Module 7.7 : Summary

53/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂ y PCA

h ≡ u1 u2

x
x P T X T XP =D

54/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂ y PCA

h ≡ u1 u2

x
x ∗ 2
P T X T XP =D | {z } kF
min kX − HW
θ
U ΣV T
(SVD)

54/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i

x̃i
xij |xij )
P (e
xi

55/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i
Regularization

x̃i
xij |xij )
P (e
xi

55/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i
Regularization

h Ω(θ) = λkθk2 Weight decaying

x̃i
xij |xij )
P (e
xi

55/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i
Regularization

h Ω(θ) = λkθk2 Weight decaying


k
X ρ 1−ρ
Ω(θ) = ρ log + (1 − ρ) log Sparse
ρ̂l 1 − ρ̂l
l=1
x̃i
xij |xij )
P (e
xi

55/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
x̂i
Regularization

h Ω(θ) = λkθk2 Weight decaying


k
X ρ 1−ρ
Ω(θ) = ρ log + (1 − ρ) log Sparse
ρ̂l 1 − ρ̂l
l=1
x̃i n X
k 
X ∂hl 2
Ω(θ) = Contractive
xij |xij )
P (e j=1 l=1
∂xj

xi

55/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

You might also like