Lecture 7

CS7015 (Deep Learning) : Lecture 7
Autoencoders and relation to PCA, Regularization in autoencoders, Denoising

autoencoders, Sparse autoencoders, Contractive autoencoders
Mitesh M. Khapra
Department of Computer Science and Engineering

Indian Institute of Technology Madras
1/55
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7
Module 7.1: Introduction to Autoencoders
2/55
x̂i
W∗
h
W
xi
3/55
An autoencoder is a special type of
x̂i feed forward neural network which
does the following
W∗
h
W
xi
3/55
does the following
W∗
Encodes its input xi into a hidden
h representation h
W
xi
3/55
does the following
W∗
h representation h
W
xi
h = g(W xi + b)
3/55
does the following
W∗
h representation h
W Decodes the input again from this
hidden representation
xi
h = g(W xi + b)
3/55
does the following
W∗
h representation h
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
3/55
does the following
W∗
h representation h
xi
The model is trained to minimize a
certain loss function which will ensure
that x̂i is close to xi (we will see some
h = g(W xi + b)
such loss functions soon)
x̂i = f (W ∗ h + c)
3/55
x̂i
W∗
h
W
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
4/55
Let us consider the case where
x̂i dim(h) < dim(xi )
W∗
h
W
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
4/55
If we are still able to reconstruct x̂i
W∗
perfectly from h, then what does it
h say about h?
W
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
4/55
W∗
h say about h?
h is a loss-free encoding of xi . It cap-
W
tures all the important characteristics
xi of xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
4/55
W∗
h say about h?
W
xi of xi
Do you see an analogy with PCA?
h = g(W xi + b)
x̂i = f (W ∗ h + c)
4/55
W∗
h say about h?
W
xi of xi
Do you see an analogy with PCA?
h = g(W xi + b)
x̂i = f (W ∗ h + c)
An autoencoder where dim(h) < dim(xi ) is

called an under complete autoencoder
4/55
x̂i
W∗
h
W
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
5/55
Let us consider the case when
x̂i dim(h) ≥ dim(xi )
W∗
h
W
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
5/55
W∗ In such a case the autoencoder could
learn a trivial encoding by simply
h copying xi into h and then copying
h into x̂i
W
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
5/55
h into x̂i
W
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
5/55
h into x̂i
W
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
5/55
h into x̂i
W
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
5/55
h into x̂i
W
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
5/55
h into x̂i
W
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
5/55
h into x̂i
W
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
5/55
h into x̂i
W
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
5/55
h into x̂i
W
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
5/55
h into x̂i
W
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
5/55
h into x̂i
W
Such an identity encoding is useless
xi in practice as it does not really tell us
anything about the important char-
h = g(W xi + b) acteristics of the data
x̂i = f (W ∗ h + c)
5/55
h into x̂i
W
Such an identity encoding is useless
xi in practice as it does not really tell us
anything about the important char-
h = g(W xi + b) acteristics of the data
x̂i = f (W ∗ h + c)
An autoencoder where dim(h) ≥ dim(xi ) is

called an over complete autoencoder
5/55
The Road Ahead
6/55
The Road Ahead
Choice of f (xi ) and g(xi )
6/55
The Road Ahead
Choice of loss function
6/55
The Road Ahead
7/55
x̂i = f (W ∗ h + c)
W∗
h = g(W xi + b)
W
xi
0 1 1 0 1 (binary inputs)
8/55
Suppose all our inputs are binary
x̂i = f (W ∗ h + c) (each xij ∈ {0, 1})
W∗
h = g(W xi + b)
W
xi
8/55
x̂i = f (W ∗ h + c) (each xij ∈ {0, 1})
W∗ Which of the following functions
would be most apt for the decoder?
h = g(W xi + b)
W
xi
8/55
x̂i = f (W ∗ h + c) (each xij ∈ {0, 1})
h = g(W xi + b)
x̂i = tanh(W ∗ h + c)
W
xi
8/55
x̂i = f (W ∗ h + c) (each xij ∈ {0, 1})
h = g(W xi + b)
W
x̂i = W ∗ h + c
xi
8/55
x̂i = f (W ∗ h + c) (each xij ∈ {0, 1})
h = g(W xi + b)
W
x̂i = W ∗ h + c
xi
x̂i = logistic(W ∗ h + c)
8/55
x̂i = f (W ∗ h + c) (each xij ∈ {0, 1})
h = g(W xi + b)
W
x̂i = W ∗ h + c
xi
0 1 1 0 1 (binary inputs) Logistic as it naturally restricts all

outputs to be between 0 and 1
8/55
x̂i = f (W ∗ h + c) (each xij ∈ {0, 1})
h = g(W xi + b)
W
x̂i = W ∗ h + c
xi
0 1 1 0 1 (binary inputs) Logistic as it naturally restricts all

outputs to be between 0 and 1
g is typically chosen as the sigmoid

function
8/55
x̂i = f (W ∗ h + c)
W∗
h = g(W xi + b)
W
xi
0.25 0.5 1.25 3.5 4.5

(real valued inputs)
9/55
Suppose all our inputs are real (each
x̂i = f (W ∗ h + c) xij ∈ R)
W∗
h = g(W xi + b)
W
xi
0.25 0.5 1.25 3.5 4.5

9/55
x̂i = f (W ∗ h + c) xij ∈ R)
h = g(W xi + b)
W
xi
0.25 0.5 1.25 3.5 4.5

9/55
x̂i = f (W ∗ h + c) xij ∈ R)
h = g(W xi + b)
W
xi
0.25 0.5 1.25 3.5 4.5

9/55
x̂i = f (W ∗ h + c) xij ∈ R)
h = g(W xi + b)
W
x̂i = W ∗ h + c
xi
0.25 0.5 1.25 3.5 4.5

9/55
x̂i = f (W ∗ h + c) xij ∈ R)
h = g(W xi + b)
W
x̂i = W ∗ h + c
xi
0.25 0.5 1.25 3.5 4.5

9/55
x̂i = f (W ∗ h + c) xij ∈ R)
h = g(W xi + b)
W
x̂i = W ∗ h + c
xi
0.25 0.5 1.25 3.5 4.5 What will logistic and tanh do?
9/55
x̂i = f (W ∗ h + c) xij ∈ R)
h = g(W xi + b)
W
x̂i = W ∗ h + c
xi
(real valued inputs) They will restrict the reconstruc-
ted x̂i to lie between [0,1] or [-1,1]
whereas we want x̂i ∈ Rn
9/55
x̂i = f (W ∗ h + c) xij ∈ R)
h = g(W xi + b)
W
x̂i = W ∗ h + c
xi
(real valued inputs) They will restrict the reconstruc-
ted x̂i to lie between [0,1] or [-1,1]
Again, g is typically chosen as the
whereas we want x̂i ∈ Rn
sigmoid function
9/55
The Road Ahead
10/55
x̂i
W∗
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
11/55
Consider the case when the inputs are real
x̂i valued
W∗
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
11/55
x̂i valued
The objective of the autoencoder is to recon-
W∗ struct x̂i to be as close to xi as possible
xi
h = g(W xi + b)
x̂i = f (W ∗ h + c)
11/55
x̂i valued
h This can be formalized using the following

objective function:
W m n
1 XX
min (x̂ij − xij )2
xi ∗
W,W ,c,b m i=1 j=1
h = g(W xi + b)
x̂i = f (W ∗ h + c)
11/55
x̂i valued

objective function:
W m n
1 XX
xi ∗
W,W ,c,b m i=1 j=1
m
1 X
h = g(W xi + b) i.e., min (x̂i − xi )T (x̂i − xi )
∗
W,W ,c,b m i=1
x̂i = f (W ∗ h + c)
11/55
x̂i valued

objective function:
W m n
1 XX
xi ∗
W,W ,c,b m i=1 j=1
m
1 X
∗
W,W ,c,b m i=1
x̂i = f (W ∗ h + c)
We can then train the autoencoder just like
a regular feedforward network using back-
propagation
11/55
x̂i valued

objective function:
W m n
1 XX
xi ∗
W,W ,c,b m i=1 j=1
m
1 X
∗
W,W ,c,b m i=1
x̂i = f (W ∗ h + c)
We can then train the autoencoder just like
a regular feedforward network using back-
propagation
∂L (θ) ∂L (θ)
All we need is a formula for ∂W ∗
and ∂W
which we will see now
11/55
L (θ) = (x̂i − xi )T (x̂i − xi )
h2 = x̂i
a2
W∗
h1
a1
W
h0 = xi
12/55
L (θ) = (x̂i − xi )T (x̂i − xi )
h2 = x̂i
a2
W∗
h1
a1
W
h0 = xi
Note that the loss function is

shown for only one training
example.
12/55
L (θ) = (x̂i − xi )T (x̂i − xi ) ∂L (θ) ∂L (θ) ∂h2 ∂a2
=
h2 = x̂i ∂W ∗ ∂h2 ∂a2 ∂W ∗
a2
W∗
h1
a1
W
h0 = xi

example.
12/55
=
h2 = x̂i ∂W ∗ ∂h2 ∂a2 ∂W ∗
a2
∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
W∗ ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
h1
a1
W
h0 = xi

example.
12/55
=
h2 = x̂i ∂W ∗ ∂h2 ∂a2 ∂W ∗
a2
∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
W∗ ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
h1
a1 We have already seen how to calculate the expres-
sion in the boxes when we learnt backpropagation
W
h0 = xi

example.
12/55
=
h2 = x̂i ∂W ∗ ∂h2 ∂a2 ∂W ∗
a2
∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
W∗ ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
h1
W
h0 = xi ∂L (θ) ∂L (θ)
=
∂h2 ∂x̂i

example.
12/55
=
h2 = x̂i ∂W ∗ ∂h2 ∂a2 ∂W ∗
a2
∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
W∗ ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
h1
W
h0 = xi ∂L (θ) ∂L (θ)
=
∂h2 ∂x̂i
= ∇x̂i {(x̂i − xi )T (x̂i − xi )}
example.
12/55
=
h2 = x̂i ∂W ∗ ∂h2 ∂a2 ∂W ∗
a2
∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
W∗ ∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
h1
W
h0 = xi ∂L (θ) ∂L (θ)
=
∂h2 ∂x̂i
= ∇x̂i {(x̂i − xi )T (x̂i − xi )}
Note that the loss function is = 2(x̂i − xi )
example.
12/55
x̂i = f (W ∗ h + c)
W∗
h = g(W xi + b)
W
xi
13/55
Consider the case when the inputs are
x̂i = f (W ∗ h + c) binary
W∗
h = g(W xi + b)
W
xi
13/55
W∗ We use a sigmoid decoder which will

produce outputs between 0 and 1, and
h = g(W xi + b) can be interpreted as probabilities.
W
xi
13/55

W For a single n-dimensional ith input we

can use the following loss function
xi X n
min{− (xij log x̂ij + (1 − xij ) log(1 − x̂ij ))}
j=1
13/55


xi X n
j=1
What value of x̂ij will minimize this
function?
13/55


xi X n
j=1
function?
If xij = 1 ?
13/55


xi X n
j=1
function?
If xij = 1 ?
If xij = 0 ?
13/55


xi X n
j=1
0 1 1 0 1 (binary inputs) ∂L (θ)
Again we need is a formula for ∂W ∗ and
What value of x̂ij will minimize this ∂L (θ)
∂W to use backpropagation
function?
If xij = 1 ?
If xij = 0 ?
13/55


xi X n
j=1
0 1 1 0 1 (binary inputs) ∂L (θ)
Again we need is a formula for ∂W ∗ and
What value of x̂ij will minimize this ∂L (θ)
∂W to use backpropagation
function?
If xij = 1 ?
If xij = 0 ?
Indeed the above function will be
minimized when x̂ij = xij !
13/55
n
L (θ) = −
P
(xij log x̂ij + (1 − xij ) log(1 − x̂ij ))
j=1
h2 = x̂i
a2
W∗
h1
a1
W
h0 = xi
14/55
n ∂L (θ) ∂L (θ) ∂h2 ∂a2
L (θ) = − =
P
(xij log x̂ij + (1 − xij ) log(1 − x̂ij )) ∗
j=1 ∂W ∂h2 ∂a2 ∂W ∗
h2 = x̂i
a2
W∗
h1
a1
W
h0 = xi
14/55
n ∂L (θ) ∂L (θ) ∂h2 ∂a2
L (θ) = − =
P
j=1 ∂W ∂h2 ∂a2 ∂W ∗
h2 = x̂i
a2 ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
W∗
h1
a1
W
h0 = xi
14/55
n ∂L (θ) ∂L (θ) ∂h2 ∂a2
L (θ) = − =
P
j=1 ∂W ∂h2 ∂a2 ∂W ∗
h2 = x̂i
a2 ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
W∗
h1 We have already seen how to
a1 calculate the expressions in the
W square boxes when we learnt BP
h0 = xi
14/55
n ∂L (θ) ∂L (θ) ∂h2 ∂a2
L (θ) = − =
P
j=1 ∂W ∂h2 ∂a2 ∂W ∗
h2 = x̂i
a2 ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
W∗
h0 = xi The first two terms on RHS can be
computed as:
∂L (θ) xij 1 − xij
=− +
∂h2j x̂ij 1 − x̂ij
∂h2j
= σ(a2j )(1 − σ(a2j ))
∂a2j
14/55
n ∂L (θ) ∂L (θ) ∂h2 ∂a2
L (θ) = − =
P
j=1 ∂W ∂h2 ∂a2 ∂W ∗
h2 = x̂i
a2 ∂L (θ) ∂L (θ) ∂h2 ∂a2 ∂h1 ∂a1
=
∂W ∂h2 ∂a2 ∂h1 ∂a1 ∂W
W∗
h0 = xi The first two terms on RHS can be
computed as:
∂L (θ) xij 1 − xij
=− +
1 − x̂ij
 ∂L (θ) 
∂h2j x̂ij
∂h
 ∂L 21
(θ) 
 ∂h2j
∂L (θ) 
 ∂h22  = σ(a2j )(1 − σ(a2j ))
= .  ∂a2j
∂h2  .. 
 
∂L (θ)
∂h2n
14/55
Module 7.2: Link between PCA and Autoencoders
15/55
We will now see that the encoder part
x̂i y PCA of an autoencoder is equivalent to
PCA if we
h ≡ u1 u2
x
xi P T X T XP =D
16/55
PCA if we
use a linear encoder
h ≡ u1 u2
x
xi P T X T XP =D
16/55
PCA if we
h ≡ u1 u2 use a linear decoder
x
xi P T X T XP =D
16/55
PCA if we
use squared error loss function
x
xi P T X T XP =D
16/55
PCA if we
use squared error loss function
normalize the inputs to
x
m
!
xi P T X T XP =D 1 1 X
x̂ij = √ xij − xkj
m m
k=1
16/55
First let us consider the implication
x̂i y PCA of normalizing the inputs to
m
!
1 1 X
u1 u2 x̂ij = √ xij − xkj
h ≡ m m
k=1
x
xi P T X T XP =D
17/55
m
!
1 1 X
h ≡ m m
k=1
The operation in the bracket ensures
x
that the data now has 0 mean along
xi P T X T XP = D each dimension j (we are subtracting
the mean)
17/55
m
!
1 1 X
h ≡ m m
k=1
x
the mean)
0
Let X be this zero mean data mat-
rix then what the above normaliza-
0
tion gives us is X = √1m X
17/55
m
!
1 1 X
h ≡ m m
k=1
x
the mean)
0
Let X be this zero mean data mat-
rix then what the above normaliza-
0
tion gives us is X = √1m X
Now (X)T X = m 1
(X 0 )T X 0 is the co-
variance matrix (recall that covari-
ance matrix plays an important role
in PCA) 17/55
x̂i y PCA
h ≡ u1 u2
x
xi P T X T XP =D
18/55
First we will show that if we use lin-
x̂i y PCA ear decoder and a squared error loss
function then
h ≡ u1 u2
x
xi P T X T XP =D
18/55
function then
The optimal solution to the following
h ≡ u1 u2 objective function
x
xi P T X T XP =D
18/55
function then
m n
x 1 XX
xi (xij − x̂ij )2
P T X T XP = D m
i=1 j=1
18/55
function then
m n
x 1 XX
xi (xij − x̂ij )2
P T X T XP = D m
i=1 j=1
is obtained when we use a linear en-

coder.
18/55
m X
X n
min (xij − x̂ij )2 (1)
θ
i=1 j=1
19/55
m X
X n
θ
i=1 j=1
This is equivalent to
19/55
m X
X n
θ
i=1 j=1
This is equivalent to
min (kX − HW ∗ kF )2
W ∗H
19/55
m X
X n
θ
i=1 j=1
This is equivalent to v
um X
n
uX
min (kX − HW ∗ kF )2 kAkF = t a2ij
W ∗H
i=1 j=1
19/55
m X
X n
θ
i=1 j=1
um X
n
uX
W ∗H
i=1 j=1
(just writing the expression (1) in matrix form and using the definition of ||A||F ) (we
are ignoring the biases)
19/55
m X
X n
θ
i=1 j=1
um X
n
uX
W ∗H
i=1 j=1
From SVD we know that optimal solution to the above problem is given by
HW ∗ = U. ,≤k Σk,k V.T,≤k
19/55
m X
X n
θ
i=1 j=1
um X
n
uX
W ∗H
i=1 j=1
From SVD we know that optimal solution to the above problem is given by
HW ∗ = U. ,≤k Σk,k V.T,≤k
By matching variables one possible solution is
H = U. ,≤k Σk,k
W ∗ = V.T,≤k
19/55
We will now show that H is a linear encoding and find an expression for the encoder
weights W
20/55
weights W
H = U. ,≤k Σk,k
20/55
weights W
H = U. ,≤k Σk,k
= (XX T )(XX T )−1 U. ,≤K Σk,k (pre-multiplying (XX T )(XX T )−1 = I)
20/55
weights W
H = U. ,≤k Σk,k
= (XV ΣT U T )(U ΣV T V ΣT U T )−1 U. ,≤k Σk,k (using X = U ΣV T )
20/55
weights W
H = U. ,≤k Σk,k
= XV ΣT U T (U ΣΣT U T )−1 U. ,≤k Σk,k (V T V = I)
20/55
weights W
H = U. ,≤k Σk,k
= XV ΣT U T U (ΣΣT )−1 U T U. ,≤k Σk,k ((ABC)−1 = C −1 B −1 A−1 )
20/55
weights W
H = U. ,≤k Σk,k
= XV ΣT (ΣΣT )−1 U T U. ,≤k Σk,k (U T U = I)
20/55
weights W
H = U. ,≤k Σk,k
−1
= XV ΣT ΣT Σ−1 U T U. ,≤k Σk,k ((AB)−1 = B −1 A−1 )
20/55
weights W
H = U. ,≤k Σk,k
−1
= XV Σ−1 I. ,≤k Σk,k (U T U. ,≤k = I. ,≤k )
20/55
weights W
H = U. ,≤k Σk,k
−1
= XV Σ−1 I. ,≤k Σk,k (U T U. ,≤k = I. ,≤k )
= XV I. ,≤k (Σ−1 I. ,≤k = Σ−1
k,k )
20/55
weights W
H = U. ,≤k Σk,k
−1
= XV Σ−1 I. ,≤k Σk,k (U T U. ,≤k = I. ,≤k )
= XV I. ,≤k (Σ−1 I. ,≤k = Σ−1
k,k )
H = XV. ,≤k
20/55
weights W
H = U. ,≤k Σk,k
−1
= XV Σ−1 I. ,≤k Σk,k (U T U. ,≤k = I. ,≤k )
= XV I. ,≤k (Σ−1 I. ,≤k = Σ−1
k,k )
H = XV. ,≤k
Thus H is a linear transformation of X and W = V. ,≤k

20/55
We have encoder W = V.,≤k
21/55
From SVD, we know that V is the matrix of eigen vectors of X T X
21/55
From PCA, we know that P is the matrix of the eigen vectors of the covariance
matrix
21/55
matrix
We saw earlier that, if entries of X are normalized by
21/55
matrix
m
!
1 1 X
m m
k=1
21/55
matrix
m
!
1 1 X
m m
k=1
then X T X is indeed the covariance matrix
21/55
matrix
m
!
1 1 X
m m
k=1
then X T X is indeed the covariance matrix

Thus, the encoder matrix for linear autoencoder(W ) and the projection
matrix(P ) for PCA could indeed be the same. Hence proved
21/55
Remember
The encoder of a linear autoencoder is equivalent to PCA if we
22/55
Remember
22/55
Remember
use a linear decoder
22/55
Remember
use a squared error loss function
22/55
Remember
and normalize the inputs to
22/55
Remember
and normalize the inputs to
m
!
1 1 X
m m
k=1
22/55
Module 7.3: Regularization in autoencoders
(Motivation)
23/55
x̂i
W∗
h
W
xi
24/55
While poor generalization could hap-
x̂i pen even in undercomplete autoen-
coders it is an even more serious prob-
W∗ lem for overcomplete auto encoders
h
W
xi
24/55
h Here, (as stated earlier) the model
can simply learn to copy xi to h and
W then h to x̂i
xi
24/55
h Here, (as stated earlier) the model
can simply learn to copy xi to h and
W then h to x̂i
xi To avoid poor generalization, we need
to introduce regularization
24/55
The simplest solution is to add a L2 -
x̂i regularization term to the objective
function
W∗
m n
1 XX
h min (x̂ij − xij )2 + λkθk2
θ,w,w∗ ,b,c m
i=1 j=1
W
xi
25/55
The simplest solution is to add a L2 -
x̂i regularization term to the objective
function
W∗
m n
1 XX
h min (x̂ij − xij )2 + λkθk2
θ,w,w∗ ,b,c m
i=1 j=1
W
This is very easy to implement and
xi
just adds a term λW to the gradient
∂L (θ)
∂W (and similarly for other para-
meters)
25/55
Another trick is to tie the weights of
x̂i the encoder and decoder
W∗
h
W
xi
26/55
x̂i the encoder and decoder i.e., W ∗ =
WT
W∗
h
W
xi
26/55
x̂i the encoder and decoder i.e., W ∗ =
WT
W∗
This effectively reduces the capacity
h of Autoencoder and acts as a regular-
izer
W
xi
26/55
Module 7.4: Denoising Autoencoders
27/55
A denoising encoder simply corrupts
the input data using a probabilistic
x̂i
xij |xij )) before feeding it
process (P (e
to the network
x̃i
xij |xij )
P (e
xi
28/55
x̂i
process (P (e
to the network
A simple P (e xij |xij ) used in practice
h
is the following
x̃i
xij |xij )
P (e
xi
28/55
x̂i
process (P (e
to the network
h
is the following
P (e
xij = 0|xij ) = q
x̃i
xij |xij )
P (e
xi
28/55
x̂i
process (P (e
to the network
h
is the following
P (e
xij = 0|xij ) = q
x̃i
xij = xij |xij ) = 1 − q
P (e
xij |xij )
P (e
xi
28/55
x̂i
process (P (e
to the network
h
is the following
P (e
xij = 0|xij ) = q
x̃i
xij = xij |xij ) = 1 − q
P (e
xij |xij )
P (e
In other words, with probability q the
xi input is flipped to 0 and with probab-
ility (1 − q) it is retained as it is
28/55
How does this help ?
x̂i
x̃i
xij |xij )
P (e
xi
29/55
x̂i This helps because the objective is
still to reconstruct the original (un-
corrupted) xi
m n
h 1 XX
arg min (x̂ij − xij )2
θ m
i=1 j=1
x̃i
xij |xij )
P (e
xi
29/55
corrupted) xi
m n
h 1 XX
θ m
i=1 j=1
x̃i It no longer makes sense for the model

to copy the corrupted xei into h(xei )
xij |xij )
P (e and then into x̂i (the objective func-
xi tion will not be minimized by doing
so)
29/55
corrupted) xi
m n
h 1 XX
θ m
i=1 j=1

xij |xij )
so)
Instead the model will now have to
capture the characteristics of the data
correctly.
29/55
corrupted) xi
m n
h 1 XX
θ m
i=1 j=1

xij |xij )
so)
For example, it will have to learn to Instead the model will now have to
reconstruct a corrupted xij correctly by capture the characteristics of the data
relying on its interactions with other correctly.
elements of xi
29/55
We will now see a practical application in which AEs are used and then compare
Denoising Autoencoders with regular autoencoders
30/55
0 1 2 3 9
Task: Hand-written digit
recognition
|xi | = 784 = 28 × 28
28*28
Figure: Basic approach(we use raw data as input

Figure: MNIST Data features)
31/55
x̂i ∈ R784
Task: Hand-written digit
recognition
h ∈ Rd
|xi | = 784 = 28 × 28
Figure: MNIST Data
Figure: AE approach (first learn important

characteristics of data)
32/55
Task: Hand-written digit 0 1 2 3 9
recognition
h ∈ Rd
|xi | = 784 = 28 × 28
Figure: MNIST Data
Figure: AE approach (and then train a classifier on

top of this hidden representation)
33/55
We will now see a way of visualizing AEs and use this visualization to compare
different AEs
34/55
We can think of each neuron as a filter which
x̂i will fire (or get maximally) activated for a cer-
tain input configuration xi
xi
35/55
For example,
h
h1 = σ(W1T xi ) [ignoring bias b]
xi Where W1 is the trained vector of weights con-

necting the input to the first hidden neuron
35/55
For example,
h

What values of xi will cause h1 to be max-
imum (or maximally activated)
35/55
For example,
h

Suppose we assume that our inputs are nor-
malized so that kxi k = 1
35/55
For example,
h

max {W1T xi } Suppose we assume that our inputs are nor-
xi
s.t. ||xi ||2 = xTi xi = 1
35/55
For example,
h

max {W1T xi } Suppose we assume that our inputs are nor-
xi
s.t. ||xi ||2 = xTi xi = 1
W1
Solution: xi = p
W1T W1
35/55
Thus the inputs
x̂i
W1 W2 Wn
xi = q ,q ,... p
h W1T W1 W2T W2 WnT Wn
will respectively cause hidden neurons 1 to n

xi to maximally fire
max {W1T xi }
xi
s.t. ||xi ||2 = xTi xi = 1

W1
Solution: xi = p
W1T W1
36/55
Thus the inputs
x̂i
W1 W2 Wn
xi = q ,q ,... p

Let us plot these images (xi ’s) which maxim-
ally activate the first k neurons of the hidden
representations learned by a vanilla autoen-
max {W1T xi } coder and different denoising autoencoders
xi
s.t. ||xi ||2 = xTi xi = 1

W1
Solution: xi = p
W1T W1
36/55
Thus the inputs
x̂i
W1 W2 Wn
xi = q ,q ,... p

Let us plot these images (xi ’s) which maxim-
ally activate the first k neurons of the hidden
representations learned by a vanilla autoen-
max {W1T xi } coder and different denoising autoencoders
xi
s.t. ||xi ||2 = xTi xi = 1

These xi ’s are computed by the above formula
W1
using the weights (W1 , W2 . . . Wk ) learned by
Solution: xi = p the respective autoencoders
W1T W1
36/55
Figure: Vanilla AE Figure: 25% Denoising Figure: 50% Denoising
(No noise) AE (q=0.25) AE (q=0.5)
The vanilla AE does not learn many meaningful patterns
37/55
(No noise) AE (q=0.25) AE (q=0.5)

The hidden neurons of the denoising AEs seem to act like pen-stroke detectors
(for example, in the highlighted neuron the black region is a stroke that you
would expect in a ’0’ or a ’2’ or a ’3’ or a ’8’ or a ’9’)
37/55
(No noise) AE (q=0.25) AE (q=0.5)

The hidden neurons of the denoising AEs seem to act like pen-stroke detectors
(for example, in the highlighted neuron the black region is a stroke that you
would expect in a ’0’ or a ’2’ or a ’3’ or a ’8’ or a ’9’)
As the noise increases the filters become more wide because the neuron has to
rely on more adjacent pixels to feel confident about a stroke
37/55
We saw one form of P (e xij |xij ) which flips a
x̂i fraction q of the inputs to zero
x̃i
xij |xij )
P (e
xi
38/55
Another way of corrupting the inputs is to add
a Gaussian noise to the input
h
eij = xij + N (0, 1)
x
x̃i
xij |xij )
P (e
xi
38/55
Another way of corrupting the inputs is to add
a Gaussian noise to the input
h
eij = xij + N (0, 1)
x
x̃i We will now use such a denoising AE on a

different dataset and see their performance
xij |xij )
P (e
xi
38/55
Figure: Weight decay
Figure: Data Figure: AE filters
filters
The hidden neurons essentially behave like edge detectors
39/55
Figure: Weight decay
Figure: Data Figure: AE filters
filters
The hidden neurons essentially behave like edge detectors

PCA does not give such edge detectors
39/55
Module 7.5: Sparse Autoencoders
40/55
x̂i
xi
41/55
x̂i A hidden neuron with sigmoid activation will
have values between 0 and 1
xi
41/55
We say that the neuron is activated when its
h output is close to 1 and not activated when
its output is close to 0.
xi
41/55
We say that the neuron is activated when its
h output is close to 1 and not activated when
its output is close to 0.
A sparse autoencoder tries to ensure the
xi neuron is inactive most of the times.
41/55
x̂i If the neuron l is sparse (i.e. mostly inactive)
then ρ̂l → 0
xi
The average value of the

activation of a neuron l is given
by
m
1 X
ρ̂l = h(xi )l
m
i=1
42/55
then ρ̂l → 0
A sparse autoencoder uses a sparsity para-
h meter ρ (typically very close to 0, say, 0.005)
and tries to enforce the constraint ρ̂l = ρ
xi
The average value of the

activation of a neuron l is given
by
m
1 X
ρ̂l = h(xi )l
m
i=1
42/55
then ρ̂l → 0
One way of ensuring this is to add the follow-
xi ing term to the objective function
k
The average value of the X ρ 1−ρ
Ω(θ) = ρ log + (1 − ρ) log
activation of a neuron l is given ρ̂l 1 − ρ̂l
l=1
by
m
1 X
ρ̂l = h(xi )l
m
i=1
42/55
then ρ̂l → 0
One way of ensuring this is to add the follow-
xi ing term to the objective function
k
The average value of the X ρ 1−ρ
Ω(θ) = ρ log + (1 − ρ) log
activation of a neuron l is given ρ̂l 1 − ρ̂l
l=1
by
m
1 X When will this term reach its minimum value
ρ̂l = h(xi )l and what is the minimum value? Let us plot
m
i=1
it and check.
42/55
ρ = 0.2
Ω(θ)
0.2 ρ̂l
43/55
ρ = 0.2
Ω(θ)
0.2 ρ̂l
The function will reach its minimum value(s) when ρ̂l = ρ.
43/55
Now,
Lˆ(θ) = L (θ) + Ω(θ)
44/55
Now,
Lˆ(θ) = L (θ) + Ω(θ)
L (θ) is the squared error loss or

cross entropy loss and Ω(θ) is the
sparsity constraint.
44/55
Now,
Lˆ(θ) = L (θ) + Ω(θ)

We already know how to calculate
∂L (θ)
∂W
44/55
Now,
Lˆ(θ) = L (θ) + Ω(θ)

∂L (θ)
∂W
∂Ω(θ)
Let us see how to calculate ∂W .
44/55
k
X ρ 1−ρ Now,
Ω(θ) = ρlog + (1 − ρ)log
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)

∂L (θ)
∂W
∂Ω(θ)
44/55
k
X ρ 1−ρ Now,
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)
Can be re-written as
k
X L (θ) is the squared error loss or
Ω(θ) = ρlogρ−ρlog ρ̂l +(1−ρ)log(1−ρ)−(1−ρ)log(1−ρ̂l ) cross entropy loss and Ω(θ) is the
l=1
∂L (θ)
∂W
∂Ω(θ)
44/55
k
X ρ 1−ρ Now,
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)
k
l=1
By Chain rule:
∂Ω(θ) ∂Ω(θ) ∂ ρ̂ ∂L (θ)
= . ∂W
∂W ∂ ρ̂ ∂W
∂Ω(θ)
44/55
k
X ρ 1−ρ Now,
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)
k
l=1
By Chain rule:
∂Ω(θ) ∂Ω(θ) ∂ ρ̂ ∂L (θ)
= . ∂W
∂W ∂ ρ̂ ∂W
∂Ω(θ)
∂Ω(θ) h
∂Ω(θ) ∂Ω(θ)
iT
= ∂ ρ̂1
, ∂ ρ̂2 , . . . ∂Ω(θ)
∂ ρ̂k
∂ ρ̂
44/55
k
X ρ 1−ρ Now,
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)
k
l=1
By Chain rule:
∂Ω(θ) ∂Ω(θ) ∂ ρ̂ ∂L (θ)
= . ∂W
∂W ∂ ρ̂ ∂W
∂Ω(θ)
∂Ω(θ) h iT
= ∂Ω(θ)
∂ ρ̂1
, ∂Ω(θ)
∂ ρ̂2
, . . . ∂Ω(θ)
∂ ρ̂k
∂ ρ̂
For each neuron l ∈ 1 . . . k in hidden layer, we have
44/55
k
X ρ 1−ρ Now,
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)
k
l=1
By Chain rule:
∂Ω(θ) ∂Ω(θ) ∂ ρ̂ ∂L (θ)
= . ∂W
∂W ∂ ρ̂ ∂W
∂Ω(θ)
∂Ω(θ) h iT
= ∂Ω(θ)
∂ ρ̂1
, ∂Ω(θ)
∂ ρ̂2
, . . . ∂Ω(θ)
∂ ρ̂k
∂ ρ̂
∂Ω(θ) ρ (1 − ρ)
=− +
∂ ρ̂l ρ̂l 1 − ρ̂l
44/55
k
X ρ 1−ρ Now,
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)
k
l=1
By Chain rule:
∂Ω(θ) ∂Ω(θ) ∂ ρ̂ ∂L (θ)
= . ∂W
∂W ∂ ρ̂ ∂W
∂Ω(θ)
∂Ω(θ) h iT
= ∂Ω(θ)∂ ρ̂1
, ∂Ω(θ)
∂ ρ̂2
, . . . ∂Ω(θ)
∂ ρ̂k
∂ ρ̂
∂Ω(θ) ρ (1 − ρ)
=− +
∂ ρ̂l ρ̂l 1 − ρ̂l
∂ ρ̂l
and = xi (g 0 (W T xi + b))T (see next slide)
∂W
44/55
k
X ρ 1−ρ Now,
ρ̂l 1 − ρ̂l
l=1 Lˆ(θ) = L (θ) + Ω(θ)
k
l=1
By Chain rule:
∂Ω(θ) ∂Ω(θ) ∂ ρ̂ ∂L (θ)
= . ∂W
∂W ∂ ρ̂ ∂W
∂Ω(θ)
∂Ω(θ) h iT
= ∂Ω(θ)∂ ρ̂1
, ∂Ω(θ)
∂ ρ̂2
, . . . ∂Ω(θ)
∂ ρ̂k Finally,
∂ ρ̂
∂ Lˆ(θ) ∂L (θ) ∂Ω(θ)
∂Ω(θ) ρ (1 − ρ) = +
=− + ∂W ∂W ∂W
∂ ρ̂l ρ̂l 1 − ρ̂l
∂ ρ̂l (and we know how to calculate both
and = xi (g 0 (W T xi + b))T (see next slide)
∂W terms on R.H.S)
44/55
Derivation
∂ ρ̂ ∂ ρ̂ ∂ ρ̂2 ∂ ρ̂k

= ∂W 1
∂W . . . ∂W
∂W
∂ ρ̂l
For each element in the above equation we can calculate ∂W (which is the partial
derivative of a scalar w.r.t. a matrix = matrix). For a single element of a matrix Wjl :-
h P i
1 m T
∂ ρ̂l ∂ m i=1 g W:,l xi + bl
=
∂Wjl ∂Wjl
h i
m T
1 X ∂ g W:,l xi + bl
=
m i=1 ∂Wjl
m
1 X 0 T

= g W:,l xi + bl xij
m i=1
So in matrix notation we can write it as :

∂ ρ̂l
= xi (g 0 (W T xi + b))T
∂W
45/55
Module 7.6: Contractive Autoencoders
46/55
A contractive autoencoder also tries
to prevent an overcomplete autoen-
coder from learning the identity func- x̂
tion.
47/55
tion.
It does so by adding the following reg-
ularization term to the loss function h
Ω(θ) = kJx (h)k2F

x
47/55
tion.

x
where Jx (h) is the Jacobian of the en-
coder.
47/55
tion.

x
where Jx (h) is the Jacobian of the en-
coder.
Let us see what it looks like.
47/55
If the input has n dimensions and the
hidden layer has k dimensions then
48/55
If the input has n dimensions and the  
hidden layer has k dimensions then ∂h1 ∂h1
∂x1 ... ... ... ∂xn
 ∂h ∂h2 
 ∂x12 ... ... ... ∂xn 
Jx (h) =  .
.. .. 

 .. . . 
∂hk ∂hk
∂x1 ... ... ... ∂xn
48/55
∂x1 ... ... ... ∂xn
 ∂h ∂h2 
In other words, the (l, j) entry of the  ∂x12 ... ... ... ∂xn 
Jacobian captures the variation in the Jx (h) =  .
.. .. 

 .. . . 
output of the lth neuron with a small ∂hk ∂hk
... ... ...
variation in the j th input. ∂x1 ∂xn
48/55
∂x1 ... ... ... ∂xn
 ∂h ∂h2 
In other words, the (l, j) entry of the  ∂x12 ... ... ... ∂xn 
Jacobian captures the variation in the Jx (h) =  .
.. .. 

 .. . . 
output of the lth neuron with a small ∂hk ∂hk
... ... ...
variation in the j th input. ∂x1 ∂xn
n X
k
∂hl 2
X
kJx (h)k2F =
∂xj
j=1 l=1
48/55
What is the intuition behind this ? n X
k
X ∂hl 2
kJx (h)k2F =
∂xj
j=1 l=1
x̂
49/55
k
X ∂hl 2
Consider ∂h1 kJx (h)k2F =
∂x1 , what does it mean if ∂xj
∂h1 j=1 l=1
∂x1 = 0
x̂
49/55
k
X ∂hl 2
∂h1 j=1 l=1
∂x1 = 0
It means that this neuron is not very
sensitive to variations in the input x1 .
x̂
49/55
k
X ∂hl 2
∂h1 j=1 l=1
∂x1 = 0
It means that this neuron is not very
sensitive to variations in the input x1 .
x̂
But doesn’t this contradict our other
goal of minimizing L(θ) which re-
quires h to capture variations in the h
input.
49/55
Indeed it does and that’s the idea n X
k
X ∂hl 2
kJx (h)k2F =
∂xj
j=1 l=1
x̂
50/55
k
X ∂hl 2
By putting these two contradicting kJx (h)k2F =
∂xj
objectives against each other we en- j=1 l=1
sure that h is sensitive to only very

important variations as observed in
the training data. x̂
50/55
k
X ∂hl 2
∂xj

L(θ) - capture important variations
in data
h
50/55
k
X ∂hl 2
∂xj

in data
h
Ω(θ) - do not capture variations in
data
x
50/55
k
X ∂hl 2
∂xj

in data
h
Ω(θ) - do not capture variations in
data
Tradeoff - capture only very import- x
ant variations in the data
50/55
Let us try to understand this with the help of an illustration.
51/55
y
u1
u2
52/55
y
Consider the variations in the data
along directions u1 and u2
u1
u2
52/55
y
u1 It makes sense to maximize a neuron
to be sensitive to variations along u1
u2
52/55
y
u2 At the same time it makes sense to
inhibit a neuron from being sensitive
to variations along u2 (as there seems
to be small noise and unimportant for
x reconstruction)
52/55
y
x reconstruction)
By doing so we can balance between
the contradicting goals of good recon-
struction and low sensitivity.
52/55
y
x reconstruction)
By doing so we can balance between
the contradicting goals of good recon-
struction and low sensitivity.
What does this remind you of ?
52/55
Module 7.7 : Summary
53/55
x̂ y PCA
h ≡ u1 u2
x
x P T X T XP =D
54/55
x̂ y PCA
h ≡ u1 u2
x
x ∗ 2
P T X T XP =D | {z } kF
min kX − HW
θ
U ΣV T
(SVD)
54/55
x̂i
x̃i
xij |xij )
P (e
xi
55/55
x̂i
Regularization
x̃i
xij |xij )
P (e
xi
55/55
x̂i
Regularization
h Ω(θ) = λkθk2 Weight decaying
x̃i
xij |xij )
P (e
xi
55/55
x̂i
Regularization

k
X ρ 1−ρ
Ω(θ) = ρ log + (1 − ρ) log Sparse
ρ̂l 1 − ρ̂l
l=1
x̃i
xij |xij )
P (e
xi
55/55
x̂i
Regularization

k
X ρ 1−ρ
Ω(θ) = ρ log + (1 − ρ) log Sparse
ρ̂l 1 − ρ̂l
l=1
x̃i n X
k
X ∂hl 2
Ω(θ) = Contractive
xij |xij )
P (e j=1 l=1
∂xj
xi
55/55

Lecture 7

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 7

Uploaded by

Copyright:

Available Formats

CS7015 (Deep Learning) : Lecture 7

Autoencoders and relation to PCA, Regularization in autoencoders, Denoising

Department of Computer Science and Engineering

An autoencoder where dim(h) < dim(xi ) is

An autoencoder where dim(h) ≥ dim(xi ) is

0 1 1 0 1 (binary inputs) Logistic as it naturally restricts all

0 1 1 0 1 (binary inputs) Logistic as it naturally restricts all

g is typically chosen as the sigmoid

0.25 0.5 1.25 3.5 4.5

0.25 0.5 1.25 3.5 4.5

0.25 0.5 1.25 3.5 4.5

0.25 0.5 1.25 3.5 4.5

0.25 0.5 1.25 3.5 4.5

0.25 0.5 1.25 3.5 4.5

h This can be formalized using the following

h This can be formalized using the following

h This can be formalized using the following

h This can be formalized using the following

Note that the loss function is

Note that the loss function is

Note that the loss function is

Note that the loss function is

Note that the loss function is

W∗ We use a sigmoid decoder which will

W∗ We use a sigmoid decoder which will

W For a single n-dimensional ith input we

W∗ We use a sigmoid decoder which will

W For a single n-dimensional ith input we

W∗ We use a sigmoid decoder which will

W For a single n-dimensional ith input we

W∗ We use a sigmoid decoder which will

W For a single n-dimensional ith input we

W∗ We use a sigmoid decoder which will

W For a single n-dimensional ith input we

W∗ We use a sigmoid decoder which will

W For a single n-dimensional ith input we

is obtained when we use a linear en-

HW ∗ = U. ,≤k Σk,k V.T,≤k

HW ∗ = U. ,≤k Σk,k V.T,≤k

By matching variables one possible solution is

Thus H is a linear transformation of X and W = V. ,≤k

then X T X is indeed the covariance matrix

then X T X is indeed the covariance matrix

x̃i It no longer makes sense for the model

x̃i It no longer makes sense for the model

x̃i It no longer makes sense for the model

Figure: Basic approach(we use raw data as input

Figure: MNIST Data

Figure: AE approach (first learn important

Figure: MNIST Data

Figure: AE approach (and then train a classifier on

xi Where W1 is the trained vector of weights con-

xi Where W1 is the trained vector of weights con-

xi Where W1 is the trained vector of weights con-

xi Where W1 is the trained vector of weights con-

xi Where W1 is the trained vector of weights con-

will respectively cause hidden neurons 1 to n

s.t. ||xi ||2 = xTi xi = 1

will respectively cause hidden neurons 1 to n

s.t. ||xi ||2 = xTi xi = 1

will respectively cause hidden neurons 1 to n

s.t. ||xi ||2 = xTi xi = 1

The vanilla AE does not learn many meaningful patterns