Machine Learning: Algorithms and Applications: Philip O. Ogunbona

Machine Learning: Algorithms and Applications
Philip O. Ogunbona
Advanced Multimedia Research Lab

University of Wollongong
Artificial Neural Networks and Deep Learning: An Introduction (III)

Autumn 2020
SCIT-AMRL (University of Wollongong) Machine Learning ANNDL 1 / 20

Outline
1 Autoencoders
2 Generative Adversarial Networks (GAN)
3 References

Autoencoders
Conceptually an autoencoder is a
feedforward network trained to copy its
input to its output (albeit imperfectly)
Structure (see Figure 1) has a hidden layer

h describing the code representing the
input
Autoencoder has two parts: encoder

function h = f (x) that generates the
representative code of the input and
decoder function r = g(h) that produces a
reconstruction from the code
Figure 1: General structure of an autoencoder; Generalization of autoencoder to

input x maps to an output r (reconstruction) stochastic mappings:
through internal representation or code pencoder (h|x) and pdecoder (x|h)
h (Goodfellow et al. 2016)
Typical training strategy is similar to that
used for feedforward networks - minibatch
gradient descent

Stacked autoencoder
Practical autoencoder is a stack

Architecture of a stacked
autoencoder is typically
symmetrical with regards to the
central hidden layer (the coding
Figure 2: Example of stacked autoencoder layer) (see Figure 2)
used for the MNIST dataset; notice the 784
(28 × 28) input neurons; 300 hidden neurons;
150 central hidden neurons; a mirroring in the
top layer (Géron 2017)

Undercomplete/Overcomplete autoencoders
Constraining h to have smaller dimension than x results in an undercomplete autoencoder
h captures the most salient features of input
Learning entails minimizing a loss function

L(x, g(f (x))) (1)
L penalizes g(f (x)) being dissimilar to x
If dimension of code is greater than that of input we have overcomplete autoencoder
Any architecture of autoencoder can be trained without the risk of over-capacity or learning
a trivial identity, by using regularization
Regularization can inpart properties to loss function:
sparsity of representtaion
smallness of derivative of representation
robustness to noise
robustness to missing data

Autoencoders and Principal Component Analysis
(PCA)
With a linear decoder g(h) and mean squared error loss, an

undercomplete autoencoder learns the same subspace as PCA
With a nonlinear encoder and decoder (respectively, f (x) and g(h)), an

autoencoder can learn a more powerful generalization of PCA

Sparse autoencoders
Sparse autoencoder has cost function used for training in the form of
reconstruction error and sparsity penalty on the code layer h:
L(x, g(f (x))) + Ω(h) (2)
where h is the encoder output; h = f (x) typically (see Figure 1)
Sparse autoencoders are useful in learning features that can be input for
other tasks, e.g. classification (think about semi-supervised classification)
Sparse autoencoders can be interpreted as approximating maximum

likelihood training of generative model that has latent variables (in this
case h)
In this respect, it is maximizing
log pmodel (h, x) = log pmodel (h) + log pmodel (x|h) (3)
log pmodel (h) can be sparsity-inducing

Denoising autoencoders
Denoising aims to reduce the noise in signals
Denoising autoencoders minimize
L(x, g(f (x̃))) (4)
where x̃ is a copy of x corrupted by some form of noise
Training process forces f and g to implictly learn the structure of pdata (x)
Another form of regularization λ i ||∇x hi ||2 forces the learning of a

P
function that does not change much when x changes slightly:
X
L(x, f (g(x))) + λ ||∇x hi ||2 (5)
i

Denoising autoencoders
Figure 4: Comparison between output of

Figure 3: Stacked convolutional denoising stacked convolutional denoising autoencoder
autoencoder and median filter; Gaussian noise: µ = 0, σ = 1

More autoencoders - cost functions
Contractive autoencoder
Regularization is introduced on the code h = f (x) to encourage
derivatives of f to be as small as possible
∂f (x) 2

Ω(h) = λ

∂x F
Contracctive autoencoder and denoising autoencoder are related when

input noise is small and Gaussian (Goodfellow et al. (2016))
denoising autoencoders make the reconstruction function resist small but
finite-sized perturbations of the input;
contractive autoencoders make the feature extraction function resist
infinitesimal perturbations of the input

Generative adversarial networks (GAN)
Central problem addressed by GAN is density estimation; GAN implicitly

captures the underlying data distribution
GAN can be used in both unsupervised and semisupervised learning

settings
Characterised by training two networks in competition:
There is a network, named the generator (G), trying to produce samples
from a distribution that is learned from given data - mimicking, forging,
synthetic data
There is a second network, the discriminator (D), that is able to tell the
synthetic samples from the real ones
Objective is to be able to generate synthetic signals that are no different

from the real ones

Figure 5: Two models leaerned while training GAN; Discriminator (D) and Generator
(G); models implemented using neural network, but any differentiable system
(mapping) can also be used (Creswell et al. 2018)

In Figure 5, Generator network has no access to the real samples
Generator network is a mapping from some representaion space (latent

space) to the data sample space:
G : G(z) → R|x|
where z ∈ R|x| is the data sample and | · | denotes the number of

dimensions
Discriminator network, D, maps data sample to a probability that sample

is from real data distribution and not generator distribution
D : D(x) → (0, 1)
pdata (x) represents the probability density function over the data samples
(in R|x| ) and pg (x) distribution of the samples produced by the generator

During training we set objective functions for the generator (JG (ΘG ; ΘD ))
and discriminator (JD (ΘD ; ΘG ))
Note that JG and JD are co-dependent on the network parameters, ΘG

and ΘD as the networks are iteratively trained

Figure 6: During GAN training, the generator is encouraged to produce a distribution

of samples, pg (x) to match that of real data, pdata (x) (Creswell et al. 2018)

Training GAN
We find parameters of a discriminator that maximize its classification accuracy and find the
parameters of a generator that maximally confuses the discriminator
Cost of training is evaluated using a value function; solve the following mini-max problem:
max min V(G, D)
D G
where
V(G, D) = Epdata(x) log D(x) + Epg(x) log (1 − D(x))
Parameters of one model are updated while the parameters of the other are fixed
Optimal discriminator is unique (Goodfellow et al. 2014)
pdata(x)
D∗ (x) =
pdata(x) + pg(x)
Generator is optimal when (Goodfellow et al. 2014)

pg(x) = pdata(x)

Figure 7: The main loop of GAN training. Novel data samples, x0 , may be drawn by
passing random samples, z, through the generator network. The gradient of the
discriminator may be updated k times before updating the generator. (Creswell et al.
2018)

Other GAN architectures
Intial GAN architecture used fully connected neural network

Difficult to train; successful only with a subset of datsets - stablity issues
Deep convolutional GAN provided more stability;
Conditional GAN - both the generator and the discriminator networks
are class-conditional (Figure 8)
Conditional GANs can provide better representations for multimodal data
generation
InfoGAN decomposes the noise source into an incompressible source
and a “latent code,”; attempt to discover latent factors of variation by
maximizing the mutual information between the latent code and the
generator’s output (Figure 9)

Other GAN architectures
Figure 8: Conditional GAN Figure 9: InfoGAN

Bibliography
Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B. & Bharath, A. A. (2018),
‘Generative adversarial networks: An overview’, IEEE Signal Processing Magazine 35(1), 53 –
65.
Géron, A. (2017), Hands-on Machine Learning with Scikit-Learn and TensorFlow, O’Reilly Media,
Inc., CA, USA.
Goodfellow, I., Bengio, Y. & Courville, A. (2016), Deep Learning, MIT Press.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. &
Bengio, Y. (2014), Generative adversarial nets, in ‘Proc. Advances Neural Information
Processing Systems Conf’, Montreal, Quebec, Canada, p. 2672–2680.

Machine Learning: Algorithms and Applications: Philip O. Ogunbona

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning: Algorithms and Applications: Philip O. Ogunbona

Uploaded by

Copyright:

Available Formats

Machine Learning: Algorithms and Applications

Advanced Multimedia Research Lab

Artificial Neural Networks and Deep Learning: An Introduction (III)

SCIT-AMRL (University of Wollongong) Machine Learning ANNDL 1 / 20

2 Generative Adversarial Networks (GAN)

SCIT-AMRL (University of Wollongong) Machine Learning ANNDL 2 / 20

Structure (see Figure 1) has a hidden layer

Autoencoder has two parts: encoder

Figure 1: General structure of an autoencoder; Generalization of autoencoder to

SCIT-AMRL (University of Wollongong) Machine Learning ANNDL 3 / 20

Practical autoencoder is a stack

SCIT-AMRL (University of Wollongong) Machine Learning ANNDL 4 / 20

Constraining h to have smaller dimension than x results in an undercomplete autoencoder

h captures the most salient features of input

Learning entails minimizing a loss function

If dimension of code is greater than that of input we have overcomplete autoencoder

SCIT-AMRL (University of Wollongong) Machine Learning ANNDL 5 / 20

With a linear decoder g(h) and mean squared error loss, an

With a nonlinear encoder and decoder (respectively, f (x) and g(h)), an

SCIT-AMRL (University of Wollongong) Machine Learning ANNDL 6 / 20

L(x, g(f (x))) + Ω(h) (2)

where h is the encoder output; h = f (x) typically (see Figure 1)

Sparse autoencoders can be interpreted as approximating maximum

In this respect, it is maximizing

log pmodel (h) can be sparsity-inducing

Denoising aims to reduce the noise in signals

Denoising autoencoders minimize

L(x, g(f (x̃))) (4)

where x̃ is a copy of x corrupted by some form of noise

Another form of regularization λ i ||∇x hi ||2 forces the learning of a

SCIT-AMRL (University of Wollongong) Machine Learning ANNDL 8 / 20

Figure 4: Comparison between output of

SCIT-AMRL (University of Wollongong) Machine Learning ANNDL 9 / 20

Contracctive autoencoder and denoising autoencoder are related when

SCIT-AMRL (University of Wollongong) Machine Learning ANNDL 10 / 20

Central problem addressed by GAN is density estimation; GAN implicitly

GAN can be used in both unsupervised and semisupervised learning

Objective is to be able to generate synthetic signals that are no different

SCIT-AMRL (University of Wollongong) Machine Learning ANNDL 11 / 20

SCIT-AMRL (University of Wollongong) Machine Learning ANNDL 12 / 20

In Figure 5, Generator network has no access to the real samples

Generator network is a mapping from some representaion space (latent

where z ∈ R|x| is the data sample and | · | denotes the number of

Discriminator network, D, maps data sample to a probability that sample

SCIT-AMRL (University of Wollongong) Machine Learning ANNDL 13 / 20

Note that JG and JD are co-dependent on the network parameters, ΘG

SCIT-AMRL (University of Wollongong) Machine Learning ANNDL 14 / 20

Figure 6: During GAN training, the generator is encouraged to produce a distribution

SCIT-AMRL (University of Wollongong) Machine Learning ANNDL 15 / 20

Generator is optimal when (Goodfellow et al. 2014)

SCIT-AMRL (University of Wollongong) Machine Learning ANNDL 16 / 20

SCIT-AMRL (University of Wollongong) Machine Learning ANNDL 17 / 20

Other GAN architectures

Intial GAN architecture used fully connected neural network

SCIT-AMRL (University of Wollongong) Machine Learning ANNDL 18 / 20

Other GAN architectures

Figure 8: Conditional GAN Figure 9: InfoGAN

SCIT-AMRL (University of Wollongong) Machine Learning ANNDL 19 / 20

SCIT-AMRL (University of Wollongong) Machine Learning ANNDL 20 / 20

You might also like