Advanced Design For AI Algorithms: Lec.: 1 GAN

Advanced Design for AI
Algorithms
Lec.: 1
GAN
Outline
1. Introduction to Generative Models
2. Generative Adversarial Networks (GANs)
3. Improved GAN Techniques
4. Deep Convolutional GAN (DCGAN)
5. Conditional GAN (CGAN)
Generative vs Discriminative Model
• Discriminative Model
• Learn hypothesis function that map input data (x) to some desired
output class label (y). In probabilistic terms, learn the conditional
distribution P(y|x).
• Generative Model
• Learn the joint probability of the input data (x) and labels
simultaneously, i.e. P(x,y). This can be converted to P(y|x) for
classification via Bayes rule.
Supervised vs Unsupervised Learning
Supervised Learning Unsupervised Learning
Data: (𝑥, 𝑦) Data: (𝑥, 𝑦)

𝑥 is data, 𝑦 is label 𝑥 is data, no label!
Goal: Learn a function to map 𝑥 → 𝑦 Goal: Learn hidden structures inside data
Examples: Examples:
▪ Classification & Regression ▪ Clustering
▪ Object Detection ▪ Dimensionality Reduction
▪ Semantic Segmentation ▪ Feature Learning
▪ Image Captioning ▪ Density Estimation (Core Problem in GANs)
Generative Models
Problem: Given training data, generate new samples from same distribution.
Training Data ~ 𝑝𝑑𝑎𝑡𝑎 (𝑥) Generated Samples ~ 𝑝𝑔 (𝑥)
Goal: We want to learn 𝒑𝒈 (𝒙) similar to 𝒑𝒅𝒂𝒕𝒂 (𝒙). Learn about data through generation.
Addresses density estimation, a core problem in unsupervised learning
▪ Explicit Density: Explicitly define and solve for 𝑝𝑔 (𝑥)
▪ Implicit Density: Learn model than can sample from 𝑝𝑔 (𝑥) without explicitly define it
Why Generative Models Important?
Advantages of Generative Models:
• Represent and manipulate high-dimensional probability distribution
• Can be incorporated into (inverse) reinforcement learning in several ways
• Can be trained with missing data on semi-supervised learning
• Can provide predictions on inputs that are missing data
• Enable machine learning to work with multi-modal outputs
• Many tasks intrinsically require realistic generation of samples from some distributions:
• Creating Art and Realistic Images (Data Augmentation)
• Single Image Super-Resolution
• Image-To-Image Translation What I cannot create, I do not understand!
-- Richard Feynman
Applications of Generative Models
Computer Vision Speech Recognition Natural Language
Realistic Image Creation Generative Speech Enhancement Realistic Text Generation
Image-to-Image Translation Speech Driven Animation Text Translation
Synthetic Image Creation Lips Talking and Reading Text Corpora Generation
Image and Shape In-Painting Synthetic Audio/Voice Generative Machine Translation
Object/Image Reconstruction Voice Conversion Conditional Sequence Generation
Image Super-Resolution Voice Separation Neural Dialog Generation
Face Emotion and Aging Voice Impersonation Generative Conversation Responses
Video Frame Prediction Speech and Speaker Emotion Text Style Transfer
Video Deblurring Postfilter for Synthesized speech Abstractive Summarization
Many More …. Many More …. Many More ….
Source: The GAN Zoo and Really Awesome GAN

Variety of GAN Models
Generative Adversarial Networks
Problems: We want to sample from complex, high-
dimensional training data distribution 𝑝𝑑𝑎𝑡𝑎 (𝑥). No
direct way to do this! Output: Generated sample
from training distribution
Key Ideas: Sample from a simple distribution, e.g.

random noise. Learn transformation to training data
distribution 𝑝𝑑𝑎𝑡𝑎 (𝑥)
Generator
(Neural Network)
p(z)
Question: What can we use to represent this complex
transformation?
Input: Random Noise 𝑧

Answer: MLP or Deep Feedforward Neural Networks
Generative Adversarial Networks
Output: Scalar
[Synthetic (0), Real (1)]
Problems: How to train generator network?
Key Ideas: Compete two NNs in two-player minimax game
Discriminator Network (D)
• Generator Network (G)
• Tries to mimic example from training dataset,
which sampled from “unknown” true data
distribution. It takes noise z as input and generate Synthetic Faces
synthetic samples. (generated)
• Discriminator Network (D)

Input: Real Images
• Receive samples from both G and training data (CelebA dataset)
(but it is not told where the sample comes from)
and predict whether it is a data sample or Generator Network (G)
synthetic.
• D trained to make accurate predictions, and G is trained to
output samples that fool the discriminator. Input: Random Noise 𝑧
Training MM-GANs: Algorithm
▪ Some find k=1 more stable, others use k > 1, no best rule.
▪ Recent work (e.g. Wasserstein GAN) alleviates this problem, better stability!
Problems in Training GANs
• Training GANs requires finding Nash equilibrium of a non-convex game with
continuous, high dimensional parameters.
• Gradient descent techniques are designed to find low value of cost function,
rather than to find the Nash equilibrium of a game.
• When used to seek Nash equilibrium Gradient descent may fail to converge.
• Saliman, et.al, @ NIPS 2016
• Several heuristic techniques to
encourage convergence of GAN
game. All codes and
hyperparameters available at
Github.
Improved GAN Techniques
• Training GANs consists in finding a Nash equilibrium to a two-player non-
cooperative game. A Nash equilibrium is a point such that cost functions of D and
G are both at minimum. Finding Nash equilibria is a very difficult problem.
Algorithms exist for specialized cases, but hard to apply to the GAN game, where
the cost functions are non-convex, the parameters are continuous, and the
parameter space is extremely high-dimensional.
• Common Heuristic Techniques Described in The Paper :
1. Feature Matching
The contributions in this paper
2. Minibatch Discrimination are of heuristic; one need to
3. Historical Averaging develop a more rigorous
theoretical understanding in
4. One-sided Label Smoothing future work.
5. Virtual Batch Normalization
Feature Matching
• Addresses the instability of GANs by specifying a new objective for the generator
that prevents it from overtraining on the current discriminator.
• Instead of directly maximizing the output of the discriminator, the new objective
requires the generator to generate data that matches the statistics of the real
data.
• Specifically, train the generator to match the expected value of the features on an
intermediate layer of the discriminator.
• Empirical results indicate that feature matching is indeed effective in situations
where regular GAN becomes unstable.
Minibatch Discrimination
• Generator can collapse to a parameter setting where it
always emits the same point. When collapse to a single
mode is imminent, the gradient of the discriminator may
point in similar directions for many similar points.
• After collapse has occurred, the discriminator learns that
this single point comes from the generator, but gradient
descent is unable to separate the identical outputs.
• An obvious strategy to avoid this type of failure is to allow
the discriminator to look at multiple data examples in
combination, and perform minibatch discrimination.
• The concept of minibatch discrimination is quite general: Features f from sample x are
any discriminator model that looks at multiple examples multiplied by tensor T and cross-
in combination, rather than in isolation, could potentially sample distance is computed
help avoid collapse of the generator.
Historical Averaging
• Modify each player’s cost function to include the following term:
• Where the 𝜃 is the value of parameters at past time i.

• The historical average of the parameters can be updated in an online fashion so
this learning rule scales well to long time series.
• This approach is loosely inspired by the fictitious play algorithm that can find
equilibria in other kinds of games.
• This approach was able to find equilibria of low-dimensional, continuous non-
convex games, such as the minimax game.
One Sided Label Smoothing
• Label smoothing, a technique from the 1980s recently independently re-
discovered by Szegedy et. al, replaces the 0 and 1 targets for a classifier with
smoothed values, like .9 or .1, and was recently shown to reduce the vulnerability
of neural networks to adversarial examples.
• Replacing positive classification targets with α and negative targets with β, the
optimal discriminator becomes:
• The presence of pmodel in the numerator is problematic because, in areas where

pdata is approximately zero and pmodel is large, erroneous samples from pmodel have
no incentive to move nearer to the data. We therefore smooth only the positive
labels to α, leaving negative labels set to 0.
Virtual Batch Normalization
• Batch normalization greatly improves optimization of neural networks, and was
shown to be highly effective for DCGANs. However, it causes the output of a
neural network for an input example x to be highly dependent on several other
inputs x′ in the same minibatch.
• To avoid this problem Salimans et al. introduce virtual batch normalization (VBN),
in which each example x is normalized based on the statistics collected on a
reference batch of examples that are chosen once and fixed at the start of
training, and on x itself.
• The reference batch is normalized using only its own statistics. VBN is
computationally expensive because it requires running forward propagation on
two minibatches of data, so it is recommended to use it only in the generator
network.
Deep Convolutional GANs
A convolutional neural network (CNN) is neural architecture that
performs well for visual image analysis. The hidden layers of a
CNN typically consist of a series of convolutional layers
that convolve with a multiplication or other dot product. The
activation function is commonly ReLU or LeakyReLU, and is
subsequently followed by additional convolutions such as
pooling layers, fully connected layers and normalization layers.
DCGAN Architecture (Radford, et.al, @ ICLR 2016 )

1. Replace any pooling layers with strided convolutions in D and fractional-
strided convolutions in G.
2. Use batch normalization in both G and D
3. Remove fully connected hidden layers for deeper architectures.
4. Use ReLU activation in generator for all layers except for the output, which
uses Tanh.
5. Use LeakyReLU activation in the discriminator for all layers.
DCGAN Approach
Generating Natural Images DCGAN Approach
• Generative image models: 1. Replace any pooling layers with strided
convolutions (discriminator) and
parametric and non- parametric. fractional-strided convolutions
(generator).
• The non-parametric models often
do matching from a database of 2. Use batch normalization in both G and D
existing images, often matching 3. Remove fully connected hidden layers for
patches of images ex. texture deeper architectures.
synthesis, super-resolution and in- 4. Use ReLU activation in generator for all
layers except for the output, which uses
painting. Tanh.
• Parametric models has been 5. Use LeakyReLU activation in the
explored extensively , had not much discriminator for all layers.
success till now.
DCGAN Architecture
DCGAN generator used for LSUN scene modeling. A 100 dimensional uniform distribution Z is
projected to a small spatial extent convolutional representation with many feature maps. A series of
four fractionally-strided convolutions (in some recent papers, these are wrongly called
deconvolutions) then convert this high level representation into a 64 ⇥ 64 pixel image. Notably, no
fully connected or pooling layers are used.
DCGAN Results – MNIST Dataset
Side-by-side illustration of (from left-to-right) the MNIST dataset,

generations from a baseline GAN, and generations from our DCGAN
DCGAN Results – LSUN Bedrooms Dataset
Generatedbedroomsafteronetrainingpassthroughthedataset.Theoretically, the model

could learn to memorize training examples, but this is experimentally unlikely as we train
with a small learning rate and minibatch SGD. We are aware of no prior empirical
evidence demonstrating memorization with SGD and a small learning rate.
Generatedbedroomsafterfiveepochsoftraining.Thereappearstobeevidenceofvisual under-
fitting via repeated noise textures across multiple samples such as the base boards of
some of the beds.
Results – Vector Arithmetic
Vector arithmetic for visual concepts. For each column, the Z vectors of samples are
averaged. Arithmetic was then performed on the mean vectors creating a new vector Y .
The center sample on the right hand side is produce by feeding Y as input to the
generator. To demonstrate the interpolation capabilities of the generator, uniform noise
sampled with scale +-0.25 was added to Y to produce the 8 other samples. Applying
arithmetic in the input space (bottom two examples) results in noisy overlap due to
misalignment.
Why Conditional Generation is Important?
Key Issues:
1. It is challenging to scale deep neural networks to accommodate an extremely
large number of predicted output categories.
2. Much of the work to date has focused on learning one-to-one mappings from
input to output. However, many interesting problems are more naturally
thought of as a probabilistic one-to-many mapping.
Proposed Solutions:
1. One way to help address the first issue is to leverage additional information
from other modalities : for instance, by using natural language corpora to learn
a vector representation for labels in which geometric relations are semantically
meaningful.
2. One way to address the second problem is to use “a conditional probabilistic
generative model”, the input is taken to be the conditioning variable and the
one-to-many mapping is instantiated as a conditional predictive distribution.
CGAN Introduction
▪ Generative adversarial nets can be extended to a
conditional model if both the generator G and
discriminator D are conditioned on some extra information
y. In this case y could be any kind of auxiliary information,
such as class labels or data from other modalities. We can
perform the conditioning by feeding y into the both the
discriminator and generator as additional input layer.
▪ In the generator the prior input noise pz(z), and y are
combined in joint hidden representation, and the
adversarial training framework allows for considerable
flexibility in how this hidden representation is composed.
▪ MinMax Objective Function for CGAN:

Conditional GAN Architecture
Basic CGAN for Convolutional
CGAN Results
• An extended CGAN model on the face
image dataset, now using both the image
data x and the face attribute data y.
• In the experiments, particular subset of the
original face attributes being used and
attributes which do not have clear visual
effects in the cropped images in our dataset
were eliminated.
• In each row we begin with a conditional
data value y sampled from the empirical
distribution py. In each column we apply
random shifts to that value y, and sample
an output image. The shifts cause
noticeable differences in facial features,
though it is evident that the slightly shifted
values of y still lead to similar-looking faces.
• The outlined face images at the far right
show the nearest neighbor in the training
data to the sampled image in the second-
to-last column; this demonstrates that the
generator has not learned to simply overfit
the training data.
CGAN Results
• Results of constant additive shifts along particular axes of the conditional data y.
REF: Jon Gauthier, Conditional generative adversarial nets for convolutional face generation
Algorithms
Lec.: 2
Key Question in Evaluating GANs
▪ Evaluating GANs is very tricky
• Different metrics can lead to different trade-offs
• Different evaluations favor different models
▪ Key question: What is the task that you care about?
▪ Density Estimation
▪ Sampling/Generation
▪ Latent Representation Learning
▪ More than one task? E.g., Semisupervised learning, image translation, etc.
▪ Evaluation drives progress, but how do we evaluate generative models?
▪ An evaluation based on samples is biased towards models which overfit and therefore a poor
indicator of a good density model in a log-likelihood sense, which favors models with large
entropy. Conversely, a high likelihood does not guarantee visually pleasing samples. Samples
can take on arbitrary form only a few bits from the optimum.
Problems in Evaluating GAN Samples
▪ Lets take evaluation of generated image quality for example.
▪ Visual quality of images is a highly subjective thing. No single definitive solution to formulize it.
▪ We need to have more or less reliable function to evaluate GANs generated images.
▪ Q: How to formulate such evaluation function systematically?
▪ A: Apply statistic or information theory for saliency and diversity.
▪ Saliency vs Diversity
▪ Saliency (what is the object) is a distribution of classes for any individual image should have
low entropy. One can think of it as a single high score and the rest very low.
▪ Diversity (how different with other objects) is overall distribution of classes across the
sampled data should have high entropy, which would mean the absence of dominating
classes and something closer to a well-balanced training set.
GAN Evaluation – Quality of Generation
▪ Which set “look” better?
▪ Human inspection are expensive,
biased, hard to reproduce.
▪ Generalization is hard to define and
assess: memorizing the training set
would give excellent samples but
clearly undesirable
▪ Quantitative evaluation of a
qualitative task can have many
answers
▪ Popular Metrics: Inception Scores
(IS), Frechet Inception Distance (FID),
Kernel Inception Distance (KID)
Common GAN Evaluation Techniques
▪ IS: Inception Score
▪ First method to evaluate quality of GANs generated samples.
▪ Higher IS score is better, corresponded to higher KL divergence between two distributions.
▪ Higher IS values mean better image quality and diversity
▪ Based on evaluating generator capabilities to generate images with:
▪ meaningful objects: conditional label distribution 𝑝(𝑥|𝑦) is low entropy.
▪ diverse images: marginal label distribution 𝑝 𝑦 = ‫ )𝑥( 𝑔𝑝)𝑦|𝑥(𝑝 𝑥׬‬is high entropy.
▪ FID: Fréchet Inception Distance

▪ FID improves IS by actually comparing the statistics of generated samples to real samples,
instead of only evaluating generated samples in vacuum.
▪ Lower FID values mean better image quality and diversity.
IS vs FID Definitions
▪ IS is very intuitive formula based on KL-Divergence. KL-Divergence is a measure of information
loss occurring when instead of a true empiric distribution an approximation is used.
▪ Note: Implementation is straight forward and shown in sample codes.
• FID uses inception network to extract features from an intermediate layer. Then it models the
data distribution of those features using a multivariate Gaussian distribution with mean µ and
covariance Σ.
▪ Note: FID as it is more robust to noise and sensitive to mode collapse compare to IS.
Inception Score Explained
▪ Assumption 1: We evaluate sample quality for GANs trained on labelled datasets
▪ Assumption 2: We have a good probabilistic classifier 𝑐(𝑦|𝑥) for predicting the label 𝑦 for any 𝑥
▪ We want samples from a good generative model to satisfy two criteria: sharpness and diversity
▪ Sharpness (𝑺):
▪ High sharpness implies classifier is confident in making predictions for generated images
▪ That is, classifier’s predictive distribution 𝑐(𝑦|𝑥) has low entropy
Inception Score Explained
▪ Diversity (𝑫):
▪ where 𝑐(𝑦) = 𝐸𝑥~𝑝 [𝑐(𝑦|𝑥)] is the classifier’s marginal predictive distribution. High diversity
implies c(y) has high entropy.
▪ Inception scores (IS) combine the two criteria of sharpness and diversity into a simple metric:
𝐼𝑛𝑐𝑒𝑝𝑡𝑖𝑜𝑛 𝑆𝑐𝑜𝑟𝑒 = 𝐷 × 𝑆
▪ Correlates well with human judgement in practice . If classifier is not available, a classifier trained
on a large dataset, e.g., Inception Net trained on the ImageNet dataset
Frechet Inception Distance Explained
▪ Inception Scores only require samples from 𝑝𝜃 and do not take into account the desired data
distribution 𝑝𝑑𝑎𝑡𝑎 directly (only implicitly via a classifier).
▪ FID measures similarities in the feature representations (e.g., those learned by a pretrained
classifier) for data points sampled from 𝑝𝜃 and the test dataset
▪ Computing FID:
1. Let 𝐺 denote the generated samples and 𝑇 denote the test dataset
2. Compute feature representations 𝐹𝐺 and 𝐹𝑇 for 𝐺 and 𝑇 respectively (e.g., prefinal layer of Inception Net)
3. Fit a multivariate Gaussian to each of 𝐹𝐺 and 𝐹𝑇 . Let 𝜇, Σ denote mean and covariance of the Gaussian
▪ FID Is Defined As:
▪ Lower FID implies better sample quality

GILBO: Generative Information Lower Bound
▪ GILBO: Simple, tractable lower bound of mutual information contained in the joint generative
density of any latent variable generative model. A symmetric, non-negative, reparameterization
independent measure of the information shared between two random variables is given by the
mutual information:
▪ GILBO offers data independent measure of the complexity of the learned latent variable
description, giving the log of the effective descriptive length. The GILBO is entirely independent of
true data, being a purely a function of the generative joint distribution.
▪ GILBO gives different information than is currently available in FID as well as being able to
distinguish between GANs with same FID scores.
KID Kernel Inception Distance
▪ Maximum Mean Discrepancy (MMD) is a two-sample test statistic that compares samples from
two distributions p and q by computing differences in their moments (mean, variances etc.)
▪ Key idea: Use a suitable kernel e.g., Gaussian to measure similarity between points
▪ Intuitively, MMD is comparing the “similarity” between samples within 𝑝 and 𝑞 individually to the
samples from the mixture of 𝑝 and 𝑞
▪ Kernel Inception Distance (KID): compute the MMD in the feature space of a classifier (e.g.,
Inception Network)
▪ FID vs. KID
▪ FID is biased (can only be positive), KID is unbiased
▪ FID can be evaluated in O(n) time, KID evaluation requires O(n2) time
KID Kernel Inception Distance
▪ FID is meaningful for not-ImageNet datasets. Estimator extremely biased. Tiny variance.
▪ KID defined as MMD between inception hidden layer activations.
▪ Use default polynomial kernel:
▪ Unbiased estimator, reasonable with few samples.
▪ KID can be used for automatic learning rate adaptation, eq. three sample MMD test.
Best Practices on Evaluating Sample Quality
1. Spend time tuning your
baselines (architecture, learning
rate, optimizer etc.). Be amazed
(rather than dejected) at how
well they can perform
2. Use random seeds for
reproducibility
3. Report results averaged over
multiple random seeds along
with confidence intervals
Selected Papers for Reading
1. A Note on Evaluation of Generative Models, ICLR 2016
2. Evaluation of Generative Networks Through Their Data Augmentation Capacity, ICLR 2018
3. A Note on Inception Score, ICLR 2018
4. An empirical study on evaluation metrics of generative adversarial networks, ICLR 2018
5. Pros and Cons of GAN Evaluation Measures, Arxiv 2018
6. GILBO: One Metric to Measure Them All, NIPS 2018
7. Geometry Score: A Method For Comparing Generative Adversarial Networks, ICML 2018
Algorithms
Applications of GAN for Images
Outline
1. Application of GANs
2. Image Generation
3. Conditional GANs
4. Unsupervised Conditional GANs
5. GANS and Reinforcement Learning
How GANs Work
▪ A generator G is a network. The network defines a probability distribution 𝑃𝐺
Normal 𝑃𝑑𝑎𝑡𝑎 𝑥
𝑃𝐺 (𝑥)
Distribution
Generator
Network
𝑧 G
Distance or
Divergence
𝑥=𝐺 𝑧
𝑥: an image (a high-dimensional vector)
▪ How to compute the divergence? 𝐺 ∗ = 𝑎𝑟𝑔 min 𝑑𝑖𝑣 𝑃𝐺 , 𝑃𝑑𝑎𝑡𝑎

𝐺
Why Generative Models Important?
Advantages of Generative Models:
• Represent and manipulate high-dimensional probability distribution
• Can be incorporated into (inverse) reinforcement learning in several ways
• Can be trained with missing data on semi-supervised learning
• Can provide predictions on inputs that are missing data
• Enable machine learning to work with multi-modal outputs
• Many tasks intrinsically require realistic generation of samples from some distributions:
• Creating Art and Realistic Images
• Single Image Super-Resolution
• Image-To-Image Translation What I cannot create, I do not understand!
-- Richard Feynman
Why This Course is Important: GAN Zoo
GAN
ACGAN
BGAN
CGAN
DCGAN
EBGAN
fGAN
GoGAN
……
𝛼𝛽𝛾
Applications of Generative Models
Computer Vision Speech Recognition Natural Language
Realistic Image Creation Generative Speech Enhancement Realistic Text Generation
Image-to-Image Translation Speech Driven Animation Text Translation
Synthetic Image Creation Lips Talking and Reading Text Corpora Generation
Image and Shape In-Painting Synthetic Audio/Voice Generative Machine Translation
Object/Image Reconstruction Voice Conversion Conditional Sequence Generation
Image Super-Resolution Voice Separation Neural Dialog Generation
Face Emotion and Aging Voice Impersonation Generative Conversation Responses
Video Frame Prediction Speech and Speaker Emotion Text Style Transfer
Video Deblurring Postfilter for Synthesized speech Abstractive Summarization
Many More …. Many More …. Many More ….
Source: The GAN Zoo and Really Awesome GAN

Outline
2. Image Generation
3. Conditional GANs
GANs Generated Realistic Images
▪ Input: Face anime images
▪ Output: Realistic face anime images
▪ Input: CelebA Dataset
▪ Output: Realistic celebrity looking faces
All of those faces are fake!

▪ Input: ImageNet Dataset
▪ Output: Realistic and natural objects/things
GANs for Image Super Resolution
▪ Input: Low Resolutions ImageNet Dataset
▪ Output: Super Resolution Realistic and natural objects/things
Outline
2. Image Generation
3. Conditional GANs
How Conditional GANs Work
▪ A generator G is a network. The network defines a probability distribution 𝑃𝐺

Normal
Distribution
𝑃𝑑𝑎𝑡𝑎 𝑥|𝑐
𝑃𝐺 (𝑥|c)
Generator
𝑧 Network
G
condition c Distance or
Divergence
𝑥 = 𝐺 𝑧, 𝑐
𝑥: an image (a high-dimensional vector) per condition c
▪ Eg. “Girl with red hair and red eyes”:

[Mehdi Mirza, et al., arXiv, 2014]
Text to Image
▪ Traditional Supervised Approach
as close
as possible
c1: a dog is running NN Image
a dog is running Text: “train”
a bird is flying
Target of NN Output
A blurry image!
Conditional GANs
▪ Conditional GANs Approach [Scott Reed, et al, ICML, 2016]
c: train
Image
𝐺 𝑥 = 𝐺(𝑐, 𝑧)
Normal distribution 𝑧
𝑥 is real image or not?

Real images: 1
𝑥 D scalar
(original) Generated images: 0
Generator will learn to generate realistic images ….

But completely ignore the input conditions.
Conditional GANs
▪ Conditional GANs Approach [Scott Reed, et al, ICML, 2016]
c: train
Image
𝐺 𝑥 = 𝐺(𝑐, 𝑧)
Normal distribution 𝑧
𝑥 is real image or not +

𝑐 and 𝑥 are match or not?
𝑥 True text-image pairs: (train, , ) 1

D scalar
(original)
𝑐 (cat , ) 0 (train, ) 0
Conditional GANs - Discriminator
𝑥 is real image or not +
object 𝑥 Network 𝑐 and 𝑥 are match or not?
Network score
condition 𝑐 Network (almost every paper)
[Augustus Odena et al., ICML, 2017]
[Takeru Miyato, et al., ICLR, 2018]
[Han Zhang, et al., arXiv, 2017]
object 𝑥 Network x is realistic or not
condition 𝑐 Network 𝑐 and 𝑥 are matched or not

Conditional GANs
▪ The images are generated paired data
by Yen-Hao Chen, Po-Chun
Chien, Jun-Chen Xie, blue eyes Collecting anime faces and the
Tsung-Han Wu. red hair description of its characteristics
short hair
red hair, green

eyes
blue hair, red

eyes
Conditional GANs – Image to Image
▪ Image to image translation or pix2pix [Phillip Isola, et al., CVPR, 2017]
𝑐
G 𝑥 = 𝐺(𝑐, 𝑧)
𝑧
▪ Traditional Supervised Approach
Testing: It is blurry.
NN Image
as close as
possible
Input L1
e.g. L1
▪ Conditional GANs Approach [Phillip Isola, et al., CVPR, 2017]
L1
Image
G D scalar
𝑧
Testing:
input L1 GAN GAN + L1

Conditional GANs – Video Generator
▪ Conditional GANs Approach [Michael Mathieu, et al., arXiv, 2015]
Generator
Discriminator thinks it is real
Last frame is real or

Discriminator generated
Conditional GANs – Sound to Image
▪ Conditional GANs Approach for Sound to Image
"a dog barking sound"
Image
c: sound G
Training Data Collection
video
Conditional GANs – Audio to Image
▪ Conditional GANs Approach for Audio to Image
Louder
Conditional GANs – Image to Label
▪ Multi Labels Image Classifier
Input condition
Generated output
Conditional GANs – Image to Label
▪ The classifiers can have different
architectures.
▪ The classifiers are trained as
conditional GAN.
▪ The classifiers can have different
architectures.
▪ The classifiers are trained as
conditional GAN.
▪ Conditional GAN outperforms
other models designed for multi-
label.
[Tsai, et al., submitted to ICASSP 2019]
Domain Adversarial Training
▪ Training and testing data are in different domains
Take digit
classification as example
Generator feature
Training data:
The same distribution
Testing data:
feature
Generator
feature extractor (Generator)
Always output
zero vectors
Domain Classifier Fails

image
Which domain?
Discriminator
blue points
(Domain classifier)
red points
feature extractor (Generator) Label predictor
Which digits?
image
Which domain?
Not only cheat the domain classifier, but

satisfying label predictor at the same time
Discriminator
(Domain classifier)
Successfully applied on image classification [Ganin et al, ICML, 2015][Ajakan et al. JMLR, 2016 ]
Outline
2. Image Generation
3. Conditional GANs
Unsupervised Conditional GAN
Condition G Generated Object

Object in Domain X Object in Domain Y
Transform an object from one domain to another without paired data
Domain X Domain Y
Not Paired
photos Vincent van Gogh’s paintings
Use image style transfer as example here

Unsupervised Conditional Generation
▪ Approach 1: Direct Transformation
For texture or color

𝐺𝑋→𝑌 ? change
Domain X Domain Y
▪ Approach 2: Projection to Common Space
𝐸𝑁𝑋 𝐷𝐸𝑌
Domain X Encoder of domain X Face Decoder of domain Y Domain Y

Attribute
Larger change, only keep the semantics
Domain X
Direct Transformation
Domain Y
Become similar
Domain X to domain Y
𝐺𝑋→𝑌 ?
𝐷𝑌 scalar
Input image
belongs to
domain Y or not
Domain Y
Domain X
Domain Y
Become similar
𝐺𝑋→𝑌 Not what we want!
ignore input
𝐷𝑌 scalar
Input image
belongs to
domain Y or not
Domain Y
Domain X
Domain Y
Become similar
𝐺𝑋→𝑌 Not what we want!
ignore input
𝐷𝑌 scalar
The issue can be avoided by network design.
Input image
Simpler generator makes the input and belongs to
output more closely related. domain Y or not
[Tomer Galanti, et al. ICLR, 2018]
Domain X
Domain Y
Become similar
𝐺𝑋→𝑌
𝐷𝑌 scalar
Encoder Encoder
pre-trained
Network Network Input image
as close as belongs to
possible domain Y or not
Baseline of DTN [Yaniv Taigman, et al., ICLR, 2017]

Direct Transformation – Cycle GAN
as close as possible
Cycle consistency
𝐺𝑋→𝑌 𝐺Y→X
Lack of information
for reconstruction
𝐷𝑌 scalar
Input image
belongs to
domain Y or not
Domain Y [Jun-Yan Zhu, et al., ICCV, 2017]
Direct Transformation – Cycle GAN
scalar: belongs to 𝐷𝑌 scalar: belongs to

domain X or not 𝐷𝑋 domain Y or not
𝐺Y→X 𝐺𝑋→𝑌
For multiple domains,
considering starGAN
Disco GAN
[Yunjey Choi, arXiv, 2017]
[Taeksoo Kim, et
al., ICML, 2017]
Dual GAN
[Zili Yi, et al., ICCV, 2017]
Cycle GAN
[Jun-Yan Zhu, et al., ICCV, 2017]

Issue of Cycle Consistency
• CycleGAN: a Master of Steganography
[Casey Chu, et al., NIPS workshop, 2017]
The information is hidden.

Unsupervised Conditional Generation
▪ Approach 1: Direct Transformation
For texture or color
𝐺𝑋→𝑌 ? change
Domain X Domain Y
▪ Approach 2: Projection to Common Space
𝐸𝑁𝑋 𝐷𝐸𝑌
Domain X Encoder of domain X Face Decoder of domain Y Domain Y

Attribute
Larger change, only keep the semantics
Projection to Common Space
Target
image 𝐸𝑁𝑋 𝐷𝐸𝑋 image
image 𝐸𝑁𝑌 Face 𝐷𝐸𝑌 image
Attribute
Domain X Domain Y
Training
Minimizing reconstruction error
image 𝐸𝑁𝑋 𝐷𝐸𝑋 image
image 𝐸𝑁𝑌 𝐷𝐸𝑌 image
Domain X Domain Y
Training
Minimizing reconstruction error Discriminator
of X domain
image 𝐸𝑁𝑋 𝐷𝐸𝑋 image 𝐷𝑋
image 𝐸𝑁𝑌 𝐷𝐸𝑌 image 𝐷𝑌

Discriminator
Minimizing reconstruction error
of Y domain
Because we train two auto-encoders separately …

The images with the same attribute may not project
to the same position in the latent space.
Training
𝐸𝑁𝑋 𝐷𝐸𝑋
𝐸𝑁𝑌 𝐷𝐸𝑌
Sharing the parameters of encoders and decoders

Couple GAN[Ming-Yu Liu, et al., NIPS, 2016]
UNIT[Ming-Yu Liu, et al., NIPS, 2017]
Training
of X domain

Discriminator
of Y domain
Domain
𝐸𝑁𝑋 and 𝐸𝑁𝑌 fool the From 𝐸𝑁𝑋 or 𝐸𝑁𝑌
Discriminator
domain discriminator
The domain discriminator forces the output of 𝐸𝑁𝑋 and
𝐸𝑁𝑌 have the same distribution. [Guillaume Lample, et al., NIPS, 2017]
Training
of X domain

Discriminator
of Y domain
Cycle Consistency:
Used in ComboGAN [Asha Anoosheh, et al., arXiv, 017]
Training
To the same Discriminator
latent space of X domain

Discriminator
of Y domain
Semantic Consistency:
Used in DTN [Yaniv Taigman, et al., ICLR, 2017] and
XGAN [Amélie Royer, et al., arXiv, 2017]
Outline
2. Image Generation
3. Conditional GANs
Basic Components
You cannot control
Reward
Actor Env Function
Video Get 20 scores when

Game killing a monster
The rule
Go of GO
Neural network as Actor
• Input of neural network: the observation of machine represented as a vector or a matrix
• Output neural network : each action corresponds to a neuron in output layer
Take the action based

NN as actor on the probability.
left 0.7
… right 0.2 Score of an

action
…
fire 0.1
pixels
Actor, Environment, Reward
𝑠1 𝑎1 𝑠2 𝑎2
Env Actor Env Actor Env ……
𝑠1 𝑎1 𝑠2 𝑎2 𝑠3
“right” “fire”
Trajectory
𝜏 = 𝑠1 , 𝑎1 , 𝑠2 , 𝑎2 , ⋯ , 𝑠𝑇 , 𝑎 𝑇
Reinforcement Learning v.s. GAN
updated updated
𝑠1 𝑎1 𝑠2 𝑎2
Env Actor Env Actor Env

……
𝑠1 𝑎1 𝑠2 𝑎2 𝑠3
Reward Reward “Black box”

You cannot use
𝑟1 𝑟2 backpropagation.
𝑇
Actor → Generator
Fixed 𝑅 𝜏 = ෍ 𝑟𝑡
Reward Function → Discriminator 𝑡=1
Imitation Learning
𝑠1 𝑎1 𝑠2 𝑎2
Env Actor Env Actor Env

……
𝑠1 𝑎1 𝑠2 𝑎2 𝑠3
reward function is not available
We have demonstration of the expert.

Self driving: record human
drivers
Each 𝜏Ƹ is a trajectory of the
Robot: grab the arm 𝜏Ƹ1 , 𝜏Ƹ 2 , ⋯ , 𝜏Ƹ 𝑁 expert.
of robot
Inverse Reinforcement Learning
demonstration of the
expert
Environment
𝜏Ƹ1 , 𝜏Ƹ 2 , ⋯ , 𝜏Ƹ 𝑁
Inverse Reinforcement
Reward Function Reinforcement Expert
Optimal Actor
LearningLearning
➢ Using the reward function to find the optimal actor.
➢ Modeling reward can be easier. Simple reward function can lead to

complex policy.
Framework of IRL The expert is always the best.
𝑁 𝑁
෍ 𝑅 𝜏Ƹ 𝑛 > ෍ 𝑅 𝜏
𝑛=1 𝑛=1
Expert 𝜋ො 𝜏Ƹ1 , 𝜏Ƹ 2 , ⋯ , 𝜏Ƹ 𝑁 Obtain

Reward Function R
Reward
𝜏1 , 𝜏2 , ⋯ , 𝜏𝑁 Function R
Actor
→ Generator
Find an actor based on
Reward function → reward function R
Actor 𝜋
Discriminator
By Reinforcement learning
High score for real,
low score for generated
GAN
D
Find a G whose output obtains

G
large score from D
IRL
Larger reward for 𝜏Ƹ 𝑛 ,
Expert 𝜏Ƹ1 , 𝜏Ƹ 2 , ⋯ , 𝜏Ƹ 𝑁 Lower reward for 𝜏
𝜏1 , 𝜏2 , ⋯ , 𝜏𝑁 Reward
Function
Find a Actor obtains large

Actor reward
Welcome to Data Science Center UI
What You Will Learn Today
▪ GANs Applications for Vision
▪ GANs Applications for NLP
▪ GANs Application for Speech
▪ Reproduction of Some Models
What You Will Practice Today

▪ Realistic Image Generation
▪ Image to Image Translation
▪ Password Cracking
▪ Text to Speech
Algorithms
Applications of GAN for Speech
Outline
1. Speech Enhancement
2. Postfilter, speech synthesis, voice conversion
3. Speech Signal Recognition
4. Conclusion
Speech Signal Generation
Paired
▪ Regression Task:
Objective function
G Output
▪ GAN Models:
Speech, Speaker, Emotion Recognitions
& Lip Reading Recognitions
▪ Classification Task: GAN Model
𝒚
Output
label
G ℎ(∙)
𝒛෤ = 𝑔(෥
𝒙)
Emb.
Acoustic Mismatch
E 𝑔(∙)
ෝ
𝒙 ෭
𝒙 ෥
𝒙
Channel Accented Noisy
Distortion Speech Data
𝒙
Speech Enhancement
▪ Speech Enhancement using GAN [Pascual et al., Interspeech 2017]
z
Speech Enhancement
Enhancing
▪ Neural network models for spectral mapping
▪ Model structures of G: DNN [Wang et al. NIPS
2012; Xu et al., SPL 2014], DDAE
[Lu et al., Interspeech 2013], RNN (LSTM) [Chen
et al., Interspeech 2015;
Weninger et al., LVA/ICA 2015], CNN [Fu et al., Objective function
Interspeech 2016].
• Typical objective function Output
▪ Mean square error (MSE) [Xu et al., TASLP 2015],
L1 [Pascual et al., Interspeech 2017], likelihood
[Chai et al., MLSP 2017], STOI [Fu et al., TASLP G
2018].
▪ GAN is used as a new objective function to
estimate the parameters in G.
Speech Enhancement (SEGAN)
▪ Experimental Result
Objective Evaluation Results Subjective Evaluation Results
Preference Test Results
SEGAN yields better speech enhancement results than Noisy and Wiener.
Speech Enhancement
▪ Pix2Pix [Michelsanti et al., Interpsech 2017]
Noisy Output Clean
Output
Clean
D Scalar
(Fake/Real)
Noisy Noisy
Speech Enhancement (Pix2Pix)
▪ Spectrogram comparison of Pix2Pix with baseline methods.
Noisy Clean NG-Pix2Pix
NG-DNN STAT-MMSE
Pix2Pix outperforms STAT-MMSE and is competitive to DNN SE.
Speech Enhancement (Pix2Pix)
▪ Objective evaluation and speaker verification test
From the PESQ and STOI evaluations, Pix2Pix outperforms Noisy and
MMSE and is competitive to DNN SE.
Speech Enhancement
▪ Frequency-domain SEGAN (FSEGAN) [Donahue et al., ICASSP 2018]
Noisy Output Clean
Output
Clean
D Scalar
(Fake/Real)
Noisy Noisy
Speech Enhancement (FSEGAN)
▪ FSEGAN ASR Results Spectogram comparison of FSEGAN with L1-trained
method.
FSEGAN reduces both additive noise and reverberant smearing.

Speech Enhancement (FSEGAN)
▪ FSEGAN ASR Results
WER (%) of SEGAN and FSEGAN. WER (%) of FSEGAN with retrain.
1. FSEGAN improves recognition results for ASR-Clean.

2. FSEGAN outperforms SEGAN as front-ends.
3. Hybrid Retraining FSEGAN outperforms Baseline
Speech Enhancement
▪ Speech enhancement through a mask function
Noisy Output mask Enhanced
Point-wise multiplication
Speech Enhancement
• GAN for spectral magnitude mask estimation (MMS-GAN) [Pandey, et.al, ICASSP 2018]
Noisy Output mask Ref. mask
Output Ref.
mask mask
D Scalar
(Fake/Real)
Noisy Noisy
Speech Enhancement (MMS-GAN)
• Speech enhancement results of MMS estimated by MSE (DNN) and by GAN (GAN) and for 2
unseen noises at 3 different SNR conditions.
▪ GAN yields better performance with random noise, Z, than without Z.

▪ GAN with Z outperforms DNN in terms of STOI scores, but no clear difference for PESQ scores
Speech Enhancement
▪ Adversarial training based mask estimation (ATME) [Higuchi et al., ASRU 2017]
True or Fake True or Fake
𝑫𝑺 𝑫𝑵
or or
True speech True noise
Estimated Estimated
speech noise
Noise data
Clean speech data
𝑮𝑴𝒂𝒔𝒌
𝑉𝑀𝑎𝑠𝑘
Noisy speech = 𝐸𝒔𝑓𝑎𝑘𝑒 log(1 − 𝑫𝑺 𝒔𝑓𝑎𝑘𝑒 , 𝜃 )
+𝐸𝒏𝑓𝑎𝑘𝑒 log(1 − 𝑫𝑵 𝒏𝑓𝑎𝑘𝑒 , 𝜃 )
Noisy data
Speech Enhancement (ATME)
▪ Spectrogram comparison of (a) noisy; (b) MMSE with supervision; (c) ATME without supervision.
𝑛
Speech mask Noise mask 𝑀𝑓,𝑡
The proposed adversarial training mask estimation can capture

speech/noise signals without supervised data.
Mask-based beamformer for robust ASR
▪ The estimated mask parameters are used to compute spatial covariance
matrix for MVDR beamformer.
Ƹ = 𝐰𝑓H 𝐲𝑓,𝑡 , where 𝑠𝑓,𝑡
▪ 𝑠𝑓,𝑡 Ƹ is the enhanced signal, and 𝐲𝑓,𝑡 denotes the
observation of M microphones, 𝑓 and 𝑡 are frequency and time indices; 𝐰𝑓 denotes the
beamformer coefficient.
𝑠+𝑛 −1
(𝑅𝑓 ) 𝐡𝑓
▪ The MVDR solves 𝐰𝑓 by: 𝐰𝑓 = 𝑠+𝑛 −1
𝐡H
𝑓 (𝑅𝑓 ) 𝐡𝑓
𝑠
▪ To estimate 𝐡𝑓 , the spatial covariance matrix of the target signal, 𝑅𝑓 , is
𝑛 H
𝑠 𝑠+𝑛 𝑛 𝑛 𝑀𝑓,𝑡 𝐲𝑓,𝑡 𝐲𝑓,𝑡 𝑛
computed by : 𝑅𝑓 = 𝑅𝑓 －𝑅𝑓 , where 𝑅𝑓 = 𝑛 , 𝑀𝑓,𝑡 was computed by AT.
σ𝑓,𝑡 𝑀𝑓,𝑡
▪ WERs (%) for the development and evaluation sets.
1. ATME provides significant improvements over Unprocessed.

2. Unsupervised ATME slightly underperforms supervised MMSE.
Speech Enhancement (AFT)
• Cycle-GAN-based acoustic feature transformation (AFT) [Mimura et al., ASRU 2017]
Clean Syn. Noisy Clean
𝐺𝑆→𝑇 𝐺𝑇→𝑆
Scalar: belongs to Scalar: belongs to

𝐷𝑆 𝐷𝑇 domain T or not
domain S or not
𝐺𝑇→𝑆 𝐺𝑆→𝑇
Noisy Enhanced Noisy
𝑉𝐹𝑢𝑙𝑙 = 𝑉𝐺𝐴𝑁 𝐺𝑋→𝑌 , 𝐷𝑌 ＋𝑉𝐺𝐴𝑁 𝐺𝑋→𝑌 , 𝐷𝑌 ＋𝜆 𝑉𝐶𝑦𝑐 (𝐺𝑋→𝑌 , 𝐺𝑌→𝑋 )

Speech Enhancement (AFT)
▪ ASR results on noise robustness and style adaptation
Noise robust ASR Speaker style adaptation.
S: Clean; 𝑇: Noisy JNAS: Read; CSJ-SPS: Spontaneous (relax);

CSJ-APS: Spontaneous (formal);
1. 𝐺𝑇→𝑆 can transform acoustic features and effectively improve

ASR results for both noisy and accented speech.
2. 𝐺𝑆→𝑇 can be used for model adaptation and effectively improve
ASR results for noisy speech.
Speech Enhancement
▪ Noise Adaptive Speech Enhancement (NA-SE)
[Chien-Feng Liao, Yu Tsao, Hung-Yi Lee, Hsin-Min Wang., arXiv 2018]
N5 N4 N5
N11 N12 N7 N10 N9
Unseen
𝑉𝑦
E G
𝜕𝑉𝑦
𝜃𝐺 ← 𝜃𝐺 − ϵ Min reconstruction 𝜕𝑉𝑦
𝜕𝜃𝐺 error 𝜃𝐸 ← 𝜃𝐸 − ϵ
𝜕𝜃𝐸
Min reconstruction error
Speech Enhancement (NA-SE)
▪ Domain adversarial training for NA-SE
N5 N4 N5
N11 N12 N7 N10 N9
Unseen
𝑉𝑦
E G
𝒛
D
Output 2
Speaker
𝜕𝑉𝑦 𝑉𝑧
𝜃𝐺 ← 𝜃𝐺 − ϵ Min reconstruction 𝜕𝑉𝑦 𝜕𝑉𝑧
𝜕𝜃𝐺 error 𝜃𝐸 ← 𝜃𝐸 − ϵ +𝛼
𝜕𝜃𝐸 𝜕𝜃𝐸
𝜕𝑉𝑧 Max domain Min reconstruction error
𝜃𝐷 ← 𝜃𝐷 − ϵ accuracy
𝜕𝜃𝐷 and Min domain accuracy
Speech Enhancement (NA-SE)
▪ Objective evaluations
PESQ at different SNR levels.
The DAT-based unsupervised adaptation can notably overcome the

mismatch issue of training and testing noise types.
Outline
4. Conclusion
Postfilter
▪ Postfilter for synthesized or transformed speech Natural
spectral texture
Speech
synthesizer
Synthesized spectral Objective function

Voice texture
conversion
G Output
Speech
enhancement
▪ Conventional postfilter approaches for G estimation include global variance (GV) [Toda et al., IEICE 2007], variance scaling (VS)
[Sil’en et al., Interpseech 2012], modulation spectrum (MS) [Takamichi et al., ICASSP 2014],DNN with MSE criterion [Chen et
al., Interspeech 2014; Chen et al., TASLP 2015].
▪ GAN is used a new objective function to estimate the parameters in G.
Postfilter
▪ GAN postfilter [Kaneko et al., ICASSP 2017]
Natural
Mel cepst. coef.
Synthesized Generated
Mel cepst. coef. Mel cepst. coef.
Nature
or
G D Generated
▪ Traditional MMSE criterion results in statistical averaging.

▪ GAN is used as a new objective function to estimate the parameters in G.
▪ The proposed work intends to further improve the naturalness of
synthesized speech or parameters from a synthesizer.
Postfilter (GAN-based Postfilter)
▪ Spectrograms of: (a) NAT (nature); (b) SYN (synthesized); (c) VS (variance scaling); (d) MS
(modulation spectrum); (e) MSE; (f) GAN postfilters.
GAN postfilter reconstructs spectral texture similar to the natural one.

▪ Objective evaluations
Mel-cepstral trajectories (GANv: GAN was applied in Averaging difference in modulation spectrum
voiced part). per Mel-cepstral coefficient.

▪ Subjective evaluations
Preference score (%). Bold font indicates the numbers over 30%.
1. GAN postfilter significantly improves the synthesized speech.

2. GAN postfilter is effective particularly in voiced segments.
3. GANv outperforms GAN and is comparable to NAT.
Postfilter (GAN-postfilter-SFTF)
▪ GAN post-filter for STFT spectrograms [Kaneko et al., Interspeech 2017]
1. GAN postfilter was applied on high-dimensional STFT spectrograms.

2. The spectrogram was partitioned into N bands (each band overlaps its neighboring bands).
3. The GAN-based postfilter was trained for each band.
4. The reconstructed spectrogram from each band was smoothly connected.
Postfilter (GAN-postfilter-SFTF)
▪ Spectrograms of: (1) SYN, (2) GAN, (3) Original (NAT)

Speech Synthesis
▪ Speech synthesis with anti-spoofing verification (ASV) [Saito et al., ICASSP 2017]
▪ Input: linguistic features; Output: speech parameters
Speech Synthesis (ASV)
▪ Objective and subjective evaluations
Averaged GVs of MCCs. Scores of speech quality.
1. The proposed algorithm generates MCCs similar to the natural ones.

2. The proposed algorithm outperforms conventional MGE training.
Speech Synthesis
▪ Speech synthesis with GAN (SS-GAN) [Saito et al., TASLP 2018]
Speech Synthesis (SS-GAN)
▪ Subjective Evaluation
Scores of speech quality (sp). Scores of speech quality (sp and F0).
The proposed algorithm works for both spectral parameters and F0.
.
Speech Synthesis
▪ Speech synthesis with GAN glottal waveform model (GlottGAN) [Bollepalli et al., Interspeech 2017]
Glottal waveform Glottal waveform
Generated Natural
Acoustic
G speech speech
features
parameters parameters
Gen.
𝑫
Nature
Speech Synthesis (GlottGAN)
▪ Objective Evaluation
Glottal pulses generated by GANs.
G, D: DNN
G, D: conditional DNN
G, D: Deep CNN
G, D: Deep CNN + LS loss
The proposed GAN-based approach can generate glottal waveforms similar to the natural ones.
Speech Synthesis
▪ Speech synthesis with GAN & multi-task learning (SS-GAN-MTL) [Yang et al., ASRU 2017]
Speech Synthesis (SS-GAN-MTL)
• Speech synthesis with GAN & multi-task learning (SS-GAN-MTL) [Yang et al., ASRU 2017]
Speech Synthesis (SS-GAN-MTL)
Objective evaluation results. The preference score (%).
1. From objective evaluations, no remarkable difference is observed.

2. From subjective evaluations, GAN outperforms BLSTM and ASV, while GAN-PC
underperforms GAN.
Voice Conversion
▪ Convert (transform) speech from source to target
Target
speaker
Source
speaker Objective function
G Output
Conventional VC approaches include Gaussian mixture model (GMM) [Toda et al., TASLP 2007], non-negative matrix
factorization (NMF) [Wu et al., TASLP 2014; Fu et al., TBME 2017], locally linear embedding (LLE) [Wu et al., Interspeech
2016], restricted Boltzmann machine (RBM) [Chen et al., TASLP 2014], feed forward NN [Desai et al., TASLP 2010],
recurrent NN (RNN) [Nakashika et al., Interspeech 2014].
Voice Conversion
Target
▪ VAW-GAN [Hsu et al., Interspeech 2017] speaker
Source
speaker
Real
G or
D
Fake
▪ Conventional MMSE approaches often encounter the “over-smoothing” issue.

▪ GAN is used a new objective function to estimate G.
▪ The goal is to increase the naturalness, clarity, similarity of converted speech.
𝑉 𝐺, 𝐷 = 𝑉𝐺𝐴𝑁 𝐺, 𝐷 + 𝜆 𝑉𝑉𝐴𝐸 𝒙|𝒚

Voice Conversion (VAW-GAN)
The spectral envelopes. MOS on naturalness.
VAW-GAN outperforms VAE in terms of objective and subjective evaluations with generating
more structured speech.
Voice Conversion
• Sequence-to-sequence VC with learned similarity metric (LSM) [Kaneko et al., Interspeech 2017]
Target
speaker
𝒚
Source
speaker
Real
𝑪 D or
Fake
𝒛
Noise 𝑮 Similarity metric
𝐷 1 2
𝑉𝑆𝑉𝐶𝑙 𝐶, 𝐷 = 𝐷𝑙 𝒚 − 𝐷𝑙 𝐶(𝒙 )
𝑀𝑙
𝐷
𝑉 𝐶, 𝐺, 𝐷 = 𝑉𝑆𝑉𝐶𝑙 𝐶, 𝐷 + 𝑉𝐺𝐴𝑁 (𝐶, 𝐺, 𝐷)
Voice Conversion (LSM)
▪ Spectrogram Analysis
Comparison of MCCs (upper) and STFT spectrograms (lower).
Source Target FVC MSE(S2S) LSM(S2S)
The spectral textures of LSM are more similar to the target ones.
Voice Conversion (LSM)
Preference scores for naturalness. Similarity of TGT and SRC with VCs.
Table 14: Preference scores for clarity.
Target speaker Source speaker
LSM outperforms FVC and MSE in terms of subjective evaluations.

Voice Conversion (CycleGAN-VC)
▪ CycleGAN-VC [Kaneko et al., arXiv 2017]
Source Syn. Target Source
𝑮𝑺→𝑻 𝐺𝑇→𝑆
Scalar: belongs to domain Scalar: belongs to

S or not 𝑫𝑺 𝑫𝑻
domain T or not
𝐺𝑇→𝑆 𝑮𝑺→𝑻
Target Syn. Source Target
𝑉𝐹𝑢𝑙𝑙 = 𝑉𝐺𝐴𝑁 𝐺𝑋→𝑌 , 𝐷𝑌 ＋𝑉𝐺𝐴𝑁 𝐺𝑋→𝑌 , 𝐷𝑌
＋𝜆 𝑉𝐶𝑦𝑐 (𝐺𝑋→𝑌 , 𝐺𝑌→𝑋 )
Voice Conversion
• Subjective evaluations
Similarity of to source and to target
speakers. S: Source; T:Target; P:
MOS for naturalness.
Proposed; B:Baseline
Target Source
speaker speaker
1. The proposed method uses non-parallel data.

2. For naturalness, the proposed method outperforms baseline.
3. For similarity, the proposed method is comparable to the baseline.
Voice Conversion
▪ Multi-target VC [Chou et al., Interspeech 2018]
➢ Stage-1
𝒚
C
𝑬nc Dec ···

𝑒𝑛𝑐(𝒙)
𝒙 𝑑𝑒𝑐(𝑒𝑛𝑐 𝒙 , 𝒚) 𝑑𝑒𝑐(𝑒𝑛𝑐 𝒙 , 𝒚′)
𝒚 𝒚′····
➢ Stage-2
𝒚"
𝑬nc Dec
𝑒𝑛𝑐(𝒙) F/R
𝒙
D+C
𝑮 ID
Real
𝒚" data
Voice Conversion
▪ Controller-generator-discriminator VC on Impaired Speech [Li-Wei Chen, Yu Tsao,
Hung-Yi Lee, arXiv 2018]
Previous applications: hearing aids; murmur to normal speech; bone-

conductive microphone to air-conductive microphone.
Proposed: improving the speech intelligibility of surgical patients.
Target: oral cancer (top five cancer for male in Taiwan).
Before After Before After

Voice Conversion
▪ Controller-generator-discriminator VC (CGD VC) on impaired speech [Li-Wei
Chen, Yu Tsao, Hung-Yi Lee, arXiv 2018]
Voice Conversion (CGD VC)
▪ Detailed architectures of C, G, and D
▪ G and D are trained using standard conditional GAN procedure.

▪ An additional loss is designed for training C: 𝐿𝑐 = 𝔼𝑥 𝑠 ~𝑆 [𝐿 𝐺 𝐶 𝑥 𝑠 , 𝑥 𝑠 ) .
▪ Instead of low-level details, we want to measure high-level difference.
▪ The perceptual loss is used : 𝐿 𝑥, 𝑥 ′ = σ𝑙 2−2𝑙 |𝐷𝑙 𝑥 − 𝐷𝑙 (𝑥′)|1 , where 𝑙 denotes the 𝑙-th
hidden layer output.
▪ Spectrogram analysis
Spectrogram comparison of CGD with CycleGAN.

MOS for content similarity, speaker similarity, and articulation.
The proposed method outperforms conditional GAN and CycleGAN in terms of content similarity,
speaker similarity, and articulation.
Outline
4. Conclusion
▪ Classification Task: GAN Model:
𝒚
Output
label
G ℎ(∙)
𝒛෤ = 𝑔(෥
𝒙)
Emb.
Acoustic Mismatch
E 𝑔(∙)
ෝ
𝒙 ෭
𝒙 ෥
𝒙
𝒙
Speech Recognition
▪ Adversarial multi-task learning (AMT) [Shinohara Interspeech 2016]
𝒚 Objective function
𝒛
Output 1 Output 2 𝑉𝑦 =− σ𝑖 log 𝑃(𝑦𝑖 |𝑥𝑖 ; 𝜃𝐸 , 𝜃𝐺 )
Senone Domain
𝑉𝑧 =− σ𝑖 log 𝑃(𝑧𝑖 |𝑥𝑖 ; 𝜃𝐸 , 𝜃𝐷 )
𝑉𝑦 G D 𝑉𝑧 Model update
𝜕𝑉𝑦 Max
𝜃𝐺 ← 𝜃𝐺 − ϵ classification
GRL 𝜕𝜃𝐺
accuracy
𝜕𝑉𝑧 Max domain
𝜃𝐷 ← 𝜃𝐷 − ϵ
E 𝜕𝜃𝐷 accuracy
𝜕𝑉𝑦 𝜕𝑉𝑧
𝜃𝐸 ← 𝜃𝐸 − ϵ +𝛼
Input Max classification accuracy
𝒙
Acoustic feature and Min domain accuracy
Speech Recognition (AMT)
▪ ASR results in known (k) and unknown (unk) noisy conditions
WER of DNNs with single-task learning (ST) and AMT.
The AMT-DNN outperforms ST-DNN with yielding lower WERs.

Speech Recognition
▪ Domain adversarial training for accented ASR (DAT) [Sun et al., ICASSP2018]
𝒛
Output 1 Output 2
𝑉𝑦 =− σ𝑖 log 𝑃(𝑦𝑖 |𝑥𝑖 ; 𝜃𝐸 , 𝜃𝐺 )
Senone Domain
𝑉𝑦 G D 𝑉𝑧 Model update
𝜕𝑉𝑦 Max classification
𝜃𝐺 ← 𝜃𝐺 − ϵ accuracy
𝜕𝜃𝐺
GRL

𝜕𝜃𝐷 accuracy
E
Input Max classification accuracy

𝒙 Acoustic feature and Min domain accuracy
Speech Recognition (DAT)
▪ ASR results on accented speech
WER of the baseline and adapted model.
STD: standard speech

1. With labeled transcriptions, ASR performance notably improves.
2. DAT is effective in learning features invariant to domain differences with and without
labeled transcriptions.
Speech Recognition
▪ Robust ASR using GAN enhancer (GAN-Enhancer) [Sriram et al., arXiv 2017]
𝒚
Output 1
Senone Cross entropy with L1 Enhancer:
𝒛 − 𝒛෤ 1
L1
D 𝐻 ℎ 𝒛෤ , 𝐲 + λ
G ℎ(∙) 𝒛 1 + 𝒛෤ 1 +𝜖
𝒛 = 𝑔(𝒙) 𝒛෤ = 𝑔(෥
𝒙)
Emb. Emb.
𝑔(∙) E Cross entropy with GAN Enhancer:

E 𝑔(∙)
𝐻 ℎ 𝒛෤ , 𝐲 + λ 𝑉𝑎𝑑𝑣 (𝑔(𝒙) , 𝑔(෥
𝒙))
𝒙 ෥
𝒙
Noisy data Clean data
Speech Recognition (GAN-Enhancer)
▪ ASR results on far-field speech:
WER of GAN enhancer and the baseline methods.
GAN Enhancer outperforms the Augmentation and L1-Enhancer

approaches on far-field speech.
Speaker Recognition
▪ Domain adversarial neural network (DANN) [Wang et al., ICASSP 2018]
𝒚
𝒛
Output 1 Output 2
Speaker ID Domain
𝑉𝑦 G D 𝑉𝑧
Enroll
GRL i-vector
DANN Pre-processing
E Scoring
DANN Pre-processing
Test
i-vector
𝒙 Input
Acoustic feature
Speaker Recognition (DANN)
▪ Recognition results of domain mismatched conditions
Performance of DAT and the state-of-the-art methods.
The DAT approach outperforms other methods with

achieving lowest EER and DCF scores.
Emotion Recognition
▪ Adversarial AE for emotion recognition (AAE-ER) [Sahu et al., Interspeech 2017]
𝒙
AE with GAN :
𝐻 ℎ 𝒛 , 𝒙 + λ 𝑉𝐺𝐴𝑁 (𝒒, 𝑔(𝒙))
D G ℎ(∙)
𝒒 Emb. 𝒛 = 𝑔(𝒙)
Syn. E 𝑔(∙)
𝒙 The distribution of code vectors

Emotion Recognition (AAE-ER)
▪ Recognition results of domain mismatched conditions:
Classification results on different systems.
Classification results on real and synthesized features.
Original
Training
data
1. AAE alone could not yield performance improvements.

2. Using synthetic data from AAE can yield higher UAR.
Lip-reading
▪ Domain adversarial training for lip-reading (DAT-LR) [Wand et al., arXiv 2017]
𝒛
Output 1 Output 2 𝑉𝑦 =− σ𝑖 log 𝑃(𝑦𝑖 |𝑥𝑖 ; 𝜃𝐸 , 𝜃𝐺 )
Words Speaker
Model update
𝑉𝑦 G D 𝑉𝑧
𝜕𝑉𝑦 Max classification
𝜃𝐺 ← 𝜃𝐺 − ϵ accuracy
𝜕𝜃𝐺
GRL
𝜕𝜃𝐷 accuracy
E
Max classification accuracy

𝒙 and Min domain accuracy
~80% WAC
Lip-reading (DAT-LR)
▪ Recognition results of speaker mismatched conditions
Performance of DAT and the baseline.
The DAT approach notably enhances the recognition

accuracies in different conditions.
Outline
4. Conclusion
Speech Signal Generation
Paired
▪ Regression Task:
Objective function
G Output
▪ GAN Models:
▪ Classification Task: GAN Model
𝒚
Output
label
G ℎ(∙)
𝒛෤ = 𝑔(෥
𝒙)
Emb.
Acoustic Mismatch
E 𝑔(∙)
ෝ
𝒙 ෭
𝒙 ෥
𝒙
𝒙
More GANs in Speech
Diagnosis of Autism Spectrum
▪ Deng, et.al., Speech-based Diagnosis of Autism Spectrum Condition by Generative Adversarial Network
Representations, ACM DH, 2017.
Emotion Recognition
▪ Chang, et.al., Learning Representations of Emotional Speech with Deep Convolutional Generative Adversarial
Networks, ICASSP, 2017.
Robust ASR
▪ Serdyuk, et.al., Invariant Representations for Noisy Speech Recognition, arXiv, 2016.
Speaker Verification
▪ Hong Yu, Zheng-Hua Tan, Zhanyu Ma, and Jun Guo, Adversarial Network Bottleneck Features for Noise Robust
Speaker Verification, arXiv, 2017.

▪ Text to Speech
Algorithms
Applications of GAN for Text

▪ Text to Speech
Outline
▪ Conditional Sequence Generation
▪ Unsupervised Conditional Sequence Generation
Conditional Sequence Generation
How are you Machine Learning I am fine.
Generator Generator Generator
機器學習 How are you?

ASR Translation Chatbot
The generator is a typical seq2seq model.
With GAN, you can train seq2seq model in another way.
Review: Sequence-to-sequence
Maximize
• Chat-bot as example likelihood I’m good.
Output: Not bad I’m John.
Human better output
Training Criterion better sentence x
Training
……
data: Encoder Decoder
A: How are you ?
B: I’m good. Seq2seq
Input sentence c
……
How are you ?

Introduction
• Machine obtains feedback from user
How are Hello

you?
Bye bye ☺ Hi ☺
-10 3
• Chat-bot learns to maximize the expected reward
Maximizing Expected Reward
Learn to maximize expected reward
Policy Gradient
Input sentence c En De response sentence x

Chatbot
Input sentence c
Human 𝑅 𝑐, 𝑥
response sentence x reward
[Li, et al., EMNLP, 2016]

Policy Gradient - Implementation
𝜃 𝑡+1 ← 𝜃 𝑡 + 𝜂𝛻 𝑅ത𝜃𝑡
𝜃𝑡 𝑁
1
෍ 𝑅 𝑐 𝑖 , 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃𝜃𝑡 𝑥 𝑖 |𝑐 𝑖
𝑁
𝑖=1
𝑐1 , 𝑥 1 𝑅 𝑐1 , 𝑥1
𝑅 𝑐 𝑖 , 𝑥 𝑖 is positive
𝑐2, 𝑥2 𝑅 𝑐2, 𝑥2
Updating 𝜃 to increase 𝑃𝜃 𝑥 𝑖 |𝑐 𝑖
……
……
𝑅 𝑐 𝑖 , 𝑥 𝑖 is negative
𝑐𝑁 , 𝑥𝑁 𝑅 𝑐𝑁, 𝑥𝑁
Updating 𝜃 to decrease 𝑃𝜃 𝑥 𝑖 |𝑐 𝑖
Comparison
Maximum Reinforcement
Likelihood Learning
𝑁 𝑁
Objective 1 1
෍ 𝑙𝑜𝑔𝑃𝜃 𝑥ො 𝑖 |𝑐 𝑖 ෍ 𝑅 𝑐 𝑖 , 𝑥 𝑖 𝑙𝑜𝑔𝑃𝜃 𝑥 𝑖 |𝑐 𝑖
Function 𝑁 𝑁
𝑖=1 𝑖=1
𝑁 𝑁
1 1
Gradient ෍ 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥ො 𝑖 |𝑐 𝑖 ෍ 𝑅 𝑐 𝑖 , 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥 𝑖 |𝑐 𝑖
𝑁 𝑁
𝑖=1 𝑖=1
Training 𝑐1 , 𝑥ො 1 , … , 𝑐 𝑁 , 𝑥ො 𝑁 𝑐1 , 𝑥 1 , … , 𝑐 𝑁 , 𝑥 𝑁
Data obtained from interaction
𝑅 𝑐 𝑖 , 𝑥ො 𝑖 = 1
weighted by 𝑅 𝑐 𝑖 , 𝑥 𝑖
Conditional GAN
Input sentence c En De response sentence x

Chatbot
Input sentence c
Discriminator Real or fake
response sentence x “reward”
human
dialogues
Algorithm
• Training data: Pairs of conditional input c and response x
• Initialize generator G (chatbot) and discriminator D
• In each iteration:
• Sample input c and response 𝑥 from training set
• Sample input 𝑐′ from training set, and generate response 𝑥෤ by G(𝑐′)
• Update D to increase 𝐷 𝑐, 𝑥 and decrease 𝐷 𝑐 ′ , 𝑥෤
• Update generator G (chatbot) such that
En De
𝑐 Discriminator scalar
Chatbot
update
En De
Discriminator scalar
Chatbot scalar
update Discriminator
Can we use gradient

ascent? B A A
NO! A A A
B B B
Update Parameters
Due to the sampling process, “discriminator+ generator” is not differentiable

<BOS> A
B
Outline
▪ Conditional Sequence Generation
▪ Unsupervised Conditional Sequence Generation
En De
Chatbot scalar
update Discriminator
Use the distribution as the

input of discriminator
B A A
A A A
Avoid the sampling process
B B B
Update Parameters
We can do
backpropagation now.
<BOS> A
B
What is the problem?
• Real sentence
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0 Discriminator can
0 0 0 1 0 immediately find
• Generated 0 0 0 0 1 the difference.
0.9 0.1 0.1 0 0

0.1 0.9 0.1 0 0
Can never
0 0 0.7 0.1 0
be 1-of-N
0 0 0.1 0.8 0.1 WGAN is helpful
0 0 0 0.1 0.9
Reinforcement Learning?
En De
Chatbot
update
• Consider the output of discriminator as reward

• Update generator to increase discriminator = to get maximum
reward
• Using the formulation of policy gradient, replace reward 𝑅 𝑐, 𝑥
with discriminator output D 𝑐, 𝑥
• Different from typical RL
• The discriminator would update
d-step
discriminator
𝑥
𝑐
fake real
g-step
D 𝜃 𝑡+1 ← 𝜃 𝑡 + 𝜂𝛻 𝑅ത𝜃𝑡
𝑡
𝜃 𝑁
1
𝑅 𝑐 𝑖 , 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃𝜃𝑡 𝑥 𝑖 |𝑐 𝑖
෍𝐷
𝑁
𝑖=1
𝑐1 , 𝑥 1 𝐷 𝑐11, 𝑥 11
𝑅
𝑅 𝑐𝑖 , 𝑥𝑖
𝐷 𝑐 , 𝑥 is positive
𝑐2, 𝑥2 𝐷
𝑅 𝑐 22, 𝑥 22 Updating 𝜃 to increase 𝑃𝜃 𝑥 𝑖 |𝑐 𝑖
……
……
𝑅 𝑐 𝑖 , 𝑥 𝑖 is negative
𝐷
𝑐𝑁 , 𝑥𝑁 𝐷 𝑐 𝑁𝑁, 𝑥 𝑁𝑁
𝑅 Updating 𝜃 to decrease 𝑃𝜃 𝑥 𝑖 |𝑐 𝑖
Reward for Every Generation Step
𝑁
1
𝛻 𝑅ത𝜃 ≈ ෍ 𝐷 𝑐 𝑖 , 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥 𝑖 |𝑐 𝑖
𝑁
𝑖=1
𝑐 𝑖 = “What is your name?” 𝐷 𝑐 𝑖 , 𝑥 𝑖 is negative

𝑥 𝑖 = “I don’t know” Update 𝜃 to decrease log𝑃𝜃 𝑥 𝑖 |𝑐 𝑖
𝑙𝑜𝑔𝑃𝜃 𝑥 𝑖 |𝑐 𝑖 = 𝑙𝑜𝑔𝑃 𝑥1𝑖 |𝑐 𝑖 + 𝑙𝑜𝑔𝑃 𝑥2𝑖 |𝑐 𝑖 , 𝑥1𝑖 + 𝑙𝑜𝑔𝑃 𝑥3𝑖 |𝑐 𝑖 , 𝑥1:2
𝑖
𝑃 "𝐼"|𝑐 𝑖
𝑐 𝑖 = “What is your name?” 𝐷 𝑐 𝑖 , 𝑥 𝑖 is positive

𝑥 𝑖 = “I am John” Update 𝜃 to increase log𝑃𝜃 𝑥 𝑖 |𝑐 𝑖
𝑙𝑜𝑔𝑃𝜃 𝑥 𝑖 |𝑐 𝑖 = 𝑙𝑜𝑔𝑃 𝑥1𝑖 |𝑐 𝑖 + 𝑙𝑜𝑔𝑃 𝑥2𝑖 |𝑐 𝑖 , 𝑥1𝑖 + 𝑙𝑜𝑔𝑃 𝑥3𝑖 |𝑐 𝑖 , 𝑥1:2
𝑖
𝑃 "𝐼"|𝑐 𝑖
Reward for Every Generation Step
ℎ𝑖 = “What is your name?” 𝑥 𝑖 = “I don’t know”
𝑙𝑜𝑔𝑃𝜃 𝑥 𝑖 |ℎ𝑖 = 𝑙𝑜𝑔𝑃 𝑥1𝑖 |𝑐 𝑖 + 𝑙𝑜𝑔𝑃 𝑥2𝑖 |𝑐 𝑖 , 𝑥1𝑖 + 𝑙𝑜𝑔𝑃 𝑥3𝑖 |𝑐 𝑖 , 𝑥1:2
𝑖
𝑃 "𝐼"|𝑐 𝑖 𝑃 "𝑑𝑜𝑛′𝑡"|𝑐 𝑖 , "𝐼" 𝑃 "𝑘𝑛𝑜𝑤"|𝑐 𝑖 , "𝐼 𝑑𝑜𝑛′𝑡"

𝑁
1
𝛻 𝑅ത𝜃 ≈ ෍ 𝐷 𝑐 𝑖 , 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥 𝑖 |𝑐 𝑖
𝑁
𝑖=1 𝑁 𝑇
1
𝛻 𝑅ത𝜃 ≈ ෍ ෍ 𝑄 𝑐 𝑖 , 𝑥1:𝑡 𝑖 𝑖
− 𝑏 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥𝑡𝑖 |𝑐 𝑖 , 𝑥1:𝑡−1
𝑁
𝑖=1 𝑡=1
Method 1. Monte Carlo (MC) Search [Yu, et al., AAAI, 2017]
Method 2. Discriminator For Partially Decoded Sequences
Method 3. Step-wise evaluation [Tual, et al., arXiv, 2018][Xu, et al., EMNLP,
2018][William Fedus, et al., ICLR, 2018]
Empirical Performance
• MLE frequently generates “I’m sorry”, “I don’t know”, etc.
(corresponding to fuzzy images?)
• GAN generates longer and more complex responses.
• Find more comparison in the survey papers. ‘
• [Lu, et al., arXiv, 2018][Zhu, et al., arXiv, 2018]
• However, no strong evidence shows that GANs are better than MLE.
• [Stanislau Semeniuta, et al., arXiv, 2018] [Guy Tevet, et al., arXiv, 2018] [Massimo Caccia, et al., arXiv,
2018]
More Applications
• Supervised machine translation [Wu, et al., arXiv
2017][Yang, et al., arXiv 2017]
• Supervised abstractive summarization [Liu, et al., AAAI

2018]
• Image/video caption generation [Rakshith Shetty, et al., ICCV

2017][Liang, et al., arXiv 2017]
• Speech Recognition [Liu, et al., arXiv 2018]
If you are using seq2seq models,

consider to improve them by GAN.
Part 1
male female
Part 2
positive negative document summary

sentences sentences
Unsupervised Abstractive
Text Style Transfer Summarization
Language 1 Language 2 Audio Text

Unsupervised Translation Unsupervised ASR
scalar: belongs to 𝐷𝑌 scalar: belongs to

domain X or not 𝐷𝑋 domain Y or not
𝐺Y→X 𝐺𝑋→𝑌
It is bad. 𝐺𝑋→𝑌 It is good. 𝐺Y→X It is bad.

negative positive negative
negative sentence? 𝐷𝑋 𝐷𝑌 positive sentence?
I love you. 𝐺Y→X I hate you. I love you.

positive
𝐺𝑋→𝑌 positive
negative
Discrete?
Word embedding
Direct Transformation [Lee, et al., ICASSP, 2018]
It is bad. 𝐺𝑋→𝑌 It is good. 𝐺Y→X It is bad.

negative positive negative
negative sentence? 𝐷𝑋 𝐷𝑌 positive sentence?
I love you. 𝐺Y→X I hate you. I love you.

positive
𝐺𝑋→𝑌 positive
negative
感謝王耀賢同學提供實驗結果
Cycle GAN
✘ Negative sentence to positive sentence:
it's a crappy day -> it's a great day
i wish you could be here -> you could be here
it's not a good idea -> it's good idea
i miss you -> i love you
i don't love you -> i love you
[Lee, et al.,
i can't do that -> i can do that ICASSP, 2018]
i feel so sad -> i happy
it's a bad day -> it's a good day
it's a dummy day -> it's a great day
sorry for doing such a horrible thing -> thanks for doing a
great thing
my doggy is sick -> my doggy is my doggy
my little doggy is sick -> my little doggy is my little doggy
感謝張瓊之同學提供實驗結果
Cycle GAN
✘ Negative sentence to positive sentence:
胃疼 , 沒睡醒 , 各種不舒服 -> 生日快樂 , 睡醒 , 超級舒服
我都想去上班了, 真夠賤的! -> 我都想去睡了, 真帥的 !
暈死了, 吃燒烤、竟然遇到個變態狂 -> 哈哈好 ~ , 吃燒烤 ~ 竟

然遇到帥狂
蕭哥, 我肚子痛的厲害 -> 蕭哥, 我生日快樂厲害
感冒了, 難受的說不出話來了 ! -> 感冒了, 開心的說不出話

來!
Decoder hidden layer as discriminator input

[Shen, et al., NIPS, 2017]
Discriminator
of X domain
Positive Positive
𝐸𝑁𝑋 𝐷𝐸𝑋 𝐷𝑋
Sentence Sentence
Negative 𝐷𝐸𝑌 Negative

𝐸𝑁𝑌 𝐷𝑌
Sentence Sentence
Discriminator
𝐸𝑁𝑋 and 𝐸𝑁𝑌 fool the of Y domain
domain discriminator Domain
From 𝐸𝑁𝑋 or 𝐸𝑁𝑌
[Zhao, et al., arXiv, 2017] Discriminator
[Fu, et al., AAAI, 2018]
Part 1
male female
Part 2

sentences sentences

Abstractive Summarization
• Now machine can do abstractive summary by seq2seq (write
summaries in its own words)
summary 1
summary 2
summary
(in its own words) summary 3
seq2seq
Supervised: We need lots of Training Data
labelled training data.
Unsupervised Abstractive Summarization
• Now machine can do abstractive summary by seq2seq (write
summaries in its own words)
summary 1
summary 2
summary 3
document
seq2seq
Domain X [Wang, et al., EMNLP, 2018] Domain Y
Human written summaries Real or not
D Discriminator
word
document
sequence
Summary?
G
Seq2seq
D Discriminator
word
document document
sequence
G R
Seq2seq Seq2seq
minimize the reconstruction error
Summarization Only need a lot
of documents to
train the model
This is a seq2seq2seq auto-encoder.
Using a sequence of words as latent representation.
not readable …
word
document sequence document
G R
Summary?
Seq2seq Seq2seq
Summarization REINFORCE algorithm to
deal with the discrete issue
Let Discriminator considers D Discriminator
my output as real
word
document sequence document
Readable
G R
Summary?
Seq2seq Seq2seq
Experimental results
English Gigaword (Document title as summary)
ROUGE-1 ROUGE-2 ROUGE-L
Supervised 33.2 14.2 30.5
Trivial 21.9 7.7 20.5
Unsupervised
28.1 10.0 25.4
(matched data)
Unsupervised
27.2 9.1 24.1
(no matched data)
• Matched data: using the title of English Gigaword to train
Discriminator
• No matched data: using the title of CNN/Diary Mail to
train Discriminator
Using
Semi-supervised Learning matched data
34
33
ROUGE-1 32
31
30
29
28
27
semi-supervised
26 unsupervised
25
0 10k 500k
Number of document-summary pairs used
WGAN Reinforce Supervised
Approaches to deal with the discrete issue. 3.8M pairs are used.

• Document:澳大利亞今天與13個國家簽署了反興奮劑雙邊協議,旨在加強體育
競賽之外的藥品檢查並共享研究成果 ……
• Summary:
• Human:澳大利亞與13國簽署反興奮劑協議
• Unsupervised:澳大利亞加強體育競賽之外的藥品檢查
• Document:中華民國奧林匹克委員會今天接到一九九二年冬季奧運會邀請函,
由於主席張豐緒目前正在中南美洲進行友好訪問,因此尚未決定是否派隊赴
賽 ……
• Summary:
• Human:一九九二年冬季奧運會函邀我參加
• Unsupervised:奧委會接獲冬季奧運會邀請函

• Document:據此間媒體27日報道,印度尼西亞蘇門答臘島的兩個省近日來連降
暴雨,洪水泛濫導致塌方,到26日為止至少已有60人喪生,100多人失蹤 ……
• Summary:
• Human:印尼水災造成60人死亡
• Unsupervised:印尼門洪水泛濫導致塌雨
• Document:安徽省合肥市最近為領導幹部下基層做了新規定:一律輕車簡從,不
準搞迎來送往、不準搞層層陪同 ……
• Summary:
• Human:合肥規定領導幹部下基層活動從簡
• Unsupervised:合肥領導幹部下基層做搞迎來送往規定:一律簡
Part 1
male female
Part 2

sentences sentences

[Alexis Conneau, et al., ICLR, 2018]
[Guillaume Lample, et al., ICLR, 2018]
supervised
unsupervised
Unsupervised learning Supervised learning with

with 10M sentences
= 100K sentence pairs
Part 1
male female
Part 2

sentences sentences

Acoustic Token Discovery
Acoustic tokens can be discovered from audio collection

without text annotation.
Acoustic tokens: chunks of acoustically similar audio segments
with token IDs [Zhang & Glass, ASRU 09]
[Huijbregts, ICASSP 11]
[Chan & Lee, Interspeech 11]
Acoustic Token Discovery
Token 3 Token 2 Token 1

Acoustic tokens can be discovered from audio collection

without text annotation.
Acoustic tokens: chunks of acoustically similar audio segments
with token IDs [Zhang & Glass, ASRU 09]
[Huijbregts, ICASSP 11]
[Chan & Lee, Interspeech 11]
Unsupervised Speech Recognition
p1 p2 p3 p4 AY L AH V Y UW
p1 p3 p2 G UH D B AY
HH AW AA R Y UW
p1 p4 p3 p5 p5
AY M F AY N
p1 p5 p4 p3 GAN
T AY W AA N
Phone-level Acoustic Phoneme sequences

Pattern Discovery from Text
[Liu, et al., INTERSPEECH,

p1 = “AY” 2018]
[Chen, et al., arXiv, 2018]
• Phoneme recognition [Liu, et al., INTERSPEECH, 2018]
(using oracle phoneme boundaries)
supervised
Audio: TIMIT
Text: WMT
Unsupervised (Gumbel-softmax)
Unsupervised (WGAN-GP)
• Word recognition [Chung, et al., NIPS, 2018] Audio: Librispeech

Achieved 23.7% accuracy Text: SWC

▪ Text to Speech

Advanced Design For AI Algorithms: Lec.: 1 GAN

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advanced Design For AI Algorithms: Lec.: 1 GAN

Uploaded by

Copyright:

Available Formats

Advanced Design for AI

Data: (𝑥, 𝑦) Data: (𝑥, 𝑦)

Training Data ~ 𝑝𝑑𝑎𝑡𝑎 (𝑥) Generated Samples ~ 𝑝𝑔 (𝑥)

Source: The GAN Zoo and Really Awesome GAN

Key Ideas: Sample from a simple distribution, e.g.

Input: Random Noise 𝑧

• Discriminator Network (D)

• Where the 𝜃 is the value of parameters at past time i.

• The presence of pmodel in the numerator is problematic because, in areas where

DCGAN Architecture (Radford, et.al, @ ICLR 2016 )

Side-by-side illustration of (from left-to-right) the MNIST dataset,

Generatedbedroomsafteronetrainingpassthroughthedataset.Theoretically, the model

▪ MinMax Objective Function for CGAN:

▪ FID: Fréchet Inception Distance

▪ Note: Implementation is straight forward and shown in sample codes.

▪ Lower FID implies better sample quality

▪ A generator G is a network. The network defines a probability distribution 𝑃𝐺

▪ How to compute the divergence? 𝐺 ∗ = 𝑎𝑟𝑔 min 𝑑𝑖𝑣 𝑃𝐺 , 𝑃𝑑𝑎𝑡𝑎

Source: The GAN Zoo and Really Awesome GAN

All of those faces are fake!

▪ A generator G is a network. The network defines a probability distribution 𝑃𝐺

▪ Eg. “Girl with red hair and red eyes”:

a dog is running Text: “train”

𝑥 is real image or not?

Generator will learn to generate realistic images ….

𝑥 is real image or not +

𝑥 True text-image pairs: (train, , ) 1

object 𝑥 Network x is realistic or not

condition 𝑐 Network 𝑐 and 𝑥 are matched or not

red hair, green

blue hair, red

input L1 GAN GAN + L1

Discriminator thinks it is real

Last frame is real or

Training Data Collection

The same distribution

Domain Classifier Fails

Not only cheat the domain classifier, but

Condition G Generated Object

Transform an object from one domain to another without paired data

photos Vincent van Gogh’s paintings

Use image style transfer as example here

For texture or color

▪ Approach 2: Projection to Common Space

Domain X Encoder of domain X Face Decoder of domain Y Domain Y

𝐺𝑋→𝑌 Not what we want!

𝐺𝑋→𝑌 Not what we want!

Baseline of DTN [Yaniv Taigman, et al., ICLR, 2017]

scalar: belongs to 𝐷𝑌 scalar: belongs to

[Zili Yi, et al., ICCV, 2017]

[Jun-Yan Zhu, et al., ICCV, 2017]

The information is hidden.

▪ Approach 2: Projection to Common Space

Domain X Encoder of domain X Face Decoder of domain Y Domain Y

image 𝐸𝑁𝑋 𝐷𝐸𝑋 image

image 𝐸𝑁𝑌 Face 𝐷𝐸𝑌 image

image 𝐸𝑁𝑋 𝐷𝐸𝑋 image

image 𝐸𝑁𝑌 𝐷𝐸𝑌 image

image 𝐸𝑁𝑌 𝐷𝐸𝑌 image 𝐷𝑌

Because we train two auto-encoders separately …

Sharing the parameters of encoders and decoders

image 𝐸𝑁𝑌 𝐷𝐸𝑌 image 𝐷𝑌

image 𝐸𝑁𝑌 𝐷𝐸𝑌 image 𝐷𝑌