Professional Documents
Culture Documents
Algorithms
Lec.: 1
GAN
Outline
1. Introduction to Generative Models
2. Generative Adversarial Networks (GANs)
3. Improved GAN Techniques
4. Deep Convolutional GAN (DCGAN)
5. Conditional GAN (CGAN)
Generative vs Discriminative Model
• Discriminative Model
• Learn hypothesis function that map input data (x) to some desired
output class label (y). In probabilistic terms, learn the conditional
distribution P(y|x).
• Generative Model
• Learn the joint probability of the input data (x) and labels
simultaneously, i.e. P(x,y). This can be converted to P(y|x) for
classification via Bayes rule.
Supervised vs Unsupervised Learning
Supervised Learning Unsupervised Learning
Examples: Examples:
▪ Classification & Regression ▪ Clustering
▪ Object Detection ▪ Dimensionality Reduction
▪ Semantic Segmentation ▪ Feature Learning
▪ Image Captioning ▪ Density Estimation (Core Problem in GANs)
Generative Models
Problem: Given training data, generate new samples from same distribution.
Goal: We want to learn 𝒑𝒈 (𝒙) similar to 𝒑𝒅𝒂𝒕𝒂 (𝒙). Learn about data through generation.
Addresses density estimation, a core problem in unsupervised learning
▪ Explicit Density: Explicitly define and solve for 𝑝𝑔 (𝑥)
▪ Implicit Density: Learn model than can sample from 𝑝𝑔 (𝑥) without explicitly define it
Why Generative Models Important?
Advantages of Generative Models:
• Represent and manipulate high-dimensional probability distribution
• Can be incorporated into (inverse) reinforcement learning in several ways
• Can be trained with missing data on semi-supervised learning
• Can provide predictions on inputs that are missing data
• Enable machine learning to work with multi-modal outputs
• Many tasks intrinsically require realistic generation of samples from some distributions:
• Creating Art and Realistic Images (Data Augmentation)
• Single Image Super-Resolution
• Image-To-Image Translation What I cannot create, I do not understand!
-- Richard Feynman
Applications of Generative Models
Computer Vision Speech Recognition Natural Language
Realistic Image Creation Generative Speech Enhancement Realistic Text Generation
Image-to-Image Translation Speech Driven Animation Text Translation
Synthetic Image Creation Lips Talking and Reading Text Corpora Generation
Image and Shape In-Painting Synthetic Audio/Voice Generative Machine Translation
Object/Image Reconstruction Voice Conversion Conditional Sequence Generation
Image Super-Resolution Voice Separation Neural Dialog Generation
Face Emotion and Aging Voice Impersonation Generative Conversation Responses
Video Frame Prediction Speech and Speaker Emotion Text Style Transfer
Video Deblurring Postfilter for Synthesized speech Abstractive Summarization
Many More …. Many More …. Many More ….
▪ Some find k=1 more stable, others use k > 1, no best rule.
▪ Recent work (e.g. Wasserstein GAN) alleviates this problem, better stability!
Problems in Training GANs
• Training GANs requires finding Nash equilibrium of a non-convex game with
continuous, high dimensional parameters.
• Gradient descent techniques are designed to find low value of cost function,
rather than to find the Nash equilibrium of a game.
• When used to seek Nash equilibrium Gradient descent may fail to converge.
• Saliman, et.al, @ NIPS 2016
• Several heuristic techniques to
encourage convergence of GAN
game. All codes and
hyperparameters available at
Github.
Improved GAN Techniques
• Training GANs consists in finding a Nash equilibrium to a two-player non-
cooperative game. A Nash equilibrium is a point such that cost functions of D and
G are both at minimum. Finding Nash equilibria is a very difficult problem.
Algorithms exist for specialized cases, but hard to apply to the GAN game, where
the cost functions are non-convex, the parameters are continuous, and the
parameter space is extremely high-dimensional.
• Common Heuristic Techniques Described in The Paper :
1. Feature Matching
The contributions in this paper
2. Minibatch Discrimination are of heuristic; one need to
3. Historical Averaging develop a more rigorous
theoretical understanding in
4. One-sided Label Smoothing future work.
5. Virtual Batch Normalization
Feature Matching
• Addresses the instability of GANs by specifying a new objective for the generator
that prevents it from overtraining on the current discriminator.
• Instead of directly maximizing the output of the discriminator, the new objective
requires the generator to generate data that matches the statistics of the real
data.
• Specifically, train the generator to match the expected value of the features on an
intermediate layer of the discriminator.
• Empirical results indicate that feature matching is indeed effective in situations
where regular GAN becomes unstable.
Minibatch Discrimination
• Generator can collapse to a parameter setting where it
always emits the same point. When collapse to a single
mode is imminent, the gradient of the discriminator may
point in similar directions for many similar points.
• After collapse has occurred, the discriminator learns that
this single point comes from the generator, but gradient
descent is unable to separate the identical outputs.
• An obvious strategy to avoid this type of failure is to allow
the discriminator to look at multiple data examples in
combination, and perform minibatch discrimination.
• The concept of minibatch discrimination is quite general: Features f from sample x are
any discriminator model that looks at multiple examples multiplied by tensor T and cross-
in combination, rather than in isolation, could potentially sample distance is computed
help avoid collapse of the generator.
Historical Averaging
• Modify each player’s cost function to include the following term:
DCGAN generator used for LSUN scene modeling. A 100 dimensional uniform distribution Z is
projected to a small spatial extent convolutional representation with many feature maps. A series of
four fractionally-strided convolutions (in some recent papers, these are wrongly called
deconvolutions) then convert this high level representation into a 64 ⇥ 64 pixel image. Notably, no
fully connected or pooling layers are used.
DCGAN Results – MNIST Dataset
Vector arithmetic for visual concepts. For each column, the Z vectors of samples are
averaged. Arithmetic was then performed on the mean vectors creating a new vector Y .
The center sample on the right hand side is produce by feeding Y as input to the
generator. To demonstrate the interpolation capabilities of the generator, uniform noise
sampled with scale +-0.25 was added to Y to produce the 8 other samples. Applying
arithmetic in the input space (bottom two examples) results in noisy overlap due to
misalignment.
Why Conditional Generation is Important?
Key Issues:
1. It is challenging to scale deep neural networks to accommodate an extremely
large number of predicted output categories.
2. Much of the work to date has focused on learning one-to-one mappings from
input to output. However, many interesting problems are more naturally
thought of as a probabilistic one-to-many mapping.
Proposed Solutions:
1. One way to help address the first issue is to leverage additional information
from other modalities : for instance, by using natural language corpora to learn
a vector representation for labels in which geometric relations are semantically
meaningful.
2. One way to address the second problem is to use “a conditional probabilistic
generative model”, the input is taken to be the conditioning variable and the
one-to-many mapping is instantiated as a conditional predictive distribution.
CGAN Introduction
▪ Generative adversarial nets can be extended to a
conditional model if both the generator G and
discriminator D are conditioned on some extra information
y. In this case y could be any kind of auxiliary information,
such as class labels or data from other modalities. We can
perform the conditioning by feeding y into the both the
discriminator and generator as additional input layer.
▪ In the generator the prior input noise pz(z), and y are
combined in joint hidden representation, and the
adversarial training framework allows for considerable
flexibility in how this hidden representation is composed.
REF: Jon Gauthier, Conditional generative adversarial nets for convolutional face generation
Advanced Design for AI
Algorithms
Lec.: 2
Key Question in Evaluating GANs
▪ Evaluating GANs is very tricky
• Different metrics can lead to different trade-offs
• Different evaluations favor different models
▪ Key question: What is the task that you care about?
▪ Density Estimation
▪ Sampling/Generation
▪ Latent Representation Learning
▪ More than one task? E.g., Semisupervised learning, image translation, etc.
▪ Evaluation drives progress, but how do we evaluate generative models?
▪ An evaluation based on samples is biased towards models which overfit and therefore a poor
indicator of a good density model in a log-likelihood sense, which favors models with large
entropy. Conversely, a high likelihood does not guarantee visually pleasing samples. Samples
can take on arbitrary form only a few bits from the optimum.
Problems in Evaluating GAN Samples
▪ Lets take evaluation of generated image quality for example.
▪ Visual quality of images is a highly subjective thing. No single definitive solution to formulize it.
▪ We need to have more or less reliable function to evaluate GANs generated images.
▪ Q: How to formulate such evaluation function systematically?
▪ A: Apply statistic or information theory for saliency and diversity.
▪ Saliency vs Diversity
▪ Saliency (what is the object) is a distribution of classes for any individual image should have
low entropy. One can think of it as a single high score and the rest very low.
▪ Diversity (how different with other objects) is overall distribution of classes across the
sampled data should have high entropy, which would mean the absence of dominating
classes and something closer to a well-balanced training set.
GAN Evaluation – Quality of Generation
▪ Which set “look” better?
▪ Human inspection are expensive,
biased, hard to reproduce.
▪ Generalization is hard to define and
assess: memorizing the training set
would give excellent samples but
clearly undesirable
▪ Quantitative evaluation of a
qualitative task can have many
answers
▪ Popular Metrics: Inception Scores
(IS), Frechet Inception Distance (FID),
Kernel Inception Distance (KID)
Common GAN Evaluation Techniques
▪ IS: Inception Score
▪ First method to evaluate quality of GANs generated samples.
▪ Higher IS score is better, corresponded to higher KL divergence between two distributions.
▪ Higher IS values mean better image quality and diversity
▪ Based on evaluating generator capabilities to generate images with:
▪ meaningful objects: conditional label distribution 𝑝(𝑥|𝑦) is low entropy.
▪ diverse images: marginal label distribution 𝑝 𝑦 = )𝑥( 𝑔𝑝)𝑦|𝑥(𝑝 𝑥is high entropy.
• FID uses inception network to extract features from an intermediate layer. Then it models the
data distribution of those features using a multivariate Gaussian distribution with mean µ and
covariance Σ.
▪ Note: FID as it is more robust to noise and sensitive to mode collapse compare to IS.
Inception Score Explained
▪ Assumption 1: We evaluate sample quality for GANs trained on labelled datasets
▪ Assumption 2: We have a good probabilistic classifier 𝑐(𝑦|𝑥) for predicting the label 𝑦 for any 𝑥
▪ We want samples from a good generative model to satisfy two criteria: sharpness and diversity
▪ Sharpness (𝑺):
▪ High sharpness implies classifier is confident in making predictions for generated images
▪ That is, classifier’s predictive distribution 𝑐(𝑦|𝑥) has low entropy
Inception Score Explained
▪ Diversity (𝑫):
▪ where 𝑐(𝑦) = 𝐸𝑥~𝑝 [𝑐(𝑦|𝑥)] is the classifier’s marginal predictive distribution. High diversity
implies c(y) has high entropy.
▪ Inception scores (IS) combine the two criteria of sharpness and diversity into a simple metric:
𝐼𝑛𝑐𝑒𝑝𝑡𝑖𝑜𝑛 𝑆𝑐𝑜𝑟𝑒 = 𝐷 × 𝑆
▪ Correlates well with human judgement in practice . If classifier is not available, a classifier trained
on a large dataset, e.g., Inception Net trained on the ImageNet dataset
Frechet Inception Distance Explained
▪ Inception Scores only require samples from 𝑝𝜃 and do not take into account the desired data
distribution 𝑝𝑑𝑎𝑡𝑎 directly (only implicitly via a classifier).
▪ FID measures similarities in the feature representations (e.g., those learned by a pretrained
classifier) for data points sampled from 𝑝𝜃 and the test dataset
▪ Computing FID:
1. Let 𝐺 denote the generated samples and 𝑇 denote the test dataset
2. Compute feature representations 𝐹𝐺 and 𝐹𝑇 for 𝐺 and 𝑇 respectively (e.g., prefinal layer of Inception Net)
3. Fit a multivariate Gaussian to each of 𝐹𝐺 and 𝐹𝑇 . Let 𝜇, Σ denote mean and covariance of the Gaussian
▪ FID Is Defined As:
▪ GILBO offers data independent measure of the complexity of the learned latent variable
description, giving the log of the effective descriptive length. The GILBO is entirely independent of
true data, being a purely a function of the generative joint distribution.
▪ GILBO gives different information than is currently available in FID as well as being able to
distinguish between GANs with same FID scores.
KID Kernel Inception Distance
▪ Maximum Mean Discrepancy (MMD) is a two-sample test statistic that compares samples from
two distributions p and q by computing differences in their moments (mean, variances etc.)
▪ Key idea: Use a suitable kernel e.g., Gaussian to measure similarity between points
▪ Intuitively, MMD is comparing the “similarity” between samples within 𝑝 and 𝑞 individually to the
samples from the mixture of 𝑝 and 𝑞
▪ Kernel Inception Distance (KID): compute the MMD in the feature space of a classifier (e.g.,
Inception Network)
▪ FID vs. KID
▪ FID is biased (can only be positive), KID is unbiased
▪ FID can be evaluated in O(n) time, KID evaluation requires O(n2) time
KID Kernel Inception Distance
▪ FID is meaningful for not-ImageNet datasets. Estimator extremely biased. Tiny variance.
▪ KID defined as MMD between inception hidden layer activations.
▪ Use default polynomial kernel:
▪ Unbiased estimator, reasonable with few samples.
▪ KID can be used for automatic learning rate adaptation, eq. three sample MMD test.
Best Practices on Evaluating Sample Quality
1. Spend time tuning your
baselines (architecture, learning
rate, optimizer etc.). Be amazed
(rather than dejected) at how
well they can perform
2. Use random seeds for
reproducibility
3. Report results averaged over
multiple random seeds along
with confidence intervals
Selected Papers for Reading
1. A Note on Evaluation of Generative Models, ICLR 2016
2. Evaluation of Generative Networks Through Their Data Augmentation Capacity, ICLR 2018
3. A Note on Inception Score, ICLR 2018
4. An empirical study on evaluation metrics of generative adversarial networks, ICLR 2018
5. Pros and Cons of GAN Evaluation Measures, Arxiv 2018
6. GILBO: One Metric to Measure Them All, NIPS 2018
7. Geometry Score: A Method For Comparing Generative Adversarial Networks, ICML 2018
Advanced Design for AI
Algorithms
Applications of GAN for Images
Outline
1. Application of GANs
2. Image Generation
3. Conditional GANs
4. Unsupervised Conditional GANs
5. GANS and Reinforcement Learning
How GANs Work
Normal 𝑃𝑑𝑎𝑡𝑎 𝑥
𝑃𝐺 (𝑥)
Distribution
Generator
Network
𝑧 G
Distance or
Divergence
𝑥=𝐺 𝑧
𝑥: an image (a high-dimensional vector)
-- Richard Feynman
Why This Course is Important: GAN Zoo
GAN
ACGAN
BGAN
CGAN
DCGAN
EBGAN
fGAN
GoGAN
……
𝛼𝛽𝛾
Applications of Generative Models
Computer Vision Speech Recognition Natural Language
Realistic Image Creation Generative Speech Enhancement Realistic Text Generation
Image-to-Image Translation Speech Driven Animation Text Translation
Synthetic Image Creation Lips Talking and Reading Text Corpora Generation
Image and Shape In-Painting Synthetic Audio/Voice Generative Machine Translation
Object/Image Reconstruction Voice Conversion Conditional Sequence Generation
Image Super-Resolution Voice Separation Neural Dialog Generation
Face Emotion and Aging Voice Impersonation Generative Conversation Responses
Video Frame Prediction Speech and Speaker Emotion Text Style Transfer
Video Deblurring Postfilter for Synthesized speech Abstractive Summarization
Many More …. Many More …. Many More ….
Generator
𝑧 Network
G
condition c Distance or
Divergence
𝑥 = 𝐺 𝑧, 𝑐
𝑥: an image (a high-dimensional vector) per condition c
a bird is flying
Target of NN Output
A blurry image!
Conditional GANs
▪ Conditional GANs Approach [Scott Reed, et al, ICML, 2016]
c: train
Image
𝐺 𝑥 = 𝐺(𝑐, 𝑧)
Normal distribution 𝑧
c: train
Image
𝐺 𝑥 = 𝐺(𝑐, 𝑧)
Normal distribution 𝑧
𝑐
G 𝑥 = 𝐺(𝑐, 𝑧)
𝑧
Conditional GANs – Image to Image
▪ Traditional Supervised Approach
Testing: It is blurry.
NN Image
as close as
possible
Input L1
e.g. L1
Conditional GANs – Image to Image
▪ Conditional GANs Approach [Phillip Isola, et al., CVPR, 2017]
L1
Image
G D scalar
𝑧
Testing:
Generator
video
Conditional GANs – Audio to Image
▪ Conditional GANs Approach for Audio to Image
Louder
Conditional GANs – Image to Label
▪ Multi Labels Image Classifier
Input condition
Generated output
Conditional GANs – Image to Label
▪ The classifiers can have different
architectures.
▪ The classifiers are trained as
conditional GAN.
▪ The classifiers can have different
architectures.
▪ The classifiers are trained as
conditional GAN.
▪ Conditional GAN outperforms
other models designed for multi-
label.
[Tsai, et al., submitted to ICASSP 2019]
Domain Adversarial Training
▪ Training and testing data are in different domains
Take digit
classification as example
Generator feature
Training data:
Testing data:
feature
Generator
Domain Adversarial Training
feature extractor (Generator)
Always output
zero vectors
Discriminator
blue points
(Domain classifier)
red points
Domain Adversarial Training
feature extractor (Generator) Label predictor
Which digits?
image
Which domain?
Successfully applied on image classification [Ganin et al, ICML, 2015][Ajakan et al. JMLR, 2016 ]
Outline
1. Application of GANs
2. Image Generation
3. Conditional GANs
4. Unsupervised Conditional GANs
5. GANS and Reinforcement Learning
Unsupervised Conditional GAN
Domain X Domain Y
Not Paired
Domain X Domain Y
𝐸𝑁𝑋 𝐷𝐸𝑌
Direct Transformation
Domain Y
Become similar
Domain X to domain Y
𝐺𝑋→𝑌 ?
𝐷𝑌 scalar
Input image
belongs to
domain Y or not
Domain Y
Domain X
Direct Transformation
Domain Y
Become similar
Domain X to domain Y
ignore input
𝐷𝑌 scalar
Input image
belongs to
domain Y or not
Domain Y
Domain X
Direct Transformation
Domain Y
Become similar
Domain X to domain Y
ignore input
𝐷𝑌 scalar
The issue can be avoided by network design.
Input image
Simpler generator makes the input and belongs to
output more closely related. domain Y or not
[Tomer Galanti, et al. ICLR, 2018]
Domain X
Direct Transformation
Domain Y
Become similar
Domain X to domain Y
𝐺𝑋→𝑌
𝐷𝑌 scalar
Encoder Encoder
pre-trained
Network Network Input image
as close as belongs to
possible domain Y or not
𝐺𝑋→𝑌 𝐺Y→X
Lack of information
for reconstruction
𝐷𝑌 scalar
Input image
belongs to
domain Y or not
Domain Y [Jun-Yan Zhu, et al., ICCV, 2017]
Direct Transformation – Cycle GAN
as close as possible
𝐺𝑋→𝑌 𝐺Y→X
𝐺Y→X 𝐺𝑋→𝑌
as close as possible
For multiple domains,
considering starGAN
Disco GAN
[Yunjey Choi, arXiv, 2017]
[Taeksoo Kim, et
al., ICML, 2017]
Dual GAN
Cycle GAN
𝐺𝑋→𝑌 𝐺Y→X
Domain X Domain Y
𝐸𝑁𝑋 𝐷𝐸𝑌
Attribute
Domain X Domain Y
Projection to Common Space
Training
Minimizing reconstruction error
Domain X Domain Y
Projection to Common Space
Training
Minimizing reconstruction error Discriminator
of X domain
image 𝐸𝑁𝑋 𝐷𝐸𝑋 image 𝐷𝑋
𝐸𝑁𝑋 𝐷𝐸𝑋
𝐸𝑁𝑌 𝐷𝐸𝑌
Cycle Consistency:
Used in ComboGAN [Asha Anoosheh, et al., arXiv, 017]
Projection to Common Space
Training
To the same Discriminator
latent space of X domain
image 𝐸𝑁𝑋 𝐷𝐸𝑋 image 𝐷𝑋
Semantic Consistency:
Used in DTN [Yaniv Taigman, et al., ICLR, 2017] and
XGAN [Amélie Royer, et al., arXiv, 2017]
Outline
1. Application of GANs
2. Image Generation
3. Conditional GANs
4. Unsupervised Conditional GANs
5. GANS and Reinforcement Learning
Basic Components
You cannot control
Reward
Actor Env Function
The rule
Go of GO
Neural network as Actor
• Input of neural network: the observation of machine represented as a vector or a matrix
• Output neural network : each action corresponds to a neuron in output layer
left 0.7
…
fire 0.1
pixels
Actor, Environment, Reward
𝑠1 𝑎1 𝑠2 𝑎2
𝑠1 𝑎1 𝑠2 𝑎2 𝑠3
“right” “fire”
Trajectory
𝜏 = 𝑠1 , 𝑎1 , 𝑠2 , 𝑎2 , ⋯ , 𝑠𝑇 , 𝑎 𝑇
Reinforcement Learning v.s. GAN
updated updated
𝑠1 𝑎1 𝑠2 𝑎2
𝑠1 𝑎1 𝑠2 𝑎2 𝑠3
𝑇
Actor → Generator
Fixed 𝑅 𝜏 = 𝑟𝑡
Reward Function → Discriminator 𝑡=1
Imitation Learning
𝑠1 𝑎1 𝑠2 𝑎2
𝑠1 𝑎1 𝑠2 𝑎2 𝑠3
Environment
𝜏Ƹ1 , 𝜏Ƹ 2 , ⋯ , 𝜏Ƹ 𝑁
Inverse Reinforcement
Reward Function Reinforcement Expert
Optimal Actor
LearningLearning
𝑅 𝜏Ƹ 𝑛 > 𝑅 𝜏
𝑛=1 𝑛=1
Reward
𝜏1 , 𝜏2 , ⋯ , 𝜏𝑁 Function R
Actor
→ Generator
Find an actor based on
Reward function → reward function R
Actor 𝜋
Discriminator
By Reinforcement learning
High score for real,
low score for generated
GAN
D
IRL
Larger reward for 𝜏Ƹ 𝑛 ,
Expert 𝜏Ƹ1 , 𝜏Ƹ 2 , ⋯ , 𝜏Ƹ 𝑁 Lower reward for 𝜏
𝜏1 , 𝜏2 , ⋯ , 𝜏𝑁 Reward
Function
▪ Regression Task:
Objective function
G Output
▪ GAN Models:
Speech, Speaker, Emotion Recognitions
& Lip Reading Recognitions
▪ Classification Task: GAN Model
𝒚
Output
label
G ℎ(∙)
𝒛 = 𝑔(
𝒙)
Emb.
Acoustic Mismatch
E 𝑔(∙)
ෝ
𝒙 ෭
𝒙
𝒙
Channel Accented Noisy
Distortion Speech Data
𝒙
Speech Enhancement
▪ Speech Enhancement using GAN [Pascual et al., Interspeech 2017]
z
Speech Enhancement
Enhancing
▪ Neural network models for spectral mapping
▪ Model structures of G: DNN [Wang et al. NIPS
2012; Xu et al., SPL 2014], DDAE
[Lu et al., Interspeech 2013], RNN (LSTM) [Chen
et al., Interspeech 2015;
Weninger et al., LVA/ICA 2015], CNN [Fu et al., Objective function
Interspeech 2016].
• Typical objective function Output
▪ Mean square error (MSE) [Xu et al., TASLP 2015],
L1 [Pascual et al., Interspeech 2017], likelihood
[Chai et al., MLSP 2017], STOI [Fu et al., TASLP G
2018].
▪ GAN is used as a new objective function to
estimate the parameters in G.
Speech Enhancement (SEGAN)
▪ Experimental Result
Objective Evaluation Results Subjective Evaluation Results
SEGAN yields better speech enhancement results than Noisy and Wiener.
Speech Enhancement
▪ Pix2Pix [Michelsanti et al., Interpsech 2017]
Output
Clean
D Scalar
(Fake/Real)
Noisy Noisy
Speech Enhancement (Pix2Pix)
▪ Spectrogram comparison of Pix2Pix with baseline methods.
NG-DNN STAT-MMSE
Pix2Pix outperforms STAT-MMSE and is competitive to DNN SE.
Speech Enhancement (Pix2Pix)
▪ Objective evaluation and speaker verification test
From the PESQ and STOI evaluations, Pix2Pix outperforms Noisy and
MMSE and is competitive to DNN SE.
Speech Enhancement
▪ Frequency-domain SEGAN (FSEGAN) [Donahue et al., ICASSP 2018]
Output
Clean
D Scalar
(Fake/Real)
Noisy Noisy
Speech Enhancement (FSEGAN)
▪ FSEGAN ASR Results Spectogram comparison of FSEGAN with L1-trained
method.
Point-wise multiplication
Speech Enhancement
• GAN for spectral magnitude mask estimation (MMS-GAN) [Pandey, et.al, ICASSP 2018]
Noisy Output mask Ref. mask
Output Ref.
mask mask
D Scalar
(Fake/Real)
Noisy Noisy
Speech Enhancement (MMS-GAN)
• Speech enhancement results of MMS estimated by MSE (DNN) and by GAN (GAN) and for 2
unseen noises at 3 different SNR conditions.
𝑫𝑺 𝑫𝑵
or or
True speech True noise
Estimated Estimated
speech noise
Noise data
Clean speech data
𝑮𝑴𝒂𝒔𝒌
𝑉𝑀𝑎𝑠𝑘
Noisy speech = 𝐸𝒔𝑓𝑎𝑘𝑒 log(1 − 𝑫𝑺 𝒔𝑓𝑎𝑘𝑒 , 𝜃 )
+𝐸𝒏𝑓𝑎𝑘𝑒 log(1 − 𝑫𝑵 𝒏𝑓𝑎𝑘𝑒 , 𝜃 )
Noisy data
Speech Enhancement (ATME)
▪ Spectrogram comparison of (a) noisy; (b) MMSE with supervision; (c) ATME without supervision.
𝑛
Speech mask Noise mask 𝑀𝑓,𝑡
𝐺𝑆→𝑇 𝐺𝑇→𝑆
𝐺𝑇→𝑆 𝐺𝑆→𝑇
Noisy Enhanced Noisy
as close as possible
N5 N4 N5
N11 N12 N7 N10 N9
Unseen
𝑉𝑦
E G
𝜕𝑉𝑦
𝜃𝐺 ← 𝜃𝐺 − ϵ Min reconstruction 𝜕𝑉𝑦
𝜕𝜃𝐺 error 𝜃𝐸 ← 𝜃𝐸 − ϵ
𝜕𝜃𝐸
Min reconstruction error
Speech Enhancement (NA-SE)
▪ Domain adversarial training for NA-SE
N5 N4 N5
N11 N12 N7 N10 N9
Unseen
𝑉𝑦
E G
𝒛
D
Output 2
Speaker
𝜕𝑉𝑦 𝑉𝑧
𝜃𝐺 ← 𝜃𝐺 − ϵ Min reconstruction 𝜕𝑉𝑦 𝜕𝑉𝑧
𝜕𝜃𝐺 error 𝜃𝐸 ← 𝜃𝐸 − ϵ +𝛼
𝜕𝜃𝐸 𝜕𝜃𝐸
𝜕𝑉𝑧 Max domain Min reconstruction error
𝜃𝐷 ← 𝜃𝐷 − ϵ accuracy
𝜕𝜃𝐷 and Min domain accuracy
Speech Enhancement (NA-SE)
▪ Objective evaluations
PESQ at different SNR levels.
Speech
synthesizer
G Output
Speech
enhancement
▪ Conventional postfilter approaches for G estimation include global variance (GV) [Toda et al., IEICE 2007], variance scaling (VS)
[Sil’en et al., Interpseech 2012], modulation spectrum (MS) [Takamichi et al., ICASSP 2014],DNN with MSE criterion [Chen et
al., Interspeech 2014; Chen et al., TASLP 2015].
▪ GAN is used a new objective function to estimate the parameters in G.
Postfilter
▪ GAN postfilter [Kaneko et al., ICASSP 2017]
Natural
Mel cepst. coef.
Synthesized Generated
Mel cepst. coef. Mel cepst. coef.
Nature
or
G D Generated
Mel-cepstral trajectories (GANv: GAN was applied in Averaging difference in modulation spectrum
voiced part). per Mel-cepstral coefficient.
Preference score (%). Bold font indicates the numbers over 30%.
The proposed algorithm works for both spectral parameters and F0.
.
Speech Synthesis
▪ Speech synthesis with GAN glottal waveform model (GlottGAN) [Bollepalli et al., Interspeech 2017]
Generated Natural
Acoustic
G speech speech
features
parameters parameters
Gen.
𝑫
Nature
Speech Synthesis (GlottGAN)
▪ Objective Evaluation
Glottal pulses generated by GANs.
G, D: DNN
G, D: conditional DNN
G, D: Deep CNN
The proposed GAN-based approach can generate glottal waveforms similar to the natural ones.
Speech Synthesis
▪ Speech synthesis with GAN & multi-task learning (SS-GAN-MTL) [Yang et al., ASRU 2017]
Speech Synthesis (SS-GAN-MTL)
• Speech synthesis with GAN & multi-task learning (SS-GAN-MTL) [Yang et al., ASRU 2017]
Speech Synthesis (SS-GAN-MTL)
▪ Objective and subjective evaluations
Source
speaker Objective function
G Output
Conventional VC approaches include Gaussian mixture model (GMM) [Toda et al., TASLP 2007], non-negative matrix
factorization (NMF) [Wu et al., TASLP 2014; Fu et al., TBME 2017], locally linear embedding (LLE) [Wu et al., Interspeech
2016], restricted Boltzmann machine (RBM) [Chen et al., TASLP 2014], feed forward NN [Desai et al., TASLP 2010],
recurrent NN (RNN) [Nakashika et al., Interspeech 2014].
Voice Conversion
Target
▪ VAW-GAN [Hsu et al., Interspeech 2017] speaker
Source
speaker
Real
G or
D
Fake
VAW-GAN outperforms VAE in terms of objective and subjective evaluations with generating
more structured speech.
Voice Conversion
• Sequence-to-sequence VC with learned similarity metric (LSM) [Kaneko et al., Interspeech 2017]
Target
speaker
𝒚
Source
speaker
Real
𝑪 D or
Fake
𝒛
Noise 𝑮 Similarity metric
𝐷 1 2
𝑉𝑆𝑉𝐶𝑙 𝐶, 𝐷 = 𝐷𝑙 𝒚 − 𝐷𝑙 𝐶(𝒙 )
𝑀𝑙
𝐷
𝑉 𝐶, 𝐺, 𝐷 = 𝑉𝑆𝑉𝐶𝑙 𝐶, 𝐷 + 𝑉𝐺𝐴𝑁 (𝐶, 𝐺, 𝐷)
Voice Conversion (LSM)
▪ Spectrogram Analysis
Comparison of MCCs (upper) and STFT spectrograms (lower).
Source Target FVC MSE(S2S) LSM(S2S)
The spectral textures of LSM are more similar to the target ones.
Voice Conversion (LSM)
▪ Subjective evaluations
Preference scores for naturalness. Similarity of TGT and SRC with VCs.
𝑮𝑺→𝑻 𝐺𝑇→𝑆
𝐺𝑇→𝑆 𝑮𝑺→𝑻
Target Syn. Source Target
as close as possible
𝑉𝐹𝑢𝑙𝑙 = 𝑉𝐺𝐴𝑁 𝐺𝑋→𝑌 , 𝐷𝑌 +𝑉𝐺𝐴𝑁 𝐺𝑋→𝑌 , 𝐷𝑌
+𝜆 𝑉𝐶𝑦𝑐 (𝐺𝑋→𝑌 , 𝐺𝑌→𝑋 )
Voice Conversion
• Subjective evaluations
Similarity of to source and to target
speakers. S: Source; T:Target; P:
MOS for naturalness.
Proposed; B:Baseline
Target Source
speaker speaker
𝑬nc Dec
𝑒𝑛𝑐(𝒙) F/R
𝒙
D+C
𝑮 ID
Real
𝒚" data
Voice Conversion
▪ Controller-generator-discriminator VC on Impaired Speech [Li-Wei Chen, Yu Tsao,
Hung-Yi Lee, arXiv 2018]
The proposed method outperforms conditional GAN and CycleGAN in terms of content similarity,
speaker similarity, and articulation.
Outline
1. Speech Enhancement
2. Postfilter, speech synthesis, voice conversion
3. Speech Signal Recognition
4. Conclusion
Speech, Speaker, Emotion Recognitions
& Lip Reading Recognitions
▪ Classification Task: GAN Model:
𝒚
Output
label
G ℎ(∙)
𝒛 = 𝑔(
𝒙)
Emb.
Acoustic Mismatch
E 𝑔(∙)
ෝ
𝒙 ෭
𝒙
𝒙
Channel Accented Noisy
Distortion Speech Data
𝒙
Speech Recognition
▪ Adversarial multi-task learning (AMT) [Shinohara Interspeech 2016]
𝒚 Objective function
𝒛
Output 1 Output 2 𝑉𝑦 =− σ𝑖 log 𝑃(𝑦𝑖 |𝑥𝑖 ; 𝜃𝐸 , 𝜃𝐺 )
Senone Domain
𝑉𝑧 =− σ𝑖 log 𝑃(𝑧𝑖 |𝑥𝑖 ; 𝜃𝐸 , 𝜃𝐷 )
𝑉𝑦 G D 𝑉𝑧 Model update
𝜕𝑉𝑦 Max
𝜃𝐺 ← 𝜃𝐺 − ϵ classification
GRL 𝜕𝜃𝐺
accuracy
𝜕𝑉𝑧 Max domain
𝜃𝐷 ← 𝜃𝐷 − ϵ
E 𝜕𝜃𝐷 accuracy
𝜕𝑉𝑦 𝜕𝑉𝑧
𝜃𝐸 ← 𝜃𝐸 − ϵ +𝛼
𝜕𝜃𝐸 𝜕𝜃𝐸
Input Max classification accuracy
𝒙
Acoustic feature and Min domain accuracy
Speech Recognition (AMT)
▪ ASR results in known (k) and unknown (unk) noisy conditions
𝑉𝑦 G D 𝑉𝑧 Model update
𝜕𝑉𝑦 Max classification
𝜃𝐺 ← 𝜃𝐺 − ϵ accuracy
𝜕𝜃𝐺
GRL
𝜕𝑉𝑦 𝜕𝑉𝑧
𝜃𝐸 ← 𝜃𝐸 − ϵ +𝛼
𝜕𝜃𝐸 𝜕𝜃𝐸
𝒙
𝒙
Noisy data Clean data
Speech Recognition (GAN-Enhancer)
▪ ASR results on far-field speech:
WER of GAN enhancer and the baseline methods.
𝑉𝑦 G D 𝑉𝑧
Enroll
GRL i-vector
DANN Pre-processing
E Scoring
DANN Pre-processing
Test
i-vector
𝒙 Input
Acoustic feature
Speaker Recognition (DANN)
▪ Recognition results of domain mismatched conditions
Performance of DAT and the state-of-the-art methods.
D G ℎ(∙)
𝒒 Emb. 𝒛 = 𝑔(𝒙)
Syn. E 𝑔(∙)
Original
Training
data
Model update
𝑉𝑦 G D 𝑉𝑧
𝜕𝑉𝑦 Max classification
𝜃𝐺 ← 𝜃𝐺 − ϵ accuracy
𝜕𝜃𝐺
GRL
𝜕𝑉𝑧 Max domain
𝜃𝐷 ← 𝜃𝐷 − ϵ
𝜕𝜃𝐷 accuracy
E
𝜕𝑉𝑦 𝜕𝑉𝑧
𝜃𝐸 ← 𝜃𝐸 − ϵ +𝛼
𝜕𝜃𝐸 𝜕𝜃𝐸
▪ Regression Task:
Objective function
G Output
▪ GAN Models:
Speech, Speaker, Emotion Recognitions
& Lip Reading Recognitions
▪ Classification Task: GAN Model
𝒚
Output
label
G ℎ(∙)
𝒛 = 𝑔(
𝒙)
Emb.
Acoustic Mismatch
E 𝑔(∙)
ෝ
𝒙 ෭
𝒙
𝒙
Channel Accented Noisy
Distortion Speech Data
𝒙
More GANs in Speech
Diagnosis of Autism Spectrum
▪ Deng, et.al., Speech-based Diagnosis of Autism Spectrum Condition by Generative Adversarial Network
Representations, ACM DH, 2017.
Emotion Recognition
▪ Chang, et.al., Learning Representations of Emotional Speech with Deep Convolutional Generative Adversarial
Networks, ICASSP, 2017.
Robust ASR
▪ Serdyuk, et.al., Invariant Representations for Noisy Speech Recognition, arXiv, 2016.
Speaker Verification
▪ Hong Yu, Zheng-Hua Tan, Zhanyu Ma, and Jun Guo, Adversarial Network Bottleneck Features for Noise Robust
Speaker Verification, arXiv, 2017.
Welcome to Data Science Center UI
What You Will Learn Today
▪ GANs Applications for Vision
▪ GANs Applications for NLP
▪ GANs Application for Speech
▪ Reproduction of Some Models
Training
……
data: Encoder Decoder
A: How are you ?
B: I’m good. Seq2seq
Input sentence c
……
Bye bye ☺ Hi ☺
-10 3
• Chat-bot learns to maximize the expected reward
Maximizing Expected Reward
Learn to maximize expected reward
Policy Gradient
Input sentence c
Human 𝑅 𝑐, 𝑥
response sentence x reward
𝜃 𝑡+1 ← 𝜃 𝑡 + 𝜂𝛻 𝑅ത𝜃𝑡
𝜃𝑡 𝑁
1
𝑅 𝑐 𝑖 , 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃𝜃𝑡 𝑥 𝑖 |𝑐 𝑖
𝑁
𝑖=1
𝑐1 , 𝑥 1 𝑅 𝑐1 , 𝑥1
𝑅 𝑐 𝑖 , 𝑥 𝑖 is positive
𝑐2, 𝑥2 𝑅 𝑐2, 𝑥2
Updating 𝜃 to increase 𝑃𝜃 𝑥 𝑖 |𝑐 𝑖
……
……
𝑅 𝑐 𝑖 , 𝑥 𝑖 is negative
𝑐𝑁 , 𝑥𝑁 𝑅 𝑐𝑁, 𝑥𝑁
Updating 𝜃 to decrease 𝑃𝜃 𝑥 𝑖 |𝑐 𝑖
Comparison
Maximum Reinforcement
Likelihood Learning
𝑁 𝑁
Objective 1 1
𝑙𝑜𝑔𝑃𝜃 𝑥ො 𝑖 |𝑐 𝑖 𝑅 𝑐 𝑖 , 𝑥 𝑖 𝑙𝑜𝑔𝑃𝜃 𝑥 𝑖 |𝑐 𝑖
Function 𝑁 𝑁
𝑖=1 𝑖=1
𝑁 𝑁
1 1
Gradient 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥ො 𝑖 |𝑐 𝑖 𝑅 𝑐 𝑖 , 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥 𝑖 |𝑐 𝑖
𝑁 𝑁
𝑖=1 𝑖=1
Training 𝑐1 , 𝑥ො 1 , … , 𝑐 𝑁 , 𝑥ො 𝑁 𝑐1 , 𝑥 1 , … , 𝑐 𝑁 , 𝑥 𝑁
Data obtained from interaction
𝑅 𝑐 𝑖 , 𝑥ො 𝑖 = 1
weighted by 𝑅 𝑐 𝑖 , 𝑥 𝑖
Conditional GAN
Input sentence c
Discriminator Real or fake
response sentence x “reward”
human
[Li, et al., EMNLP, 2017]
dialogues
Algorithm
• Training data: Pairs of conditional input c and response x
• Initialize generator G (chatbot) and discriminator D
• In each iteration:
• Sample input c and response 𝑥 from training set
• Sample input 𝑐′ from training set, and generate response 𝑥 by G(𝑐′)
• Update D to increase 𝐷 𝑐, 𝑥 and decrease 𝐷 𝑐 ′ , 𝑥
• Update generator G (chatbot) such that
En De
𝑐 Discriminator scalar
Chatbot
update
En De
Discriminator scalar
Chatbot scalar
update Discriminator
NO! A A A
B B B
Update Parameters
A A A
Avoid the sampling process
B B B
Update Parameters
We can do
backpropagation now.
<BOS> A
B
What is the problem?
• Real sentence
1 0 0 0 0
0 1 0 0 0
0 0 1 0 0 Discriminator can
0 0 0 1 0 immediately find
• Generated 0 0 0 0 1 the difference.
En De
Discriminator scalar
Chatbot
update
D 𝜃 𝑡+1 ← 𝜃 𝑡 + 𝜂𝛻 𝑅ത𝜃𝑡
𝑡
𝜃 𝑁
1
𝑅 𝑐 𝑖 , 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃𝜃𝑡 𝑥 𝑖 |𝑐 𝑖
𝐷
𝑁
𝑖=1
𝑐1 , 𝑥 1 𝐷 𝑐11, 𝑥 11
𝑅
𝑅 𝑐𝑖 , 𝑥𝑖
𝐷 𝑐 , 𝑥 is positive
𝑐2, 𝑥2 𝐷
𝑅 𝑐 22, 𝑥 22 Updating 𝜃 to increase 𝑃𝜃 𝑥 𝑖 |𝑐 𝑖
……
……
𝑅 𝑐 𝑖 , 𝑥 𝑖 is negative
𝐷
𝑐𝑁 , 𝑥𝑁 𝐷 𝑐 𝑁𝑁, 𝑥 𝑁𝑁
𝑅 Updating 𝜃 to decrease 𝑃𝜃 𝑥 𝑖 |𝑐 𝑖
Reward for Every Generation Step
𝑁
1
𝛻 𝑅ത𝜃 ≈ 𝐷 𝑐 𝑖 , 𝑥 𝑖 𝛻𝑙𝑜𝑔𝑃𝜃 𝑥 𝑖 |𝑐 𝑖
𝑁
𝑖=1
𝑃 "𝐼"|𝑐 𝑖
𝑃 "𝐼"|𝑐 𝑖
Reward for Every Generation Step
ℎ𝑖 = “What is your name?” 𝑥 𝑖 = “I don’t know”
𝑙𝑜𝑔𝑃𝜃 𝑥 𝑖 |ℎ𝑖 = 𝑙𝑜𝑔𝑃 𝑥1𝑖 |𝑐 𝑖 + 𝑙𝑜𝑔𝑃 𝑥2𝑖 |𝑐 𝑖 , 𝑥1𝑖 + 𝑙𝑜𝑔𝑃 𝑥3𝑖 |𝑐 𝑖 , 𝑥1:2
𝑖
• However, no strong evidence shows that GANs are better than MLE.
• [Stanislau Semeniuta, et al., arXiv, 2018] [Guy Tevet, et al., arXiv, 2018] [Massimo Caccia, et al., arXiv,
2018]
More Applications
• Supervised machine translation [Wu, et al., arXiv
2017][Yang, et al., arXiv 2017]
male female
Part 2
𝐺𝑋→𝑌 𝐺Y→X
𝐺Y→X 𝐺𝑋→𝑌
as close as possible
Direct Transformation
as close as possible
as close as possible
Discrete?
Word embedding
Direct Transformation [Lee, et al., ICASSP, 2018]
as close as possible
as close as possible
感謝 王耀賢 同學提供實驗結果
Cycle GAN
✘ Negative sentence to positive sentence:
it's a crappy day -> it's a great day
i wish you could be here -> you could be here
it's not a good idea -> it's good idea
i miss you -> i love you
i don't love you -> i love you
[Lee, et al.,
i can't do that -> i can do that ICASSP, 2018]
i feel so sad -> i happy
it's a bad day -> it's a good day
it's a dummy day -> it's a great day
sorry for doing such a horrible thing -> thanks for doing a
great thing
my doggy is sick -> my doggy is my doggy
my little doggy is sick -> my little doggy is my little doggy
感謝 張瓊之 同學提供實驗結果
Cycle GAN
✘ Negative sentence to positive sentence:
male female
Part 2
summary 1
summary 2
summary
(in its own words) summary 3
seq2seq
Supervised: We need lots of Training Data
labelled training data.
Unsupervised Abstractive Summarization
• Now machine can do abstractive summary by seq2seq (write
summaries in its own words)
summary 1
summary 2
summary 3
document
seq2seq
Domain X [Wang, et al., EMNLP, 2018] Domain Y
Unsupervised Abstractive Summarization
D Discriminator
word
document
sequence
Summary?
G
Seq2seq
Unsupervised Abstractive Summarization
D Discriminator
word
document document
sequence
G R
Seq2seq Seq2seq
minimize the reconstruction error
Unsupervised Abstractive
Summarization Only need a lot
of documents to
train the model
This is a seq2seq2seq auto-encoder.
Using a sequence of words as latent representation.
not readable …
word
document sequence document
G R
Summary?
Seq2seq Seq2seq
Unsupervised Abstractive
Summarization REINFORCE algorithm to
deal with the discrete issue
Human written summaries Real or not
Let Discriminator considers D Discriminator
my output as real
word
document sequence document
Readable
G R
Summary?
Seq2seq Seq2seq
Experimental results
English Gigaword (Document title as summary)
ROUGE-1 ROUGE-2 ROUGE-L
Supervised 33.2 14.2 30.5
Trivial 21.9 7.7 20.5
Unsupervised
28.1 10.0 25.4
(matched data)
Unsupervised
27.2 9.1 24.1
(no matched data)
• Matched data: using the title of English Gigaword to train
Discriminator
• No matched data: using the title of CNN/Diary Mail to
train Discriminator
Using
Semi-supervised Learning matched data
34
33
ROUGE-1 32
31
30
29
28
27
semi-supervised
26 unsupervised
25
0 10k 500k
Number of document-summary pairs used
WGAN Reinforce Supervised
Approaches to deal with the discrete issue. 3.8M pairs are used.
感謝 王耀賢 同學提供實驗結果
male female
Part 2
unsupervised
male female
Part 2
p1 p3 p2 G UH D B AY
HH AW AA R Y UW
p1 p4 p3 p5 p5
AY M F AY N
p1 p5 p4 p3 GAN
T AY W AA N
supervised
Audio: TIMIT
Text: WMT
Unsupervised (Gumbel-softmax)
Unsupervised (WGAN-GP)