Cycle-Consistent Diverse Image Synthesis From Natural Language

2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW)
CYCLE-CONSISTENT DIVERSE IMAGE SYNTHESIS FROM NATURAL LANGUAGE
Zhi Chen and Yadan Luo
School of Information Technology and Electrical Engineering

The University of Queensland
uqzhichen@gmail.com, lyadanluol@gmail.com
7H[WGHVFULSWLRQ 6\QWKHVLVSURFHVV 6\QWKHWLFLPDJH &DSWLRQLQJSURFHVV *HQHUDWHGFDSWLRQ

ABSTRACT
7KLVIORZHUKDV 7KLVIORZHUKDV
\HOORZSHWDOVDV ,PDJH &DSWLRQ \HOORZSHWDOVDV
Text-to-image translation has become an attractive yet chal- ZHOODVDSLQN *HQHUDWRU *HQHUDWRU ZHOODVDEODFN
SLVWLO SLVWLO
lenging task in computer vision. Previous approaches tend to
&\FOH
generate similar, or even monotonous, images for distinctive FRQVLVWHQF\
ORVV
texts and overlook the characteristics of specific sentences. In
this paper, we aim to generate images from the given texts by
Fig. 1. The architecture of our proposed SuperGAN. The train-
preserving diverse appearances and modes of the objects or
ing procedure consists of two main processes: image synthesis and
instances contained. To achieve that, a novel learning model image captioning. During the image synthesis process, the image
named SuperGAN is proposed, which consists of two major generator takes the text descriptions as input to create images. The
components: an image synthesis network and a captioning synthetic images are then fed into the caption generator to generate
model in a Cycle-GAN framework. SuperGAN adopts the fake captions. We use the cycle-consistency loss between the origi-
cycle-consistent adversarial training strategy to learn an im- nal input and the fake captions to update the image generator.
age generator where the feature distribution of the generated
performance on the image synthesis, a more complex con-
images complies with the distribution of the generic images.
ditional model of the text-to-image translation[3] produces
Meanwhile, a cycle-consistency loss is applied to constrain
higher quality images, while another research indicates that
that the caption of the generated images is closed to the orig-
multi-stage image generation [4, 5] can synthesize more vivid
inal texts. Extensive experiments on the benchmark dataset
and detailed objects than simply adopting upsampling layers.
Oxford-flowers-102 demonstrate the validity and effective-
ness of our proposed method. In addition, a new evaluation Despite the success of generating high-quality images,
metric is proposed to measure the diversity of synthetic re- these methods suffer from mode collapse problem. Most real-
sults. world data distributions are multimodal and highly complex.
In the GANs’ objective function, however, there is no explicit
Index Terms— Image synthesis, image captioning, gen- constraint to force the generator to capture the diversity of
erative adversarial networks, cycle-consistency loss data distribution. Once the training of GANs converged, the
generator model usually cannot cover the whole data distri-
1. INTRODUCTION butions. In other words, although the synthesized samples are
often meaningful only a few modes can be represented in the
Generating high-quality images from text descriptions is an generated samples. On the other hand, the model diversity
important task and can be applied in various scenarios, e.g. is not easy to be nailed down and evaluated. To the best of
online chat, graphic design, etc. There is intensive research in our knowledge, there is no appropriate metric to measure the
image synthesis recently along with the surge of Generative diversity of generated samples.
Adversarial Networks (GANs) [1]. GANs are trained in an To address the model collapse issue, we propose a novel
adversarial manner to generate data mimicking some distri- cycle model SuperGAN to synthesize diverse images from
butions and this manner has demonstrated its potential in var- natural language. As shown in Figure 1, the basic idea behind
ious tasks, especially in image generation. Making the images this model is straightforward yet efficient: in addition to the
realistic, while still capturing the semantics of the text is one single image synthesis GAN model, we introduce another im-
of the image synthesis challenges. Latest researches on this age captioning model to translate the generated images back
challenge are mostly based on the conditional GAN [2] which to text descriptions again. The reconstructed captions are then
takes some extra information as condition to the model. Con- compared with the original text input to compute a cycle-
ditioned on the text description, a conditional GAN is able to consistency loss to update the image generator so that mode
produce image samples that are consistent with the text se- collapse issue can be prevented. The captioning component
mantics. In order to further promote the conditional GAN’s is inspired by CycleGAN [6] which adopts extra parts in ob-
978-1-5386-9214-1/19/$31.00 ©2019 IEEE 459

DOI 10.1109/ICMEW.2019.00085
jective function to minimize the reconstruction loss. Similar
/RJSF /RJSF /RJSF /RJSZFZ
to traditional GANs, the objective of the image generator is
S S S SZ
to fool the image discriminator when distinguishing the gen-
erated samples. Built upon this objective function, to make
/670
/670
/670
āāā
/670
sure that the semantic characteristics are well presented and
captured by the image generator, we use a captioning model &11IHDWXUHHQFRGHU
6\QWKHWLF
& & & &Z
to translate the generated images into text descriptions. ,PDJHV
511ODQJXDJHGHFRGHU
Even if Inception score [7] is widely used for measuring
the image quality in multimodal datasets, it cannot show how
diverse the generated image samples are. So instead, image Fig. 3. Image captioning model structure. During training our
histogram is proposed to measure the diversity of model dis- proposed SuperGAN, we take the synthetic images as input,
tribution, in which we take into account image brightness and extracting image features via a convolutional neural network
colour. The contrast between training set and generated sam- followed by a RNN language decoder for generating captions.
ples is able to effectively illustrate the diversity of modes cap- manifold interpolation and attention mechanism to achieve
tured in the model distribution. better performance.
Our main contribution of this paper can be concluded into We adopted StackGAN [4] as our image synthesis model
three-fold: (1) We propose SuperGAN, a novel image synthe- since the multi-stage image generation has shown its supe-
sis model in a cycle architecture by adding an image caption- riority in generating high-resolution images meanwhile pre-
ing component, which is able to remedy the major problem serving the text semantics. Similar to StackGAN, Laplacian
of GANs: mode collapse. (2) Instead of Inception score, we pyramid framework (LAPGAN) [8] is able to refine and syn-
come up with a new metric to measure the variety of samples thesize high-resolution image details in multiple stages. In-
by using image histograms. (3) Extensive experiments show spired by the multi-stage approach, Cascaded refinement net-
that our framework not only results in an image generator that work [9] is able to synthesize high-resolution images from
synthesize diverse image representations but guarantees the semantic maps.
image quality as well.
澻濙濢濙濦濕濨濣濦澔澻澥澸濝濧濗濦濝濡濝濢濕濨濣濦澔澸澥
2.2. Image captioning
6NLSWKRXJKW
HPEHGGLQJĳF
7H[WGHVFULSWLRQ
)DNHLPDJHV
The additional component in our proposed cycle architecture
7KLVIORZHUKDV
8S
\HOORZSHWDOVDV
ZHOODVDSLQNSLVWLO VDPSOLQJ

'RZQ
VDPSOLQJ
is an image captioning model. Before the use of recurrent
6LJPRLG
3UREDELOLW\ neural network as as decoder to generate caption for images
]
5HDOLPDJHV [10, 11], image captioning was regarded as a retrieval task:

given an image and fetch the related text description. There
are Long Short-Term Memory (LSTM) based methods condi-
)DNHLPDJHV tion on the features that extracted from convolutional neural
'RZQ
'RZQ
VDPSOLQJ
5HVLGXDO
EORFNV
8S
VDPSOLQJ
VDPSOLQJ 6LJPRLG
networks to generate text descriptions using a number of cap-
)DNHLPDJHV 3UREDELOLW\
tioned image datasets [12, 13].
5HDOLPDJHV
澻濙濢濙濦濕濨濣濦澔澻澦澸濝濧濗濦濝濡濝濢濕濨濣濦澔澸澦
2.3. Cycle architecture

Fig. 2. Image synthesis model structure. The Generator 1 is A cycle consistency error is proposed in addition to adversar-
used to generate a low-resolution image from a random noise ial losses by [6, 14] to tackle the problem of lacking paired
vector and the given text description. Based on the generated image-image translation training data. The CycleGAN [6]
low-resolution images, the Generator 2 aims to synthesize a framework is then widely adopted by a number of works. Cy-
high-resolution image. cada [15] utilises the cycle model to perform various applica-
tions (e.g. digit adaption, cross-season adaption).
2. RELATED WORK
2.4. Mode collapse
2.1. Conditional GANs
Mode collapse is regarded as one of the most severe issues of
Compared a standard GAN, conditional GANs (cGANs) [2] GANs. Recently, there are a number of approaches proposed
take additional information vector (i.e. class label) as input to to address this issue. Some of them, including our approach,
both generator and discriminator. This additional information focus on improving the training process of GANs to converge
enables the generator to produce images corresponding to the into a better optimum, while others introduce explicit objec-
given condition. More complex image synthesis model based tive function to enforce GANs to learn the diverse data distri-
on an image caption [3] apply matching-aware discriminator, bution. InfoGAN [16] is an information-theoretic extension
460
of GANs which is able to learn meaningful disentangled rep- we use the second stage to reconstruct and polish the sketches
resentations. Instead of simply identifying the real and fake generated in the first stage. There are another pair of gen-
samples generated, the discriminator also need to recognize erator and discriminator denoted as G2 and D2 respectively.
which generator is not capable of generating realistic sam- The generator G2 conditions on the low-resolution images Iˆ1
ples, and enforce these generators to learn identifiable modes. generated in the first phase and the text descriptions ϕC to
generate more detailed images Iˆ2 . The discriminator D2 tries
3. METHODOLOGY to distinguish how likely the generated images Iˆ2 matches the
corresponding captions ϕC . Let the Iˆ1 ∼ pG1 (Iˆ1 ) denotes
Our approach trains a text-to-image-to-text closed loop, and the model distribution in stage 1. Similar to the first stage, the
introduces a cycle-consistency loss to enforce the text-to- generator G2 and the discriminator D2 alternatively minimize
image model to capture more diverse modes as in data distri- and maximize the loss function Limage2 :
bution. As shown in Figure 1, the cycle-consistency loss is in-
troduced to compare the original text input and the generated EI,C∼ pdata (I,C) [logD2 (I, ϕC )]+
captions. In order to stabilize the overall training procedure, EIˆ1 ∼ p ˆ pdata (C) [log(1 − D2 (G2 (ϕC , Iˆ1 ), ϕC ))].
G1 (I1 ),C∼
we pretrain the captioning model so that the supervision infor-
(3)
mation provided by captioning model is accurate throughout
the cycle training process. In this section, we describe each In order to simplify the image synthesis process and enhance
component of our cycle architecture and the image diversity the contrast between the cycle model and the single image
metric. generation model (non-cycle model), we omit several tech-
niques proposed in StackGAN, e.g. conditioning augmenta-
3.1. Image synthesis model tion, sentence embedding interpolation.
Image synthesis is the primary task in our framework. We 3.2. Image captioning model
adopt an image synthesis model very similar to StackGAN
[4] to conduct this task shown in Figure 2, which decomposes As a supplementary component to the image synthesis model
the text-to-image generation process into two phases. in the cycle architecture, the performance of image captioning
In the first phase, generator G1 produces a low-resolution part is extremely important. If the generated captions in the
image with rough sketch and the discriminator D1 is in charge image-to-text translation are inaccurate, this will further cause
of distinguishing the probability that how likely the given im- negative effects on the image synthesis model while training
age is real. The text description and image data distribution the cycle architecture. In this case, we focus not only on the
is denoted as C ∼ pdata (C) and I ∼ pdata (I) respectively. image synthesis model performance but the captioning model
We use Skip-thoughts embedding to embed the text descrip- as well.
tion C into ϕC . Random noise z ∼ pz (z) sampled from The generated images Iˆ from the image synthesis model
normal distribution is then concatenated with the text embed- are then fed into the caption generator F (I).ˆ The F is pre-
ding. The generator take the joint vectors as input to synthe- trained to generate accurate captions from given images. We
size images I ← G1 (ϕC , z). Besides, the discriminator D1 is use a simple encoder-decoder structure to as the captioning
to distinguish the generated images G1 (ϕC , z): Iˆ1 . The dis- model. Recent novel image captioning methods have shown
criminator also take as input the original text description to that given an image and directly maximizing the probability
check if the text descriptions match the generated images. In- of the correct word sequence can achieve state-of-the-art re-
stead of taking into account images only, the model is benefi- sults. This approach can be formulated as following:
cial from the supervision information from the corresponding
text description. In other words, G1 and D1 can be formu- θ∗ = argmax logp(Ĉ|Iˆ2 ; θ) (4)
θ
lated as a minimax problem: (Iˆ2 ,Ĉ)
where θ denotes the parameters in the image captioning

minmaxLimage1 (G1 , D1 ) (1)
G1 D1 model, Iˆ2 is the generated image from the second image gen-
erator in image synthesis model and Ĉ is the generated text
Here, the image generator G1 tries to minimize the loss func- description. Since each word in the text description is gen-
tion, whereas the discriminator D1 aims to maximize it. The erated one by one, and affected by the previous words, the
loss function Limage1 is : overall probability can be formulated as:
EI,C∼ pdata (I,C) [logD1 (I, ϕC )]+
N
logp(Ĉ|Iˆ2 ) = logp(Cw |Iˆ2 , C0 , ..., Cw−1 ) (5)
Ez∼ pz (z),C∼ pdata (C) [log(1 − D1 (G1 (ϕC , z), ϕC ))]. (2)
w=0
In order to enable the output images to contain more details This architecture is shown in Figure 3. The synthetic images
corresponding to the given caption and avoid shape distortion, generated from above image synthesis model are fed into the
461
D E F
Fig. 4. Colour histogram comparison. (a): colour distribution of images from test set. (b): generated samples by cycle model
given text descriptions from test set. (c): generated samples by non-cycle model given captions from test set.
CNN encoder so that the features can be extracted. To be during training the cycle model so that it can provide accurate
specific, we use AlexNet to do feature extraction. Once the supervision information to the synthetic images.
image features are collected, an LSTM model as language It is worth mentioning that this paper aims to propose such
generation part takes these features to generate a sequence a cycle model to constrain the training of the image synthe-
of words. During training stage, the loss function for image sis GAN; therefore it is directly comparable to the non-cycle
captioning component can be minimized as: model (the image synthesis model alone) being adopted in
our approach. In order to simplify the novel cycle model and

N
avoid extra techniques that may cause unexpected effects and
Lcaption (I, C) = − logpw (Cw ) (6) interfere the contrast and highlight the difference caused by
w=0
the cycle-consistency loss, we abandoned the complex atten-
tion mechanism in our captioning model and several state-
3.3. Cycle consistency loss of-the-art techniques proposed for generating higher quality
images.
The adversarial training losses alone in the image synthesis
model cannot guarantee that the image generator can syn- 4.1. Evaluation metrics
thesize the input text descriptions into desired images. In
theory, the mapping functions for text-to-image and image- Although the Inception score has shown its ability on evaluat-
to-text should be cycle-consistent[6]. As shown in Figure 1, ing the visual quality of images, it cannot reflect how diverse
if the captioning model is able to bring the synthetic images the generated images are. In Inception score, let x denote one
back to the original text descriptions, the cycle consistency is sample, p(y) is the overall label distribution of all generated
achieved. results and p(y|x) is the probability that how likely the sample
Once the fake caption Ĉ is produced during training the x belongs to class y:
cycle model, a cycle consistency loss can be calculated by
comparing the ground truth captions and the generated ones. exp(Ex KL(p(y|x)||p(y))) (9)
Lcycle = logH(C, Ĉ), (7) The reason why using Inception score is that for high quality
images, they normally have high confidence to be categorized
where H(C, Ĉ) denotes the cross entropy loss between the by a classifier into a class. However, if the model collapses to
generate caption Ĉ and the ground truth C. a few modes but still ensures high quality for image genera-
We express the objective function for training the cycle tion, Inception score cannot diagnose this issue.
model as following: Therefore, to fairly evaluate the diversity of generated im-
ages, we adopted image histograms to compare the diversity
Lf inal = Limage1 (G1 , D1 ) + Limage2 (G2 , D2 ) + λLcycle distribution between results from cycle model and non-cycle
(8) model. An image histogram is a graphical representation of
the data distribution in a digital image. It plots the number
The λ value can be adjusted to determine the weight of the of pixels in Y-axis for each value in X-axis. In our metric,
cycle-consistency loss. We set λ as 2 in the experiments. we concatenate a cluster of generated images into one vector
and visualize it as a single image. To be specific, we visualize
4. EXPERIMENTS the colour histogram and brightness histogram of ground truth
and samples from cycle model and non-cycle model. By com-
We performs extensive experiments to assess the effectiveness paring y-axis on these histogram over pixels, we can clearly
of our cycle model with our proposed diversity metric. The see which model is able to generate more diverse results in
image captioning component is pretrained and will be fixed terms of colour and brightness.
462
<HOORZ 5HG 3XUSOH <HOORZ 5HG 3XUSOH
/HDYHV
3HGDO
6WDPHQ
3LVWLO
Fig. 7. The left three columns show the visual expression for
Fig. 5. Brightness histogram comparison between test set, different combination of words from the cycle model, and the
cycle model and non-cycle model right three show the results from the non-cycle model.
distribution gap can be obvious. Specifically, in tonal value

,QSXW
LPDJHV
225 blue lines are significantly higher than yellow lines for
test set and cycle model results, whereas the non-cycle model
has a higher yellow line. In range 50 to 150, the red lines are
DEULJKWUHGIORZHU DFOXVWHURIYLEUDQW WKLVIORZHUKDVORQJ ZKLWHSHWDOVZLWKD
ZLWKODUJHOD\HUHG \HOORZIORUHWVHDFK RYDOSHWDOVZLWK \HOORZFHQWHU close to the blue lines for test set and cycle model. By con-
SHWDOVDQGD\HOORZ ZLWKRQHODUJHURXQG SRLQWVWKDWDUH
*URXQG
WUXWK VWLJPD DQGUXIIOHG\HOORZ VKDGHGIURPFUHDP
trast, instead of adjoining the blue line, the red line is very
SHWDODQGQXPHURXV WRSLQNDQGD\HOORZ
VWDPHQFRQQHFWHG VWDPHQ
near to the yellow line for non-cycle model. These minor ob-
WRDFHQWUDO\HOORZ
EORVVRP
servations reflect that our proposed cycle model is capable of
VKDSHGIORZHUKDVD WKHIORZHUKDVVPDOO WKLVIORZHUKDVSLQN WKLVIORZHUKDV capture those modes that might be ignored in those models
*HQHUDWHG ORWRISHWDOVDQGD \HOORZSHWDOVWKDW SHWDOVDVZHOODVD SHWDOVWKDWDUHZKLWH
FDSWLRQV \HOORZDQGEURZQ DUHVPRRWKDQG \HOORZVWDPHQ DQGKDVD\HOORZ that do not have cycle-consistency loss constraint.
VWLJPD URXQGVKDSHGZLWK FHQWHU
JUHHQSHGLFHO In addition, from Figure 5 we can see the brightness
histogram that shows the pixel distribution over of different
Fig. 6. Image captioning model results. The second and third brightness values. The blue line stands for the distribution
rows show the ground truth and captions generated by our of all images in the test set, the yellow line and orange line
image captioning model given the first row example images. represent images generated from non-cycle model and cycle
model respectively given the text description in the test set.
According to our observation on this figure, our proposed cy-
4.2. Datasets cle model capture more diverse brightness distribution than
the non-cycle model.
The dataset Oxford-flowers-102 [17] is adopted to validate the
constructed cycle model. This dataset contains 8,189 images
from 102 categories. 62 of them are used as training set, 20 4.4. Qualitative evaluation of generated samples
for validation and 20 as test set. Each image has correspond-
ing 10 captions, 5 of which are used for experiments. Besides the quantitative analysis, we manually examine the
To process the captions in each dataset, we firstly dropped generated samples by our cycle model to check if our pro-
those words that appeared less than 4 times and replace them posed approach causes side-effects for the sample quality. In
with UNK. Since the captions are not at the same length, we Figure 8, we can find that the samples generated from non-
pad them into the same length 52 by adding PADs. Then we cycle model in the right three columns have very similar rep-
adopted Skip-Thought to embed the text into vectors. resentations for irrelevant word combinations. Due to the
mode collapse issue, without our cycle-consistency constraint
4.3. Quantitative Analysis on the non-cycle model, it cannot capture diverse modes from
data distribution. In contrast, the left three columns which are
Figure 4 presents the colour histogram of images in test set, sampled from our cycle model show comparably diverse rep-
cycle model’s generated images and non-cycle model’s re- resentation for various flower descriptions. These samples are
sults. We only visualize the tonal range from 40 to 240, be- randomly picked from generated images from captions that
cause a large number of the images in the dataset have black contains corresponding phrases.
or white background. It is clearly that the colour distribu- We also yield a image captioning model with state-of-the-
tion in training data is more similar to the cycle model than art performance. In Figure 6, we provide several samples gen-
the non-cycle one. It might be hard to distinguish the subtle erated by the image captioning component which we used in
distribution difference in the overall RGB trend of the three our cycle architecture. It is clear that the generated captions
histograms. However, if we separately look at each line, the are not only accurate but natural as well.
463
5. CONCLUSION [8] Emily L Denton, Soumith Chintala, Rob Fergus, et al.,
“Deep generative image models using a laplacian pyra-
We have proposed a novel cycle model to conduct natural lan- mid of adversarial networks,” in Advances in neural in-
guage to diverse image synthesis task. The framework has formation processing systems, 2015, pp. 1486–1494.
shown its ability to mitigate the GANs’ major issue mode col-
lapse. Our intuition is to use the supervision information from [9] Qifeng Chen and Vladlen Koltun, “Photographic im-
the image captioning model to assist finding those lost modes age synthesis with cascaded refinement networks,” in
back in image synthesis model. Besides, image histogram is Proceedings of the IEEE International Conference on
introduced to measure the diversity of generated images. Computer Vision, 2017, pp. 1511–1520.
The future work lies in the following aspects: Firstly, Su- [10] Oriol Vinyals, Alexander Toshev, Samy Bengio, and
perGAN model will be experimented on larger multimodal Dumitru Erhan, “Show and tell: A neural image cap-
datasets. Secondly, state-of-the-art captioning models, e.g. tion generator,” in Proceedings of the IEEE conference
attention model, will be adopted to do ablation study. on computer vision and pattern recognition, 2015, pp.
3156–3164.
6. REFERENCES [11] Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng
Huang, and Alan Yuille, “Deep captioning with mul-
[1] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, timodal recurrent neural networks (m-rnn),” arXiv
Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron preprint arXiv:1412.6632, 2014.
Courville, and Yoshua Bengio, “Generative adversar-
ial nets,” in Advances in neural information processing [12] Tsung-Yi Lin, Michael Maire, Serge Belongie, James
systems, 2014, pp. 2672–2680. Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and
C Lawrence Zitnick, “Microsoft coco: Common objects
[2] Mehdi Mirza and Simon Osindero, “Conditional gener- in context,” in European conference on computer vision.
ative adversarial nets,” arXiv preprint arXiv:1411.1784, Springer, 2014, pp. 740–755.
2014.
[13] Ziwei Wang, Yadan Luo, Yang Li, Zi Huang, and
[3] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Lo- Hongzhi Yin, “Look deeper see richer: Depth-aware
geswaran, Bernt Schiele, and Honglak Lee, “Genera- image paragraph captioning,” in MM 2018-Proceedings
tive adversarial text to image synthesis,” arXiv preprint of the 2018 ACM Multimedia Conference. Association
arXiv:1605.05396, 2016. for Computing Machinery, Inc, 2018, pp. 672–680.
[14] Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong, “Du-
[4] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, algan: Unsupervised dual learning for image-to-image
Xiaogang Wang, Xiaolei Huang, and Dimitris N translation,” in Proceedings of the IEEE International
Metaxas, “Stackgan: Text to photo-realistic image syn- Conference on Computer Vision, 2017, pp. 2849–2857.
thesis with stacked generative adversarial networks,” in
Proceedings of the IEEE International Conference on [15] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu,
Computer Vision, 2017, pp. 5907–5915. Phillip Isola, Kate Saenko, Alexei A Efros, and Trevor
Darrell, “Cycada: Cycle-consistent adversarial domain
[5] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, adaptation,” arXiv preprint arXiv:1711.03213, 2017.
Xiaogang Wang, Xiaolei Huang, and Dimitris Metaxas,
“Stackgan++: Realistic image synthesis with stacked [16] Xi Chen, Yan Duan, Rein Houthooft, John Schulman,
generative adversarial networks,” arXiv preprint Ilya Sutskever, and Pieter Abbeel, “Infogan: Inter-
arXiv:1710.10916, 2017. pretable representation learning by information maxi-
mizing generative adversarial nets,” in Advances in
[6] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A neural information processing systems, 2016, pp. 2172–
Efros, “Unpaired image-to-image translation using 2180.
cycle-consistent adversarial networks,” in Proceedings
[17] Maria-Elena Nilsback and Andrew Zisserman, “Au-
of the IEEE International Conference on Computer Vi-
tomated flower classification over a large number of
sion, 2017, pp. 2223–2232.
classes,” in 2008 Sixth Indian Conference on Computer
Vision, Graphics & Image Processing. IEEE, 2008, pp.
[7] Tim Salimans, Ian Goodfellow, Wojciech Zaremba,
722–729.
Vicki Cheung, Alec Radford, and Xi Chen, “Improved
techniques for training gans,” in Advances in neural in-
formation processing systems, 2016, pp. 2234–2242.
464

Cycle-Consistent Diverse Image Synthesis From Natural Language

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cycle-Consistent Diverse Image Synthesis From Natural Language

Uploaded by

Copyright:

Available Formats

2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW)

CYCLE-CONSISTENT DIVERSE IMAGE SYNTHESIS FROM NATURAL LANGUAGE

Zhi Chen and Yadan Luo

School of Information Technology and Electrical Engineering

7H[WGHVFULSWLRQ 6\QWKHVLVSURFHVV 6\QWKHWLFLPDJH &DSWLRQLQJSURFHVV *HQHUDWHGFDSWLRQ

978-1-5386-9214-1/19/$31.00 ©2019 IEEE 459

2.3. Cycle architecture

where θ denotes the parameters in the image captioning

distribution gap can be obvious. Speciﬁcally, in tonal value

You might also like