Professional Documents
Culture Documents
/670
/670
/670
āāā
/670
sure that the semantic characteristics are well presented and
captured by the image generator, we use a captioning model &11IHDWXUHHQFRGHU
6\QWKHWLF
& & & &Z
to translate the generated images into text descriptions. ,PDJHV
511ODQJXDJHGHFRGHU
Even if Inception score [7] is widely used for measuring
the image quality in multimodal datasets, it cannot show how
diverse the generated image samples are. So instead, image Fig. 3. Image captioning model structure. During training our
histogram is proposed to measure the diversity of model dis- proposed SuperGAN, we take the synthetic images as input,
tribution, in which we take into account image brightness and extracting image features via a convolutional neural network
colour. The contrast between training set and generated sam- followed by a RNN language decoder for generating captions.
ples is able to effectively illustrate the diversity of modes cap- manifold interpolation and attention mechanism to achieve
tured in the model distribution. better performance.
Our main contribution of this paper can be concluded into We adopted StackGAN [4] as our image synthesis model
three-fold: (1) We propose SuperGAN, a novel image synthe- since the multi-stage image generation has shown its supe-
sis model in a cycle architecture by adding an image caption- riority in generating high-resolution images meanwhile pre-
ing component, which is able to remedy the major problem serving the text semantics. Similar to StackGAN, Laplacian
of GANs: mode collapse. (2) Instead of Inception score, we pyramid framework (LAPGAN) [8] is able to refine and syn-
come up with a new metric to measure the variety of samples thesize high-resolution image details in multiple stages. In-
by using image histograms. (3) Extensive experiments show spired by the multi-stage approach, Cascaded refinement net-
that our framework not only results in an image generator that work [9] is able to synthesize high-resolution images from
synthesize diverse image representations but guarantees the semantic maps.
image quality as well.
澻濙濢濙濦濕濨濣濦澔澻澥 澸濝濧濗濦濝濡濝濢濕濨濣濦澔澸澥
2.2. Image captioning
6NLSWKRXJKW
HPEHGGLQJijF
7H[WGHVFULSWLRQ
)DNHLPDJHV
The additional component in our proposed cycle architecture
7KLVIORZHUKDV
8S
\HOORZSHWDOVDV
ZHOODVDSLQNSLVWLO VDPSOLQJ
'RZQ
VDPSOLQJ
is an image captioning model. Before the use of recurrent
6LJPRLG
3UREDELOLW\ neural network as as decoder to generate caption for images
]
5HDOLPDJHV [10, 11], image captioning was regarded as a retrieval task:
given an image and fetch the related text description. There
are Long Short-Term Memory (LSTM) based methods condi-
)DNHLPDJHV tion on the features that extracted from convolutional neural
'RZQ
'RZQ
VDPSOLQJ
5HVLGXDO
EORFNV
8S
VDPSOLQJ
VDPSOLQJ 6LJPRLG
networks to generate text descriptions using a number of cap-
)DNHLPDJHV 3UREDELOLW\
tioned image datasets [12, 13].
5HDOLPDJHV
澻濙濢濙濦濕濨濣濦澔澻澦
澸濝濧濗濦濝濡濝濢濕濨濣濦澔澸澦
460
of GANs which is able to learn meaningful disentangled rep- we use the second stage to reconstruct and polish the sketches
resentations. Instead of simply identifying the real and fake generated in the first stage. There are another pair of gen-
samples generated, the discriminator also need to recognize erator and discriminator denoted as G2 and D2 respectively.
which generator is not capable of generating realistic sam- The generator G2 conditions on the low-resolution images Iˆ1
ples, and enforce these generators to learn identifiable modes. generated in the first phase and the text descriptions ϕC to
generate more detailed images Iˆ2 . The discriminator D2 tries
3. METHODOLOGY to distinguish how likely the generated images Iˆ2 matches the
corresponding captions ϕC . Let the Iˆ1 ∼ pG1 (Iˆ1 ) denotes
Our approach trains a text-to-image-to-text closed loop, and the model distribution in stage 1. Similar to the first stage, the
introduces a cycle-consistency loss to enforce the text-to- generator G2 and the discriminator D2 alternatively minimize
image model to capture more diverse modes as in data distri- and maximize the loss function Limage2 :
bution. As shown in Figure 1, the cycle-consistency loss is in-
troduced to compare the original text input and the generated EI,C∼ pdata (I,C) [logD2 (I, ϕC )]+
captions. In order to stabilize the overall training procedure, EIˆ1 ∼ p ˆ pdata (C) [log(1 − D2 (G2 (ϕC , Iˆ1 ), ϕC ))].
G1 (I1 ),C∼
we pretrain the captioning model so that the supervision infor-
(3)
mation provided by captioning model is accurate throughout
the cycle training process. In this section, we describe each In order to simplify the image synthesis process and enhance
component of our cycle architecture and the image diversity the contrast between the cycle model and the single image
metric. generation model (non-cycle model), we omit several tech-
niques proposed in StackGAN, e.g. conditioning augmenta-
3.1. Image synthesis model tion, sentence embedding interpolation.
Image synthesis is the primary task in our framework. We 3.2. Image captioning model
adopt an image synthesis model very similar to StackGAN
[4] to conduct this task shown in Figure 2, which decomposes As a supplementary component to the image synthesis model
the text-to-image generation process into two phases. in the cycle architecture, the performance of image captioning
In the first phase, generator G1 produces a low-resolution part is extremely important. If the generated captions in the
image with rough sketch and the discriminator D1 is in charge image-to-text translation are inaccurate, this will further cause
of distinguishing the probability that how likely the given im- negative effects on the image synthesis model while training
age is real. The text description and image data distribution the cycle architecture. In this case, we focus not only on the
is denoted as C ∼ pdata (C) and I ∼ pdata (I) respectively. image synthesis model performance but the captioning model
We use Skip-thoughts embedding to embed the text descrip- as well.
tion C into ϕC . Random noise z ∼ pz (z) sampled from The generated images Iˆ from the image synthesis model
normal distribution is then concatenated with the text embed- are then fed into the caption generator F (I).ˆ The F is pre-
ding. The generator take the joint vectors as input to synthe- trained to generate accurate captions from given images. We
size images I ← G1 (ϕC , z). Besides, the discriminator D1 is use a simple encoder-decoder structure to as the captioning
to distinguish the generated images G1 (ϕC , z): Iˆ1 . The dis- model. Recent novel image captioning methods have shown
criminator also take as input the original text description to that given an image and directly maximizing the probability
check if the text descriptions match the generated images. In- of the correct word sequence can achieve state-of-the-art re-
stead of taking into account images only, the model is benefi- sults. This approach can be formulated as following:
cial from the supervision information from the corresponding
text description. In other words, G1 and D1 can be formu- θ∗ = argmax logp(Ĉ|Iˆ2 ; θ) (4)
θ
lated as a minimax problem: (Iˆ2 ,Ĉ)
In order to enable the output images to contain more details This architecture is shown in Figure 3. The synthetic images
corresponding to the given caption and avoid shape distortion, generated from above image synthesis model are fed into the
461
D E F
Fig. 4. Colour histogram comparison. (a): colour distribution of images from test set. (b): generated samples by cycle model
given text descriptions from test set. (c): generated samples by non-cycle model given captions from test set.
CNN encoder so that the features can be extracted. To be during training the cycle model so that it can provide accurate
specific, we use AlexNet to do feature extraction. Once the supervision information to the synthetic images.
image features are collected, an LSTM model as language It is worth mentioning that this paper aims to propose such
generation part takes these features to generate a sequence a cycle model to constrain the training of the image synthe-
of words. During training stage, the loss function for image sis GAN; therefore it is directly comparable to the non-cycle
captioning component can be minimized as: model (the image synthesis model alone) being adopted in
our approach. In order to simplify the novel cycle model and
N
avoid extra techniques that may cause unexpected effects and
Lcaption (I, C) = − logpw (Cw ) (6) interfere the contrast and highlight the difference caused by
w=0
the cycle-consistency loss, we abandoned the complex atten-
tion mechanism in our captioning model and several state-
3.3. Cycle consistency loss of-the-art techniques proposed for generating higher quality
images.
The adversarial training losses alone in the image synthesis
model cannot guarantee that the image generator can syn- 4.1. Evaluation metrics
thesize the input text descriptions into desired images. In
theory, the mapping functions for text-to-image and image- Although the Inception score has shown its ability on evaluat-
to-text should be cycle-consistent[6]. As shown in Figure 1, ing the visual quality of images, it cannot reflect how diverse
if the captioning model is able to bring the synthetic images the generated images are. In Inception score, let x denote one
back to the original text descriptions, the cycle consistency is sample, p(y) is the overall label distribution of all generated
achieved. results and p(y|x) is the probability that how likely the sample
Once the fake caption Ĉ is produced during training the x belongs to class y:
cycle model, a cycle consistency loss can be calculated by
comparing the ground truth captions and the generated ones. exp(Ex KL(p(y|x)||p(y))) (9)
Lcycle = logH(C, Ĉ), (7) The reason why using Inception score is that for high quality
images, they normally have high confidence to be categorized
where H(C, Ĉ) denotes the cross entropy loss between the by a classifier into a class. However, if the model collapses to
generate caption Ĉ and the ground truth C. a few modes but still ensures high quality for image genera-
We express the objective function for training the cycle tion, Inception score cannot diagnose this issue.
model as following: Therefore, to fairly evaluate the diversity of generated im-
ages, we adopted image histograms to compare the diversity
Lf inal = Limage1 (G1 , D1 ) + Limage2 (G2 , D2 ) + λLcycle distribution between results from cycle model and non-cycle
(8) model. An image histogram is a graphical representation of
the data distribution in a digital image. It plots the number
The λ value can be adjusted to determine the weight of the of pixels in Y-axis for each value in X-axis. In our metric,
cycle-consistency loss. We set λ as 2 in the experiments. we concatenate a cluster of generated images into one vector
and visualize it as a single image. To be specific, we visualize
4. EXPERIMENTS the colour histogram and brightness histogram of ground truth
and samples from cycle model and non-cycle model. By com-
We performs extensive experiments to assess the effectiveness paring y-axis on these histogram over pixels, we can clearly
of our cycle model with our proposed diversity metric. The see which model is able to generate more diverse results in
image captioning component is pretrained and will be fixed terms of colour and brightness.
462
<HOORZ 5HG 3XUSOH <HOORZ 5HG 3XUSOH
/HDYHV
3HGDO
6WDPHQ
3LVWLO
Fig. 7. The left three columns show the visual expression for
Fig. 5. Brightness histogram comparison between test set, different combination of words from the cycle model, and the
cycle model and non-cycle model right three show the results from the non-cycle model.
463
5. CONCLUSION [8] Emily L Denton, Soumith Chintala, Rob Fergus, et al.,
“Deep generative image models using a laplacian pyra-
We have proposed a novel cycle model to conduct natural lan- mid of adversarial networks,” in Advances in neural in-
guage to diverse image synthesis task. The framework has formation processing systems, 2015, pp. 1486–1494.
shown its ability to mitigate the GANs’ major issue mode col-
lapse. Our intuition is to use the supervision information from [9] Qifeng Chen and Vladlen Koltun, “Photographic im-
the image captioning model to assist finding those lost modes age synthesis with cascaded refinement networks,” in
back in image synthesis model. Besides, image histogram is Proceedings of the IEEE International Conference on
introduced to measure the diversity of generated images. Computer Vision, 2017, pp. 1511–1520.
The future work lies in the following aspects: Firstly, Su- [10] Oriol Vinyals, Alexander Toshev, Samy Bengio, and
perGAN model will be experimented on larger multimodal Dumitru Erhan, “Show and tell: A neural image cap-
datasets. Secondly, state-of-the-art captioning models, e.g. tion generator,” in Proceedings of the IEEE conference
attention model, will be adopted to do ablation study. on computer vision and pattern recognition, 2015, pp.
3156–3164.
6. REFERENCES [11] Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng
Huang, and Alan Yuille, “Deep captioning with mul-
[1] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, timodal recurrent neural networks (m-rnn),” arXiv
Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron preprint arXiv:1412.6632, 2014.
Courville, and Yoshua Bengio, “Generative adversar-
ial nets,” in Advances in neural information processing [12] Tsung-Yi Lin, Michael Maire, Serge Belongie, James
systems, 2014, pp. 2672–2680. Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and
C Lawrence Zitnick, “Microsoft coco: Common objects
[2] Mehdi Mirza and Simon Osindero, “Conditional gener- in context,” in European conference on computer vision.
ative adversarial nets,” arXiv preprint arXiv:1411.1784, Springer, 2014, pp. 740–755.
2014.
[13] Ziwei Wang, Yadan Luo, Yang Li, Zi Huang, and
[3] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Lo- Hongzhi Yin, “Look deeper see richer: Depth-aware
geswaran, Bernt Schiele, and Honglak Lee, “Genera- image paragraph captioning,” in MM 2018-Proceedings
tive adversarial text to image synthesis,” arXiv preprint of the 2018 ACM Multimedia Conference. Association
arXiv:1605.05396, 2016. for Computing Machinery, Inc, 2018, pp. 672–680.
[14] Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong, “Du-
[4] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, algan: Unsupervised dual learning for image-to-image
Xiaogang Wang, Xiaolei Huang, and Dimitris N translation,” in Proceedings of the IEEE International
Metaxas, “Stackgan: Text to photo-realistic image syn- Conference on Computer Vision, 2017, pp. 2849–2857.
thesis with stacked generative adversarial networks,” in
Proceedings of the IEEE International Conference on [15] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu,
Computer Vision, 2017, pp. 5907–5915. Phillip Isola, Kate Saenko, Alexei A Efros, and Trevor
Darrell, “Cycada: Cycle-consistent adversarial domain
[5] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, adaptation,” arXiv preprint arXiv:1711.03213, 2017.
Xiaogang Wang, Xiaolei Huang, and Dimitris Metaxas,
“Stackgan++: Realistic image synthesis with stacked [16] Xi Chen, Yan Duan, Rein Houthooft, John Schulman,
generative adversarial networks,” arXiv preprint Ilya Sutskever, and Pieter Abbeel, “Infogan: Inter-
arXiv:1710.10916, 2017. pretable representation learning by information maxi-
mizing generative adversarial nets,” in Advances in
[6] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A neural information processing systems, 2016, pp. 2172–
Efros, “Unpaired image-to-image translation using 2180.
cycle-consistent adversarial networks,” in Proceedings
[17] Maria-Elena Nilsback and Andrew Zisserman, “Au-
of the IEEE International Conference on Computer Vi-
tomated flower classification over a large number of
sion, 2017, pp. 2223–2232.
classes,” in 2008 Sixth Indian Conference on Computer
Vision, Graphics & Image Processing. IEEE, 2008, pp.
[7] Tim Salimans, Ian Goodfellow, Wojciech Zaremba,
722–729.
Vicki Cheung, Alec Radford, and Xi Chen, “Improved
techniques for training gans,” in Advances in neural in-
formation processing systems, 2016, pp. 2234–2242.
464