You are on page 1of 8

INDIAN INSTITUTE OF

INFORMATION TECHNOLOGY
ALLAHABAD
TEXT TO IMAGE SYNTHESIS

IEC2020053 - MAYANK BHARATI


IEC2020027 – ABHISHEK WANI
IEC2020089 – SHREYANSH SHARMA
IIT2020094 – CHAHIT KUMAR

Abstract
Introduction
Text-to-image synthesis is an Text-to-image synthesis is new
emerging research focused on research in artificial intelligence
creating realistic images from and computer vision that has the
descriptive text. This report potential to change the way we
provides an overview of the create and manage visual content.
various techniques and The purpose of combining text
standards for text-to-image with an image is to create
writing, along with their uses and realistic images from
limitations. We also present the descriptions commonly used in
latest research results showing graphic design, advertising, e-
the latest technologies in this commerce, and entertainment.
field. The ability to create images from
natural descriptions can facilitate
Keywords
human-computer interaction and
text-to-image synthesis, accessibility for the visually
generative model, rendering, impaired.
natural language processing.
Text-to-image synthesis is a
complex task because it requires
models to understand and
interpret textual descriptions
and translate them into visual
elements such as objects, scenes,
and textures. Recent advances in methods for text-to-image
deep learning and modeling synthesis, including generative
techniques, such as generative adversarial networks (GANs),
adversarial networks (GANs), variable autoencoders (VAEs),
variational autoencoders (VAEs), and autoregressive models. GANs
and transformers, have brought are a popular class of
significant progress in this field. computational models consisting
However, text-to-image synthesis of generators and discrete
is a research area with many algorithms. The generator
open challenges, such as image network uses random vector
quality control, multiple elements and generates images
materials, and similar images. In that are fed to the viewer along
this report, we review recent with the actual images. The job
work on text-image matching of the discriminator is to
and their applications. First, we distinguish the real image from
discuss various text-to-image the generated image and provide
synthesis methods, including feedback to the producer. VAEs
GANs, VAEs, and transformers, are another model designed to
along with their advantages and study the distribution of latent
limitations. We also provide an variables that make up images.
overview of commonly used
Problem Statement
caption synthesis datasets,
including the COCO Captions The difficulty of the design
dataset, the Oxford-102 Flowers problem is usually achieved
dataset, and the CUB-200-2011 through conflict learning.
dataset. Finally, we discuss the However, coaching has the
current applications and future problem of unsustainable
directions of text-to-image training. I propose to use the
linking, highlighting challenges points as a model to solve this
and opportunities in the field. challenge.

Input : Text Description.

Background Output: Image, as per description.

Text-to-Image Synthesis
Techniques There are several
Dataset and Feature
To develop and evaluate models
that produce text and images, it
is necessary to have image-text
pairs. The COCO caption dataset
[13] is a popular source of such
information, as it contains more
than 330,000 images and 1.5
million corresponding captions.
The data is generally divided into
three groups: about 120,000
images for training, 5,000 for
Images in this file have been
validation, and 205,005 for
resized to 32 x 32 pixels due to
testing. However, for future
computational limitations. Some
research, the reference
examples of data are shown
information data generated by
below.
Google artificial intelligence [12],
which contains more than 3
million images in a TSV file with
attached text, is worth methodology
considering.
This section revisits the four key
components needed to
understand the T2I methods
discussed in the following
sections: the original
(unconditional) GAN [2], which
accepts noise as input to
generate the image, the
conditional GAN (cGAN) [14],
which allows the generated
image to be conditioned on the
label, the text encoders used to
create text description embeds
for editing, and the datasets
commonly used by the T2I trained to assign the logarithm
community. shape to the correct class, while
the generator G(z) is trained to
minimize the possibility of
misclassification by the
separator (log(1 − D() G (z ))))
( Equation 1). Unemployment is
used to display Ladv in the
picture.Equation1:

2.1 Generative
adversarial networks
The first generative adversarial
network (GAN) proposed in [1] 2.2. Conditional GANs
includes a pair of neural
networks - a generative network While creating new ground truth
G(z) and a discriminative models is a big science, it is also
network D(x); it is preceded by a important to control the
sound wing (z ∼ pz) and rendering process. For this
produces a rendered image (x ∼ purpose, Mirza et al. A
pg). The discriminator is trained conditional GAN (cGAN) [14]
to distinguish real images has been proposed that includes
(x∼ pdata) from fake images conditional variables y, such as a
generated by the generator. The list of classes, in both the
goal is to train the renderer to generator and the discriminator
capture the true data distribution to determine which MNIST code
and produce images that can fool to generate.
the discriminator.
Figure 3 provides an example of
The GAN architecture is shown in this approach. In their
Figure 2. The tutorial can be experiments, inputs z ∼ pz and y
designed as a minimax game for are fed to a multilayer
two players with a cost function perceptron (MLP) network with
V(D,G) as described in [2]. one hidden layer forming a
Specifically, the separator D(x) is common hidden representation
of the generator. Likewise, for locations of the two training
the haters, MLP provides images names. The authors also point
and text. Convert Equation 1 to out that traditional text editors
Equation 2 as described in. such as Word2Vec and Bag-of-
Words are less efficient. TAC-
GAN uses Skip-Thought vectors.

Experiments and Results


Many experiments and studies
have been carried out using
different methods and materials
related to combining text with
images. A popular method is to
use Artificial Neural Networks
(GANs), which perform well in
generating high-quality images
from high-quality image
2.3. Encoded text
descriptors with many variables
transformation, it is trivial to and features. An important test is
generate an embedding from a the StackGAN model proposed by
network-supportedtext Zhang et al. 2017. The StackGAN
representation using a pre- model consists of a two-stage
trained character-level GAN architecture that constructs
convolutional recurrent neural a problem-solving image from a
network (char-CNN-RNN) to description.
textually describe the encoding.
char-CNN-RNN is pre-trained to
recognize matching texts and
images based on the text class.
The result is a different text
encoding.

During training, additional tags


are created using simple
interactions between the
The first stage of the model years. year. , due to the
creates a low image, then this development of higher GAN
second stage is used to create a architectures and larger and
high image with more details. more diverse datasets.
The model was trained on the
CUB dataset, which contains However, there is still room for
more than 11,000 annotated bird improvement, especially when it
images. The StackGAN model can comes to creating visuals with
generate real images with good more ideas and summaries and
details, better processing of more efficient and effective Good
image quality and diversity. results.
Another experiment conducted
by Reed et al. 2016 included the
use of GANs to create images
from scene descriptions.

The model is trained on the


COCO dataset containing over
300,000 images and similar
descriptions. GANs are capable of
rendering realistic images of
scenes, including objects, people,
and backgrounds, with a high
degree of accuracy and detail.
Recently, some researchers have
also explored the use of pre-
learned language models (such as
GPT-2 and BERT) for text-to-
image matching. Patki et al. In
2021, a BERT-based model was
trained on the COCO dataset and
was able to generate text-to-
image synthesis. Overall,
experimental results of text-to-
image synthesis show
improvements in image quality
and diversity over the past few
Conclusion availability of large databases
In general, text-to-image and prior learning models opens
synthesis is an interesting and up new avenues of research in
exciting field of research with the image-text synthesis.
potential to transform many For example, some researchers
applications, including art, have conducted research using
design, and advertising. In recent previously learned language
years, the quality and diversity patterns to create images with
of image production has good results.
increased with the development
of deep learning and the Overall, progress in text-to-image
availability of large-scale data. synthesis has been good and is
Generative neural networks an exciting area of research
(GANs) have become a popular with many exciting possibilities.
and effective method for text-to- As this field continues to grow,
image synthesis, with many we can see more facts and more
models aiming to produce images than we can show, which
realistic and diverse images. can affect many industries and
However, there is room for applications.
improvement in creating more
conceptual and abstract images
and creating more efficient and
capable models. In addition, the
REFERENCES 5. Chen, H., Zhang, T., Xu, Y., Yi,
Here are some references related Z., & Yang, J. (2018). Text to
to text-to-image synthesis: image synthesis using
generative adversarial
1. Reed, S. E., Akata, Z., Yan, X., networks. In Proceedings of
Logeswaran, L., Schiele, B., & the IEEE International
Lee, H. (2016). Generative Conference on Multimedia
adversarial text to image and Expo (pp. 435-440).
synthesis. In Proceedings of
the 33rd International
Conference on Machine 6. Zhu, J. Y., Park, T., Isola, P., &
Learning (pp. 1060-1069). Efros, A. A. (2017). Unpaired
image-to-image translation
2. Zhang, H., Xu, T., Li, H., Zhang, using cycle-consistent
S., Wang, X., Huang, X., & adversarial networks. In
Metaxas, D. N. (2017). Proceedings of the IEEE
StackGAN: Text to photo- International Conference on
realistic image synthesis with Computer Vision (pp. 2223-
2232).
3. stacked generative adversarial 7. Huang, X., Li, Y., Poursaeed,
networks. In Proceedings of O., Hopcroft, J. E., & Belongie,
the IEEE International S. (2018). Multimodal
Conference on Computer unsupervised image-to-image
Vision (pp. 5907-5915). translation. In Advances in
Neural Information Processing
4. Xu, T., Zhang, P., Huang, Q.,
Systems (pp. 465-476).
Zhang, H., Gan, Z., Huang, X.,
& He, X. (2018). Attngan:
Fine- grained text to image
generation with attentional
generative adversarial
networks. In Proceedings of
the IEEE Conference on
Computer Vision and Pattern
Recognition (pp. 1316-1324).

You might also like