You are on page 1of 6

Face Generation using Textual Description

Achyut Raghavan
Akhilesh Harkude Udit Brahmadevara
Computer Science Engineering
Computer Science Engineering Computer Science Engineering
PES University
PES University PES University
Bangalore, India
Bangalore, India Bangalore, India
raghavanachyut@gmail.com
akhileshharkude1@gmail.com udit.brahmadevara@gmail.com

Ashrita D
Computer Science Engineering
PES University
Bangalore, India
daraashritha0703@gmail.com

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE


Abstract— This paper introduces two state-of-the-art memory-based generator and a multi-scale discriminator,
Generative Adversarial Network (GAN) architectures, namely achieving superior results in terms of image coherence and
the Self-Attention Generative Adversarial Network (SAGAN) discrimination. AnyFace [4] utilizes StyleGAN for free-
and the Diverse Face Generative Adversarial Network (DF-
style text-to-face synthesis, incorporating a Style-Mixing
GAN), aiming to enhance text-to-image generation,
particularly focusing on generating high-quality facial images
module for precise control. Suspect Face Generation [5]
from textual descriptions. SAGAN revolutionizes image employs deep learning to generate diverse and realistic
synthesis through attention mechanisms, allowing suspect face images, showcasing advancements in forensic
comprehensive modeling of long-range dependencies crucial applications. Text2FaceGAN [6] stands out for producing
for realistic details across the entire image. By enabling the lifelike face images from detailed textual descriptions,
generator to incorporate information from various feature demonstrating versatility for applications such as virtual
locations, SAGAN produces intricate and consistent details reality and gaming. A Realistic Image Generation of Face
across distant areas. On the other hand, DF-GAN targets text- [7] and Synthesizing Human Face Image [8] both contribute
to-face generation challenges by emphasizing fine-grained
by presenting methods for realistic facial image generation
facial details. Employing a two-stage stack architecture, it
generates low-resolution face images in Stage-I and refines from text descriptions, with considerations for fully trained
them into high-resolution images in Stage-II. Leveraging GANs and attentional mechanisms. Text2Human [9]
Wasserstein loss for stable training, DF-GAN utilizes multiple introduces a configurable attribute embedding module for
text embeddings during training epochs to effectively capture controlled GAN-based human image synthesis. Head
diverse facial features. Through exploring SAGAN's attention- Rotation in Denoising Diffusion Models [10] explores
driven modeling and DF-GAN's focus on fine-grained facial handling head rotation using Denoising Diffusion Implicit
synthesis, this paper significantly contributes to advancing Models. Finally, AttGAN [11] focuses on facial attribute
text-to-image generation, showcasing the capabilities of these editing, emphasizing the separation of attributes in latent
cutting-edge GAN architectures in this domain
representations for precise and aesthetically pleasing
Keywords – GAN, Face Editing, Poses of the face, MXNET, modifications.
Audio To Text, CelebA, Criminal Detection, Deep Learning
III. METHODOLOGY
I. INTRODUCTION A. Problem Formulation:
The main intent of this project is to develop an efficient
This section outlines our methodology for text-to-face
face sketching software tailored for criminal departments,
generation and introduces a comparative analysis between
addressing the imperative need for accuracy in suspect
two generative adversarial networks (GANs): DF-GAN and
identification. Leveraging Generative Adversarial Networks
SA-GAN. The primary goal is to explore the capabilities of
(GANs), the software transforms textual descriptions into
these models in generating facial images from textual
facial images. Crucially, it introduces a unique face tuning
descriptions.
feature to enhance sketch precision using Attn-GAN. In the
context of criminal investigations, where time is of the
essence, this software aims to fasten the often-meticulous
B. Dataset Preparation:
process of creating facial sketches. With the help of
advanced technologies and machine learning algorithms, it The dataset comprises a fusion of the CelebHQ dataset
not only streamlines the identification process but also and an additional 20,000 Indian faces obtained through web
contributes to the reliability of witness testimonies. The scraping. Using Octoparse, we extracted JPG URLs, cleaned
motivation behind this comes from the recognition of the the dataset, and labeled the images using the MXNet face
current challenges faced by law enforcement agencies in attribute classifier. This resulted in a curated dataset of
producing accurate facial sketches with different poses on a 50,000 Indian faces.
custom dataset. This project aspires to introduce a
transformative tool that, with its user-friendly interface and
innovative features, can significantly improve the efficiency C. Model Selection:
of criminal identification processes, ultimately aiding law
enforcement in solving cases more effectively and enhancing For the text-to-face generation task, we selected two
public safety. distinct models: DF-GAN and SA-GAN. The source code for
these models is acquired from their respective repositories,
facilitating a comparative analysis of their performance.
II. RELATED WORK
In the domain of text-to-image synthesis, several approaches
have been proposed, each contributing to the advancement D. Text-Embedding and Captioning:
of this field. Generative Adversarial Networks (GANs)
To enrich the textual context of generated faces, we
serve as a fundamental framework for learning functions incorporate text embeddings through captioning. This step
that generate samples resembling those from a given enhances the narrative aspects of the facial images,
training distribution [2]. AttnGAN [1] introduces a multi- contributing to a more meaningful synthesis.
stage attentional refinement network, focusing on different
image portions during generation for fine-grained image
synthesis. DF-GAN [2] provides a simple yet effective E. Minor Variations with AttGAN:
baseline for text-to-image synthesis, emphasizing ease of
use and scalability. DM-GAN [3] leverages a dynamic
To introduce subtle variations and enhance the diversity reliability in capturing diverse facial attributes on the CelebA
of the generated faces, we implement AttGAN. This dataset.
auxiliary tool allows for modifications in specific attributes,
providing individuality to each generated face.
After this meticulous labelling process, our dataset
expanded significantly to include a total of 49,000 faces,
F. Poses with Head Rotation: with 19,000 of them specifically representing Indian faces.
The inclusion of Indian faces in our dataset is particularly
Dynamism is introduced into the facial images using the
relevant because our project primarily targets service for
head-rotation repository. This step diversifies the spatial
police and forensics departments in India. This choice is
orientation of the generated faces, adding realism and
well-suited to the current focus and objectives of our
variability to the dataset.
initiative.

G. Training the Generative Models:


The DF-GAN and SA-GAN models undergo training on
the augmented dataset. The training process involves two
stages: Stage-I focuses on learning basic face structures in
low resolution (64 × 64 × 3), while Stage-II refines these
structures into high-resolution outputs (128 × 128 × 3).

H. Model Evaluation:
Quantitative evaluation metrics are employed to assess
the performance of DF-GAN and SA-GAN in text-to-face
generation. These metrics provide insights into the
effectiveness of each model in capturing contextual and
structural nuances. Fig. 1. Cultural Infusion into CelebAHQ

I. Integration of Stages:
The final generation process integrates both stages, with
Stage-I trained on 64 × 64 faces. The output is then passed to
Stage-II for the production of 128 × 128 faces. This
sequential process aims to refine and enhance the generated
facial images.

J. Comparative Analysis:
Generated facial images from DF-GAN and SA-GAN are
compared, highlighting distinctions in their outputs. The
inclusion of AttGAN variations and head rotations further
enriches the comparative analysis.

IV. DATASET AND PREPROCESSING


Fig. 2. Annotated Images
To enhance the diversity of the CelebA-HQ dataset,
which initially comprises around 30,000 faces predominantly
featuring American celebrities, our focus is on incorporating V. EXPERIMENTS AND RESULTS
more representation, particularly from Indian faces. To
achieve this, we implemented a multi-step process. A. Dataset and Training Details:
Initially, we sought images of Indian faces from various
websites, employing Octoparse and web scraping techniques The model was trained on a dataset comprising a fusion
for efficient data gathering. Once we obtained these images, of the CelebHQ dataset and an additional 20,000 Indian
we employed face detection using haar cascades and faces obtained through web scraping, resulting in a curated
cropping methods to isolate the facial features. After this dataset of 50,000 Indian faces. 10,000 random images from
manual evaluation and data cleaning for each cropped image the CelebA dataset were used for training and evaluation
was worked on. The subsequent step involved labelling these purposes. The training set consisted of 7500 images, while
faces, specifying attributes such as age, gender, and facial the testing set comprised 2500 images. A batch size of 64
expressions. For the task of labelling Indian faces, we chose was employed during training. The generator and
the Face MXNet model, leveraging its accuracy and discriminator used learning rates of 0.0002 and 0.0001,
respectively. The Adam optimizer was utilized with
parameters β1 = 0.5 and β2 = 0.5 for both generator and the continual support and encouragement we have received
discriminator. from our family and friends.

B. Training Approach: REFERENCES

Training commenced with Stage-I responsible for [1] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan,
Xiaolei Huang, Xiaodong He “AttnGAN: Fine-Grained Text to
generating low-resolution outputs. Stage-II was Image Generation with Attentional Generative Adversarial
subsequently trained using the outputs from Stage-I to Networks”, 2017
generate high-resolution facial images. During the training
process, efforts were made to maintain a uniform [2] Ming Tao, Hao Tang, Fei Wu, Xiaoyuan Jing, Bing-Kun Bao,
distribution of captions from each class to prevent biased Changsheng Xu “DF-GAN: A Simple and Effective Baseline for
image generation towards specific classes. Text-to-Image Synthesis”, 2022

[3] Minfeng Zhu, Pingbo Pan, Wei Chen, Yi Yang “DM-GAN: Dynamic
VI. CONCLUSION AND FUTURE WORK Memory Generative Adversarial Networks for Text-to-Image
Synthesis”, 2020
A. Findings and Conclusion:
[4] Jianxin Sun1,2*, Qiyao Deng1,2*, Qi Li1,2 †, Muyi Sun1, Min
Ren1,2, Zhenan Sun1,2 “AnyFace : Free-style Text-to-Face Synthesis
The experimental findings demonstrated the model's and Manipulation”, 2020
potential in generating facial images from textual
descriptions. The dataset fusion of CelebHQ and Indian [5] Harsh Jaykumar Jalan, Gautam Maurya, Canute Corda, Suspect Face
faces enabled diverse and realistic image synthesis based on Generation, 2020
provided textual cues.
[6] Osaid Rehman Nasir+∗, Shailesh Kumar Jha+∗, Manraj Singh
B. Implications and Future Work: Grover∗, Yi Yu†, Ajit Kumar‡ and Rajiv Ratn Shah
“Text2FaceGAN: Face Generation from Fine Grained Textual
Descriptions”, 2019
The successful generation of facial images from textual
descriptions holds promise in various applications, including [7] Tingting Qiao, Jing Zhang*, Duanqing Xu*, Dacheng Tao, College of
creative content generation and facial recognition. Future Computer Science and Technology, Zhejiang University, China,
research directions may include further exploration of School of Automation, Hangzhou Dianzi University, China,
augmentation techniques, architecture refinements, and UBTECH Sydney AI Centre, School of Computer Science, FEIT, The
University of Sydney, Australia: “MirrorGAN: Learning Text-to-
dataset expansion for improved diversity and model image Generation by Redescription”, 2021
performance.
[8] Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, Philip H. S. Torr,
ACKNOWLEDGMENT University of Oxford: “Controllable Text-to-Image Generation”, 2019
We would like to express our gratitude to Dr. Vinodha K,
Department of Computer Science and Engineering, PES [9] MUHAMMAD ZEESHAN KHAN SAIRA JABEENMUHAMMAD
University, for her continuous guidance, assistance, and USMAN GHANI KHANTANZILA SABA (Senior Member, IEEE),
ASIM REHMATAMJAD REHMAN , (Senior Member, IEEE),AND
encouragement throughout the development of this USMAN TARIQ “A Realistic Image Generation of Face From Text
UE20CS461A - Capstone Project Phase – 2. Description Using the Fully Trained Generative Adversarial
Network”, 2020
We are very grateful to the Capstone Project Coordinator,
Dr. Sarasvathi V, Professor and Dr. Sudeepa Roy Dey, [10] Jupiter Tamrakar , Bal Krishna Nyaupane “Synthesizing Human Face
Image from Textual Description of Facial Attributes Using
Associate Professor, for organizing, managing, and helping Attentional Generative Adversarial Network”, 2021
with the entire process.
[11] YUMING JIANG, S-Lab, Nanyang Technological University,
We take this opportunity to thank Dr. Sandesh B J, SingaporeSHUAI YANG, S-Lab, Nanyang Technological University,
Chairperson, Department of Computer Science and SingaporeHAONAN QIU, S-Lab, Nanyang Technological University,
Engineering, PES University, for all the knowledge and Singapore: “Text2Human: Text-Driven Controllable Human Image
Generation”, 2022
support we have received from the department. We
would like to thank Dr. B.K. Keshavan, Dean of
[12] Zhenliang He, Wangmeng Zuo, Senior Member, IEEE, Meina Kan,
Faculty, PES University for his help. Member, IEEE, Shiguang Shan, Senior Member, IEEE, and Xilin
Chen, Fellow, IEEE “AttGAN: Facial Attribute Editing by Only
We are deeply grateful to Dr.M. R.Doreswamy, Chancellor, Changing What You Want”, 2019
PES University, Prof. Jawahar Doreswamy, Pro Chancellor
– PES University, Dr. Suryaprasad J, Vice- Chancellor, PES [13] Andrea Asperti, Gabriele Colasuonno∗ , and Antonio Guerra∗
∗Department of Informatics, University of Bologna, Italy “Head
University and Prof. Nagarjuna Sadineni, Pro-Vice Rotation in Denoising Diffusion Models”, 2023
Chancellor - PES University, for providing to us various
opportunities and enlightenment every step of the way. [14] Han Zhang, Ian Goodfellow, Dimitris Metaxas, Augustus Odena,
Finally, this project could not have been completed without “Self-Attention Generative Adversarial Networks”, 2018
[15] Amit Kushwaha; Chanakya P; Krishna Pratap Singh, “Text to Face Gustav Reichert, and Helge Ritter, “Face Generation and Editing with
generation using Wasserstein stackGAN”, 2022 StyleGAN: A Survey”, 2022

[16] Hang Zhou,1 Jihao Liu, Ziwei Liu, Yu Liu, Xiaogang Wang, “Rotate- [20] SHU-YU CHEN, FENG-LIN LIU, YU-KUN LAI, PAUL L. ROSIN,
and-Render: Unsupervised Photorealistic Face Rotation from Single- CHUNPENG LI, HONGBO FU, LIN GAO, “DeepFaceEditing: Deep
View Images”, 2020 Face Generation and Editing with Disentangled Geometry and
Appearance Control”, 2021\
[17] Rui Huang, Shu Zhang, Tianyu Li1, Ran He, “Beyond Face Rotation:
Global and Local Perception GAN for Photorealistic and Identity [21] Xianxu Hou, Linlin Shen, Or Patashnik, Daniel Cohen-Or, Hui
Preserving Frontal View Synthesis”, 2017 Huang, “FEAT: Face Editing with Attention”, 2022

[18] Jyoti Ravikumar, Narendra Kumar D N, Ramachandra A C, and Raja


K B, “FACE RECOGNITION BASED ON FRONTALIZATION OF
MULTIPLE POSES USING G-GAN AND DWT”, 2021

[19] Andrew Melnik, Maksim Miasayedzenkau, Dzianis Makarovets,


Dzianis Pirshtuk, Eren Akbulut, Dennis Holzmann, Tarek Renusch,

You might also like