You are on page 1of 4

Prompt Evolution for Generative AI: A

Classifier-Guided Approach
Melvin Wong1 * , Yew-Soon Ong1,2 , Abhishek Gupta3 , Kavitesh Kumar Bali2 , Caishun Chen2
1 Schoolof Computer Science and Engineering (SCSE), Nanyang Technological University (NTU), Singapore
2 Centrefor Frontier AI Research (CFAR), Agency for Science, Technology and Research (A*STAR), Singapore
3 Singapore Institute of Manufacturing Technology (SIMTech), Agency for Science, Technology and Research (A*STAR), Singapore

{wong1250, asysong}@ntu.edu.sg, abhishek gupta@simtech.a-star.edu.sg, {balikk, chen caishun}@cfar.a-star.edu.sg


arXiv:2305.16347v1 [cs.LG] 24 May 2023

Abstract—Synthesis of digital artifacts conditioned on user from the generative model multiple times, evolving the outputs
prompts has become an important paradigm facilitating an at every generation under the guidance of multi-objective
explosion of use cases with generative AI. However, such models selection pressure imparted by multi-label classifiers evalu-
often fail to connect the generated outputs and desired target
concepts/preferences implied by the prompts. Current research ating the satisfaction of user preferences. While a powerful
addressing this limitation has largely focused on enhancing the deep generative model alone could come up with near Pareto
prompts before output generation or improving the model’s optimal outputs, our experiments show that the proposed
performance up front. In contrast, this paper conceptualizes multiobjective evolutionary algorithm helps to diversify the
prompt evolution, imparting evolutionary selection pressure and outputs across the Pareto frontier.
variation during the generative process to produce multiple
outputs that satisfy the target concepts/preferences better. We II. I NTRODUCING P ROMPT E VOLUTION
propose a multi-objective instantiation of this broader idea
that uses a multi-label image classifier-guided approach. The
Applications using evolutionary algorithms in digital art
predicted labels from the classifiers serve as multiple objectives can be found since the 1970s [11]. Recent efforts to incor-
to optimize, with the aim of producing diversified images that porate modern evolutionary algorithms with deep learning-
meet user preferences. A novelty of our evolutionary algorithm is based approaches for image generation have shown compara-
that the pre-trained generative model gives us implicit mutation ble performance to gradient-based methods [11], [12]. Given
operations, leveraging the model’s stochastic generative capability
to automate the creation of Pareto-optimized images more faithful
the generative model’s inherent capability to produce natural
to user preferences. images, we foresee a potential synergy with the power of evo-
Index Terms—prompt evolution, generative model, user pref- lutionary algorithms in producing diversified yet high-fidelity
erence, single-objective, multi-X evolutionary computation outputs that are simultaneously faithful to multi-dimensional
user preferences. To this end, we conceptualize the idea of
I. I NTRODUCTION prompt evolution that we define as follows:
Generative AI, such as text-to-image diffusion models, has Definition 1 (Prompt Evolution). Let the outputs of a gen-
become popular for producing novel artworks conditioned on erative AI be conditioned on prompts. Prompt evolution then
user prompts [1], [2]. A prompt is usually text expressed in serves to impart evolutionary selection pressure and variation
free-form natural language but could also take the form of ref- to the generative process of the AI, with the aim of producing
erence images containing hints of target concepts/preferences. multiple outputs that best satisfy target concepts/preferences
Given such input prompts, pre-trained generative systems infer implied by the prompts.
the semantic concepts in the prompts and synthesize outputs
Note that prompt evolution is applicable to different set-
that are (hopefully) representative of the target concepts.
tings including but not limited to single-objective as well as
However, a study conducted in [3] found that the generative
various multi-X [22] formulations (e.g. multi-objective, multi-
models conditioned using prompts often failed to establish a
task, multi-constrained) where a collection of multiple target
connection between generated images and the target concepts.
*
solutions are sought.
Research addressing this limitation has mainly focused on A. Relationship to Background Work
enhancing the prompts before output generation [4]–[7] or There exists alternative techniques in the literature that aims
improving the model’s performance beforehand [8]–[10]. In to satisfy user preferences in generative AI, as described in
this paper, we conceptualize an alternative paradigm, namely, what follows.
Prompt Evolution (which we shall define in Section II). We 1) Fine Tuning [19]: updates pre-trained model weights at
craft an instantiation of this broad idea showing that more selective layers with respect to user preferences.
qualified images can be obtained by re-generating images 2) Prompt Tuning [18]: uses learnable “soft prompts” to
condition frozen pre-trained language model for text-to-image
* Corresponding author generation task.
3) Prompt Engineering [5]: is a practice where users Here, b is a predefined upper bound constraining the deviation
modify the prompts and generative model hyperparameters. d (x, θ) between generated outputs x and the user prompt θ.
4) In-Context Learning [17]: is an emergent behavior in The deviation is computed using:
large language models where a frozen language model per-
forms a new task by conditioning on a few examples of the d (x, θ) = τ [1 − cos (xCLIP , θCLIP )] (3)
task.
The aforementioned techniques often generate a single where τ > 0 is the temperature hyperparameter and xCLIP
output that satisfies user preferences to some extent. However, and θCLIP are the normalized CLIP [13] embeddings.
such techniques involve implicit trade-offs between producing
outputs of high-fidelity, remaining faithful to the prompt,
and generating diverse outputs. In contrast, prompt evolution
generates multiple outputs in a single run to provide a well-
balanced approach that addresses all these trade-offs simulta-
neously (as per Definition 1).

III. P ROMPT E VOLUTION WITH C LASSIFIER - GUIDANCE


This section presents one instantiation of the broader idea
of prompt evolution (Definition 1). In particular, it exemplifies
how multi-objective evolutionary selection pressure is imparted
to the generative process of AI, producing diverse images that
best satisfy user preferences inferred from the prompts.

A. Preliminaries
(N ×K×C)
Let X ⊂ R≥0 denote a space of images of height
N , width K and C channels, θ be the user prompt expressed
in free-form natural language, L = {λ1 , . . . , λQ } be a set of
Q user preference labels, and Y = [0, 1]Q be a probability
space where yi is the probability that user preference λi is Fig. 1: Workflow of prompt evolution with multi-label
satisfied. User preferences are simple concepts, such as actions classifier-guidance
or desired emotion descriptors, that serve to augment the
prompts. Outputs from the multi-label classifiers serve as the fitness
Given any image x ∈ X , we assume to have multi-label function of the evolving images, which imparts a selection
classifiers F : X → Y that can predict the probabilities pressure on images to better fit user preferences at every
corresponding to the Q user preference labels. We also assume generation. The user preferences augment the prompts to
a conditional deep generative model G from which new reduce the disconnect between the generated images and the
samples can be iteratively generated guided by the prompts desired target concepts. Fig. 1 illustrates the workflow of the
as: multiobjective evolutionary algorithm.
It is worth highlighting that standard evolutionary algo-
(
rithms typically use simple probabilistic models (e.g., Gaus-
[m]  t | θ)
PG (x  t=0
xt ∼ [l] (1) sian distributions) as mutation operators. In contrast, the
PG xt | θ, xt−1 otherwise
proposed algorithm in Fig. 1 makes use of the conditional
where t is the iteration count such that 0 ≤ t ≤ Tmax . generative model G to implicitly serve as a kind of mutation.
Notice the salient feature that the probability distribution To our knowledge, this novelty hasn’t been explored in the
of the m-th image sample in iteration t is conditioned on literature.
the l-th image sample at iteration t − 1. This feature shall
IV. E XPERIMENTAL OVERVIEW
be uniquely leveraged as a novel mutation operator in the
proposed multiobjective evolutionary algorithm for prompt We explored the performance of prompt evolution in gener-
evolution. ating diverse images from prompts derived from “inspirational
proverbs”. The chosen user preferences are entities, scenes,
B. Problem Formulation actions, or desired emotional responses that the images are
The Q-objective optimization problem statement for prompt intended to evoke. These preferences are implicitly embed-
evolution is then: ded in the prompt. Furthermore, we selected the NSGA-
II algorithm [21] and pre-trained Stable Diffusion model
[9] for our experiments. Lastly, we used pre-trained multi-
max F(x) = [y1 (x), y2 (x), · · · , yQ (x)] label classifiers to predict the expected entities, concepts, and
x (2)
s.t. d (x, θ) ≤ b. emotional responses [14]–[16].
Proverb: Practice makes perfect
V. E XPERIMENTAL R ESULTS Prompt: A sketch of a person diligently studying a book under a lamp
For each set of text prompts describing a proverb, we Objectives: “person reading book”, “lamp above person”, “awe”
identified 2 to 3 user preferences as the multiple objectives Before evolution After prompt evolution
to optimize. We ran multiple runs using the same set of
text prompts, with each run completing 20 generations of
evolution. The pre-trained generative model parameters are
frozen throughout all our experiments.
In our results, before prompt evolution, we observed that the
pre-trained generative model often failed to depict the multiple
(a)
target concepts and their dependencies. Hence, the model
failed to generate images aligned with user preferences (see Proverb: A journey of thousand miles begins with a single step
Prompt: A person standing at the edge of a cliff with a backpack and
Table I). Through evolution, the method was able to fit these looking into the distance
preferences increasingly well, producing better variations and Objectives: “person standing on cliff”, “person carrying backpack”, “excitement”
spreading across the Pareto frontier with every new generation Before evolution After prompt evolution
(see Figure 3). As a result, the evolved images could faithfully
capture expected user preferences.
A. Comparison Study
We extend the experiment presented in Figure 3 to compare
the diversity of generated images between prompt evolution
and a brute-force approach, with and without prompt engi- (b)
neering, using hypervolume [20] as the performance metric.
The results are presented in Figure 2. Note that the brute force Table I: Examples of images generated before and after prompt
approach generates the same number of images as prompt evolution with multi-label classifier-guidance
evolution has sampled from the same pre-trained generative
model.
As shown by Figure 2, prompt evolution outperforms the method synergizes the inherent ability of generative models
brute-force approach as early as generation 6. This demon- to produce natural images with the power of diversification
strates that prompt evolution is indeed efficient in finding of evolutionary algorithms. Considerable improvements in the
diversified and near Pareto optimal images. Moreover, we quality of images are achieved, alleviating shortcomings of
observed that while prompt engineering in the brute force today’s deep models. As highlighted in Section II, prompt
approach generates some images more faithful to the prompt, evolution need not be limited to multi-objective settings and
improvements in generating diversified images were marginal can also apply to other settings such as single-objective or
- even for a considerable sample size of 1550 images. various multi-X formulations. Additionally, prompt evolution
exerts evolutionary selection pressure and variation not only to
0.94
the generated outputs but also to other forms of guidance and
Normalized Hypervolume (higher is better)

0.92 hidden states such as prompts, audio, and latent embeddings,


to name a few. Such flexibility within the paradigm indeed
0.90 presents abundant research opportunities in this emerging field
of prompt evolution.
0.88
R EFERENCES
0.86
[1] Crowson, Katherine, Stella Biderman, Daniel Kornis, Dashiell Stander,
0.84 Eric Hallahan, Louis Castricato, and Edward Raff. ”Vqgan-clip: Open
domain image generation and editing with natural language guidance.”
In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv,
0.82
Brute force Brute force Brute force Prompt Prompt Israel, October 23–27, 2022, Proceedings, Part XXXVII, pp. 88-105.
with same with with Evolution Evolution Cham: Springer Nature Switzerland, 2022.
Prompt preferences preferences (gen. 6) (last gen. 20)
included included and [2] Rombach, Robin, Andreas Blattmann, and Björn Ommer. ”Text-guided
emphasized synthesis of artistic images with retrieval-augmented diffusion models.”
arXiv preprint arXiv:2207.13038 (2022).
Fig. 2: Comparison of brute force approach versus Prompt [3] Chen, Wenhu, Hexiang Hu, Chitwan Saharia, and William W. Co-
Evolution with multi-label classifier-guidance hen. ”Re-imagen: Retrieval-augmented text-to-image generator.” arXiv
preprint arXiv:2209.14491 (2022).
[4] Wang, Yunlong, Shuyuan Shen, and Brian Y. Lim. ”RePrompt: Au-
VI. C ONCLUSIONS AND FUTURE RESEARCH tomatic Prompt Editing to Refine AI-Generative Art Towards Precise
Expressions.” arXiv preprint arXiv:2302.09466 (2023).
This paper presented prompt evolution as a novel, automated [5] Oppenlaender, Jonas. ”A Taxonomy of Prompt Modifiers for Text-to-
approach to creating images faithful to user preferences. The Image Generation.” arXiv preprint arXiv:2204.13988 (2022).
1.0
Prompt: A photo of family members teaching a child to ride a bicycle
final evolution
Objective 1: person helping child (higher is better) initial population

Before evolution, most generated images


0.9 did not satisfy objective 1. After evolution,
a diverse set of images is generated with
trade-offs between objectives.
(1) Final evolution
child riding bicycle=0.69
person helping child=0.94
0.8

(4) Initial population


0.7 child riding bicycle=1.0
(2) Final evolution person helping child=0.57
child riding bicycle=0.74
person helping child=0.90

0.6 (3) Initial population


child riding bicycle=0.93
person helping child=0.57

0.5
0.5 0.6 0.7 0.8 0.9 1.0
Objective 2: child riding bicycle (higher is better)
Fig. 3: Prompt derived from the proverb: “The family that plays together stays together”. The nondominated population of
generated images at the initial and final stages of evolution is shown. Before evolution, most generated images did not satisfy
objective 1. After evolution, a diverse set of images is generated with trade-offs between objectives.

[6] Liu, Vivian, and Lydia B. Chilton. ”Design guidelines for prompt visual art.” In Proceedings of the IEEE/CVF Conference on Computer
engineering text-to-image generative models.” In Proceedings of the Vision and Pattern Recognition, pp. 11569-11579. 2021.
2022 CHI Conference on Human Factors in Computing Systems, pp. [15] Radford, Alec, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel
1-23. 2022. Goh, Sandhini Agarwal, Girish Sastry et al. ”Learning transferable visual
[7] Ruskov, Martin. ”Grimm in Wonderland: Prompt Engineering with models from natural language supervision.” In International conference
Midjourney to Illustrate Fairytales.” arXiv preprint arXiv:2302.08961 on machine learning, pp. 8748-8763. PMLR, 2021.
(2023). [16] Ridnik, Tal, Hussam Lawen, Asaf Noy, Emanuel Ben Baruch, Gilad
[8] Balaji, Yogesh, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Sharir, and Itamar Friedman. ”Tresnet: High performance gpu-dedicated
Song, Karsten Kreis, Miika Aittala et al. ”ediffi: Text-to-image dif- architecture.” In proceedings of the IEEE/CVF winter conference on
fusion models with an ensemble of expert denoisers.” arXiv preprint applications of computer vision, pp. 1400-1409. 2021.
arXiv:2211.01324 (2022). [17] Xie, Sang Michael, Aditi Raghunathan, Percy Liang, and Tengyu Ma.
[9] Rombach, Robin, Andreas Blattmann, Dominik Lorenz, Patrick Esser, ”An explanation of in-context learning as implicit bayesian inference.”
and Björn Ommer. ”High-resolution image synthesis with latent diffu- arXiv preprint arXiv:2111.02080 (2021).
sion models.” In Proceedings of the IEEE/CVF Conference on Computer [18] Lester, Brian, Rami Al-Rfou, and Noah Constant. ”The power of scale
Vision and Pattern Recognition, pp. 10684-10695. 2022. for parameter-efficient prompt tuning.” arXiv preprint arXiv:2104.08691
[10] Ramesh, Aditya, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark (2021).
Chen. ”Hierarchical text-conditional image generation with clip latents.” [19] Bakker, Michiel, Martin Chadwick, Hannah Sheahan, Michael Tessler,
arXiv preprint arXiv:2204.06125 (2022). Lucy Campbell-Gillingham, Jan Balaguer, Nat McAleese et al. ”Fine-
[11] Tian, Yingtao, and David Ha. ”Modern evolution strategies for creativity: tuning language models to find agreement among humans with diverse
Fitting concrete images and abstract concepts.” In Artificial Intelligence preferences.” Advances in Neural Information Processing Systems 35
in Music, Sound, Art and Design: 11th International Conference, Evo- (2022): 38176-38189.
MUSART 2022, Held as Part of EvoStar 2022, Madrid, Spain, April [20] Yuan, Yuan, Yew-Soon Ong, Liang Feng, A. Kai Qin, Abhishek Gupta,
20–22, 2022, Proceedings, pp. 275-291. Cham: Springer International Bingshui Da, Qingfu Zhang, Kay Chen Tan, Yaochu Jin, and Hisao
Publishing, 2022. Ishibuchi. ”Evolutionary multitasking for multiobjective continuous op-
[12] Rakotonirina, Nathanaël Carraz, Andry Rasoanaivo, Laurent Najman, timization: Benchmark problems, performance metrics and baseline
Petr Kungurtsev, Jeremy Rapin, Fabien Teytaud, Baptiste Roziere et al. results.” arXiv preprint arXiv:1706.02766 (2017).
”Many-Objective Optimization for Diverse Image Generation.” (2021). [21] Deb, Kalyanmoy, Amrit Pratap, Sameer Agarwal, and T. A. M. T.
[13] Radford, Alec, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Meyarivan. ”A fast and elitist multiobjective genetic algorithm: NSGA-
Goh, Sandhini Agarwal, Girish Sastry et al. 2021. ”Learning transferable II.” IEEE transactions on evolutionary computation 6, no. 2 (2002): 182-
visual models from natural language supervision.” PMLR. 8748-8763. 197.
[22] Gupta, Abhishek, and Yew-Soon Ong. ”Back to the roots: Multi-x
[14] Achlioptas, Panos, Maks Ovsjanikov, Kilichbek Haydarov, Mohamed
evolutionary computation.” Cognitive Computation 11 (2019): 1-17.
Elhoseiny, and Leonidas J. Guibas. ”Artemis: Affective language for

You might also like