You are on page 1of 11

ClipFace: Text-guided Editing of Textured 3D Morphable Models

SHIVANGI ANEJA, Technical University of Munich, Germany


JUSTUS THIES, Max Planck Institute for Intelligent Systems, Tübingen, Germany
ANGELA DAI, Technical University of Munich, Germany
MATTHIAS NIESSNER, Technical University of Munich, Germany

3D Mesh UV Texture Map Renderings from multiple viewpoints


“Werewolf” “Mona Lisa”

“Angry” “Happy” “Stoned” “White Walker” “Gothic Makeup”

(b) Expression Manipulation (a) Texture Manipulation

Fig. 1. ClipFace learns a self-supervised generative model for jointly synthesizing geometry and texture leveraging 3D morphable face models, that can be
guided by text prompts. For a given 3D mesh with fixed topology, we can generate arbitrary face textures as UV maps (top). The textured mesh can then be
manipulated with text guidance to generate diverse set of textures and geometric expressions in 3D by altering (a) only the UV texture maps for Texture
Manipulation and (b) both UV maps and mesh geometry for Expression Manipulation.

We propose ClipFace, a novel self-supervised approach for text-guided edit- applicability of our method to generate temporally changing textures for a
ing of textured 3D morphable model of faces. Specifically, we employ user- given animation sequence.
friendly language prompts to enable control of the expressions as well as
CCS Concepts: • Computing methodologies → Texturing; Image-based
appearance of 3D faces. We leverage the geometric expressiveness of 3D
rendering.
morphable models, which inherently possess limited controllability and tex-
ture expressivity, and develop a self-supervised generative model to jointly Additional Key Words and Phrases: 3D Facial Stylization/Animation, Texture
synthesize expressive, textured, and articulated faces in 3D. We enable high- Manipulation, Expression Manipulation
quality texture generation for 3D faces by adversarial self-supervised train-
ing, guided by differentiable rendering against collections of real RGB images. ACM Reference Format:
Controllable editing and manipulation are given by language prompts to Shivangi Aneja, Justus Thies, Angela Dai, and Matthias Niessner. 2023.
adapt texture and expression of the 3D morphable model. To this end, we pro- ClipFace: Text-guided Editing of Textured 3D Morphable Models. In Special
pose a neural network that predicts both texture and expression latent codes Interest Group on Computer Graphics and Interactive Techniques Conference
of the morphable model. Our model is trained in a self-supervised fashion by Conference Proceedings (SIGGRAPH ’23 Conference Proceedings), August 6–
exploiting differentiable rendering and losses based on a pre-trained CLIP 10, 2023, Los Angeles, CA, USA. ACM, New York, NY, USA, 11 pages. https:
model. Once trained, our model jointly predicts face textures in UV-space, //doi.org/10.1145/3588432.3591566
along with expression parameters to capture both geometry and texture
changes in facial expressions in a single forward pass. We further show the 1 INTRODUCTION
Modeling 3D content is central to many applications in our mod-
ern digital age, including asset creation for video games and films,
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed as well as mixed reality. In particular, modeling 3D human face
for profit or commercial advantage and that copies bear this notice and the full citation avatars is a fundamental element towards digital expression. How-
on the first page. Copyrights for components of this work owned by others than the ever, current content creation processes require extensive time from
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission highly-skilled artists in creating compelling 3D face models.
and/or a fee. Request permissions from permissions@acm.org. In contrast to implicit representations for human faces [Chan et al.
SIGGRAPH ’23 Conference Proceedings, August 6–10, 2023, Los Angeles, CA, USA 2021] that do not follow fixed mesh topology, 3D morphable models
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 979-8-4007-0159-7/23/08. . . $15.00 present a promising approach for modeling animatable avatars, with
https://doi.org/10.1145/3588432.3591566 popular blendshape models used for human faces (e.g., FLAME [Li

1
SIGGRAPH ’23 Conference Proceedings, August 6–10, 2023, Los Angeles, CA, USA Shivangi Aneja, Justus Thies, Angela Dai, and Matthias Niessner

et al. 2017]) or bodies (e.g., SMPL [Loper et al. 2015]). In particu- 2 RELATED WORK
lar, they offer a compact, parametric representation to model an 2.1 Texture Generation
object, while maintaining a mesh representation that fits the clas-
sical graphics pipelines for editing and animation. Additionally, There is a large corpus of research works in the field of generative
the shared topology of the representation enables deformation and models for UV textures [Gecer et al. 2021a; Gecer et al. 2020; Gecer
texture transfer capabilities. et al. 2019, 2021b; Lattas et al. 2020, 2021; Lee et al. 2020; Li et al.
Despite such morphable models’ expressive capability in geo- 2020; Luo et al. 2021; Wang et al. 2022b]. These methods achieve
metric modeling and potential practical applicability towards artist impressive results; however, the majority is fully supervised in
creation pipelines, they remain insufficient for augmenting artist nature, requiring ground truth textures, which in turn necessitate
collection in a controlled capture setting. Learning self-supervised
workflows. This is due to limited controllability, as they rely on PCA
texture generation is much more challenging, and only a handful
models, and lack of texture expressiveness, since the models have
been built from very limited quantities of 3D-captured textures; of methods exist. For instance, Marriott et al. [2021] were among
both of these aspects are crucial for content creation and visual the first to leverage Progressive GANs [Karras et al. 2018] and 3D
consumption. We thus address the challenging task of creating a Morphable Models [Blanz and Vetter 1999] to generate textures for
generative model to enable synthesis of expressive, textured, and facial recognition; however, the textures generated remain relatively
articulated human faces in 3D. low resolution and are not suitable to perform language-driven
We propose ClipFace, to enable controllable generation and edit- texture and expression manipulations.
ing of 3D faces. We leverage the geometric expressiveness of 3D Textures generated by Slossberg et al. [2022] made a significant
morphable models, and introduce a self-supervised generative model improvement in quality by using pretrained StyleGAN [Karras et al.
to jointly synthesize textures and adapt expression parameters of 2020b] and StyleRig [Tewari et al. 2020a]. The closest inspiration
to our texture generator is StyleUV [Lee et al. 2020] which also
the morphable model. To facilitate controllable editing and manip-
ulation, we exploit the power of vision-language models [Radford operates on a mesh. Both methods achieve stunning results, but do
et al. 2021] to enable user-friendly generation of diverse textures not take the head and ears into account, which limits their practical
and expressions in 3D faces. This allows us to specify facial ex- applicability to directly use them as assets in games and movies. In
pressions as well as the appearance of the human via text while our work, we propose a generative model to synthesize UV textures
maintaining a clean 3D mesh representation that can be consumed for the full-head topology to enable text-guided editing and control
by standard graphics applications. Such text-based editing enables of 3D morphable face models.
intuitive control over the content creation process.
Our generative model is trained in a self-supervised fashion, lever-
2.2 Semantic Manipulation of Facial Attributes
aging the availability of large-scale face image datasets with differ- Facial manipulation has also seen significant studies following the
entiable rendering to produce a powerful texture generator that can success of StyleGAN2 [Karras et al. 2020b] image generation. In
be controlled along with the morphable model geometry by text particular, its disentangled latent space facilitates texture editing
prompts. Based on our texture generator, we learn a neural network as well as enables a level of control over pose and expressions of
that can edit the texture latent code as well as the expression pa- generated images. Several methods [Abdal et al. 2021; Deng et al.
rameters of the 3D morphable model with text prompt supervision 2020; Ghosh et al. 2020; Kowalski et al. 2020; Liu et al. 2022; Tewari
and losses based on CLIP. Our approach further enables generating et al. 2020a,b] have made significant progress to induce controllabil-
temporally varying textures for a given driving expression sequence. ity to images by embedding 3D priors via conditioning StyleGAN
The main focus of our work lies in enabling text-guided edit- on known facial attributes extracted from synthetic face renderings.
ing and control of 3D morphable face models. As recent 3D face However, these methods operate on 2D images, and although they
models [Gerig et al. 2017; Li et al. 2017] are limited in their texture achieve high-quality results on a per-frame basis, consistent and
space, we propose a generative model to synthesize UV textures coherent rendering from varying poses and expressions remains
with a realistic appearance; this texture space is a pre-requisite for challenging. Motivated by such impressive 2D generators, we pro-
our text-guided editing. To summarize, we present the following pose to lift them to 3D and directly operate in UV space of a 3D mesh,
contributions: producing temporally-consistent results when rendering animation
sequences.
• We propose a novel approach to controllable editing of tex-
tured, parametric 3D morphable models through user-friendly 2.3 Text-Guided Image Manipulation
text prompts, by exploiting CLIP-based supervision to jointly Recent progress in 2D language models has opened up significant
synthesize texture and expressions of a 3D face model. opportunities for text-guided image manipulation [Abdal et al. 2022;
• The controllable 3D face model is supported by our texture Avrahami et al. 2022; Bau et al. 2021; Crowson 2021; Dayma et al.
generator, trained in a self-supervised fashion on 2D images 2021; Ramesh et al. 2022, 2021; Ruiz et al. 2022]. For instance, con-
only. trastive language-image pre-training (CLIP) [Radford et al. 2021]
• Our approach additionally enables generating temporally model has been used for text-guided editing for a variety of 2D/3D
varying textures of an animated 3D face model from a driving applications [Canfes et al. 2022; Gal et al. 2021; Hong et al. 2022;
video sequence. Khalid et al. 2022; Kocasari et al. 2022; Michel et al. 2022; Patashnik
et al. 2021; Petrovich et al. 2022; Wang et al. 2022a; Wei et al. 2022;

2
ClipFace: Text-guided Editing of Textured 3D Morphable Models SIGGRAPH ’23 Conference Proceedings, August 6–10, 2023, Los Angeles, CA, USA

Images
; ;
Masking
DECA FLAME
“Real” Discriminator
θ⃗ β ⃗ ψ ⃗
RGB Image Real Image

ℳrand
“Fake”

“Real”
Mapping Synthesis

Image Patches
Network Network Differentiable Patch
Rendering Discriminator
z ∈ ℝ512 w ∈ ℝ512×18

“Fake”
M G T Fake Image

Fig. 2. Texture Generation: We learn self-supervised texture generation from collections of 2D images. An RGB image is encoded by pretrained encoder
® pose 𝜽® and expression 𝝍
DECA [Feng et al. 2021] to extract shape 𝜷, ® coefficents in FLAME’s latent space, which are decoded by FLAME [Li et al. 2017] to
deformed mesh vertices. The background and mouth interior are then masked out, generating the ‘Real Image’ for our adversarial formulation. In parallel,
latent code 𝑧 ∈ R512 is sampled from a Gaussian distribution N (0, 𝐼 ) and input to mapping network 𝑀 to generate intermediate latent w ∈ R512×18 , which is
used by synthesis network 𝐺 to generate the UV texture image. The predicted texture is differentiably rendered on a randomly deformed FLAME mesh to
generate the ‘Fake Image.’ Two discriminators interpret the generated and masked real image, at full resolution and at patch size 64 × 64. Frozen models are
denoted in blue, and learnable networks in green.

Youwang et al. 2022]. StyleClip [Patashnik et al. 2021] presented 3 METHOD


seminal advances in stylizing human face images by leveraging ClipFace targets text-guided synthesis of textured 3D face mod-
the expressive power of CLIP in combination with the generative els. It consists of two fundamental components: (i) an expressive
power of StyleGAN to produce unique manipulations for faces. This generative texture space for facial appearances (Sec 3.1), and (ii)
was followed by StyleGAN-Nada [Gal et al. 2021], which enables a text-guided prediction of the latent codes for the texture gener-
adapting image generation to a remarkable diversity of styles from ator and the expression parameters of the underlying statistical
various domains, without requiring image examples from those do- morphable model (Sec 3.2). In the following, we will detail these
mains. However, these manipulations are designed for the image contributions and further demonstrate how they enable produc-
space, and are not 3D consistent. ing temporally changing textures for a given animation sequence
Sec. 3.3.

2.4 Text-Guided 3D Manipulation 3.1 Generative Synthesis of Face Appearance


Following the success of text-guided image manipulation, recent Since there does not exist any large-scale datasets for UV textures,
works have adopted powerful vision-language models to enable text we propose a self-supervised method to learn the appearance man-
guidance for 3D object manipulation. Text2Mesh [Michel et al. 2022] ifold of human faces, as depicted in Fig. 2. Rather than learning
was one of the first pioneering methods to leverage a pre-trained from ground truth UV textures, we instead leverage large-scale RGB
2D CLIP model as guidance to generate language-conditioned 3D image datasets of faces, which we use in an adversarial formulation
mesh textures and geometric offsets. Here, edits are realized as through differentiable rendering. For our experiments, we use the
part of a test-time optimization that aims to solve for the texture FFHQ dataset [Karras et al. 2019] which consists of 70,000 diverse,
and mesh offsets in a neural field representation, such that their high-quality images. As we focus on the textures of the human head,
re-renderings minimizes a 2D CLIP loss from different viewpoints. we remove images that contain headwear (caps, scarfs, etc.) and
Similar to Text2Mesh, CLIP-Mesh [Khalid et al. 2022] produces tex- eyewear (sunglasses and spectacles) using face parsing [zllrunning
tured meshes by jointly estimating texture and deformation of a 2018], resulting in a subset of 45,000 images. Based on this data, we
template mesh, based on text inputs using CLIP. Recently, Can- train a StyleGAN-ADA [Karras et al. 2020a] generator to produce
fes et al. [2022] adapt TB-GAN [Gecer et al. 2020], an expression- UV textures that when rendered on top of the FLAME mesh [Li et al.
conditioned generative model to produce UV-texture maps, with a 2017] results in realistic imagery.
CLIP loss to produce facial expressions in 3D, although the quality More specifically, we use the FLAME model as our shape prior to
of textures and expressions is limited, due to reliance on 3D scan produce different geometric shapes and facial expressions. It can be
data for TB-GAN training. Our method is inspired by these lines of defined as:
research; however, our focus lies on leveraging the parametric repre- ® ®
® 𝜽,
F ( 𝜷, ® : R | 𝜷 | × | 𝜽 | × | 𝝍® | → R3𝑁 ,
® 𝝍) (1)
sentations of 3D morphable face models with our StyleGAN-based
texture generator, which can enable content creation for direct use where 𝜷® ∈ R100 are the shape parameters, 𝜽® ∈ R6 refers to the jaw
in many applications such as games or movies. and head pose, and 𝝍® ∈ R50 are the expression coefficients.

3
SIGGRAPH ’23 Conference Proceedings, August 6–10, 2023, Los Angeles, CA, USA Shivangi Aneja, Justus Thies, Angela Dai, and Matthias Niessner

Ttgt
1 Text Prompt

“Happy face like zombie”

….
Synthesis
Network
winit 18 wdelta
Texture
Mapper G ℒclip
ψinit θinit

Differentiable
Rendering
FLAME
ℳinit Tinit
Initial Mesh and Texture Expression
Mapper

wmean ℳtgt
𝒯
ψdelta βinit Rendered Images
Fig. 3. Text-guided Synthesis of Textured 3D Face Models: From a given textured mesh with texture code winit , we synthesize various styles by adapting
both texture and expression to the target text prompt. winit is input to the texture mappers T = [ T 1 , ..., T 18 ] to obtain texture offsets wdelta ∈ R512×18 for 18
different levels of winit . The expression mapper E takes mean latent code wmean = ∥winit + wdelta ∥ 2 as input, and predicts expression offset 𝝍 delta to obtain
deformed mesh geometry Mtgt . The generated UV map 𝑇tgt and deformed mesh Mtgt are differentiably rendered to generate styles that fit the text prompt.

To recover the distribution of face shapes and expressions from its high expressiveness, and formulate the offsets as:
the training dataset, we employ DECA [Feng et al. 2021], a pretrained
w∗delta, 𝝍 ∗delta = arg min Ltotal, (2)
encoder that takes an RGB image as input and outputs the corre- wdelta ,𝝍 delta
sponding FLAME parameters 𝜷, ® 𝜽,
® 𝝍,
® including orthographic camera
where Ltotal formulates CLIP guidance and expression regulariza-
parameters c. We use the recovered parameters to remove the back-
tion, as defined in Eq. 9.
grounds from the original images, and only keep the image region
In order to optimize this loss, we learn a texture mapper T =
that is covered by the corresponding face model. Using this distri-
[T 1, ..., T 18 ] and an expression mapper E. The texture mapper pre-
bution of face geometries and camera parameters D ∼ [ 𝜷, ® 𝜽,
® 𝝍,
® c],
dicts the latent variable offsets across the different levels {1, 2, ...18}
along with the masked real samples, we train the StyleGAN network of the StyleGAN generator:
using differentiable rendering [Laine et al. 2020]. We sample a latent
code 𝒛 ∈ R512 from Gaussian distribution N (0, I) to generate the 
 T 1 (w1init )
intermediate latent code w ∈ R512×18 using the a mapping network

 T 2 (w2init )




𝑀: w = 𝑀 (z). This latent code w is passed to the synthesis network wdelta = . (3)
 ..
𝐺 to generate UV texture map 𝑇 ∈ R512×512×3 : 𝑇 = 𝐺 (w). This pre- 


dicted texture 𝑇 is then rendered on a randomly sampled deformed  T 18 (w18 ).

 init
mesh from our discrete distribution of face geometries D. We use
an image resolution of 512 × 512. The expression mapper E learns the expression offsets and takes as
Both the generated image and masked real image are then passed input wmean = ∥winit + wdelta ∥ 2 , where wmean ∈ R512 is the mean
to the discriminator during training. To further improve the texture of 18-different levels of the latent space, and outputs the expression
quality, we use a patch discriminator alongside a full-image discrim- offsets 𝝍 delta :
inator, which encourages high-fidelity details in local regions of 𝝍 delta = E (wmean ). (4)
the rendered images. We apply image augmentations (e.g., color We notice that it is critical to design separate texture and expression
jitter, image flipping, hue/saturation changes) to both full-image mappers to maintain disentangled texture and expression spaces.
and image patches before feeding them to the discriminator. The Conditioning the expression mapper on texture codes correlates
patch size is set to 64 × 64 for all of our experiments. Note that the them meaningfully for realistic expression generation, significantly
patch discriminator is critical to producing high-frequency texture improving generation quality. We show results for different input
details; see Section 4. conditions in supplemental. We use a 4-layer MLP architecture with
LeakyReLU activations for the mappers. The method is shown in
Fig. 3.
Naively using a CLIP loss as in StyleClip [Patashnik et al. 2021]
3.2 Text-guided Synthesis of Textured 3D Models to train the mappers tends to result in unwanted identity and/or il-
For a given textured mesh with texture code winit = {w1init, ...w18
init } ∈ lumination changes in texture. Thus, we draw inspiration from [Gal
R512×18 in neutral pose 𝜽 init and neutral expression 𝝍 init , our goal et al. 2021], and leverage the CLIP-space direction between the
is to learn optimal offsets wdelta, 𝝍 delta for texture and expression initial style and the to-be-performed manipulation in order to per-
respectively defined through text prompts. As a source of supervi- form consistent and identity-preserving manipulation. We compute
sion, we use a pretrained CLIP [Radford et al. 2021] model due to the ‘text-delta’ direction Δt in CLIP-space between the initial text

4
ClipFace: Text-guided Editing of Textured 3D Morphable Models SIGGRAPH ’23 Conference Proceedings, August 6–10, 2023, Los Angeles, CA, USA

Animation Sequence
t = 1 ... t = T

winit
1
w1tgt ...
Text Prompt

“Laughing”
...

….
….
18 1
w1delta Ttgt
= [ψ 1:T
;θ 1:T
] e1

...

......
......

Differentiable
......
Synthesis

Rendering
Initial Animation Shared

ℒclip
Network

...

...
1
wTdelta
G
winit
wTtgt
….

... 18
winit
T
𝒱
𝒯 eT Ttgt Final
Animation

Fig. 4. Texture Manipulation for Animation Sequences: Given a video sequence V = [𝝍 1:𝑇 ; 𝜽 1:𝑇 ] of T frames, an initial texture code winit , and a text
prompt, we synthesize a 3D textured animation to match the text. We concatenate V to winit across different timestamps to obtain e𝑡 , which is input to the
time-shared texture mappers T to obtain time-dependent textures offsets w1:𝑇 delta
for all frames. The new texture codes w1:𝑇 tgt generated using importance
1:𝑇 , which are then differentiably rendered to generate the final
weighting, are then passed to our texture generator 𝐺 to obtain time-dependent UV textures 𝑇tgt
animation, guided by the CLIP loss across all frames.

prompt tinit and the target text prompt ttgt , indicating which at- Note that we can also only alter the texture without changing expres-
tributes from the initial style should be changed: sions by keeping the expression mapper frozen and not fine-tuning
it. We pre-train the mapper networks to predict zero-offsets (details
Δt = 𝐸𝑇 (ttgt ) − 𝐸𝑇 (tinit ), (5)
in supplemental).
where 𝐸𝑇 refers to the CLIP text encoder. We use the same initial
text tinit = ‘A photo of a face’ for all our experiments and alter the 3.3 Texture Manipulation for Video Sequences
target text prompt ttgt , depending on the desired style change. For Given an expression video sequence, we propose a novel tech-
example, to generate a Mona Lisa style texture, we use the text nique to manipulate the textures for every frame of the video
prompt ttgt = ‘A photo of a face that looks like Mona Lisa’. Guided guided by a CLIP loss (see Fig. 4). That is, for a given animation
via the CLIP-space image direction between the initial rendered sequence V = [𝜽 1:𝑇 ; 𝝍 1:𝑇 ] of 𝑇 frames, with expression codes
image iinit and the target rendered image itgt generated using our 𝝍 1:𝑇 = [𝝍 1, 𝝍 2, ...𝝍𝑇 ], pose codes 𝜽 1:𝑇 = [𝜽 1, 𝜽 2, ...𝜽 𝑇 ], and a given
texture generator, we train the mapping networks to predict style texture code winit , we use a multi-layer perceptron as our texture
specified by the target text prompt ttgt . mapper T = [T 1, ..., T 18 ] to generate time-dependent texture off-
Δi = 𝐸𝐼 (itgt ) − 𝐸𝐼 (iinit ), (6) sets w1:𝑇
delta
for different levels of the texture latent space. This mapper
receives as input e1:𝑇 , the concatenation of the initial texture code
where iinit is the image rendered with
 the initial texture and initial winit with the time-dependent expression and pose code [𝝍 𝑡 ; 𝜽 𝑡 ].
flame parameters winit, 𝜷, 𝜽, 𝝍 init , and itgt is the image with the
  Mathematically, we have:
target texture and target flame parameters wtgt, 𝜷, 𝜽, 𝝍 tgt , and 𝐸𝐼
e1:𝑇 = [e1, ....e𝑇 ] (10)
refers to the CLIP image encoder.
𝑡 𝑡 𝑡
Note that we do not alter the pose 𝜽 and shape code 𝜷 of the e = [winit ; 𝝍 ; 𝜽 ], (11)
FLAME model. The CLIP loss Lclip is then computed as:
where 𝝍 𝑡 and 𝜽 𝑡 refer to the expression and pose code at timestamp
Δi.Δt 𝑡 extracted from sequence V. Next, we pass e1:𝑇 to the time-shared
Lclip = 1 − . (7)
|Δi|¤|Δt| texture mapper T to obtain texture offsets w1:𝑇 delta
. To ensure a co-
In order to prevent the mesh from taking unrealistic expressions, herent animation and smooth transition across frames, we weight
we further regularize the expressions using the Mahalanobis prior the predicted offsets w1:𝑇
delta
using importance weights I = [𝑖 1, ....𝑖𝑇 ]
as: extracted from video sequence V, before adding them to winit :
Lreg = 𝝍𝑇 Σ𝜓−1 𝝍, (8) w𝑡tgt = winit + 𝑖𝑡 · w𝑡delta . (12)
where Σ𝜓−1
is the diagonal expression covariance matrix of FLAME We compute importance weights by measuring the deviation be-
model. As we show in our results, this regularization is critical to tween the neutral shape [𝜽 neutral ; 𝝍 neutral ] and per-frame face shape
prevent the 3D morphable model from taking an unrealistic shape. [𝜽 𝑡 ; 𝝍 𝑡 ], with by min-max normalization:
The full training loss can then be written as:
𝛿 𝑡 − min(𝜹 1:𝑇 )
𝑖𝑡 = , (13)
Ltotal = Lclip + 𝜆reg Lreg . (9) max(𝜹 1:𝑇 ) − min(𝜹 1:𝑇 )

5
SIGGRAPH ’23 Conference Proceedings, August 6–10, 2023, Los Angeles, CA, USA Shivangi Aneja, Justus Thies, Angela Dai, and Matthias Niessner

with 𝛿 𝑡 = [𝜽 neutral ; 𝝍 neutral ] − [𝜽 𝑡 ; 𝝍 𝑡 ] 2 . The importance weight-


ing ensures that key frames with strong expressions are emphasized.

FlameTex
The predicted target latent codes w1:𝑇 tgt are then used to generate
the UV maps 𝑇tgt1:𝑇 , which are differentiably rendered onto the given

animation sequence V.
To train texture mapper T , we minimize clip loss (Eq. 7) for the
given text prompt ttgt and the rendered frames i𝑡tgt aggregated over
T timesteps for all the frames from the video.

Slossberg et al.
𝑇
∑︁
w1:𝑇
delta = arg min Lclip (ttgt, i𝑡tgt ). (14)
w1:𝑇
delta 𝑡 =1

4 RESULTS

Ours (w/o Patch)


We evaluate ClipFace on the tasks of texture generation, text-guided
synthesis of textured 3D face models, and text-guided manipula-
tion of animation sequences. For texture generation, we evaluate
on standard GAN metrics FID and KID. We evaluate both of these
metrics with respect to masked FFHQ images [Karras et al. 2019]
(with background and mouth interior masked out) as ground truth,
and generated textures rendered at 512 × 512 resolution for ≈ 45K
rendered textures and ground truth images. For text-guided ma- Ours (w/ Patch)
nipulation, we evaluate perceptual quality using FID & KID and
similarity to text prompt using CLIP score, which is evaluated as the
cosine similarity to the text prompt using pre-trained CLIP models.
We use two different CLIP variants, ‘ViT-B/16’ and ‘ViT-L/14’, each
on 224 × 224 pixels as input. We report average scores for these
pre-trained variants.

4.1 Implementation Details Fig. 5. Comparison against unsupervised texturing methods. The left col-
umn shows the mesh geometry, followed by the textured mesh and corre-
For our texture generator, we produce 512×512 texture maps. We use
sponding UV map. FlameTex [Feng 2019] generates textures for the full head
an Adam optimizer with a learning rate of 2e-3, batch size 8, gradient region, however, the texture quality is fairly limited. Slossberg et. al. [2022]
penalty 10, and path length regularization 2 for all our experiments. generates plausible textures, but does not take the head and ears into ac-
We use a learning rate of 0.005 and 0.0001 for the expression and count, limiting its practical applicability. Our approach is able to synthesize
texture mappers, also using Adam. For differentiable rendering, we diverse texture styles ranging across different skin colors, ethnicities, and
use NvDiffrast [Laine et al. 2020]. For the patch discriminator, we demographics. Note that the patch discriminator helps not only to improve
use a patch size of 64 × 64. We train for 300,000 iterations until texture quality but also in correctly aligning the UV texture with mesh
convergence. For the text-guided manipulation experiments, we use geometry, especially around the mouth region.
the same model architecture for expression and texture mappers, a
4-layer MLP with LeakyReLU activations. For CLIP supervision, we
use the pretrained ‘ViT-B/32’ variant. For text manipulation tasks, first perform remeshing to increase vertices from 5023 to 60,000 be-
we train for 20,000 iterations. fore optimization. Our approach generates consistently high-quality
textures for various prompts, in comparison to baselines. In particu-
4.2 Texture Generation lar, our texture generator enables effective editing even in small face
We evaluate the quality of our generated textures and compare with
existing unsupervised texture generation methods in Tab. 1 and Table 1. Quantitative evaluation of texture quality. Our approach signifi-
Fig. 5. ClipFace outperforms other baselines in perceptual quality. cantly outperforms baselines in both FID and KID scores.
Although Slossberg et al. [2022] can obtain good textures for the
interior face region, it does not synthesize head and ears.
Method FID ↓ KID ↓
4.3 Texture & Expression Manipulation FlameTex [Feng 2019] 76.627 0.063
Slossberg et al. [2022] 32.794 0.021
We compare with CLIP-based texturing techniques for texture ma-
Ours (w/o Patch) 16.640 0.013
nipulation in Fig. 6 and Tab. 2. Note that for comparisons with
Ours (w/ Patch) 9.559 0.006
Text2Mesh [Michel et al. 2022], we follow the authors’ suggestion to

6
ClipFace: Text-guided Editing of Textured 3D Morphable Models SIGGRAPH ’23 Conference Proceedings, August 6–10, 2023, Los Angeles, CA, USA

Latent3D
Clip-Matrix
FlameTex
Text2Mesh
Ours

“Red “Mark “Disney “Pixar


“Original” “Blue Eyes” “Mona Lisa ” “Wrinkles”
Lipstick” Zuckerberg” Princess” Character”

Fig. 6. Qualitative comparison on texture manipulation: We compare our method against several 3D texturing methods: Latent3D [Canfes et al. 2022],
Clip-Matrix [Jetchev 2021], FlameTex [Feng 2019], Text2Mesh [Michel et al. 2022]. Our method obtains consistently superior textures in comparison to the
baselines, even capable deftly adapting identity when guided by text prompt.

regions (e.g., lips and eyes). While Text2Mesh yields a high CLIP Table 2. Evaluation of text manipulation. ClipFace effectively matches text
score, it produces semantically implausible results, as the specified prompts while maintaining high perceptual fidelity.
text prompts highly match rendered colors irrespective of the global
face context (i.e., which region should be edited). In contrast, our Method FID ↓ KID ↓ CLIP Score ↑
method generates perceptually high-quality face texture, evident in
Latent3d [Canfes et al. 2022] 205.27 0.260 0.227 ±0.041
the perceptual KID metric.
FlameTex [Feng 2019] 88.95 0.053 0.235 ±0.053
We show additional ClipFace texturing results on a wide variety of
ClipMatrix [Jetchev 2021] 198.34 0.180 0.243 ±0.049
prompts, including on fictional characters, in Fig. 7, demonstrating
Text2Mesh [Michel et al. 2022] 219.59 0.185 0.264 ±0.044
our expressive power.
Ours 80.34 0.032 0.251 ±0.059
Furthermore, we show results for expression manipulation in
Fig. 8. ClipFace faithfully deforms face geometry and texture to
match a variety of text prompts, where expression regularization
maintains plausible geometry and directional loss enables balanced animation compared to a constant texture that looks monotonic. We
adaptation of geometry and texture. We refer to the supplemental show results for only two frames; however, we refer readers to the
for more visuals. supplemental video for more detailed results.

5 LIMITATIONS
4.4 Texture Manipulation for Video Sequences Although ClipFace can generate high-quality textures and expres-
Finally, we show results for texture manipulation for given ani- sions, it still has various limitations. For instance, our method does
mation sequences in Fig. 9. ClipFace can produce more expressive not capture accessories like jewelry, headwear, or eyewear, due

7
SIGGRAPH ’23 Conference Proceedings, August 6–10, 2023, Los Angeles, CA, USA Shivangi Aneja, Justus Thies, Angela Dai, and Matthias Niessner

to our use of the FLAME [Li et al. 2017] model, which does not 5281/zenodo.5146400
represent accessories or complex hair. We believe that this could Yu Deng, Jiaolong Yang, Dong Chen, Fang Wen, and Xin Tong. 2020. Disentangled and
Controllable Face Image Generation via 3D Imitative-Contrastive Learning. In IEEE
be further improved by augmenting parametric 3D models with Computer Vision and Pattern Recognition.
artist-designed assets for 3D hair, headwear, or eyewear. Haven Feng. 2019. Photometric FLAME Fitting. https://github.com/HavenFeng/
photometric_optimization.
Yao Feng, Haiwen Feng, Michael J. Black, and Timo Bolkart. 2021. Learning an Ani-
6 CONCLUSION matable Detailed 3D Face Model from In-the-Wild Images. ACM Transactions on
Graphics (ToG), Proc. SIGGRAPH 40, 4 (Aug. 2021), 88:1–88:13.
In this paper, we have introduced ClipFace, a novel approach to en- Rinon Gal, Or Patashnik, Haggai Maron, Gal Chechik, and Daniel Cohen-Or.
able text-guided editing of textured 3D morphable face models. We 2021. StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators.
jointly synthesize high-quality textures and adapt geometry based arXiv:2108.00946 [cs.CV]
Baris Gecer, Jiankang Deng, and Stefanos Zafeiriou. 2021a. OSTeC: One-Shot Texture
on the expressions of the morphable model, in a self-supervised Completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and
fashion. This enables compelling 3D face generation across a vari- Pattern Recognition (CVPR). 7628–7638.
Baris Gecer, Alexander Lattas, Stylianos Ploumpis, Jiankang Deng, Athanasios Papaioan-
ety of textures, expressions, and styles, based on user-friendly text nou, Stylianos Moschoglou, and Stefanos Zafeiriou. 2020. Synthesizing Coupled 3D
prompts. We further demonstrate the ability of ClipFace to synthe- Face Modalities by Trunk-Branch Generative Adversarial Networks. In Proceedings
size of animation sequences, driven by a guiding video sequence. of the European conference on computer vision (ECCV). Springer.
Baris Gecer, Stylianos Ploumpis, Irene Kotsia, and Stefanos Zafeiriou. 2019. GANFIT:
We believe this is an important first step towards enabling control- Generative Adversarial Network Fitting for High Fidelity 3D Face Reconstruction. In
lable, realistic texture and expression modeling for 3D face models, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
dovetailing with conventional graphics pipelines, which will enable (CVPR).
Baris Gecer, Stylianos Ploumpis, Irene Kotsia, and Stefanos P Zafeiriou. 2021b. Fast-
many new possibilities for content creation and digital avatars. GANFIT: Generative Adversarial Network for High Fidelity 3D Face Reconstruction.
IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
Thomas Gerig, Andreas Morel-Forster, Clemens Blumer, Bernhard Egger, Marcel Lüthi,
ACKNOWLEDGMENTS Sandro Schönborn, and Thomas Vetter. 2017. Morphable Face Models - An Open
This work was supported by the ERC Starting Grant Scan2CAD Framework. https://doi.org/10.48550/ARXIV.1709.08398
Partha Ghosh, Pravir Singh Gupta, Roy Uziel, Anurag Ranjan, Michael J. Black, and
(804724), the Bavarian State Ministry of Science and the Arts and Timo Bolkart. 2020. GIF: Generative Interpretable Faces. In International Conference
coordinated by the Bavarian Research Institute for Digital Transfor- on 3D Vision (3DV). 868–878. http://gif.is.tue.mpg.de/
Fangzhou Hong, Mingyuan Zhang, Liang Pan, Zhongang Cai, Lei Yang, and Ziwei Liu.
mation (bidt), the German Research Foundation (DFG) Grant “Mak- 2022. AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars.
ing Machine Learning on Static and Dynamic 3D Data Practical,” ACM Transactions on Graphics (TOG) 41, 4 (2022), 1–19.
the German Research Foundation (DFG) Research Unit “Learning Nikolay Jetchev. 2021. ClipMatrix: Text-controlled Creation of 3D Textured Meshes.
https://doi.org/10.48550/ARXIV.2109.12922
and Simulation in Visual Computing,” and Sony Semiconductor So- Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive Growing
lutions Corporation. We would like to thank Yawar Siddiqui for of GANs for Improved Quality, Stability, and Variation. In International Conference
help with implementation of differential rendering codebase, Artem on Learning Representations. https://openreview.net/forum?id=Hk99zCeAb
Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo
Sevastopolsky and Guy Gafni for helpful discussions, and Alexey Aila. 2020a. Training Generative Adversarial Networks with Limited Data. In Proc.
Bokhovkin, Chandan Yeshwanth, and Yueh-Cheng Liu for help dur- NeurIPS.
Tero Karras, Samuli Laine, and Timo Aila. 2019. A Style-Based Generator Architecture
ing internal review. for Generative Adversarial Networks. In 2019 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR). 4396–4405. https://doi.org/10.1109/CVPR.
REFERENCES 2019.00453
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo
Rameen Abdal, Peihao Zhu, John Femiani, Niloy Mitra, and Peter Wonka. 2022. Aila. 2020b. Analyzing and Improving the Image Quality of StyleGAN. In Proc.
CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions. In ACM CVPR.
SIGGRAPH 2022 Conference Proceedings (Vancouver, BC, Canada) (SIGGRAPH ’22). Nasir Mohammad Khalid, Tianhao Xie, Eugene Belilovsky, and Popa Tiberiu. 2022.
Association for Computing Machinery, New York, NY, USA, Article 48, 9 pages. CLIP-Mesh: Generating textured meshes from text using pretrained image-text
https://doi.org/10.1145/3528233.3530747 models. SIGGRAPH Asia 2022 Conference Papers (December 2022).
Rameen Abdal, Peihao Zhu, Niloy J. Mitra, and Peter Wonka. 2021. StyleFlow: Umut Kocasari, Alara Dirik, Mert Tiftikci, and Pinar Yanardag. 2022. StyleMC: Multi-
Attribute-Conditioned Exploration of StyleGAN-Generated Images Using Condi- Channel Based Fast Text-Guided Image Generation and Manipulation. In Proceedings
tional Continuous Normalizing Flows. ACM Trans. Graph. (May 2021). https: of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 895–
//doi.org/10.1145/3447648 904.
Omri Avrahami, Dani Lischinski, and Ohad Fried. 2022. Blended diffusion for text-driven Marek Kowalski, Stephan J. Garbin, Virginia Estellers, Tadas Baltrušaitis, Matthew John-
editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer son, and Jamie Shotton. 2020. CONFIG: Controllable Neural Face Image Generation.
Vision and Pattern Recognition. 18208–18218. In European Conference on Computer Vision (ECCV).
David Bau, Alex Andonian, Audrey Cui, YeonHwan Park, Ali Jahanian, Aude Oliva, Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol, Jaakko Lehtinen, and Timo
and Antonio Torralba. 2021. Paint by Word. arXiv:arXiv:2103.10951 Aila. 2020. Modular Primitives for High-Performance Differentiable Rendering.
Volker Blanz and Thomas Vetter. 1999. A Morphable Model for the Synthesis of ACM Transactions on Graphics 39, 6 (2020).
3D Faces. In Proceedings of the 26th Annual Conference on Computer Graphics and Alexandros Lattas, Stylianos Moschoglou, Baris Gecer, Stylianos Ploumpis, Vasileios
Interactive Techniques (SIGGRAPH ’99). ACM Press/Addison-Wesley Publishing Co., Triantafyllou, Abhijeet Ghosh, and Stefanos Zafeiriou. 2020. AvatarMe: Realistically
USA, 187–194. https://doi.org/10.1145/311535.311556 Renderable 3D Facial Reconstruction "In-the-Wild". In Proceedings of the IEEE/CVF
Zehranaz Canfes, M. Furkan Atasoy, Alara Dirik, and Pinar Yanardag. 2022. Text and Conference on Computer Vision and Pattern Recognition (CVPR).
Image Guided 3D Avatar Generation and Manipulation. https://doi.org/10.48550/ Alexandros Lattas, Stylianos Moschoglou, Stylianos Ploumpis, Baris Gecer, Abhijeet
ARXIV.2202.06079 Ghosh, and Stefanos P Zafeiriou. 2021. AvatarMe++: Facial Shape and BRDF In-
Eric R. Chan, Connor Z. Lin, Matthew A. Chan, Koki Nagano, Boxiao Pan, Shalini De ference with Photorealistic Rendering-Aware GANs. IEEE Transactions on Pattern
Mello, Orazio Gallo, Leonidas Guibas, Jonathan Tremblay, Sameh Khamis, Tero Analysis and Machine Intelligence (2021).
Karras, and Gordon Wetzstein. 2021. Efficient Geometry-aware 3D Generative Myunggi Lee, Wonwoong Cho, Moonheum Kim, David I. Inouye, and Nojun Kwak. 2020.
Adversarial Networks. In arXiv. StyleUV: Diverse and High-fidelity UV Map Generative Model. ArXiv abs/2011.12893
Katherine Crowson. 2021. VQGAN-CLIP. https://github.com/nerdyrodent/VQGAN- (2020).
CLIP Ruilong Li, Karl Bladin, Yajie Zhao, Chinmay Chinara, Owen Ingraham, Pengda Xiang,
Boris Dayma, Suraj Patil, Pedro Cuenca, Khalid Saifullah, Tanishq Abraham, Phuc Xinglei Ren, Pratusha Prasad, Bipin Kishore, Jun Xing, and Hao Li. 2020. Learning
Le Khac, Luke Melas, and Ritobrata Ghosh. 2021. DALL·E Mini. https://doi.org/10.

8
ClipFace: Text-guided Editing of Textured 3D Morphable Models SIGGRAPH ’23 Conference Proceedings, August 6–10, 2023, Los Angeles, CA, USA

Formation of Physically-Based Face Attributes. In IEEE/CVF Conference on Computer


Vision and Pattern Recognition (CVPR).
Tianye Li, Timo Bolkart, Michael. J. Black, Hao Li, and Javier Romero. 2017. Learning a
model of facial shape and expression from 4D scans. ACM Transactions on Graphics,
(Proc. SIGGRAPH Asia) 36, 6 (2017), 194:1–194:17.
Yuchen Liu, Zhixin Shu, Yijun Li, Zhe Lin, Richard Zhang, and S. Y. Kung. 2022. 3D-FM
GAN: Towards 3D-Controllable Face Manipulation. ArXiv abs/2208.11257 (2022).
Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J.
Black. 2015. SMPL: A Skinned Multi-Person Linear Model. ACM Trans. Graphics
(Proc. SIGGRAPH Asia) 34, 6 (Oct. 2015), 248:1–248:16.
Huiwen Luo, Koki Nagano, Han-Wei Kung, Qingguo Xu, Zejian Wang, Lingyu Wei,
Liwen Hu, and Hao Li. 2021. Normalized Avatar Synthesis Using StyleGAN and
Perceptual Refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR). 11662–11672.
Richard T. Marriott, Sami Romdhani, and Liming Chen. 2021. A 3D GAN for Improved
Large-pose Facial Recognition. 2021 IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR) (2021), 13440–13450.
Oscar Michel, Roi Bar-On, Richard Liu, Sagie Benaim, and Rana Hanocka. 2022.
Text2Mesh: Text-Driven Neural Stylization for Meshes. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 13492–
13502.
Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. 2021.
StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. In Proceedings of the
IEEE/CVF International Conference on Computer Vision (ICCV). 2085–2094.
Mathis Petrovich, Michael J. Black, and Gül Varol. 2022. TEMOS: Generating diverse
human motions from textual descriptions. In European Conference on Computer
Vision (ECCV).
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini
Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen
Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From
Natural Language Supervision. In Proceedings of the 38th International Conference
on Machine Learning (Proceedings of Machine Learning Research, Vol. 139), Marina
Meila and Tong Zhang (Eds.). PMLR, 8748–8763. https://proceedings.mlr.press/
v139/radford21a.html
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022.
Hierarchical Text-Conditional Image Generation with CLIP Latents. https://doi.
org/10.48550/ARXIV.2204.06125
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford,
Mark Chen, and Ilya Sutskever. 2021. Zero-Shot Text-to-Image Generation. https:
//doi.org/10.48550/ARXIV.2102.12092
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir
Aberman. 2022. DreamBooth: Fine Tuning Text-to-image Diffusion Models for
Subject-Driven Generation. (2022).
Ron Slossberg, Ibrahim Jubran, and Ron Kimmel. 2022. Unsupervised High-Fidelity
Facial Texture Generation and Reconstruction. In Computer Vision – ECCV 2022:
17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XIII
(Tel Aviv, Israel). Springer-Verlag, Berlin, Heidelberg, 212–229. https://doi.org/10.
1007/978-3-031-19778-9_13
Ayush Tewari, Mohamed Elgharib, Gaurav Bharaj, Florian Bernard, Hans-Peter Seidel,
Patrick Pérez, Michael Zöllhofer, and Christian Theobalt. 2020a. StyleRig: Rigging
StyleGAN for 3D Control over Portrait Images, CVPR 2020. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR). IEEE.
Ayush Tewari, Mohamed Elgharib, Mallikarjun B R, Florian Bernard, Hans-Peter Seidel,
Patrick Pérez, Michael Zollhöfer, and Christian Theobalt. 2020b. PIE: Portrait Image
Embedding for Semantic Control. ACM Trans. Graph. (2020).
Can Wang, Menglei Chai, Mingming He, Dongdong Chen, and Jing Liao. 2022a. CLIP-
NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 3835–
3844.
Lizhen Wang, Zhiyuan Chen, Tao Yu, Chenguang Ma, Liang Li, and Yebin Liu. 2022b.
FaceVerse: A Fine-Grained and Detail-Controllable 3D Face Morphable Model From
a Hybrid Dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR). 20333–20342.
Tianyi Wei, Dongdong Chen, Wenbo Zhou, Jing Liao, Zhentao Tan, Lu Yuan, Weiming
Zhang, and Nenghai Yu. 2022. HairCLIP: Design Your Hair by Text and Reference
Image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR). 18072–18081.
Kim Youwang, Kim Ji-Yeon, and Tae-Hyun Oh. 2022. CLIP-Actor: Text-Driven Recom-
mendation and Stylization for Animating Human Meshes. In ECCV.
zllrunning. 2018. face-parsing.PyTorch. https://github.com/zllrunning/face-parsing.
PyTorch.

9
SIGGRAPH ’23 Conference Proceedings, August 6–10, 2023, Los Angeles, CA, USA Shivangi Aneja, Justus Thies, Angela Dai, and Matthias Niessner

Original “Dracula” “Barbie Doll” “Zombie” “Party Makeup” “Werewolf” “Witcher” “Ghost”
Fig. 7. Texture manipulations. ‘Original’ (first column) shows the initial textured mesh sampled from our ClipFace generator without any text prompt, followed
by ClipFace-generated textures for various text prompts.

10
ClipFace: Text-guided Editing of Textured 3D Morphable Models SIGGRAPH ’23 Conference Proceedings, August 6–10, 2023, Los Angeles, CA, USA

“Angry”
“Happy”
“Sad”
“Surprised”
“Disgusted”

Input w/o Exp. Reg. w/o Clip Direction Ours

Fig. 8. ClipFace generates a large variety of expressions, faithfully deforming the mesh geometry and manipulating texture for more expressiveness. Our
expression regularization is important to produce realistic geometric expressions, and directional loss for balanced manipulation of both texture and geometry.

t=23 t=35 t=23 t=35


Geometry

Geometry
Constant
Texture
Constant
Texture

Ours
Ours

Geo + Tex Tex Only Geo + Tex Tex Only Tex Only Tex Only
Geo + Tex Geo + Tex
Fig. 9. Given a 3D face motion sequence (top row), we compare our dynamic texturing approach (bottom row) against static-only texturing (middle row). Geo
+ Tex shows textures overlaid on the animated mesh, and Tex Only shows texture in the neutral pose. Our proposed dynamic texture manipulation technique
generates more compelling animation, particularly in articulated expressions (e.g., t=23). Here, we show results for the text prompt "Laughing". We further
refer interested readers to our supplemental video.

11

You might also like