You are on page 1of 11

IET Computer Vision

Special Section: Computer Vision for the Creative Industries

Interactive facial animation with deep neural ISSN 1751-9632


Received on 6th October 2019
Revised 2nd April 2020
networks Accepted on 3rd June 2020
E-First on 13th August 2020
doi: 10.1049/iet-cvi.2019.0790
www.ietdl.org

Wolfgang Paier1 , Anna Hilsmann1, Peter Eisert1,2


1Vision
& Imaging Technologies, Fraunhofer HHI, Berlin, Germany
2Visual
Computing, Humboldt University, Berlin, Germany
E-mail: wolfgang.paier@hhi.fraunhofer.de

Abstract: Creating realistic animations of human faces is still a challenging task in computer graphics. While computer graphics
(CG) models capture much variability in a small parameter vector, they usually do not meet the necessary visual quality. This is
due to the fact, that geometry-based animation often does not allow fine-grained deformations and fails in difficult areas (mouth,
eyes) to produce realistic renderings. Image-based animation techniques avoid these problems by using dynamic textures that
capture details and small movements that are not explained by geometry. This comes at the cost of high-memory requirements
and limited flexibility in terms of animation because dynamic texture sequences need to be concatenated seamlessly, which is
not always possible and prone to visual artefacts. In this study, the authors present a new hybrid animation framework that
exploits recent advances in deep learning to provide an interactive animation engine that can be used via a simple and intuitive
visualisation for facial expression editing. The authors describe an automatic pipeline to generate training sequences that
consist of dynamic textures plus sequences of consistent three-dimensional face models. Based on this data, they train a
variational autoencoder to learn a low-dimensional latent space of facial expressions that is used for interactive facial animation.

1 Introduction fine details are visible. We designed our system in a way such that
it can generate a 2D expression manifold for each region of interest
Modelling and rendering of virtual human faces is still a (i.e. mouth and eyes). This enables us to create a simple interface
challenging task in computer graphics. Complex deformation and for interactive facial animation in real time (Fig. 2).
material properties and effects like occlusion and disocclusions The remainder of this paper is organised as follows: in Section
make it difficult to develop efficient representations that are 2, we present an overview of existing technologies and methods
capable of generating high-quality renderings. Especially that are related to facial animation and facial modelling. Section 3
frameworks, which try to capture eyes and the oral cavity in a gives an overview of the proposed framework. In Section 4, we
holistic manner suffer from these problems. A geometry-based describe the data acquisition process for our training and test data.
approach would have to model eyeballs and eyelids, as well as Section 5 contains a detailed description of our hybrid model and
tongue, teeth and gums explicitly, which is prone to artefacts, low- in the last two sections, we present and discuss our results.
visual quality and likely to require heavy manual intervention. On
the other hand, purely image-based approaches are becoming more
and more popular with the rise of deep convolutional networks, 2 Related work
which are capable of capturing the complex appearance changes in Animation and performance capture of human faces have been an
image space alone. However, this comes at the cost of long training active area of research for decades.
times, high hardware requirements and large amounts of training Classical approaches for modelling facial expressions, for
data. Futhermore, simple operations like changing the three- example [1–3], have been used for many years. They are usually
dimensional (3D) pose of a model become more difficult again. A built upon a linear basis of blendshapes and even though their
more sensible solution should make use of both approaches. expressive power is limited, they became popular due to their
Geometry-based representations can already compensate for a large robustness and simplicity of use.
amount of variance, which is usually caused by rigid motion and For example in [4], Li et al. improve blendshape-based
large-scale deformations, but they fail in face regions, which performance capture and animation systems by estimating
exhibit complex as well as fine motions or strong changes in corrective blendshapes on the fly. Cao et al. [5] present an
texture. Therefore, we propose a hybrid approach between classical automatic system for creating large-scale blendshape databases
geometry-based modelling and image-based rendering using using captured RGB-D data and Garrido et al. [6] use blendshapes
dynamic textures (Fig. 1). In general, working in texture space to create a personalised model by adapting it to a detailed 3D scan
provides several advantages, like a fixed resolution, less variance using sparse facial features and optical flow. The generated face
that is induced by motion and especially a fixed topology that even model is then used to synthesise plausible mouth animations of an
allows us to define regions of interest for generating local models actor in a video according to a new target audio track. Since
that can be animated independently (e.g. mouth or eyes). This helps blendshape models cannot capture teeth and the mouth cavity, they
in reducing the model complexity by a large amount, as we can use a textured 3D proxy to render teeth and the inner mouth. A
train smaller neural networks on selected regions. In our more sophisticated facial retargeting system was proposed by Thies
experiments, we focus on the eye and mouth area, which are et al. [7] who implemented a linear model that can represent
represented by convolutional neural networks as local non-linear expressions, identity, texture and light. With their model, it is
models that are able to capture the highly non-linear appearance possible to capture a wide range of facial expressions of different
changes. Our networks have a variational autoencoder (VAE) people under varying light conditions. However, similar to Garrido
structure with two separate encoding and decoding paths for et al. [6], they had to use a hand-crafted model to visualise the oral
texture and geometry. Additionally, we use an adversarial loss, to cavity, which also demonstrates the weakness of blendshape-based
improve the visual quality of the generated textures such that even models. While being useful to represent large-scale geometry

IET Comput. Vis., 2020, Vol. 14 Iss. 6, pp. 359-369 359


This is an open access article published by the IET under the Creative Commons Attribution-NonCommercial-NoDerivs License
(http://creativecommons.org/licenses/by-nc-nd/3.0/)
images and fulfilment of the expression conditioning. Another
example is [25], where Huang and Khan propose a method to
synthesise believable facial expressions using GANs that are
conditioned on sketches of facial expressions.
While deep models were initially used to generate realistic
images or modify them, GANs are also capable of performing
image to image translation tasks, as demonstrated in [26, 27]. In
these examples, the output images of GANs are conditioned on
source images, which usually represent the ‘content’ and the task
of the GAN is generating new images that show the same content
(e.g. facial expressions) but with a different style or appearance
(e.g. identity). After training, it can be used to animate, for
example, a virtual character or a face by recording the desired
facial expressions and using the GAN to translate the video.
A different method was presented by Lombardi et al. [28] and
Slossberg et al. [29]. Instead of working in image space alone, they
use a 3D model with texture. A deep neural network is trained to
represent geometry and texture with a 128-dimensional vector.
They use a high-quality face capture rig with 40 synchronised
machine vision cameras to reconstruct consistent 3D face geometry
for every frame. Based on the 3D geometry and the average
texture, they train a VAE that is capable of synthesising geometry
and a view-dependent texture, which allows rendering high-quality
face sequences in 3D. They also proposed a facial expression
editing method, where they allow artistic control over facial
expressions by applying position constraints on vertices. While this
seems to be a sensible solution at first glance, our system aims at
providing high-level control over captured facial expressions to
design new facial animation sequences, rather than editing facial
Fig. 1  Examples of our results. The proxy head model is rendered with
expressions on vertex level. Furthermore, we want to provide a
different facial expressions and from different viewpoints
solution that can work with standard hardware.
deformations and rigid motion, they usually lack expressive power
to capture complex deformations and occlusions/disocclusions that 3 System overview
occur around eyes, mouth and in the oral cavity. This section gives a high-level overview of the proposed animation
To circumvent these limitations, several image-based framework, see Fig. 3. Our system basically consists of three
approaches like [8–17] have been developed. For example in [13], phases: capture/tracking (offline), training (offline) and animation/
a low-cost system for facial performance transfer is presented. rendering(interactive/live).
Using a multilinear face model, they are able to track an actor's In the first phase, we capture a set of multiview stereo image
facial geometry in the 2D video, which enables them to transfer pairs in order to compute an approximate deformable geometry
facial performances between different persons via image-based model of the face. We need not capture all fine details in geometry
rendering. Similar techniques were used in [12, 9, 7]. A different since most of the visual information will be represented by
way of image-based animation was proposed by Casas et al. in [11] dynamic textures. This is especially true for complex areas like
or Paier et al. in [8, 10]. They exploit the fact that low-quality 3D mouth and eyes, where accurate 3D modelling would need heavy
models with high-quality textures are still capable of producing manual intervention. Our deformable face model is only
high-quality renderings. Based on previously captured video and responsible to capture rigid motion, large-scale deformation and
geometry, they create a database of atomic geometry and texture provide consistent texture space parameterisation. The advantage
sequences that can be re-arranged to synthesise novel of this approach is that we can do this in an automatic way without
performances. While Casas and co-workers use pre-computed manual intervention. Since most visual information will be stored
motion graphs to find suitable transitions points, Paier and co- in texture space, we also need to create a texture model. We do this
workers proposed a spatio-temporal blending to generate seamless by recording a few sequences of multiview video, where our
transitions between concatenated sequences. actress shows some general facial expressions and speaks a set of
The big advantage of these approaches is that they make direct words in order to capture most visemes and transitions between
use of captured video data to synthesise new video sequences, them. Before we start extracting face textures, we estimate the 3D
which results in high-quality renderings. Geometry is only used as face pose and geometry deformation parameters for each captured
a proxy to explain rigid motion, large-scale deformation and image frame. In the second phase, we train a deep auto-encoder
perspective distortion. However, the high-visual quality comes at network to learn a latent representation of face texture and
the cost of high-memory requirements during rendering, since geometry. In the last phase, we use this latent representation to
textures have to be stored for each captured frame in the database. interpolate realistically between different facial expressions (in
With the advent of deep neural networks, more powerful geometry as well as in texture space), which allows us to create
generative models have been developed. Especially, VAEs [18] and new performances by concatenating short sequences of captured
generative adversarial networks (GANs) [19] gain popularity, since geometry/texture in a motion-graph like a manner. Furthermore, we
they are powerful enough to represent large distributions of high- provide a simple user interface for direct high-level manipulation
dimensional data in high quality (e.g. faces images or textures). For of facial expressions based on the latent expression features. Due to
example in [20–22], GANs are used to learn latent representations the fast evaluation of our neural model on the GPU, we can provide
of images that can be used to manipulate the content of images a graphical user interface for interactive animation and facial
(e.g. facial expression) in a controlled and meaningful way. More expression editing in the real time.
recent methods like [23, 24] are even capable of generating
believable facial animations from a single photo using GANs. For
example, Pumarola et al. [23] present a GAN-based animation 4 Acquisition of data
approach that can generate facial expressions based on an action Our proposed face representation is based on a deformable 3D
unit coding scheme. The action unit weights define a continuous geometry model with consistent topology and dynamic textures.
representation of the facial expressions. The loss function is While the geometry acts as a 3D proxy to represent rigid motion
defined by a critic network that evaluates the realism of generated and large-scale deformation of the face, our texture model aims at
360 IET Comput. Vis., 2020, Vol. 14 Iss. 6, pp. 359-369
This is an open access article published by the IET under the Creative Commons Attribution-NonCommercial-NoDerivs License
(http://creativecommons.org/licenses/by-nc-nd/3.0/)
Fig. 3  Schematic overview of the proposed framework

refine the initial 3D mesh by computing dense stereo


reconstructions [33] based on optical flow and improve the
Fig. 2  Screenshot of our implemented animation interface. There are two accuracy of the initial mesh by computing vertex offsets that
separate windows for mouth (bottom) and eye (top) control. We use the 2D minimise the difference between a vertex on the initial mesh and
parameter vector for mouth and eyes to display the learned manifold of the closest vertices on all dense stereo reconstructions. We
facial expressions. By clicking on the displayed expression in the respective regularise the deformation field by minimising the uniform
window it is possible to change the facial expression of our model Laplacian of the deformation field. The result of this first step is
shown in Fig. 5 (right).
generating high-quality dynamic textures with fine details and
realistic micromovements. The capture process for data acquisition 4.2 Deformable mesh registration
consists of two parts: first, we capture 18 general facial expressions
as well as 15 expressions that show visemes, with a D-SLR camera In order to create an animatable face model, we need registered
rig that consists of four stereo pairs (Fig. 4). Second, we record face meshes with consistent topology. For this purpose, we use an
multiview face video with high-resolution machine vision cameras existing template mesh [34] and deform it such that it matches the
in a diffuse light setup. For this part, we capture video data for the geometry of all reconstructed 3D face meshes. During mesh
dynamic texture model. Similar as before, we capture one video, registration, we minimise three error terms simultaneously: the
where our actress displays general facial expressions and two more point-to-plane distance between template geometry and target
videos, where our actress speaks a set of selected words to capture geometry ℰgeo (2), the optical flow ℰimg (3) and the distance
visemes and transitions between different visemes. In the last between automatically detected facial feature points [35] and the
video, our actress displayed some general facial expressions and projected 2D location of corresponding 3D mesh-landmarks on the
speech which allows us to evaluate the performance on unseen template mesh ℰlm (4). We optimise the following parameters
data. The training data consists of ∼2160 frames and the evaluation during mesh registration: similarity S is represented by a 7D
sequence has 250 frames. parameter vector s = [tx, ty, tz, rx, ry, rz, s]T that contains three
parameters for translation tx, ty, tz , three parameters for rotation
4.1 Preprocessing rx, ry, rz and s for uniform scale. The blendshape weight vector
During preprocessing, we compute the 3D face geometry for each b = [b0, …b9]T is a 10D column vector and vertex offsets
static facial expression that was captured by the D-SLR rig. This is o = [ox, oy, oz, …, oxn, oyn, ozn]T represent a 3D offset for each vertex
achieved using a three-stage reconstruction pipeline: first, we in the template mesh with n vertices
extract and match SIFT [30] features using the matching approach
described in [31]. Knowing corresponding points in all camera xs, b, v = S X0 + ℬb + o (1)
pairs, we compute an initial 3D point cloud (Fig. 5, leftmost),
which serves as input for the screened Poisson mesh reconstruction
approach [32] to generate an initial 3D mesh (Fig. 5, middle). We

IET Comput. Vis., 2020, Vol. 14 Iss. 6, pp. 359-369 361


This is an open access article published by the IET under the Creative Commons Attribution-NonCommercial-NoDerivs License
(http://creativecommons.org/licenses/by-nc-nd/3.0/)
Fig. 4  Samples of static expressions captured by D-SLRs

Fig. 5  Results of the mesh reconstruction, from left to right: initial point
cloud, initial 3D mesh, refined and cleaned 3D mesh

The column vector xs, b, v = [x, y, z, …, xn, yn, zn]T contains x, y and z
coordinates of all vertices of the deformed mesh. X0 corresponds to
the mean shape of the mesh template and S represents the
similarity transform that is applied in the last step on all deformed
and refined vertices. xi refers to the 3D position [xi, yi, zi]T of the ith
Fig. 6  Examples of the adapted 3D face template with corresponding
transformed vertex. ℬ refers to the blendshape base (column multiview 3D reconstructions
vectors) of the mesh template to adapt the overall shape
2
The mesh registration process itself consists of two steps: first,
the mesh template is registered with the reconstructed neutral face
ℰgeo s, b, v = ∑ xi − vi nvi (2)
geometry. For this step, we minimise geometry error ℰgeo and the
i
distance between corresponding landmark positions ℰlm by
ℰgeo refers to the squared point-to-plane distance, with xi being the adjusting similarity s, blendshapes weights b and vertex offsets o.
transformed position of vertex i, vi being the corresponding point The blendshape weight b is related to the SMPL [34] template
on the target mesh and nvi being the surface normal of vi. We select model, which provides ten blendshapes to adjust the general shape.
Fine adjustments to the face shape are realised by computing per-
vi by choosing the closest point on the target mesh using a kd-tree
vertex offset o that is regularised via the contangent Laplacian δi
C 2 [36, 37]
ℰimg s, b, v = ∑∑ Jc p − ℐs, b, v p , (3)
c p∈I δi = ∑ wi, j x j − xi , (5)
i, j ∈ E
where ℰimg refers to the sum of all squared intensity differences at
all rendered pixel p between the synthesised image ℐs, b, v and the with xi being the position of vertex i and x j being the position of
captured image Jc, with c being the camera index and C being the one of its one-ring neighbours
number of all cameras
ωi, j
wi, j = (6)
C 2 ∑ i, k ∈E ωi, k
ℰlm s, b, v = ∑∑ ^
mc, i − mc, i , (4)
c i ωi, j = cot α + cot β (7)

the above equation refers to the sum of squared distances between After the template mesh is adapted to the neutral expression, we
the projected 2D position mc, i of all mesh landmarks and the start registering the template to all other facial expressions. We do
corresponding detected 2D landmark position m^ c, i in all input this by deforming the adapted neutral template to match the
images. We further assume that the mesh template is annotated and reconstructed geometry of all other facial expressions. During this
the locations of distinct facial features (e.g. eyes, nose, mouth phase, we only adapt rigid motion tx, ty, tz, rx, ry, rz and vertex
corners) on the mesh are available (i.e. barycentric coordinates and offsets o, since we have no blendshapes that could simulate facial
triangle id of landmarks on the triangle mesh). expressions. With the adapted neutral template, we additionally use
the optical flow error (3) to improve the semantic consistency

362 IET Comput. Vis., 2020, Vol. 14 Iss. 6, pp. 359-369


This is an open access article published by the IET under the Creative Commons Attribution-NonCommercial-NoDerivs License
(http://creativecommons.org/licenses/by-nc-nd/3.0/)
between the registered face meshes. To regularise the strong
deformations, we minimise the contangent mesh Laplacian (5) on
vertex offsets as well as on the vertex positions. This ensures a
smooth deformation field as well as an artefact-free and smooth
mesh surface. The registration process is treated as non-linear
optimisation task, which is solved using the iterative Gauss–
Netwon method. With the 33 registered template meshes, we create
a linear face deformation model consisting of 20 basis vectors,
which are computed using principal component analysis (PCA).
This linear model is used in all subsequent steps for capturing face
motion, deformation and extracting textures from the captured
multiview video stream Fig. 6.

4.3 Model-based face tracking and texture extraction


The training data for our hybrid model is extracted from the
previously recorded multiview video stream and consists of a
deformable geometry proxy and high-resolution dynamic textures.
We process the multiview video data in two steps: first, we track
the 3D head pose and deformation based on face landmarks and
optical flow. Knowing the approximate 3D head geometry at every
timestep, we extract a sequence of textures using an approach [10]
that was adapted for the usage in our pipeline. To obtain the 3D
face geometry for the first frame of each sequence, we assume a
neutral expression and optimise rigid motion parameters
rx, ry, rz, tx, ty, tz to minimise the distance between all detected
landmarks and the projected 2D landmark positions (4) of the
annotated face model. All following frames are processed by
minimising landmark error ℰlm as well as the intensity difference
ℰimg between captured images and the rendered face model. To
evaluate the intensity difference, we use the first frame of each
camera as a texture for our face model and compute the difference
between the synthesised image (given the pose and blendshape
parameters) and the captured target image. As before, we treat the
tracking as optimisation task that is solved iteratively using the
Gauss–Newton method.
Knowing the accurate pose and face shape for each frame in the
captured multiview videos allows us to extract a dynamic texture
sequence for each video. The extracted texture sequences, capture
all information that is not represented by geometry, which includes,
for example, fine motions and deformations (e.g. micromovements,
wrinkles) as well as changes in texture and occlusions/
disocclusions (e.g. eyes and oral cavity). Together, geometry and
texture represent almost the full appearance of the captured face in
each frame. For the texture extraction, we rely on a graph-cut-
based approach with a temporal consistency constraint [10]. This
method, simultaneously optimises three data terms (8) to generate a
visually pleasing (i.e. no spatial or temporal artefacts) sequence of
textures from each video
T N
Etex C = ∑ ∑D f it, cit + λ ∑ Vi, j cit, ctj + ηT cit, cit − 1
t i i, j ∈ N
(8)
Fig. 7  Upper half of this figure shows the rendered model with and
without highlighted animation regions. The lower half of this picture shows
that we can simply select the modified face region in texture space by
defining a rectangular region of interest. Similarly, we select the
C denotes the set of source camera ids for all triangles. The first corresponding vertices of interest by comparing the texture coordinate of
term D f i, ci is a measure for the visual quality of a triangle f i each vertex with the region of interest in texture space
textured by the camera ci. It uses the heuristic W f i, ci , which is
the area of f i projected on the image plane of the camera ci relative Vi, j ci, c j represents a spatial smoothness constraint, which relates
to the sum of area f i, ci over all possible ci to ease the choice of the sum of colour differences along the common edge ei, j of two
the weighting factors η and λ triangles f i and f j that are textured from different cameras ci and c j

1 − W f i, ci f iis visible 0 ci = c j
D f i, ci = (9) Vi, j ci, c j = (11)
∞ f iis occluded Πei, j ci ≠ c j

area f i, ci Πei, j = ∫ ∥ Ici x − Ic j x ∥dx


W f i, ci = (10) (12)
∑c j area f i, ci ei, j

the last term T ci, c j in (8) ensures temporal smoothness by


penalising the change of the source camera ci of a triangle f i

IET Comput. Vis., 2020, Vol. 14 Iss. 6, pp. 359-369 363


This is an open access article published by the IET under the Creative Commons Attribution-NonCommercial-NoDerivs License
(http://creativecommons.org/licenses/by-nc-nd/3.0/)
Fig. 8  This figure shows the used network architecture. The network consists of roughly six parts. A convolutional texture encoder/decoder(blue). A geometry
encoder/decoder(green). A fully connected bottleneck (yellow), that combines information of texture and geometry into a latent code vector (μ) and deviation
(σ). A texture discriminator network (orange) tries to classify textures as real or synthetic

Table 1 Final errors after 100 epochs


Data Training error Validation error
eye geometry 0.000274 0.000280
mouth geometry 0.000676 0.000759
eye texture 0.015 0.0163
mouth texture 0.0116 0.0125

between consecutive time steps. Without such a term, the extracted choice, while the remaining facial regions can be represented easily
dynamic textures are not temporally consistent, i.e. the source with a linear model. Furthermore, training smaller networks
camera of a certain triangle can change arbitrarily between two requires less data, less hardware and less time which makes this
consecutive texture frames, which results in disturbing flickering in approach even more appealing in practice.
the resulting texture sequence
5.1 Network structure and training
0 cit = cit − t
T cit, cit − t = (13) Fig. 7 shows a schematic representation of the face model
1 cit ≠ cit − t components. Both local non-linear models are represented as deep
VAE [18] (Fig. 8). Similar to [28] we use separate encoder/decoder
The selection of source cameras C for all mesh-triangles in all paths for texture (Fig. 8, blue) and geometry (Fig. 8, green) that are
frames is solved simultaneously using the alpha expansion joined at the bottleneck (Fig. 8, yellow) into a single fully
algorithm [38]. In the last step, we assemble the reconstructed connected layer that outputs the latent representation of an
textures by sampling texel colours from the optimal source expression. In contrast to [28], we use an adversarial discriminator
cameras. Seams between triangles are concealed using Poisson (Fig. 8, orange) on the reconstructed texture to improve the realism
image editing [39]. of the reconstructed textures (e.g. eye-lashes, wrinkles).
The encoder–decoder path for the local texture model consists
5 Neural face model of six convolutional blocks. Each texture encoding block consists
of two consecutive convolution–batchnorm–LeakyReLU layers,
Our hybrid face model consists of two animatable regions (i.e. where the first layer performs downscaling using a stride of two.
mouth and eyes), while the rest of the face model is deformed The first convolution block outputs a filter map with 32 channels.
automatically in order to ensure a consistent facial geometry. Both In each following convolution block, the number of channels is
animatable regions are represented with deep neural networks that doubled. Each texture decoding block consists of a transposed
are capable of simultaneously generating a detailed local texture as convolution–batchnorm–LeakyReLU followed by a convolution–
well as the vertex positions of the local geometry proxy. The batchnorm–LeakyReLU layer. The number of channels for each
remaining face geometry is modelled with a small set of filter bank can be found in Fig. 8. The vertex encoding and
blendshapes and a static texture. This approach has several decoding layer consist of a linear layer followed by batchnorm and
advantages: first, most relevant information appears around mouth a LeakyReLU activation. The fully connected layers are merging
and eyes. Decoupling them gives our model a higher expressive texture and geometry information and output a 2D latent code that
power while keeping the overall complexity and the amount of is regularised by the variational lower bound [18]. This guarantees
necessary training data as low as possible. The complexity of a smooth latent space and latent features that are normally
deformation and material properties of these two regions is much distributed (μ = 0 and σ = 1), which give a well-defined parameter
higher, which makes the usage of deep neural networks a sensible space that can be used for manual facial animation, interpolation

364 IET Comput. Vis., 2020, Vol. 14 Iss. 6, pp. 359-369


This is an open access article published by the IET under the Creative Commons Attribution-NonCommercial-NoDerivs License
(http://creativecommons.org/licenses/by-nc-nd/3.0/)
Fig. 9  Possible facial expressions, reconstructed by our model. We randomly combined different expressions of eye and mouth area

and editing. That means, our face model can be animated by close to one signals that the image patch is real. The advantage of
providing 2D parameters for eyes and mouth region, which results using smaller patches is that the discriminator classifies based on
in four parameters per facial expression. Our discriminator network smaller features and details, which forces the texture generator to
is based on the PatchGAN structure as described in [40]. It consists create a sharper texture with more details. We use a batch-size of
of four convolutional blocks. Each block consists of a 2D four, a leakiness of 0.2 for all ReLUs in our network and use the
convolution with stride 2, batchnorm and LeakyReLU layer. The Adam optimiser with default parameters to train our network for
final layer is a 1 × 1 convolution that outputs a single channel 100 epochs with a fixed learning rate of 0.0001. Table 1 shows the
feature map where a value close to zero means, that the final training errors after 100 epochs on training data and unseen
discriminator believes that this image region is fake, while a value data. The geometry error is the mean absolute deviation over all

IET Comput. Vis., 2020, Vol. 14 Iss. 6, pp. 359-369 365


This is an open access article published by the IET under the Creative Commons Attribution-NonCommercial-NoDerivs License
(http://creativecommons.org/licenses/by-nc-nd/3.0/)
Fig. 10  Selected facial expressions that show the advantages of our hybrid approach. Below every facial expression, we show the cropped and enlarged
mouth and eye area. While the geometry model cannot capture details and occlusions/disocclusions, we show that the texture model is capable of capturing
and reproducing teeth, tongue, eye-lashes or wrinkles

vertex coordinates given in metre. The texture error is given as the 3 × 3 sub-tensor contains the weights of a Laplacian filter kernel
mean absolute error over all pixel, with normalised values that while all other elements are set to zero. Using this convolution
range from 0 to 1. While the texture error is quite similar for the operation, we compute the target response tensor Lt by convolving
eye and the mouth region, the geometry error for the eye region is the generated texture. Then, we replace all border pixel values in
almost three times lower than for the mouth region. This is most the reconstructed texture with its values from the original texture
probably caused by the fact, that the eye geometry does not vary as and iteratively update the reconstructed texture by minimising the
much as the mouth geometry and therefore expressions for the eye difference between the Lt and L. We use 25 blending iterations in
region are mainly captured in texture space. This would also our experiments. The integration of both sub-meshes is performed
explain, why the texture error for the eye region is higher than for on the CPU. Therefore, we adjust the blendshape weights of the
the mouth region. underlying face model to minimise the geometry difference
between the full face model and the two local geometry
5.2 Animation and rendering reconstructions. Similar as before, we integrate the reconstructed
geometry by performing Laplacian blending on all unmodified
The rendering of our face model can be performed with most
vertices while the updated positions of the mouth and eye region
available rendering engines since the output of our model is a
stay constant and serve as border constraint. Finally, we render the
textured triangle mesh. The animation procedure is performed in
textured mesh with OpenGL.
three steps. First, we evaluate both local face models with their
corresponding animation parameters. These animation parameters
can be generated either by using the encoder part of the trained 6 Experimental results and discussion
networks or by a user that modifies the animation parameters In this section, we present more results that were generated with
manually by selecting the desired facial expressions on the the proposed model. In Fig. 9, we show the rendered face model
implemented animation interface (Fig. 2). In order to show possible with 16 different facial expressions. We randomly combined
facial expressions in the animation control windows, we sample the different expressions of eye area and mouth region that were
expression parameter space at regular grid locations with seamlessly integrated in the head model. For our experiments, we
coordinate-values between −4 and +4. The textures of these captured our actress with eight synchronised video cameras (5120 
samples are then arranged as a 2D grid and act as a rough preview × 3840@25 fps) and eight D-SLR cameras (5184 × 3456). While
for the animator. Second, we make sure that the reconstructed we used high-quality hardware for our data capture, our method is
texture and the geometry fit seamlessly into the animated mesh. We not restricted to this exact hardware configuration. A minimal
perform this integration with Poisson image and mesh editing. capture setup could, for example, consist of three video cameras
Since the reconstructed texture lies already in GPU memory, we alone to record the subject's face from the left side, from the right
use the highly optimised 2D convolution operations of the machine side and frontally. Instead of creating a personalised face model, it
learning framework to perform realtime Poisson image editing on is also possible to employ an existing blendshape model (e.g. [5])
the texture. This can be implemented by creating a single 2D or to use a single 3D reconstruction of the subject's face (e.g.
convolution layer with a 3 × 3 × 3 × 3 filter tensor, where each i,i, neutral expression) that is deformed based on the optical flow in

366 IET Comput. Vis., 2020, Vol. 14 Iss. 6, pp. 359-369


This is an open access article published by the IET under the Creative Commons Attribution-NonCommercial-NoDerivs License
(http://creativecommons.org/licenses/by-nc-nd/3.0/)
Fig. 11  Comparison of the achieved texture quality for PCA, GPLVM and VAE. Each column corresponds to an expression. The rows correspond to (top to
bottom): ground truth, VAE, GPLVM and PCA results

the video sequences (e.g. [8, 41]). This is possible due to the fact automatically, we used a semiautomatic mesh unwrapping
that our method does not rely on accurate and detailed 3D technique to add texture coordinates to the mesh and annotate face
geometry for animation. This is demonstrated in Fig. 10, where we landmarks on the mesh. Since these are tasks that need to be done
present facial expressions with strong deformations and only once, we do not consider it as a strong limitation. Currently,
dissocclusions (e.g. teeth and tongue appear in the opened mouth we use a neural texture model that does not account for view
which is not represented by geometry). While the data-processing dependency of textures. While this is not a major limitation in most
and training have to be done in an offline learning phase, we can cases, it can cause erroneous perspective distortions if the
animate and render interactively at a rate of 20 frames per second. geometry is not accurate enough and the rendering viewpoint
Rendering and training performed on a regular desktop computer differs much from the camera's viewpoint from which the texture
with 64 GB Ram, 2.6 GHz CPU (14 cores with hyperthreading) was generated.
and one GeForce RTX2070 graphics card. The animated eye region In order to assess the effectiveness of our VAE regarding the
has a size of 360 × 100 pixel and needs ∼2.5 GB of graphic representation of dynamic textures, we compare the reconstruction
memory during training. The mouth area is 470 × 380 pixel big and quality of our model with two other well-known methods, namely
needs ∼4.5 GB of GPU memory during training, which means that PCA and the Gaussian process latent variable model (GPLVM)
both models can be trained simultaneously on one GPU. Our face [42]. PCA-based representations are a popular choice in computer-
model is based on the SMPL mesh template [34]. While we adapt vision and computer-graphics, especially when dealing with high-
geometry and compute the personalised face geometry model dimensional data that can be well explained by a linear

IET Comput. Vis., 2020, Vol. 14 Iss. 6, pp. 359-369 367


This is an open access article published by the IET under the Creative Commons Attribution-NonCommercial-NoDerivs License
(http://creativecommons.org/licenses/by-nc-nd/3.0/)
Fig. 12  This figure compares the achieved texture quality between PCA, GPLVM and VAE. Each column corresponds to an expression. The rows correspond
to (top to bottom): ground truth, VAE, GPLVM and PCA results

Fig. 13  Comparison of the interpolation capabilities of a PCA-, GPLVM- and VAE-based texture representation. We synthesise intermediate expressions to
simulate a transition from closed to open eyes. The two top rows show the expressions that are synthesised by our proposed model. The rows in the middle
show expressions that are synthesised by the GPLVM-based approach and the two bottom rows correspond to the PCA-based texture interpolation. While the
VAE-based approach shows realistic intermediate expressions, PCA and GPLVM textures exhibit ghosting and/or blur

combination of ortho-normal base vectors and a constant offset. reconstruction of already captured data but also for generating
While PCA-based models assume a linear relationship between intermediate expressions (e.g. when changing from one facial
data and its latent representation, GPLVMs overcome this expression to another), we conducted another experiment where we
limitation and are also capable of learning non-linear relations use all three models to interpolate between closed and open eyes,
between data and latent variables. For our experiments, we trained see Fig. 13. To synthesise the intermediate expressions, we
PCA-based models as well as GPLVMs on the mouth and eye performed a linear interpolation in the latent space. The results
textures and compared the reconstructed textures with the ones show that even though all three models are able to faithfully
generated by our VAE, see Figs. 11 and 12. For both non-linear reconstruct the original eye textures, only the VAE-based model is
models, we used a 2D latent space. For the linear PCA model, we able to synthesise a texture sequence that shows a realistic
chose the number of bases such that the number of model transition from closed to open eyes. The texture sequence that is
parameters is approximately the same as in the VAE. As the VAE generated by the PCA model contains ghosting artefacts because
has ∼9.6 million parameters, we use 20D PCA basis. Looking at the interpolation in latent space essentially results in a linear
the reconstructed mouth textures, the most obvious difference interpolation of corresponding texel values and the GPLVM
between the results of PCA/GPLVM and VAE is that the textures synthesises a texture sequence that actually shows moving eyelids
of the latter are less blurry and do not suffer from ghosting but textures are much blurrier than the VAE textures.
artefacts. The blurriness is caused by the fact that we capture small
motions, deformations as well as occlusions/disocclusions in 7 Conclusion
texture space. This strategy greatly simplifies the general process
of facial performance capture, but in order to produce high-quality We presented a new method for video-based animation of human
results we need a more sophisticated texture model that is actually faces. The key idea of the method is to allow animators to use deep
capable of representing complex changes in texture space. neural networks in simple way to control facial expressions of a
Comparision of the mouth and eye textures shows that eye textures virtual human character. The simple representation on a 2D grid
are in general more detailed and less blurry for all three models. makes it easy to create believable animations, even for untrained
This might be caused by the fact that eye movements and blinks users. The data-driven approach allows generating realistic facial
are too fast to be captured by our cameras, which results in clusters animations and renderings since our neural face model is learned
of similar eye states rather than the actual motion. Therefore, even completely from captured data of our actress. Facial animation
the simple linear PCA model is able to reconstruct artefact-free parameters can be easily generated from real data by encoding
textures. However, as we want to use our model not only for the sequences of extracted texture and geometry, which enables a user

368 IET Comput. Vis., 2020, Vol. 14 Iss. 6, pp. 359-369


This is an open access article published by the IET under the Creative Commons Attribution-NonCommercial-NoDerivs License
(http://creativecommons.org/licenses/by-nc-nd/3.0/)
to create a database of realistic animation sequences in a motion [17] Carranza, J., Theobalt, C., Magnor, M.A., et al.: ‘Free-viewpoint video of
human actors’, ACM Trans. Graph., 2003, 22, (3), pp. 569–577
capture-like way. However, unlike motion capture, the facial [18] Kingma, D.P., Welling, M.: ‘Auto-encoding variational Bayes’. Banff,
expressions can be represented holistically in a low-dimensional Canada, 2013
latent space that allows easy manipulation. The representation of [19] Goodfellow, I.J., Pouget.Abadie, J., Mirza, M., et al.: ‘Generative adversarial
texture and geometry data with autoencoders allows for smooth nets’. Proc. 27th Int. Conf. on Neural Information Processing Systems -
Volume 2 (NIPS'14), Cambridge, MA, USA, 2014, pp. 2672–2680. Available
and artefact-free interpolation between facial expressions, which at: http://dl.acm.org/citation.cfm?id=2969033.2969125
enables a user to perform interesting operations with captured [20] Radford, A., Metz, L., Chintala, S.: ‘Unsupervised representation learning
video and geometry data like concatenation or looping of atomic with deep convolutional generative adversarial networks’, 2015
sequences to create novel animation sequences that have not been [21] Lample, G., Zeghidour, N., Usunier, N., et al.: ‘Fader networks: manipulating
images by sliding attributes’, CoRR, Long Beach, CA, USA, 2017, abs/
captured in that way. Apart from that, we also compare the VAE- 1706.00409. Available at: http://arxiv.org/abs/1706.00409
based texture representation with a PCA and GPLVM-based [22] Kulkarni, T.D., Whitney, W., Kohli, P., et al.: ‘Deep convolutional inverse
approach and show that in contrast to PCA/GPLVM, our proposed graphics network’, CoRR, Montréal, Canada, 2015, abs/1503.03167.
method generates high-quality texture sequences that do not exhibit Available at: http://arxiv.org/abs/1503.03167
[23] Pumarola, A., Agudo, A., Martinez, A.M., et al.: ‘Ganimation: anatomically-
strong blur or ghosting artefacts. aware facial animation from a single image’. Proc. European Conf. on
Computer Vision (ECCV), Munich, Germany, 2018
8 Acknowledgments [24] Geng, J., Shao, T., Zheng, Y., et al.: ‘Warp-guided gans for single-photo facial
animation’, ACM Trans. Graph., 2018, 37, (6), pp. 231:1–231:12. Available
This work was partly funded by the European Union's Horizon at: http://doi.acm.org/10.1145/3272127.3275043
[25] Huang, Y.M., Khan, S.: ‘Dyadgan: generating facial expressions in dyadic
2020 research and innovation programme under grant agreement interactions’. The IEEE Conf. on Computer Vision and Pattern Recognition
No 762021 (Content4All) as well as the Germany Federal Ministry (CVPR) Workshops, Honolulu, Hawaii, July 2017
of Education and Research under grant agreement 03ZZ0468 [26] Bansal, A., Ma, S., Ramanan, D., et al.: ‘Recycle-gan: unsupervised video
(3DGIM2). retargeting’. ECCV, Munich, Germany, 2018
[27] Webber, S., Harrop, M., Downs, J., et al.: ‘Everybody dance now: tensions
between participation and performance in interactive public installations’.
9 References Proc. Annual Meeting of the Australian Special Interest Group for Computer
Human Interaction(OzCHI ’15)New York, NY, USA, 2015, pp. 284–288.
[1] Cootes, T.F., Edwards, G.J., Taylor, C.J.: ‘Active appearance models’, IEEE Available at: http://doi.acm.org/10.1145/2838739.2838801
Trans. Pattern Anal. Mach. Intell., 1998, 23, pp. 484–498 [28] Lombardi, S., Saragih, J.M., Simon, T., et al.: ‘Deep appearance models for
[2] Blanz, V., Vetter, T.: ‘A morphable model for the synthesis of 3d faces’. Proc. face rendering’. CoRR, Montréal, Canada, 2018, abs/1808.00362. Available
26th Annual Conf. on Computer Graphics and Interactive Techniques. at: http://arxiv.org/abs/1808.00362
SIGGRAPH ’99, New York, NY, USA,, 1999, pp. 187–194. http://dx.doi.org/ [29] Slossberg, R., Shamai, G., Kimmel, R.: ‘High quality facial surface and
10.1145/311535.311556 texture synthesis via generative adversarial networks’, CoRR, Munich,
[3] Vlasic, D., Brand, M., Pfister, H., et al.: ‘Face transfer with multilinear Germany, 2018, abs/1808.08281. Available at: http://arxiv.org/abs/
models’, ACM Trans. Graph., 2005, 24, (3), pp. 426–433. http://doi.acm.org/ 1808.08281
10.1145/1073204.1073209 [30] Lowe, D.G.: ‘Distinctive image features from scale-invariant keypoints’, Int.
[4] Li, H., Yu, J., Ye, Y., et al.: ‘Realtime facial animation with on-the-fly J. Comput. Vision, 2004, 60, (2), pp. 91–110. Available at: https://doi.org/
correctives’, ACM Trans. Graph., 2013, 32, (4), pp. 42:1–42:10. http:// 10.1023/B:VISI.0000029664.99615.94
doi.acm.org/10.1145/2461912.2462019 [31] Furch, J., Eisert, P.: ‘An iterative method for improving feature matches’.
[5] Cao, C., Weng, Y., Zhou, S., et al.: ‘Facewarehouse: A 3d facial expression 2013 Int. Conf. on 3D Vision - 3DV 2013, June 2013, pp. 406–413
database for visual computing’, IEEE Trans. Vis. Comput. Graphics, 2014, 20, [32] Kazhdan, M., Hoppe, H.: ‘Screened Poisson surface reconstruction’, ACM
(3), pp. 413–425. http://dx.doi.org/10.1109/TVCG.2013.249 Trans. Graph., 2013, 32, (3), pp. 29:1–29:13. Available from: http://
[6] Garrido, P., Valgaert, L., Wu, C., et al.: ‘Reconstructing detailed dynamic face doi.acm.org/10.1145/2487228.2487237
geometry from monocular video’, ACM Trans. Graph., 2013, 32, (6), pp. [33] Blumenthal.Barby, D., Eisert, P.: ‘High-resolution depth for binocular image-
158:1–158:10. http://doi.acm.org/10.1145/2508363.2508380 based modelling’, Comput. Graph., 2014, 39, pp. 89–100
[7] Thies, J., Zollhöfer, M., Nießner, M., et al.: ‘Real-time expression transfer for [34] Loper, M., Mahmood, N., Romero, J., et al.: ‘SMPL: A skinned multi-person
facial reenactment’, ACM Trans. Graph. (TOG), 2015, 34, (6), pp. 1–14 linear model’, ACM Trans Graphics (Proc SIGGRAPH Asia), 2015, 34, (6),
[8] Paier, W., Kettern, M., Hilsmann, A., et al.: ‘A hybrid approach for facial pp. 248:1–248:16
performance analysis and editing’, IEEE Trans Cir Syst Video Technol., 2017, [35] Kazemi, V., Sullivan, J.: ‘One millisecond face alignment with an ensemble of
27, (4), pp. 784–797. https://doi.org/10.1109/TCSVT.2016.2610078 regression trees’. Proc. of the IEEE Conf. on Computer Vision and Pattern
[9] Fechteler, P., Paier, W., Hilsmann, A., et al.: ‘Real-time avatar animation with Recognition, Columbus, Ohio, USA, 2014, pp. 1867–1874
dynamic face texturing’. Proc. 23rd IEEE Int. Conf. on Image Processing [36] Pinkall, U., Polthier, K.: ‘Computing discrete minimal surfaces and their
(iCIP 2016), Phoenix, Arizona, USA, 2016 conjugates’, Exp. Math., 1993, 2, (1), pp. 15–36
[10] Paier, W., Kettern, M., Anna, H., et al.: ‘Video-based facial re-animation’. [37] Meyer, M., Desbrun, M., Schrӧder, P., et al.: ‘Discrete differential-geometry
Proc. 12th European Conf. on Visual Media Production, November, 2015 operators for triangulated 2-manifolds’ (Springer-Verlag, Germany, 2002, pp.
[11] Casas, D., Volino, M., Collomosse, J., et al.: ‘4d video textures for interactive 35–57)
character appearance’, Comput. Graph. Forum, 2014, 33, (2), pp. 371–380 [38] Boykov, Y., Veksler, O., Zabih, R.: ‘Fast approximate energy minimization
[12] Fechteler, P., Paier, W., Eisert, P.: ‘Articulated 3D model tracking with on-the- via graph cuts’, IEEE Trans. Pattern Anal. Mach. Intell., 2001, 23, (11), pp.
fly texturing’. Proc. 21st IEEE Int. Conf. on Image Processing (iCIP 2014), 1222–1239. Available from: https://doi.org/10.1109/34.969114
Paris, France, 2014, pp. 3998–4002 [39] Pérez, P., Gangnet, M., Blake, A.: ‘Poisson image editing’, ACM Trans.
[13] Dale, K., Sunkavalli, K., Johnson, M.K., et al.: ‘Video face replacement’. Graph., 2003, 22, (3), pp. 313–318. Available at: http://doi.acm.org/
ACM Transactions on Graphics (Proc SIGGRAPH Asia), Hong Kong, 2011, 10.1145/882262.882269
vol. 30 [40] Isola, P., Zhu, J.Y., Zhou, T., et al.: ‘Image-to-image translation with
[14] Lipski, C., Klose, F., Ruhl, K., et al.: ‘Making of ‘who cares?’ HD conditional adversarial networks’, arxiv, 2016
stereoscopic free viewpoint video’. Proc. of the Eighth European Conf. on [41] Kettern, M., Hilsmann, A., Eisert, P.: ‘Temporally consistent wide baseline
Visual Media Production, London, November 2011 facial performance capture via image warping’. Proc. Vision, Modeling, and
[15] Kilner, J., Starck, J., Hilton, A.: ‘A comparative study of free-viewpoint video Visualization Workshop 2015, Aachen, Germany, October 2015
techniques for sports events’. Proc. Third European Conf. on Visual Media [42] Lawrence, N.D.: ‘Gaussian process latent variable models for visualisation of
Production, London, November 2006 high dimensional data’. Proc. of the 16th Int. Conf. on Neural Information
[16] Borshukov, G., Montgomery, J., Werner, W., et al.: ‘Playable universal Processing Systems (NIPS'03), Cambridge, MA, USA, 2003, pp. 329–336
capture’. ACM SIGGRAPH 2006 Sketches, Boston, August 2006

IET Comput. Vis., 2020, Vol. 14 Iss. 6, pp. 359-369 369


This is an open access article published by the IET under the Creative Commons Attribution-NonCommercial-NoDerivs License
(http://creativecommons.org/licenses/by-nc-nd/3.0/)

You might also like