You are on page 1of 17

algorithms

Article
Generative Model for Skeletal Human Movements
Based on Conditional DC-GAN Applied
to Pseudo-Images
Wang Xi 1 , Guillaume Devineau 2 , Fabien Moutarde 2, * and Jie Yang 1
1 School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240,
China; luckyxi94@sjtu.edu.cn (W.X.); jieyang@sjtu.edu.cn (J.Y.)
2 Centre de Robotique, MINES ParisTech, Université PSL, 75006 Paris, France;
guillaume.devineau@mines-paristech.fr
* Correspondence: fabien.moutarde@mines-paristech.fr

Received: 4 November 2020; Accepted: 27 November 2020; Published: 3 December 2020 

Abstract: Generative models for images, audio, text, and other low-dimension data have achieved
great success in recent years. Generating artificial human movements can also be useful for many
applications, including improvement of data augmentation methods for human gesture recognition.
The objective of this research is to develop a generative model for skeletal human movement,
allowing to control the action type of generated motion while keeping the authenticity of the
result and the natural style variability of gesture execution. We propose to use a conditional Deep
Convolutional Generative Adversarial Network (DC-GAN) applied to pseudo-images representing
skeletal pose sequences using tree structure skeleton image format. We evaluate our approach on
the 3D skeletal data provided in the large NTU_RGB+D public dataset. Our generative model can
output qualitatively correct skeletal human movements for any of the 60 action classes. We also
quantitatively evaluate the performance of our model by computing Fréchet inception distances,
which shows strong correlation to human judgement. To the best of our knowledge, our work
is the first successful class-conditioned generative model for human skeletal motions based on
pseudo-image representation of skeletal pose sequences.

Keywords: generative model; human movement; conditional deep convolutional generative


adversarial network; GAN; spatiotemporal pseudo-image; TSSI

1. Introduction
Human movement generation, although less developed than for instance text generation, is an
important applicative field for sequential data generation. It is difficult to get enough effective
human movement dataset due to its complexity—human movement is the result of both physical
limitation (torque exerted by muscles, gravity, and moment preservation) and the intentions of subjects
(how to perform an intentional motion) [1]. Generating artificial human movement data enables data
augmentation, which should improve the performance of models in all fields of study on human
movement: classification, prediction, generation, etc. Also, it will have important applications in
related domains. Imagine, for example, in computer games, each character running and jumping in
exactly the same way but in their own style, making those actions in the game closer to reality.
The objective of this research is to develop a generative model for skeletal human movements,
allowing to control the action type of generated motion while keeping the authenticity of the result and
the natural style variability of gesture execution. Skeleton-based movement generation has recently
become an active topic in computer vision, owing to the potential advantages of skeletal representation

Algorithms 2020, 13, 319; doi:10.3390/a13120319 www.mdpi.com/journal/algorithms


Algorithms 2020, 13, 319 2 of 17

and to recent algorithms for automatically extracting skeletal pose sequences from videos. Previously,
most of the methods and researches on human gestures, including human movement generation,
used RGB image sequences, depth image sequences, videos, or some fusion of these modalities as
input data. Nowadays, it is easy to get relatively accurate 2D or 3D human skeletal pose data thanks to
state-of-the-art human pose estimation algorithms and high-performance sensors. Compared to dense
image approaches, skeletal data has the following advantages: (a) lower computational requirement
and (b) robustness under situations such as complicated background and changing body scales,
viewpoints, motion speed, etc. Besides these advantages, skeletal data has the three following main
characteristics [2]:

• spatial information: strong correlations between adjacent joints, which makes it possible to learn
about body structural information within a single frame (intra-frame);
• temporal information: makes it possible to learn about temporal correlation between frames
(inter-frame); and
• cooccurrence relationship between spatial and temporal domains when taking joints and bones
into account.

Skeletal-based human movement generation can be formalized as a continuous multivariate


time-series problem and is therefore often tackled using Recurrent Neural Network (RNN). In our work,
we however decided to use representation of each skeletal pose sequence as a spatiotemporal
pseudo-image with space and time as two dimensions of the image instead. The advantage is that we
can directly apply state-of-the-art generative models for images, which are more mature and simpler
in terms of the network architecture than the ones tailored for continuous time-series generation.
Thus, it is important to find a good pseudo-image representation for 3D skeletal data. The related
works will be further developed in the next part. The key issues to be solved include how to generate
realistic human movements, how to control the output (action, style, etc.), and how to evaluate the
model effectively.
Our work proves that, as long as we apply a well-adapted pseudo-image representation for 3D
skeletal data, such as Tree Structure Skeleton Image (TSSI), we can generate human movement by a
simple standard image generative model such as Deep Convolutional Generative Adversarial Network
(DC-GAN [3]). Furthermore, by introducing a conditional label, we control the output action type.
We also evaluate our generative models by calculating Fréchet inception distance.
The paper is organized as follows: Section 2 presents published works related to human movement
generative models and to pseudo-image representations for skeletal pose sequences; Section 3 explains
the details of our approach; Section 4 analyzes the results we have obtained; and Section 5 presents our
conclusions and perspectives.

2. Related Works
In this section, we present some of the already published works most related to our work
concerning generative models for skeletal human movements and pseudo-image representation for
skeletal pose sequences.

2.1. Generative Models for Skeletal Human Movements


A family of methods based on deep Recurrent Neural Networks (RNNs) have shown good
performance on generative tasks for human skeletal movements. For example, Fragkiadaki et al. [4]
proposed in 2015 the Encoder-Recurrent-Decoder (ERD) model that uses curriculum learning and
incorporates representation learning in the architecture. In 2016, Jain et al. introduced the concept
of structural-RNN [5], which manually encodes the semantic similarity between different body parts
based on spatiotemporal graphs. The reason why RNN is a popular generative model for sequential
data is that RNNs typically possess both a richly distributed internal state representation and flexible
nonlinear transition functions. This expressive power and the ability to train via error back-propagation
Algorithms 2020, 13, 319 3 of 17

are the key reasons why RNNs have gained popularity as generative models for highly structured
sequential data. Inspired by previous works on RNN, Martinez et al. [1] proposed in 2017 a human
motion prediction model using sequence-to-sequence (seq2seq) architecture with residual connections,
which turned out to be a simple but state-of-the-art method on prediction of short-term dynamics of
human movement. Chung et al. [6] explored the inclusion of the latent random variables into the
hidden state of an RNN by combining the elements of the variational autoencoder. They argued that,
through the use of high-level latent random variables, their variational RNN (VRNN) can model the
kind of variability observed in highly structured sequential data such as natural speech.
Besides the models based on RNNs and Variational Auto-Encoders (VAEs), Generative Adversarial
Networks (GANs) [7], which achieve great success in generative problems, are recently one of the
most active machine-learning research topics. GANs have proved their strength in the generation of
images and some types of temporal sequences. For example, Deep Convolutional GAN (DC-GAN) [3]
proposed a strided conv-transpose layer and introduced a conditional label, which allows for the
generation of impressive images. WaveNet [8], applying dilated causal convolutions that effectively
operate on a coarser scale, can generate raw audio waveforms and can transform text to speech which
human listeners can hardly distinguish from the true voice. Recently, several researches on human
movement prediction and generation models are based on GAN. In 2018, the Human Prediction
GAN (HP-GAN) [9] of Barsoum et al. used also a sequence-to-sequence model as its generator and
additionally designed a critic network to calculate a custom loss function inspired by WGAN-GP [10].
Kiasari et al. [11] approached from another way the combination of an Auto-Encoder (AE) and a
conditional GAN to generate a sequence of human movements.

2.2. Pseudo-Image Representation for Skeletal Pose Sequences


In many skeleton-based human movement recognition studies, researchers use Convolutional
Neural Networks (CNN) because of their efficiency and excellent ability to extract high level information.
However, the most common and developed CNNs use 2D convolutions and therefore generally focus
on image-based tasks. Meanwhile, human movement problems, including recognition and, the main
task of this paper, generation, are definitely a heavily temporal problem. Thus, it is challenging to
balance and capture both spatial and temporal information in a CNN-based architecture [2].
In general, to meet the needs of the CNN input, a 3D skeletal sequence data should be transformed
into a pseudo-image. Nevertheless, a decent representation that contains both spatial and temporal
information is not simple. Many researchers encode the skeletal joints to multiple 2D pseudo-images and
then feed them into the model. Wang et al. [12] introduced Joint Trajectory Maps (JTM), which represent
spatial configuration and the dynamics of joint trajectories into three texture images through color
encoding. Li et al. [13] proposed a translation-scale invariant image mapping. They pointed out that
the JTM encoding method is dataset dependent and not robust to the translation and scale variation
of human activity. To tackle this problem, they divided all human skeletal joints into five main parts
according to human physical structure and then mapped them to images. However, they merely
focused on the coordinates of isolated joints and therefore ignored the spatial relationships between
joints and only implicitly learned the movement representations. Take the action “walking” for example,
not only the legs and feet but also the arms and other body parts should be considered. In a word,
the human body needs to be treated as a whole. To solve this problem, Li et al. [14] proposed in 2019 an
effective method to learn comprehensive representations from skeletal sequences by using geometric
algebra. Firstly, a frontal orientation-based spatiotemporal model was constructed to represent the
spatial configuration and temporal dynamics of skeletal sequences, which is robust against view
variations. Then, shape-motion representations which mutually compensate one another were learned
to describe the skeletal actions comprehensively. Liu et al. [15] used enhanced skeletal visualization to
represent the skeletal data. SkeleMotion [16] directly encodes data by using orientation and magnitude
to provide information regarding the velocity of the movement in different temporal scales. The Tree
Structure Reference Joint Image (TSRJI) of Caetano et al. [17] combines the use of reference joints and
Algorithms 2020, 13, 319 4 of 17

a tree structure skeleton: while the former incorporates different spatial relationships between the
joints, the latter preserves important spatial relations by traversing a skeletal tree with a depth-first
order algorithm.
Even if these skeletal sequence representations have made great efforts to encode as much
information as possible in one or a series of “images”, it is still hard to learn global cooccurrences.
When these pseudo-images are fed to CNN-based models, only the neighboring joints within the
convolutional kernel are considered with learning local cooccurrence features. Thus, in 2018, Li et al. [18]
proposed an end-to-end cooccurrence feature-learning framework, where features are aggregated
gradually from point-level features to global cooccurrence features.

3. Materials and Methods


To generate human skeletal motions, we used a conditional Deep Convolutional Generative
Adversarial Network (DC-GAN) applied to pseudo-images representing skeletal pose sequences using
Tree Structure Skeleton Image (TSSI) format. We evaluated our approach on the 3D skeletal sequence
data provided in the large NTU_RGB+D [19] public dataset, and use Fréchet inception distances as
quantitative evaluation. The details of the database, pseudo-image representation, data preparation,
our generative model, and the process of training and evaluation will be provided in this section.

3.1. NTU_RGB+D Dataset


NTU_RGB+D [19] is a large-scale public dataset for 3D human activity analysis. It contains
60 different action classes including daily, mutual, and health-related actions. With 56,880 samples,
40 human subjects, and 80 camera views, this dataset is much larger than other available datasets for
RGB+D (RGB videos and depth sequence) action analysis. In our research, we only use the skeletal
data (3D locations of 25 major body joints) provided in the dataset.

3.2. Tree Structure Skeleton Image (TSSI)


In our approach, we transform the skeletal pose sequences into Tree Structure Skeleton Image
(TSSI) pseudo-images [20]. The principle of TSSI is to create one image for each sequence of skeletons.
In these spatiotemporal pseudo-images, the horizontal axis corresponds to joint positions and the
vertical axis corresponds to time. The joint positions are encoded as one channel for x, one for y, and one
for z, therefore obtaining a color image with red corresponding to x, green corresponding to y, and blue
corresponding to z. One of the important particularities of TSSI is reordering the skeletons using
depth-first tree traversal order so that neighboring columns in spatiotemporal images are spatially
related in the original skeletal graph. The spatial relations between connected skeletal joints can thus be
well preserved. TSSI can be trained with semantical meaning in a normal generative model for images,
where the spatial and temporal features are meant to be learnt at the same time. Also, compared with
other pseudo-image methods, TSSI is relatively simple. Figure 1 illustrates the skeleton structure and
order in NTU_RGB+D with traversal order and joint arrangements of TSSI.
Algorithms 2020, 13, x FOR PEER REVIEW 5 of 17
Algorithms 2020, 13, 319 5 of 17

(a) (b)
Figure 1.1. Illustration
Illustration(adapted
(adapted from [20])[20])
from of Tree Structure
of Tree Skeleton
Structure Image (TSSI):
Skeleton Image (a) skeleton
(TSSI): (a) structure
skeleton
and order and
structure in NTU_RGB+D, with traversal
order in NTU_RGB+D, with order that respects
traversal spatial
order that relations,
respects andrelations,
spatial (b) joint arrangements
and (b) joint
of TSSI. The shape
arrangements is (T,The
of TSSI. J, d)shape
whereisT(T,
is the
J, d)sequence
where Tduration, J’ the number
is the sequence of joints
duration, J’ the in traversal
number order
of joints
(J joint repeated for the TSSI), and d the dimension (d = 3 for (x, y, z) 3D positions).
in traversal order (J joint repeated for the TSSI), and d the dimension (d = 3 for (x, y, z) 3D positions).

3.3. Data Preparation


3.3. Data Preparation
First, we cleaned up the samples with missing data. Next, since the time length of sequences
First, we cleaned up the samples with missing data. Next, since the time length of sequences in
in NTU_RGB+D
NTU_RGB+D varies varies around
around 100 frames,
100 frames, we linearly
we linearly resampled
resampled all examples
all examples to the
to the length of length of
100 time-
100 time-steps. They were then transformed into TSSI pseudo-images of size 100 ×
steps. They were then transformed into TSSI pseudo-images of size 100 × 49 × 3 (time, joints in TSSI49 × 3 (time,
joints
order,in3),TSSI order,
where 3), where
3 stands 3 stands
for the for the three-dimensional
three-dimensional coordinates
coordinates of each joint.ofThe
each joint.size
input Theofinput
DC-
GAN was 64 × 64 × 3. Thus, the spatiotemporal images were further resized to 64 × 64 × ×3 64
size of DC-GAN was 64 × 64 × 3. Thus, the spatiotemporal images were further resized to 64 ×3
using
using bilinear interpolation. Finally, these images were saved by class. The entire dataset
bilinear interpolation. Finally, these images were saved by class. The entire dataset was used for was used
for training.
training.
3.4. Conditional Deep Convolution Generative Adversarial Network (Conditional DC-GAN)
3.4. Conditional Deep Convolution Generative Adversarial Network (Conditional DC-GAN)
In this paper, we used a conditional Deep Convolution Generative Adversarial Network
In this paper, we used a conditional Deep Convolution Generative Adversarial Network
(conditional DC-GAN) based on DC-GAN [3] and cGAN [21]. In the original DC-GAN architecture,
(conditional DC-GAN) based on DC-GAN [3] and cGAN [21]. In the original DC-GAN architecture,
the discriminator was made up of strided convolution layers, batch norm layers, and LeakyReLU
the discriminator was made up of strided convolution layers, batch norm layers, and LeakyReLU
activations. Its input was a 64 × 64 × 3 (Height × Width × Channel) image, and the output was
activations. Its input was a 64 × 64 × 3 (Height × Width × Channel) image, and the output was a scalar
a scalar probability that the input is from real-data distribution. The generator was comprised of
probability that the input is from real-data distribution. The generator was comprised of
convolutional-transpose layers, batch norm layers, and ReLU activations. Its input was a latent vector
convolutional-transpose layers, batch norm layers, and ReLU activations. Its input was a latent vector
z, that was drawn from a standard normal distribution, and the output was a 64 × 64 × 3 RGB image.
z, that was drawn from a standard normal distribution, and the output was a 64 × 64 × 3 RGB image.
The strided conv-transpose layers allowed the latent vector to be transformed into a volume with the
The strided conv-transpose layers allowed the latent vector to be transformed into a volume with the
same shape as an image.
same shape as an image.
Figure 2 shows the architecture of our conditional DC-GAN. In the generator, we used an upsample
Figure 2 shows the architecture of our conditional DC-GAN. In the generator, we used an
layer and a convolutional layer instead of the original convolutional-transpose layer in order to avoid
upsample layer and a convolutional layer instead of the original convolutional-transpose layer in
checkerboard artifacts [22]. If a standard deconvolution (based on convolutional-transpose layer)
order to avoid checkerboard artifacts [22]. If a standard deconvolution (based on convolutional-
is used to upscale from a lower resolution image to a higher one, it uses every point in the small
transpose layer) is used to upscale from a lower resolution image to a higher one, it uses every point
image to “paint” a square in the larger one. Here comes the problem of “uneven overlap”, that is
in the small image to “paint” a square in the larger one. Here comes the problem of “uneven overlap”,
to say, when these “squares” in the larger image overlaps. In particular, deconvolution has uneven
that is to say, when these “squares” in the larger image overlaps. In particular, deconvolution has
overlap when the kernel size is not divisible by the stride. These overlapped pixels create an unwanted
uneven overlap when the kernel size is not divisible by the stride. These overlapped pixels create an
checkerboard pattern on generated images.
unwanted checkerboard pattern on generated images.
Algorithms 2020, 13, 319 6 of 17
Algorithms 2020, 13, x FOR PEER REVIEW 6 of 17

Figure 2. Architecture of our conditional Deep Convolutional Generative Adversarial


Figure 2. Architecture of our conditional Deep Convolutional Generative Adversarial Network (DC-
Network (DC-GAN).
GAN).
Thus, we instead used an upsample layer followed by convolutional layers to achieve the
Thus, we instead used an upsample layer followed by convolutional layers to achieve the
“deconvolution” operation. We resized the image, using the upsample layer with bilinear interpolation
“deconvolution” operation. We resized the image, using the upsample layer with bilinear
(or nearest-neighbor interpolation) and then using a convolutional layer. For the best result, before the
interpolation (or nearest-neighbor interpolation) and then using a convolutional layer. For the best
convolutional layer, we added a RelectionPad2d layer of size 1 to avoid boundary artifacts. Details are
result, before the convolutional layer, we added a RelectionPad2d layer of size 1 to avoid boundary
provided in Appendix A.
artifacts. Details are provided in Appendix A.
In order to control the output action, the model receives the class label at input for both the
In order to control the output action, the model receives the class label at input for both the
generator and discriminator. The generator receives as input a random noise (1 × 1 × Z) and a one-hot
generator and discriminator. The generator receives as input a random noise (1 × 1 × Z) and a one-
conditional label (1 × 1 × C), where Z is the dimension of z and C is the number of classes. By default,
hot conditional label (1 × 1 × C), where Z is the dimension of z and C is the number of classes. By
we chose Z = 100 and C = 60. They were concatenated and passed to a convolutional layer, with a 1 × 1
default, we chose Z = 100 and C = 60. They were concatenated and passed to a convolutional layer,
kernel. The output channel was 8192 (=4 × 4 × 512). Then, the output data were resized to the shape of
with a 1 × 1 kernel. The output channel was 8192 (=4 × 4 × 512). Then, the output data were resized to
4 × 4 × 512. Finally, we passed the data to several upscaling blocks (Upsample + Conv2D + BatchNorm,
the shape of 4 × 4 × 512. Finally, we passed the data to several upscaling blocks (Upsample + Conv2D
except for the output layer), which is exactly the same as the ones in the original DC-GAN. The reason
+ BatchNorm, except for the output layer), which is exactly the same as the ones in the original DC-
for not directly resizing the input data (the concatenation of random noise and one-hot label) was that,
GAN. The reason for not directly resizing the input data (the concatenation of random noise and one-
when the data size is 1 × 1, it can only be upsampled to an equal-valued data. Also, we wished to keep
hot label) was that, when the data size is 1 × 1, it can only be upsampled to an equal-valued data.
the input noise size at 1 × 1 × Z. Therefore, we decided to add an extra convolutional layer and the
Also, we wished to keep the input noise size at 1 × 1 × Z. Therefore, we decided to add an extra
resizing step to make it work. In the discriminator, the label was encoded as an image with C channels
convolutional layer and the resizing step to make it work. In the discriminator, the label was encoded
(C for class number). The label was one-hot encoded along the dimension of channel. To be more
as an image with C channels (C for class number). The label was one-hot encoded along the
specific, the label of the nth action was an H × W × C sized data, with all 1s on the nth channel and 0
dimension of channel. To be more specific, the label of the nth action was an H × W × C sized data,
for the rest. At the input layer, the image and the label were concatenated: every pixel of the image was
with all 1s on the nth channel and 0 for the rest. At the input layer, the image and the label were
concatenated with a one-hot label in the direction of channel. Thus, the input channel size was 3 + C.
concatenated: every pixel of the image was concatenated with a one-hot label in the direction of
The architecture of the discriminator consisted in several convolutional layers and batch normalization
channel. Thus, the input channel size was 3 + C. The architecture of the discriminator consisted in
layers (except for the output layer) as in the the original architecture of DC-GAN.
several convolutional layers and batch normalization layers (except for the output layer) as in the the
original architecture
3.5. Loss of DC-GAN.
Function and Training Process

3.5. LossWe used the


Function andstandard
Trainingloss function for GANs, which is defined in [7] as follows:
Process
We used the standard
min maxVloss
(D, function
G) = Ex∼pfor GANs,
data
(x)[logwhich is E
D(x)] + defined
x∼pz (z)[in
log[7]
(1 as
−D follows:
(G(x)))] (1)
G D
min max 𝑉(𝐷, 𝐺) = 𝔼 ~ (𝑥)[log 𝐷(𝑥)] + 𝔼 ~ (𝑧)[log(1 − 𝐷 𝐺(𝑥) )] (1)
Since a GAN has a generator and a discriminator, we had two losses: Loss_D and Loss_G.
WeSince
usedathe
GAN has aCross
Binary generator and(BCE)
Entropy a discriminator, we had Loss_D
as loss functions. two losses:
wasLoss_D and as
calculated Loss_G. We of
the sum
used the Binary Cross Entropy (BCE) as loss functions. Loss_D was calculated as the sum of errD_real
= −log(𝐷(𝑥)) and errD_fake = −log(1 − 𝐷 𝐺(𝑧) ), representing the BCE of the prediction to the
Algorithms 2020, 13, 319 7 of 17

errD_real = − log(D(x)) and errD_fake = − log(1 − D(G(z))), representing the BCE of the prediction to
the reality when D receives real inputs and fake inputs. Loss_G = − log(D(G(z))) calculates the BCE of
the “false positive” prediction when D receives fake inputs.
The discriminator and the generator were trained and updated one after another, with the adam
optimizer at learning rate = 0.0002 by default.

3.6. Transformation of Generated Pseudo-Images into Skeletal Sequences


The input and output data of our model were both pseudo-images. In Section 3.3, we explain
how we transform the original skeletal variable-length sequences into TSSI pseudo-images with all the
same size, which are fed into the discriminator. The generator produces fake pseudo-images with the
same size. Therefore, in order to obtain a sequence of skeletons from the generated pseudo-image,
we need to perform a symmetric operation of that described in Section 3.3. First, the generated
64 × 64 × 3 pseudo-image was resized into 100 × 49 × 3 (time-step, joints in TSSI order, xyz) using
bilinear interpolation (we chose 100 time-steps as the unique length for generated sequences). Then,
for the joints which were repeated in TSSI order, we took the average value. For example, the third
joint (neck) was repeated twice in TSSI order (the third and the fourth). We calculated the average of
these two values and took it as the value of the third joint of skeleton. In this way, for each generated
pseudo-image, we obtained a corresponding generated skeletal sequence with a shape of 100 × 25 × 3
(time-steps, joints, xyz), which can be visualized as a sequence of skeletal poses.

3.7. Model Evaluation—Fréchet Inception Distance (FID)


To evaluate the model, we used Fréchet Inception Distance (FID) [23]. FID is a measure of
similarity between two datasets of images. It was shown to correlate well with human judgement of
visual quality and is most often used to evaluate the quality of samples of GAN. FID was first proposed
and used in [24] to improve the evaluation, being more consistent than the inception score [25]. The FID
metric calculates the distance between two distributions of images on the level of features. As in [25],
we computed FID measures using the features extracted from a pretrained inception V3 model [26].
FID is calculated by the following function [23]:
  1/2 
FID = kµr − µ g k2 + Tr Σr + Σ g − 2 Σr Σ g (2)

µr , µ g are the mean values of features of real and fake images


Σr , Σ g are the covariance matrix of features of real and fake images
The smaller the FID, the greater the similarity between the two distributions. In our study, the real
images were the TSSI pseudo-images obtained (as described in Section 3.3) from skeletal sequences of
the NTU_RGB+D dataset and the fake images were the pseudo-images generated by our generative
model. We calculated FID between these two distributions of data as a metric. The aim was to
minimize FID.
Since we would like to evaluate the performance both by class of action and by all actions in the
future, we prepared the statistics for each class of action and the whole dataset. When calculating FID,
we only need to generate a certain number of fake samples to get its statistics µ g and Σ g . Notice that
there is a tradeoff between the fake sample numbers and computational cost of evaluation. The more
generated samples, the better the distribution of fake samples are presented. Meanwhile, it spends
more time and calculation power. Thus, it is crucial to decide the number of samples. We chose to
generate 250 samples for each class of action.
Algorithms 2020, 13, 319 8 of 17

4. Results

4.1. Qualitative Result for Generated Skeletal Actions


We first analyze qualitatively the results of our generative model. Figure 3 shows four examples
of actions generated by our model (with default hyperparameter setting) trained after 200 epochs:
Algorithms 2020, 13, x FOR PEER REVIEW 8 of 17
(a) sitting down; (b) standing up; (c) hand waving; and (d) kicking something. For each action class,
the figure also shows for comparison a typical real pose sequence from the training set. As can be seen,
the generated skeletal sequences match qualitatively and visually with the typical training example
the generated skeletal sequences match qualitatively and visually with the typical training example of
of a real pose sequence for the action class corresponding to the value of the class-condition input fed
a real pose sequence for the action class corresponding to the value of the class-condition input fed into
into our generator. The respective characteristics of each action class, such as the movement of body
our generator. The respective characteristics of each action class, such as the movement of body parts,
parts, are correctly learnt by the model, and the generated movements are continuous and smooth.
are correctly learnt by the model, and the generated movements are continuous and smooth.

(a) Sitting down

Generated

Real

(b) Standing up

Generated

Real

(c) Hand waving

Generated

Real

(d) Kicking something

Generated

Real

Figure 3. Examples
Figure of generated
3. Examples actions
of generated compared
actions withwith
compared a typical real training
a typical sequence
real training for 4 for
sequence action
4 action
classes: (a) sitting
classes: down,
(a) sitting (b) standing
down, up, (c)
(b) standing up,hand waving,
(c) hand and (d)
waving, andkicking something.
(d) kicking something.

4.2. Quantitative Model Evaluation Using Fréchet Inception Distance (FID)


At each evaluation, we record the FID score for each class of action, their average value, and the
FID score of the total dataset. Figure 4 shows the action “standing up” generated by our model at the
150th epoch and the 200th epoch. The FID score is much smaller at the 200th epoch, and the
corresponding generated sequence is obviously qualitatively (i.e., visually) better. This confirms that
Algorithms 2020, 13, 319 9 of 17

4.2. Quantitative Model Evaluation Using Fréchet Inception Distance (FID)


At each evaluation, we record the FID score for each class of action, their average value, and the
FID score of the total dataset. Figure 4 shows the action “standing up” generated by our model at
the 150th epoch and the 200th epoch. The FID score is much smaller at the 200th epoch, and the
corresponding generated sequence is obviously qualitatively (i.e., visually) better. This confirms that
the FID on
Algorithms pseudo-images
2020, is clearly a good measure of the quality of the corresponding generated
13, x FOR PEER REVIEW 9 of 17
pose sequences.

(a)

(b)

(c)
Figure 4.
Figure 4. Checking
Checking the
the consistency
consistency ofof the
the Fréchet
Fréchet Inception
Inception Distance
Distance (FID)
(FID) score
score with
with the
the qualitative
qualitative
appearance
appearance of
of generated
generated action:
action: the
the action
action “standing
“standing up”
up” generated
generated by
by the
the model
model after
after (a)
(a) 150
150 epoch
epoch
training (b)200
training and (b) 200epoch
epochtraining,
training,
andand (c) corresponding
(c) the the corresponding
curvecurve
of FIDofasFID as a function
a function of
of training
training
epochs. epochs.

We
Wenownowshow
show in inFigure
Figure55some
someFID FIDscores
scoresduring
duringthe thetraining
training ofof our
our model,
model, evaluating
evaluating every
every
50
50 epochs. The first two rows of curves stand for 8 different actions and the last two curves are the
epochs. The first two rows of curves stand for 8 different actions and the last two curves are the
class
classaverage
averageFID
FIDandandthethe FID
FID for
for the
the entire
entire dataset.
dataset. For
For most
most actions,
actions, the
the FID
FID score
score globally
globally decreases
decreases
as
as training
training goes
goes on.
on. It
It theoretically
theoretically means
means that,
that, as
as training
training goes
goes on,
on, the
the generated
generated pseudo-images
pseudo-images get get
closer to the real TSSI training pseudo-images in terms of realness and diversity.
closer to the real TSSI training pseudo-images in terms of realness and diversity. We however note We however note
that,
that,for
forsome
someaction
actionclasses,
classes,the
theFID
FIDincreases
increasesagain
againafter thethe
after 150th epoch.
150th epoch.TheThesame trend
same is visible
trend on
is visible
the average
on the FIDFID
average score. ThisThis
score. means that,that,
means in general, for one
in general, classclass
for one of action, the results
of action, are better
the results for the
are better for
the model trained after 150 epochs than after 200 epochs. When regarding all actions as a whole, we
can derive the same conclusion with the total FID score.
Algorithms 2020, 13, 319 10 of 17
Algorithms 2020, 13, x FOR PEER REVIEW 10 of 17

Algorithms
model 2020, 13,
trained x FOR
after 150PEER REVIEW
epochs 10can
than after 200 epochs. When regarding all actions as a whole, we of 17

derive the same conclusion with the total FID score.

Figure 5. FID curves evaluated on (0–7) 8 different actions, (avg) their average score, and (total) the
entire dataset.
Figure 5. FID curves evaluated on (0–7) 8 different actions, (avg) their average score, and (total) the
Figure
entire 5. FID curves evaluated on (0–7) 8 different actions, (avg) their average score, and (total) the
dataset.
4.3. Importance of the Modification of Generator Using Upsample + Convolutional Layers
entire dataset.
4.3. Importance of the Modification of Generator Using Upsample + Convolutional Layers
As explained in Section 3.4, we modified the deconvolution operation from the original DC-
4.3. Importance of the Modification of modified
Generator the
Using Upsample + Convolutional Layers
GANAsinexplained in Section
order to avoid 3.4, we
checkerboard deconvolution
artifacts that arise when using operation
simply from the original DC-GAN
convolutional-transpose
in order
layers. totackle
avoidthis
AsToexplained checkerboard
inproblem, weartifacts
Section 3.4, thatupsample
we modified
applied the arise when using simply
the deconvolution operation
+ convolutional convolutional-transpose
layer from
methodthe to
original
realize DC-
the
layers. To tackle
GAN in order to
deconvolution this problem, we
avoid checkerboard
operation. applied the
In Figure 6,artifacts upsample
that arise
we compare + convolutional
when using
the learning layer
simply
curves (Loss_D method
and Loss_G) of the
to realize
convolutional-transpose our
deconvolution
layers.(orange)
model To tackleoperation.
thisthe
with In Figure
problem,
originalwe 6, we compare
applied
conditional the themodel
upsample
DC-GAN learning curves
+ convolutional
(blue) (Loss_D
trained layer
withmethod
the Loss_G)
andsame of Our
toinput.
realizeour
the
model (orange)
deconvolution with the
operation. original
In Figureconditional
6, we DC-GAN
compare the model
learning (blue)
curvestrained
(Loss_D
model converges much faster than the original one, meaning that the new model is much easier to with the
and same
Loss_G) input.
of our
Our
model
train.model converges
(orange) much
with the fasterconditional
original than the original
DC-GAN one,model
meaning that
(blue) the new
trained model
with is much
the same easier
input. Our
tomodel
train. converges much faster than the original one, meaning that the new model is much easier to
train.

Figure
Figure 6.6. Learning
Learning curves
curves ofof our
our model
model (using
(using upsample
upsample and
and convolutional
convolutional layers)
layers) (orange)
(orange) and
and the
the
original method (using convolutional-transpose layers) (blue): Loss_D is discriminator loss
original method (using convolutional-transpose layers) (blue): Loss_D is discriminator loss calculated calculated
as
as the
the average
Figure of
oflosses
6. Learning
average for
forthe
curves
losses of all
allreal
theour realand
model and all
allfake
(using samples,
upsample
fake and
samples,and Loss_G
Loss_Gisisthe
andconvolutionaltheaverage
layers) generator
average(orange)
generator loss.
and the
loss.
original method (using convolutional-transpose layers) (blue): Loss_D is discriminator loss calculated
Furthermore, we visually inspect some generated pseudo-images in order to check the
Furthermore,
as the average ofwe visually
losses inspect
for the all some
real and generated
all fake pseudo-images
samples, and in order
Loss_G is the average to check
generator loss. the
improvement. On Figure 7, we compare a real TSSI input pseudo-image on Figure 7a; a fake
improvement. On Figure 7, we compare a real TSSI input pseudo-image on Figure 7a; a fake pseudo-
pseudo-image generatedvisually
by the original
inspectconditional DC-GAN model with convolutional-transpose
imageFurthermore,
generated bywe the original conditional some generated
DC-GAN modelpseudo-images in order to check
with convolutional-transpose the
layers,
layers, having
improvement. checkerboard
On Figure patterns
7, weon on
compare the entire
a real image
TSSI on on
input Figure 7b;
pseudo-imageand a fake pseudo-image
on Figure without
7a; a fake without
pseudo-
having checkerboard patterns the entire image Figure 7b; and a fake pseudo-image
image generated by the original conditional DC-GAN model with convolutional-transpose
checkerboard artifacts generated with our modified DC-GAN on Figure 7c. This confirms that, thanks layers,
having checkerboard patterns on the entire image on Figure 7b; and a fake pseudo-image without
checkerboard artifacts generated with our modified DC-GAN on Figure 7c. This confirms that, thanks
Algorithms 2020, 13, x319
FOR PEER REVIEW 11
11 of 17
Algorithms 2020, 13, x FOR PEER REVIEW 11 of 17
to our modification of upsampling in the generator, the checkerboard pattern artifact no longer
checkerboard artifacts of
to our modification
appears. generated with our
upsampling modified
in the DC-GAN
generator, on Figure 7c. This
the checkerboard confirms
pattern that,no
artifact thanks to
longer
our modification
appears. of upsampling in the generator, the checkerboard pattern artifact no longer appears.

(a) (b) (c)


(a) (b) (c)
Figure 7. Example of spatiotemporal images: (a) TSSI of a real sequence; (b) an image generated by
Figure 7. Example of spatiotemporal images: (a) TSSI of a real sequence; (b) an image generated by the
Figure
the 7. Example
original of spatiotemporal
conditional DC-GAN modelimages:
with(a)convolutional-transpose
TSSI of a real sequence; layers,
(b) an image generated by
with checkerboard
original conditional DC-GAN model with convolutional-transpose layers, with checkerboard artifacts;
the original
artifacts; and conditional
(c) an imageDC-GAN
generatedmodel
by ourwith
model,convolutional-transpose
without checkerboard layers,
artifacts.with checkerboard
and (c) an image generated by our model, without checkerboard artifacts.
artifacts; and (c) an image generated by our model, without checkerboard artifacts.
In
In order
order to to check
check how
how essential
essential it
it is
is for
for our
our approach
approach to to avoid
avoid checkerboard
checkerboard artifacts,
artifacts, wewe finally
finally
In
compare order
the to check how
corresponding essential
generatedit is for our
actions. approach
Figure 8 to
shows avoid
the checkerboard
action
compare the corresponding generated actions. Figure 8 shows the action “drinking” generated “drinking”artifacts,
generatedwe finally
by (a)
by
compare
new the corresponding
model-trained 69 epochs generated
and (b) actions.
original Figure
models 8 shows
after 349the action
epochs “drinking”
when
(a) new model-trained 69 epochs and (b) original models after 349 epochs when the learning curves the generated
learning curvesby (a)
of
of
new
two model-trained
models reach 69
the epochs
same and (b)
values. original
Comparing modelsthoseafter 349
two epochs
generatedwhen the learning
sequences
two models reach the same values. Comparing those two generated sequences of the action “drinking”, of thecurves
actionof
twoformer
the models
“drinking”, reach
isthe
obviouslytheis
former same
more values.
obviously more
continuous Comparing
continuous
and those
realistic. and two generated
realistic.
The human sequences
The human
skeleton skeleton
is closer of isthe
to reality asaction
closer to
well.
“drinking”,
reality as the
well. former
The body is obviously
shape more
deformations continuous
and actionand realistic.
unsmoothness The human
visible
The body shape deformations and action unsmoothness visible in Figure 8b are the consequences of the skeleton
in Figure is8b closer
are to
the
reality as
consequences well. The body shape
of the checkerboard
checkerboard artifacts deformations
artifacts
in the generated and action
in the generated
pseudo-image, unsmoothness
which pseudo-image, visible
translate into which in Figure
translate
very bad 8b are the
into very
pose sequences
consequences
bad pose of
sequences the checkerboard
after artifacts
reconverting
after reconverting from TSSI pseudo-image. from in the
TSSI generated
pseudo-image. pseudo-image, which translate into very
bad pose sequences after reconverting from TSSI pseudo-image.

(a)
(a)

(b)
(b)
Figure 8. Sequence of action “drinking” generated by (a) our model (using upsample + convolutional
Figure
layers) 8. Sequence
trained afterof
69action
epochs“drinking”
and (b) thegenerated by (a) (using
original model our model (using upsample + convolutional
convolutional-transpose layer) trained
Figure
layers) 8. Sequence
trained
after 349 epochs. of
after action
69 “drinking”
epochs and (b)generated
the by
original (a) our
model model
(using(using upsample + convolutional
convolutional-transpose layer)
layers) after
trained trained
349after 69 epochs and (b) the original model (using convolutional-transpose layer)
epochs.
In conclusion,
trained after 349 our analysis confirms that, for our work, using the upsample + convolutional
epochs.
layerIn conclusion,
structure our analysis
to realize confirmsinthat,
deconvolution for our work,
the generator using the
is efficient andupsample
essential +for
convolutional layer
obtaining realistic
In conclusion,
structure
generated to realizeour
sequences ofanalysis
deconvolution
skeletons. confirms
in thethat, for ouriswork,
generator usingand
efficient the essential
upsamplefor + convolutional layer
obtaining realistic
structure to
generated realize deconvolution
sequences of skeletons. in the generator is efficient and essential for obtaining realistic
4.4. Importance
generated of Reordering
sequences Joints in TSSI Spatiotemporal Images
of skeletons.
4.4. Importance of Reordering Joints in TSSI Spatiotemporal Images
In order to assess the importance of joint reordering in TSSI skeletal representation, we also train
4.4. Importance of Reordering Joints in TSSI Spatiotemporal Images
our generative
In order to model using
assess the importance of joint without
pseudo-images reorderingtreeintraversal orderrepresentation,
TSSI skeletal representation.we The
alsospatial
train
dimension
our is then
In order
generative simply
tomodel
assess the
usingthe joint order
importance of from
pseudo-images 1 to 25.tree
jointwithout
reordering Figure
in 9 compares
TSSI skeletal
traversal the actions generated
orderrepresentation,
representation. we also
The bytrain
the
spatial
our generative
two
dimension model
trainedismodels,
then using
without
simply pseudo-images
the or with
joint TSSIfrom
order without
1 to 25. tree
representation.Figuretraversal
Figure 9a order
9 compares representation.
is thethe
training
actionsresult byThe
generated spatial
non-TSSI
by the
dimension
images,
two andismodels,
trained then simply
Figure 9b
withouttheresult
is the joint
or order
with TSSIfrom
trained 1 to 25.
by TSSI. It Figure
representation.is obvious9 compares
Figure that,
9a the
after
is the actions
the
trainingsame generated
training
result by the
epochs,
by non-TSSI
twomodel
the trained
images, and models,
obtained
Figure 9b without
using
is the or with
non-TSSI
result TSSI representation.
pseudo-images
trained cannot
by TSSI. It Figure
output
is obvious 9aa is
that, the training
realistic
after human
the same result by non-TSSI
skeleton.
training epochs,
images,
the model and Figure 9b
obtained is the
using result trained
non-TSSI by TSSI. cannot
pseudo-images It is obvious
output that, after the
a realistic same skeleton.
human training epochs,
the model obtained using non-TSSI pseudo-images cannot output a realistic human skeleton.
In order to assess the importance of joint reordering in TSSI skeletal representation, we also train
our generative
Algorithms 2020, 13, xmodel using
FOR PEER pseudo-images without tree traversal order representation. The spatial
REVIEW 12 of 17
Algorithms
dimension 2020,
is 13,
thenx FOR PEERthe
simply REVIEW
joint order from 1 to 25. Figure 9 compares the actions generated by 12the
of 17

two trained models, without or with TSSI representation. Figure 9a is the training result by non-TSSI
images, and
Algorithms 2020,Figure
13, 319 9b is the result trained by TSSI. It is obvious that, after the same training epochs,
12 of 17
the model obtained using non-TSSI pseudo-images cannot output a realistic human skeleton.

(a)
(a)

(a)

(b)
(b)
(b)
Figure 9. Sequences of action “standing” generated by the models trained (a) with non-TSSI pseudo-
Figure 9. Sequences of action “standing” generated by the models trained (a) with non-TSSI pseudo-
Figure(no
images 9.
9.Sequences
reordering
Sequencesofof
action “standing”
ofjoints)
action and generated
(b) with
“standing” by thebymodels
TSSI-reordered
generated trainedtrained
(a) with(a)
pseudo-images.
the models non-TSSI pseudo-
with non-TSSI
images (no reordering of joints) and (b) with TSSI-reordered pseudo-images.
images (no reordering
pseudo-images of joints)ofand
(no reordering (b) with
joints) TSSI-reordered
and (b) pseudo-images.
with TSSI-reordered pseudo-images.
4.5. Hyperparameter Tuning
4.5.
4.5.Hyperparameter
HyperparameterTuning Tuning
4.5. Hyperparameter Tuning
After
After justifying the choice of our model architecture and data format, we we explore
explore thethe effect
effect ofof
Afterjustifying
After justifyingthe
justifying thechoice
the choiceof
choice
our
ofofrate,
our
model
ourmodel architecture
modelarchitecture
architectureand and data
anddata
data format,
format,
format, wewe explore
explore the the effect
effect ofBy
hyperparameter
hyperparameter values:
values: learning
learning batch
rate,rate,
batch size, and
size,size,
andand number
number of dimensions
of dimensions of latent
of latent space
space z.
z. By
of hyperparameter
hyperparameter values:
values: learning
learning rate,batch
batchbatch
size, andand numbernumber of dimensions
of dimensions of latent
ofFigure
latent space space
z. Byz.
default,
default, we
we set set learning
learning rate
raterateat 0.0002,
at 0.0002, batch size at
sizesize 64,
at 64, andand dimension
dimension of z at 100.
of zofatz 100. Figure 10
10 10shows
shows the
the
By default,
default, curves we set
we set learning learning at
rate attrained 0.0002,
0.0002, batch batch at 64,
size at 64,learning
and dimensiondimension at
of z at0.0005,100. Figure
100. Figure 10 shows showsthe
learning
learning curves for
for the
the model
model trained with
with different
different learning rates:
rates: 0.001,
0.001, 0.0005, 0.0002,
0.0002, and 0.0001.
and 0.0001.
the learning
learning to curvescurves for
for the the
model model trained
trained with different
with different learning
learning rates:
rates: 0.001,
0.001, 0.0005, 0.0002,
0.0005, 0.0002, and
and 0.0001.
0.0001.
Opposite
Opposite common sense, when the learning rate is smaller, Loss_D and Loss_G
Loss_G converge faster. For
Oppositeto
Opposite tocommon
to commonsense,
common sense,when
sense, whenthe
when the
the learning
learning
learning rate
rate is smaller,
rate
is smaller,
is smaller, Loss_D
Loss_D
Loss_D and
and and
Loss_G converge
Loss_G converge
converge faster.
faster. For
faster.
For
models lr = 0.001 and lr = 0.0005, the average FID for a single class reaches the minimum at the 150th
models
For models
models lr = 0.001
lrlr==0.001
0.001 and
and lr and
lr lr = 0.0005,
== 0.0005,
0.0005, the
the average FID for
the average
average FID for
FID aa single
single classreaches
for a single
class reaches
class theminimum
reaches
the minimum
the minimumatatthe
theat 150th
the
150th
epoch and then goes up at the 200th epoch. For model lr = 0.0002, FID scores increase slightly at the
epoch
150th and
and then
epochepoch andgoes
then thenup
goes goes
up at
at the
up at
the 200th epoch.epoch.
the 200th
200th epoch. For model
For model
For modellr lr = 0.0002,
lr == 0.0002,
0.0002, FIDscores
FID scores increase
FID scores
increase slightly
increase
slightly atatthe
slightlythe
200th
200th epoch.
at theepoch. For
For model lr = 0.0001, FID
lr = 0.0001, scores keep decreasing but converge slowly. For all four
200th 200th epoch.
epoch. For model
model lr
For modellr == 0.0001,
0.0001, FID scores
FID scores keep decreasing
FID scores
keep decreasing
keep but converge
decreasing
but converge
but converge slowly.
slowly. Forall
slowly.
For allfour
For four
all
models,
models, FID
FID
four models, scores
scores are
are
FID scores close
close to
to
are close each
each other
other
to each at
at the
the 200th
200th epoch.
epoch.
models, FID scores are close to each otherother
at theat200th
the 200thepoch. epoch.

Learning curvesfor
Figure 10.Learning
Figure for our model
model trained with different learning rates: 0.001, 0.0005, 0.0002,
Figure10.
Figure 10. Learningcurves
10. Learning curves for our
curves our model trained with
trained with
trained different
with different learning
different learning rates:0.001,
learningrates:
rates: 0.001,0.0005,
0.001, 0.0005,0.0002,
0.0005, 0.0002,
0.0002,
and
and 0.0001.
and0.0001.
and 0.0001.
0.0001.
We also train the model with different batch sizes (of training data): 16, 32, 64, and 128 (see Figure 11).
We
Wealso
We alsotrain
also trainthe
train the model
the model with
model with different
different batch sizes
batch sizes
batch (of
sizes (of training
(of training data):16,
training data):
data): 16,32,
16, 32,64,
32, 64,and
64, and128
and 128(see
128 (see
(see
When the batch size is small, the learning curves Loss_D and Loss_G are smoother and the discriminator
Figure
Figure11).
Figure 11).When
11). Whenthe
When thebatch
the batchsize
batch size is
is small,
small, the
the learning curves
learning curves
learning Loss_D
curvesLoss_D
Loss_DandandLoss_G
and Loss_Gare
Loss_G aresmoother
are smootherand
smoother andthe
and the
the
learns more easily However, FID score converges clearly best with batch size = 64.
discriminator
discriminatorlearns
discriminator learnsmore
learns moreeasily
more easily However,
However, FID score convergesclearly
score converges clearlybest
bestwith
best withbatch
with batchsize
batch size===64.
size 64.
64.

Figure 11. Learning curves for our model trained with different batch sizes (of input data): 16, 32, 64,
and 128.
Algorithms 2020, 13, x FOR PEER REVIEW 13 of 17

Figure 11. Learning curves for our model trained with different batch sizes (of input data): 16, 32, 64,
Algorithms 2020, 13, 319 13 of 17
and 128.

Finally,
Finally,we
weinvestigate
investigatethethe influence
influence ofof the dimension dim_z
the dimension dim_z ofof the
thelatent
latentspace
spacez.z.InInFigure
Figure12,
12,
Loss_D
Loss_DandandLoss_G
Loss_G are not affected
are not affectedby bythe
thechoice
choiceofof dim_z
dim_z while
while FIDFID scores
scores varyvary significantly.
significantly. WithWith
the
the increase of dim_z, FID scores tend to converge faster. For the model with dim_z
increase of dim_z, FID scores tend to converge faster. For the model with dim_z = 5, the FID score = 5, the FID score
shows
showsthethedecreasing
decreasingtrend
trend along
along 200
200 epochs.
epochs. For the model
For the model with dim_z==10
withdim_z 10and
and30,30,the
theFID
FIDscores
scores
both
bothreach
reachthe
theminimal
minimalvalue
value near
near the 150th epoch
epoch and
andthen
thengogoup.
up.The
Themodel
modelwith dim_z= =100
withdim_z 100gets
gets
the
thelowest
lowestpoint
pointofofFID
FIDatatthe
the130th
130th and
and 180th
180th epochs, much smaller
smaller than
than the
the results
resultsofofthe
theothers.
others.
However,
However,further
furthertraining
training epochs
epochs are needed to to figure
figureout
outwhether
whetherFIDFIDfor dim_z==55and
fordim_z and100
100would
would
continue
continuedecreasing.
decreasing.

Figure 12. Learning curves for the model trained on different numbers of dimensions of the latent space
Figure 12. Learning curves for the model trained on different numbers of dimensions of the latent
z (input of G): 5, 30, and 100. For the average FID, we plot the value at 50, 100, 110, 120, 200 epochs.
space z (input of G): 5, 30, and 100. For the average FID, we plot the value at 50, 100, 110, 120, 200
epochs.
To sum up, when training the model at 200 epochs, the original choice of the learning rate
and batch size is already a relatively good setup. Regarding the optimal dimension of latent space,
To sum
further studyup,
is when
neededtraining the model
to determine at 200 epochs, the original choice of the learning rate and
it robustly.
batch size is already a relatively good setup. Regarding the optimal dimension of latent space, further
4.6. Analysis
study is neededof Latent Space it robustly.
to determine
We now analyze the influence of the latent variable z on the style of generated pose sequences.
4.6. Analysis of Latent Space
To this end, we move separately along each dimension of the latent space and try to understand
the We
specific
nowinfluence
analyze the of each z coordinate
influence on the
of the latent appearance
variable of generated
z on the sequences.
style of generated In Figure
pose 13,
sequences.
Towethis
visualize
end, we the variation
move with zalong
separately of theeach
generated action
dimension of“standing
the latent up”.
spaceThe
andmodel
try toisunderstand
trained after the
200 epochs
specific with of
influence each=z5.coordinate
dim_z For each figure,
on the we adopt linear
appearance interpolation
of generated on one dimension
sequences. In Figure 13, of we
z,
with 5 points
visualize in the range
the variation withofz[−1, 1]. generated
of the Each row isaction
the sequence
“standing generated
up”. The bymodel
an interpolation “point”
is trained after 200
of latent
epochs withspace
dim_zz. By
= 5.comparing the first
For each figure, weframe
adoptof eachinterpolation
linear sequence (theonfirstonecolumn),
dimension it seems that 5
of z, with
latentin
points space z influences
the range of [−1, in
1].particular
Each rowthe initial
is the orientation
sequence of the by
generated skeleton. For instance,
an interpolation in Figure
“point” 13,
of latent
the body rotates from the front to its right side (or the left side from the reader’s
space z. By comparing the first frame of each sequence (the first column), it seems that latent space z point of view).
The evolution
influences from top the
in particular to bottom is continuous
initial orientation ofand
the smooth.
skeleton.AtFortheinstance,
same time,in for each13,
Figure row ofbody
the the
sequence,
rotates fromthetheaction
front “standing
to its rightup”sideis(or
performed well.from
the left side Thisthe
example
reader’sshows
pointthat, by tuning
of view). The on latent
evolution
space z, our generative model is able to continuously control properties,
from top to bottom is continuous and smooth. At the same time, for each row of the sequence, such as the orientation, of thethe
action and eventually to define the style of action while generating the correct
action “standing up” is performed well. This example shows that, by tuning on latent space z, our class of action indicated
by the conditional label.
generative model is able to continuously control properties, such as the orientation, of the action and
eventually to define the style of action while generating the correct class of action indicated by the
conditional label.
rotates from the front to its right side (or the left side from the reader’s point of view). The evolution
from top to bottom is continuous and smooth. At the same time, for each row of the sequence, the
action “standing up” is performed well. This example shows that, by tuning on latent space z, our
generative model is able to continuously control properties, such as the orientation, of the action and
eventually to define the style of action while generating the correct class of action indicated by
Algorithms 2020, 13, 319
the
14 of 17
conditional label.

Algorithms 2020, 13, x FOR PEER REVIEW 14 of 17

Figure 13. Variation of the generated action “standing up” with varying values of latent vector z.
Figure 13. Variation of the generated action “standing up” with varying values of latent vector z.
5. Conclusions
In this work, we propose a new class-conditioned generative model for human skeletal motions.
5. Conclusions
Our generative model is based on TSSI (Tree Structure Skeleton Image) spatiotemporal pseudo-image
In this work, we propose a new class-conditioned generative model for human skeletal motions.
representation of skeletal pose sequences, on which is applied a conditional DC-GAN to generate
Our generative model is based on TSSI (Tree Structure Skeleton Image) spatiotemporal pseudo-image
artificial pseudo-images that can be decoded into realistic artificial human skeletal movements. In our
representation of skeletal pose sequences, on which is applied a conditional DC-GAN to generate
setup, each column of a pseudo-image corresponds to approximately one single skeletal joint, so it is
artificial pseudo-images that can be decoded into realistic artificial human skeletal movements. In
quite essential to avoid artifacts in the generated pseudo-image; therefore, we modified the original
our setup, each column of a pseudo-image corresponds to approximately one single skeletal joint, so
deconvolution operation in standard DC-GAN in order to efficiently eliminate checkerboard artifacts.
it is quite essential to avoid artifacts in the generated pseudo-image; therefore, we modified the
We evaluated our approach on the large NTU_RGB+D human actions public dataset, comprising
original deconvolution operation in standard DC-GAN in order to efficiently eliminate checkerboard
60 classes of actions. We showed that our generative model is able to generate artificial human
artifacts.
movements qualitatively and visually corresponding to the correct action class used as the condition
We evaluated our approach on the large NTU_RGB+D human actions public dataset, comprising
for the DC-GAN. We also used Fréchet inception distance as an evaluating metric, which quantitatively
60 classes of actions. We showed that our generative model is able to generate artificial human
confirms the qualitative results.
movements qualitatively and visually corresponding to the correct action class used as the condition
The code of our work will soon be made available on Github. As further works, we will continue
for the DC-GAN. We also used Fréchet inception distance as an evaluating metric, which
to improve the quality of generated human movements, for instance, adding a conditional label in
quantitatively confirms the qualitative results.
every hidden layer in the network [27] and evaluating the performances, and trying or developing
The code of our work will soon be made available on Github. As further works, we will continue
to improve the quality of generated human movements, for instance, adding a conditional label in
every hidden layer in the network [27] and evaluating the performances, and trying or developing
other representations of skeletal pose sequences. Moreover, evaluation of the model on other human
skeletal datasets is important to verify the universality of the model.
Algorithms 2020, 13, 319 15 of 17

other representations of skeletal pose sequences. Moreover, evaluation of the model on other human
skeletal datasets is important to verify the universality of the model.

Author Contributions: Conceptualization, G.D. and F.M.; Methodology, W.X., G.D. and F.M.; Software, W.X.;
Validation, W.X., G.D. and F.M.; Formal analysis, W.X.; Investigation, W.X.; Resources, W.X.; Data curation, W.X.
and G.D.; Writing—original draft preparation, W.X.; Writing—review and editing, F.M.; Visualization, W.X. and
G.D.; Supervision, G.D., F.M. and J.Y.; Project administration, F.M. and J.Y.; Funding acquisition, F.M. and J.Y.
All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Conflicts of Interest: The authors declare no conflict of interest.

Appendix A
Implementation details of our conditional Deep Convolutional Generative Adversarial Networks
(c_DCGAN) (with hyperparameters in default setting: dim_z = 100)
Generator (
(input_layer): Conv2d (160, 8192, kernel_size = (1, 1), stride = (1, 1), bias = False)
(fc_main): Sequential (
(0): UpscalingTwo (
(upscaler): Sequential (
(0): Upsample (scale_factor = 2.0, mode = bilinear)
(1): ReflectionPad2d ((1, 1, 1, 1))
(2): Conv2d (512, 256, kernel_size = (3, 3), stride = (1, 1), bias = False)
(3): BatchNorm2d (256, eps = 1 × 10−5 , momentum = 0.1, affine = True, track_running_stats = True)
(4): ReLU (inplace = True)
)
)
(1): UpscalingTwo (
(upscaler): Sequential (
(0): Upsample (scale_factor = 2.0, mode = bilinear)
(1): ReflectionPad2d ((1, 1, 1, 1))
(2): Conv2d (256, 128, kernel_size = (3, 3), stride = (1, 1), bias = False)
(3): BatchNorm2d (128, eps = 1 × 10−5 , momentum = 0.1, affine = True, track_running_stats = True)
(4): ReLU (inplace = True)
)
)
(2): UpscalingTwo (
(upscaler): Sequential (
(0): Upsample (scale_factor = 2.0, mode = bilinear)
(1): ReflectionPad2d ((1, 1, 1, 1))
(2): Conv2d (128, 64, kernel_size = (3, 3), stride = (1, 1), bias = False)
(3): BatchNorm2d (64, eps = 1 × 10−5 , momentum = 0.1, affine = True, track_running_stats = True)
(4): ReLU (inplace = True)
)
)
(3): UpscalingTwo (
(upscaler): Sequential (
(0): Upsample (scale_factor = 2.0, mode = bilinear)
(1): ReflectionPad2d ((1, 1, 1, 1))
(2): Conv2d (64, 3, kernel_size = (3, 3), stride = (1, 1), bias = False)
(3): Tanh ()
)
Algorithms 2020, 13, 319 16 of 17

)
)
)
Discriminator (
(hidden_layer1): Sequential (
(0): Conv2d (63, 64, kernel_size = (4, 4), stride = (2, 2), padding = (1, 1), bias = False)
(1): LeakyReLU (negative_slope = 0.2, inplace = True)
(2): Conv2d (64, 128, kernel_size = (4, 4), stride = (2, 2), padding = (1, 1), bias = False)
(3): BatchNorm2d (128, eps = 1 × 10−5 , momentum = 0.1, affine = True, track_running_stats = True)
(4): LeakyReLU (negative_slope = 0.2, inplace = True)
(5): Conv2d (128, 256, kernel_size = (4, 4), stride = (2, 2), padding = (1, 1), bias = False)
(6): BatchNorm2d (256, eps = 1 × 10−5 , momentum = 0.1, affine = True, track_running_stats = True)
(7): LeakyReLU (negative_slope = 0.2, inplace = True)
(8): Conv2d (256, 512, kernel_size = (4, 4), stride = (2, 2), padding = (1, 1), bias = False)
(9): BatchNorm2d (512, eps = 1 × 10−5 , momentum = 0.1, affine = True, track_running_stats = True)
(10): LeakyReLU (negative_slope = 0.2, inplace = True)
(11): Conv2d (512, 1, kernel_size = (4, 4), stride = (1, 1), bias = False)
(12): Sigmoid ()
)
)

References
1. Martinez, J.; Black, M.J.; Romero, J.; IEEE. On human motion prediction using recurrent neural networks.
In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI,
USA, 21–26 July 2017; pp. 4674–4683.
2. Ren, B.; Liu, M.; Ding, R.; Liu, H. A survey on 3D skeleton-based action recognition using learning method.
arXiv 2020, arXiv:2002.05907.
3. Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative
adversarial networks. arXiv 2016, arXiv:1511.06434.
4. Fragkiadaki, K.; Levine, S.; Felsen, P.; Malik, J. Recurrent network models for human dynamics. In Proceedings
of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 13–16 December 2015;
pp. 4346–4354.
5. Jain, A.; Zamir, A.R.; Savarese, S.; Saxena, A. Structural-RNN: Deep learning on spatio-temporal graphs.
In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV,
USA, 26 June–1 July 2016; pp. 5308–5317.
6. Chung, J.; Kastner, K.; Dinh, L.; Goel, K.; Courville, A.; Bengio, Y. A recurrent latent variable model for
sequential data. In Advances in Neural Information Processing Systems 28; Cortes, C., Lawrence, N.D., Lee, D.D.,
Sugiyama, M., Garnett, R., Eds.; Morgan Kaufmann Publishers Inc.: Burlington, MA, USA, 2015.
7. Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y.
Generative adversarial nets. In Advances in Neural Information Processing Systems 27; Ghahramani, Z.,
Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q., Eds.; Morgan Kaufmann Publishers Inc.:
Burlington, MA, USA, 2014.
8. van den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.;
Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv 2016, arXiv:1609.03499.
9. Barsoum, E.; Kender, J.; Liu, Z.; IEEE. HP-GAN: Probabilistic 3D human motion prediction via GAN.
In Proceedings of the 2018 IEEE/cvf Conference on Computer Vision and Pattern Recognition Workshops,
Salt Lake City, UT, USA, 18–22 June 2018; pp. 1499–1508.
10. Ishaan, G.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; Courville, A. Improved training of Wasserstein GANs.
In Advances in Neural Information Processing Systems 30; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H.,
Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Morgan Kaufmann Publishers Inc.: Burlington, MA, USA, 2017.
Algorithms 2020, 13, 319 17 of 17

11. Kiasari, M.A.; Moirangthem, D.S.; Lee, M. Human action generation with generative adversarial networks.
arXiv 2018, arXiv:1805.10416.
12. Wang, P.; Li, W.; Li, C.; Hou, Y. Action recognition based on joint trajectory maps with convolutional neural
networks. Knowl. Based Syst. 2018, 158, 43–53. [CrossRef]
13. Li, B.; Dai, Y.; Cheng, X.; Chen, H.; Lin, Y.; He, M. Skeleton based action recognition using translation-scale
invariant image mapping and multi-scale deep CNN. In Proceedings of the 2017 IEEE International
Conference on Multimedia & Expo Workshops (ICMEW), Hong Kong, China, 10–14 July 2017; pp. 601–604.
14. Li, Y.; Xia, R.; Liu, X.; Huang, Q. Learning shape-motion representations from geometric algebra
spatio-temporal model for skeleton-based action recognition. In Proceedings of the 2019 IEEE International
Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; pp. 1066–1071.
15. Liu, M.; Liu, H.; Chen, C. Enhanced skeleton visualization for view invariant human action recognition.
Pattern Recognit. 2017, 68, 346–362. [CrossRef]
16. Caetano, C.; Sena, J.; Bremond, F.; Dos Santos, J.A.; Schwartz, W.R. Skelemotion: A New Representation of
Skeleton Joint Sequences Based on Motion Information for 3D Action Recognition. In Proceedings of the 16th
IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Taipei, Taiwan,
18–21 September 2019.
17. Caetano, C.; Bremond, F.; Schwartz, W.R.; IEEE. Skeleton image representation for 3D action recognition
based on tree structure and reference joints. In Proceedings of the 2019 32nd Sibgrapi Conference on Graphics,
Patterns and Images, Rio de Janeiro, Brazil, 28–30 October 2019; pp. 16–23.
18. Li, C.; Zhong, Q.; Xie, D.; Pu, S. Co-occurrence feature learning from skeleton data for action recognition and
detection with hierarchical aggregation. arXiv 2018, arXiv:1804.06055.
19. Shahroudy, A.; Liu, J.; Ng, T.-T.; Wang, G. NTU RGB+D: A large scale dataset for 3D human activity analysis.
In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV,
USA, 26 June–1 July 2016; pp. 1010–1019.
20. Yang, Z.; Li, Y.; Yang, J.; Luo, J. Action recognition with spatio-temporal visual attention on skeleton image
sequences. IEEE Trans. Circuits Syst. Video Technol. 2019, 29, 2405–2415. [CrossRef]
21. Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784.
22. Sugawara, Y.; Shiota, S.; Kiya, H. Checkerboard artifacts free convolutional neural networks. APSIPA Trans.
Signal Inf. Process. 2019, 8. [CrossRef]
23. Dowson, D.; Landau, B. The Fréchet distance between multivariate normal distributions. J. Multivar. Anal.
1982, 12, 450–455. [CrossRef]
24. Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs trained by a two time-scale
update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems 30;
Morgan Kaufmann Publishers Inc.: Burlington, MA, USA, 2017.
25. Salimans, T.; Goodfellow, I.J.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved techniques for
training GANs. Adv. Neural Inf. Process. Syst. 2016, 29, 2234–2242.
26. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer
vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2818–2826.
27. Giuffrida, M.; Scharr, H.; Tsaftaris, S. Arigan: Synthetic arabidopsis plants using generative adversarial
network. In Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops (ICCVW),
Venice, Italy, 21–26 July 2017; pp. 2064–2071.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional
affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

You might also like