You are on page 1of 11

Neural Networks 132 (2020) 521–531

Contents lists available at ScienceDirect

Neural Networks
journal homepage: www.elsevier.com/locate/neunet

Human interaction behavior modeling using Generative Adversarial


Networks

Yusuke Nishimura, Yutaka Nakamura , Hiroshi Ishiguro
Department of System Innovation, Graduate School of Engineering Science, Osaka University, 1-3 Machikaneyama, Toyonaka, Osaka, Japan

article info a b s t r a c t

Article history: Recently, considerable research has focused on personal assistant robots, and robots capable of rich
Received 21 May 2020 human-like communication are expected. Among humans, non-verbal elements contribute to effective
Received in revised form 1 September 2020 and dynamic communication. However, people use a wide range of diverse gestures, and a robot
Accepted 25 September 2020
capable of expressing various human gestures has not been realized. In this study, we address
Available online 30 September 2020
human behavior modeling during interaction using a deep generative model. In the proposed method,
Keywords: to consider interaction motion, three factors, i.e., interaction intensity, time evolution, and time
Human robot interaction resolution, are embedded in the network structure. Subjective evaluation results suggest that the
Human motion modeling proposed method can generate high-quality human motions.
Generative Adversarial Networks © 2020 Elsevier Ltd. All rights reserved.
Human behavior during dialog

1. Introduction first face-to-face conversation (Kahn et al., 2008), attentive lis-


tening (Kawahara, Uesato, Yoshino, & Takanashi, 2015), a job
With the advancement of automated speech recognition tech- interview (Inoue et al., 2019) and a reception (Doering, Glas, &
nologies, information devices with a dialog interface, such as Ishiguro, 2019); however, such methods are difficult to scale to
smart speakers, are becoming increasingly common in daily life. a large domain because, in practical applications, they require
Among such devices, robots capable of rich human-like com- elaborate rules for each scenario.
munication, including nonverbal communication, are expected. Many researchers have attempted to apply data-driven meth-
Therefore, a mechanism to enable robots to realize various ods to model a single motion from recorded motion data using
situation-specific nonverbal communication behavior is required. a dynamic model, such as a hidden Markov model (Inamura &
During a conversation, people dynamically change their be- Nakamura, 2014; Takano & Nakamura, 2016). However, these
havior, such as facial expressions and gestures, in response to methods aim to develop a model for a single motion, and mul-
the conversation partner’s behavior and the environmental condi- tiple dynamicmodels with a switching mechanism are required
tions. However, non-verbal behavior, particularly gestures, can be to realize the diverse motions (Pavlovic, Rehg, & MacCormick,
ambiguous. The same gesture motion may have different mean- 2001). They are not suitable for modeling human motion behavior
ings in different situations. In addition, there could be more than during dialog because, under such conditions, motions cannot
one ‘‘suitable’’ motion for a given situation, i.e., in the same or be definitively categorized and clear switching points cannot be
similar situations, an individual could play a different gesture. determined.
In other words, gestures can be multimodal. The purpose of this Generative models approximate the probability distribution
study is to model diverse human gestural behavior that may be of the target data (or recorded data) and, in principle, such
applicable to motion generation of humanoid robots.
models are expected to generate diverse samples. Recently, sev-
Existing methods for implementing communication robots’
eral deep generative models, e.g., Variational AutoEncoders (VAE)
motion can be roughly classified as rule-based and data-driven
(Kingma & Welling, 2013) and Generative Adversarial Networks
(Heracleous, Sato, Ishi, Ishiguro, & Hagita, 2011; Sakai, Minato,
(GAN) (Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley,
Ishi, & Ishiguro, 2017). In rule-based methods, rules are manu-
Ozair, Courville, & Bengio, 2014), have been proposed. These
ally designed by an expert. Rule-based methods have been suc-
models can generate remarkably high quality natural images,
cessfully applied to several situated dialog systems, such as a
human voices, and text. Furthermore, high precision modeling of
a motion is also realized (Barsoum, Kender, & Liu, 2018; Hu, Yang,
∗ Correspondence to: Doctor of Engineering, Osaka University, 1-3
Liang, Salakhutdinov, & Xing, 2017; Karras, Aila, Laine, & Lehtinen,
Machikaneyama, Toyonaka, Osaka, Japan.
E-mail addresses: nishimura.yusuke@irl.sys.es.osaka-u.ac.jp (Y. Nishimura),
2017). The benefit of these method is to generate diverse samples
nakamura@irl.sys.es.osaka-u.ac.jp (Y. Nakamura), without a complicated, hand-coded hierarchical model, such as a
ishiguro@irl.sys.es.osaka-u.ac.jp (H. Ishiguro). combination of single motion models and switching mechanism.

https://doi.org/10.1016/j.neunet.2020.09.019
0893-6080/© 2020 Elsevier Ltd. All rights reserved.
Y. Nishimura, Y. Nakamura and H. Ishiguro Neural Networks 132 (2020) 521–531

Fig. 1. Video recording of a conversation; (a) environment (b) extraction of skeleton from images.

We have been working on a GAN framework for modeling people involved in the dialog were instructed to stay in one place,
human behavior during dialog (Nishimura, Nakamura, & Ishiguro, and a camera was placed such that neither person was occluded.
2019). In this study, we implemented two deep generative neural A pose estimation system was employed to encode the posture
networks and investigated their performance. The first one is a of each person to a vector that consists of the position of feature
naive fully-connected deep neural network where a fake sample points, such as the head, torso, and right elbow (Fig. 1(b)). We
is generated from an ‘‘unstructured’’ latent variable. The second refer to this vector as ‘‘skeleton’’. Then, we can obtain a matrix
is a convolutional neural network where the latent variable is representation, i.e., a multidimensional time series, of a human
designed by considering the interaction between two entities. motion and construct a generative model based on this data
For example, a person’s behavior is affected by the conversation representation.
partner whereas individual properties, such as height, are inde-
pendent, and such a notion must be considered in designing the 2.1. Data collection using an omnidirectional camera
model structure. The latent variable and connection in the convo-
lutional neural network model are designed based on interaction In this study, we used an omnidirectional camera to obtain
intensity, time resolution and time evolution (Section 3.1). frontal images of two people during communication. Each person
Subjective evaluations of the generated motion were con- sits on a fixed chair, as shown in Fig. 2. K feature points of
ducted. As a result, it becomes clear that a traditional generative a human posture(shown in Fig. 2 bottom center) are extracted
model, i.e., a principle component analysis (PCA) model, cannot from each captured image using the OpenPose library (Cao, Hi-
generate interaction motions including conversation turns, while dalgo, Simon, Wei, & Sheikh, 2018; Cao, Simon, Wei, & Sheikh,
the proposed deep generative models can. In addition, from a 2017). Here the dimensionality of the skeleton is D (= K × 2).
‘‘biological motion’’ perspective, the quality of motion generated The skeletons of the people on the right and left are denoted
using the structured latent variable model is better than that xtr ∈ RD and xtl ∈ RD , respectively. The motions of both people
generated using the unstructured latent variable model. [xtr−T +1 , xtr−T +2 , . . . , xtr ] and [xtl −T +1 , xtl −T +2 , . . . , xtl ] are combined
The structure of the paper is as follows. In Section 2, the and ‘‘interaction motion data’’ x ∈ RD×2×T is composed. Here, T is
data collection procedure for human–human interaction and its the length of the time series. In this paper, the interaction motion
statistical properties are described. In Section 3, the network data is x ∈ R22×2×32 since we used 11 feature points(K = 11,
structure of our proposed deep generative model is explained. In D = 22) and 4-s long motions(8 fps, T = 32).
Section 4, the results of the subjective evaluations and statistics of The obtained interaction motion data were preprocessed to
the generated motions are presented. Finally, Section 5 presents reduce noise and rectify missing values. (i) Removing outliers
the conclusion of the research. : OpenPose estimates with low confidence scores (<5%) were
removed and replaced by a linear interpolation. (ii) Smoothing :
2. Motion during dialog each dimension of the time series was smoothed using a low-pass
filter with a 4-Hz cutoff frequency. (iii) Resampling :To reduce
Human motion during dialog has two main characteristics. redundant information from the interaction motion data that
(1) There may be multiple appropriate actions in any given were usually recorded with at a standard video frame rate, such
situation. For example, a person can wave either their right as 24 or 30 fps, the time series was resampled to 8 fps.
or left hand when the conversation partner waves their hand.
(2) The same action can convey different meanings in different 2.2. Interaction motion property
contexts. For example, depending on the situation, a simple
nodding motion may mean ‘‘I agree with you’’ or ‘‘I’m listening We analyze the recorded data to investigate the property
to you’’. of the interaction motion data for designing the latent variable
For this aim, commonly used motion generation methods for structure. We applied principal component analysis (PCA) to the
interaction robots are not suitable. Rule-based methods require pair of skeletons and single skeleton. The cumulative contribution
a great many manually constructed rules to cope with a single ratios of principal components are shown in Fig. 3. The num-
scenario. This results in what is known as a combinatorial explo- ber of components required to approximate the skeletons for
sion where the complexity of a problem increases rapidly. With the former case is 24 with small mean squared errors (<1%).
imitation learning, each motion can be automatically created Since the number of the components required to mimic each
from the human instruction; however imitation learning requires skeleton independently is 14 (the latter case), 28 components are
a typical motion trajectory for each motion category (Zhang, Mc- necessary to reproduce the pair of the skeletons. This indicates
Carthy, Jow, Lee, Chen, Goldberg, et al., 2018). Furthermore, the that the number of required components can be reduced due to
required number of categories is not known in advance. To ad- the interaction between two people, and it is suggested that the
dress these problems, we propose an innovate motion generation latent variable to generate interaction motion data might have
model based on a deep generative model. some components shared by two skeletons.
In this study, as raw data, we use video recordings of the It is also observed that a person can perform a different action
movements of two people during dialog, as shown in Fig. 1. The in a similar situation and that a single motion can have different
522
Y. Nishimura, Y. Nakamura and H. Ishiguro Neural Networks 132 (2020) 521–531

Fig. 2. Overview of data collection for interaction motion.

the samples distribute in a relatively low-dimensional manifold


(Fig. 4). The probability distribution of interaction motion data
is expected to have multiple modes(Fig. 4), and the relationship
between motion and context is many-to-many.
It has been difficult to obtain an approximated distribution of
such a complicated and nonlinear probability distribution using
the traditional method. The average of two typical samples suited
to the situation may not be included in the manifold (Fig. 4(b));
however, recent advances in deep neural network-based systems
allow us to model complicated data such as human face im-
ages (Karras et al., 2017) or natural text (Hu et al., 2017). In this
study, we propose a generative model for interaction motion data
based on GAN.

3. Generative model for interaction motions

In this study, the generative model is trained based on WGAN-


GP because it has been suggested that WGAN-GP realizes stable
Fig. 3. Comparison of the necessary number of single skeleton components and learning while avoiding the mode collapse problem (Goodfellow
skeleton pair components.
et al., 2014; Gulrajani, Ahmed, Arjovsky, Dumoulin, & Courville,
2017). Two neural networks are used in a WGAN-GP method ,
i.e., the ‘‘generator’’ network, which generates fake samples from
latent variables, and the ‘‘critic’’ network, which evaluates the
quality of the samples, i.e., how likely to be an real one the given
sample is. These networks are trained in an adversarial manner
such that the generator attempts to cheat the critic, and the critic
attempts to avoid being cheated by the generation.
The generator neural network outputs a fake sample xF from a
randomly generated latent variable z. The latent variable sampled
from a (multidimensional) normal distribution is projected into
the observation space using nonlinear transformation G defined
by a deep neural network. Thus, the generator can be expressed
conceptually as follows.

Fig. 4. Multimodality of interaction motions.


xF = G(z)

We employ a normal distribution to generate the latent variable;


however, we can also use an arbitrary distribution if required.
meanings. An example of the former case is shown in Fig. 4, Here, a fake sample xF is in R22×2×32 because a real sample xR
where a person waves their right(Fig. 4(a)) or left hand(Fig. 4(c)) is in R22×2×32 .
as a response to the other person’s wave. In other words, there The critic neural network receives sample x and outputs the
can be multiple motions suited to a given situation. degree of likeliness to be a real sample; thus, it can distinguish
Theoretically, each interaction motion data element corre- fake and real samples. We denote the output of the critic as C (x),
sponds to a single point in a highly dimensional trajectory space and the loss function Lc is calculated as follows:
(R22×2×32 ). However, motions are governed by various conditions,
such as physical constraints and communication protocols; thus, Lc = Lwgan + α Lgp + β (γ L1 + (1 − γ )L2 )
523
Y. Nishimura, Y. Nakamura and H. Ishiguro Neural Networks 132 (2020) 521–531

Fig. 5. Latent space of interaction motion.

Lwgan = E [C (xF )] − E [C (xR )] 3.1. Latent structure of interaction motion data


xF ∼PF xR ∼PR

Lgp = E [(∥∇ C (x̂)∥2 − 1)2 ] Based on the knowledge we obtained (Section 2.2), we de-
x̂∼Px̂
signed the latent space in consideration of the following charac-
where PF is the probability distribution of fake samples defined teristics to model interaction motion data.
by G(z) and p(z) and PR is the probability distribution of real
samples. E[·] denotes the expectation. x̂ is a subdivided point of a Interaction intensity The behavior of a person can be affected
real sample xR and a fake sample xF , and is generated as follows: by the interlocutor’s behavior. However, the physical prop-
erty or personality of the person may not be directly
x̂ = ϵ xR + (1 − ϵ )xF
affected by the interlocutor. Thus, the latent variable com-
where ϵ ∈ [0, 1] is a random number sampled from the uni- prises elements shared by two persons and elements as-
form distribution (Gulrajani et al., 2017), and Px̂ is its probability signed exclusively to each individual.
distribution.
The parameter of the critic wc is updated to minimize the loss Time evolution Human posture changes continuously from mo-
function Lc . Here, the Adam optimizer (Kingma & Ba, 2014) is ment to moment and has local dependence on time; thus,
used to update the weight parameter. L1 and L2 are the l1 and the latent variable comprises elements that only affect
l2 norms of the parameter of the critic, i.e., connection weights different periods.
wc in the deep neural network. These regularization terms help to
avoid overfitting by reducing the magnitude of the weight values, Time resolution The motion of a person includes changes with
i.e., regularization. In addition, hyperparameters α , β and γ are different time constant. For example, hand movement dur-
used to balance the effect of each loss. These parameters are set ing waving hand is fast. In contrast, changes in posture
to 10.0, 0.00001 and 0.3, respectively. When the critic is trained induced by changes in emotion may be slow. Thus, it seems
appropriately, the output of the critic for real samples, i.e., C (xR ), natural to include elements that correspond to different
becomes large while that for fake samples, i.e., C (xF ), becomes time periods in latent variable.
small.
The generator is trained to generate fake samples such that The proposed network possesses a latent variable that can be rep-
the critic fails to distinguish fake samples, and the parameter is resented as a three-dimensional array, where the axes correspond
updated as follows to minimize loss function Lg using the critic. to interaction intensity, time evolution, and time resolution.
Lg = − E [C (xF )] The latent variable used in the proposed method is shown
xF ∼PF in Fig. 5(upper left). Note that the vertical direction represents
The Adam optimizer is used to minimize the generator’s param- interaction intensity. The upper and lower elements are assigned
eter. to each person individually, and the middle elements affect the
During learning, each network is trained alternately. The qual- motions of both individuals simultaneously. The horizontal di-
ity of the ‘‘fake’’ samples generated by the generator is improved rection represents the time difference. Here, elements on the
by training such that the critic cannot distinguish fake samples left affect the behavior of the earlier period, and those on the
even though the critic also has improved ability to distinguish right affect the behavior of the later period. The depth direction
such samples. In other words, the generator is trained to increase represents the difference in time resolution, where elements in
the loss function of the critic, while the critic’s parameter is near planes respond to fast changes, and those in far planes
updated to decrease the loss function. This alternating learning respond to slow changes. For simplicity, to facilitate a low time
process of both networks is referred to as ‘‘adversarial learning’’. resolution, multiple elements in far planes have the same value.
524
Y. Nishimura, Y. Nakamura and H. Ishiguro Neural Networks 132 (2020) 521–531

Fig. 6. Information propagation by transposed convolution.

Fig. 7. Overview of transposed convolution.

3.2. Generator network Table 1


Characteristic elements of Interaction intensity and Time resolution.

The generator network includes multiple (three-dimensional) Short term Long term
transposed convolution layers (Dosovitskiy, Tobias Springenberg, Individual Emotion and intention Height and personality
& Brox, 2015; Radford, Metz, & Chintala, 2015) to mix the in- Interactive Turn-take and topics Relationship and context

formation of the latent variable as shown Fig. 6(upper left). As


mentioned previously, due to the locality of the convolution
structure, the total effect of each element in the latent variable the weights for the last layer are shared for each person and
is restricted such that each element contributes to a certain part each time frame. As a result, a fake sample, the matrix of size
in matrix xF . (∈ R?×2×32 ), which has the same dimensionality as the interaction
Fig. 7 shows an example procedure of the transposed convo- motion data (∈ R22×2×32 ) is generated from a randomly generated
lution layer. In the first step, input matrix (∈ R2×2×2 ) is extended latent variable using the generator.
to a larger matrix (∈ R4×4×4 ) by duplicating each element. Then, This network structure may be advantageous relative to gen-
an ordinal convolution process with a 2 × 2 × 2 kernel is applied. erating interaction motions.
As a result, the output matrix (∈ R3×3×3 ) is obtained. As shown
by the brightness of each element, the input values are mixed • The role of each element is determined to some extent. The
locally. role of each element can be determined shown in Table 1
After applying such transposed convolution layers repeatedly, according to the position of the three-dimensional array,
the resultant matrix is reshaped to a two-dimensional matrix i.e., the latent variable.
with T columns, where each column corresponds to the encoded • The number of weight parameters of the neural network is
postures of two people in each frame as shown Fig. 6(bottom reduced significantly.
center). A fully connected layer is used to decode the posture • Continuous posture change is realized due to the network
of each person in each frame from the encoded posture. Here, structure (Oord et al., 2016).

525
Y. Nishimura, Y. Nakamura and H. Ishiguro Neural Networks 132 (2020) 521–531

Fig. 8. Recorded motions(REAL).

Fig. 9. To generate motions(linear PPCA).

4. Experimental results In the REAL condition, there were many cases in which a
single person moves a lot and the other nearly stops. Here,
We evaluate the proposed method by comparing it to a linear the magnitudes of both persons’ movements were occasionally
probabilistic principal component analysis (PPCA) as the base- switched(Fig. 8). Such movements allow us to imagine the turn
line. The PPCA parameters were determined based on the max- during the conversation because the speaker may move a lot
(while the listener does not).
imum likelihood estimation method and were obtained analyt-
Unlike the REAL condition, there were many cases in which
ically. The number of latent variables (effective dimensionality
both people constantly moved at simultaneously under the PPCA
of the observation data) was selected such that the cumulative
condition(Fig. 9). In addition, the interaction motions generated
contribution rate becomes greater than 99.0%. As a result, we
by both GAN methods include many cases in which each person
used a PPCA model with a 315-dimensional latent variable. We
repeated quick-and-active movements and resting motions alter-
also evaluated GANs with different generator network structure, nately (Figs. 10 and 11). Note that the result of the questionnaire
where each layer was composed naively using a fully connected in our previous study indicated that variation of the motion
layer (Nishimura et al., 2019). In the following, we refer to this as generated by GAN-F was more diverse than that of PPCA.
‘‘GAN-F’’, and the proposed method is called ‘‘GAN-C’’.
Figs. 8–11 show stroboscopic interaction motion patterns con- 4.1. Subjective evaluation
verted from the measurement data (REAL) or generated by each
method (GAN-F, GAN-C or PPCA). The motions generated by GAN- We conducted a subject experiment to evaluate the generated
F contained noisy high-frequency vibrations; thus, a lowpass filter motions. Fig. 12 show the questionnaire used in this subjective
with a cutoff frequency of 2 Hz was applied to avoid negative experiment. A skeleton movie generated according to one of the
impressions on the generated motions induced by such noise. conditions was shown to the participant, and the participant
526
Y. Nishimura, Y. Nakamura and H. Ishiguro Neural Networks 132 (2020) 521–531

Fig. 10. To generate motions(Full-connected GAN).

Fig. 11. To generate motions(Convolutional GAN).

answered two questions. For the first question, the participant pairs of the video pairs generated from a single generated (fake)
judged whether the person in the video was a speaker or listener interaction motion data and annotated by a single person.
in a 0.5-s period. Note that the evaluation period was indicated The subjective evaluation was performed as follows: (1) five
in the seek bar at the bottom of the video. interaction motion data were generated for each condition (total
The participant also answered the confidence of the judge of 20 interaction motion data); (2) interaction motion data were
using a five-point Likert scale: (1) it looks like the person is divided into two single-person motion data (total of 40 motion
listening; (2) it is likely the person is probably listening; (3) it data); (3) a skeleton video generated from motion data was
is difficult judge; (4) it is likely the person is probably speaking; shown to the subject; (4) the subject answered questions for the
and (5) it is likely the person is speaking at the moment. presented video; and (5) steps (3) and (4) were repeated until
For the second question (‘‘Do you feel that the shape of a
the subject had evaluated all motion data. Note that the order
person was always kept?’’), the participant evaluated the shape
was shuffled randomly. The above procedure was performed in a
of the skeleton images on a seven-point Likert scale. The lowest
single session, and each subject participated in three sessions. The
item (1) indicates that the skeleton always maintained a human-
like shape, and the highest item (7) indicates that the skeleton first session was considered a practice session, and the results of
did not always looks like the human form. the remaining two sessions were used for analysis. The number
One reason we adopted videos in which a single skeleton was of subjects was 12, and 120 answers were corrected for each
presented is that it appears to be difficult for humans to recognize condition.
what occurs in a video recorded by an omnidirectional camera Each cell in Table 2 shows the number of each combination
when raster graphics are converted to a skeleton image using of labels annotated by a single person to two skeleton videos
a 2D motion capture system, e.g., OpenPose. Here, two people generated from a single interaction motion data. We summarize
facing each other were presented in a single plane as two figures the answers using frequency band graphs in Fig. 13, where the
viewed from the front. We evaluated the consistency of label confidence level is ignored and merged. As shown, the ratios of
527
Y. Nishimura, Y. Nakamura and H. Ishiguro Neural Networks 132 (2020) 521–531

Table 2
Questionnaire results for motion role labels.

the other of the pair; thus, the total number of answers for each
condition was 240 (= 120 × 2). As shown in Fig. 14, the qualities
of REAL, GAN-C, and PPCA were much greater than that of GAN-F.
Therefore, the proposed model (GAN-C) outperformed the other
models relative to generating interaction motions because GAN-
Fig. 12. Questionnaire. C has higher conversation turn reproducibility than PPCA and
generates higher quality motions than GAN-F.

the combination [Speaker, Speaker] for the REAL, GAN-F, GAN- 4.2. Motion characteristics
C, and PPCA conditions are 17%, 9%, 8%, and 54%, respectively,
We also evaluated the generated interaction motion based
and the ratios of combination [Speaker, Listener] for the REAL,
on two perspectives, i.e., the distribution of each feature point
GAN-F, GAN-C, and PPCA condition are 39%, 38%, 39%, and 19%,
position, and the distribution of each feature point velocity.
respectively.
Fig. 15 shows the histograms of the hand position generated
Note that the [Speaker, Speaker] combination indicates that by the PPCA, GAN-F, and GAN-C (or measured from the real
both people speak to each other simultaneously, which is not video). Here, the size of the original image was 736 × 736
likely to occur in normal dialog. In fact, such scenes were rarely pixels, and the position of each feature point was defined as the
recorded in the training videos, except for the short overlap dur- difference of pixels divided by 368. The horizontal axis of the
ing turn taking. In contrast, the [Speaker, Listener] and [Listener, graph represents the horizontal position of the nose relative to
Listener] combinations appear more natural. The interaction mo- the torso position, and the vertical axis represents the frequency
tion data generated GANs appear better because the composition where the number of bins is 100. The results shown in the figure
ratio is not similar to the REAL and it is likely that unnatural suggest that the distribution of PPCA differs significantly from
scenes rarely occur. that of REAL.
Fig. 14 shows the histogram of the quality of the generated Fig. 16 shows the those of velocity of the nose. To visualize
skeleton videos. Each skeleton video was evaluated regardless of the low frequent area visible, we show an additional graph with

Fig. 13. Evaluation results of roll label (Speaker or Listener) ratio for each condition.

528
Y. Nishimura, Y. Nakamura and H. Ishiguro Neural Networks 132 (2020) 521–531

Fig. 14. Evaluation result of human-likeness of skeleton shape for each condition.

Fig. 15. Distribution of nose horizontal position.

Fig. 17. Comparison of differences of position distributions with REAL.

(44,608 samples) using each generative model and compared the


difference between histograms. The results shown in these figures
suggest that the GAN-based methods generated samples with
closer distribution to REAL.
As shown in Fig. 13, the motion label ratio of the GAN condi-
tion is close to REAL may be due to the closeness of the position
and velocity distribution. The labels of the motion generated un-
der the GAN condition are close to the real sample ratio because
the generation position and speed distribution are close.
Fig. 16. Distribution of nose horizontal velocity. Fig. 19 shows the histogram of the amount of whole body
movement. Here, the absolute difference between the two suc-
cessive positions of feature points was averaged over all feature
enlarged scale at the bottom of Fig. 16. The velocity distribution points. This value s̄ is calculated as follows:
was calculated based on the temporal difference of the position 11 √
and its successive position in the next frame. Here, the frame 1 ∑
s̄ = Hk2 + Vk2 ,
rate was 8; thus, we multiplied with the difference. Therefore, the 11
k=1
horizontal axis represents the velocity with the unit [pixels/s].
Figs. 17 and 18 show the Kullback–Leibler divergence from the where k is the index of each feature point and Hk and Vk denote
distribution of real samples to the distribution of the generated the horizontal and vertical difference between two successive po-
samples. Here, we generated the same number of real samples sitions of the kth feature point, respectively. Note that 11 feature
529
Y. Nishimura, Y. Nakamura and H. Ishiguro Neural Networks 132 (2020) 521–531

Thus, we conclude that the PPCA condition tends to generate


motion that always moves slowly. In contrast, the GAN condition
tends to generate motion that switches between being stationary
and moving.

5. Conclusion

In this paper, we have formulated modeling interaction be-


havior and proposed a deep generative model based on GAN,
where the structure of the latent variable is designed based on the
three concepts, i.e., ‘‘interaction intensity’’, ‘‘time evolution’’ and
‘‘time resolution’’. Combined with the network connection pat-
tern with transposed convolutional layers, the proposed model
can generate high quality interaction-like motion.
The proposed GAN-C model considers fewer parameters than
the existing GAN-F method. It is appropriate to output sim-
ilar results for the same state change at different times and
against different times; however, this result is guaranteed by us-
ing convolutional layers. In addition, the convolutional structure
can constrain the mixture of individual intrinsic characteristics
(e.g., the height of a person) and various phenomena caused by
interactions. These results are considered to be a model in which
relationships between latent variables and generated motions are
easier to interpret in the proposed model than in the GAN-F
model. Research into interpretability in unsupervised learning has
been investigated recently (Chen et al., 2016; Zhang, Nian &, Zhu,
2018), and it will be interesting to assess the proposed method
from this perspective.
In recent years, 3D pose estimation systems (Keselman,
Fig. 18. Comparison of differences of velocity distributions with REAL.
Iselin Woodfill, Grunnet-Jepsen, & Bhowmik, 2017; Zhang, 2012)
and OpenPose 3D (Cao et al., 2018) that can be used with small
image sensors have been developed Such systems allow easy data
acquisition; however, several system restrictions must be con-
sidered, e.g., attaching markers. In addition, with such systems,
it is easy to use with physical restrictions relative to mounting
equipment, e.g., mounting equipment to humanoid robots. In the
future, we plan to model using 3D motion data acquired from
such systems.
In addition, humanoid motions can be generated using a be-
havior model and past motion data acquired online as conditions.
To realize a humanoid motion generation system, a model must
be able to generate motions based on a conversational partner’s
data (e.g., motion and voice data), environmental information,
and conversation context.

Declaration of competing interest

The authors declare that they have no known competing finan-


cial interests or personal relationships that could have appeared
to influence the work reported in this paper.
Fig. 19. Distribution of average speed for each condition.
Acknowledgment

This work was partially supported by JST ERATO Grant Number


points were considered in this study. The horizontal axis repre- JPMJER1401, Japan and Grant-in-Aid for Scientific Research on
sents the amount of whole body movements, and the vertical axis Innovative Areas, Grant Numbers 19H05693.
represents its frequency (number of bins:100).
The distribution of the REAL condition is concentrated in
References
smaller movements, and it is heavily tailed. The distribution of
the PPCA condition shows a peak at larger movements and imme- Barsoum, E., Kender, J., & Liu, Z. (2018). HP-GAN: Probabilistic 3D human motion
diately becomes small as the movement size. The distributions of prediction via GAN. In Proceedings of the IEEE conference on computer vision
GAN-C and GAN-F appear to have more similar property to REAL and pattern recognition workshops. (pp. 1418–1427).
because they demonstrate positive skewness. This fact seems Cao, Z., Hidalgo, G., Simon, T., Wei, S.-E., & Sheikh, Y. (2018). OpenPose: realtime
multi-person 2D pose estimation using Part Affinity Fields. arXiv preprint
consistent with the results shown in Fig. 13, where motions gen- arXiv:1812.08008.
erated by PPCA tend to move constantly, while others alternate Cao, Z., Simon, T., Wei, S.-E., & Sheikh, Y. (2017). Realtime multi-Person 2D pose
between stop and moving motions. estimation using part affinity fields. In CVPR.

530
Y. Nishimura, Y. Nakamura and H. Ishiguro Neural Networks 132 (2020) 521–531

Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., & Abbeel, P. (2016). Keselman, L., Iselin Woodfill, J., Grunnet-Jepsen, A., & Bhowmik, A. (2017). Intel
Infogan: Interpretable representation learning by information maximizing realsense stereoscopic depth cameras. In Proceedings of the IEEE conference
generative adversarial nets. In Advances in neural information processing on computer vision and pattern recognition workshops. (pp. 1–10).
systems (pp. 2172–2180). Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv
Doering, M., Glas, D. F., & Ishiguro, H. (2019). Modeling interaction structure preprint arXiv:1412.6980.
for robot imitation learning of human social behavior. IEEE Transactions on Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv
Human-Machine Systems, 49(3), 219–231. preprint arXiv:1312.6114.
Dosovitskiy, A., Tobias Springenberg, J., & Brox, T. (2015). Learning to gener- Nishimura, Y., Nakamura, Y., & Ishiguro, H. (2019). Human behavior modeling
ate chairs with convolutional neural networks. In Proceedings of the IEEE during dialogue by using generative adversarial networks. Journal of the
conference on computer vision and pattern recognition. (pp. 1538–1546). Robotics Society of Japan, 37(7), 632–638. http://dx.doi.org/10.7210/jrsj.37.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., 632.
Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In Advances in Oord, A. v. d., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A.,
neural information processing systems (pp. 2672–2680).
et al. (2016). Wavenet: A generative model for raw audio. arXiv preprint
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. C. (2017).
arXiv:1609.03499.
Improved training of wasserstein gans. In Advances in neural information
Pavlovic, V., Rehg, J. M., & MacCormick, J. (2001). Learning switching linear
processing systems (pp. 5767–5777).
models of human motion. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.),
Heracleous, P., Sato, M., Ishi, C. T., Ishiguro, H., & Hagita, N. (2011). Speech
Advances in neural information processing systems, vol. 13 (pp. 981–987). MIT
production in noisy environments and the effect on automatic speech
Press.
recognition. In ICPhS. (pp. 855–858).
Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R., & Xing, E. P. (2017). Toward con- Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning
trolled generation of text. In Proceedings of the 34th international conference with deep convolutional generative adversarial networks. arXiv preprint
on machine learning, vol. 70 (pp. 1587–1596). JMLR. org. arXiv:1511.06434.
Inamura, T., & Nakamura, Y. (2014). Stochastic information processing that Sakai, K., Minato, T., Ishi, C. T., & Ishiguro, H. (2017). Novel speech motion
unifies recognition and generation of motion patterns: Toward symbolical generation by modeling dynamics of human speech production. 4. (p. 49).
understanding of the continuous world. (pp. 79–102). http://dx.doi.org/10. http://dx.doi.org/10.3389/frobt.2017.00049.
1201/b17949-7. Takano, W., & Nakamura, Y. (2016). Real-time unsupervised segmentation of
Inoue, K., Hara, K., Lala, D., Nakamura, S., Takanashi, K., & Kawahara, T. (2019). human whole-body motion and its application to humanoid robot acquisition
A job interview dialogue system with autonomous android ERICA. In IWSDS. of motion symbols. Robotics and Autonomous Systems, 75, 260–272.
Kahn, P. H., Freier, N. G., Kanda, T., Ishiguro, H., Ruckert, J. H., Severson, R. Zhang, Z. (2012). Microsoft kinect sensor and its effect. IEEE Multimedia, 19(2),
L., et al. (2008). Design patterns for sociality in human-robot interaction. 4–10.
In Proceedings of the 3rd ACM/IEEE international conference on human robot Zhang, T., McCarthy, Z., Jow, O., Lee, D., Chen, X., Goldberg, K., et al. (2018). Deep
interaction (pp. 97–104). ACM. imitation learning for complex manipulation tasks from virtual reality tele-
Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2017). Progressive growing of gans for operation. In 2018 IEEE International Conference on Robotics and Automation
improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. (pp. 1–8). IEEE.
Kawahara, T., Uesato, M., Yoshino, K., & Takanashi, K. (2015). Toward adaptive Zhang, Q., Nian Wu, Y., & Zhu, S.-C. (2018). Interpretable convolutional neural
generation of backchannels for attentive listening agents. In International networks. In Proceedings of the IEEE conference on computer vision and pattern
workshop serien on spoken dialogue systems technology. (pp. 1–10). recognition. (pp. 8827–8836).

531

You might also like