Professional Documents
Culture Documents
Neural Networks: Yusuke Nishimura, Yutaka Nakamura, Hiroshi Ishiguro
Neural Networks: Yusuke Nishimura, Yutaka Nakamura, Hiroshi Ishiguro
Neural Networks
journal homepage: www.elsevier.com/locate/neunet
article info a b s t r a c t
Article history: Recently, considerable research has focused on personal assistant robots, and robots capable of rich
Received 21 May 2020 human-like communication are expected. Among humans, non-verbal elements contribute to effective
Received in revised form 1 September 2020 and dynamic communication. However, people use a wide range of diverse gestures, and a robot
Accepted 25 September 2020
capable of expressing various human gestures has not been realized. In this study, we address
Available online 30 September 2020
human behavior modeling during interaction using a deep generative model. In the proposed method,
Keywords: to consider interaction motion, three factors, i.e., interaction intensity, time evolution, and time
Human robot interaction resolution, are embedded in the network structure. Subjective evaluation results suggest that the
Human motion modeling proposed method can generate high-quality human motions.
Generative Adversarial Networks © 2020 Elsevier Ltd. All rights reserved.
Human behavior during dialog
https://doi.org/10.1016/j.neunet.2020.09.019
0893-6080/© 2020 Elsevier Ltd. All rights reserved.
Y. Nishimura, Y. Nakamura and H. Ishiguro Neural Networks 132 (2020) 521–531
Fig. 1. Video recording of a conversation; (a) environment (b) extraction of skeleton from images.
We have been working on a GAN framework for modeling people involved in the dialog were instructed to stay in one place,
human behavior during dialog (Nishimura, Nakamura, & Ishiguro, and a camera was placed such that neither person was occluded.
2019). In this study, we implemented two deep generative neural A pose estimation system was employed to encode the posture
networks and investigated their performance. The first one is a of each person to a vector that consists of the position of feature
naive fully-connected deep neural network where a fake sample points, such as the head, torso, and right elbow (Fig. 1(b)). We
is generated from an ‘‘unstructured’’ latent variable. The second refer to this vector as ‘‘skeleton’’. Then, we can obtain a matrix
is a convolutional neural network where the latent variable is representation, i.e., a multidimensional time series, of a human
designed by considering the interaction between two entities. motion and construct a generative model based on this data
For example, a person’s behavior is affected by the conversation representation.
partner whereas individual properties, such as height, are inde-
pendent, and such a notion must be considered in designing the 2.1. Data collection using an omnidirectional camera
model structure. The latent variable and connection in the convo-
lutional neural network model are designed based on interaction In this study, we used an omnidirectional camera to obtain
intensity, time resolution and time evolution (Section 3.1). frontal images of two people during communication. Each person
Subjective evaluations of the generated motion were con- sits on a fixed chair, as shown in Fig. 2. K feature points of
ducted. As a result, it becomes clear that a traditional generative a human posture(shown in Fig. 2 bottom center) are extracted
model, i.e., a principle component analysis (PCA) model, cannot from each captured image using the OpenPose library (Cao, Hi-
generate interaction motions including conversation turns, while dalgo, Simon, Wei, & Sheikh, 2018; Cao, Simon, Wei, & Sheikh,
the proposed deep generative models can. In addition, from a 2017). Here the dimensionality of the skeleton is D (= K × 2).
‘‘biological motion’’ perspective, the quality of motion generated The skeletons of the people on the right and left are denoted
using the structured latent variable model is better than that xtr ∈ RD and xtl ∈ RD , respectively. The motions of both people
generated using the unstructured latent variable model. [xtr−T +1 , xtr−T +2 , . . . , xtr ] and [xtl −T +1 , xtl −T +2 , . . . , xtl ] are combined
The structure of the paper is as follows. In Section 2, the and ‘‘interaction motion data’’ x ∈ RD×2×T is composed. Here, T is
data collection procedure for human–human interaction and its the length of the time series. In this paper, the interaction motion
statistical properties are described. In Section 3, the network data is x ∈ R22×2×32 since we used 11 feature points(K = 11,
structure of our proposed deep generative model is explained. In D = 22) and 4-s long motions(8 fps, T = 32).
Section 4, the results of the subjective evaluations and statistics of The obtained interaction motion data were preprocessed to
the generated motions are presented. Finally, Section 5 presents reduce noise and rectify missing values. (i) Removing outliers
the conclusion of the research. : OpenPose estimates with low confidence scores (<5%) were
removed and replaced by a linear interpolation. (ii) Smoothing :
2. Motion during dialog each dimension of the time series was smoothed using a low-pass
filter with a 4-Hz cutoff frequency. (iii) Resampling :To reduce
Human motion during dialog has two main characteristics. redundant information from the interaction motion data that
(1) There may be multiple appropriate actions in any given were usually recorded with at a standard video frame rate, such
situation. For example, a person can wave either their right as 24 or 30 fps, the time series was resampled to 8 fps.
or left hand when the conversation partner waves their hand.
(2) The same action can convey different meanings in different 2.2. Interaction motion property
contexts. For example, depending on the situation, a simple
nodding motion may mean ‘‘I agree with you’’ or ‘‘I’m listening We analyze the recorded data to investigate the property
to you’’. of the interaction motion data for designing the latent variable
For this aim, commonly used motion generation methods for structure. We applied principal component analysis (PCA) to the
interaction robots are not suitable. Rule-based methods require pair of skeletons and single skeleton. The cumulative contribution
a great many manually constructed rules to cope with a single ratios of principal components are shown in Fig. 3. The num-
scenario. This results in what is known as a combinatorial explo- ber of components required to approximate the skeletons for
sion where the complexity of a problem increases rapidly. With the former case is 24 with small mean squared errors (<1%).
imitation learning, each motion can be automatically created Since the number of the components required to mimic each
from the human instruction; however imitation learning requires skeleton independently is 14 (the latter case), 28 components are
a typical motion trajectory for each motion category (Zhang, Mc- necessary to reproduce the pair of the skeletons. This indicates
Carthy, Jow, Lee, Chen, Goldberg, et al., 2018). Furthermore, the that the number of required components can be reduced due to
required number of categories is not known in advance. To ad- the interaction between two people, and it is suggested that the
dress these problems, we propose an innovate motion generation latent variable to generate interaction motion data might have
model based on a deep generative model. some components shared by two skeletons.
In this study, as raw data, we use video recordings of the It is also observed that a person can perform a different action
movements of two people during dialog, as shown in Fig. 1. The in a similar situation and that a single motion can have different
522
Y. Nishimura, Y. Nakamura and H. Ishiguro Neural Networks 132 (2020) 521–531
Lgp = E [(∥∇ C (x̂)∥2 − 1)2 ] Based on the knowledge we obtained (Section 2.2), we de-
x̂∼Px̂
signed the latent space in consideration of the following charac-
where PF is the probability distribution of fake samples defined teristics to model interaction motion data.
by G(z) and p(z) and PR is the probability distribution of real
samples. E[·] denotes the expectation. x̂ is a subdivided point of a Interaction intensity The behavior of a person can be affected
real sample xR and a fake sample xF , and is generated as follows: by the interlocutor’s behavior. However, the physical prop-
erty or personality of the person may not be directly
x̂ = ϵ xR + (1 − ϵ )xF
affected by the interlocutor. Thus, the latent variable com-
where ϵ ∈ [0, 1] is a random number sampled from the uni- prises elements shared by two persons and elements as-
form distribution (Gulrajani et al., 2017), and Px̂ is its probability signed exclusively to each individual.
distribution.
The parameter of the critic wc is updated to minimize the loss Time evolution Human posture changes continuously from mo-
function Lc . Here, the Adam optimizer (Kingma & Ba, 2014) is ment to moment and has local dependence on time; thus,
used to update the weight parameter. L1 and L2 are the l1 and the latent variable comprises elements that only affect
l2 norms of the parameter of the critic, i.e., connection weights different periods.
wc in the deep neural network. These regularization terms help to
avoid overfitting by reducing the magnitude of the weight values, Time resolution The motion of a person includes changes with
i.e., regularization. In addition, hyperparameters α , β and γ are different time constant. For example, hand movement dur-
used to balance the effect of each loss. These parameters are set ing waving hand is fast. In contrast, changes in posture
to 10.0, 0.00001 and 0.3, respectively. When the critic is trained induced by changes in emotion may be slow. Thus, it seems
appropriately, the output of the critic for real samples, i.e., C (xR ), natural to include elements that correspond to different
becomes large while that for fake samples, i.e., C (xF ), becomes time periods in latent variable.
small.
The generator is trained to generate fake samples such that The proposed network possesses a latent variable that can be rep-
the critic fails to distinguish fake samples, and the parameter is resented as a three-dimensional array, where the axes correspond
updated as follows to minimize loss function Lg using the critic. to interaction intensity, time evolution, and time resolution.
Lg = − E [C (xF )] The latent variable used in the proposed method is shown
xF ∼PF in Fig. 5(upper left). Note that the vertical direction represents
The Adam optimizer is used to minimize the generator’s param- interaction intensity. The upper and lower elements are assigned
eter. to each person individually, and the middle elements affect the
During learning, each network is trained alternately. The qual- motions of both individuals simultaneously. The horizontal di-
ity of the ‘‘fake’’ samples generated by the generator is improved rection represents the time difference. Here, elements on the
by training such that the critic cannot distinguish fake samples left affect the behavior of the earlier period, and those on the
even though the critic also has improved ability to distinguish right affect the behavior of the later period. The depth direction
such samples. In other words, the generator is trained to increase represents the difference in time resolution, where elements in
the loss function of the critic, while the critic’s parameter is near planes respond to fast changes, and those in far planes
updated to decrease the loss function. This alternating learning respond to slow changes. For simplicity, to facilitate a low time
process of both networks is referred to as ‘‘adversarial learning’’. resolution, multiple elements in far planes have the same value.
524
Y. Nishimura, Y. Nakamura and H. Ishiguro Neural Networks 132 (2020) 521–531
The generator network includes multiple (three-dimensional) Short term Long term
transposed convolution layers (Dosovitskiy, Tobias Springenberg, Individual Emotion and intention Height and personality
& Brox, 2015; Radford, Metz, & Chintala, 2015) to mix the in- Interactive Turn-take and topics Relationship and context
525
Y. Nishimura, Y. Nakamura and H. Ishiguro Neural Networks 132 (2020) 521–531
4. Experimental results In the REAL condition, there were many cases in which a
single person moves a lot and the other nearly stops. Here,
We evaluate the proposed method by comparing it to a linear the magnitudes of both persons’ movements were occasionally
probabilistic principal component analysis (PPCA) as the base- switched(Fig. 8). Such movements allow us to imagine the turn
line. The PPCA parameters were determined based on the max- during the conversation because the speaker may move a lot
(while the listener does not).
imum likelihood estimation method and were obtained analyt-
Unlike the REAL condition, there were many cases in which
ically. The number of latent variables (effective dimensionality
both people constantly moved at simultaneously under the PPCA
of the observation data) was selected such that the cumulative
condition(Fig. 9). In addition, the interaction motions generated
contribution rate becomes greater than 99.0%. As a result, we
by both GAN methods include many cases in which each person
used a PPCA model with a 315-dimensional latent variable. We
repeated quick-and-active movements and resting motions alter-
also evaluated GANs with different generator network structure, nately (Figs. 10 and 11). Note that the result of the questionnaire
where each layer was composed naively using a fully connected in our previous study indicated that variation of the motion
layer (Nishimura et al., 2019). In the following, we refer to this as generated by GAN-F was more diverse than that of PPCA.
‘‘GAN-F’’, and the proposed method is called ‘‘GAN-C’’.
Figs. 8–11 show stroboscopic interaction motion patterns con- 4.1. Subjective evaluation
verted from the measurement data (REAL) or generated by each
method (GAN-F, GAN-C or PPCA). The motions generated by GAN- We conducted a subject experiment to evaluate the generated
F contained noisy high-frequency vibrations; thus, a lowpass filter motions. Fig. 12 show the questionnaire used in this subjective
with a cutoff frequency of 2 Hz was applied to avoid negative experiment. A skeleton movie generated according to one of the
impressions on the generated motions induced by such noise. conditions was shown to the participant, and the participant
526
Y. Nishimura, Y. Nakamura and H. Ishiguro Neural Networks 132 (2020) 521–531
answered two questions. For the first question, the participant pairs of the video pairs generated from a single generated (fake)
judged whether the person in the video was a speaker or listener interaction motion data and annotated by a single person.
in a 0.5-s period. Note that the evaluation period was indicated The subjective evaluation was performed as follows: (1) five
in the seek bar at the bottom of the video. interaction motion data were generated for each condition (total
The participant also answered the confidence of the judge of 20 interaction motion data); (2) interaction motion data were
using a five-point Likert scale: (1) it looks like the person is divided into two single-person motion data (total of 40 motion
listening; (2) it is likely the person is probably listening; (3) it data); (3) a skeleton video generated from motion data was
is difficult judge; (4) it is likely the person is probably speaking; shown to the subject; (4) the subject answered questions for the
and (5) it is likely the person is speaking at the moment. presented video; and (5) steps (3) and (4) were repeated until
For the second question (‘‘Do you feel that the shape of a
the subject had evaluated all motion data. Note that the order
person was always kept?’’), the participant evaluated the shape
was shuffled randomly. The above procedure was performed in a
of the skeleton images on a seven-point Likert scale. The lowest
single session, and each subject participated in three sessions. The
item (1) indicates that the skeleton always maintained a human-
like shape, and the highest item (7) indicates that the skeleton first session was considered a practice session, and the results of
did not always looks like the human form. the remaining two sessions were used for analysis. The number
One reason we adopted videos in which a single skeleton was of subjects was 12, and 120 answers were corrected for each
presented is that it appears to be difficult for humans to recognize condition.
what occurs in a video recorded by an omnidirectional camera Each cell in Table 2 shows the number of each combination
when raster graphics are converted to a skeleton image using of labels annotated by a single person to two skeleton videos
a 2D motion capture system, e.g., OpenPose. Here, two people generated from a single interaction motion data. We summarize
facing each other were presented in a single plane as two figures the answers using frequency band graphs in Fig. 13, where the
viewed from the front. We evaluated the consistency of label confidence level is ignored and merged. As shown, the ratios of
527
Y. Nishimura, Y. Nakamura and H. Ishiguro Neural Networks 132 (2020) 521–531
Table 2
Questionnaire results for motion role labels.
the other of the pair; thus, the total number of answers for each
condition was 240 (= 120 × 2). As shown in Fig. 14, the qualities
of REAL, GAN-C, and PPCA were much greater than that of GAN-F.
Therefore, the proposed model (GAN-C) outperformed the other
models relative to generating interaction motions because GAN-
Fig. 12. Questionnaire. C has higher conversation turn reproducibility than PPCA and
generates higher quality motions than GAN-F.
the combination [Speaker, Speaker] for the REAL, GAN-F, GAN- 4.2. Motion characteristics
C, and PPCA conditions are 17%, 9%, 8%, and 54%, respectively,
We also evaluated the generated interaction motion based
and the ratios of combination [Speaker, Listener] for the REAL,
on two perspectives, i.e., the distribution of each feature point
GAN-F, GAN-C, and PPCA condition are 39%, 38%, 39%, and 19%,
position, and the distribution of each feature point velocity.
respectively.
Fig. 15 shows the histograms of the hand position generated
Note that the [Speaker, Speaker] combination indicates that by the PPCA, GAN-F, and GAN-C (or measured from the real
both people speak to each other simultaneously, which is not video). Here, the size of the original image was 736 × 736
likely to occur in normal dialog. In fact, such scenes were rarely pixels, and the position of each feature point was defined as the
recorded in the training videos, except for the short overlap dur- difference of pixels divided by 368. The horizontal axis of the
ing turn taking. In contrast, the [Speaker, Listener] and [Listener, graph represents the horizontal position of the nose relative to
Listener] combinations appear more natural. The interaction mo- the torso position, and the vertical axis represents the frequency
tion data generated GANs appear better because the composition where the number of bins is 100. The results shown in the figure
ratio is not similar to the REAL and it is likely that unnatural suggest that the distribution of PPCA differs significantly from
scenes rarely occur. that of REAL.
Fig. 14 shows the histogram of the quality of the generated Fig. 16 shows the those of velocity of the nose. To visualize
skeleton videos. Each skeleton video was evaluated regardless of the low frequent area visible, we show an additional graph with
Fig. 13. Evaluation results of roll label (Speaker or Listener) ratio for each condition.
528
Y. Nishimura, Y. Nakamura and H. Ishiguro Neural Networks 132 (2020) 521–531
Fig. 14. Evaluation result of human-likeness of skeleton shape for each condition.
5. Conclusion
530
Y. Nishimura, Y. Nakamura and H. Ishiguro Neural Networks 132 (2020) 521–531
Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., & Abbeel, P. (2016). Keselman, L., Iselin Woodfill, J., Grunnet-Jepsen, A., & Bhowmik, A. (2017). Intel
Infogan: Interpretable representation learning by information maximizing realsense stereoscopic depth cameras. In Proceedings of the IEEE conference
generative adversarial nets. In Advances in neural information processing on computer vision and pattern recognition workshops. (pp. 1–10).
systems (pp. 2172–2180). Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv
Doering, M., Glas, D. F., & Ishiguro, H. (2019). Modeling interaction structure preprint arXiv:1412.6980.
for robot imitation learning of human social behavior. IEEE Transactions on Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv
Human-Machine Systems, 49(3), 219–231. preprint arXiv:1312.6114.
Dosovitskiy, A., Tobias Springenberg, J., & Brox, T. (2015). Learning to gener- Nishimura, Y., Nakamura, Y., & Ishiguro, H. (2019). Human behavior modeling
ate chairs with convolutional neural networks. In Proceedings of the IEEE during dialogue by using generative adversarial networks. Journal of the
conference on computer vision and pattern recognition. (pp. 1538–1546). Robotics Society of Japan, 37(7), 632–638. http://dx.doi.org/10.7210/jrsj.37.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., 632.
Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In Advances in Oord, A. v. d., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A.,
neural information processing systems (pp. 2672–2680).
et al. (2016). Wavenet: A generative model for raw audio. arXiv preprint
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. C. (2017).
arXiv:1609.03499.
Improved training of wasserstein gans. In Advances in neural information
Pavlovic, V., Rehg, J. M., & MacCormick, J. (2001). Learning switching linear
processing systems (pp. 5767–5777).
models of human motion. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.),
Heracleous, P., Sato, M., Ishi, C. T., Ishiguro, H., & Hagita, N. (2011). Speech
Advances in neural information processing systems, vol. 13 (pp. 981–987). MIT
production in noisy environments and the effect on automatic speech
Press.
recognition. In ICPhS. (pp. 855–858).
Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R., & Xing, E. P. (2017). Toward con- Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning
trolled generation of text. In Proceedings of the 34th international conference with deep convolutional generative adversarial networks. arXiv preprint
on machine learning, vol. 70 (pp. 1587–1596). JMLR. org. arXiv:1511.06434.
Inamura, T., & Nakamura, Y. (2014). Stochastic information processing that Sakai, K., Minato, T., Ishi, C. T., & Ishiguro, H. (2017). Novel speech motion
unifies recognition and generation of motion patterns: Toward symbolical generation by modeling dynamics of human speech production. 4. (p. 49).
understanding of the continuous world. (pp. 79–102). http://dx.doi.org/10. http://dx.doi.org/10.3389/frobt.2017.00049.
1201/b17949-7. Takano, W., & Nakamura, Y. (2016). Real-time unsupervised segmentation of
Inoue, K., Hara, K., Lala, D., Nakamura, S., Takanashi, K., & Kawahara, T. (2019). human whole-body motion and its application to humanoid robot acquisition
A job interview dialogue system with autonomous android ERICA. In IWSDS. of motion symbols. Robotics and Autonomous Systems, 75, 260–272.
Kahn, P. H., Freier, N. G., Kanda, T., Ishiguro, H., Ruckert, J. H., Severson, R. Zhang, Z. (2012). Microsoft kinect sensor and its effect. IEEE Multimedia, 19(2),
L., et al. (2008). Design patterns for sociality in human-robot interaction. 4–10.
In Proceedings of the 3rd ACM/IEEE international conference on human robot Zhang, T., McCarthy, Z., Jow, O., Lee, D., Chen, X., Goldberg, K., et al. (2018). Deep
interaction (pp. 97–104). ACM. imitation learning for complex manipulation tasks from virtual reality tele-
Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2017). Progressive growing of gans for operation. In 2018 IEEE International Conference on Robotics and Automation
improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. (pp. 1–8). IEEE.
Kawahara, T., Uesato, M., Yoshino, K., & Takanashi, K. (2015). Toward adaptive Zhang, Q., Nian Wu, Y., & Zhu, S.-C. (2018). Interpretable convolutional neural
generation of backchannels for attentive listening agents. In International networks. In Proceedings of the IEEE conference on computer vision and pattern
workshop serien on spoken dialogue systems technology. (pp. 1–10). recognition. (pp. 8827–8836).
531