Professional Documents
Culture Documents
Recognition
convolutional layer. It makes the topology of the graph op- 𝐶𝑖𝑛 𝑇 × 𝑁 𝑁×𝑁
timized together with the other parameters of the network in
an end-to-end learning manner. The graph is unique for dif-
𝐶𝑘
ferent layers and samples, which greatly increases the flex- 𝑟𝑒𝑠(1 × 1)
ibility of the model. Meanwhile, it is designed as a residual 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝐴𝑘 𝐵𝑘
⨁
predication
J-Stream
Figure 5. Illustration of the overall architecture of the 2s-AGCN. The scores of two streams are added to obtain the final prediction.
smaller than the Kinetics-Skeleton dataset, we perform ex- this benchmark is divided into a training set (40,320 videos)
haustive ablation studies on it to examine the contributions and a validation set (16,560 videos), where the actors in
of the proposed model components based on the recogni- the two subsets are different. 2).Cross-view (X-View): the
tion performance. Then, the final model is evaluated on training set in this benchmark contains 37,920 videos that
both of the datasets to verify the generality and is compared are captured by cameras 2 and 3, and the validation set con-
with the other state-of-the-art approaches. The definitions tains 18,960 videos that are captured by camera 1. We fol-
of joints and their natural connections in the two datasets low this convention and report the top-1 accuracy on both
are shown in Fig. 6. benchmarks.
Kinetics-Skeleton: Kinetics [12] is a large-scale hu-
16
14
0
15
17
4
man action dataset that contains 300,000 videos clips in 400
1
3 classes. The video clips are sourced from YouTube videos
2 5 9 21 5
and have a great variety. It only provides raw video clips
3 6 10 6 without skeleton data. [32] estimate the locations of 18
25 12
11
2
7
8 23
joints on every frame of the clips using the publicly avail-
4 8 11 7
24
17
1
13
22 able OpenPose toolbox [4]. Two peoples are selected for
multiperson clips based on the average joint confidence. We
use their released data (Kinetics-Skeleton) to evaluate our
9 12 18 14
model. The dataset is divided into a training set (240,000
clips) and a validation set (20,000 clips). Following the
10 13 19 15
evaluation method in [32], we train the models on the train-
20 16 ing set and report the top-1 and top-5 accuracies on the val-
Figure 6. The left sketch shows the joint label of the Kinetics- idation set.
Skeleton dataset and the right sketch shows the joint label of the
NTU-RGBD dataset. 5.2. Training details
All experiments are conducted on the PyTorch deep
learning framework [26]. Stochastic gradient descent
5.1. Datasets (SGD) with Nesterov momentum (0.9) is applied as the op-
NTU-RGBD: NTU-RGBD [27] is currently the largest timization strategy. The batch size is 64. Cross-entropy
and most widely used in-door-captured action recognition is selected as the loss function to backpropagate gradients.
dataset, which contains 56,000 action clips in 60 action The weight decay is set to 0.0001.
classes. The clips are performed by 40 volunteers in dif- For the NTU-RGBD dataset, there are at most two peo-
ferent age groups ranging from 10 to 35. Each action is ple in each sample of the dataset. If the number of bodies
captured by 3 cameras at the same height but from differ- in the sample is less than 2, we pad the second body with
ent horizontal angles: −45◦ , 0◦ , 45◦ . This dataset provides 0. The max number of frames in each sample is 300. For
3D joint locations of each frame detected by Kinect depth samples with less than 300 frames, we repeat the samples
sensors. There are 25 joints for each subject in the skele- until it reaches 300 frames. The learning rate is set as 0.1
ton sequences, while each video has no more than 2 sub- and is divided by 10 at the 30th epoch and 40th epoch. The
jects. The original paper [27] of the dataset recommends training process is ended at the 50th epoch.
two benchmarks: 1). Cross-subject (X-Sub): the dataset in For the Kinetics-Skeleton dataset, the size of the input
tensor of Kinetics is set the same as [32], which contains the corresponding adaptive adjacency matrix learned by our
150 frames with 2 bodies in each frame. We perform the model. It is clear that the learned structure of the graph is
same data-augmentation methods as in [32]. In detail, we more flexible and not constrained to the physical connec-
randomly choose 150 frames from the input skeleton se- tions of the human body.
quence and slightly disturb the joint coordinates with ran-
domly chosen rotations and translations. The learning rate 0 0
GCN on the NTU-RGBD dataset is 88.3%. By using the re- Figure 7. Example of the learned adjacency matrix. The left ma-
arranged learning-rate scheduler and the specially designed trix is the original adjacency matrix for the second subset in the
data preprocessing methods, it is improved to 92.7%, which NTU-RGBD dataset. The right matrix is an example of the corre-
is used as the baseline in the experiments. The detail is in- sponding adaptive adjacency matrix learned by our model.
troduced in the supplementary material.
Table 2. Comparisons of the validation accuracy with different in- tion recognition. It parameterizes the graph structure of the
put modalities. skeleton data and embeds it into the network to be jointly
learned and updated with the model. This data-driven ap-
5.4. Comparison with the state-of-the-art proach increases the flexibility of the graph convolutional
network and is more suitable for the action recognition task.
We compare the final model with the state-of-the-art Furthermore, the traditional methods always ignore or un-
skeleton-based action recognition methods on both the derestimate the importance of second-order information of
NTU-RGBD dataset and Kinetics-Skeleton dataset. The skeleton data, i.e., the bone information. In this work,
results of these two comparisons are shown in Tab 3 and we propose a two-stream framework to explicitly employ
Tab 4, respectively. The methods used for comparison in- this type of information, which further enhances the per-
clude the handcraft-feature-based methods [31, 8], RNN- formance. The final model is evaluated on two large-scale
based methods [6, 27, 22, 29, 33, 19, 20], CNN-based action recognition datasets, NTU-RGBD and Kinetics, and
methods [21, 14, 13, 23, 18, 17] and GCN-based meth- it achieves the state-of-the-art performance on both of them.
ods [32, 30]. Our model achieves state-of-the-art perfor-
mance with a large margin on both of the datasets, which Acknowledgement
verifies the superiority of our model.
This work was supported in part by the National Nat-
5.5. Conclusion
ural Science Foundation of China under Grant 61572500,
In this work, we propose a novel adaptive graph convo- 61876182 and 61872364, and in part by the State Grid Cor-
lutional neural network (2s-AGCN) for skeleton-based ac- poration Science and Technology Project.
References [14] Tae Soo Kim and Austin Reiter. Interpretable 3d human ac-
tion analysis with temporal convolutional networks. In Com-
[1] James Atwood and Don Towsley. Diffusion-convolutional puter Vision and Pattern Recognition Workshops (CVPRW),
neural networks. In Advances in Neural Information Pro- 2017 IEEE Conference on, pages 1623–1631, 2017. 1, 2, 8
cessing Systems, pages 1993–2001, 2016. 1, 2 [15] Thomas Kipf, Ethan Fetaya, Kuan-Chieh Wang, Max
[2] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Le- Welling, and Richard Zemel. Neural relational inference for
Cun. Spectral Networks and Locally Connected Networks on interacting systems. In International Conference on Machine
Graphs. In ICLR, 2014. 2 Learning (ICML), 2018. 1, 2
[3] C. Cao, C. Lan, Y. Zhang, W. Zeng, H. Lu, and Y. Zhang. [16] Thomas N. Kipf and Max Welling. Semi-Supervised Classi-
Skeleton-Based Action Recognition with Gated Convolu- fication with Graph Convolutional Networks. arXiv preprint
tional Neural Networks. IEEE Transactions on Circuits and arXiv:1609.02907, Sept. 2016. 1, 2
Systems for Video Technology, pages 1–1, 2018. 2 [17] Bo Li, Yuchao Dai, Xuelian Cheng, Huahui Chen, Yi Lin,
[4] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. and Mingyi He. Skeleton based action recognition using
Realtime multi-person 2d pose estimation using part affinity translation-scale invariant image mapping and multi-scale
fields. In CVPR, 2017. 6 deep CNN. In Multimedia & Expo Workshops (ICMEW),
[5] Michal Defferrard, Xavier Bresson, and Pierre Van- 2017 IEEE International Conference on, pages 601–604.
dergheynst. Convolutional Neural Networks on Graphs IEEE, 2017. 1, 2, 8
with Fast Localized Spectral Filtering. In D. D. Lee, M. [18] Chao Li, Qiaoyong Zhong, Di Xie, and Shiliang Pu.
Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, edi- Skeleton-based action recognition with convolutional neural
tors, Advances in Neural Information Processing Systems 29, networks. In Multimedia & Expo Workshops (ICMEW), 2017
pages 3844–3852. Curran Associates, Inc., 2016. 2 IEEE International Conference on, pages 597–600. IEEE,
[6] Yong Du, Wei Wang, and Liang Wang. Hierarchical recur- 2017. 1, 2, 8
rent neural network for skeleton based action recognition. In [19] Lin Li, Wu Zheng, Zhaoxiang Zhang, Yan Huang, and
Proceedings of the IEEE conference on computer vision and Liang Wang. Skeleton-Based Relational Modeling for Ac-
pattern recognition, pages 1110–1118, 2015. 1, 2, 8 tion Recognition. arXiv:1805.02556 [cs], 2018. 1, 2, 8
[7] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, [20] Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, and Yanbo Gao.
Rafael Bombarell, Timothy Hirzel, Alan Aspuru-Guzik, and Independently recurrent neural network (indrnn): Building
Ryan P Adams. Convolutional Networks on Graphs for A longer and deeper RNN. In Proceedings of the IEEE Con-
Learning Molecular Fingerprints. In C. Cortes, N. D. ference on Computer Vision and Pattern Recognition, pages
Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, edi- 5457–5466, 2018. 1, 2, 8
tors, Advances in Neural Information Processing Systems 28, [21] Hong Liu, Juanhui Tu, and Mengyuan Liu. Two-Stream 3d
pages 2224–2232. Curran Associates, Inc., 2015. 1, 2 Convolutional Neural Network for Skeleton-Based Action
[8] Basura Fernando, Efstratios Gavves, Jose M. Oramas, Amir Recognition. arXiv:1705.08106 [cs], May 2017. 1, 2, 8
Ghodrati, and Tinne Tuytelaars. Modeling video evolution [22] Jun Liu, Amir Shahroudy, Dong Xu, and Gang Wang.
for action recognition. In Proceedings of the IEEE Con- Spatio-Temporal LSTM with Trust Gates for 3d Human Ac-
ference on Computer Vision and Pattern Recognition, pages tion Recognition. In Computer Vision ECCV 2016, vol-
5378–5387, 2015. 1, 2, 8 ume 9907, pages 816–833. Springer International Publish-
[9] Will Hamilton, Zhitao Ying, and Jure Leskovec. Induc- ing, Cham, 2016. 1, 2, 8
tive representation learning on large graphs. In Advances in [23] Mengyuan Liu, Hong Liu, and Chen Chen. Enhanced skele-
Neural Information Processing Systems, pages 1025–1035, ton visualization for view invariant human action recogni-
2017. 1, 2 tion. Pattern Recognition, 68:346–362, 2017. 1, 2, 8
[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. [24] Federico Monti, Davide Boscaini, Jonathan Masci,
Deep Residual Learning for Image Recognition. In The Emanuele Rodola, Jan Svoboda, and Michael M. Bronstein.
IEEE Conference on Computer Vision and Pattern Recog- Geometric deep learning on graphs and manifolds using
nition (CVPR), June 2016. 5 mixture model CNNs. In Proc. CVPR, volume 1, page 3,
[11] Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convo- 2017. 1, 2
lutional networks on graph-structured data. arXiv preprint [25] Mathias Niepert, Mohamed Ahmed, and Konstantin
arXiv:1506.05163, 2015. 2 Kutzkov. Learning convolutional neural networks for graphs.
[12] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, In International conference on machine learning, pages
Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, 2014–2023, 2016. 1, 2
Tim Green, Trevor Back, Paul Natsev, and others. The [26] Adam Paszke, Sam Gross, Soumith Chintala, Gregory
Kinetics Human Action Video Dataset. arXiv preprint Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al-
arXiv:1705.06950, 2017. 2, 5, 6 ban Desmaison, Luca Antiga, and Adam Lerer. Automatic
[13] Qiuhong Ke, Mohammed Bennamoun, Senjian An, Fer- differentiation in PyTorch. In NIPS-W, 2017. 6
dous Ahmed Sohel, and Farid Boussad. A New Representa- [27] Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang.
tion of Skeleton Sequences for 3d Action Recognition. 2017 NTU RGB+D: A Large Scale Dataset for 3d Human Activity
IEEE Conference on Computer Vision and Pattern Recogni- Analysis. In The IEEE Conference on Computer Vision and
tion (CVPR), pages 4570–4579, 2017. 1, 2, 8 Pattern Recognition (CVPR), 2016. 1, 2, 5, 6, 8
[28] David I Shuman, Sunil K Narang, Pascal Frossard, Antonio
Ortega, and Pierre Vandergheynst. The emerging field of sig-
nal processing on graphs: Extending high-dimensional data
analysis to networks and other irregular domains. IEEE Sig-
nal Processing Magazine, 30(3):83–98, 2013. 2
[29] Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and
Jiaying Liu. An End-to-End Spatio-Temporal Attention
Model for Human Action Recognition from Skeleton Data.
In AAAI, volume 1, pages 4263–4270, 2017. 1, 2, 8
[30] Yansong Tang, Yi Tian, Jiwen Lu, Peiyang Li, and Jie Zhou.
Deep Progressive Reinforcement Learning for Skeleton-
Based Action Recognition. In The IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2018. 1,
8
[31] Raviteja Vemulapalli, Felipe Arrate, and Rama Chellappa.
Human action recognition by representing 3d skeletons as
points in a lie group. In Proceedings of the IEEE conference
on computer vision and pattern recognition, pages 588–595,
2014. 1, 2, 8
[32] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial Temporal
Graph Convolutional Networks for Skeleton-Based Action
Recognition. In AAAI, 2018. 1, 2, 3, 5, 6, 7, 8
[33] Pengfei Zhang, Cuiling Lan, Junliang Xing, Wenjun Zeng,
Jianru Xue, and Nanning Zheng. View Adaptive Recur-
rent Neural Networks for High Performance Human Action
Recognition From Skeleton Data. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recog-
nition, pages 2117–2126, 2017. 1, 2, 8
[34] Yifan Zhang, Congqi Cao, Jian Cheng, and Hanqing Lu.
EgoGesture: A New Dataset and Benchmark for Egocentric
Hand Gesture Recognition. IEEE Transactions on Multime-
dia, pages 1–1, 2018. 1