You are on page 1of 4

Human Action Recognition on Raw Depth Maps

Jacek Trelinski, Bogdan Kwolek†


AGH University of Science and Technology, Mickiewicza 30 Av., 30-059 Kraków, Poland
Email: bkw@agh.edu.pl

Abstract—We propose an effective framework for human ac- stereovision systems have no support for real-time 3D skeleton
tion recognition on raw depth maps. We leverage a convolutional estimation and deliver only RGB-D data streams. Despite
autoencoder to extract on sequences of deep maps the frame- needs for functionalities that skeleton-based approaches cannot
features that are then fed to a 1D-CNN responsible for embedding
action features. A Siamese neural network trained on repre- provide, the number of depth map-based approaches to action
sentative single depth map for each sequence extracts features, recognition is very limited [1]. One of the reasons for this is
which are then processed by shapelets algorithm to extract action far more difficult feature extraction, i.e. extraction of features
features. These features are then concatenated with features permitting achieving similar or better results in comparison to
extracted by a BiLSTM with TimeDistributed wrapper. Given the results attained by skeleton-based results, in particular, on raw
learned individual models on such features we perform a selection
of a subset of models. We demonstrate experimentally that depth maps. Despite tremendous progresses in improving the
on SYSU 3DHOI dataset the proposed algorithm outperforms quality and resolution of depth maps, present consumer depth
considerably all recent algorithms including skeleton-based ones. cameras still suffer from heavy sensor noises.
Conventional algorithms for activity recognition on depth
Index Terms—3D action recognition, neural networks, maps rely on handcrafted features [5]–[7]. The research fo-
shapelets, classifier committee.
cuses on designing discriminative hand-crafted features for
action description. In a recent work [7], depth motion images
I. I NTRODUCTION
(DMI) and Laplacian pyramids as structured multi-scale fea-
Human activity recognition (HAR) aims to deliver infor- ture maps have been utilized for human action classification on
mation on physical activity of one or more users through depth maps. Deep learning is a kind of representation learning
detecting simple or complex actions in a real-world setting. in which increasingly abstract feature representations and
Due to considerable application potential, HAR is spreading in composition of features that reflect a hierarchy of structures
various fields, such as senior care, rehabilitation, surveillance, in the data are learned. The learned networks are able to dis-
robotics, driver behavior analysis and smart home [1]. In recent cover high-level features directly from raw data [2]. However,
decade, HAR based on video analysis using AI has attracted training deep models often require massive amounts of labeled
considerable attention [2]. Recognizing human activities from data to establish a basis for reliable learning patterns. In this
image sequences is a challenging task due to difficulties, context it is worth noting that most benchmark datasets for
such as background clutter, changes in scale, occlusions, and evaluation of algorithms for recognition of 3D human actions,
changes in viewpoint, lighting and appearance [1]. except recently introduced NTU-RGB+D dataset [8] are not
Among the methods for action recognition based on depth big enough for end-to-end learning of deep models. Classifier
data, methods using skeleton data are most frequently inves- committee is a general way of improving the accuracy of
tigated in research [3]. Compared with RGB-based or depth- learning-based algorithms, especially the generalization ability
based action sequences, skeleton-based action instances are on small datasets [9], [10].
more semantic. One of their limitations is that there is less Some research has highlighted that noteworthy gains in
appearance and few scene information provided in skeletal performance of action recognition can be achieved by utilizing
data. Having on regard that general representations of human hand-crafted features in learned models. A method proposed
actions are of spatio-temporal nature, both the appearance in [11] is based on a feature amplification, where an auxiliary,
information and motion dynamics should be taken into account hand-crafted feature is utilized to carry out a spatially varying
in the recognition to better leverage their complementarities. soft-gating on intermediate feature maps. A stratified pooling
Two-stream models [4] extract the appearance and motion that has been introduced in [12] permits aggregating frame-
features of action sequences. Because of relatively short level features into fixed-length video-level descriptor.
measurement range of depth sensors, the 3D skeleton can This paper is devoted to human action recognition on raw
only be reliably estimated if the user performs actions at depth maps. We train neural networks that consist of TimeDis-
distances less than about four meters from the depth senor. tributed and LSTM layers to extract class-specific features
Another shortcoming of depth sensors delivering 3D skeleton representing actions. They are concatenated with common fea-
data is relatively narrow field of view. In addition, long-range tures for all classes. Such common features are extracted by (i)
a temporal CNN (1D-CNN) that operates on multivariate time-
This work was supported by Polish National Science Center (NCN) under a
research grant 2017/27/B/ST6/01743. series of features determined by a convolutional autoencoder,
† Corresponding author (ii) shapelets algorithm that operates on Siamese features.
On such class-specific features that are concatenated with the neural network the output of its dense layer has been used to
common features the logistic regression classifiers are trained embed the features, which are referred to as 1D-CNN features.
to build a classifier committee. Finally, a classifier selection is The size of 1D-CNN feature vector is equal to 100.
executed in order to add to the classifier committee the models
that lead to maximization of its performance. B. Shapelet-based Action Features
In order to compactly represent actions we also learn fea-
II. T HE A LGORITHM tures on representative depth maps. A Siamese neural network
Due to limited number of depth map sequences in most is trained on single depth maps as a representation of the
frequently used benchmark datasets for human action recog- whole depth map sequences. The network trained on such
nition we employ a multiple classifier system. In our approach a compact representation of depth map sequences is used to
to classification of human actions on raw depth maps, various extract frame-features on depth map sequences. In contrast to
features are learned in different domains, like single depth the CAE it has been trained only on frontal depth maps. In the
map, time-series of embedded features, time-series represented current implementation, for each sequence from the training
by shapelets features, and final decision is taken by a classifier subset a middle depth map as a representation the whole depth
committee with selective classifiers. The multi-stream features map sequence has been selected and then included in a training
are processed to extract action features in sequences of depth subset for the Siamese neural network. The Siamese neural
maps. Action features are extracted by shapelets algorithm network operates on depth maps of size 1×64×128. It consists
operating on Siamese features, TimeDistributed and LSTM of 64 Conv2D filters of size 5×5 followed by max-pooling, 32
layers (TD-LSTM), and convolutional autoencoder followed Conv2D filters of size 5 × 5 followed by max-pooling, which
by a multi-channel, temporal CNN (1D-CNN). The TD- in turn are followed by the flattening layer and then a dense
LSTM action features are class-specific, whereas 1D-CNN layer consisting of 128 neurons. It has been trained using the
and shapelets action features are common for all classes. contrastive loss. The neural network trained in such a way was
In order to cope with variability in observations as well as used to extract features on every depth map from a given input
limited training data, particularly in order to improve model sequence. A human action represented by a number of depth
uncertainty the final decision is made using several models maps is described by a multivariate time-series of length equal
that are simpler but more robust to the specifics of the noisy to number of frames in the sequence and dimension equal to
data sequences. The multiple classifier system consists of 128. The feature vector was reduced to 32 using recursive
softmax logistic regression classifiers. They are learned on the feature elimination (RFE) [13].
concatenated class-specific features and common features for A shapelet [14] is a time-series subsequence that is de-
all classes. Given the learned individual models we perform termined to represent local, phase-independent similarity in
a selection of a subset of models and the best combination the shape. Shapelets are different from DTW (Dynamic Time
strategy for aggregating them. The final decision is made by Warping) because they focus on only subsequences of the
a classifier committee consisting of selected classifiers. time series. Shapelets are phase independent and therefore are
capable of learning the representative patterns regardless of
A. Embedding Action Features Using CAE and 1D-CNN the exact time they happen. Algorithms based on shapelets
In order to extract frame-features we implemented a con- have demonstrated competitive classification performances on
volutional autoencoder (CAE). After training the CAE, the common benchmarks [15]. Therefore they have been selected
decoding layers were excluded from the autoencoder. The to model human actions instead of the the DTW as in [16].
network trained in such a way has been utilized to extract A shapelet is discriminant if it is present in most time-series
low-dimensional frame-features. The input depth maps were of one class and absent in time-series of other classes. If the
projected on two 2D orthogonal Cartesian planes to represent distance d(x, s) = min||xt→t+L − s||2 , where L denotes the
t
top as well as side view of the depth maps. The convolutional length the shapelet s, and xt→t+L is the subsequence of time-
autoencoder has been trained on depth maps of size 3×64×64. series x starting in t and stopping in t + L, is small, then the
The size of depth map embedding is equal to 100. On the shapelet s is supposed to occur in the time-series x. The result
training subsets we learned a single CAE for all classes. of passing a time series through a shapelet algorithm is the
The CAE-based feature extractor operating on depth map minimum distance between the shapelet and all subsequences
sequences representing human actions produces multivariate of the time series. We utilized implementation [17] in order to
time-series. Having on regard that depth map sequences learn the shapelets as proposed in [18] and to determine the
differ in the length, such variable length time-series have shapelet-transform representations of time-series. The number
been interpolated to a common length, set to 64. In multi- of shapelet-based action features was equal to 14.
channel, temporal CNNs the 1D convolutions are executed in
the temporal domain. The time-series of frame-features that C. Embedding Actions Using BiLSTM with TimeDistributed
are extracted by the CAE-based feature extractor have been The neural network consisting of TimeDistributed and BiL-
employed to train a multi-channel 1D-CNN. The number of STM layers operates on depth map sequences, where each
channels at the input of the neural network is equal to 100, i.e. sample is of size 64 × 64, across 30 time-steps. The frame
the size of depth map embedding. After training the discussed batches have size equal to 30 and they have been constructed
by sampling with replacement. In first three layers we utilize III. E XPERIMENTAL R ESULTS
TimeDistributed wrapper in order to apply the same Conv2D
layer to each of the 30 time-steps, independently. In the last The proposed algorithm has been evaluated on publicly
layer we employ 64 BiLSTMs and then 64 global average available SYSU 3D Human-Object Interaction Set (SYSU
pooling filters. The means that the neural network delivers 3DHOI) [20]. The dataset comprises 480 sequences of 12
feature vectors of size 64. The models have been trained as action classes, including playing with a cell phone, calling
one-vs-all and the resulting features are called TD-LSTM. with a cell phone, drinking, pouring, sitting on a chair, moving
a chair, wearing a backpack, packing a backpack, sweeping,
D. Multi-class Classifiers for Classifier Committee mopping, taking something out from the wallet and taking
The features outlined in Subsections II-A –II-C have been out a wallet. Actions were performed by 40 subjects and
employed to train multi-class, logistic regression classifiers each activity is a kind of human-object interaction. This
with softmax encoding. The final decision is made by a classi- dataset is challenging for human action recognition because
fier committee consisting of selected classifiers. In essence, the a number of actions contain similar motions or identical
final decision is taken using a committee of individual models. operating object at the early temporal stages, Actions were
One advantage of this approach is its interpretability. Because performed by 40 subjects and each activity is a kind of human-
each class is expressed by one classifier only, it is possible to object interaction. For evaluation, we utilized the cross-subject
gain knowledge about the discriminative power of individual setting that is frequently used in RGB-D action recognition.
classifiers. For each class the action features that are common In discussed setting, there are thirty training/testing splits,
for all actions are concatenated with class-specific features, which are recommended by [20]. The algorithm has also been
and then used to train multi-class classifiers. evaluated in setting-1 [20] in which for each activity class,
With the classifiers described previously, we implemented we selected half of the samples for training and the rest for
a basic classifier committee. We built a classifier committee testing. Because in the SYSU 3DHOI dataset the performers
with the hard voting. Afterwards, we extended the classifier are not extracted from the background, the subjects have been
committee about soft voting, which predicts the action class extracted by us. For each depth map we determined a window
on the basis of the argmax/largest predicted value of the surrounding the performer that has then been scaled to the
sum of the predicted probabilities. Next, we extended the required input shape.
classifier committee about differential evolution (DE), which Table I shows results that were achieved on the SYSU
was responsible for determining the weights for the voting. 3DHOI dataset in settings-1 evaluation protocol. First row con-
The objective function depended on classification accuracy. tains results that were achieved by the committee consisting
The bounds have been set as a N -dimensional hypercube of classifiers operating on one-vs-all features (LSTM-based)
with values between 0.0 and 1.0, where N stands for the concatenated with features embedded by CAE and 1D-CNN,
number of actions. The optimization has been performed using and concatenated with shapelets features. The discussed results
implementation from scipy package. have been achieved by the logistic regression classifiers. As
The problem of selecting a globally optimal sub-committee we can observe, soft voting algorithm that uses all classifiers
of classifiers with powerful generalization capability has been gives worse results, cf. results in the second row. The selection
proven to be a NP-difficult. An effective solution to this of subset of logistic regression-based classifiers leads to su-
problem has been proposed in [19], where each individual perior results. DE-based results have been achieved using ten
model has assigned a weight wt which is adjusted according classifiers, whereas the best results have been obtained using
to contribution of the corresponding model to the regression only two classifiers selected from twelve classifiers.
model. In discussed approach the weightsP should satisfy the
N TABLE I: Recognition performance on SYSU 3DHOI dataset
following two constraints: 0 < wt < 1, t=1 = 1, where (setting-1).
N is the number models in the ensemble. The selection of
the model depends on its weight, and if it is smaller than a
voting Accuracy Precision Recall F1-score
threshold it is excluded from the classifier committee. hard voting 0.9167 0.9217 0.9167 0.9171
Let us denote by V the validations subset. Let us assume soft voting 0.9079 0.9102 0.9079 0.9071
that the classifier committee consists of N classifiers trained in diff. evol. 0.9079 0.9110 0.9079 0.9073
advance, and w = w1 . . . , wN . In our case, N is equal to num- class. sel. 0.9254 0.9271 0.9254 0.9246
ber of classes in a considered dataset. The correlation between
the ith and the jth individual classifiers can be determined as Table II shows results that were achieved on the 3DHOI
follows: CvV1 ,v2 = |V1 | x∈V (fv1 (x) − d(x))(fv2 (x) − d(x)),
P
dataset in cross-subject setting. The soft voting with all clas-
where fv1 (x), fv2 (x) represent the outputs on x obtained by sifiers permits improving the base classification performance,
the ith and the jth classifiers, respectively. The generalization cf. results in the second row. The results achieved by soft
V
errors of
P N PNclassifiers canV be determined as follows: Ew = voting with weights determined by DE are better than results
t1 =1 t2 =1 wt1 wt2 Ct1 t2 . The fitness function assumes the achieved by the soft voting. All weights were higher than zero,
V
following form: f (w) = 1/Ew . In this work it has been i.e. all classifiers were utilized, whereas the best results that are
optimized using a genetic algorithm (GA). shown in the last row have been obtained using six classifiers.
TABLE II: Recognition performance on SYSU 3DHOI dataset R EFERENCES
(cross-subject setting). [1] L. Wang, D. Q. Huynh, and P. Koniusz, “A comparative review of
recent Kinect-based action recognition algorithms,” IEEE Trans. Image
Process., vol. 29, pp. 15–28, 2020.
voting Accuracy Precision Recall F1-score [2] S. Majumder and N. Kehtarnavaz, “Vision and inertial sensing fusion
hard voting 0.8991 0.9079 0.8991 0.8990 for human action recognition: A review,” IEEE Sensors J., vol. 21, no. 3,
soft voting 0.9035 0.9098 0.9035 0.9036 pp. 2454–2467, 2021.
diff. evol. 0.9123 0.9175 0.9123 0.9119 [3] B. Ren, M. Liu, R. Ding, and H. Liu, “A survey on 3D skeleton-based
class. sel. 0.9211 0.9259 0.9211 0.9209 action recognition using learning method,” arXiv, 2002.05907, 2020.
[4] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream
network fusion for video action recognition,” in CVPR, 2016, pp. 1933–
Table III presents results achieved by recent algorithms 1941.
[5] X. Yang, C. Zhang, and Y. L. Tian, “Recognizing actions using depth
on 3DHOI dataset in comparison to results achieved by our motion maps-based histograms of oriented gradients,” in Proc. of the
algorithm. As we can observe, our algorithm achieves superior 20th ACM Int. Conf. on Multimedia. ACM, 2012, pp. 1057–1060.
results on this challenging dataset. The results are better in [6] L. Xia and J. Aggarwal, “Spatio-temporal depth cuboid similarity feature
for activity recognition using depth camera,” in CVPR, 2013, pp. 2834–
comparison to results achieved in our previous work [16] in 2841.
which we performed evaluations according to setting-2. It is [7] C. Li, Q. Huang, X. Li, and Q. Wu, “A multi-scale human action
worth noting that method [21] relies on depth and skeleton recognition method based on Laplacian pyramid depth motion images,”
in Proc. the 2nd ACM Int. Conf. on Multimedia in Asia. ACM, 2021.
modalities, whereas [22] additionally utilizes RGB images [8] J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y. Duan, and A. C. Kot,
jointly with the skeleton data. To the best of our knowledge, on “NTU RGB+D 120: A large-scale benchmark for 3D human activity
SYSU 3DHOI dataset the best classification accuracies among understanding,” PAMI, vol. 42, no. 10, pp. 2684–2701, 2020.
[9] R. Polikar, “Ensemble based systems in decision making,” IEEE Circuits
skeleton-based algorithms achieves a recently published SGN and Systems Magazine, vol. 6, no. 3, pp. 21–45, 2006.
algorithm [23]. The proposed algorithm outperforms it by a [10] L. I. Kuncheva and C. J. Whitaker, “Measures of diversity in classifier
large margin in terms of classification accuracy. ensembles and their relationship with the ensemble accuracy,” Mach.
Learn., vol. 51, no. 2, pp. 181–207, 2003.
TABLE III: Comparative recognition performance of the pro- [11] E. Park, X. Han, T. L. Berg, and A. C. Berg, “Combining multiple
sources of knowledge in deep CNNs for action recognition,” in IEEE
posed method with recent algorithms on 3DHOI dataset. Winter Conf. on Appl. of Comp. Vision (WACV), 2016, pp. 1–8.
[12] S. Yu, Y. Cheng, S. Su, G. Cai, and S. Li, “Stratified pooling based
Method Modality setting Acc. [%] deep convolutional neural networks for human action recognition,”
Multimedia Tools and Appl., vol. 76, no. 11, pp. 13 367–13 382, 2016.
LGN [24] skel. II 83.33 [13] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection for
SGN [23] skel. II 86.90 cancer classification using Support Vector Machines,” Mach. Learn.,
MSRNN [22] depth+RGB+skel. II 79.58 vol. 46, no. 13, pp. 389–422, 2002.
LAFF [25] depth+RGB II 80.00 [14] L. Ye and E. Keogh, “Time series shapelets: A novel technique that
PTS [21] depth+skeleton II 87.92 allows accurate, interpretable and fast classification,” Data Min. Knowl.
bidirect. rank p. [26] depth I 76.25 Discov., vol. 22, no. 1–2, pp. 149–182, 2011.
bidirect. rank p. [26] depth II 75.83 [15] A. Dau, A. Bagnall, K. Kamgar, C.-C. M. Yeh, Y. Zhu, S. Gharghabi,
C. Ratanamahatana, and E. Keogh, “The UCR time series archive,”
deep emb. feat. [16] depth II 90.35
IEEE/CAA J. of Automatica Sinica, vol. 6, pp. 1293–1305, 2019.
Proposed method depth I 92.54 [16] J. Trelinski and B. Kwolek, “Deep embedding features for action
Proposed method depth II 92.11 recognition on raw depth maps,” in ICCS, 2021, pp. 95–108.
[17] R. Tavenard, J. Faouzi, G. Vandewiele, F. Divo, G. Androz, C. Holtz,
M. Payne, R. Yurchak, M. Rußwurm, K. Kolar, and E. Woods, “Tslearn,
We demonstrated experimentally that on challenging SYSU a machine learning toolkit for time series data,” J. of Machine Learning
3DHOI dataset the proposed algorithm attains promising Research, vol. 21, no. 118, pp. 1–6, 2020.
results and outperforms recent state-of-the-art depth-based [18] J. Grabocka, N. Schilling, M. Wistuba, and L. Schmidt-Thieme, “Learn-
ing time-series shapelets,” ser. KDD 2014, 2014, pp. 392–401.
algorithms [21]–[25]. It is worth noting that our algorithm [19] Z.-H. Zhou, J.-X. Wu, Y. Jiang, and S.-F. Chen, “Genetic algorithm
outperforms most recent skeleton-based methods that usually based selective neural network ensemble,” in IJCAI, 2001, pp. 797–802.
achieve better results in comparison to methods based on depth [20] J. Hu, W. Zheng, J. Lai, and J. Zhang, “Jointly learning heterogeneous
features for RGB-D activity recognition,” in CVPR, 2015, pp. 5344–
maps only. The algorithm has been implemented in Python 5352.
language using Keras. The source code is freely available at [21] X. Wang, J.-F. Hu, J.-H. Lai, J. Zhang, and W.-S. Zheng, “Progressive
the following URL: https://github.com/tjacek/VCIP. teacher-student learning for early action prediction,” in CVPR, 2019, pp.
3551–3560.
IV. C ONCLUSIONS [22] J. Hu, W. Zheng, L. Ma, G. Wang, J. Lai, and J. Zhang, “Early action
prediction by soft regression,” PAMI, no. 11, pp. 2568–2583, 2019.
In this paper we presented an effective framework for [23] P. Zhang, C. Lan, W. Zeng, J. Xing, J. Xue, and N. Zheng, “Semantics-
human action recognition on raw depth maps. The selection guided neural networks for efficient skeleton-based human action recog-
nition,” in CVPR. IEEE, 2020, pp. 1109–1118.
of the subset of logistic regression-based classifiers permits [24] Q. Ke, M. Bennamoun, H. Rahmani, S. An, F. Sohel, and F. Boussaid,
achieving improvements in classification accuracies. The base “Learning latent global network for skeleton-based action prediction,”
classifiers have been trained on shapelets features that have IEEE Trans. Img. Proc., vol. 29, pp. 959–970, 2020.
[25] J.-F. Hu, W.-S. Zheng, L. Ma, G. Wang, and J. Lai, “Real-time RGB-D
been concatenated with deep embeddings. We demonstrated activity prediction by soft regression,” in ECCV, 2016, pp. 280–296.
experimentally that on SYSU 3DHOI benchmark dataset the [26] Z. Ren, Q. Zhang, X. Gao, P. Hao, and J. Cheng, “Multi-modality
proposed classifier committee with selected models outper- learning for human action recognition,” Multimedia Tools and Appl.,
vol. 80, no. 11, pp. 16 185–16 203, 2021.
forms by a large margin all recent algorithms.

You might also like