Professional Documents
Culture Documents
1 Introduction
The aim of the Human Action Recognition (HAR) is to automatically recog-
nize what kind of action is performed in the image or depth map sequence.
This is really a hard problem due to a lot of challenges involved in HAR. These
challenges include: variation in human shape, differences in human motion, self-
occlusions, cluttered backgrounds, varying illumination conditions, and view-
point variations [1]. There are many different ways in which action can be done.
Research on action recognition focuses on processing conventional RGB image
sequences and then extracting handcrafted features. Compared to RGB image
sequences the depth maps deliver range information and are less sensitive to
varying illumination conditions. However, most current approaches to action
c Springer Nature Switzerland AG 2019
L. Rutkowski et al. (Eds.): ICAISC 2019, LNAI 11509, pp. 91–103, 2019.
https://doi.org/10.1007/978-3-030-20915-5_9
92 J. Trelinski and B. Kwolek
2 Relevant Work
3 The Method
Having on regard that in currently available datasets for depth-based action
recognition the amount of data (depth map sequences) is insufficient to train
deep models with good generalization capabilities, we propose to use CNNs
operating on single depth maps (pairs of consecutive frames) to extract informa-
tive features. The main difference between our approach and other approaches
is that instead of training CNNs/LSTMs on depth map sequences we train a set
of CNNs on single depth maps to extract considerable number of features. In
the proposed approach, a CNN is trained for each class to extract class-specific
features. The class-specific models that were trained for action classification are
used as feature extractors by removing the softmax output layer (which out-
puts class scores). A separate CNN is trained for each action class to distin-
guish between this class and all remaining classes, like in one-vs-all approach to
multi-class classification. This means that each action-specific CNN is trained
to predict if the considered depth map belongs to the class for which the CNN
is trained or to one of the remaining classes. Each CNN is trained on single
depth maps belonging to the considered class and depth maps sampled from
the remaining classes. Since the number of single depth maps in all depth maps
sequences is considerable it is possible to train CNNs without overfitting and
with good generalization capabilities. In our approach the features are extracted
using the outputs from the penultimate layer consisting of one hundred neurons.
This means that the size of the feature vector extracted by each action-specific
CNN is equal to one hundred, and the number of 100-element vectors is equal to
the number of the actions. Since the number of frames in depth maps sequences
representing the actions is different, the lengths of multivariate time-series are
usually not identical.
In the proposed approach the input depth maps have size 64 × 64 pixels.
The input of convolutional neural network is a 4 × 64 × 64 tensor consisting of
two consecutive depth maps, and orthogonal projection of the input depth map
onto xz and yz planes [16]. This means that aside the frontal depth maps we
also employ side-view and top projections of the depth maps. The convolutional
layer C1 consists of sixteen 5 × 5 convolutional filters that are followed by a
subsampling layer. The next convolutional layer C2 operates on sixteen feature
maps of size 15 × 15. It consists of sixteen 5 × 5 convolutional filters that are
followed by a subsampling layer. It outputs sixteen feature maps of size 2×2. The
next fully connected layer FC consists of one hundred neurons. At the learning
stage, the output of the CNN is a softmax layer with number of neurons equal to
two. Every action-specific network has been learned on depth maps from training
parts of depth map sequences. After the training, the layers before the softmax
have been employed to extract shape features.
The actions are represented by multivariate time-series of CNN-based fea-
tures. For each time-series composed of class-specific CNN-based features we
calculate three statistical features: average, standard deviation and skewness.
The motivation of using skewness was to include a parameter describing asym-
metry in random variable’s probability distribution with respect to a normal
Ensemble of Classifiers Using CNN and Hand-Crafted Features 95
distribution. This means that each action is described by a 3 × 100 = 300 ele-
ment vector describing statistically the multivariate time series of CNN-based
features.
Aside from the CNN-based features for each frame we calculate handcrafted
features. At the beginning the depth map is projected onto two orthogonal planes
to determine side-view and top projections of the depth map. Given that pixels
with non-zero values represent the performer on depth maps, only pixels with
non-zero values are utilized in calculation of handcrafted features. The following
features are calculated on such three depth maps: area ratio (single value for axes
x, y, z), standard deviation (axes x, y, z), skewness (axes x, y, z), correlation (xy,
xz and zy axes). This means that the number of handcrafted features describing
a single depth map is equal to ten. Every realization of action is described by
a multivariate time-series of length equal the number of frames and dimension
equal to ten.
For each action realization that is represented by a time-series of handcrafted
features the following statistical features are calculated:
1. average
2. std
3. skewness
4. number of local minimas in a time series
5. number of local maximas in a time series
6. location of global maximum in a time series (as part of time-series length)
7. location of global minimum in a time series (as part of time-series length)
n−1 |fi+1 −fi |
8. mean change ( n−11
i=1 fi , fi denotes feature value in frame i)
9. difference between the average and the median
The features mentioned above were grouped into the following feature collections:
I - (1, 2, 3), II - (4, 5), III - (6, 7), IV - (8), V - (9). In order to determine the
discriminative power of the handcrafted features we performed evaluations of
action classification using the following feature sets: I, I+II+III, I+II+III+IV,
I+II+III+V and I+II+III+IV+V. For the mentioned feature sets the number of
handcrafted features was equal to: 30, 70, 80, 80 and 90, respectively. The CNN-
based features that are action-specific and handcrafted features that are common
for all actions were concatenated together resulting in action feature vectors.
This means that class-specific feature vectors of size 300 that are extracted by
CNNs were concatenated with handcrafted feature vectors of size dependent of
the used feature set. More details are in Sect. 4.
For each action feature vector we train a multi-class classifier with one-hot
encoding of output labels. The prediction of the action is done by a voting-based
ensemble operating on such one-hot encoding outputs. Thanks to training a CNN
for each class the features extracted by the CNNs are uncorrelated and thus
provide sufficient diversity for the ensemble. What is more, handcrafted features
describing individual depth maps are uncorrelated with CNN-based features and
thus introduce the diversity for the ensemble.
96 J. Trelinski and B. Kwolek
Table 2 shows recognition performance that has been obtained by the ensem-
ble on UTD-MHAD dataset. The class-specific CNN-based features were con-
catenated with common handcrafted features and then used to train class-specific
multi-class logistic regression classifiers. In a second option, Recursive Feature
Elimination (RFE) algorithm has been executed to select the most informa-
tive CNN-based features, which were then concatenated with the handcrafted
features and then used to train class-specific multi-class logistic regression classi-
fiers. The first row presents results that were achieved by the ensemble using only
CNN-based features. In the next rows there are results achieved on the basis of
CNN-based features that were concatenated with handcrafted features. The next
two rows in discussed table present results that have been achieved using the
I feature set, which was concatenated with CNN-based features. In the follow-
ing groups of two rows there are shown results that were obtained by I+II+III,
I+II+III+IV, I+II+III+V and I+II+III+IV+V feature sets, respectively. In
each group of two rows, the first row depicts results without selection of the
most informative features, whereas the second row in each group demonstrates
results achieved on the basis of the most informative features, concatenated
with the handcrafted features. The number of features utilized by the classifiers
is shown in the third column. The values in the remaining columns illustrate
the accuracies, precisions, recalls and F1-scores that were obtained using the
discussed above feature sets. As we can observe, the best results were obtained
using both handcrafted features and deep features, and feature selection by RFE.
In general, the use of RFE leads to better results for a given feature set. The
resulting feature vectors are about twice shorter in comparison to feature vectors
not processed by RFE, which in turn leads to better generalization. As we can
observe, owing to the use of handcrafted features the classification accuracy has
been improved from 61.6% to 83%. On the other hand, the best classification
accuracy on the basis of only handcrafted features was equal to 77.4%, see also
Table 1.
Figure 1 depicts the confusion matrix that was determined on the basis of
results achieved by the ensemble on the UTD-MHAD dataset.
Table 3 presents the recognition performance of the proposed method com-
pared with the state-of-the-art methods. Most of current methods for action
recognition on UTD-MHAD dataset are based on skeleton data. It is worth not-
ing that methods based on skeleton modality usually achieve better results in
comparison to methods relying on depth data only. Despite the fact that our
method is based on depth modality, we evoked the recent skeleton-based meth-
ods to show that it outperforms many of them. The methods based on depth data
have wide range applications since not all sensors or cameras delivering depth
modality have support for skeleton extraction. Our method is considerably bet-
ter than the WHDMM+3DConvNets method that employs weighted hierarchical
depth motion maps (WHDMMs) and three 3D ConvNets. The WHDMMs are
employed at several temporal scales to encode spatiotemporal motion patterns
of actions into 2D spatial structures. In order to provide sufficient amount of
training data, the 3D points are rotated and then used to synthesize new exem-
plars. In contrast, our algorithm operates on class-specific CNN features that
are concatenated with handcrafted features. The improved performance of our
method may suggest that the proposed method has better viewpoint tolerance
in comparison to depth-based algorithms, including [15].
Ensemble of Classifiers Using CNN and Hand-Crafted Features 99
5 Conclusions
In this work a method for action recognition on depth map sequences has been
proposed. Due to considerable amount of noise in depth maps that prevent
applying local differential operators, the number of depth maps-based sequen-
tial approaches is limited and most of the approaches are skeleton-based. We
demonstrated experimentally that the proposed algorithm can achieve supe-
rior results in comparison to results achieved by state-of-the-art algorithms,
including recently proposed deep learning-based algorithms. The method has
been evaluated on two widely employed benchmark datasets and compared with
state-of-the-art methods. We demonstrated experimentally that on challenging
MSR-Action3D and UTD-MHAD datasets the proposed method achieves supe-
rior results or at least comparative results in comparison to results achieved by
recent methods.
References
1. Aggarwal, J., Ryoo, M.: Human activity analysis: a review. ACM Comput. Surv.
43(3), 16:1–16:43 (2011)
2. Liang, B., Zheng, L.: A survey on human action recognition using depth sensors.
In: International Conference on Digital Image Computing: Techniques and Appli-
cations, pp. 1–8 (2015)
3. Aggarwal, J., Xia, L.: Human activity recognition from 3D data: a review. Pattern
Recognit. Lett. 48, 70–80 (2014)
1
https://github.com/tjacek/DeepActionLearning.
102 J. Trelinski and B. Kwolek
4. Chen, L., Wei, H., Ferryman, J.: A survey of human motion analysis using depth
imagery. Pattern Recognit. Lett. 34(15), 1995–2006 (2013)
5. Ye, M., Zhang, Q., Wang, L., Zhu, J., Yang, R., Gall, J.: A survey on human motion
analysis from depth data. In: Grzegorzek, M., Theobalt, C., Koch, R., Kolb, A.
(eds.) Time-of-Flight and Depth Imaging. Sensors, Algorithms, and Applications.
LNCS, vol. 8200, pp. 149–187. Springer, Heidelberg (2013). https://doi.org/10.
1007/978-3-642-44964-2 8
6. Lo Presti, L., La Cascia, M.: 3D skeleton-based human action classification. Pattern
Recognit. 53(C), 130–147 (2016)
7. Xia, L., Chen, C.C., Aggarwal, J.: View invariant human action recognition using
histograms of 3D joints. In: CVPR Workshops, pp. 20–27 (2012)
8. Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3D points. In:
IEEE International Conference on Computer Vision and Pattern Recognition -
Workshops, pp. 9–14 (2010)
9. Chen, C., Jafari, R., Kehtarnavaz, N.: Action recognition from depth sequences
using depth motion maps-based local binary patterns. In: 2015 IEEE Winter Con-
ference on Applications of Computer Vision, pp. 1092–1099 (2015)
10. Yang, X., Zhang, C., Tian, Y.L.: Recognizing actions using depth motion maps-
based histograms of oriented gradients. In: Proceedings of the 20th ACM Interna-
tional Conference on Multimedia, pp. 1057–1060. ACM (2012)
11. Wang, J., Liu, Z., Chorowski, J., Chen, Z., Wu, Y.: Robust 3D action recognition
with random occupancy patterns. In: Fitzgibbon, A., Lazebnik, S., Perona, P.,
Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, pp. 872–885. Springer, Heidelberg
(2012). https://doi.org/10.1007/978-3-642-33709-3 62
12. Vieira, A.W., Nascimento, E.R., Oliveira, G.L., Liu, Z., Campos, M.F.M.:
STOP: space-time occupancy patterns for 3D action recognition from depth map
sequences. In: Alvarez, L., Mejail, M., Gomez, L., Jacobo, J. (eds.) CIARP 2012.
LNCS, vol. 7441, pp. 252–259. Springer, Heidelberg (2012). https://doi.org/10.
1007/978-3-642-33275-3 31
13. Xia, L., Aggarwal, J.: Spatio-temporal depth cuboid similarity feature for activity
recognition using depth camera. In: IEEE International Conference on Computer
Vision and Pattern Recognition, pp. 2834–2841 (2013)
14. Oreifej, O., Liu, Z.: HON4D: histogram of oriented 4D normals for activity recogni-
tion from depth sequences. In: IEEE International Conference on Computer Vision
and Pattern Recognition, pp. 716–723 (2013)
15. Wang, P., Li, W., Gao, Z., Zhang, J., Tang, C., Ogunbona, P.: Action recognition
from depth maps using deep convolutional neural networks. IEEE Trans. Hum.-
Mach. Syst. 46(4), 498–509 (2016)
16. Chen, C., Jafari, R., Kehtarnavaz, N.: UTD-MHAD: a multimodal dataset for
human action recognition utilizing a depth camera and a wearable inertial sensor.
In: 2015 IEEE International Conference on Image Processing (ICIP), pp. 168–172,
September 2015
17. Wang, P., Li, W., Li, C., Hou, Y.: Action recognition based on joint trajectory
maps with convolutional neural networks. Knowl.-Based Syst. 158, 43–53 (2018)
18. Hou, Y., Li, Z., Wang, P., Li, W.: Skeleton optical spectra-based action recognition
using convolutional neural networks. IEEE Trans. Circuits Syst. Video Technol.
28(3), 807–811 (2018)
19. Hussein, M.E., Torki, M., Gowayyed, M.A., El-Saban, M.: Human action recog-
nition using a temporal hierarchy of covariance descriptors on 3D joint locations.
In: Proceedings of the Twenty-Third International Joint Conferences on Artificial
Intelligence, IJCAI 2013, pp. 2466–2472. AAAI Press (2013)
Ensemble of Classifiers Using CNN and Hand-Crafted Features 103
20. Zhou, L., Li, W., Zhang, Y., Ogunbona, P., Nguyen, D., Zhang, H.: Discrimina-
tive key pose extraction using extended LC-KSVD for action recognition. In: 2014
International Conference on Digital Image Computing: Techniques and Applica-
tions (DICTA), pp. 1–8 (2014)
21. Wang, P., Wang, S., Gao, Z., Hou, Y., Li, W.: Structured images for RGB-D
action recognition. In: 2017 IEEE International Conference on Computer Vision
Workshops (ICCVW), pp. 1005–1014 (2017)
22. Wu, Y.: Mining actionlet ensemble for action recognition with depth cameras. In:
IEEE International Conference on Computer Vision and Pattern Recognition, pp.
1290–1297 (2012)
23. Wang, P., Li, W., Gao, Z., Zhang, J., Tang, C., Ogunbona, P.O.: Action recognition
from depth maps using deep convolutional neural networks. IEEE Trans. Hum.-
Mach. Syst. 46(4), 498–509 (2016)
24. Lu, C., Jia, J., Tang, C.: Range-sample depth feature for action recognition. In:
IEEE International Conference on Computer Vision and Pattern Recognition, pp.
772–779 (2014)