You are on page 1of 13

Ensemble of Classifiers Using CNN and

Hand-Crafted Features for Depth-Based


Action Recognition

Jacek Trelinski and Bogdan Kwolek(B)

AGH University of Science and Technology, 30 Mickiewicza, 30-059 Krakow, Poland


{tjacek,bkw}@agh.edu.pl
http://home.agh.edu.pl/~bkw/contact.html

Abstract. In this paper, we present an algorithm for action recogni-


tion that uses only depth maps. At the beginning we extract features
describing the person shape in single depth maps. For each class we
train a separate one-against-all convolutional neural network to extract
class-specific features. The actions are represented by multivariate time-
series of such CNN-based frame features for which we calculate statistical
features. For the non-zero pixels representing the person shape in each
depth map we calculate handcrafted features. For time-series of such
handcrafted features we calculate the statistical features. Afterwards,
handcrafted features that are common for all actions and CNN-based
features that are action-specific are concatenated together resulting in
action feature vectors. For each action feature vector we train a multi-
class classifier with one-hot encoding of output labels. The prediction of
the action is done by a voting-based ensemble operating on such one-hot
encoding outputs. We demonstrate experimentally that on UTD-MHAD
dataset the proposed algorithm outperforms state-of-the-art depth-based
algorithms and achieves promising results on MSR-Action3D dataset.

Keywords: Convolutional neural networks · Ensemble ·


Action classification

1 Introduction
The aim of the Human Action Recognition (HAR) is to automatically recog-
nize what kind of action is performed in the image or depth map sequence.
This is really a hard problem due to a lot of challenges involved in HAR. These
challenges include: variation in human shape, differences in human motion, self-
occlusions, cluttered backgrounds, varying illumination conditions, and view-
point variations [1]. There are many different ways in which action can be done.
Research on action recognition focuses on processing conventional RGB image
sequences and then extracting handcrafted features. Compared to RGB image
sequences the depth maps deliver range information and are less sensitive to
varying illumination conditions. However, most current approaches to action
c Springer Nature Switzerland AG 2019
L. Rutkowski et al. (Eds.): ICAISC 2019, LNAI 11509, pp. 91–103, 2019.
https://doi.org/10.1007/978-3-030-20915-5_9
92 J. Trelinski and B. Kwolek

recognition on data delivered by depth sensors are based on skeleton modal-


ity or on handcrafted features [2], which in many scenarios provide insufficient
discriminative power. In recent decade, local features methods were the most
successful in action recognition. Those methods are based on feature detectors
and model the local neighborhood of detected points of interest. The advantage
of such approaches is that they do not require sophisticated techniques for joint
detection (i.e. skeleton detection) as a pre-processing step. One of the drawbacks
of methods relying on local features is that they usually ignore spatio-temporal
layout of detected features.
In a typical algorithm for action recognition we can distinguish three main
steps: feature extraction, quantization/dimension reduction and classification.
Designing both effective and efficient feature set for action recognition on depth
map sequences is not an easy task [3]. The main difficulty is that in contrast to
color images the depth maps do not have as much texture as color ones. Usually,
they are too noisy alike spatially and temporally to apply gradient operators
both in space and time. Last but not least, the recognition is typically realized
on depth maps acquired by a single sensor. Since body parts undergo occlusions,
the robustness of global feature-based methods can be poor [3]. In order to cope
with challenges mentioned above, various features that are semi-local, highly
discriminative and robust to occlusions have been developed [4].
Because of noisy character of depth maps that limits applying local differ-
ential operators, the number of depth map-based sequential approaches, which
achieved competitive results in comparison to depth-maps or depth-maps space-
time volume approaches is limited [5]. Since the skeleton data is one of the most
natural features for modeling action dynamics, the most successful approaches
utilize skeleton information [6]. A method called Histogram of 3D Joint Loca-
tions (HOJ3D) [7] that encodes spatial occupancy information with regard to
the skeleton root is a representative method for such approaches. The HOJ3D
features are computed on action depth sequences, projected by LDA and then
clustered into posture visual words to represent the prototypical poses of actions.
The temporal evolutions of such visual words are modeled by discrete HMMs.
In this work, we present an algorithm for action recognition that is based
on depth maps only. At the beginning we extract features describing the person
shape in single depth maps. For each action class we train a separate one-against-
all convolutional neural network in order to extract class-specific features. The
actions are represented by multivariate time-series of such CNN-based frame
features for which we determine statistical features. For non-zero pixels repre-
senting the person shape in each depth map we calculate handcrafted features.
For time-series of such handcrafted features we determine the statistical features.
Next, handcrafted features that are common for all classes and CNN-based fea-
tures that are class-specific are concatenated together resulting in action feature
vectors. For each action feature vector we train a multi-class logistic regression
classifier with one-hot encoding of output labels. Action classification is per-
formed by a voting-based ensemble operating on such one-hot encoding outputs.
We demonstrate experimentally that on the UTD-MHAD dataset the proposed
Ensemble of Classifiers Using CNN and Hand-Crafted Features 93

algorithm outperforms state-of-the-art algorithms and achieves promising results


on the MSR-Action3D dataset.

2 Relevant Work

As noticed in a recently published survey [2], among datasets utilized in evalu-


ation of action recognition algorithms, the MSR-Action3D dataset is the most
popular and widely used benchmark data. For the MSR-Action3D dataset, most
of the studies follow the evaluation setting of Li et al. [8], in which twenty actions
are divided into three subsets AS1, AS2, AS3, each having eight actions. For each
subset, the tests T1, T2 and CST are usually performed. In most papers the clas-
sification accuracy better than 90% in the first two tests is reported. In the third
test, however, the recognition performance is generally far lower. It follows from
here that many of these methods do not have good generalization capabilities
when unlike performer is performing the action, even in identical conditions and
environment. As an example, the algorithm of Li et al. achieves 74.7% classi-
fication accuracy in the CST test, whereas 91.6% and 94.2% accuracies were
obtained in tests T1 and T2, respectively.
As already mentioned, approaches based on the joint features achieve far
better classification performance in comparison to methods relying on depth
maps or points clouds [6]. However, skeleton-based methods are not applicable
for applications, where skeleton data is not accessible [9]. Since our method is
based on depth modality only, below we discuss only depth-based approaches to
action recognition.
In [10], depth images were projected onto three orthogonal planes and then
accumulated in order to generate Depth Motion Maps (DMMs). Afterwards,
the histograms of the oriented gradients (HOGs) computed from DMMs were
employed as feature descriptors. A method proposed in [11] does not utilize
the skeletal joints information. Random occupancy pattern (ROP) features were
extracted from depth map sequences and a sparse coding was utilized to encode
such features. In a method proposed in [12], the depth map sequence is divided
into a spatiotemporal grid. Subsequently, a simple feature called global occu-
pancy pattern is determined, where the number of the occupied pixels is stored
for each grid cell. In [13] depth cuboid similarity features (DCSF) are built
around the local spatio-temporal interest points (STIPs), which are extracted
from depth map sequences. A method proposed in [14] does not require a skele-
ton tracker and determines a histogram of oriented 4D surface normals (HON4D)
in order to capture complex joint shape-motion cues at pixel-level. In contrast
to method proposed in [10], the temporal order of the events in the action
sequences is encoded and not ignored. A recently proposed method [9] employs
three projection views to capture motion cues and then uses LBPs to deter-
mine compact feature representation. In a more recent method [15], recognition
of human action from depth maps is done using weighted hierarchical depth
motion maps (WHDMM) and three-channel deep convolutional neural networks
(3DConvNets).
94 J. Trelinski and B. Kwolek

3 The Method
Having on regard that in currently available datasets for depth-based action
recognition the amount of data (depth map sequences) is insufficient to train
deep models with good generalization capabilities, we propose to use CNNs
operating on single depth maps (pairs of consecutive frames) to extract informa-
tive features. The main difference between our approach and other approaches
is that instead of training CNNs/LSTMs on depth map sequences we train a set
of CNNs on single depth maps to extract considerable number of features. In
the proposed approach, a CNN is trained for each class to extract class-specific
features. The class-specific models that were trained for action classification are
used as feature extractors by removing the softmax output layer (which out-
puts class scores). A separate CNN is trained for each action class to distin-
guish between this class and all remaining classes, like in one-vs-all approach to
multi-class classification. This means that each action-specific CNN is trained
to predict if the considered depth map belongs to the class for which the CNN
is trained or to one of the remaining classes. Each CNN is trained on single
depth maps belonging to the considered class and depth maps sampled from
the remaining classes. Since the number of single depth maps in all depth maps
sequences is considerable it is possible to train CNNs without overfitting and
with good generalization capabilities. In our approach the features are extracted
using the outputs from the penultimate layer consisting of one hundred neurons.
This means that the size of the feature vector extracted by each action-specific
CNN is equal to one hundred, and the number of 100-element vectors is equal to
the number of the actions. Since the number of frames in depth maps sequences
representing the actions is different, the lengths of multivariate time-series are
usually not identical.
In the proposed approach the input depth maps have size 64 × 64 pixels.
The input of convolutional neural network is a 4 × 64 × 64 tensor consisting of
two consecutive depth maps, and orthogonal projection of the input depth map
onto xz and yz planes [16]. This means that aside the frontal depth maps we
also employ side-view and top projections of the depth maps. The convolutional
layer C1 consists of sixteen 5 × 5 convolutional filters that are followed by a
subsampling layer. The next convolutional layer C2 operates on sixteen feature
maps of size 15 × 15. It consists of sixteen 5 × 5 convolutional filters that are
followed by a subsampling layer. It outputs sixteen feature maps of size 2×2. The
next fully connected layer FC consists of one hundred neurons. At the learning
stage, the output of the CNN is a softmax layer with number of neurons equal to
two. Every action-specific network has been learned on depth maps from training
parts of depth map sequences. After the training, the layers before the softmax
have been employed to extract shape features.
The actions are represented by multivariate time-series of CNN-based fea-
tures. For each time-series composed of class-specific CNN-based features we
calculate three statistical features: average, standard deviation and skewness.
The motivation of using skewness was to include a parameter describing asym-
metry in random variable’s probability distribution with respect to a normal
Ensemble of Classifiers Using CNN and Hand-Crafted Features 95

distribution. This means that each action is described by a 3 × 100 = 300 ele-
ment vector describing statistically the multivariate time series of CNN-based
features.
Aside from the CNN-based features for each frame we calculate handcrafted
features. At the beginning the depth map is projected onto two orthogonal planes
to determine side-view and top projections of the depth map. Given that pixels
with non-zero values represent the performer on depth maps, only pixels with
non-zero values are utilized in calculation of handcrafted features. The following
features are calculated on such three depth maps: area ratio (single value for axes
x, y, z), standard deviation (axes x, y, z), skewness (axes x, y, z), correlation (xy,
xz and zy axes). This means that the number of handcrafted features describing
a single depth map is equal to ten. Every realization of action is described by
a multivariate time-series of length equal the number of frames and dimension
equal to ten.
For each action realization that is represented by a time-series of handcrafted
features the following statistical features are calculated:

1. average
2. std
3. skewness
4. number of local minimas in a time series
5. number of local maximas in a time series
6. location of global maximum in a time series (as part of time-series length)
7. location of global minimum in a time series (as part of time-series length)
n−1 |fi+1 −fi |
8. mean change ( n−11
i=1 fi , fi denotes feature value in frame i)
9. difference between the average and the median

The features mentioned above were grouped into the following feature collections:
I - (1, 2, 3), II - (4, 5), III - (6, 7), IV - (8), V - (9). In order to determine the
discriminative power of the handcrafted features we performed evaluations of
action classification using the following feature sets: I, I+II+III, I+II+III+IV,
I+II+III+V and I+II+III+IV+V. For the mentioned feature sets the number of
handcrafted features was equal to: 30, 70, 80, 80 and 90, respectively. The CNN-
based features that are action-specific and handcrafted features that are common
for all actions were concatenated together resulting in action feature vectors.
This means that class-specific feature vectors of size 300 that are extracted by
CNNs were concatenated with handcrafted feature vectors of size dependent of
the used feature set. More details are in Sect. 4.
For each action feature vector we train a multi-class classifier with one-hot
encoding of output labels. The prediction of the action is done by a voting-based
ensemble operating on such one-hot encoding outputs. Thanks to training a CNN
for each class the features extracted by the CNNs are uncorrelated and thus
provide sufficient diversity for the ensemble. What is more, handcrafted features
describing individual depth maps are uncorrelated with CNN-based features and
thus introduce the diversity for the ensemble.
96 J. Trelinski and B. Kwolek

4 Empirical Results and Discussion


The proposed framework has been evaluated on two publicly available bench-
mark datasets: MSR-Action3D dataset [8] and UTD-MHAD dataset [16]. The
datasets were chosen due to their frequent use as benchmark data by action
recognition community. In all experiments and evaluations, 557 sequences of
MSR-Action3D dataset were investigated. Half of the subjects were used as
training data and the rest of the subjects has been considered as test subset. In
the discussed classification setting, half of the subjects are used for the training,
and the rest for the testing, which is different from evaluation protocols based
on AS1, AS2 and AS3 data splits and averaging the classification accuracies
over such data splits. It is worth mentioning that the classification performances
achieved in the utilized setting are lower in comparison to classification per-
formances that are achieved on AS1, AS2, AS3 setting due to bigger variations
across the same actions performed by different subjects. The cross-subject evalu-
ation scheme [13,15] has been applied in all evaluations. This scheme is different
from the scheme utilized in [7], in which more subjects were in the training
subset.
The UTD-MHAD dataset consists of 27 different actions performed by 8
subjects (4 females and 4 males). Each performer repeated each action 4 times.
All actions were performed in an indoor environment with fixed background.
The dataset was collected using Kinect sensor and a wearable inertial sensor. It
includes 861 data sequences.
At the beginning we evaluated the discriminative power of handcrafted fea-
tures on UTD-MHAD dataset. A multi-class linear SVM has been trained on
statistical features of time-series consisting of handcrafted features. The evalu-
ation has been performed for feature sets discussed in Sect. 3. The value of C
parameter of a linear SVM has been determined in a grid search. As we can
observe in Table 1, the best results were achieved for I+II+III+IV+V feature
set. We considered also a logistic regression-based multi-class classifier, but the
best classification accuracy was about five percent worse in comparison to clas-
sification accuracy achieved by the linear SVM.

Table 1. Recognition performance on UTD-MHAD dataset that was achieved by SVM


using handcrafted features.

f. set f. num. Accuracy Precision Recall F1-score


I 30 0.6884 0.7143 0.6884 0.6752
I+II+III 70 0.7419 0.7615 0.7419 0.7350
I+II+III+IV 80 0.7531 0.7748 0.7531 0.7531
I+II+III+V 80 0.7535 0.7792 0.7535 0.7501
I+II+III+IV+V 90 0.7744 0.7923 0.7744 0.7718
Ensemble of Classifiers Using CNN and Hand-Crafted Features 97

Table 2 shows recognition performance that has been obtained by the ensem-
ble on UTD-MHAD dataset. The class-specific CNN-based features were con-
catenated with common handcrafted features and then used to train class-specific
multi-class logistic regression classifiers. In a second option, Recursive Feature
Elimination (RFE) algorithm has been executed to select the most informa-
tive CNN-based features, which were then concatenated with the handcrafted
features and then used to train class-specific multi-class logistic regression classi-
fiers. The first row presents results that were achieved by the ensemble using only
CNN-based features. In the next rows there are results achieved on the basis of
CNN-based features that were concatenated with handcrafted features. The next
two rows in discussed table present results that have been achieved using the
I feature set, which was concatenated with CNN-based features. In the follow-
ing groups of two rows there are shown results that were obtained by I+II+III,
I+II+III+IV, I+II+III+V and I+II+III+IV+V feature sets, respectively. In
each group of two rows, the first row depicts results without selection of the
most informative features, whereas the second row in each group demonstrates
results achieved on the basis of the most informative features, concatenated
with the handcrafted features. The number of features utilized by the classifiers
is shown in the third column. The values in the remaining columns illustrate
the accuracies, precisions, recalls and F1-scores that were obtained using the
discussed above feature sets. As we can observe, the best results were obtained
using both handcrafted features and deep features, and feature selection by RFE.
In general, the use of RFE leads to better results for a given feature set. The
resulting feature vectors are about twice shorter in comparison to feature vectors
not processed by RFE, which in turn leads to better generalization. As we can
observe, owing to the use of handcrafted features the classification accuracy has
been improved from 61.6% to 83%. On the other hand, the best classification

Table 2. Recognition performance on UTD-MHAD dataset that was achieved by the


ensemble.

f. set f. sel. f. num. Accuracy Precision Recall F1-score


- - 300 0.6163 0.6259 0.6163 0.6040
I - 330 0.7163 0.7359 0.7163 0.7101
I RFE 130 0.7535 0.7724 0.7535 0.7510
I+II+III - 370 0.7674 0.7741 0.7674 0.7580
I+II+III RFE 170 0.7953 0.8044 0.7953 0.7897
I+II+III+IV - 380 0.7860 0.8010 0.7860 0.7755
I+II+III+IV RFE 180 0.8209 0.8314 0.8209 0.8167
I+II+III+V - 380 0.7860 0.8010 0.7860 0.7755
I+II+III+V RFE 180 0.8070 0.8256 0.8070 0.7981
I+II+III+IV+V - 390 0.7953 0.8083 0.7953 0.7885
I+II+III+IV+V RFE 190 0.8302 0.8398 0.8302 0.8250
98 J. Trelinski and B. Kwolek

Fig. 1. Confusion matrix obtained by the ensemble on UTD-MHAD dataset.

accuracy on the basis of only handcrafted features was equal to 77.4%, see also
Table 1.
Figure 1 depicts the confusion matrix that was determined on the basis of
results achieved by the ensemble on the UTD-MHAD dataset.
Table 3 presents the recognition performance of the proposed method com-
pared with the state-of-the-art methods. Most of current methods for action
recognition on UTD-MHAD dataset are based on skeleton data. It is worth not-
ing that methods based on skeleton modality usually achieve better results in
comparison to methods relying on depth data only. Despite the fact that our
method is based on depth modality, we evoked the recent skeleton-based meth-
ods to show that it outperforms many of them. The methods based on depth data
have wide range applications since not all sensors or cameras delivering depth
modality have support for skeleton extraction. Our method is considerably bet-
ter than the WHDMM+3DConvNets method that employs weighted hierarchical
depth motion maps (WHDMMs) and three 3D ConvNets. The WHDMMs are
employed at several temporal scales to encode spatiotemporal motion patterns
of actions into 2D spatial structures. In order to provide sufficient amount of
training data, the 3D points are rotated and then used to synthesize new exem-
plars. In contrast, our algorithm operates on class-specific CNN features that
are concatenated with handcrafted features. The improved performance of our
method may suggest that the proposed method has better viewpoint tolerance
in comparison to depth-based algorithms, including [15].
Ensemble of Classifiers Using CNN and Hand-Crafted Features 99

Table 3. Comparative recognition performance of the proposed method with recent


algorithms on MHAD dataset.

Method Modality Accuracy [%]


JTM [17] Skeleton 85.81
SOS [18] Skeleton 86.97
Cov3DJ [19] Skeleton 85.58
Kinect & inertial [16] Skeleton 79.10
ELC-KCSVD [20] Skeleton 76.19
Struct. body [21] Skeleton 66.05
Struct. part [21] Skeleton 78.70
Struct. joint [21] Skeleton 86.81
Struct. SzDDI [21] Skeleton 89.04
WHDMMs + ConvNets [15, 21] Depth 73.95
Proposed method Depth 83.02

Table 4 presents results that were achieved on MSR-Action3D dataset by


a multi-class linear SVM operating only on the handcrafted features. As we
can notice, the best classification accuracy has been achieved on the basis of
I+II+III+V feature set. We considered also a multi-class logistic regression clas-
sifier, but it turned out that the best classification accuracy was about four per-
cent worse in comparison to classification accuracy achieved by the linear SVM
classifier.
Table 5 shows the recognition performance that was achieved on the basis
of different feature sets on MSR-Action3D dataset. As we can observe, the best
results were achieved on the basis of I+II+III+IV handcrafted feature set that
was concatenated with CNN-based features, and which were selected by the RFE
algorithm. The accuracy that was achieved using I+II+III+IV+V feature set is
identical, but precision is worse. The RFE-based feature elimination decreases
the number of CNN-based features and the results on the reduced feature subsets
are better. The results presented in Tables 2 and 5 were achieved by ensemble

Table 4. Recognition performance on MSR-Action3D dataset that was achieved by


SVM using handcrafted features.

f. set f. num. Accuracy Precision Recall F1-score


I 30 0.8109 0.8232 0.8109 0.7948
I+II+III 70 0.8182 0.8270 0.8182 0.8102
I+II+III+IV 80 0.7927 0.8049 0.7927 0.7883
I+II+III+V 80 0.8400 0.8440 0.8400 0.8420
I+II+III+IV+V 90 0.8036 0.8134 0.8036 0.7922
100 J. Trelinski and B. Kwolek

consisting of multi-class logistic classifiers with one-hot encoding of output labels.


We considered also multi-class logistic classifiers and multi-class SVMs with
probabilistic outputs, which were then used to determine the ensemble output.
However, the results were not better.

Table 5. Recognition performance on MSR-Action3D dataset that was achieved by


the ensemble.

f. set f. sel. f. num. Accuracy Precision Recall F1-score


- - 300 0.8000 0.8098 0.8000 0.7883
I - 330 0.8982 0.9092 0.8982 0.8690
I RFE 130 0.8545 0.8761 0.8545 0.8491
I+II+III - 370 0.8727 0.8812 0.8727 0.8671
I+II+III RFE 170 0.8945 0.8977 0.8945 0.8827
I+II+III+IV - 380 0.8691 0.8827 0.8691 0.8578
I+II+III+V RFE 180 0.8836 0.8944 0.8836 0.8740
I+II+III+V - 380 0.8873 0.8920 0.8873 0.8770
I+II+III+IV RFE 180 0.9055 0.9204 0.9055 0.8945
I+II+III+IV+V - 390 0.8836 0.8941 0.8836 0.8712
I+II+III+IV+V RFE 190 0.9055 0.9095 0.9055 0.8955

Table 6 illustrates the classification performance of the proposed method in


comparison to previous depth-based methods on the MSR-Action3D dataset.
The classification performance of the proposed framework has been determined
using the cross-subject evaluation [22], where subjects 1, 3, 5, 7, and 9 were
employed for training and subjects 2, 4, 6, 8, and 10 were utilized for testing.
As we can notice, the proposed method achieves better classification accuracy in
comparison to recently proposed method [17], and it has worse performance in

Table 6. Comparative recognition performance of the proposed method with recent


algorithms on MSR-Action3D dataset.

Method Split Modality Accuracy[%]


3DCNN [17] Split II Depth 84.07
Depth motion maps [10] Split II Depth 88.73
PRNN [18] Split II Depth 94.90
Range sample [24] Not shown Skeleton 95.62
WHDMM+CNN [23] Split I Depth 100.00
S DDI [21] Split I Depth 100.00
Proposed method Split I Depth 90.6
Ensemble of Classifiers Using CNN and Hand-Crafted Features 101

comparison to recently proposed methods [18,21,23]. One of the main reasons


for this is limited amount of training samples in the MSR-Action3D dataset. In
order to cope with such a limitation, Wang et al. generated synthesized training
samples on the basis of 3D points. This means that the discussed algorithm
is not based on depth maps only. Comparing the results from Tables 3 and 6
we can notice that the classification performances achieved by the WHDMM
algorithm on UTD-MHAD dataset are worse in comparison to results achieved
by the proposed algorithm.
The proposed method has been implemented in Python using Theano and
Lasagne deep learning frameworks. The Lasagne library is built on top of Theano.
The values of the initial weights in CNNs networks were drawn randomly from
uniform distributions. The binary cross-entropy loss function has been used in
the minimization. The CNN networks were trained using SGD with momen-
tum. Much computations were performed on a PC computer equipped with
an NVIDIA GPU card. The source code of the proposed algorithms is freely
available1 .

5 Conclusions
In this work a method for action recognition on depth map sequences has been
proposed. Due to considerable amount of noise in depth maps that prevent
applying local differential operators, the number of depth maps-based sequen-
tial approaches is limited and most of the approaches are skeleton-based. We
demonstrated experimentally that the proposed algorithm can achieve supe-
rior results in comparison to results achieved by state-of-the-art algorithms,
including recently proposed deep learning-based algorithms. The method has
been evaluated on two widely employed benchmark datasets and compared with
state-of-the-art methods. We demonstrated experimentally that on challenging
MSR-Action3D and UTD-MHAD datasets the proposed method achieves supe-
rior results or at least comparative results in comparison to results achieved by
recent methods.

Acknowledgment. This work was supported by Polish National Science Center


(NCN) under a research grant 2017/27/B/ST6/01743.

References
1. Aggarwal, J., Ryoo, M.: Human activity analysis: a review. ACM Comput. Surv.
43(3), 16:1–16:43 (2011)
2. Liang, B., Zheng, L.: A survey on human action recognition using depth sensors.
In: International Conference on Digital Image Computing: Techniques and Appli-
cations, pp. 1–8 (2015)
3. Aggarwal, J., Xia, L.: Human activity recognition from 3D data: a review. Pattern
Recognit. Lett. 48, 70–80 (2014)
1
https://github.com/tjacek/DeepActionLearning.
102 J. Trelinski and B. Kwolek

4. Chen, L., Wei, H., Ferryman, J.: A survey of human motion analysis using depth
imagery. Pattern Recognit. Lett. 34(15), 1995–2006 (2013)
5. Ye, M., Zhang, Q., Wang, L., Zhu, J., Yang, R., Gall, J.: A survey on human motion
analysis from depth data. In: Grzegorzek, M., Theobalt, C., Koch, R., Kolb, A.
(eds.) Time-of-Flight and Depth Imaging. Sensors, Algorithms, and Applications.
LNCS, vol. 8200, pp. 149–187. Springer, Heidelberg (2013). https://doi.org/10.
1007/978-3-642-44964-2 8
6. Lo Presti, L., La Cascia, M.: 3D skeleton-based human action classification. Pattern
Recognit. 53(C), 130–147 (2016)
7. Xia, L., Chen, C.C., Aggarwal, J.: View invariant human action recognition using
histograms of 3D joints. In: CVPR Workshops, pp. 20–27 (2012)
8. Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3D points. In:
IEEE International Conference on Computer Vision and Pattern Recognition -
Workshops, pp. 9–14 (2010)
9. Chen, C., Jafari, R., Kehtarnavaz, N.: Action recognition from depth sequences
using depth motion maps-based local binary patterns. In: 2015 IEEE Winter Con-
ference on Applications of Computer Vision, pp. 1092–1099 (2015)
10. Yang, X., Zhang, C., Tian, Y.L.: Recognizing actions using depth motion maps-
based histograms of oriented gradients. In: Proceedings of the 20th ACM Interna-
tional Conference on Multimedia, pp. 1057–1060. ACM (2012)
11. Wang, J., Liu, Z., Chorowski, J., Chen, Z., Wu, Y.: Robust 3D action recognition
with random occupancy patterns. In: Fitzgibbon, A., Lazebnik, S., Perona, P.,
Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, pp. 872–885. Springer, Heidelberg
(2012). https://doi.org/10.1007/978-3-642-33709-3 62
12. Vieira, A.W., Nascimento, E.R., Oliveira, G.L., Liu, Z., Campos, M.F.M.:
STOP: space-time occupancy patterns for 3D action recognition from depth map
sequences. In: Alvarez, L., Mejail, M., Gomez, L., Jacobo, J. (eds.) CIARP 2012.
LNCS, vol. 7441, pp. 252–259. Springer, Heidelberg (2012). https://doi.org/10.
1007/978-3-642-33275-3 31
13. Xia, L., Aggarwal, J.: Spatio-temporal depth cuboid similarity feature for activity
recognition using depth camera. In: IEEE International Conference on Computer
Vision and Pattern Recognition, pp. 2834–2841 (2013)
14. Oreifej, O., Liu, Z.: HON4D: histogram of oriented 4D normals for activity recogni-
tion from depth sequences. In: IEEE International Conference on Computer Vision
and Pattern Recognition, pp. 716–723 (2013)
15. Wang, P., Li, W., Gao, Z., Zhang, J., Tang, C., Ogunbona, P.: Action recognition
from depth maps using deep convolutional neural networks. IEEE Trans. Hum.-
Mach. Syst. 46(4), 498–509 (2016)
16. Chen, C., Jafari, R., Kehtarnavaz, N.: UTD-MHAD: a multimodal dataset for
human action recognition utilizing a depth camera and a wearable inertial sensor.
In: 2015 IEEE International Conference on Image Processing (ICIP), pp. 168–172,
September 2015
17. Wang, P., Li, W., Li, C., Hou, Y.: Action recognition based on joint trajectory
maps with convolutional neural networks. Knowl.-Based Syst. 158, 43–53 (2018)
18. Hou, Y., Li, Z., Wang, P., Li, W.: Skeleton optical spectra-based action recognition
using convolutional neural networks. IEEE Trans. Circuits Syst. Video Technol.
28(3), 807–811 (2018)
19. Hussein, M.E., Torki, M., Gowayyed, M.A., El-Saban, M.: Human action recog-
nition using a temporal hierarchy of covariance descriptors on 3D joint locations.
In: Proceedings of the Twenty-Third International Joint Conferences on Artificial
Intelligence, IJCAI 2013, pp. 2466–2472. AAAI Press (2013)
Ensemble of Classifiers Using CNN and Hand-Crafted Features 103

20. Zhou, L., Li, W., Zhang, Y., Ogunbona, P., Nguyen, D., Zhang, H.: Discrimina-
tive key pose extraction using extended LC-KSVD for action recognition. In: 2014
International Conference on Digital Image Computing: Techniques and Applica-
tions (DICTA), pp. 1–8 (2014)
21. Wang, P., Wang, S., Gao, Z., Hou, Y., Li, W.: Structured images for RGB-D
action recognition. In: 2017 IEEE International Conference on Computer Vision
Workshops (ICCVW), pp. 1005–1014 (2017)
22. Wu, Y.: Mining actionlet ensemble for action recognition with depth cameras. In:
IEEE International Conference on Computer Vision and Pattern Recognition, pp.
1290–1297 (2012)
23. Wang, P., Li, W., Gao, Z., Zhang, J., Tang, C., Ogunbona, P.O.: Action recognition
from depth maps using deep convolutional neural networks. IEEE Trans. Hum.-
Mach. Syst. 46(4), 498–509 (2016)
24. Lu, C., Jia, J., Tang, C.: Range-sample depth feature for action recognition. In:
IEEE International Conference on Computer Vision and Pattern Recognition, pp.
772–779 (2014)

You might also like