You are on page 1of 26

3200 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO.

3, MARCH 2023

Human Action Recognition From Various


Data Modalities: A Review
Zehua Sun , Qiuhong Ke , Hossein Rahmani , Mohammed Bennamoun , Gang Wang , and Jun Liu

Abstract—Human Action Recognition (HAR) aims to understand human behavior and assign a label to each action. It has a wide
range of applications, and therefore has been attracting increasing attention in the field of computer vision. Human actions can be
represented using various data modalities, such as RGB, skeleton, depth, infrared, point cloud, event stream, audio, acceleration,
radar, and WiFi signal, which encode different sources of useful yet distinct information and have various advantages depending on the
application scenarios. Consequently, lots of existing works have attempted to investigate different types of approaches for HAR using
various modalities. In this article, we present a comprehensive survey of recent progress in deep learning methods for HAR based on
the type of input data modality. Specifically, we review the current mainstream deep learning methods for single data modalities and
multiple data modalities, including the fusion-based and the co-learning-based frameworks. We also present comparative results on
several benchmark datasets for HAR, together with insightful observations and inspiring future research directions.

Index Terms—Human action recognition, deep learning, data modality, single modality, multi-modality

1 INTRODUCTION affordable sensors, and the distinct advantages of different


modalities for HAR depending on the application scenarios.
UMAN Action Recognition (HAR), i.e., recognizing and
H understanding human actions, is crucial for a number
of real-world applications. It can be used in visual surveil-
Specifically, according to the visibility, data modalities can
be roughly divided into two categories, namely, visual modali-
ties and non-visual modalities. As shown in Table 1, RGB, skel-
lance systems [1] to identify dangerous human activities,
eton, depth, infrared sequence, point cloud, and event stream
and autonomous navigation systems [2] to perceive human
are visually “intuitive” for representing human actions, and
behaviors for safe operation. Besides, it is important for a
can be seen as visual modalities. Generally, visual modalities
number of other applications, e.g., video retrieval [3],
are very effective for HAR. Among them, RGB video data is
human-robot interaction [4], and entertainment [5].
the most common data type for HAR, which has been widely
In the early days, most of the works focused on using
used in surveillance and monitoring systems. Skeleton data
RGB or gray-scale videos as input for HAR [6], due to their
encodes the trajectories of human body joints. It is succinct
popularity and easy access. Recent years have witnessed an
and efficient for HAR when the action performing does not
emergence of works [7], [8], [9], [10], [11], [12], [13], [14],
involve objects or scene context. Point cloud and depth data,
[15], [16] using other data modalities, such as skeleton,
which capture the 3D structure and distance information, are
depth, infrared sequence, point cloud, event stream, audio,
popularly used for HAR in robot navigation and self-driving
acceleration, radar, and WiFi for HAR. This is mainly due to
applications. In addition, infrared data can be utilized for
the development of different kinds of accurate and
HAR in even dark environments, while the event stream keeps
the foreground movement of the human subjects and avoids
much visual redundancy, making it suitable for HAR as well.
 Zehua Sun and Jun Liu are with Singapore University of Technology Meanwhile, audio, acceleration, radar, and WiFi, etc. are
and Design, Singapore 487372. E-mail: zehua.sun@my.cityu.edu.hk, non-visual modalities, i.e., they are not visually “intuitive”
jun_liu@sutd.edu.sg.
 Qiuhong Ke is with Monash University, Clayton, VIC 3800, Australia.
for representing human behaviors. Nevertheless, these
E-mail: Qiuhong.Ke@monash.edu. modalities can also be used for HAR in some scenarios
 Hossein Rahmani is with Lancaster University, LA1 4YW Lancaster, UK. which require the protection of privacy of subjects. Among
E-mail: h.rahmani@lancaster.ac.uk. them, audio data is suitable for locating actions in the tem-
 Mohammed Bennamoun is with the University of Western Australia, Craw-
ley, WA 6009, Australia. E-mail: mohammed.bennamoun@uwa.edu.au. poral sequences, while acceleration data can be adopted for
 Gang Wang is with Alibaba Group, Hangzhou, Zhejiang 310052, China. fine-grained HAR. Besides, as a non-visual modality, radar
E-mail: wanggang@ntu.edu.sg. data can even be used for through-wall HAR. More details
Manuscript received 29 Jan. 2021; revised 7 Apr. 2022; accepted 11 June 2022. of different modalities are discussed in Section 2.
Date of publication 14 June 2022; date of current version 3 Feb. 2023. Single modality-based HAR has been extensively
This work was supported in part by National Research Foundation, Singapore
under its AI Singapore Programme under Grants AISG Award No. AISG- investigated in the past decades [6], [17], [18], [19], [20],
100E-2020-065, and in part by SUTD SRG. This work was supported in part [21], [22]. However, since different modalities have differ-
by TAILOR, a project funded by EU Horizon 2020 research and innovation ent strengths and limitations for HAR, the fusion of mul-
programme under Grant 952215. tiple data modalities and the transfer of knowledge
(Corresponding author: Jun Liu.)
Recommended for acceptance by G. Farinella. across modalities to enhance the accuracy and robustness
Digital Object Identifier no. 10.1109/TPAMI.2022.3183112 of HAR, have also received great attention recently [23],
0162-8828 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Pontificia Universidad Javeriana. Downloaded on March 28,2023 at 16:53:12 UTC from IEEE Xplore. Restrictions apply.
SUN ET AL.: HUMAN ACTION RECOGNITION FROM VARIOUS DATA MODALITIES: A REVIEW 3201

TABLE 1 (1) To the best of our knowledge, this is the first survey
Action Samples of Different Data Modalities paper that comprehensively reviews the HAR methods
(With Pros and Cons)
from the perspective of various data modalities, including
RGB, depth, skeleton, infrared sequence, point cloud, event
stream, audio, acceleration, radar, and WiFi.
(2) We comprehensively review the multi-modality-
based HAR methods, and categorize them into two types,
namely, multi-modality fusion-based approaches and cross-
modality co-learning-based approaches.
(3) We focus on reviewing the more recent and advanced
deep learning methods for HAR, and hence provide the
readers with the state-of-the-art approaches.
(4) We provide comprehensive comparisons of existing
methods and their performance on several benchmark data-
sets (e.g., Tables 2, 3, 4, 5), with brief summaries and insight-
ful discussions.
The rest of this paper is organized as follows. Section 2
reviews HAR methods using different single data modalities.
Section 3 introduces multi-modality HAR methods. The
benchmark datasets are listed in Section 4. Finally, potential
future development of HAR is discussed in Section 5.

2 SINGLE MODALITY
Different modalities can have different characteristics with
corresponding advantages and disadvantages, as shown in
Table 1. Hence numerous works [7], [8], [9], [10], [11], [12],
[13], [14], [15], [16] exploited various modalities for HAR.
Below we review the methods that use single data modali-
ties, including RGB, skeleton, depth, infrared, point cloud,
event stream, audio, acceleration, radar, and WiFi, for HAR.

2.1 RGB Modality


The RGB modality generally refers to images or videos
(sequences of images) captured by RGB cameras which aim
to recreate what human eyes see. RGB data is usually easy
to collect, and it contains rich appearance information of the
captured scene context. RGB-based HAR has a wide range
[24]. More specifically, fusion means the combination of of applications, such as visual surveillance [1], autonomous
the information of two or more modalities to recognize navigation [2], and sport analysis [36]. However, action rec-
actions. For example, audio data can serve as complemen- ognition from RGB data is often challenging, owing to the
tary information of the visual modalities to distinguish variations of backgrounds, viewpoints, scales of humans,
the actions “putting a plate” and “putting a bag” [25]. and illumination conditions. Besides, RGB videos have gen-
Besides fusion, some other methods have also exploited erally large data sizes, leading to high computational costs
co-learning, i.e., transferring knowledge across different when modeling the spatio-temporal context for HAR.
modalities to strengthen the robustness of HAR models. In this subsection, we review the methods which use the
For example, in [26], the skeleton data serves as an auxil- RGB modality for HAR. Specifically, since videos contain
iary modality enabling the model to extract more discrim- the temporal dynamics of human motions that are often cru-
inative features from RGB videos for HAR. cial for HAR, most of the existing works focused on using
Considering the significance of using different single videos to handle HAR [37], and only a few approaches uti-
modalities for HAR and also leveraging their complementary lized static images [38], [39], [40]. Therefore, we focus, in the
characteristics for fusion and co-learning-based HAR, we following, on reviewing RGB video-based HAR methods.
review existing HAR methods from the perspective of data In the pre-deep learning era, many hand-crafted feature-
modalities. Specifically, we review the mainstream deep learn- based approaches, including the space-time volume-based
ing architectures that use single data modalities for HAR, and methods [41], space-time interest point (STIP)-based methods
also the methods taking advantage of multiple modalities for [42], and trajectory-based methods [43], were well-designed
enhanced HAR. Note that in this survey, we mainly focus on for RGB video-based HAR. Recently, with the great progress
HAR from short and trimmed video segments, i.e., each video of deep learning techniques, various deep learning architec-
contains only one action instance. We also report and discuss tures have also been proposed. Due to the strong representa-
the benchmark datasets. The main contributions of this review tion capability and superior performance of deep learning-
are summarized as follows. based methods, current mainstream research in this field

Authorized licensed use limited to: Pontificia Universidad Javeriana. Downloaded on March 28,2023 at 16:53:12 UTC from IEEE Xplore. Restrictions apply.
3202 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 3, MARCH 2023

TABLE 2 TABLE 3
Performance Comparison of RGB Video-Based Deep Performance of Skeleton-Based Deep Learning HAR Methods
Learning Methods for HAR on the UCF101, HMDB51, on NTU RGB+D and NTU RGB+D 120 Datasets
and Kinectis-400 Datasets

‘CS’, ‘CV,’ and ‘CP’ denote Cross-Subject, Cross-View, and Cross-Setup eval-
uation criteria.
Note in RGB video-based HAR methods, some other information, e.g., optical
flow or motion vector, may also be used along with the RGB data as input for
HAR. For simplicity, ‘%’ after the value is omitted. ‘-’ indicates the result is as shown in Fig. 1a. In this section, we review the classic
unavailable. ‘Flow’ denotes optical flow. two-stream methods [44], [45] and also their extensions [46].
As a classic two-stream framework, Simonyan and Zis-
focuses on designing different types of deep learning frame- serman [44] proposed a two-stream CNN model consisting
works. Consequently, we review, in the following, the of a spatial network and a temporal network. More specifi-
advanced deep learning works for RGB-based HAR, which cally, given a video, each individual RGB frame and multi-
can be mainly divided into three categories, namely, two- frame-based optical flows were fed to the spatial stream
stream 2D Convolutional Neural Network (CNN), Recurrent and temporal stream, respectively. Hence, appearance fea-
Neural Network (RNN) and 3D CNN-based methods, as tures and motion features were learned by these two
shown in Table 2. In particular, Section 2.1.1 mainly reviews streams for HAR. Finally, the classification scores of these
the two-stream methods and their extensions (e.g., multi- two streams were fused to generate the final classification
stream architectures), which use 2D CNNs as their backbone result. In another classic work, Karpathy et al. [45] fed low-
models. Section 2.1.2 mainly reviews RGB-based HAR meth- resolution RGB frames and high-resolution center crops to
ods which mainly employ RNN models together with 2D two separate streams to speed up the computation. Differ-
CNNs as feature extractors. In Section 2.1.3, we review the 3D ent fusion strategies were investigated to model the tempo-
CNN-based methods and their extensions. ral dynamics in videos. Several studies endeavored to
extend and improve over these classic two-stream CNNs,
and they are reviewed below.
2.1.1 Two-Stream 2D CNN-Based Methods To produce better video representations for HAR, Wang
As the name suggests, the two-stream 2D CNN framework et al. [47] fed multi-scale video frames and optical flows to a
generally contains two 2D CNN branches taking different two-stream CNN to extract convolutional feature maps,
input features extracted from the RGB videos for HAR, and which were then sampled over the spatio-temporal tubes
the final result is usually obtained through fusion strategies, centered at the extracted trajectories. Finally, the resulting

Authorized licensed use limited to: Pontificia Universidad Javeriana. Downloaded on March 28,2023 at 16:53:12 UTC from IEEE Xplore. Restrictions apply.
SUN ET AL.: HUMAN ACTION RECOGNITION FROM VARIOUS DATA MODALITIES: A REVIEW 3203

TABLE 4 TABLE 5
Performance of Depth, Infrared, Point Cloud, and Event Performance Comparison of Multi-Modality Fusion-Based HAR
Stream-Based Deep Learning HAR Methods Methods on MSRDailyActivity3D (M), UTD-MHAD (U),
and NTU RGB+D (N) Datasets

The datasets are MSRDailyActivity3D (M) [237], Northwestern-UCLA (N-


UCLA) [238], UWA3D Multiview II (UWA3D II) [239], NTU RGB+D (N)
[161], InfAR [30], MSR-Action3D [240], DvsGesture [241], and DHP19 [32].

features were aggregated using Fisher Vector representation


[48] followed by a SVM for HAR. Cheron et al. [49] used the
positions of human body joints to crop multiple human
body parts from the RGB and optical flow images, which
were passed through a two-stream network for feature
extraction followed by a SVM classifier for HAR.
There are also several other works [50], [51], [52], [53], S: Skeleton, D: Depth, IR: Infrared, PC: Point Cloud, Ac: Acceleration, Gyr:
which extended the two-stream framework to extract Gyroscope
long-term video-level information for HAR. Wang et al.
[50] divided each video into three segments and proc- information were then fed to a multi-stream framework
essed each segment with a two-stream network. The clas- for HAR. Dynamic images have also been adopted by
sification scores of the three segments were then fused by some other works as representations of videos for HAR
an average pooling method to produce the video-level [56], [57], [58]. Besides, the two-stream network has also
prediction. Instead of fusing the scores of the segments as been extended to a two-stream Siamese network to extract
in [50], Diba et al. [52] aggregated the features of the seg- features from the frames before an action happens (pre-
ments by element-wise multiplication. Girdhar et al. [51] condition) and the frames after the action (effect), and the
extracted features from sampled appearance and motion action was then represented as a transformation between
frames based on a two-stream framework, and used the the two sets of features [59].
vocabulary of “action words” to aggregate the features To tackle the high computational cost of computing
into a single video-level representation for classification. accurate optical flow, some works [60] aimed to mimic the
Feichtenhofer et al. [53] multiplied the appearance resid- knowledge of the flow stream during training, in order to
ual features with the motion information at the feature avoid the usage of optical flow during testing. Zhang et al.
level for gated modulation. Zong et al. [54] extended the [60] proposed a teacher-student framework, which trans-
two-stream CNN in [53] to a three-stream CNN by add- fers the knowledge from the teacher network trained on
ing the motion saliency stream to better capture the optical flow data to the student network trained on
salient motion information. Bilen et al. [46] constructed motion vectors which can be obtained from compressed
dynamic images from RGB sequences and optical flow videos without extra calculation. Specifically, the knowl-
sequences to summarize the long-term global appearance edge was transferred by using the soft labels produced by
and motion via rank pooling [55]. The dynamic images the teacher model as an additional supervision to train
together with the original RGB images and optical flow the student network. Different from [60], Piergiovanni

Authorized licensed use limited to: Pontificia Universidad Javeriana. Downloaded on March 28,2023 at 16:53:12 UTC from IEEE Xplore. Restrictions apply.
3204 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 3, MARCH 2023

Fig. 2. Illustration of deep learning frameworks for skeleton-based HAR.


(a) Skeleton sequences can be processed by RNN and LSTM as time
series. (b) Skeleton sequences can be converted to 2D pseudo-images
and then be fed to 2D CNNs for feature learning. (c) The joint depen-
dency structure can naturally be represented via a graph structure, and
thus GCN models are also suitable for this task. (a)-(c) are originally
shown in [152], [160], [175].
Fig. 1. Illustration of RGB-based deep learning methods for HAR. (c) and
(d) are originally shown in [66].
As shown in Fig. 1b, RNN-based methods usually employ
and Ryoo [61] proposed a trainable flow layer which cap- 2D CNNs, which serve as feature extractors, followed by an
tures the motion information without the need of comput- LSTM model for HAR. Donahue et al. [7] introduced the
ing optical flows. Long-term Recurrent Convolutional Network (LRCN),
There have also been some works extending the two- which consists of a 2D CNN to extract frame-level RGB fea-
stream CNN architectures in other aspects. Wang et al. [62] tures followed by LSTMs to generate a single action label.
found that many two-stream CNNs were relatively shallow, Ng et al. [71] extracted frame-level RGB and “flow image”
and thus, they designed very deep two-stream CNNs for features from pre-trained 2D CNNs, and then fed these fea-
achieving better recognition results. Considering many tures to a stacked LSTM framework for HAR. In the work of
frames of a video sequence could be irrelevant or useless for [72], an encoder LSTM was used to map an input video into
HAR, Kar et al. [63] recursively predicted the discriminative a fixed-length representation, which was then decoded
importance of each frame for feature pooling. To perform through a decoder LSTM to perform the tasks of video
HAR on low spatial resolution videos, Zhang et al. [64] pro- reconstruction and prediction in an unsupervised manner.
posed two video super-resolution methods producing high Wu et al. [73] leveraged two LSTMs, which operate on
resolution videos, which were fed to the spatial and tempo- coarse-scale and fine-scale CNN features cooperatively for
ral streams to predict the action class. In the work of [65], efficient HAR. Majd and Safabakhsh [74] proposed a
several fusion strategies were studied, showing that it is C 2 LSTM which incorporates convolution and cross-corre-
effective to fuse the spatial and temporal networks at the lation operators to learn motion and spatial features while
last convolution layer hence reducing the number of param- modeling temporal dependencies. Some other works [75],
eters, yet keeping the accuracy. [76], [77] adopted the Bi-directional LSTM, which consists
The two-stream 2D CNN architectures, that learn differ- of two independent LSTMs to learn both the forward and
ent types of information (e.g., spatial and temporal) from backward temporal information, for HAR.
the input videos through separate networks and then per- The introduction of attention mechanisms, in terms of
form fusion to get the final result, enable the traditional 2D spatial attention [78], [79], [80], [81], temporal attention [82],
CNNs to effectively handle the video data and achieve high [83], and both spatial and temporal attention [84], [85], have
HAR accuracy. However, this type of architectures is still also benefited the LSTM-based frameworks to achieve bet-
not powerful enough for long-term dependency modeling, ter HAR performance. Sharma et al. [78] designed a multi-
i.e., it has limitations in effectively modeling the video-level layer LSTM model, which recursively outputs attention
temporal information, while temporal sequence modeling maps weighting the input features of the next frame to focus
networks, such as LSTM, can make up for it. on the important spatial features, leading to better recogni-
tion performance. Sudhakaran et al. [80] introduced a recur-
rent unit with built-in spatial attention to spatially localize
2.1.2 RNN-Based Methods the discriminative information across a video sequence.
RNNs can be used to analyze temporal data due to the LSTM has also been used to learn temporal attention to
recurrent connections in their hidden layers. However, the weight features of all frames [82]. Li et al. [85] proposed a
traditional vanilla RNN suffers from the vanishing gradient Video-LSTM, which incorporates convolutions and motion-
problem, making it incapable of effectively modeling the based attention into the soft-attention LSTM [86], to better
long-term temporal dependency. Thus, most of the existing capture both spatial and motion information.
methods have adopted gated RNN architectures, such as Besides using LSTM, some works [87], [88], [89], [90]
Long-Short Term Memory (LSTM) [67], [68], [69], [70], to have used Gated Recurrent Units (GRUs) [91] for HAR,
model the long-term temporal dynamics in video which can also alleviate the vanishing gradient problem of
sequences. the vanilla RNN. Compared to LSTM, GRU has fewer gates,

Authorized licensed use limited to: Pontificia Universidad Javeriana. Downloaded on March 28,2023 at 16:53:12 UTC from IEEE Xplore. Restrictions apply.
SUN ET AL.: HUMAN ACTION RECOGNITION FROM VARIOUS DATA MODALITIES: A REVIEW 3205

leading to fewer model parameters, yet it can usually pro- and pooling kernels of a 2D CNN with an additional tempo-
vide a similar performance of LSTM for HAR [92]. ral dimension. Wang et al. [98] integrated a two-stream 3D
Several other works [76], [93], [94] focused on the hybrid CNN with an LSTM model to capture the long-range tem-
architectures combining two-stream 2D CNN and RNN poral dependencies. Feichtenhofer et al. [112] designed a
models for HAR. For example, Wu et al. [93] utilized two- two-stream 3D CNN framework containing a slow pathway
stream 2D CNN models to extract spatial and short-term and a fast pathway that operate on RGB frames at low and
motion features, which were fed to two LSTMs respectively high frame rates to capture semantic and motion, respec-
to model longer-term temporal information for HAR. tively. Inspired by [113], the features of the fast pathway
Generally, most of the RNN-based methods operate on were fused (by summation or concatenation) with the fea-
the CNN features of frames instead of the high-dimensional tures of the slow pathway at each layer. Li et al. [114] intro-
frame images for HAR. Instead of using 2D CNN, some duced a two-stream spatio-temporal deformable 3D CNN
other works [95], [96], [97], [98] used 3D CNNs as their fea- with attention mechanisms to capture the long-range tem-
ture extractors. In the following section, we review the 3D poral and long-distance spatial dependencies. Besides, deep
CNN-based methods. learning frameworks combining 2D CNNs and 3D CNNs
have also been investigated for HAR [115], [116], [117]. For
example, the ECO architecture [115] uses 2D CNNs to
2.1.3 3D CNN-Based Methods extract spatial features, which are stacked and then fed to
Plenty of researches [66], [99], [100], [101] have extended 2D 3D CNNs to model long-term dependencies for HAR.
CNNs (Fig. 1c) to 3D structures (Fig. 1d), to simultaneously Some works focused on addressing certain other prob-
model the spatial and temporal context information in vid- lems in 3D CNNs. For example, Zhou et al. [118] analyzed
eos that is crucial for HAR. As one of the earliest works, Ji the spatio-temporal fusion in 3D CNN from a probabilistic
et al. [99] segmented human subjects in videos by utilizing a perspective. Yang et al. [119] proposed a generic feature-
human detector method, and then fed the segmented videos level Temporal Pyramid Network (TPN) to model speed
to a novel 3D CNN model to extract spatio-temporal fea- variations in performing actions. Kim et al. [120] proposed a
tures from videos. Unlike [99], Tran et al. [66] introduced a Random Mean Scaling (RMS) regularization method to
3D CNN model, dubbed C3D, to learn the spatio-temporal address the problem of overfitting. The viewpoint variation
features from raw videos in an end-to-end learning frame- problem has also been investigated [121], [122]. Inspired by
work. However, these networks were mainly used for clip- [123], Varol et al. [122] proposed to train 3D CNN on syn-
level learning (e.g., 16 frames in each clip) instead of learn- thetic action videos to address the problem of variations in
ing from full videos, which thus ignored the long-range spa- viewpoints. Piergiovanni and Ryoo [121] proposed a geo-
tio-temporal dependencies in videos. Hence, several metric convolutional layer to learn view-invariant represen-
approaches focused on modeling the long-range spatio-tem- tations of actions by imposing the latent action
poral dependencies in videos. For example, Diba et al. [102] representations to follow 3D geometric transformations and
extended DenseNet [103] with 3D filters and pooling ker- projections. Knowledge distillation has also been investi-
nels, and designed a Temporal 3D CNN (T3D), where the gated to improve motion representations of 3D CNN frame-
temporal transition layer can model variable temporal con- works. For example, Stroud et al. [124] introduced a
volution kernel depths. T3D can densely and efficiently cap- Distilled 3D Network (D3D) consisting of a student network
ture the appearance and temporal information at short, and a teacher network, where the student network was
middle, and also long terms. In their subsequent study trained on RGB videos and it also distilled knowledge from
[104], they introduced a new block embedded in some archi- the teacher network that was trained on optical flow
tectures such as ResNext and ResNet, which can model the sequences. Crasto et al. [125] transferred the knowledge of a
inter-channel correlations of a 3D CNN with respect to the teacher model trained on the optical flows to a 3D CNN stu-
temporal and spatial features. Varol et al. [105] proposed a dent network operating on RGB videos, by minimizing the
Long-term Temporal Convolution (LTC) framework, which mean squared error between the feature maps of the two
increases the temporal extents of 3D convolutional layers at streams. Following the research line of handling the large
the cost of reducing the spatial resolution, to model the computational cost of obtaining optical flow, Shou et al.
long-term temporal structure. Hussein et al. [106] proposed [126] proposed an adversarial framework with a light-
multi-scale temporal-only convolutions, dubbed Timecep- weight generator to approximate flow information by refin-
tion, to account for large variations and tolerate a variety of ing the noisy and coarse motion vector available in the
temporal extents in complex and long actions. Wang et al. compressed video. Differently, Wang et al. [127] adopted an
[107] proposed a non-local operation, which models the cor- efficient learnable correlation operator to better learn
relations between any two positions in the feature maps to motion information from 3D appearance features. Fayyaz
capture long-range dependencies. Besides, Li et al. [101] pro- et al. [128] addressed the problem of dynamically adapting
posed a Channel Independent Directional Convolution the temporal feature resolution within the 3D CNNs to
(CIDC) which can be attached to I3D [108] to better capture reduce their computational cost. A Similarity Guided Sam-
the long-term temporal dynamics of the full input video. pling (SGS) module was proposed to enable 3D CNNs to
To enhance the HAR performance, several other works dynamically adapt their computational resources by select-
[108], [109], [110], [111], [112] have investigated 3D CNN ing the most informative and distinctive temporal features.
models with two-stream or multi-stream designs. For exam- The 3D CNN-based methods are very powerful in
ple, Carreira and Zisserman [108] introduced the two- modeling discriminative features from both the spatial and
stream Inflated 3D CNN (I3D) inflating the convolutional temporal dimensions for HAR. However, many 3D CNN-

Authorized licensed use limited to: Pontificia Universidad Javeriana. Downloaded on March 28,2023 at 16:53:12 UTC from IEEE Xplore. Restrictions apply.
3206 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 3, MARCH 2023

based frameworks contain a large number of parameters, variations. Meanwhile, motion capture systems that are
and thus require a large amount of training data. Therefore, insensitive to view and lighting can provide reliable skele-
some studies [109], [129], [130], [131], [132] aimed to factor- ton data. However, in many application scenarios, it is not
ize 3D convolutions. Sun et al. [129] proposed a Factorized convenient to deploy motion capture systems. Thus many
spatio-temporal CNN (Fst CN), factorizing the 3D convolu- recent works on skeleton-based HAR used skeleton data
tion to 2D spatial convolutional layers followed by 1D tem- obtained from depth maps [151] or RGB videos [152].
poral convolutional layers. Similarly, Qiu et al. [130] There are many advantages of using skeleton data for
decomposed the 3D convolution into a 2D convolution for HAR, due to its provided body structure and pose informa-
the spatial domain, followed by a 1D convolution for the tion, its essentially simple and informative representation,
temporal domain, to economically and effectively simulate its scale invariance, and its robustness against variations of
3D convolutions. Xie et al. [131] explored the performance clothing textures and backgrounds. Due to these advantages
of several variants of I3D [108] by utilizing a combination of and also the availability of accurate and low-cost depth sen-
3D and 2D convolutional filters in the I3D network, and sors, skeleton-based HAR has attracted much attention in
introduced temporally separable convolutions and spatio- the research community recently.
temporal feature gating to enhance HAR. Yang et al. [133] Early works focused on extracting hand-crafted spatial
proposed efficient asymmetric one-directional 3D convolu- and temporal features from skeleton sequences for HAR.
tions to approximate the traditional 3D convolution. The hand-crafted feature-based methods can be roughly
Besides, some other works [134], [135], [136], [137] also uti- divided into joint-based [153] and body part-based [154]
lized 2D CNNs with powerful temporal modules to methods, depending on the used feature extraction techni-
decrease the computational cost of 3D CNNs. Lin et al. [134] ques. Due to the strong feature learning capability, deep
proposed a Temporal Shift Module (TSM), which shifts a learning has been widely used for skeleton-based HAR, and
part of the channels along the temporal dimension to per- has become the mainstream research in this field. Conse-
form temporal interaction between the features from adja- quently, we review, in the following, the deep learning
cent frames. Unlike parameter-free temporal shift methods, which can be mainly divided into three categories:
operations in [134], Sudhakaran et al. [135] introduced a RNN, CNN, and Graph Neural Network (GNN) or Graph
lightweight Gate-Shift Module (GSM), which uses learnable Convolutional Network (GCN)-based methods, as shown in
spatial gating blocks for spatial-temporal decomposition of Fig. 2. Table 3 shows the results achieved by different skele-
3D convolutions. Wang et al. [136] designed a two-level ton-based HAR methods on two benchmark datasets.
Temporal Difference Module (TDM) to capture both finer
local and long-range global motion information. Wang et al.
[137] introduced an ACTION module leveraging multipath 2.2.1 RNN-Based Methods
excitation to respectively model spatial-temporal, channel- As mentioned in Section 2.1.2, RNNs and their gated var-
wise, and motion patterns. iants (e.g., LSTMs) are capable of learning the dynamic
Apart from the above-mentioned architectures, i.e., two- dependencies in sequential data. Hence, various methods
stream 2D CNN, RNN, and 3D CNN, there are also some [155], [156], [157], [158] have applied and adapted RNNs
other frameworks designed for HAR using RGB videos, and LSTMs to effectively model the temporal context infor-
such as Convolutional Gated Restricted Boltzmann Mac- mation within the skeleton sequences for HAR.
hines [138], Graph-based Modeling [139], [140], Transform- In one of the classical methods, Du et al. [155] proposed
ers [141], and 4D CNNs [142]. an end-to-end hierarchical RNN, dividing the human skele-
In general, RGB data is the most commonly used modal- ton into five body parts instead of inputting the skeleton in
ity for HAR in real-world application scenarios, since it is each frame as a whole. These five body parts were then sep-
easy to collect and contains rich information. Besides, RGB- arately fed to multiple bidirectional RNNs, whose output
based HAR methods can leverage large-scale web videos to representations were hierarchically fused to generate high-
pre-train models for better recognition performance [143], level representations of the action. Differential RNN
[144]. However, it often requires complex computations for (dRNN) [159] learns salient spatio-temporal information by
feature extraction from RGB videos. Moreover, HAR meth- quantifying the change of the information gain caused by
ods based on the RGB modality are often sensitive to view- the salient motions between frames. The Derivative of States
point variations and background clutters, etc. Hence, (DoS) was proposed inside the LSTM unit to act as a signal
recently, HAR with other modalities, such as 3D skeleton controlling the information flow into and out of the internal
data, has also received great attention, and is therefore dis- state over time. Zhu et al. [160] introduced a novel mecha-
cussed in the subsequent sections. nism for the LSTM network to achieve automatic co-occur-
rence mining, since co-occurrence intrinsically characterizes
the actions. Sharoudy et al. [161] proposed a Part-aware
2.2 Skeleton Modality LSTM (P-LSTM) by introducing a mechanism to simulate
Skeleton sequences encode the trajectories of human body the relations among different body parts inside the LSTM
joints, which characterize informative human motions. unit. Liu et al. [8], [162] extended the RNN design to both
Therefore, skeleton data is also a suitable modality for the temporal and spatial domains. Specifically, they utilized
HAR. The skeleton data can be acquired by applying pose the tree structure-based skeleton traversal method to further
estimation algorithms on RGB videos [150] or depth maps exploit the spatial information, and trust gates to deal with
[5]. It can also be collected with motion capture systems. noise and occlusions. In their subsequent study [163], an
Generally, human pose estimation is sensitive to viewpoint attention-based LSTM network named Global Context-

Authorized licensed use limited to: Pontificia Universidad Javeriana. Downloaded on March 28,2023 at 16:53:12 UTC from IEEE Xplore. Restrictions apply.
SUN ET AL.: HUMAN ACTION RECOGNITION FROM VARIOUS DATA MODALITIES: A REVIEW 3207

Aware Attention LSTM (GCA-LSTM) was proposed to the joints in the frames, and Fourier Temporal Pyramids [9]
selectively focus on the informative joints using the global were utilized to exploit the macro-temporal relations.
context information. The GCA-LSTM network contains two Plenty of researches focused on addressing certain spe-
LSTM layers, where the first layer encodes the skeleton cific problems. For example, the works of [171], [178], [179]
sequence and outputs a global context memory, while the focused on handling the viewpoint variation issue. Since
second layer outputs attention representations to refine the features learned from skeleton data are not always transla-
global context. Finally, a softmax classifier outputs the HAR tion, scale, and rotation invariant, several methods [180],
result. In the work of [164], a two-stream RNN structure [181] have also been designed to handle these issues. CNN
was proposed to model both the temporal dynamics and architectures are often complex, resulting in high computa-
spatial configurations. Deep LSTM with spatio-temporal tional costs. Thus Yang et al. [182] proposed a Double-fea-
attention was proposed in [165], where a spatial attention ture Double-motion Network (DD-Net) to make the CNN-
sub-network and a temporal attention sub-network work based recognition models run faster. Additionally, Tang
jointly under the main LSTM network. To model variable et al. [183] proposed a self-supervised learning framework
temporal dynamics of skeleton sequences, Lee et al. [166] under the unsupervised domain adaptation setting, which
proposed an ensemble Temporal Sliding LSTM (TS-LSTM) segments and permutes the time segments or body parts to
framework composed of multiple parts, containing short- reduce domain-shift and improve the generalization ability
term, medium-term, and long-term TS-LSTM networks. of the model.
IndRNN [167] not only addresses the gradient vanishing
and exploding issues, but it is also faster than the original
LSTM. 2.2.3 GNN or GCN-Based Methods
Due to the expressive power of graph structures, analyzing
graphs with learning models have received great attention
2.2.2 CNN-Based Methods recently [184], [185]. As shown in Fig. 2c, skeleton data is
CNNs have achieved great success in 2D image analysis naturally in the form of graphs. Hence, simply representing
due to their superior capability in learning features in the skeleton data as a vector sequence processed by RNNs, or
spatial domain. However, when facing skeleton-based 2D/3D maps processed by CNNs, cannot fully model the
HAR, the modeling of spatio-temporal information becomes complex spatio-temporal configurations and correlations of
a challenge. Nevertheless, plenty of advanced approaches the body joints. This indicates that topological graph repre-
have been proposed, which apply temporal convolution on sentations can be more suitable for representing the skele-
skeleton data [168], or represent skeleton sequences as ton data. As a result, many GNN and GCN-based HAR
pseudo-images that are then fed to standard CNNs for methods [152], [186] have been proposed to treat the skele-
HAR [169], [170]. These pseudo-images were designed to ton data as graph structures of edges and nodes.
simultaneously encode spatial structure information in each GNN is a connection model capturing the dependence
frame and temporal dynamic information between frames. within a graph through message passing between nodes. Si
Hou et al. [169] and Wang et al. [170] respectively pro- et al. [186] proposed a spatial reasoning network, followed
posed the skeleton optical spectra and the joint trajectory by RNNs, to capture the high-level spatial structural and
maps. They both encoded the spatio-temporal information temporal dynamics of skeleton data. Shi et al. [187] repre-
of skeleton sequences into color-texture images, and then sented the skeleton as a directed acyclic graph to effectively
adopted CNNs for HAR. As an extension of these works, incorporate both the joint and bone information, and uti-
the Joint Distance Map (JDM) [171] produces view-invariant lized a GNN to perform HAR.
color-texture images encoding pair-wise distances of skele- More recently, GCN-based HAR has become a hot research
ton joints. Ke et al. [172] transformed each skeleton sequence direction [188], [189], [190], [191], [192]. Yan et al. [152]
into three “video clips”, which were then fed to a pre- exploited GCNs for skeleton-based HAR by introducing Spa-
trained CNN to produce a compact representation, followed tial-Temporal GCNs (ST-GCNs) that can automatically learn
by a multi-task learning network for HAR. Kim and Reiter both the spatial and temporal patterns from skeleton data, as
[168] utilized the Temporal CNN (TCN) [173] explicitly pro- shown in Fig. 2c. More specifically, the pose information was
viding a type of interpretable spatio-temporal representa- estimated from the input videos and then passed through the
tions. An end-to-end framework to learn co-occurrence spatio-temporal graphs to achieve action representations with
features with a hierarchical methodology was proposed in strong generalization capabilities for HAR. Since implicit joint
[174]. Specifically, they learned the point-level features of correlations have been ignored by previous works [152], Li
each joint independently, and then utilized these features as et al. [193] further proposed an Actional-Structural GCN (AS-
a channel of the convolutional layer to learn hierarchical co- GCN), combining actional links and structural links into a gen-
occurrence features, and a two-stream framework was eralized skeleton graph. The actional links were used to cap-
adopted to fuse the motion features. Caetano et al. intro- ture action-specific latent dependencies and the structural
duced SkeleMotion [175] and the Tree Structure Reference links were used to represent higher-order dependencies. To
Joints Image (TSRJI) [176] as representations of skeleton better explore the implicit joint correlations, Peng et al. [194]
sequences for HAR. In [177], Skepxels were utilized as basic determined their GCN architecture via a neural architecture
building blocks to construct skeletal images mainly encod- search scheme. Specifically, they enriched the search space to
ing the spatio-temporal information of human joint loca- implicitly capture the joint correlations based on multiple
tions and velocities. Besides, Skepxels were used to dynamic graph sub-structures and higher-order connections
hierarchically capture the micro-temporal relations between with a Chebyshev polynomial approximation. Besides, context

Authorized licensed use limited to: Pontificia Universidad Javeriana. Downloaded on March 28,2023 at 16:53:12 UTC from IEEE Xplore. Restrictions apply.
3208 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 3, MARCH 2023

information integration was used to effectively model long- of devices have been developed to obtain depth images, which
range dependencies in the work of [195]. include active sensors (e.g., Time-of-Flight and structured-
Shi et al. [196] proposed a two-stream Adaptive GCN (2s- light-based cameras) and passive sensors (e.g., stereo cameras)
AGCN), where the topology of graphs can be either uni- [19]. Active sensors emit radiation in the direction towards the
formly or individually learned by the back-propagation objects in the scene, and then measure the reflected energy
algorithm, instead of setting it manually. The 2s-AGCN from the objects to acquire the depth information. In contrast,
explicitly combined the second-order information (lengths passive sensors measure the natural energy that is emitted or
and directions of human bones) of the skeleton with the reflected by the objects in the scene. For example, as a type of
first-order information (coordinates of joints). Wu et al. [197] passive sensor, stereo cameras usually have two or more
introduced a cross-domain spatial residual layer to capture lenses to simulate the binocular vision of human eyes to
the spatio-temporal information, and a dense connection acquire depth information, and depth maps are recovered by
block to learn global information based on ST-GCN. Li et al. seeking image point correspondences between stereo pairs
[198] fed frame-wise skeleton and node trajectories of a skel- [214]. Compared to active sensors (e.g., Kinect and Real-
eton sequence to a spatial graph router and a temporal Sense3D), passive depth map generation is usually computa-
graph router to generate new skeleton-joint-connectivity tionally expensive, and moreover, depth maps obtained with
graphs, followed by a ST-GCN for classification. Liu et al. passive sensors can be ineffective in texture-less regions or
[199] integrated a disentangled multi-scale aggregation highly textured regions with repetitive patterns.
scheme and a spatio-temporal graph convolutional operator In this section, we review methods that used depth maps
named G3D to achieve a powerful feature extractor. High- for HAR. Note that only a few works [215], [216] used depth
level semantics of joints were introduced for HAR in [200]. maps captured by stereo cameras for HAR, while most of
Attention mechanisms were applied for extracting discrimi- the other methods [9], [56], [217] focused on using depth
native information and global dependencies in [201], [202]. videos captured by active sensors to recognize actions, due
Besides, to reduce computational costs of GCNs, a Shift- to the availability of low-cost and reliable active sensors like
GCN was designed by Cheng et al. [203], which adopts shift Kinect. Consequently, here we focus on reviewing the meth-
graph operations and lightweight point-wise convolutions, ods using depth videos captured with active sensors for
instead of using heavy regular graph convolutions. Follow- HAR. Table 4 shows the results of some depth-based HAR
ing this research line, Song et al. [204] proposed a multi- methods on several benchmark datasets.
stream GCN model, which fuses the input branches includ- Compared to hand-crafted feature-based methods [218],
ing joint positions, motion velocities, and bone features at [219], deep learning models [9], [56] have shown to be more
early stage, and utilized separable convolutional layers and powerful and achieved better performance for HAR from
a compound scaling strategy to extremely reduce the redun- depth maps. Inspired by the hand-crafted Depth Motion Map
dant trainable parameters while increasing the capacity of (DMM) features [220], a deep learning framework utilizing
model. Different from the above mentioned methods, Li weighted hierarchical DMMs was proposed in [221]. As
et al. [205] proposed symbiotic GCNs to handle both action DMMs are not able to capture detailed temporal information,
recognition and motion prediction tasks simultaneously. Wang et al. [56] proposed to represent depth sequences with
The proposed Sym-GNN consists of a multi-branch multi- three pairs of structured dynamic images [55] at the body,
scale GCN followed by an action recognition and a motion body part, and joint levels, which were then fed to CNNs fol-
prediction heads, to jointly conduct action recognition and lowed by a score fusion module for fine-grained HAR.
motion prediction tasks. This allows the two tasks to The performance of this method was further improved in
enhance each other. [217] by introducing three representations of depth maps,
In summary, the skeleton modality provides the body including dynamic depth images, dynamic depth normal
structure information, which is simple, efficient, and infor- images, and dynamic depth motion normal images. Rah-
mative for representing human behaviors. Nevertheless, mani et al. [9] transferred the human data obtained from dif-
HAR using skeleton data still faces challenges, due to its ferent views to a view-invariant high-level space to address
very sparse representation, the noisy skeleton information, the view-invariant HAR problem. Specifically, they utilized
and the lack of shape information that can be important a CNN model to learn the view-invariant human pose
when handling human-object interactions. Hence, some of model and Fourier Temporal Pyramids to model the tempo-
the existing works on HAR also focused on using depth ral action variations. To obtain more multi-view training
maps, as discussed in the following section, since depth data, synthetic data was generated by fitting synthetic 3D
maps not only provide the 3D geometric structural informa- human models to real motion capture data, and rendering
tion but also conserve the shape information. the human data from various viewpoints. In [57], multi-
view dynamic images were extracted through multi-view
projections from depth videos for action recognition. In
2.3 Depth Modality order to effectively capture spatio-temporal information in
Depth maps refer to images where the pixel values represent depth videos, Sanchez et al. [222] proposed a 3D fully CNN
the distance information from a given viewpoint to the points architecture for HAR. In their subsequent work [223], a vari-
in the scene. The depth modality, which is often robust to var- ant of LSTM unit was introduced to address the problem of
iations of color and texture, provides reliable 3D structural memory limitation during video processing, which can be
and geometric shape information of human subjects, and thus used to perform HAR from long and complex videos.
can be used for HAR. The essence of constructing a depth map Note that besides using depth maps obtained with active
is to convert the 3D data into a 2D image, and different types sensors or stereo cameras for HAR, there has been another

Authorized licensed use limited to: Pontificia Universidad Javeriana. Downloaded on March 28,2023 at 16:53:12 UTC from IEEE Xplore. Restrictions apply.
SUN ET AL.: HUMAN ACTION RECOGNITION FROM VARIOUS DATA MODALITIES: A REVIEW 3209

approach designed for depth-based HAR from RGB videos. as inputs, and the joint discriminator tried to discriminate
Specifically, Zhu and Newsam [224] estimated the depth the real thermal data and optical flow from the recon-
maps from RGB videos using existing depth estimation tec- structed ones.
hniques [225], [226], which were then passed through a In all, infrared data has been used for HAR, especially for
deep learning architecture for action classification. night HAR. However, infrared images may suffer from a
In general, the depth modality provides geometric shape relatively low contrast and a low signal-to-noise ratio, mak-
information that is useful for HAR. However, the depth ing it challenging for robust HAR in some scenarios.
data is often not used alone, due to the lack of appearance
information that can also be helpful for HAR in some sce-
narios. Thus, many works focused on fusing depth informa- 2.5 Point Cloud Modality
tion with other data modalities for enhanced HAR, and Point cloud data is composed of a numerous collection of
more details can be found in Section 3. points that represent the spatial distribution and surface
characteristics of the target under a spatial reference system.
There are two main ways to obtain 3D point cloud data,
2.4 Infrared Modality namely, (1) using 3D sensors, such as LiDAR and Kinect, or
Generally, infrared sensors do not need to rely on external (2) using image-based 3D reconstruction. As a 3D data
ambient light, and thus are particularly suitable for HAR at modality, point cloud has great power to represent the spa-
night. Infrared sensing technologies can be divided into tial silhouettes and 3D geometric shapes of the subjects, and
active and passive ones. Some infrared sensors, such as Kin- hence can be used for HAR.
ect, rely on active infrared technology, which emit infrared Early methods [239], [253] focused on extracting hand-
rays and utilize target reflection rays to perceive objects in crafted spatio-temporal descriptors from the point cloud
the scene. In contrast, thermal sensors relying on passive sequences for HAR, while the current mainstream research
infrared technology do not emit infrared rays. Instead, they focuses on deep learning architectures [246], [247] that gen-
work by detecting rays (i.e., heat energy) emitted from erally achieve better performance. Wang et al. [11] trans-
targets. formed the raw point cloud sequence into regular voxel
Some deep learning methods [30], [227] have been pro- sets. Then temporal rank pooling [55] was applied on all the
posed for HAR from infrared data. In [228], the extremely voxel sets to encode 3D action information into one single
low-resolution thermal images were first cropped by the voxel set. Finally, the voxel representation was abstracted
gravity center of human regions. The cropped sequences and passed through the PointNet++ model [254] for 3D
and the frame differences were then passed through a CNN HAR. However, converting point clouds into voxel repre-
followed by an LSTM layer to model spatio-temporal infor- sentations causes quantization errors and inefficient proc-
mation for HAR. To simultaneously learn both the spatial essing performance. Instead of quantizing the point clouds
and temporal features from thermal videos, Shah et al. [229] into voxels, Liu et al. [242] proposed MeteorNet, which
utilized a 3D CNN and achieved real-time HAR. Instead of directly stacks multi-frame point clouds and calculates local
using raw thermal images, Meglouli et al. [230] passed the features by aggregating information from spatio-temporal
optical flow information computed from thermal sequences neighboring points. Unlike MeteorNet [242], which learns
into a 3D CNN for HAR. spatio-temporal information by appending 1D temporal
Inspired by the two-stream CNN model [44], several dimension to 3D points, PSTNet [247] disentangles the
multi-stream architectures [10], [30], [231] have also been space and time to reduce the impacts of the spatial irregu-
proposed. Gao et al. [30] passed optical flow and optical larity of points on temporal modeling. Fan et al. [246] intro-
flow motion history images (OF-MHIs) [232] through a two- duced a point 4D convolution, followed by a transformer to
stream CNN for HAR. In [10], both infrared data and optical capture the global appearance and motion information
flow volumes were fed to a two-stream 3D CNN framework across the entire point cloud video. In the work of [245], a
to effectively capture the complementary information on self-supervised learning framework was introduced to learn
appearance from still thermal frames and motion between 4D spatio-temporal information from the point cloud
frames. The two-stream model was trained using a combi- sequence by predicting the temporal orders of 4D clips
nation of cross-entropy and label consistency regularization within the sequence. A 4D CNN model [242], [255] followed
loss in [233]. To better represent the global temporal infor- by an LSTM was utilized to predict the temporal order.
mation, Optical flow motion history images (OF-MHIs) Finally, the 4D CNN+LSTM network was fine-tuned on the
[232], optical flow and a stacked difference of optical flow existing HAR dataset to evaluate its performance. Wang
images, were fed to a three-stream CNN for HAR [231]. et al. [244] proposed an anchor-based spatio-temporal atten-
Imran and Raman [234] proposed a four-stream architec- tion convolution model to capture the dynamics of 3D point
ture, where each stream consists of a CNN followed by an cloud sequences. However, these methods are not able to
LSTM. The local/global stacked dense flow difference sufficiently capture the long-term relationships within point
images [235] and local/global Stacked saliency difference cloud sequences. To address this issue, Min et al. [243] intro-
images were fed to these four streams to capture both the duced a modified version of LSTM unit named PointLSTM
local and global spatio-temporal information in videos. to update state information for neighbor point pairs to per-
Mehta et al. [236] presented an adversarial framework con- form HAR. The recent work [248] proposed a Point Spatial-
sisting of a two-stream 3D convolutional auto-encoder as Temporal Transformer (PST2 ) for processing point cloud
the generator and two 3D CNNs as the joint discriminator. sequences. A self-attention-based module, called Spatio-
The generator network took thermal data and optical flow Temporal Self-Attention (STSA), was introduced to capture

Authorized licensed use limited to: Pontificia Universidad Javeriana. Downloaded on March 28,2023 at 16:53:12 UTC from IEEE Xplore. Restrictions apply.
3210 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 3, MARCH 2023

the spatial-temporal context information, which can be lev- basically larger than the original Neuromorphic Vision
eraged for action recognition in 3D point clouds. Stream (NVS), the advantages of event cameras are diluted.
In all, point clouds can effectively capture the 3D shapes Therefore, different from the aforementioned methods,
and silhouettes of subjects, and can be used for HAR. Gen- some other deep learning approaches [12], [249], [251],
erally, 3D point cloud-based HAR methods can be insensi- [260], [261] have been proposed, which directly take the
tive to viewpoint variations, since viewpoint normalization event data as input of deep neural networks. Ghosh et al.
can be conveniently performed by rotating the point cloud [12] learned a set of 3D spatio-temporal convolutional filters
in the 3D space. However, point clouds often suffer from in an unsupervised manner to generate a structured matrix
the presence of noise and high non-uniformity distribution form of the raw event data, which was then fed to a 3D
of points, making it challenging for robust HAR. In addi- CNN for HAR. George et al. [260] utilized Spiking Neural
tion, processing all the points within the point cloud Networks (SNNs) for event stream-based HAR. The work
sequence is often computationally expensive. in [249] treated the event stream as a 3D point cloud, which
was then fed to a PointNet [254] for gesture recognition.
Recently, Bi et al. [261] represented events as graphs and
2.6 Event Stream Modality used a GCN network for an end-to-end feature learning
Event cameras, also known as neuromorphic cameras or directly from the raw event data.
dynamic-vision sensors, which can capture illumination In general, the event stream modality is an emerging
changes and produce asynchronous events independently modality for HAR, and has received great attention in the
for each pixel [252], have received lots of attention recently. past few years. Processing event data is computationally
Different from conventional video cameras capturing the cheap, and the captured frames do not usually contain back-
entire image arrays, event cameras only respond to changes ground information, which can be helpful for action under-
in the visual scene. Taking a high-speed object as an exam- standing. However, the event stream data cannot generally
ple, traditional RGB cameras may not be able to capture be effectively and directly processable using conventional
enough information of the object, due to the low frame rate video analysis techniques, and thus effectively and directly
and motion blur. However, this issue can be significantly utilizing the asynchronous event stream for HAR is still a
mitigated when using event cameras that operate at challenging research problem.
extremely high frequencies, generating events at a ms tem-
poral scale [252]. Besides, event cameras have some other 2.7 Audio Modality
characteristics, such as high dynamic range, low latency,
Audio signal is often available with videos for HAR. Due to
low power consumption, and no motion blur, which make
the synchronization between the visual and audio streams,
them suitable for HAR. Particularly, event cameras are able
the audio data can be used to locate actions to reduce
to effectively filter out background information and keep
human labeling efforts and decrease computational costs.
foreground movement only, avoiding considerable redun-
Some deep learning-based methods [13], [262] have been
dancies in the visual information. However, the information
proposed to perform general activity recognition from
obtained with event cameras is generally spatio-temporally
audio signals. However, in recent years, only a few deep
sparse, and asynchronous. Common event cameras include
learning methods were proposed for HAR from audio sig-
the Dynamic Vision Sensor (DVS) [256] and the Dynamic
nal alone. For example, Liang and Thomaz [13] used a pre-
and Active-pixel Vision Sensor (DAVIS) [257].
trained VGGish model [263] as the feature extractor fol-
The output data of event cameras is highly different from
lowed by a deep classification network to perform HAR.
that of conventional RGB cameras, as shown in Table 1.
Using audio modality alone is not a very popular scheme
Thus, some of the existing methods [12], [252] mainly
for HAR, compared to other data modalities, since audio
focused on designing event aggregation strategies convert-
signal does not contain enough information for accurate
ing the asynchronous output of the event camera into syn-
HAR. Nevertheless, the audio modality can serve as com-
chronous visual frames, which can then be processed with
plementary information for more reliable and efficient
conventional computer vision techniques. While hand-
HAR, and thus most of the audio-based HAR methods [25],
crafted feature-based methods [258], [259] have been pro-
[264], [265], [266] focused on taking advantage of multi-
posed, deep learning methods have been more popularly
modal deep learning techniques, which are discussed in
used recently. Given a fixed time interval Dt, Innocenti et al.
Section 3.
[252] first built a sequence of binary representations from
the raw event data by checking the presence or absence of
an event for each pixel during Dt. These intermediate binary 2.8 Acceleration Modality
representations were then stacked together to form a single Acceleration signals obtained from accelerometers have
frame via a binary to decimal conversion. The sequence of been used for HAR [267], due to their robustness against
such frames extracted from the whole event stream was occlusion, viewpoint, lighting, and background variations,
finally fed to a CNN+LSTM for HAR. Huang et al. [250] uti- etc. Specifically, a tri-axial accelerometer can return an esti-
lized time-stamp image encoding to transform the event mation of acceleration along the x, y, and z axes, that can be
data sequence into frame-based representations, which used to perform human activity analysis [268]. As for the
were then fed to a CNN for HAR. More precisely, the value feasibility of using acceleration signal for HAR, although
of each pixel in the generated time-stamp image encodes the size and proportion of the human body vary from per-
the amount of event taking place within a time-window. son to person, people generally have similar qualitative
However, since the size of the frame to be processed is ways to perform an action, so the acceleration signal usually

Authorized licensed use limited to: Pontificia Universidad Javeriana. Downloaded on March 28,2023 at 16:53:12 UTC from IEEE Xplore. Restrictions apply.
SUN ET AL.: HUMAN ACTION RECOGNITION FROM VARIOUS DATA MODALITIES: A REVIEW 3211

does not have obvious intra-class variations for the same architectures [288], [289] have also been proposed recently. For
action. HAR using acceleration signal can generally achieve example, Yang et al. [289] and Wang et al. [288] utilized an
a high accuracy, and thus has been adopted for remote LSTM model and a stacked RNN model, respectively to pre-
monitoring systems [269], [270] while taking care of privacy dict the action classes.
issues. In general, the characteristics and advantages of the
Recently, deep learning networks have been widely used radar modality make it suitable to be used for HAR in some
for acceleration-based HAR. In [14], [271], [272], [273], the scenarios, but radars are relatively expensive. Though HAR
tri-axial acceleration data was fed to different CNN architec- using radar data has achieved satisfactory results on some
tures for HAR. Wang et al. [274] proposed a framework con- datasets, there is still plenty of room for the development of
sisting of CNN and Bi-LSTM networks to extract spatial radar-based methods, and the work of [22] also pointed out
and temporal features from the raw acceleration data. some future directions in this area, such as handling more
Unlike the above-mentioned works, Lu et al. [275] utilized a complex actions in real-world scenarios with radar data.
modified Recurrence Plot (RP) [276] to transform the raw
tri-axial acceleration data into color images, which were
then fed to a ResNet for HAR. Besides, some acceleration- 2.10 WiFi Modality
based methods [277] focused on the fall detection task. WiFi is considered to be one of the most common indoor
In general, the acceleration modality can be utilized for wireless signal types nowadays [290]. Since human bodies
fine-grained HAR, and has been used for action monitoring, are good reflectors of wireless signal, WiFi signal can be uti-
especially for elderly care due to its privacy-protecting char- lized for HAR, and sometimes even for through-wall HAR
acteristic. However, the subject needs to carry wearable sen- [291]. There are some advantages of using the WiFi modal-
sors that are often cumbersome and disturbing. In addition, ity for action analysis, mainly due to the convenience, sim-
the position of the sensors on the human body can also plicity, and privacy protection of WiFi signal, and also the
affect the HAR performance. low cost of WiFi devices. Specifically, most of the existing
WiFi-based HAR methods [292], [293] focused on using the
Channel State Information (CSI) to conduct the HAR task.
2.9 Radar Modality CSI is the fine-grained information computed from the raw
Radar is an active sensing technology that transmits electro- WiFi signal, and the WiFi signal reflected by a person who
magnetic waves and receives returned waves from targets. performs an action, usually generates unique variations in
Continuous-wave radars, such as Doppler [278] and Fre- the CSI on the WiFi receiver.
quency Modulated Continuous Wave (FMCW) radars [279], Deep learning-based HAR from the CSI signal has
are most often chosen for HAR. Specifically, Doppler radars received attention recently. Wang et al. [294] proposed a
detect the radial velocity of the body parts, and the fre- deep sparse auto-encoder to learn discriminative features
quency changes according to the distance, which is known from the CSI streams. In the work of [16], the WiFi-based
as the Doppler shift. The micro-Doppler signatures gener- sample-level HAR model, named Temporal Unet, was pro-
ated by the micro-motion of the radar contain the motion posed, which consists of several temporal convolutional,
and structure information of the target, which thus can be deconvolutional, and max pooling layers. This method clas-
used for HAR. As for the FMCW radars, they can measure sified every WiFi distortion sample in the series into one
the distances of the targets as well. There are some advan- action for fine-grained HAR. LSTM networks have also
tages of using spectrograms obtained from radars for HAR, been used for HAR using CSI signal [295], [296], [297].
which include the robustness to variations in illumination Huang et al. [295] passed raw CSI measurements through a
and weather conditions, privacy protection, and the capabil- noise and redundancy removal module followed by a clip
ity of through-wall HAR [22]. reconstruction module to segment the processed CSI signal
Several deep learning architectures have been proposed into multiple clips. These clips were then fed to a multi-
recently. The works in [15], [280], [281], [282], [283], [284] stream CNN+LSTM model to perform HAR. In the work of
fed the micro-Doppler spectrogram images into CNNs for [296], the spatial features of the CSI signal were first
predicting action classes in different scenarios, such as extracted from a fully connected layer of a pre-trained
aquatic activities [280], [282], different geometrical locations CNN. These features were then fed to a Bi-LSTM to capture
[284], and simulated environments [281]. Unlike the afore- the temporal information for HAR. Chen et al. [297] directly
mentioned methods, the work in [285] directly took the raw passed the raw CSI signal through an attention-based Bi-
radar range data as input. The raw data was passed through LSTM to predict the action class. Different from the above-
an auto-correlation function, followed by a CNN to extract mentioned works, Gao et al. [298] transformed the CSI sig-
action-related features. Finally, a Random Forest classifier nal into radio images, which were then fed to a deep sparse
was used to predict the action class. auto-encoder to learn discriminative features for HAR.
The two-stream architecture has also been investigated. In general, because of the advantages (e.g., convenience),
Hernang omez et al. [286] designed a two-stream CNN tak- the WiFi modality can be used for HAR in some scenarios.
ing both micro-Doppler and range spectrograms represent- However, there are still some challenges which need to be
ing the structural signatures of the target as inputs. In the further addressed, such as how to more effectively use the
work of [287], micro-Doppler spectrograms and echos of the CSI phase and amplitude information, and to improve the
radar were fed to a two-stream CNN model for HAR. robustness when handling dynamic environments.
By interpreting micro-Doppler spectrograms as temporal Apart from the various modalities mentioned above,
sequences rather than images, several RNN-based some other modalities, such as pressure [299], orientation

Authorized licensed use limited to: Pontificia Universidad Javeriana. Downloaded on March 28,2023 at 16:53:12 UTC from IEEE Xplore. Restrictions apply.
3212 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 3, MARCH 2023

[299], gyroscope [299], [300], radio frequency [301], [302], [220] from three different viewpoints) and RGB data (i.e., a
energy harvester [303], [304], and Electromyography [305], Motion History Image [348]). The output scores of these
have also been used for HAR. four streams were fused to perform action classification. By
considering the depth and RGB modalities as a single entity,
3 MULTI-MODALITY Wang et al. [308] extracted scene flow features from the spa-
tially aligned and temporally synchronized RGB and depth
In real life, humans often perceive the environment in a
frames. The bidirectional rank pooling [55] was utilized to
multi-modal cognitive way. Similarly, multi-modal machine
generate two dynamic images from the sequence of scene
learning is a modeling approach aiming to process and
flow features. The dynamic images were then fed to two dif-
relate the sensory information from multiple modalities
ferent CNNs, and finally, their classification scores were
[306]. By aggregating the advantages and capabilities of var-
fused to perform HAR. Wang et al. [58] represented the
ious data modalities, multi-modal machine learning can
RGB and depth data as two pairs of RGB and depth
often provide more robust and accurate HAR. There are
dynamic images, which were then passed through a cooper-
two main types of multi-modality learning methods,
atively trained CNN (c-ConvNet). The c-ConvNet consists
namely, fusion and co-learning. Fusion refers to the integra-
of two streams that exploit features for action classification
tion of information from two or more modalities for training
by jointly optimizing a ranking loss and a softmax loss. In
and inference, while co-learning refers to the transfer of
[311], a hybrid network that consists of multi-stream CNNs
knowledge between different data modalities.
and 3D ConvLSTMs [349] was introduced to extract features
from RGB and depth videos. These features were then fused
3.1 Fusion
via canonical correlation analysis to perform action classifi-
As discussed in Section 2, different modalities can have dif- cation. Wang et al. [309] proposed a generative framework
ferent strengths. Thus it becomes a natural choice to take to explore the feature distribution across the RGB and depth
advantage of the complementary strengths of different data modalities. The fusion was performed by constructing a
modalities via fusion, so as to achieve enhanced HAR per- cross-modality discovery matrix, which was then fed to a
formance. There are two widely used multi-modality fusion modality correlation discovery network for the final predic-
schemes in HAR, namely, score fusion and feature fusion. tion. Dhiman et al. [310] designed a two-stream network
Generally, the score fusion [310] integrates the decisions composed of a motion stream and a Shape Temporal
that are separately made based on different modalities (e.g., Dynamic (STD) stream to encode features from RGB and
by weighted averaging [326] or by learning a score fusion depth videos, respectively. Particularly, the motion stream
model [309]) to produce the final classification results. On takes a dynamic image from the RGB data as input and out-
the other hand, the feature fusion [323] generally combines puts classification scores. The STD network consists of the
the features from different modalities to yield aggregated Human Pose Model in [9] as its backbone, followed by sev-
features that are often very discriminative and powerful for eral LSTMs and a softmax layer. The final classification
HAR. Note that data fusion, i.e., fusing the multi-modality scores were obtained by aggregating the scores of these two
input data before feature extraction [342], has also been streams.
exploited. Since the input data can be treated as the original Fusion of RGB and Skeleton Modalities. The appearance
raw features, here we simply categorize the data fusion information provided by the RGB data, and the body pos-
approaches under feature fusion. Table 5 gives the results of ture and joint motion information provided by the skeleton
multi-modality fusion-based HAR methods on the MSRDai- sequences, are complementary and useful for activity analy-
lyActivity3D [237], UTD-MHAD [300], and NTU RGB+D sis. Thus, several works have investigated deep learning
[161] benchmark datasets. architectures to fuse RGB and skeleton data for HAR. Zhao
et al. [315] introduced a two-stream deep network, which
3.1.1 Fusion of Visual Modalities consists of an RNN and a CNN to process skeleton and
With the emergence of low-cost RGB-D cameras, many RGB data, respectively. Both the feature fusion and the
multi-modality datasets [161], [237] have been created by score fusion were evaluated and the former achieved better
the community, and consequently, some multi-modality performance. The two-stream architecture was also investi-
fusion-based HAR methods have been proposed. Most of gated in [314], [316], where the classification scores from a
these methods focused on the fusion of visual modalities CNN model [316] or an RNN model [314] trained on skele-
which are reviewed below. ton data and a skeleton conditioned spatio-temporal atten-
Fusion of RGB and Depth Modalities. The RGB and depth tion network trained on RGB data were fused for final
videos respectively capture rich appearance and 3D shape classification. Zolfaghari [313] designed a three-stream 3D-
information, that are complementary and can be used for CNN to process pose, motion, and the raw RGB images.
HAR. Early methods [346], [347] focused on extracting The three streams were fused via a Markov chain model for
hand-crafted features which capture spatio-temporal struc- action classification. Liu et al. [8] proposed a spatio-tempo-
tural relationships from the RGB and depth modalities for ral LSTM network, which is able to effectively fuse the RGB
more reliable HAR. Since the current mainstream of and skeleton features within the LSTM unit. In order to
research focuses on deep learning architectures, we review, effectively capture the subtle variations in action perform-
in the following, deep learning methods which fuse RGB ing, Song et al. [317] proposed a learning framework con-
and depth data modalities. In [307], a four-stream deep taining two streams of skeleton-guided deep CNN to
CNN was introduced to extract features from different rep- extract features from RGB and optical flow. Particularly, the
resentations of depth data (i.e., three Depth Motion Maps local image patches around human body parts were first

Authorized licensed use limited to: Pontificia Universidad Javeriana. Downloaded on March 28,2023 at 16:53:12 UTC from IEEE Xplore. Restrictions apply.
SUN ET AL.: HUMAN ACTION RECOGNITION FROM VARIOUS DATA MODALITIES: A REVIEW 3213

extracted from both RGB and flow sequences, and then fed score multiplications. Rani et al. [326] proposed a three-
into two individual CNNs followed by a part-aggregated stream 2D CNN to perform classification on three different
pooling layer to generate two fixed-length feature vectors hand-crafted features extracted from the depth and skeleton
corresponding to RGB and flow frames. The skeleton data sequences, followed by a score fusion module to obtain the
and the two sequences of RGB and flow feature vectors final classification result.
were then passed through a three-stream LSTM model fol- Fusion of RGB, Skeleton, and Depth Modalities. Several
lowed by a fusion layer to obtain the final classification hand-crafted feature-based methods [347], [353] have
scores. Das et al. [318] introduced a pose guided spatio-tem- explored the fusion of RGB, skeleton, and depth modalities
poral attention network on top of a 3D CNN model taking to further enhance the robustness of HAR. Meanwhile, deep
RGB videos as input to perform HAR. In their subsequent learning-based methods [23], [327], [328], [329], [330] have
work [320], they extended their work in [318] by paying received greater attention due to their superior perfor-
attention to the topology of the human body while comput- mance. Shahroudy et al. [327] focused on studying the corre-
ing the spatio-temporal attention maps. Li et al. [319] pro- lation between modalities, and factorizing them into their
posed a two-stream network, which consists of three main correlated and independent components. Then a structured
components namely the ST-GCN network [152] to extract sparsity-based classifier was utilized for HAR. Hu et al.
the skeleton features, R(2+1)D network [109] to extract RGB [328] learned the time-varying information across RGB,
features, and a guided block which takes these features to skeleton, and depth modalities by extracting temporal fea-
enhance the action-related information in RGB videos. ture maps from each modality, and then concatenating
Finally, the score fusion approach was utilized to perform them along the modality dimension. These multi-modal
classification. Cai et al. [322] introduced a two-stream GCN temporal features were then fed to a stack of bilinear blocks
network, respectively taking skeleton data and joint-aligned to exploit the mutual information from both modality and
flow patches obtained from RGB videos as the inputs for temporal directions. Khaire et al. [329] proposed a five-
HAR.PoseC3D [321] represents skeleton data as a 3D heat- stream CNN network, which takes Motion History Images
map volume and passes the RGB video and the heatmap [348], Depth Motion Maps [220] (i.e., from three different
volume through a two-stream 3D CNN framework whose viewpoints), and skeleton images generated respectively
architecture is inspired by [112]. from RGB, depth, and skeleton sequences as inputs. Each
Fusion of Skeleton and Depth Modalities. The skeleton data CNN was trained individually, and the output scores of
has been shown to be a succinct yet informative representa- these five streams were fused with a weighted product
tion for human behavior analysis [350], which, however, is model [354] to obtain the final classification scores. In their
a very sparse representation without the encoding of the subsequent work [330], three fusion methods were explored
shape information of the human body and the interacted for combining skeletal, RGB, and depth modalities. Carde-
objects. Besides, skeleton data is often noisy, limiting the nas and Chavez [23] fed three different optical spectra chan-
performance of HAR when it is used alone [8]. Meanwhile, nels from skeleton data [169] and dynamic images from
depth maps provide discriminative 3D shape and silhouette both RGB and depth videos to a pre-trained CNN to extract
information that can be helpful for HAR. Therefore, early multi-modal features. Finally, a feature aggregation module
methods [351], [352] have attempted to fuse hand-crafted was used to perform classification.
features extracted from both the skeleton and the depth Other Fusion Methods. Besides, some other multi-modality
sequences for more accurate HAR. Recently, deep learning- fusion methods were proposed [332], [333]. In the work of
based approaches have become the mainstream for fusing [332], a multi-modal fusion model was built on top of the TSN
the skeleton and depth modalities. In order to learn the rela- framework [50]. The RGB, depth, infrared, and optical flow
tionships between the human body and the objects, as well sequences were passed through TSNs to perform initial classi-
as the relations between different human body parts, Rah- fication, followed by a fusion network applied on the initial
mani et al. [323] proposed an end-to-end learning frame- classification scores to obtain the final classification scores.
work, which consists of a CNN followed by the bilinear Main and Noumeir [333] utilized a 3D CNN and a 2D CNN to
compact pooling and fully-connected layers for action clas- extract features from infrared and skeleton data, respectively.
sification. In particular, the model takes the relative geome- The feature vectors were then concatenated and exploited by a
try between every body part and others as the skeleton multi-layer perceptron for HAR.
features and depth image patches around different body
parts as the appearance features to encode body part-object
and body part-body part relations for reliable HAR. Kamel 3.1.2 Fusion of Visual and Non-Visual Modalities
et al. [324] utilized a three-stream CNN to extract action- Visual and non-visual modalities can also be fused to lever-
related features from Depth Motion Images (DMIs), Moving age their complementary discriminative capabilities for
Joint Descriptors (MJDs), and their combination. The DMI more accurate and robust HAR models.
and MJD encode the depth and skeleton sequences as Fusion of Audio and Visual Modalities. Audio data provides
images. The output scores of these three CNNs were fused complementary information to the appearance and motion
to obtain the final classification scores. Zhao et al. [325] uti- information in the visual data. Several deep learning-based
lized two 3D CNN streams taking raw depth data and methods have been proposed to fuse these two types of
Depth Motion Maps [220] as inputs, and a manifold repre- modalities for HAR. Wang et al. [334] introduced a three-
sentation stream taking 3D skeleton as input, for feature stream CNN to extract multi-modal features from audio sig-
extraction from depth and skeleton sequences. The classifi- nal, RGB frames, and optical flows. Both feature fusion and
cation scores from these three networks were fused via score fusion were evaluated, and the former achieved better

Authorized licensed use limited to: Pontificia Universidad Javeriana. Downloaded on March 28,2023 at 16:53:12 UTC from IEEE Xplore. Restrictions apply.
3214 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 3, MARCH 2023

performance. Owens and Efros [264] trained a two-stream Although the aforementioned multi-modality fusion-
CNN in a self-supervised manner to identify any misalign- based HAR methods have achieved promising results on
ment between the audio and visual sequences. The learned some benchmark datasets, the task of effective modality
model was then fine-tuned on the HAR datasets for audio- fusion is still largely open. Specifically, most of the existing
visual HAR. Inspired by TSN [50], Kazakos et al. [25] intro- multi-modality methods have complicated architectures
duced a Temporal Binding Network (TBN), which takes which require high computational costs. Thus efficient
audio, RGB, and optical flow as inputs for egocentric HAR. multi-modality HAR also needs to be addressed.
The TBN network contains a three-stream CNN to fuse the
multi-modal input sequences within each Temporal Bind- 3.2 Co-Learning
ing Window, followed by a temporal aggregation module Co-learning explores how knowledge learned from auxil-
for classification. Their results showed that the TBN outper- iary modalities can be used to assist the learning of a model
formed the Temporal Segment Network (TSN) [50] for the on another modality [306]. Transferring knowledge among
audio-visual HAR task. Gao et al. [265] utilized audio signal different modalities can overcome the shortcomings of a
to reduce temporal redundancies in videos. Particularly, single data modality and enhance its performance. Unlike
this method distills the knowledge from a teacher network fusion methods, in co-learning methods, the data of the aux-
trained on video clips to a student network trained on iliary modalities is only required during training rather than
image-audio pairs for efficient HAR. The student network is testing. This is particularly beneficial in cases where some
a two-stream deep model which takes the starting frame modalities are missing during testing. Co-learning can also
and audio signal of the video clip as input followed by a fea- benefit the learning of a certain modality with fewer sam-
ture fusion, fully-connected, and softmax layers for audio- ples, by leveraging other correlated modalities with richer
visual HAR. Inspired by the work of [112], Xiao et al. [266] samples for assisting model training.
introduced a hierarchically integrated audio-visual repre-
sentation framework containing slow and fast visual path-
ways that are deeply integrated with a faster audio 3.2.1 Co-Learning With Visual Modalities
pathway at multiple layers. Two training strategies, includ- Most of the co-learning-based HAR methods focused on co-
ing randomly dropping the audio pathway and hierarchical learning with visual modalities, such as RGB with depth
audio-visual synchronization, were also introduced to modalities, and RGB with skeleton modalities.
enable joint training of audio and video modalities. Co-Learning With RGB and Depth Modalities. Knowledge
Fusion of Acceleration and Visual Modalities. Some hand- transfer [306] between the RGB modality which captures
crafted feature-based methods [355], [356] have exploited appearance information, and depth data which encodes 3D
the fusion of acceleration and visual modalities for HAR. shape information, has been shown to be useful for improv-
Besides, many deep learning methods [336], [337], [338], ing the representation capability of each modality for HAR,
[339], [340], [341] have also been proposed recently. Dawar especially when one of these modalities has a limited
et al. [337] represented inertial signal as an image, and uti- amount of annotated data for training. There have been
lized two CNNs to fuse the depth images and inertial sig- some methods that used hand-crafted features for co-learn-
nals using score fusion. Due to the limited depth-inertial ing [346], [347]. Recently, Garcia et al. [357], [358], [359] pro-
training data, Dawar et al. [336] proposed a data augmenta- posed several deep learning-based HAR methods based on
tion framework based on depth and inertial modalities, cross-modality knowledge distillation. In [357], a knowl-
which were separately fed to a CNN and a CNN+LSTM edge distillation framework was proposed to distill knowl-
respectively. The scores of the two models were fused dur- edge from a teacher network taking depth videos as input
ing testing for better classification. Wei et al. [338] respec- to a RGB-based student network. The knowledge distilla-
tively fed the 3D video frames and 2D inertial images to a tion was achieved by forcing the feature maps and predic-
3D CNN and a 2D CNN for HAR, and the score fusion tion scores of the student network to be similar to the
achieved better performance than feature fusion. Several teacher network. While in [359], an adversarial learning-
depth-inertial fusion techniques have also been investigated based knowledge distillation strategy was proposed to learn
in [339], [340] by designing different two-stream CNN the student stream. In their subsequent work [358], a three-
architectures, where the inertial signal was transformed stream network taking RGB, depth, and optical flow data as
into images using the technique in [342]. inputs, was trained using a cooperative learning strategy,
Other Fusion Methods. Memmesheimer et al. [345] trans- i.e., the predicted labels generated by the modality stream
formed the skeleton, inertial, and WiFi data to a color image, with the minimum classification loss were used as an addi-
which was fed to a CNN to perform HAR. Imran and tional supervision for the training of other streams.
Raman [235] designed a three-stream architecture, where a Co-Learning With RGB and Skeleton Modalities. Co-learning
1D CNN, a 2D CNN, and an RNN were used for gyroscopic, between the RGB and skeleton modalities has also been
RGB, and skeleton data, respectively. Several fusion meth- investigated in existing works [24], [26], [360]. Mahasseni
ods were adopted to predict the final class label, while fus- and Todorovic [24] utilized a CNN+LSTM network to per-
ing features using canonical correlation analysis achieved form classification based on RGB videos, and an LSTM
the best performance. Zou et al. [290] designed a two-stream model trained on skeleton data to act as a regularizer by
architecture, named WiVi, which takes the RGB and WiFi forcing the output features of the two models (i.e., CNN
frames as inputs of the C3D and CNN streams. A multi- +LSTM and LSTM) to be similar. Thoker and Gall [360] uti-
modal score fusion module was built on top of the network lized TSN [50] as the teacher network taking RGB videos,
to perform HAR. and an ensemble of student networks (e.g., ST-GCN [152]

Authorized licensed use limited to: Pontificia Universidad Javeriana. Downloaded on March 28,2023 at 16:53:12 UTC from IEEE Xplore. Restrictions apply.
SUN ET AL.: HUMAN ACTION RECOGNITION FROM VARIOUS DATA MODALITIES: A REVIEW 3215

and HCN [174]) handling skeleton data, to perform knowl- video/audio modality, respectively. The audio-visual corre-
edge distillation for cross-modality HAR. Specifically, each spondence and temporal synchronization have also been
student network was trained from the supervisory informa- used as self-supervised signals to learn discriminative audio
tion provided by the teacher network as well as other stu- and visual features [368], [369]. Several other works [370],
dent networks. In [26], a supervision signal from an [371] leverage narrations of videos as a weak supervisory
auxiliary LSTM network taking the skeleton data as input signal to jointly learn video and text representations for
was fed to a two-stream CNN+LSTM architecture taking action recognition and detection. Miech et al. [370] intro-
RGB and optical flow sequences as inputs. The auxiliary duced a text-video embedding model, which is trained
supervision signal forces the model to align the distribu- from the caption-clip pairs using the max-margin ranking
tions of the auxiliary (skeleton) and source (RGB and optical loss for learning joint representations of video and language
flow) modalities, which thus enhances the representation of automatically. In their further study [371], a novel training
the RGB and optical flow modalities. loss derived from Multiple Instance Learning and Noise
Other Co-Learning Methods. Besides using RGB with depth Contrastive Estimation, was introduced to address mis-
or RGB with skeleton data for co-learning, other visual alignments in narrated videos. Besides, Sun et al. [372] intro-
modalities have also been investigated. For example, Wang duced a joint visual-linguistic model by extending BERT
et al. [361] learned a transferable generative model which [373] to videos for zero-shot HAR.
takes an infrared video as input and generates a fake feature
representation of its corresponding RGB video. The discrimi-
nator consists of two sub-networks, i.e., a deep binary classi- 4 DATASETS
fier which attempts to differentiate between the generated A large number of datasets have been created to train and
fake features and the real RGB feature representation, and a evaluate the HAR methods. Table 6 lists a series of bench-
predictor which takes both the infrared video representation mark datasets. In particular, the attributes of these datasets
and the generated features as inputs to perform action classi- are also summarized.
fication. Chadha et al. [362] embedded the PIX2NVS emula- For RGB-based HAR, the UCF101 [374], HMDB51 [27],
tor [363] (i.e., converting pixel domain video frames to and Kinectis-400 [375] are widely used as benchmark data-
neuromorphic events) into a teacher-student framework, sets. Besides, Kinetics-600 [376], Kinetics-700 [377], EPIC-
which transfers knowledge from a pre-trained optical flow KITCHENS-55 [378], THUMOS Challenge 15 [379], Activi-
teacher network to a neuromorphic event student network. tyNet [380], and Something-Something-v1 [381] datasets are
also popularly used. For 3D skeleton, depth, infrared, and
point cloud-based HAR, the large-scale NTU RGB+D [161]
3.2.2 Co-Learning With Visual and Non-Visual and NTU RGB+D 120 [151] are widely used benchmark
Modalities datasets. Besides, the MSRDailyActivity3D [237], North-
There are also several works on co-learning between visual western-UCLA [238], and UWA3D Multiview II [239] data-
and non-visual modalities [299], [364], [365]. Kong et al. sets are also widely used for depth-based HAR, while the
[299] and Liu et al. [364] trained multiple teacher networks InfAR [30] dataset is often used for infrared-based HAR.
on non-visual modalities, such as acceleration, gyroscope, Some point cloud-based methods [242], [245] have also been
and orientation signal, to teach an RGB video-based student evaluated on point cloud data obtained from depth images.
network through knowledge distillation with an attention DvsGesture [241] and DHP19 [32] datasets are popular
mechanism. In [299], the knowledge was transferred by first datasets for event stream-based HAR. However, there are
integrating the classification scores of the teachers with no widely used benchmark datasets for non-visual modal-
attention weights learned by feeding the concatenated clas- ity-based HAR.
sification scores of the teachers to a feed-forward neural net- In addition, for multi-modality-based HAR, the NTU
work. Then the attended-fusion scores were used as an RGB+D [161], NTU RGB+D 120 [151], and MMAct [299] are
additional supervision to train the student network. In large benchmark datasets suitable for fusion and co-learn-
[364], knowledge distillation was achieved by forcing the ing of different modalities. The MSRDailyActivity3D [237],
visual explanations (i.e., attention maps that highlight the UTD-MHAD [300], and PKU-MMD [28] datasets are also
important regions for the classification scores) of both the popularly used.
student and teacher networks to be similar to bridge the
modality gap. In another work, Perez et al. [365] proposed a
knowledge distillation framework to transfer knowledge 5 DISCUSSION
from a teacher network trained on RGB videos to a student In the previous sections, we review the methods and data-
network taking the raw sound data as input. Specifically, an sets for HAR with various data modalities. Below we dis-
objective function [366], which encourages the student cuss some of the potential and important directions that
model to predict the true labels as well as matching the soft could need further investigation in this domain.
labels provided by the teacher network, was utilized. Datasets. Large and comprehensive datasets generally
Besides fusing multiple modalities and transferring have a vital importance for the development of HAR, espe-
knowledge between different modalities, there are also a cially for the deep learning-based HAR methods. There are
few works that leverage the correlation between different many aspects which indicate the quality of a dataset, such
modalities for self-supervised learning. For example, as its size, diversity, applicability, and type of modality.
Alwassel et al. [367] leveraged unsupervised clustering in Despite the large number of existing datasets that have
the audio/video modality as the supervisory signal for the advanced the HAR area greatly, to further facilitate the

Authorized licensed use limited to: Pontificia Universidad Javeriana. Downloaded on March 28,2023 at 16:53:12 UTC from IEEE Xplore. Restrictions apply.
3216 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 3, MARCH 2023

TABLE 6 remain opportunities to design more effective fusion and


Some Representative Benchmark Datasets with Various co-learning strategies for multi-modality HAR.
Data Modalities for HAR
Efficient Action Analysis. The superior performance of
many HAR methods is built on high computational com-
plexity, while efficient HAR is also crucial for many real-life
practical applications. Hence how to reduce the computa-
tional costs and resource consumption (e.g., CPU, GPU, and
energy consumption), and achieve efficient and fast HAR,
deserve further studies [134], [203].
Early Action Recognition. Early action recognition (action
prediction) enables recognition when only a part of the
action has been performed, i.e., recognizing an action before
it has been fully performed [418], [419], [420]. This is also an
important problem due to its relevance in some applica-
tions, such as online human-robot interaction and early
alarm in some real-life scenarios.
Few-Shot Action Analysis. It can be difficult to collect a
large amount of training data (especially multi-modality
data) for all action classes. To handle this issue, one of the
possible solutions is to take advantage of few-shot learning
techniques [421], [422]. Though there have been some
attempts for few-shot HAR [100], [151], considering the sig-
nificance of handling the issues of scarcity of data in many
practical scenarios, more advanced few-shot action analysis
can still be further explored.
Unsupervised and Semi-Supervised Learning. Supervised
learning methods, especially deep learning-based ones,
often require a large amount of data with expensive labels
for model training. Meanwhile, unsupervised and semi-
supervised learning techniques [423], [424], [425] often
enable to leverage the availability of unlabelled data to train
the models, which significantly reduces the requirement of
large labeled datasets. Since unlabelled action samples are
often easier to be collected than labeled ones, unsupervised
and semi-supervised HAR are also an important research
direction that is worthy of further development.

S: Skeleton, D: Depth, IR: Infrared, PC: Point Cloud, ES: Event Stream, Au:
Audio, Ac: Acceleration, Gyr: Gyroscope, EMG: Electromyography. 6 CONCLUSION
HAR is an important task that has attracted significant
research on HAR, new benchmark datasets are still research attention in the past decades, and various data
required. For example, most of the existing multi-modality modalities with different characteristics have been used for
datasets were collected in controlled environments, where this task. In this paper, we have given a comprehensive
actions were usually performed by volunteers. Thus collect- review of HAR methods using different data modalities.
ing multi-modality data from uncontrolled environments to Besides, multi-modality recognition methods, including
achieve large and challenging benchmarks, for further pro- fusion and co-learning methods, have been also surveyed.
moting multi-modality HAR in practical applications, can Benchmark datasets have also been reviewed, and some
be important. Besides, the construction of large and chal- potential research directions have been discussed.
lenging datasets for every person’s action recognition in
crowded environments can also be further investigated. REFERENCES
Multi-Modality Learning. As discussed in Section 3, sev- [1] W. Lin, M.-T. Sun, R. Poovandran, and Z. Zhang, “Human activ-
eral multi-modality learning methods, including multi- ity recognition for video surveillance,” in Proc. IEEE Int. Symp.
modality fusion and cross-modality transfer learning, have Circuits Syst., 2008, pp. 2737–2740.
[2] M. Lu, Y. Hu, and X. Lu, “Driver action recognition using
been proposed for HAR. The fusion of multi-modality data
deformable and dilated faster R-CNN with optimized region
which can often complement each other, results in improve- proposals,” Appl. Intell., vol. 50, no. 4, pp. 1100–1111, 2020.
ments of HAR performance, while co-learning can be used [3] W. Hu, D. Xie, Z. Fu, W. Zeng, and S. Maybank, “Semantic-based
to handle the issue of the lack of data of some modalities. surveillance video retrieval,” IEEE Trans. Image Process., vol. 16,
no. 4, pp. 1168–1181, Apr. 2007.
However, as pointed out by [417], many existing multi- [4] I. Rodomagoulakis et al., “Multimodal human action recognition
modality methods are not as effective as expected owing to in assistive human-robot interaction,” in Proc. IEEE Int. Conf.
a series of challenges, such as over-fitting. This implies there Acoust. Speech Signal Process., 2016, pp. 2702–2706.

Authorized licensed use limited to: Pontificia Universidad Javeriana. Downloaded on March 28,2023 at 16:53:12 UTC from IEEE Xplore. Restrictions apply.
SUN ET AL.: HUMAN ACTION RECOGNITION FROM VARIOUS DATA MODALITIES: A REVIEW 3217

[5] J. Shotton et al., “Real-time human pose recognition in parts from [29] B. Ni, G. Wang, and P. Moulin, “RGBD-HuDaAct: A color-depth
single depth images,” in Proc. IEEE Conf. Comput. Vis. Pattern video database for human daily activity recognition,” in Proc.
Recognit., 2011, pp. 1297–1304. IEEE Int. Conf. Comput. Vis. Workshops, 2011, pp. 1147–1153.
[6] R. Poppe, “A survey on vision-based human action recognition,” [30] C. Gao et al., “InfAR dataset: Infrared action recognition at differ-
Image Vis. Comput., vol. 28, pp. 976–990, 2010. ent times,” Neurocomputing, vol. 212, pp. 36–47, 2016.
[7] J. Donahue et al., “Long-term recurrent convolutional networks [31] H. Cheng and S. M. Chung, “Orthogonal moment-based descrip-
for visual recognition and description,” in Proc. IEEE Conf. Com- tors for pose shape query on 3D point cloud patches,” Pattern
put. Vis. Pattern Recognit., 2015, pp. 2625–2634. Recognit., vol. 52, pp. 397–409, 2016.
[8] J. Liu, A. Shahroudy, D. Xu, A. C. Kot, and G. Wang, “Skeleton- [32] E. Calabrese et al., “DHP19: Dynamic vision sensor 3D human
based action recognition using spatio-temporal lstm network pose dataset,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
with trust gates,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, Workshops, 2019, pp. 1695–1704.
no. 12, pp. 3007–3021, Dec. 2018. [33] F. Ofli, R. Chaudhry, G. Kurillo, R. Vidal, and R. Bajcsy, “Berkeley
[9] H. Rahmani and A. Mian, “3D action recognition from novel MHAD: A comprehensive multimodal human action database,” in
viewpoints,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Proc. IEEE Workshop Appl. Comput. Vis., 2013, pp. 53–60.
2016, pp. 1506–1515. [34] J. R. Kwapisz, G. M. Weiss, and S. A. Moore, “Activity recogni-
[10] Z. Jiang, V. Rozgic, and S. Adali, “Learning spatiotemporal fea- tion using cell phone accelerometers,” ACM SigKDD Explorations
tures for infrared action recognition with 3D convolutional neu- Newslett., vol. 12, no. 2, pp. 74–82, 2011.
ral networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. [35] W. Wang, A. X. Liu, M. Shahzad, K. Ling, and S. Lu, “Understanding
Workshops, 2017, pp. 309–317. and modeling of WiFi signal based human activity recognition,” in
[11] Y. Wang et al., “3DV: 3D dynamic voxel for action recognition in Proc. 21st Annu. Int. Conf. Mobile Comput. Netw., 2015, pp. 65–76.
depth video,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., [36] K. Soomro and A. R. Zamir, “Action recognition in realistic sports
2020, pp. 508–517. videos,” in Vision in Sports, Berlin, Germany: Springer, 2014.
[12] R. Ghosh, A. Gupta, A. Nakagawa, A. Soares, and N. Thakor, [37] J. M. Chaquet, E. J. Carmona, and A. Fernandez-Caballero, “A sur-
“Spatiotemporal filtering for event-based action recognition,” vey of video datasets for human action and activity recognition,”
2019, arXiv:1903.07067. Comput. Vis. Image Understanding, vol. 117, no. 6, pp. 633–659, 2013.
[13] D. Liang and E. Thomaz, “Audio-based activities of daily living [38] V. Delaitre, I. Laptev, and J. Sivic, “Recognizing human actions
(ADL) recognition with large-scale acoustic embeddings from in still images: A study of bag-of-features and part-based repre-
online videos,” Proc. ACM InterAct. Mobile Wearable Ubiquitous sentations,” in Proc. Brit. Mach. Vis. Conf., 2010, pp. 97.1–97.11.
Technol., vol. 3, no. 1, 2019, Art. no. 17. [39] B. Yao and L. Fei-Fei, “Grouplet: A structured image representa-
[14] M. Zeng et al., “Convolutional neural networks for human activ- tion for recognizing human and object interactions,” in Proc.
ity recognition using mobile sensors,” in Proc. 6th Int. Conf. IEEE Conf. Comput. Vis. Pattern Recognit., 2010, pp. 9–16.
Mobile Comput. Appl. Serv., 2014, pp. 197–205. [40] G. Sharma, F. Jurie, and C. Schmid, “Discriminative spatial
[15] Y. Kim and T. Moon, “Human detection and activity classifica- saliency for image classification,” in Proc. IEEE Conf. Comput. Vis.
tion based on micro-doppler signatures using deep convolu- Pattern Recognit., 2012, pp. 3506–3513.
tional neural networks,” IEEE Geosci. Remote Sens. Lett., vol. 13, [41] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri,
no. 1, pp. 8–12, Jan. 2016. “Actions as space-time shapes,” IEEE Trans. Pattern Anal. Mach.
[16] F. Wang, Y. Song, J. Zhang, J. Han, and D. Huang, “Temporal Intell., vol. 29, no. 12, pp. 2247–2253, Dec. 2007.
unet: Sample level human action recognition using WiFi,” [42] I. Laptev, “On space-time interest points,” Int. J. Comput. Vis.,
2019, arXiv:1904.11953. vol. 64, no. 2–3, pp. 107–123, 2005.
[17] J. K. Aggarwal and M. S. Ryoo, “Human activity analysis: A [43] H. Wang, A. Kl€aser, C. Schmid, and C.-L. Liu, “Action recogni-
review,” ACM Comput. Surv., vol. 43, no. 3, 2011, Art. no. 16. tion by dense trajectories,” in Proc. IEEE Conf. Comput. Vis. Pat-
[18] B. Ren, M. Liu, R. Ding, and H. Liu, “A survey on 3D skeleton- tern Recognit., 2011, pp. 3169–3176.
based action recognition using learning method,” [44] K. Simonyan and A. Zisserman, “Two-stream convolutional net-
2020, arXiv:2002.05907. works for action recognition in videos,” in Proc. 27th Int. Conf.
[19] L. Chen, H. Wei, and J. Ferryman, “A survey of human motion Neural Informat. Process. Syst., 2014, pp. 568–576.
analysis using depth imagery,” Pattern Recognit. Lett., vol. 34, [45] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and
no. 15, pp. 1995–2006, 2013. L. Fei-Fei, “Large-scale video classification with convolutional
[20] S. Slim, A. Atia, M. Elfattah, and M. Mostafa, “Survey on human neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
activity recognition based on acceleration data,” Int. J. Adv. Com- nit., 2014, pp. 1725–1732.
put. Sci. Appl., vol. 10, pp. 84–98, 2019. [46] H. Bilen, B. Fernando, E. Gavves, and A. Vedaldi, “Action recog-
[21] S. Yousefi, H. Narui, S. Dayal, S. Ermon, and S. Valaee, “A survey nition with dynamic image networks,” IEEE Trans. Pattern Anal.
on behavior recognition using WiFi channel state information,” Mach. Intell., vol. 40, no. 12, pp. 2799–2813, Dec. 2018.
IEEE Commum. Mag., vol. 55, no. 10, pp. 98–104, Oct. 2017. [47] L. Wang, Y. Qiao, and X. Tang, “Action recognition with trajec-
[22] X. Li, Y. He, and X. Jing, “A survey of deep learning-based tory-pooled deep-convolutional descriptors,” in Proc. IEEE Conf.
human activity recognition in radar,” Remote Sens., vol. 11, no. 9, Comput. Vis. Pattern Recognit., 2015, pp. 4305–4314.
2019, Art. no. 1068. [48] J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek, “Image clas-
[23] E. J. E. Cardenas and G. C. Chavez, “Multimodal human action sification with the fisher vector: Theory and practice,” Int. J. Com-
recognition based on a fusion of dynamic images using CNN put. Vis., vol. 105, no. 3, pp. 222–245, 2013.
descriptors,” in Proc. SIBGRAPI Conf. Graph. Patterns Images, [49] G. Cheron, I. Laptev, and C. Schmid, “P-CNN: Pose-based CNN
2018, pp. 95–102. features for action recognition,” in Proc. IEEE/CVF Int. Conf. Com-
[24] B. Mahasseni and S. Todorovic, “Regularizing long short term put. Vis., 2015, pp. 3218–3226.
memory with 3D human-skeleton sequences for action recog- [50] L. Wang et al., “Temporal segment networks: Towards good practi-
nition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, ces for deep action recognition,” in Proc. Eur. Conf. Comput. Vis., 2016,
pp. 3054–3062. pp. 20–36.
[25] E. Kazakos, A. Nagrani, A. Zisserman, and D. Damen, “EPIC- [51] R. Girdhar, D. Ramanan, A. Gupta, J. Sivic, and B. Russell,
fusion: Audio-visual temporal binding for egocentric action rec- “Actionvlad: Learning spatio-temporal aggregation for action
ognition,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
pp. 5491–5500. 2017, pp. 3165–3174.
[26] S. Song, J. Liu, Y. Li, and Z. Guo, “Modality compensation net- [52] A. Diba, V. Sharma, and L. Van Gool, “Deep temporal linear
work: Cross-modal adaptation for action recognition,” IEEE encoding networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Rec-
Trans. Image Process., vol. 29, pp. 3957–3969, 2020. ognit., 2017, pp. 1541–1550.
[27] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, [53] C. Feichtenhofer, A. Pinz, and R. P. Wildes, “Spatiotemporal
“HMDB: A large video database for human motion recognition,” multiplier networks for video action recognition,” in Proc. IEEE
in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2011, pp. 2556–2563. Conf. Comput. Vis. Pattern Recognit., 2017, pp. 7445–7454.
[28] C. Liu, Y. Hu, Y. Li, S. Song, and J. Liu, “PKU-MMD: A large [54] M. Zong, R. Wang, X. Chen, Z. Chen, and Y. Gong, “Motion
scale benchmark for continuous multi-modal human action saliency based multi-stream multiplier resnets for action recog-
understanding,” 2017, arXiv:1703.07475. nition,” Image Vis. Comput., vol. 107, 2021, Art. no. 104108.

Authorized licensed use limited to: Pontificia Universidad Javeriana. Downloaded on March 28,2023 at 16:53:12 UTC from IEEE Xplore. Restrictions apply.
3218 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 3, MARCH 2023

[55] B. Fernando, E. Gavves, J. Oramas, A. Ghodrati, and T. Tuyte- [81] R. Girdhar and D. Ramanan, “Attentional pooling for action recog-
laars, “Rank pooling for action recognition,” IEEE Trans. Pattern nition,” in Proc. 27th Int. Conf. Neural Informat. Process. Syst., 2017,
Anal. Mach. Intell., vol. 39, no. 4, pp. 773–787, Apr. 2017. pp. 33–44.
[56] P. Wang, S. Wang, Z. Gao, Y. Hou, and W. Li, “Structured images [82] L. Meng et al., “Interpretable spatio-temporal attention for video
for RGB-D action recognition,” in Proc. IEEE Int. Conf. Comput. action recognition,” in Proc. IEEE Int. Conf. Comput. Vis. Work-
Vis. Workshops, 2017, pp. 1005–1014,. shops, 2019, pp. 1513–1522.
[57] Y. Xiao, J. Chen, Y. Wang, Z. Cao, J. T. Zhou, and X. Bai, “Action [83] Z. Wu, C. Xiong, C.-Y. Ma, R. Socher, and L. S. Davis, “AdaFrame:
recognition for depth video using multi-view dynamic images,” Adaptive frame selection for fast video recognition,” in Proc. IEEE
Inf. Sci., vol. 480, pp. 287–304, 2019. Conf. Comput. Vis. Pattern Recognit., 2019, pp. 1278–1287.
[58] P. Wang, W. Li, J. Wan, P. Ogunbona, and X. Liu, “Cooperative [84] Z. Liu, Z. Li, R. Wang, M. Zong, and W. Ji, “Spatiotemporal
training of deep aggregation networks for RGB-D action recog- saliency-based multi-stream networks with attention-aware
nition,” in Proc. AAAI Conf. Artif. Intell., 2018, pp. 7404–7411. LSTM for action recognition,” Neural. Comput. Appl., vol. 32,
[59] X. Wang, A. Farhadi, and A. Gupta, “Actions transformations,” in pp. 14593–14602, 2020.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 2658–2667. [85] Z. Li, K. Gavrilyuk, E. Gavves, M. Jain, and C. G. Snoek,
[60] B. Zhang, L. Wang, Z. Wang, Y. Qiao, and H. Wang, “Real-time “VideoLSTM convolves, attends and flows for action recognition,”
action recognition with enhanced motion vector CNNs,” in Proc. Comput. Vis. Image Understanding, vol. 166, pp. 41–50, 2018.
IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 2718–2726. [86] K. Xu et al., “Show, attend and tell: Neural image caption genera-
[61] A. Piergiovanni and M. S. Ryoo, “Representation flow for action tion with visual attention,” in Proc. 32nd Int. Conf. Int. Conf.
recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Mach. Learn., 2015, pp. 2048–2057.
2019, pp. 9937–9945. [87] N. Ballas, L. Yao, C. Pal, and A. Courville, “Delving deeper into
[62] L. Wang, Y. Xiong, Z. Wang, and Y. Qiao, “Towards good practi- convolutional networks for learning video representations,” in
ces for very deep two-stream convnets,” 2015, arXiv:1507.02159. Proc. Int. Conf. Learn. Representations, 2016, pp. 1–11.
[63] A. Kar, N. Rai, K. Sikka, and G. Sharma, “AdaScan: Adaptive [88] Y. Shi, Y. Tian, Y. Wang, W. Zeng, and T. Huang, “Learning long-
scan pooling in deep convolutional neural networks for human term dependencies for action recognition with a biologically-
action recognition in videos,” in Proc. IEEE Conf. Comput. Vis. inspired deep network,” in Proc. IEEE/CVF Int. Conf. Comput.
Pattern Recognit., 2017, pp. 5699–5708. Vis., 2017, pp. 716–725.
[64] H. Zhang, D. Liu, and Z. Xiong, “Two-stream action recognition- [89] D. Dwibedi, P. Sermanet, and J. Tompson, “Temporal reasoning
oriented video super-resolution,” in Proc. IEEE/CVF Int. Conf. in videos using convolutional gated recurrent units,” in Proc.
Comput. Vis., 2019, pp. 8799–8808. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2018.
[65] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional [90] P.-S. Kim, D.-G. Lee, and S.-W. Lee, “Discriminative context
two-stream network fusion for video action recognition,” in Proc. learning with gated recurrent unit for group activity recog-
IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 1933–1941. nition,” Pattern Recognit., vol. 76, pp. 149–161, 2018.
[66] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, [91] K. Cho et al., “Learning phrase representations using RNN encoder-
“Learning spatiotemporal features with 3D convolutional decoder for statistical machine translation,” in Proc. Conf. Empirical
networks,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2015, Methods Natural Lang. Process., 2014, pp. 1724–1734.
pp. 4489–4497. [92] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evalua-
[67] W. Du, Y. Wang, and Y. Qiao, “RPAN: An end-to-end recurrent tion of gated recurrent neural networks on sequence modeling,”
pose-attention network for action recognition in videos,” in Proc. 2014, arXiv:1412.3555.
IEEE/CVF Int. Conf. Comput. Vis., 2017, pp. 3745–3754. [93] Z. Wu, X. Wang, Y.-G. Jiang, H. Ye, and X. Xue, “Modeling spa-
[68] L. Sun, K. Jia, K. Chen, D.-Y. Yeung, B. E. Shi, and S. Savarese, tial-temporal clues in a hybrid deep learning framework for
“Lattice long short-term memory for human action recognition,” video classification,” in Proc. ACM Int. Conf. Multimedia, 2015,
in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2017, pp. 2166–2175. pp. 461–470.
[69] T. Perrett and D. Damen, “DDLSTM: Dual-domain LSTM for [94] H. Gammulle, S. Denman, S. Sridharan, and C. Fookes, “Two
cross-dataset action recognition,” in Proc. IEEE Conf. Comput. Vis. stream LSTM: A deep fusion framework for human action recog-
Pattern Recognit., 2019, pp. 7844–7853. nition,” in Proc. IEEE Workshop Appl. Comput. Vis., 2017, pp. 177–186.
[70] Y. Meng et al., “Ar-net: Adaptive frame resolution for efficient action [95] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt,
recognition,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 86–104. “Sequential deep learning for human action recognition,” in
[71] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, Proc. Int. Workshop Hum. Behav. Understanding, 2011, pp. 29–39.
R. Monga, and G. Toderici, “Beyond short snippets: Deep net- [96] X. Wang, L. Gao, J. Song, and H. Shen, “Beyond frame-level CNN:
works for video classification,” in Proc. IEEE Conf. Comput. Vis. Saliency-aware 3-D CNN with LSTM for video action recognition,”
Pattern Recognit., 2015, pp. 4694–4702. IEEE Signal Process. Lett., vol. 24, no. 4, pp. 510–514, Apr. 2017.
[72] N. Srivastava, E. Mansimov, and R. Salakhudinov, “Unsupervised [97] S. Vyas, Y. S. Rawat, and M. Shah, “Multi-view action recogni-
learning of video representations using LSTMs,” in Proc. 32nd Int. tion using cross-view video prediction,” in Proc. Eur. Conf. Com-
Conf. Int. Conf. Mach. Learn., 2015, pp. 843–852. put. Vis., 2020, pp. 427–444.
[73] Z. Wu, C. Xiong, Y.-G. Jiang, and L. S. Davis, “LiteEval: A coarse-to- [98] X. Wang, L. Gao, P. Wang, X. Sun, and X. Liu, “Two-stream 3-D
fine framework for resource efficient video recognition,” in Proc. convnet fusion for action recognition in videos with arbitrary size
27th Int. Conf. Neural Informat. Process. Syst., 2019, pp. 7780–7789. and length,” IEEE Trans. Multimedia, vol. 20, no. 3, pp. 634–644,
[74] M. Majd and R. Safabakhsh, “Correlational convolutional LSTM Mar. 2018.
for human action recognition,” Neurocomputing, vol. 396, [99] S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural net-
pp. 224–229, 2020. works for human action recognition,” IEEE Trans. Pattern Anal.
[75] A. Ullah, J. Ahmad, K. Muhammad, M. Sajjad, and S. W. Baik, Mach. Intell., vol. 35, no. 1, pp. 221–231, Jan. 2013.
“Action recognition in video sequences using deep bi-directional [100] H. Zhang, L. Zhang, X. Qui, H. Li, P. H. Torr, and P. Koniusz,
LSTM with CNN features,” IEEE Access, vol. 6, pp. 1155–1166, 2017. “Few-shot action recognition with permutation-invariant
[76] J.-Y. He, X. Wu, Z.-Q. Cheng, Z. Yuan, and Y.-G. Jiang, “DB- attention,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 525–542.
LSTM: Densely-connected bi-directional LSTM for human action [101] X. Li, B. Shuai, and J. Tighe, “Directional temporal modeling for
recognition,” Neurocomputing, vol. 444, pp. 319–331, 2021. action recognition,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 275–291.
[77] H. Zhao and X. Jin, “Human action recognition based on [102] A. Diba et al., “Temporal 3D convnets: New architecture and
improved fusion attention CNN and RNN,” in Proc. 5th Int. Conf. transfer learning for video classification,” 2017, arXiv:1711.08200.
Comput. Intell. Appl., 2020, pp. 108–112. [103] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger,
[78] S. Sharma, R. Kiros, and R. Salakhutdinov, “Action recognition “Densely connected convolutional networks,” in Proc. IEEE Conf.
using visual attention,” 2015, arXiv:1511.04119. Comput. Vis. Pattern Recognit., 2017, pp. 2261–2269.
[79] H. Ge, Z. Yan, W. Yu, and L. Sun, “An attention mechanism based [104] A. Diba et al., “Spatio-temporal channel correlation networks for
convolutional LSTM network for video action recognition,” Mul- action classification,” in Proc. Eur. Conf. Comput. Vis., 2018,
timed. Tools. Appl., vol. 78, no. 14, pp. 20533–20556, 2019. pp. 299–315.
[80] S. Sudhakaran, S. Escalera, and O. Lanz, “LSTA: Long short-term [105] G. Varol, I. Laptev, and C. Schmid, “Long-term temporal convo-
attention for egocentric action recognition,” in Proc. IEEE Conf. lutions for action recognition,” IEEE Trans. Pattern Anal. Mach.
Comput. Vis. Pattern Recognit., 2019, pp. 9946–9955. Intell., vol. 40, no. 6, pp. 1510–1517, Jun. 2018.

Authorized licensed use limited to: Pontificia Universidad Javeriana. Downloaded on March 28,2023 at 16:53:12 UTC from IEEE Xplore. Restrictions apply.
SUN ET AL.: HUMAN ACTION RECOGNITION FROM VARIOUS DATA MODALITIES: A REVIEW 3219

[106] N. Hussein, E. Gavves, and A. W. Smeulders, “Timeception for [131] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, “Rethinking spa-
complex action recognition,” in Proc. IEEE Conf. Comput. Vis. Pat- tiotemporal feature learning: Speed-accuracy trade-offs in video
tern Recognit., 2019, pp. 254–263. classification,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 318–335.
[107] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural [132] K. Li, X. Li, Y. Wang, J. Wang, and Y. Qiao, “CT-net: Channel ten-
networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., sorization network for video classification,” in Proc. Int. Conf.
2018, pp. 7794–7803. Learn. Representations, 2021.
[108] J. Carreira and A. Zisserman, “Quo vadis, action recognition? A [133] H. Yang et al., “Asymmetric 3D convolutional neural networks
new model and the kinetics dataset,” in Proc. IEEE Conf. Comput. for action recognition,” Pattern Recognit., vol. 85, pp. 1–12, 2019.
Vis. Pattern Recognit., 2017, pp. 4724–4733. [134] J. Lin, C. Gan, and S. Han, “TSM: Temporal shift module for effi-
[109] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, cient video understanding,” in Proc. IEEE/CVF Int. Conf. Comput.
“A closer look at spatiotemporal convolutions for action recog- Vis., 2019, pp. 7082–7092.
nition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, [135] S. Sudhakaran, S. Escalera, and O. Lanz, “Gate-shift networks for
pp. 6450–6459. video action recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern
[110] Y. Zhou, X. Sun, Z.-J. Zha, and W. Zeng, “MiCT: Mixed 3D/2D Recognit., 2020, pp. 1099–1108.
convolutional tube for human action recognition,” in Proc. IEEE [136] L. Wang, Z. Tong, B. Ji, and G. Wu, “TDN: Temporal difference
Conf. Comput. Vis. Pattern Recognit., 2018, pp. 449–458. networks for efficient action recognition,” in Proc. IEEE Conf.
[111] Y. Zhu, Z. Lan, S. Newsam, and A. Hauptmann, “Hidden two- Comput. Vis. Pattern Recognit., 2021, pp. 1895–1904.
stream convolutional networks for action recognition,” in Proc. [137] Z. Wang, Q. She, and A. Smolic, “ACTION-net: Multipath excita-
Asian Conf. Comput. Vis., 2018, pp. 363–378. tion for action recognition,” in Proc. IEEE Conf. Comput. Vis. Pat-
[112] C. Feichtenhofer, H. Fan, J. Malik, and K. He, “SlowFast net- tern Recognit., 2021, pp. 13209–13218.
works for video recognition,” in Proc. IEEE/CVF Int. Conf. Com- [138] G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler,
put. Vis., 2019, pp. 6201–6210. “Convolutional learning of spatio-temporal features,” in Proc.
[113] C. Feichtenhofer, A. Pinz, and R. P. Wildes, “Spatiotemporal resid- Eur. Conf. Comput. Vis., 2010, pp. 140–153.
ual networks for video action recognition,” 2016, arXiv:1611.02155. [139] D. Li, Z. Qiu, Y. Pan, T. Yao, H. Li, and T. Mei, “Representing
[114] J. Li, X. Liu, M. Zhang, and D. Wang, “Spatio-temporal deform- videos as discriminative sub-graphs for action recognition,” in
able 3D convnets with attention for action recognition,” Pattern Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021.
Recognit., vol. 98, 2020, Art. no. 107037. [140] J. Zhou, K.-Y. Lin, H. Li, and W.-S. Zheng, “Graph-based high-
[115] M. Zolfaghari, K. Singh, and T. Brox, “ECO: Efficient convolu- order relation modeling for long-term action recognition,”
tional network for online video understanding,” in Proc. Eur. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 8980–
Conf. Comput. Vis., 2018, pp. 713–730. 8989.
[116] L. Wang, W. Li, W. Li, and L. Van Gool, “Appearance-and-rela- [141] R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman, “Video
tion networks for video classification,” in Proc. IEEE Conf. Com- action transformer network,” in Proc. IEEE Conf. Comput. Vis. Pat-
put. Vis. Pattern Recognit., 2018, pp. 1430–1439. tern Recognit., 2019, pp. 244–253.
[117] B. Martinez, D. Modolo, Y. Xiong, and J. Tighe, “Action recogni- [142] S. Zhang, S. Guo, W. Huang, M. R. Scott, and L. Wang, “V4D: 4D
tion with spatial-temporal discriminative filter banks,” in Proc. convolutional neural networks for video-level representation
IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 5481–5490. learning,” in Proc. Int. Conf. Learn. Representations, 2020.
[118] Y. Zhou, X. Sun, C. Luo, Z.-J. Zha, and W. Zeng, “Spatiotemporal [143] H. Duan, Y. Zhao, Y. Xiong, W. Liu, and D. Lin, “Omni-sourced
fusion in 3D CNNs: A probabilistic view,” in Proc. IEEE Conf. webly-supervised learning for video recognition,” in Proc. Eur.
Comput. Vis. Pattern Recognit., 2020, pp. 9826–9835. Conf. Comput. Vis., 2020, pp. 670–-688.
[119] C. Yang, Y. Xu, J. Shi, B. Dai, and B. Zhou, “Temporal pyramid [144] D. Ghadiyaram, D. Tran, and D. Mahajan, “Large-scale weakly-
network for action recognition,” in Proc. IEEE Conf. Comput. Vis. supervised pre-training for video action recognition,” in Proc.
Pattern Recognit., 2020, pp. 588–597. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 12038–12047.
[120] J. Kim, S. Cha, D. Wee, S. Bae, and J. Kim, “Regularization on [145] G. Lev, G. Sadeh, B. Klein, and L. Wolf, “RNN fisher vectors for
spatio-temporally smoothed feature for action recognition,” in action recognition and image annotation,” in Proc. Eur. Conf.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 12100– Comput. Vis., 2016, pp. 833–850.
12109. [146] Y. Pan et al., “Compressing recurrent neural networks with ten-
[121] A. Piergiovanni and M. S. Ryoo, “Recognizing actions in videos sor ring for action recognition,” in Proc. AAAI Conf. Artif. Intell.,
from unseen viewpoints,” in Proc. IEEE Conf. Comput. Vis. Pattern 2019, pp. 4683–4690.
Recognit., 2021, pp. 4122–4130. [147] C. Feichtenhofer, “X3D: Expanding architectures for efficient
[122] G. Varol, I. Laptev, C. Schmid, and A. Zisserman, “Synthetic video recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
humans for action recognition from unseen viewpoints,” Int. J. nit., 2020, pp. 200–210.
Comput. Vis., vol. 129, pp. 2264–2287, 2021. [148] M. E. Kalfaoglu, S. Kalkan, and A. A. Alatan, “Late temporal
[123] H. Rahmani, A. Mian, and M. Shah, “Learning a deep model for modeling in 3D CNN architectures with bert for action recog-
human action recognition from novel viewpoints,” IEEE Trans. nition,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 731–747.
Pattern Anal. Mach. Intell., vol. 40, no. 3, no. 3, pp. 667–681, [149] X. Yang, H. Fan, L. Torresani, L. S. Davis, and H. Wang, “Beyond
Mar. 2018. short clips: End-to-end video-level learning with collaborative
[124] J. Stroud, D. Ross, C. Sun, J. Deng, and R. Sukthankar, “D3D: Dis- memories,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
tilled 3D networks for video action recognition,” in Proc. IEEE 2021, pp. 7563–7572.
Workshop Appl. Comput. Vis., 2020, pp. 614–623. [150] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution
[125] N. Crasto, P. Weinzaepfel, K. Alahari, and C. Schmid, “MARS: representation learning for human pose estimation,” in Proc.
Motion-augmented RGB stream for action recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 5686–5696,.
IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 7874–7883. [151] J. Liu, A. Shahroudy, M. L. Perez, G. Wang, L.-Y. Duan, and A. C.
[126] Z. Shou et al., “DMC-net: Generating discriminative motion cues Kot, “NTU RGB+D: A large-scale benchmark for 3D human
for fast compressed video action recognition,” in Proc. IEEE Conf. activity understanding,” IEEE Trans. Pattern Anal. Mach. Intell.,
Comput. Vis. Pattern Recognit., 2019, pp. 1268–1277. vol. 42, no. 10, pp. 2684–2701, Oct. 2020.
[127] H. Wang, D. Tran, L. Torresani, and M. Feiszli, “Video modeling [152] S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolu-
with correlation networks,” in Proc. IEEE Conf. Comput. Vis. Pat- tional networks for skeleton-based action recognition,” in Proc.
tern Recognit., 2020, pp. 349–358. AAAI Conf. Artif. Intell., 2018, pp. 7444–7452.
[128] M. Fayyaz et al., “3D CNNs with adaptive temporal feature reso- [153] J. Wang, Z. Liu, Y. Wu, and J. Yuan, “Learning actionlet ensem-
lutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, ble for 3D human action recognition,” IEEE Trans. Pattern Anal.
pp. 4729–4738. Mach. Intell., vol. 36, no. 5, pp. 914–927, May 2014.
[129] L. Sun, K. Jia, D.-Y. Yeung, and B. E. Shi, “Human action recogni- [154] R. Vemulapalli, F. Arrate, and R. Chellappa, “Human action recog-
tion using factorized spatio-temporal convolutional networks,” nition by representing 3D skeletons as points in a lie group,” in
in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2015, pp. 4597–4605. Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 588–595.
[130] Z. Qiu, T. Yao, and T. Mei, “Learning spatio-temporal represen- [155] Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural
tation with pseudo-3D residual networks,” in Proc. IEEE/CVF network for skeleton based action recognition,” in Proc. IEEE
Int. Conf. Comput. Vis., 2017, pp. 5534–5542. Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1110–1118.

Authorized licensed use limited to: Pontificia Universidad Javeriana. Downloaded on March 28,2023 at 16:53:12 UTC from IEEE Xplore. Restrictions apply.
3220 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 3, MARCH 2023

[156] S. Zhang, X. Liu, and J. Xiao, “On geometric features for skeleton- [179] P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng, “View
based action recognition using multilayer LSTM networks,” in adaptive neural networks for high performance skeleton-based
Proc. IEEE Workshop Appl. Comput. Vis., 2017, pp. 148–157. human action recognition,” IEEE Trans. Pattern Anal. Mach.
[157] S. Zhang et al., “Fusing geometric features for skeleton-based Intell., vol. 41, no. 8, pp. 1963–1978, Aug. 2019.
action recognition using multilayer LSTM networks,” IEEE [180] Q. Ke, S. An, M. Bennamoun, F. Sohel, and F. Boussaid, “Skeletonnet:
Trans. Multimedia, vol. 20, no. 9, pp. 2330–2343, Sep. 2018. Mining deep part features for 3-D action recognition,” IEEE Signal
[158] J. Liu, G. Wang, L.-Y. Duan, K. Abdiyeva, and A. C. Kot, Process. Lett., vol. 24, no. 6, pp. 731–735, Jun. 2017.
“Skeleton-based human action recognition with global context- [181] B. Li, Y. Dai, X. Cheng, H. Chen, Y. Lin, and M. He, “Skeleton
aware attention LSTM networks,” IEEE Trans. Image Process., based action recognition using translation-scale invariant image
vol. 27, no. 4, pp. 1586–1599, Apr. 2018. mapping and multi-scale deep CNN,” in Proc. IEEE Int. Conf.
[159] V. Veeriah, N. Zhuang, and G.-J. Qi, “Differential recurrent neu- Multimedia Expo Workshops, 2017, pp. 601–604.
ral networks for action recognition,” in Proc. IEEE/CVF Int. Conf. [182] F. Yang, Y. Wu, S. Sakti, and S. Nakamura, “Make skeleton-based
Comput. Vis., 2015, pp. 4041–4049. action recognition model smaller, faster and better,” in Proc.
[160] W. Zhu et al., “Co-occurrence feature learning for skeleton based ACM Multimedia Asia, 2019. Art. no. 31.
action recognition using regularized deep LSTM networks,” in [183] Y. Tang, X. Liu, X. Yu, D. Zhang, J. Lu, and J. Zhou, “Learning
Proc. AAAI Conf. Artif. Intell., 2016, pp. 3697–3703. from temporal spatial cubism for cross-dataset skeleton-based
[161] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “NTU RGB+D: A action recognition,” ACM Trans. Multimedia Comput. Commun.
large scale dataset for 3D human activity analysis,” in Proc. IEEE Appl., vol. 18, pp. 1–24, 2022.
Conf. Comput. Vis. Pattern Recognit., 2016, pp. 1010–1019. [184] L. Zhao et al., “T-GCN: A temporal graph convolutional network
[162] J. Liu, A. Shahroudy, D. Xu, and G. Wang, “Spatio-temporal for traffic prediction,” IEEE Trans. Intell. Transp. Syst., vol. 21,
LSTM with trust gates for 3D human action recognition,” in Proc. no. 2, pp. 3848–3858, Sep. 2020.
Eur. Conf. Comput. Vis., 2016, pp. 816–833. [185] F. Monti, K. Otness, and M. M. Bronstein, “MotifNet: A motif-
[163] J. Liu, G. Wang, P. Hu, L.-Y. Duan, and A. C. Kot, “Global context- based graph convolutional network for directed graphs,” in Proc.
aware attention LSTM networks for 3D action recognition,” in Proc. IEEE Data Sci. Workshop, 2018, pp. 225–228.
IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 3671–3680. [186] C. Si, Y. Jing, W. Wang, L. Wang, and T. Tan, “Skeleton-based
[164] H. Wang and L. Wang, “Modeling temporal dynamics and spa- action recognition with spatial reasoning and temporal stack
tial configurations of actions using two-stream recurrent neural learning,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 106–121.
networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., [187] L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Skeleton-based action rec-
2017, pp. 3633–3642. ognition with directed graph neural networks,” in Proc. IEEE
[165] S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu, “An end-to-end spatio- Conf. Comput. Vis. Pattern Recognit., 2019, pp. 7904–7913.
temporal attention model for human action recognition from skele- [188] Y. Tang, Y. Tian, J. Lu, P. Li, and J. Zhou, “Deep progressive rein-
ton data,” in Proc. AAAI Conf. Artif. Intell., 2017, pp. 4263–4270. forcement learning for skeleton-based action recognition,” in Proc.
[166] I. Lee, D. Kim, S. Kang, and S. Lee, “Ensemble deep learning for IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 5323–5332.
skeleton-based action recognition using temporal sliding LSTM [189] X. Zhang, C. Xu, X. Tian, and D. Tao, “Graph edge convolutional
networks,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2017, neural networks for skeleton-based action recognition,” IEEE
pp. 1012–1020. Trans. Neural Netw. Learn. Syst., vol. 31, no. 8, pp. 3047–3060,
[167] S. Li, W. Li, C. Cook, Y. Gao, and C. Zhu, “Deep independently Aug. 2020.
recurrent neural network (indRNN),” 2019, arXiv:1910.06251. [190] M. Korban and X. Li, “DDGCN: A dynamic directed graph con-
[168] T. S. Kim and A. Reiter, “Interpretable 3 D human action analysis volutional network for action recognition,” in Proc. Eur. Conf.
with temporal convolutional networks,” in Proc. IEEE Conf. Com- Comput. Vis., 2020, pp. 761–776.
put. Vis. Pattern Recognit. Workshops, 2017, pp. 1623–1631. [191] K. Cheng, Y. Zhang, C. Cao, L. Shi, J. Cheng, and H. Lu,
[169] Y. Hou, Z. Li, P. Wang, and W. Li, “Skeleton optical spectra- “Decoupling GCN with dropgraph module for skeleton-based
based action recognition using convolutional neural networks,” action recognition,” in Proc. 16th Eur. Conf. Comput. Vis., 2020,
IEEE Trans. Circuits Syst. Video Technol., vol. 28, no. 3, pp. 807– pp. 536–553.
811, Mar. 2018. [192] P. Yu, Y. Zhao, C. Li, J. Yuan, and C. Chen, “Structure-aware
[170] P. Wang, Z. Li, Y. Hou, and W. Li, “Action recognition based on human-action generation,” in Proc. Eur. Conf. Comput. Vis., 2020,
joint trajectory maps using convolutional neural networks,” in pp. 18–34.
Proc. 24th ACM Int. Conf. Multimedia, 2016, pp. 102–106. [193] M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian,
[171] C. Li, Y. Hou, P. Wang, and W. Li, “Joint distance maps based “Actional-structural graph convolutional networks for skeleton-
action recognition with convolutional neural networks,” IEEE based action recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern
Signal Process. Lett., vol. 24, no. 5, pp. 624–628, May 2017. Recognit., 2019, pp. 3590–3598.
[172] Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid, “A [194] W. Peng, X. Hong, H. Chen, and G. Zhao, “Learning graph con-
new representation of skeleton sequences for 3D action recog- volutional network for skeleton-based human action recognition
nition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, by neural searching,” in Proc. AAAI Conf. Artif. Intell., 2020,
pp. 4570–4579. pp. 2669–2676.
[173] C. Lea, M. Flynn, R. Vidal, A. Reiter, and G. Hager, “Temporal con- [195] X. Zhang, C. Xu, and D. Tao, “Context aware graph convolution
volutional networks for action segmentation and detection,” Proc. for skeleton-based action recognition,” in Proc. IEEE Conf. Com-
IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1003–1012. put. Vis. Pattern Recognit., 2020, pp. 14321–14330.
[174] C. Li, Q. Zhong, D. Xie, and S. Pu, “Co-occurrence feature learn- [196] L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Two-stream adaptive
ing from skeleton data for action recognition and detection with graph convolutional networks for skeleton-based action recog-
hierarchical aggregation,” 2018, arXiv:1804.06055. nition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019,
[175] C. Caetano, J. Sena, F. Bremond, J. A. Dos Santos, and W. R. pp. 12018–12027.
Schwartz, “SkeleMotion: A new representation of skeleton joint [197] C. Wu, X.-J. Wu, and J. Kittler, “Spatial residual layer and dense
sequences based on motion information for 3D action recog- connection block enhanced spatial temporal graph convolutional
nition,” in Proc. IEEE Int. Conf. Adv. Video Signal Based Surveill., network for skeleton-based action recognition,” in Proc. IEEE Int.
2019, pp. 1–8. Conf. Comput. Vis. Workshops, 2019, pp. 1740–1748.
[176] C. Caetano, F. Bremond, and W. R. Schwartz, “Skeleton image [198] B. Li, X. Li, Z. Zhang, and F. Wu, “Spatio-temporal graph routing
representation for 3D action recognition based on tree structure for skeleton-based action recognition,” in Proc. AAAI Conf. Artif.
and reference joints,” in Proc. SIBGRAPI Conf. Graph. Patterns Intell., 2019, pp. 8561–8568.
Images, 2019, pp. 16–23. [199] Z. Liu, H. Zhang, Z. Chen, Z. Wang, and W. Ouyang,
[177] J. Liu, N. Akhtar, and A. Mian, “Skepxels: Spatio-temporal image “Disentangling and unifying graph convolutions for skeleton-
representation of human skeleton joints for action recognition,” based action recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2019, Recognit., 2020, pp. 140–149.
pp. 10–19. [200] P. Zhang, C. Lan, W. Zeng, J. Xing, J. Xue, and N. Zheng,
[178] M. Liu, H. Liu, and C. Chen, “Enhanced skeleton visualization “Semantics-guided neural networks for efficient skeleton-based
for view invariant human action recognition,” Pattern Recognit., human action recognition,” in Proc. IEEE Conf. Comput. Vis. Pat-
vol. 68, pp. 346–362, 2017. tern Recognit., 2020, pp. 1109–1118.

Authorized licensed use limited to: Pontificia Universidad Javeriana. Downloaded on March 28,2023 at 16:53:12 UTC from IEEE Xplore. Restrictions apply.
SUN ET AL.: HUMAN ACTION RECOGNITION FROM VARIOUS DATA MODALITIES: A REVIEW 3221

[201] C. Si, W. Chen, W. Wang, L. Wang, and T. Tan, “An attention [222] A. Sanchez-Caballero et al., “3DFCNN: Real-time action recogni-
enhanced graph convolutional LSTM network for skeleton-based tion using 3D deep neural networks with raw depth
action recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Rec- information,” 2020, arXiv:2006.07743.
ognit., 2019, pp. 1227–1236. [223] A. Sanchez-Caballero, D. Fuentes-Jimenez, and C. Losada-
[202] Y.-H. Wen, L. Gao, H. Fu, F.-L. Zhang, and S. Xia, “Graph CNNs Gutierrez, “Exploiting the ConvLSTM: Human action recogni-
with motif and variable temporal block for skeleton-based action tion using raw depth video-based recurrent neural networks,”
recognition,” in Proc. AAAI Conf. Artif. Intell., 2019, pp. 8989–8996. 2020, arXiv:2006.07744.
[203] K. Cheng, Y. Zhang, X. He, W. Chen, J. Cheng, and H. Lu, [224] Y. Zhu and S. Newsam, “Depth2Action: Exploring embedded
“Skeleton-based action recognition with shift graph convolu- depth for large-scale action recognition,” in Proc. Eur. Conf. Com-
tional network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., put. Vis., 2016, pp. 668–684.
2020, pp. 180–189. [225] F. Liu, C. Shen, and G. Lin, “Deep convolutional neural fields for
[204] Y.-F. Song, Z. Zhang, C. Shan, and L. Wang, “Constructing stron- depth estimation from a single image,” in Proc. IEEE Conf. Com-
ger and faster baselines for skeleton-based action recognition,” put. Vis. Pattern Recognit., 2015, pp. 5162–5170.
2021, arXiv:2106.15125. [226] D. Eigen and R. Fergus, “Predicting depth, surface normals and
[205] M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian, semantic labels with a common multi-scale convolutional
“Symbiotic graph neural networks for 3D skeleton-based human architecture,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2015,
action recognition and motion prediction,” IEEE Trans. Pattern pp. 2650–2658.
Anal. Mach. Intell., vol 44, no. 6, pp. 3316–3333, Jun. 2022. [227] A. Akula, A. K. Shah, and R. Ghosh, “Deep learning approach for
[206] P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng, “View human action recognition in infrared images,” Cogn. Syst. Res.,
adaptive recurrent neural networks for high performance human vol. 50, 2018.
action recognition from skeleton data,” in Proc. IEEE/CVF Int. [228] T. Kawashima et al., “Action recognition from extremely low-res-
Conf. Comput. Vis., 2017, pp. 2136–2145. olution thermal image sequence,” in Proc. 14th IEEE Int. Conf.
[207] C. Xie et al., “Memory attention networks for skeleton-based Adv. Video Signal Based Surveill., 2017, pp. 1–6.
action recognition,” 2018, arXiv:1804.08254. [229] A. K. Shah, R. Ghosh, and A. Akula, “A spatio-temporal deep
[208] Y. Du, Y. Fu, and L. Wang, “Skeleton based action recognition learning approach for human action recognition in infrared vid-
with convolutional neural network,” in Proc. IAPR Asian Conf. on eos,” in Proc. Opt. Photon. Informat. Process. XII, 2018, pp. 249–257.
Pattern Recognit., 2015, pp. 579–583. [230] H. Meglouli et al., “A new technique based on 3D convolutional
[209] Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid, neural networks and filtering optical flow maps for action classi-
“Learning clip representations for skeleton-based 3D action rec- fication in infrared video,” Control Eng. Appl. Inf., vol. 21, no. 4,
ognition,” IEEE Trans. Image Process., vol. 27, no. 6, pp. 2842–2855, pp. 43–50, 2019.
Jun. 2018. [231] Y. Liu, Z. Lu, J. Li, T. Yang, and C. Yao, “Global temporal repre-
[210] Y. Xu, J. Cheng, L. Wang, H. Xia, F. Liu, and D. Tao, “Ensemble sentation based CNNs for infrared action recognition,” IEEE Sig-
one-dimensional convolution neural networks for skeleton- nal Process. Lett., vol. 25, no. 6, pp. 848–852, Jun. 2018.
based action recognition,” IEEE Signal Process. Lett., vol. 25, no. 7, [232] D.-M. Tsai, W.-Y. Chiu, and M.-H. Lee, “Optical flow-motion his-
pp. 1044–1048, Jul. 2018. tory image (OF-MHI) for action recognition,” Signal Image Video
[211] Y. Li, R. Xia, X. Liu, and Q. Huang, “Learning shape-motion rep- Process., vol. 9, pp. 1897–1906, 2015.
resentations from geometric algebra spatio-temporal model for [233] Z. Jiang, Y. Wang, L. Davis, W. Andrews, and V. Rozgic, “Learning
skeleton-based action recognition,” in Proc. IEEE Int. Conf. Multi- discriminative features via label consistent neural network,” Proc.
media Expo, 2019, pp. 1066–1071. IEEE Workshop Appl. Comput. Vis., 2017, pp. 207–216.
[212] A. Banerjee, P. K. Singh, and R. Sarkar, “Fuzzy integral based [234] J. Imran and B. Raman, “Deep residual infrared action recogni-
CNN classifier fusion for 3D skeleton action recognition,” IEEE tion by integrating local and global spatio-temporal cues,” Infra-
Trans. Circuits Syst. Video Technol., vol. 31, no. 6, pp. 2206–2216, red Phys. Technol., vol. 102, 2019, Art. no. 103014.
Jun. 2021. [235] J. Imran and B. Raman, “Evaluating fusion of RGB-D and inertial
[213] L. Li, M. Wang, B. Ni, H. Wang, J. Yang, and W. Zhang, “3D sensors for multimodal human action recognition,” J. Ambient
human action representation learning via cross-view consistency Intell. Humanized Comput., vol. 11, no. 1, pp. 189–208, 2020.
pursuit,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, [236] V. Mehta, A. Dhall, S. Pal, and S. Khan, “Motion and region aware
pp. 4739–4748. adversarial learning for fall detection with thermal imaging,”
[214] H. Mohaghegh, N. Karimi, R. Soroushmehr, S. Samavi, and K. early access, May 5, 2021, 10.1109/ICPR48806.2021.9412632.
Najarian, “Aggregation of rich depth aware features in a modi- [237] J. Wang, Z. Liu, Y. Wu, and J. Yuan, “Mining actionlet ensemble
fied stacked generalization model for single image depth for action recognition with depth cameras,” in Proc. IEEE Conf.
estimation,” IEEE Trans. Circuits Syst. Video Technol., vol. 29, Comput. Vis. Pattern Recognit., 2012, pp. 1290–1297.
no. 3, pp. 683–697, Mar. 2019. [238] J. Wang, X. Nie, Y. Xia, Y. Wu, and S.-C. Zhu, “Cross-view action
[215] M. Harville and D. Li, “Fast, integrated person tracking and modeling, learning and recognition,” in Proc. IEEE Conf. Comput.
activity recognition with plan-view templates from a single ste- Vis. Pattern Recognit., 2014, pp. 2649–2656.
reo camera,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., [239] H. Rahmani, A. Mahmood, D. Huynh, and A. Mian, “Histogram
2004, pp. II–II. of oriented principal components for cross-view action recog-
[216] M.-C. Roh, H.-K. Shin, and S.-W. Lee, “View-independent nition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 12,
human action recognition with volume motion template on sin- pp. 2430–2443, Dec. 2016.
gle stereo camera,” Pattern Recognit. Lett., vol. 31, no. 7, pp. 639– [240] W. Li, Z. Zhang, and Z. Liu, “Action recognition based on a bag
647, 2010. of 3D points,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
[217] P. Wang, W. Li, Z. Gao, C. Tang, and P. O. Ogunbona, “Depth Workshops, 2010, pp. 9–14.
pooling based large-scale 3-D action recognition with convolu- [241] A. Amir et al., “A low power, fully event-based gesture recogni-
tional neural networks,” IEEE Trans. Multimedia, vol. 20, no. 5, tion system,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
pp. 1051–1061, May 2018. 2017, pp. 7388–7397.
[218] X. Yang and Y. Tian, “Super normal vector for activity recogni- [242] X. Liu, M. Yan, and J. Bohg, “Meteornet: Deep learning on
tion using depth sequences,” in Proc. IEEE Conf. Comput. Vis. Pat- dynamic 3D point cloud sequences,” in Proc. IEEE/CVF Int. Conf.
tern Recognit., 2014, pp. 804–811. Comput. Vis., 2019, pp. 9245–9254.
[219] O. Oreifej and Z. Liu, “HON4D: Histogram of oriented 4D nor- [243] Y. Min, Y. Zhang, X. Chai, and X. Chen, “An efficient pointLSTM
mals for activity recognition from depth sequences,” in Proc. for point clouds based gesture recognition,” in Proc. IEEE Conf.
IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 716–723. Comput. Vis. Pattern Recognit., 2020, pp. 5760–5769.
[220] X. Yang, C. Zhang, and Y. Tian, “Recognizing actions using [244] G. Wang, H. Liu, M. Chen, Y. Yang, Z. Liu, and H. Wang, “Anchor-
depth motion maps-based histograms of oriented gradients,” in based spatial-temporal attention convolutional networks for
Proc. 20th ACM Int. Conf. Multimedia, 2012, pp. 1057–1060. dynamic 3D point cloud sequences,” 2020, arXiv:2012.10860.
[221] P. Wang, W. Li, Z. Gao, J. Zhang, C. Tang, and P. O. Ogunbona, [245] H. Wang, L. Yang, X. Rong, J. Feng, and Y. Tian, “Self-supervised
“Action recognition from depth maps using deep convolutional 4D spatio-temporal feature learning via order prediction of
neural networks,” IEEE Trans. Hum. Mach. Syst., vol. 46, no. 4, sequential point cloud clips,” Proc. IEEE Workshop Appl. Comput.
pp. 498–509, Aug. 2016. Vis., 2021, pp. 3761–3770.

Authorized licensed use limited to: Pontificia Universidad Javeriana. Downloaded on March 28,2023 at 16:53:12 UTC from IEEE Xplore. Restrictions apply.
3222 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 3, MARCH 2023

[246] H. Fan, Y. Yang, and M. Kankanhalli, “Point 4D transformer [270] U. Maurer, A. Smailagic, D. P. Siewiorek, and M. Deisher,
networks for spatio-temporal modeling in point cloud vid- “Activity recognition and monitoring using multiple sensors on
eos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, different body positions,” in Proc. Int. Workshop Wearable Implant-
pp. 14199–14208. able Body Sensor Netw., 2006, Art. no. 116.
[247] H. Fan, X. Yu, Y. Ding, Y. Yang, and M. Kankanhalli, “PSTNet: [271] Y. Chen and Y. Xue, “A deep learning approach to human activ-
Point spatio-temporal convolution on point cloud sequences,” in ity recognition based on single accelerometer,” in Proc. IEEE Int.
Proc. Int. Conf. Learn. Representations, 2021, pp. 1–23. Conf. Syst., Man, Cybern., 2015, pp. 1488–1492.
[248] Y. Wei, H. Liu, T. Xie, Q. Ke, and Y. Guo, “Spatial-temporal [272] M. Panwar et al., “CNN based approach for activity recognition
transformer for 3D point cloud sequences,” in Proc. IEEE Work- using a wrist-worn accelerometer,” in Proc. 39th Annu. Int. Conf.
shop Appl. Comput. Vis., 2022, pp. 657–666. IEEE Eng. Med. Biol. Soc., 2017, pp. 2438–2441.
[249] Q. Wang, Y. Zhang, J. Yuan, and Y. Lu, “Space-time event clouds [273] A. Ignatov, “Real-time human activity recognition from acceler-
for gesture recognition: From RGB cameras to event cameras,” in ometer data using convolutional neural networks,” Appl. Soft
Proc. IEEE Workshop Appl. Comput. Vis., 2019, pp. 1826–1835. Comput., vol. 62, pp. 915–922, 2018.
[250] C. Huang, “Event-based action recognition using timestamp [274] J. Wang et al., “Human action recognition on cellphone using
image encoding network,” 2020, arXiv:2009.13049. compositional bidir-LSTM-CNN networks,” in Proc. Int. Conf.
[251] J. Chen, J. Meng, X. Wang, and J. Yuan, “Dynamic graph CNN Comput. Netw. Commun. Inf. Syst., 2019, pp. 687–692.
for event-camera based gesture recognition,” in Proc. IEEE Int. [275] J. Lu and K.-Y. Tong, “Robust single accelerometer-based activity
Symp. Circuits Syst., 2020, pp. 1–5. recognition using modified recurrence plot,” IEEE Sens. J.,
[252] S. U. Innocenti, F. Becattini, F. Pernici, and A. Del Bimbo, vol. 19, no. 15, pp. 6317–6324, Aug. 2019.
“Temporal binary representation for event-based action recog- [276] J. Eckmann et al., “Recurrence plots of dynamical systems,”
nition,” in Proc. Int. Conf. Pattern Recognit., 2021, pp. 10426–10432. World Sci. Ser. Nonlinear Sci. Ser. A, vol. 16, pp. 441–446, 1995.
[253] H. Rahmani, A. Mahmood, D. Q. Huynh, and A. Mian, “HOPC: [277] G. L. Santos, P. T. Endo, K. H. D. C. Monteiro, E. D. S. Rocha, I.
Histogram of oriented principal components of 3D pointclouds Silva, and T. Lynn, “Accelerometer-based human fall detection
for action recognition,” in Proc. Eur. Conf. Comput. Vis., 2014, using convolutional neural networks,” Sensors, vol. 19, no. 7,
pp. 742–757. 2019, Art. no. 1644.
[254] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deep hierar- [278] A. Lin and H. Ling, “Doppler and direction-of-arrival (DDOA)
chical feature learning on point sets in a metric space,” in Proc. radar for multiple-mover sensing,” IEEE Trans. Aerosp. Electron.
27th Int. Conf. Neural Informat. Process. Syst., 2017, pp. 5105–5114. Sys., vol. 43, no. 4, pp. 1496–1509, Oct. 2007.
[255] C. Choy, J. Gwak, and S. Savarese, “4D spatio-temporal convnets: [279] A. Stove, “Modern FMCW radar-techniques and applications,”
Minkowski convolutional neural networks,” in Proc. IEEE Conf. in Proc. 1st Eur. Radar Conf., 2004, pp. 149–152.
Comput. Vis. Pattern Recognit., 2019, pp. 3070–3079. [280] J. Park, R. J. Javier, T. Moon, and Y. Kim, “Micro-doppler based clas-
[256] P. Lichtsteiner, C. Posch, and T. Delbruck, “A 128  128120 db 15 ms sification of human aquatic activities via transfer learning of convo-
latency asynchronous temporal contrast vision sensor,” IEEE J. Solid- lutional neural networks,” Sensors, vol. 16, no. 12, 2016, Art. no. 1990.
Statist. Circuits, vol. 43, no. 2, pp. 566–576, Feb. 2008. [281] R. Trommel, R. Harmanny, L. Cifola, and J. Driessen, “Multi-tar-
[257] R. Berner, C. Brandli, M. Yang, S.-C. Liu, and T. Delbruck, get human gait classification using deep convolutional neural
“A 240  180 10mw 12us latency sparse-output vision sensor for networks on micro-doppler spectrograms,” in Proc. Eur. Radar
mobile applications,” in Proc. Symp. VLSI Circuits, 2013, Conf., 2016, pp. 81–84.
pp. C186–C187. [282] Y. Kim, J. Park, and T. Moon, “Classification of micro-doppler
[258] S. A. Baby, B. Vinod, C. Chinni, and K. Mitra, “Dynamic vision signatures of human aquatic activity through simulation and
sensors for human activity recognition,” in Proc. IAPR Asian measurement using transferred learning,” in Radar Sensor Tech-
Conf. Pattern Recognit., 2017, pp. 316–321. nology XXI, Bellingham, WA, USA: SPIE, 2017.
[259] X. Clady, J.-M. Maro, S. Barre, and R. B. Benosman, “A motion- [283] H. Du, Y. He, and T. Jin, “Transfer learning for human activities
based feature for event-based pattern recognition,” Front. Neuro- classification using micro-doppler spectrograms,” in Proc. IEEE
sci., vol. 10, 2017, Art. no. 594. Int. Conf. Comput. Electromagnetics, 2018, pp. 1–3.
[260] A. M. George, D. Banerjee, S. Dey, A. Mukherjee, and P. Bala- [284] S. A. Shah and F. Fioranelli, “Human activity recognition: Pre-
murali, “A reservoir-based convolutional spiking neural network liminary results for dataset portability using FMCW radar,” in
for gesture recognition from DVS input,” in Proc. Int. Joint Conf. Proc. Int. Radar Conf., 2019, pp. 1–4.
Neural Netw., 2020, pp. 1–9. [285] Y. Lin, J. Le Kernec, S. Yang, F. Fioranelli, O. Romain, and Z.
[261] Y. Bi, A. Chadha, A. Abbas, E. Bourtsoulatze, and Y. Andreopou- Zhao, “Human activity classification with radar: Optimization
los, “Graph-based spatio-temporal feature learning for neuro- and noise robustness with iterative convolutional neural net-
morphic vision sensing,” IEEE Trans. Image Process., vol. 29, works followed with random forests,” IEEE Sens. J., vol. 18,
pp. 9084–9098, 2020. no. 23, pp. 9669–9681, Dec. 2018.
[262] G. Laput, K. Ahuja, M. Goel, and C. Harrison, “Ubicoustics: [286] R. Hernang omez, A. Santra, and S. Sta nczak, “Human activity
Plug-and-play acoustic activity recognition,” in Proc. 31st Annu. classification with frequency modulated continuous wave radar
ACM Symp. User Interface Softw. Technol., 2018, pp. 213–224. using deep convolutional neural networks,” in Proc. Int. Radar
[263] S. Hershey et al., “CNN architectures for large-scale audio classi- Conf., 2019, pp. 1–6.
fication,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., [287] H. Zhou, Y. He, H. Huang, and X. Jing, “Two-stream convolu-
2017, pp. 131–135. tional neural network for action recognition in radar,” in Proc.
[264] A. Owens and A. A. Efros, “Audio-visual scene analysis with Signal Inf. Process. Netw. Comput., 2020, pp. 19–26.
self-supervised multisensory features,” in Proc. Eur. Conf. Com- [288] M. Wang, Y. D. Zhang, and G. Cui, “Human motion recognition
put. Vis., 2018, pp. 639–658. exploiting radar with stacked recurrent neural network,” Digit.
[265] R. Gao, T.-H. Oh, K. Grauman, and L. Torresani, “Listen to look: Signal Process., vol. 87, pp. 125–131, 2019.
Action recognition by previewing audio,” in Proc. IEEE Conf. [289] S. Yang, J. Le Kernec, and F. Fioranelli, “Action recognition using
Comput. Vis. Pattern Recognit., 2020, pp. 10454–10464. indoor radar systems,” 2019. [Online]. Available: http://eprints.gla.
[266] F. Xiao, Y. J. Lee, K. Grauman, J. Malik, and C. Feichtenhofer, ac.uk/190039/
“Audiovisual slowfast networks for video recognition,” [290] H. Zou, J. Yang, H. P. Das, H. Liu, Y. Zhou, and C. J. Spanos,
2020, arXiv:2001.08740. “WiFi and vision multimodal learning for accurate and robust
[267] P. Casale, O. Pujol, and P. Radeva, “Human activity recognition device-free human activity recognition,” in Proc. IEEE Conf. Com-
from accelerometer data using a wearable device,” in Proc. Ibe- put. Vis. Pattern Recognit. Workshops, 2019, pp. 426–433.
rian Conf. Pattern Recognit. Image Anal., 2011, pp. 289–296. [291] X. Wu, Z. Chu, P. Yang, C. Xiang, X. Zheng, and W. Huang,
[268] A. Bayat, M. Pomplun, and D. A. Tran, “A study on human activ- “TW-see: Human activity recognition through the wall with
ity recognition using accelerometer data from smartphones,” commodity Wi-Fi devices,” IEEE Trans. Veh. Technol, vol. 68,
Procedia Comput. Sci., vol. 34, pp. 450–457, 2014. no. 1, pp. 306–319, Jan. 2019.
[269] A. M. Khan, Y.-K. Lee, S.-Y. Lee, and T.-S. Kim, “Human activity [292] Y. Wang, J. Liu, Y. Chen, M. Gruteser, J. Yang, and H. Liu, “E-
recognition via an accelerometer-enabled-smartphone using ker- eyes: Device-free location-oriented activity identification using
nel discriminant analysis,” in Proc. 5th Int. Conf. Future Inf. Tech- fine-grained WiFi signatures,” in Proc. 21st Annu. Int. Conf.
nol., 2010, pp. 1–6. Mobile Comput. Netw., 2014, pp. 617–628.

Authorized licensed use limited to: Pontificia Universidad Javeriana. Downloaded on March 28,2023 at 16:53:12 UTC from IEEE Xplore. Restrictions apply.
SUN ET AL.: HUMAN ACTION RECOGNITION FROM VARIOUS DATA MODALITIES: A REVIEW 3223

[293] W. Wang, A. X. Liu, M. Shahzad, K. Ling, and S. Lu, “Device-free [315] R. Zhao, H. Ali, and P. Van der Smagt, “Two-stream RNN/CNN
human activity recognition using commercial WiFi devices,” IEEE for action recognition in 3D videos,” in Proc. IEEE/RSJ Int. Conf.
J. Sel. Areas Commun., vol. 35, no. 5, pp. 1118–1131, May 2017. Intell. Robots Syst., 2017, pp. 4260–4267.
[294] J. Wang, X. Zhang, Q. Gao, H. Yue, and H. Wang, “Device-free [316] F. Baradel, C. Wolf, and J. Mille, “Pose-conditioned spatio-
wireless localization and activity recognition: A deep learning temporal attention for human action recognition,” 2017,
approach,” IEEE Trans. Veh. Technol., vol. 66, no. 7, pp. 6258–6267, arXiv:1703.10106.
Jul. 2017. [317] S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu, “Skeleton-indexed deep
[295] S. Huang, D. Wang, R. Zhao, and Q. Zhang, “Wiga: A WiFi- multi-modal feature learning for high performance human action
based contactless activity sequence recognition system based on recognition,” in Proc. IEEE Int. Conf. Multimedia Expo, 2018, pp. 1–6.
deep learning,” in Proc. 15th Int. Conf. Mobile Ad-Hoc Sensor [318] S. Das et al., “Toyota smarthome: Real-world activities of daily liv-
Netw., 2019, pp. 69–74. ing,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 833–842.
[296] B. Sheng, F. Xiao, L. Sha, and L. Sun, “Deep spatial–temporal model [319] J. Li, X. Xie, Q. Pan, Y. Cao, Z. Zhao, and G. Shi, “SGM-Net: Skel-
based cross-scene action recognition using commodity WiFi,” IEEE eton-guided multimodal network for action recognition,” Pattern
Internet Things J., vol. 7, no. 4, pp. 3592–3601, Apr. 2020. Recognit., vol. 104, 2020, Art. no. 107356.
[297] Z. Chen, L. Zhang, C. Jiang, Z. Cao, and W. Cui, “WiFi CSI based [320] S. Das, S. Sharma, R. Dai, F. Bremond, and M. Thonnat, “VPN:
passive human activity recognition using attention based Learning video-pose embedding for activities of daily living,” in
BLSTM,” IEEE Trans. Mob. Comput., vol. 18, no. 11, pp. 2714– Proc. Eur. Conf. Comput. Vis., 2020, pp. 72–90.
2724, Nov. 2019. [321] H. Duan, Y. Zhao, K. Chen, D. Shao, D. Lin, and B. Dai,
[298] Q. Gao, J. Wang, X. Ma, X. Feng, and H. Wang, “CSI-based “Revisiting skeleton-based action recognition,” ,
device-free wireless localization and activity recognition using 2021, arXiv:2104.13586.
radio image features,” IEEE Trans. Veh. Technol., vol. 66, no. 11, [322] J. Cai, N. Jiang, X. Han, K. Jia, and J. Lu, “JOLO-GCN: Mining
pp. 10 346–10 356, Nov. 2017. joint-centered light-weight information for skeleton-based action
[299] Q. Kong, Z. Wu, Z. Deng, M. Klinkigt, B. Tong, and T. Murakami, recognition,” in Proc. IEEE Workshop Appl. Comput. Vis., 2021,
“MMAct: A large-scale dataset for cross modal human action pp. 2734–2743.
understanding,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, [323] H. Rahmani and M. Bennamoun, “Learning action recognition
pp. 8657–8666. model from depth and skeleton videos,” in Proc. IEEE/CVF Int.
[300] C. Chen, R. Jafari, and N. Kehtarnavaz, “UTD-MHAD: A multi- Conf. Comput. Vis., 2017, pp. 5833–5842.
modal dataset for human action recognition utilizing a depth [324] A. Kamel, B. Sheng, P. Yang, P. Li, R. Shen, and D. D. Feng,
camera and a wearable inertial sensor,” in Proc. IEEE Int. Conf. “Deep convolutional neural networks for human action recogni-
Image Process., 2015, pp. 168–172. tion using depth maps and postures,” IEEE Trans. Syst. Man
[301] J. Yang, J. Lee, and J. Choi, “Activity recognition based on RFID Cybern. Syst., vol. 49, no. 9, pp. 1806–1819, Sep. 2019.
object usage for smart mobile devices,” J. Comput. Sci. Technol., [325] C. Zhao, M. Chen, J. Zhao, Q. Wang, and Y. Shen, “3D behavior
vol. 26, no. 2, pp. 239–246, 2011. recognition based on multi-modal deep space-time learning,”
[302] T. Li, L. Fan, M. Zhao, Y. Liu, and D. Katabi, “Making the Appl. Sci., vol. 9, no. 4, 2019, Art. no. 716.
invisible visible: Action recognition through walls and [326] S. S. Rani, G. A. Naidu, and V. U. Shree, “Kinematic joint descrip-
occlusions,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, tor and depth motion descriptor with convolutional neural net-
pp. 872–881. works for human action recognition,” in Proc. MATPR, 2021,
[303] W. Xu et al., “Keh-gait: Towards a mobile healthcare user authen- pp. 3164–3173.
tication system by kinetic energy harvesting,” in Proc. Netw. Dis- [327] A. Shahroudy, T.-T. Ng, Y. Gong, and G. Wang, “Deep multi-
trib. Syst. Secur., 2017, pp. 1–15. modal feature analysis for action recognition in RGB+ D videos,”
[304] D. Ma, G. Lan, W. Xu, M. Hassan, and W. Hu, “Simultaneous IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 5, pp. 1045–
energy harvesting and gait recognition using piezoelectric 1058, May 2018.
energy harvester,” 2020, arXiv:2009.02752. [328] J.-F. Hu, W.-S. Zheng, J. Pan, J. Lai, and J. Zhang, “Deep bilinear
[305] L. Wang, B. Sun, J. Robinson, T. Jing, and Y. Fu, “EV-action: learning for RGB-D action recognition,” in Proc. Eur. Conf. Com-
Electromyography-vision multi-modal action dataset,” in put. Vis., 2018, pp. 346–362.
Proc. 15th IEEE Int. Conf. Autom. Face Gesture Recognit., 2020, [329] P. Khaire, P. Kumar, and J. Imran, “Combining CNN streams of
pp. 160–167. RGB-D and skeletal data for human activity recognition,” Pattern
[306] T. Baltrusaitis, C. Ahuja, and L.-P. Morency, “Multimodal Recognit. Lett., vol. 115, pp. 107–116, 2018.
machine learning: A survey and taxonomy,” IEEE Trans. Pattern [330] P. Khaire, J. Imran, and P. Kumar, “Human activity recognition
Anal. Mach. Intell., vol. 41, no. 2, pp. 423–443, Feb. 2019. by fusion of RGB, depth, and skeletal data,” in Proc. 2nd Int.
[307] J. Imran and P. Kumar, “Human action recognition using RGB-D Conf. Comput. Vis. Image Process., 2018, pp. 409–421.
sensor and deep convolutional neural networks,” in Proc. Int. [331] J. Ye, K. Li, G.-J. Qi, and K. A. Hua, “Temporal order-preserving
Conf. Adv. Comput. Commun. Informat., 2016, pp. 144–148. dynamic quantization for human action recognition from multi-
[308] P. Wang, W. Li, Z. Gao, Y. Zhang, C. Tang, and P. Ogunbona, modal sensor streams,” in Proc. 5th ACM Int. Conf. Multimedia
“Scene flow to action map: A new representation for RGB-D based Retrieval, 2015, pp. 99–106.
action recognition with convolutional neural networks,” in Proc. [332] S. Ardianto and H.-M. Hang, “Multi-view and multi-modal
IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 416–425. action recognition with learned fusion,” in Proc. Asia-Pacific Sig-
[309] L. Wang, Z. Ding, Z. Tao, Y. Liu, and Y. Fu, “Generative multi- nal Inf. Process. Assoc. Annu. Summit Conf., 2018, pp. 1601–1604.
view human action recognition,” in Proc. IEEE/CVF Int. Conf. [333] A. Main de Boissiere and R. Noumeir, “Infrared and 3D
Comput. Vis., 2019, pp. 6211–6220. skeleton feature fusion for RGB-D action recognition,” 2020,
[310] C. Dhiman and D. K. Vishwakarma, “View-invariant deep archi- arXiv:2002.12886.
tecture for human action recognition using two-stream motion [334] C. Wang, H. Yang, and C. Meinel, “Exploring multimodal video
and shape temporal dynamics,” IEEE Trans. Image Process., representation for action recognition,” in Proc. Int. Joint Conf.
vol. 29, pp. 3835–3844, Jan. 2020. Neural Netw., 2016, pp. 1924–1931.
[311] H. Wang, Z. Song, W. Li, and P. Wang, “A hybrid network for [335] M. Brousmiche, J. Rouat, and S. Dupont, “Multi-level attention
large-scale action recognition from RGB and depth modalities,” fusion network for audio-visual event recognition,” 2021,
Sensors, vol. 20, no. 11, 2020, Art. no. 3305. arXiv:2106.06736.
[312] Z. Ren, Q. Zhang, X. Gao, P. Hao, and J. Cheng, “Multi-modality [336] N. Dawar, S. Ostadabbas, and N. Kehtarnavaz, “Data augmentation
learning for human action recognition,” Multimedia Tools Appl., in deep learning-based fusion of depth and inertial sensing for
vol. 80, no. 11, pp. 16 185–16 203, 2021. action recognition,” IEEE Sens. Lett., vol. 3, no. 1, pp. 1–4, Jan. 2019.
[313] M. Zolfaghari, G. L. Oliveira, N. Sedaghat, and T. Brox, “Chained [337] N. Dawar and N. Kehtarnavaz, “A convolutional neural net-
multi-stream networks exploiting pose, motion, and appearance work-based sensor fusion system for monitoring transition
for action classification and detection,” in Proc. IEEE/CVF Int. movements in healthcare applications,” in Proc. IEEE 14th Int.
Conf. Comput. Vis., 2017, pp. 2923–2932. Conf. Control Autom., 2018, pp. 482–485.
[314] F. Baradel, C. Wolf, and J. Mille, “Human action recognition: [338] H. Wei, R. Jafari, and N. Kehtarnavaz, “Fusion of video and iner-
Pose-based attention draws focus to hands,” in Proc. IEEE Int. tial sensing for deep learning–based human action recognition,”
Conf. Comput. Vis. Workshops, 2017, pp. 604–613. Sensors, vol. 19, no. 17, 2019, Art. no. 3680.

Authorized licensed use limited to: Pontificia Universidad Javeriana. Downloaded on March 28,2023 at 16:53:12 UTC from IEEE Xplore. Restrictions apply.
3224 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 3, MARCH 2023

[339] Z. Ahmad and N. Khan, “Human action recognition using deep [362] A. Chadha, Y. Bi, A. Abbas, and Y. Andreopoulos,
multilevel multimodal (M) fusion of depth and inertial sensors,” “Neuromorphic vision sensing for CNN-based action recog-
IEEE Sens. J., vol. 20, no. 3, pp. 1445–1455, Feb. 2020. nition,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process.,
[340] Z. Ahmad and N. Khan, “CNN based multistage gated average 2019, pp. 7968–7972.
fusion (MGAF) for human action recognition using depth and iner- [363] Y. Bi and Y. Andreopoulos, “PIX2NVS: Parameterized conversion
tial sensors,” IEEE Sens. J., vol. 21, no. 3, pp. 3623–3634, Feb. 2021. of pixel-domain video frames to neuromorphic vision streams,” in
[341] C. Pham, L. Nguyen, A. Nguyen, N. Nguyen, and V.-T. Nguyen, Proc. IEEE Int. Conf. Image Process., 2017, pp. 1990–1994.
“Combining skeleton and accelerometer data for human fine- [364] Y. Liu, G. Li, and L. Lin, “Semantics-aware adaptive knowledge
grained activity recognition and abnormal behaviour detection distillation for sensor-to-vision action recognition,” 2020,
with deep temporal convolutional networks,” Multimedia Tools arXiv:2009.00210.
Appl., vol. 80, pp. 28919–28940, 2021. [365] A. Perez, V. Sanguineti, P. Morerio, and V. Murino, “Audio-
[342] W. Jiang and Z. Yin, “Human activity recognition using wearable visual model distillation using acoustic images,” in Proc. IEEE
sensors by deep convolutional neural networks,” in Proc. 23rd Workshop Appl. Comput. Vis., 2020, pp. 2843–2852.
ACM Int. Conf. Multimedia, 2015, pp. 1307–1310. [366] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in
[343] F. J. Ord on~ ez and D. Roggen, “Deep convolutional and LSTM a neural network,” in Proc. NeurIPS Deep Learn. Representation
recurrent neural networks for multimodal wearable activity rec- Learn. Workshop, 2015, pp. 1–9.
ognition,” Sensors, vol. 16, no. 1, 2016, Art. no. 115. [367] H. Alwassel, D. Mahajan, L. Torresani, B. Ghanem, and D. Tran,
[344] Q. Zou, Y. Wang, Q. Wang, Y. Zhao, and Q. Li, “Deep learning- “Self-supervised learning by cross-modal audio-video
based gait recognition using smartphones in the wild,” IEEE clustering,” 2019, arXiv:1911.12667.
Trans. Inf. Forens. Sec., vol. 15, pp. 3197–3212, Apr. 2020. [368] R. Arandjelovic and A. Zisserman, “Look, listen and learn,” in
[345] R. Memmesheimer, N. Theisen, and D. Paulus, “Gimme signals: Proc. IEEE/CVF Int. Conf. Comput. Vis., 2017, pp. 609–617.
Discriminative signal encoding for multimodal activity recog- [369] B. Korbar, D. Tran, and L. Torresani, “Cooperative learning of audio
nition,” 2020, arXiv:2003.06156. and video models from self-supervised synchronization,” in Proc.
[346] Y. Kong and Y. Fu, “Bilinear heterogeneous information machine 27th Int. Conf. Neural Informat. Process. Syst., 2018, pp. 7774–7785.
for RGB-D action recognition,” in Proc. IEEE Conf. Comput. Vis. [370] A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev, and J.
Pattern Recognit., 2015, pp. 1054–1062. Sivic, “HowTo100M: Learning a text-video embedding by watch-
[347] Y. Kong and Y. Fu, “Max-margin heterogeneous information ing hundred million narrated video clips,” in Proc. IEEE/CVF Int.
machine for RGB-D action recognition,” Int. J. Comput. Vis., Conf. Comput. Vis., 2019, pp. 2630–2640.
vol. 123, no. 3, pp. 350–371, 2017. [371] A. Miech, J.-B. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zis-
[348] A. F. Bobick and J. W. Davis, “The recognition of human move- serman, “End-to-end learning of visual representations from
ment using temporal templates,” IEEE Trans. Pattern Anal. Mach. uncurated instructional videos,” in Proc. IEEE Conf. Comput. Vis.
Intell., vol. 23, no. 3, pp. 257–267, Mar. 2001. Pattern Recognit., 2020, pp. 9876–9886.
[349] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-C. [372] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid,
Woo, “Convolutional LSTM network: A machine learning “VideoBERT: A joint model for video and language representa-
approach for precipitation nowcasting,” in Proc. 27th Int. Conf. tion learning,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019,
Neural Informat. Process. Syst., 2015, pp. 802–810. pp. 7463–7472.
[350] J. Liu, A. Shahroudy, G. Wang, L.-Y. Duan, and A. C. Kot, [373] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-
“Skeleton-based online action prediction using scale selection training of deep bidirectional transformers for language under-
network,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 6, standing,” 2018, arXiv:1810.04805.
pp. 1453–1467, Jun. 2020. [374] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101
[351] H. Rahmani, A. Mahmood, D. Q. Huynh, and A. Mian, “Real human actions classes from videos in the wild,”
time action recognition using histograms of depth gradients and 2012, arXiv:1212.0402.
random decision forests,” in Proc. IEEE Workshop Appl. Comput. [375] W. Kay et al., “The kinetics human action video dataset,”
Vis., 2014, pp. 626–633. 2017, arXiv:1705.06950.
[352] A. Shahroudy, T.-T. Ng, Q. Yang, and G. Wang, “Multimodal multi- [376] J. Carreira, E. Noland, A. Banki-Horvath, C. Hillier, and A. Zis-
part learning for action recognition in depth videos,” IEEE Trans. serman, “A short note about kinetics-600,”
Pattern Anal. Mach. Intell., vol. 38, no. 10, pp. 2123–2129, Oct. 2016. 2018, arXiv:1808.01340.
[353] J.-F. Hu, W.-S. Zheng, J. Lai, and J. Zhang, “Jointly learning het- [377] J. Carreira, E. Noland, C. Hillier, and A. Zisserman, “A short note
erogeneous features for RGB-D activity recognition,” in Proc. on the kinetics-700 human action dataset,”
IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 5344–5352. 2019, arXiv:1907.06987.
[354] E. Triantaphyllou, B. Shu, S. N. Sanchez, and T. Ray, “Multi-crite- [378] D. Damen et al., “Scaling egocentric vision: The epic-kitchens
ria decision making: An operations research approach,” Encyclo- dataset,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 753–771.
pedia Elect. Electron. Eng., vol. 15, no. 1998, pp. 175–186, 1998. [379] A. Gorban et al., “THUMOS challenge: Action recognition with a
[355] Q. Zou, L. Ni, Q. Wang, Q. Li, and S. Wang, “Robust gait recogni- large number of classes,” 2015. [Online]. Available: http://
tion by integrating inertial and RGBD sensors,” IEEE Trans. www.thumos.info/
Cybern., vol. 48, no. 4, pp. 1136–1150, Apr. 2018. [380] F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles,
[356] N. E. D. Elmadany, Y. He, and L. Guan, “Multimodal learning “ActivityNet: A large-scale video benchmark for human activity
for human action recognition via bimodal/multimodal hybrid understanding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
centroid canonical correlation analysis,” IEEE Trans. Multimedia, nit., 2015, pp. 961–970.
vol. 21, no. 5, pp. 1317–1331, May 2019. [381] R. Goyal et al., “The “something something” video database for
[357] N. C. Garcia, P. Morerio, and V. Murino, “Modality distillation learning and evaluating visual common sense,” in Proc. IEEE/
with multiple stream networks for action recognition,” in Proc. CVF Int. Conf. Comput. Vis., 2017, pp. 5843–5851.
Eur. Conf. Comput. Vis., 2018, pp. 106–121. [382] C. Schuldt, I. Laptev, and B. Caputo, “Recognizing human
[358] N. C. Garcia, S. A. Bargal, V. Ablavsky, P. Morerio, V. Murino, actions: A local SVM approach,” in Proc. 17th Int. Conf. Pattern
and S. Sclaroff, “DMCL: Distillation multiple choice learning for Recognit., 2004, pp. 32–36.
multimodal action recognition,” 2019, arXiv:1912.10982. [383] D. Weinland, R. Ronfard, and E. Boyer, “Free viewpoint action
[359] N. C. Garcia, P. Morerio, and V. Murino, “Learning with privi- recognition using motion history volumes,” Comput. Vis. Image
leged information via adversarial discriminative modality distil- Understanding, vol. 104, pp. 249–257, 2006.
lation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 10, [384] M. M€ uller, T. R€
oder, M. Clausen, B. Eberhardt, B. Kr€ uger, and A.
pp. 2581–2593, Oct. 2020. Weber, “Documentation mocap database HDM05,” Universit€at
[360] F. M. Thoker and J. Gall, “Cross-modal knowledge distillation Bonn, no. CG-2007-2, Jun. 2007.
for action recognition,” in Proc. IEEE Int. Conf. Image Process., [385] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learning
2019, pp. 6–10. realistic human actions from movies,” in Proc. IEEE Conf. Com-
[361] L. Wang, C. Gao, L. Yang, Y. Zhao, W. Zuo, and D. Meng, “PM- put. Vis. Pattern Recognit., 2008, pp. 1–8.
GANs: Discriminative representation learning for action recogni- [386] M. Marszalek, I. Laptev, and C. Schmid, “Actions in context,”
tion using partial-modalities,” in Proc. Eur. Conf. Comput. Vis., in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2009,
2018, pp. 389–406. pp. 2929–2936.

Authorized licensed use limited to: Pontificia Universidad Javeriana. Downloaded on March 28,2023 at 16:53:12 UTC from IEEE Xplore. Restrictions apply.
SUN ET AL.: HUMAN ACTION RECOGNITION FROM VARIOUS DATA MODALITIES: A REVIEW 3225

[387] J. C. Niebles, C.-W. Chen, and L. Fei-Fei, “Modeling temporal [407] M. Moreaux, M. G. Ortiz, I. Ferrane, and F. Lerasle, “Benchmark
structure of decomposable motion segments for activity classi- for Kitchen20, a daily life dataset for audio-based human action
fication,” in Proc. Eur. Conf. Comput. Vis., 2010, pp. 392–405. recognition,” in Proc. Int. Conf. Content-Based Multimedia Indexing,
[388] J. Sung, C. Ponce, B. Selman, and A. Saxena, “Human activity 2019, pp. 1–6.
detection from RGBD images,” in Proc. AAAI Conf. Artif. Intell., [408] M. Monfort et al., “Moments in time dataset: One million videos
2011, pp. 47–55. for event understanding,” IEEE Trans. Pattern Anal. Mach. Intell.,
[389] Z. Cheng, L. Qin, Y. Ye, Q. Huang, and Q. Tian, “Human daily vol. 42, no. 2, pp. 502–508, Feb. 2020.
action analysis with multi-view and color-depth data,” in Proc. [409] F. Wang, J. Feng, Y. Zhao, X. Zhang, S. Zhang, and J. Han, “Joint
Eur. Conf. Comput. Vis., 2012, pp. 52–61. activity recognition and indoor localization with WiFi finger-
[390] Y.-C. Lin, M.-C. Hu, W.-H. Cheng, Y.-H. Hsieh, and H.-M. Chen, prints,” IEEE Access, vol. 7, pp. 80 058–80 068, 2019.
“Human action recognition and retrieval using sole depth [410] D. Damen et al., “Rescaling egocentric vision,” 2020,
information,” in Proc. 20th ACM Int. Conf. Multimedia, 2012, arXiv:2006.13256.
pp. 1053–1056. [411] J. Jang, D. Kim, C. Park, M. Jang, J. Lee, and J. Kim, “ETRI-activi-
[391] L. Xia, C.-C. Chen, and J. K. Aggarwal, “View invariant human ty3D: A large-scale RGB-D dataset for robots to recognize daily
action recognition using histograms of 3D joints,” in Proc. IEEE activities of the elderly,” 2020, arXiv:2003.01920.
Conf. Comput. Vis. Pattern Recognit. Workshops, 2012, pp. 20–27. [412] Y. Ben-Shabat et al., “The IKEA ASM dataset: Understanding
[392] H. S. Koppula, R. Gupta, and A. Saxena, “Learning human activi- people assembling furniture through actions, objects and pose,”
ties and object affordances from RGB-D videos,” Int. J. Robot. in Proc. IEEE Workshop Appl. Comput. Vis., 2020, pp. 846–858.
Res., vol. 32, no. 8, pp. 951–970, 2013. [413] A. Miech, J.-B. Alayrac, I. Laptev, J. Sivic, and A. Zisserman,
[393] M. Munaro, G. Ballin, S. Michieletto, and E. Menegatti, “3D flow esti- “RareAct: A video dataset of unusual interactions,” 2020,
mation for human action recognition from colored point clouds,” arXiv:2008.01018.
Biologically Inspired Cogn. Architectures, vol. 5, pp. 42–51, 2013. [414] A. R. Punnakkal, A. Chandrasekaran, N. Athanasiou, A. Quiros-
[394] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, “Towards Ramirez, and M. J. Black, “BABEL: Bodies, action and behavior
understanding action recognition,” in Proc. IEEE/CVF Int. Conf. with English labels,” in Proc. IEEE Conf. Comput. Vis. Pattern Rec-
Comput. Vis., 2013, pp. 3192–3199. ognit., 2021, pp. 722–731.
[395] C. Ellis, S. Z. Masood, M. F. Tappen, J. J. LaViola, and R. Sukthan- [415] N. Rai et al., “Home action genome: Cooperative compositional
kar, “Exploring the trade-off between accuracy and observa- action understanding,” in Proc. IEEE Conf. Comput. Vis. Pattern
tional latency in action recognition,” Int. J. Comput. Vis., vol. 101, Recognit., 2021, pp. 11179–11188.
no. 3, pp. 420–436, 2013. [416] T. Li, J. Liu, W. Zhang, Y. Ni, W. Wang, and Z. Li, “UAV-human:
[396] A.-A. Liu, Y.-T. Su, P.-P. Jia, Z. Gao, T. Hao, and Z.-X. Yang, A large benchmark for human behavior understanding with
“Multiple/single-view human action recognition via part- unmanned aerial vehicles,” in Proc. IEEE Conf. Comput. Vis. Pat-
induced multitask structural learning,” IEEE Trans. Cybern., tern Recognit., 2021, pp. 16261–16270.
vol. 45, no. 6, pp. 1194–1208, Jun. 2015. [417] W. Wang, D. Tran, and M. Feiszli, “What makes training multi-
[397] I. Theodorakopoulos, D. Kastaniotis, G. Economou, and S. Foto- modal classification networks hard?,” in Proc. IEEE Conf. Comput.
poulos, “Pose-based human action recognition via sparse repre- Vis. Pattern Recognit., 2020, pp. 12692–12702.
sentation in dissimilarity space,” J. Vis. Commun. Image [418] T. Li, J. Liu, W. Zhang, and L. Duan, “HARD-Net: Hardness-
Represent., vol. 25, no. 1, pp. 12–23, 2014. AwaRe discrimination network for 3D early activity prediction,”
[398] H. Rahmani, A. Mahmood, D. Huynh, and A. Mian, “Action clas- in Proc. Eur. Conf. Comput. Vis., 2020, pp. 420–436.
sification with locality-constrained linear coding,” in Proc. 22nd [419] J. Liu, A. Shahroudy, G. Wang, L.-Y. Duan, and A. C. Kot,
Int. Conf. Pattern Recognit., 2014, pp. 3511–3516. “SSNet: Scale selection network for online 3D action prediction,”
[399] A.-A. Liu, W.-Z. Nie, Y.-T. Su, L. Ma, T. Hao, and Z.-X. Yang, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018,
“Coupled hidden conditional random fields for RGB-D human pp. 8349–8358.
action recognition,” Signal Process., vol. 112, pp. 74–82, 2015. [420] B. Fernando and S. Herath, “Anticipating human actions by cor-
[400] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and relating past with the future with jaccard similarity measures,”
A. Gupta, “Hollywood in homes: Crowdsourcing data collection in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021,
for activity understanding,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 13219–13228.
pp. 510–526. [421] Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Generalizing from a
[401] S. Abu-El-Haija et al., “YouTube-8M: A large-scale video classifi- few examples: A survey on few-shot learning,” ACM Comput.
cation benchmark,” 2016, arXiv:1609.08675. Surv., vol. 53, no. 3, 2020, Art. no. 63.
[402] C. Gu et al., “AVA: A video dataset of spatio-temporally local- [422] T.-T. Xie, C. Tzelepis, F. Fu, and I. Patras, “Few-shot action locali-
ized atomic visual actions,” in Proc. IEEE Conf. Comput. Vis. Pat- zation without knowing boundaries,” in Proc. Int. Conf. Multime-
tern Recognit., 2018, pp. 6047–6056. dia Retrieval, 2021, pp. 339–348.
[403] Y.-G. Jiang, Z. Wu, J. Wang, X. Xue, and S.-F. Chang, “Exploiting [423] J.-B. Alayrac, P. Bojanowski, N. Agrawal, J. Sivic, I. Laptev, and
feature and class relationships in video categorization with regu- S. Lacoste-Julien, “Unsupervised learning from narrated instruc-
larized deep neural networks,” IEEE Trans. Pattern Anal. Mach. tion videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
Intell., vol. 40, no. 2, pp. 352–364, Feb. 2018. 2016, pp. 4575–4583.
[404] D. Micucci, M. Mobilio, and P. Napoletano, “UniMiB SHAR: A [424] A. Singh et al., “Semi-supervised action recognition with tempo-
dataset for human activity recognition using acceleration data ral contrastive learning,” in Proc. IEEE Conf. Comput. Vis. Pattern
from smartphones,” Appl. Sci., vol. 7, no. 10, 2017, Art. no. 1101. Recognit., 2021, pp. 10384–10394.
[405] Y. Ji, F. Xu, Y. Yang, F. Shen, H. T. Shen, and W.-S. Zheng, “A [425] X. Song et al., “Spatio-temporal contrastive domain adaptation
large-scale RGB-D database for arbitrary-view human action rec- for action recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern
ognition,” in Proc. 26th ACM Int. Conf. Multimedia, 2018, Recognit., 2021, pp. 9782–9790.
pp. 1510–1518.
[406] M. Martin et al., “Drive&Act: A multi-modal dataset for fine-
grained driver behavior recognition in autonomous vehicles,” in " For more information on this or any other computing topic,
Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 2801–2810. please visit our Digital Library at www.computer.org/csdl.

Authorized licensed use limited to: Pontificia Universidad Javeriana. Downloaded on March 28,2023 at 16:53:12 UTC from IEEE Xplore. Restrictions apply.

You might also like