You are on page 1of 31

1

Continuous Human Action Recognition for


Human-Machine Interaction: A Review
Harshala Gammulle, David Ahmedt-Aristizabal, Simon Denman, Lachlan Tychsen-Smith,
Lars Petersson, Clinton Fookes

Abstract—With advances in data-driven machine learning re- only through spatial information only (e.g. “riding bicycle”
search, a wide variety of prediction models have been proposed to vs “lifting weight”), many day-to-day actions include similar
capture spatio-temporal features for the analysis of video streams.
arXiv:2202.13096v1 [cs.CV] 26 Feb 2022

behavioural patterns (e.g. “walking” and “running”) and can


Recognising actions and detecting action transitions within an
input video are challenging but necessary tasks for applications only be separated by exploiting temporal information.
that require real-time human-machine interaction. By reviewing a Video-based action recognition research can be broadly
large body of recent related work in the literature, we thoroughly grouped into two types: discrete, and continuous action recog-
analyse, explain and compare action segmentation methods and nition (see Fig. 1) [1]. Discrete action recognition uses seg-
provide details on the feature extraction and learning strategies mented videos where only a single action is included within
that are used on most state-of-the-art methods. We cover the
impact of the performance of object detection and tracking the video input. In contrast, continuous action recognition
techniques on human action segmentation methodologies. We operates over unsegmented videos that contain multiple actions
investigate the application of such models to real-world scenarios per video. Unlike in discrete action recognition, recognising
and discuss several limitations and key research directions continuous actions involves not only recognising the actions
towards improving interpretability, generalisation, optimisation but also detecting action transitions. Therefore, continuous
and deployment.
action recognition methods are well aligned with real-world
scenarios where actions are continuous and are related to their
I. I NTRODUCTION
surrounding actions. Despite the complexity and challenging
S humans, we perform countless daily actions. The intent
A behind an action may be to achieve a day-to-day task,
to convey an idea, or to communicate non-verbally as part
nature of continuous action recognition, a considerable amount
of research has been conducted into the continuous action
recognition task. In this review, our main focus is evaluating
of human-to-human interactions. The performance of these recent advancements in continuous action recognition meth-
actions plays a crucial part in our lives, and occurs (at least ods, and evaluating them in real-world settings.
in part) automatically. With advancements in robotics and Continuous action recognition methods may also often be
machine learning, there is an increasing interest in developing referred to as action segmentation and action detection. Both
robots that can work alongside humans. Such a service robot address the same problem scenario, yet the final output takes
should understand those surrounding actions performed by different forms. As stated in [2], action segmentation aims
their human co-workers. Related applications exist within to predict the actions that occur at every frame of the video
intelligent surveillance, where it is desirable to recognise input while action detection aims to determine a sparse set
human behaviour and identify abnormal patterns. To address of action segments where each segment includes the start
such scenarios, an action recognition method that is capable of time, end time and action label. In some research, [3], [4]
understanding varying complex human actions plays a major action segmentation/ detection is refereed to as fine-grained
role. action recognition, to highlight low variation that exists among
Human action recognition (HAR) aims to automatically de- classes that are to be detected [3]. For example, in the MPII
tect and understand actions of interest (e.g. running, walking, cooking dataset, actions such as “grating”, “chopping” and
etc) performed by a human subject, using information captured “peeling” exhibit fine-grained characteristics.
from a camera or other sensors. HAR is a widely investigated In addition to the actions of interest, continuous action
problem in the field of computer vision and machine learning. recognition videos also include frames that contain action tran-
In the past few decades, action recognition has been investi- sitions. These frames are typically annotated as “background”
gated using both images or videos. Compared to image-based or “null”, and must also be detected. This is further illustrated
methods which learn actions through only spatial (appearance) through an example action sequence in Fig. 2. As such, if
features, video-based approaches learn temporal patterns in a dataset contains N actual actions, N + 1 classes should
addition to spatial patterns included in video frames/images. be detected by the model. However, the background/null ex-
Even though in some instances the actions can be recognised amples often have similar characteristics to their surrounding
H. Gammulle, S. Denman, and C. Fookes are with the Sig- action frames, which makes precise detection of the boundary
nal Processing, Artificial Intelligence and Vision Technologies (SAIVT) a challenging task. Regardless of these challenges, to date
Lab, Queensland University of Technology, Brisbane, Australia. E-mail: methods have shown considerable performance, yet there is
pranali.gammule@qut.edu.au
D. Ahmedt-Aristizabal, L. Tychsen-Smith and L. Petersson are with the capacity for methods to be further improved and utilised in
Imaging and Computer Vision group, CSIRO Data61, Canberra, Australia. real-world settings.
2

video segment video

Discrete Action Continuous Action


Recognition Recognition

action label action labels

Act 1 Act 5 Act 4 Act 3 Act 2 Act 1

(a) Discrete Action Recognition (b) Continuous Action Recognition

Fig. 1. Discrete and Continuous Action Recognition Comparison.

time

BCKG/ NULL ENTER TAKE_OUT BCKG/ NULL WASH BCKG/ NULL

Fig. 2. An Example of a Continuous Action Video Sequence: The frames that include action transitions are often labelled as “background”(BCKG) or “Null”.

Earlier work [5], [6] on video-based continuous action range temporal patterns. A two-stream architecture based on
recognition has focused on detecting actions/interactions in bi-directional LSTMs (Bi-LSTM) that operates over short
movies; while in [3] the authors utilised holistic video fea- video chunks was introduced in [4]. However, this model is
tures based on dense trajectories [7], histograms of oriented claimed to be time-consuming due to the sequential predictions
gradients(HOG)/ histograms of optical flow (HOF) [8], motion [20]. Considering the limitations of these early methods, a
boundary histograms (MBH) [9], and articulated pose tracks class of time-series models, called Temporal Convolutional
motivated from [10]. Work has also been completed that Networks (TCN), was introduced for action segmentation and
incorporates object information [11], [12] or utilises grammar- detection in [2]. This model is able to overcome the short-
based approaches [13], [14] to model human actions. However, comings of previous methods and capture long-range temporal
these methods are based on hand-crafted features, which are patterns through a hierarchy of temporal convolutional filters.
highly dependent on the knowledge of the human expert and The method in [2] employs an encoder-decoder architecture
can lack the ability to generalise. where the encoder is built using temporal convolutions and
Therefore, recent efforts have shifted towards deep learning pooling, while the decoder is composed of upsampling fol-
as such models are able to learn action-specific features lowed by convolutions. The utilisation of temporal pooling
automatically. While the data driven nature of deep learning enables the capture of long-range temporal dependencies,
methods requires large annotated datasets to train the models, although the approach may result in the loss of fine-grained
they have nonetheless achieved state-of-the-art performance. details required for action segmentation [20]. This model is
For instance, in [15] the authors improved upon existing extended in [21] by using deformable convolutions and adding
hand-crafted approaches by replacing the hand-crafted feature- residual connections in the encoder-decoder model. Motivated
based model with a CNN model. However, as previously by these works, [20] proposed a multi-stage TCN architecture
mentioned, irrespective of the technique used it is essential which operates over the full temporal resolution, and utilises
to capture spatio-temporal features when analysing video data. dilated convolutions to capture long-range temporal patterns.
Multiple video-based discrete action recognition methods [16], Following these works, a large number of novel models have
[17], [18] have used two-stream models to capture spatial been introduced for the action segmentation task, and we
and temporal information. However, with continuous action provide further details in Sec. II-C and investigate model
recognition, input sequences are much lengthier in comparison capabilities.
to discrete action methods, and thus require the capture of
long-range relationships. A. Our Contributions
Early works on action detection [12], [19] proposed sliding Although there exist numerous recent survey articles [22],
window approaches, yet these were unable to capture long- [23], [1], [24] concerning human action recognition and in-
3

TABLE I and multi-object tracking are systematically discussed in Sec.


C OMPARISON OF O UR R EVIEW TO OTHER R ELATED S TUDIES . III-A and Sec. III-B, respectively. In Sec. IV-E, we investigate
Topic
Reference the application of the action segmentation, detection and
[22] [23] [1] [24] Ours
Feature Extractors X X X X X tracking methods to real-world applications. Limitations and
Action Segmentation
Object detection and tracking for action segmentation
X
X
future directions for research conclude this paper, and are
Discussion of Challenges with real world Application X X X presented in Sec. V.
Analysis of extensions for real-time response X
Methods for handling multiple-person data X
II. H UMAN ACTION S EGMENTATION
teraction detection, there is no systematic review of human Continuous action recognition methods aim to recognise
action segmentation methodologies. In particular, to the best of and localise actions in an unsegmented video stream. Fig. 3
our knowledge, none of the existing literature has investigated illustrates the general pipeline of the state-of-the-art action
the application of state-of-the-art human action segmentation segmentation models. Generally, the first step is to extract
methodologies to real-world scenarios and the extensions that salient semantic features from each frame of the video input.
are required to enable this. Such an in-depth analysis would These features are saved in a feature buffer and are passed
allow readers to compare and contrast the strengths and weak- through the action segmentation model. Finally, the action
nesses of different deep learning action segmentation models segmentation model outputs a sequence of action predictions,
in a real-world context, and select methods accordingly. In this one per frame of the input video stream.
paper, we address the limitations of existing surveys (which In the following sections, we provide a more detailed
are summarised in Table I) and go beyond a simple feature explanation of these steps in the action segmentation pipeline.
and neural architecture level comparison of existing human
action segmentation methods. We empirically compare the A. Feature Extraction
performance of four state-of-the-art deep learning architectures The feature extraction step plays a major role in the ac-
with respect to the utilised feature extraction models. In tion segmentation task, encoding input frames and extracting
particular, we evaluate their performance with respect to the salient information for the subsequent action recognition step.
size of the extracted feature vector, and assess the possibility of Ideally, the encoding step should ensure that only relevant in-
using a compressed feature representation to achieve real-time formation is retained to help avoid confusion within the action
throughput. Furthermore, we analyse different model training recognition. As such, the choice of feature extractor plays a
strategies that can be used to augment the feature extraction major role in the overall action segmentation performance.
process, and demonstrate the utility of fine-tuning feature Recently, a popular approach has been to use pre-trained CNNs
extractors on a sub-set of training data, rather than merely such as VGG [25], GoogleNet [26], Residual Neural Network
using pre-trained feature extractors. (ResNet) [27], EfficientNet-B0 [28], MobileNet-V2 [29] and
Moreover, this paper provides a comprehensive overview of Inflated 3D ConvNet (I3D) [30] as feature extraction methods.
the impact on performance of the hyper-parameters (includ- We refer interested readers to Appendix A of supplemen-
ing the observed sequence length and the feature extraction tary material where we discuss the architectures of ResNet,
window size) of human action segmentation methodologies. EfficientNet-B0, MobileNet-V2 and I3D models in detail.
These hyper-parameters are critical factors when applying
such systems in multi-person environments and time-critical B. Networks and Learning Strategies
applications.
Lastly, our paper details the limitations of existing human In this section, we provide a brief theoretical overview
action segmentation methods and lists key research direc- regarding recent advancements in network architectures and
tions to encourage more generalisable, fast, and interpretable learning strategies used in state-of-the-art action segmentation
methodologies. methods. An in depth discussion of the state-of-the-art action
segmentation models is subsequently provided in Sec. II-C.
1) Temporal Convolutional Networks (TCN): As explained
B. Organisation in Sec. I, the action segmentation task involves analysing a
In Sec. II we outline the traditional action segmentation sequence of frames to recognise actions as they evolve over
pipeline that most state-of-the-art action segmentation methods time. Therefore, an action segmentation model must be able
are built upon, while providing a detailed review of each to capture temporal patterns. In many previous approaches on
step of the pipeline. Specifically, in Sec. II-A we discuss action segmentation and detection, Recurrent Neural Networks
in detail different backbone models that are used for the (RNN) such as Long Short-Term Memory (LSTM) [31] and
feature extraction step. In Sec. II-B we provide an overview Gated Recurrent Units (GRU) [32] are widely utilised. While
of different networks and learning strategies that are used these can capture temporal patterns, they are unable to attend
in the existing action segmentation methods, while providing to the entire sequence as they maintain a fixed size memory.
a summary of each technique utilised. Sec. II-C provides a In contrast, Temporal Convolutional Networks (TCN) have the
detailed description of the action segmentation models that ability to analyse the entire sequence at once, and have a lower
we selected for further experiments. Sec. III introduces the computational cost compared to recurrent models. Therefore,
benefits of incorporating detection and tracking techniques to TCNs have been widely used in action segmentation research
aid action segmentation. Recent advances in object detection [2], [20], [33].
4

Feature action
Buffer labels

t=1 t=1

Act 1
t=2

t=2

Action Segmentation
t=3

Feature Extractor

Act 2
t=4

t=3
t=5

t=6

Act 3
t=7

Act 4
t=N

Act 5
t=N

Fig. 3. Action Segmentation Pipeline.

A TCN uses a 1D (1 dimensional) fully-convolutional net- A layer with a dilation factor of d and a kernel with a size of
work architecture. Each hidden layer outputs a representation k has a receptive field size spreading over a length of l, where
the same length as the input layer, with zero padding of length l can be defined as,
(kernal size - 1) used to maintain the equal length of each
layers. As such, the network can take an input sequence of l = 1 + d × (k − 1). (1)
any length and after processing obtain an output with the same
As shown in Fig. 5, when d = 1 and kernel size = 3,
length, aligning well with the action segmentation task.
the operation takes the form of a standard convolutional layer
As illustrated in Fig. 4, in a TCN the 1D convolu-
where the input elements chosen to calculate an output element
tions are applied across the temporal axis. The network
are adjacent and the receptive field is 3 elements. In the case
receives a 3-dimensional tensor of shape [batch size, se-
where d = 2, the receptive field expands to spread across
quence length, input channels], and outputs a tensor of
5 elements, and when d = 4 the receptive field spreads
shape [batch size, sequence length, output channels], ob-
across 9 elements. In practice the dilation factor is increased
tained through 2-dimensional kernels. Fig. 4 (a) visualises
exponentially resulting in an increase of the receptive field at
a scenario where the input channels=1, while Fig. 4 (b)
each layer. For example, in [20], when the kernel size = 3,
illustrates a scenario where the input tensor contains more
the receptive field (R f ieldl ) at layer l (l[1, l]) is decided
than channel (input channels = 2). At each step, the output
based on the formula,
is calculated by obtaining the dot product between the sub-
sequence that lies within the sliding window and the learned R f iledl = 2l+1 − 1. (2)
weights of the kernel vector. In order to obtain the next output
element, the window is shifted to the right by a one input As a result of this very large receptive field however is that
element (i.e. stride = 1). This process is repeated across the networks are limited to having only a few layers to avoid over-
entire sequence, generating an output sequence that has the fitting.
same length as the input. Therefore, the prediction at each Recent works [35], [20] have shown that by adding resid-
frame is a function of a fixed-length period of time (i.e. the ual connections, TCN models can be further improved. In
receptive field), such that a subset of input elements impact a particular, this has benefited deeper networks by facilitating
specific output element. This receptive field varied according gradient flow. A residual block is composed of a series of
to requirements of the problem. transformations and its outputs are added to the input of the
One effective way to increase the receptive filed while block. This allows the layer to learn suitable modifications
maintaining a relatively small number of layers is to utilise to the representation, rather than needing to learn a complete
dilation. As defined in [34], dilated convolution (also known transform.
as convolution with holes) is a convolution operation where 2) Generative Adversarial Networks (GAN): A standard
the filter is applied over an area greater than the filter length Generative Adversarial Network (GANs) [36] is able to learn
by skipping input elements with a desired step, based on a mapping from a random noise vector z to an output vector
the dilation factor. The dilatation rate allows the network to y. This is achieved through two networks, the “Generator” (G)
maintain the temporal order of the samples, yet capture long- and the “Discriminator” (D), which compete in a two player
term dependencies without an explosion in model complexity. min-max game. G seeks to learn the input data distribution,
5

Output
Output

𝒇 𝒇
𝒇 𝒇
kernel (size = 3) kernel (size = 3, input_ch)
Input

Input subsequence
subsequence

stride =1 stride =1

(a) TCN operation on an input with a single channel. (b) TCN operation on an input with multiple channels.
Fig. 4. Temporal Convolutional Networks (TCN).

d=4

d=2

d=1

Fig. 5. Dilation operations with different dilation factors (d) with kernels ize = 3. In the case where d = 1, 3 adjacent input elements are chosen to compute
a particular output element by setting the receptive field size to 3. When d = 2 the receptive field is increased to length 5. When d = 4, the receptive field
expands to a length of 9.

while D estimates the authenticity of an input (real/generated). problems including future frame prediction [39], product photo
Both models, G and D, are trained simultaneously. generation [40], and photographic text-to-image generation
Fig. 6 illustrates the GAN training procedure. Noise sam- [41], [42]. Furthermore, multiple methods have adapted the
pled from Pz (z) is fed to G. G aims to learn the data cGAN architecture for action recognition [43], [44], [45] and
distribution of the target data, Pdata (x), by mapping from the prediction [46] tasks.
noise space to the target data space. D seeks to output a scalar 3) Domain Adaptation: Domain adaptation is used to ad-
variable when given an input sample which may be generated dress scenarios in which a model is employed on a target
by G (fake), or be a real example. The discriminator is trained distribution which is different (but related) to a source data dis-
to perform the real/fake classification. The GAN objective can tribution on which the model was trained. In general, domain
be written as, adaptation strategies can be divided into: (i) discrepancy-based
domain adaptation; (ii) adversarial-based domain adaptation;
minG maxD V (D, G) = Ex∼Pdata (x) [logD(x)] + Ez∼Pz (z) [log1 − D(G(z))].
and (iii) reconstruction-based domain adaptation [47].
(3)
The GAN learning process is considered to be unsupervised. In Discrepancy-based methods, a divergence criterion be-
During learning, D is provided with a supervision signal tween the source and target domain is minimised such that a
(real/fake ground truth labels), though G is not provided domain-invariant feature representation can be obtained. Pop-
with any labelled data. However, the primary task that the ular domain discrepancy measures include Maximum Mean
GANs seek to address is to learn the data distribution (i.e. the Discrepancy (MMD) [48], Correlation Alignment (CORAL)
objective of G), therefore, the overall process is considered to [49], Contrastive Domain Discrepancy (CDD) [50] and the
be unsupervised. Wasserstein metric [51]. When comparing two inputs using
In [37], the authors extend the GAN framework and intro- MMD, we first transform the feature maps to a latent space
duced the conditional GAN (cGAN). The generator and dis- where the similarity between the two feature maps is measured
criminator outputs of the cGAN are conditioned on additional based on the mean embedding of the features. In contrast,
data, c. The cGAN objective is defined as, CORAL utilises second-order statistics (correlation) between
the source and target domains [52]. In CDD, class level domain
minG maxD V (D, G) = Ex∼Pdata (x) [logD(x|c)] + Ez∼Pz (z) [log1 − D(G(z|c)|c)]. discrepancies are considered where the intraclass discrepancies
(4) are minimised and the domain discrepancy between different
Following [38], the cGAN can be considered a general- classes is maximised [50]. For the Wasserstein metric, the
purpose image-to-image translation method. Through utilising Wasserstein distance measure is used to evaluate domain
a cGAN framework, a network can learn a mapping from discrepancies.
an input image to an output image while learning a loss Adversarial-based approaches use generative models to min-
formulation to do this mapping. Prior works using image-based imise domain confusion. In the CoGAN architecture [53],
conditional models have widely investigated image generation two generator/discriminator pairs are used and some weights
6

fake

z G
D real/fake

real
Fig. 6. Generative Adversarial Networks (GAN)

are shared such that a domain-invariant feature space can be be processed simultaneously, which is impractical for large
learned. graphs with millions of connections. In contrast, in spatial
Following a similar line of work, in [54] a domain confusion GCNs operations can be performed locally. Furthermore, the
loss is introduced in addition to the model’s primary loss (i.e. assumption of a fixed graph in spectral GCNs leads to poor
classification loss), and they seek to make samples from both generalisation [57].
domains mutually indistinguishable for the classifier. With the network structure and node information as inputs,
In contrast, in the reconstruction-based domain adaptation GNNs can be formulated to perform various graph analytic
approach an auxiliary reconstruction task is created such tasks such as:
that the shared representation between the source and target • Node-level prediction: A GNN operating at the node-
domain can be learned by the network through jointly training level computes values for each node in the graph and
both primary and auxiliary (reconstruction) tasks. An example is thus useful for node classification and regression. In
of this paradigm is [55], where the authors propose the two node classification, the task is to predict the node label
tasks of classifying the source data and reconstructing the for every node in the graph. To compute node-level
unlabelled target data. Another popular reconstruction-based predictions, the node embedding is fed to a Multi-Layer
approach is to employ Cycle GANs [56], where data is Perceptron (MLP).
translated between domains. • Graph-level prediction: GNNs that predict a single value
4) Graph Convolution Networks: In a conventional convo- for an entire graph perform a graph-level prediction.
lution layer, the input is multiplied by a filter or kernel, which This is commonly used to classify entire graphs, or
is defined by a set of weights. This filter slides across the compute similarities between graphs. To compute graph-
input representation, generating an activation map. In Graph level predictions, the same node embedding used in node-
Convolution Networks (GCNs), the same concept is applied level prediction is input to a pooling process followed by
to irregular or non-structured data, rather than the regular a separate MLP.
Euclidean grids over which CNNs operate. Hence, GCNs can One of the key reasons to apply GCNs to action seg-
be seen as a generalised form of CNNs. mentation is their ability to simultaneously handle spatial
Traditional CNNs analyse local areas based on fixed con- and temporal information. In spatio-temporal GCNs, the node
nectivity (determined by the convolutional kernel), potentially and/or edge structure changes over time, allowing them to
limiting performance and leading to difficulty in interpreting seamlessly model both spatial and temporal relations in the
the structures being modelled, particularly when the relation- data. For example, location nodes could simply represent the
ships being modelled cannot be easily fitted to a regular grid. spatial dependencies of the observations, such as distances
Graphs, on the other hand, offer more flexibility to analyse between pairs of sensors, while by using recurrent neural
unordered data by preserving neighbouring relations. Similar networks to model the nodes one can represent the temporal
to CNNs, GCNs learn abstract feature representations for each evolution of the observations, generating a spatio-temporal
feature at a node via message passing, in which nodes succes- graph.
sively aggregate feature vectors from their neighbourhood to
compute a new feature vector at the next hidden layer in the
network. C. Action Segmentation Models
GCNs can be broadly categorised into two classes: spectral In the following subsections we discuss recent action seg-
GCNs and spatial GCNs. In Spectral GCN approaches the mentation models in detail. In this section we focus on end-
graph convolution operation is formulated based on graph to-end models, that receive the raw data as input and perform
signal processing, while in spatial GCNs it is formulated as segmentation. A growing number of methods seek to optimise
aggregating information from neighbours [57]. The popularity either the results or even the structure of these networks, and
of spatial GCNs has rapidly grown over the years compared these are discussed in Sec. II-D.
to its spectral counterpart, despite the solid mathematical 1) Temporal Convolutional Network for Action Segmenta-
foundation of spectral GCNs. One of the major drawbacks tion and Detection [2]: This work can be considered one
of spectral GCNs is the requirement that the entire graph of the first to utilise TCNs for human action segmentation.
7

TABLE II
P ERFORMANCE OF SOTA ACTION SEGMENTATION MODELS . T HE RESULTS FOR THE EPIC-K ITCHENS ARE OBTAINED FROM [58].

50 Salads [59] EPIC-Kitchens [60]


Method
Acc Edit F1@0.1 F1@0.25 F1@0.5 Acc Edit F1@0.1 F1@0.25 F1@0.5
Bi-LSTM 55.7 55.6 62.6 58.3 47.0 43.3 29.1 19.0 11.7 5.0
ED-TCN [2] 64.7 59.8 68.0 63.9 52.6 42.9 23.7 21.8 13.8 6.5
SS-AGAN [45] 73.3 69.8 74.9 71.7 67.0 - - - - -
Coupled-AGAN [44] 74.5 76.9 80.1 78.7 71.1 - - - - -
MS-TCN [20] 80.7 67.9 76.3 74.0 64.5 43.6 25.3 19.4 12.3 5.7
GTRM [58] - - - - - 43.4 42.1 31.9 22.8 10.7
DTGRM [61] 80.0 72.0 79.1 75.9 66.1 - - - - -
Global2Local [62] 82.2 73.4 80.3 78.0 69.8 - - - - -
SSTDA [63] 83.2 75.8 83.0 81.5 73.8 - - - - -
MSTCN + HASR [64] 81.7 77.4 83.4 81.8 71.9 - - - - -
SSTDA + HASR [64] 83.5 82.1 74.1 77.3 82.7 - - - - -
MS-TCN++ [65] 83.7 74.3 80.7 78.5 70.1 - - - - -
BACN [66] 84.4 74.3 82.3 81.3 74.0 - - - - -

The authors proposed two TCN based architectures: Encoder- In order to support gradient flow, residual connections are
Decoder TCN (ED-TCN) and Dilated TCN. The ED-TCN applied. The MS-TCN model is based only on TCNs and no
model has temporal convolution, pooling and upsampling pooling or fully-connected layers are used. [20] reports that
operations. This network is relatively shallow (only 3 layers in the pooling layers reduce the temporal resolution, while the
the encoder) compared to prior networks that were proposed to addition of fully-connected layers greatly increases the number
segment human actions. However, due to the length of the 1D of trainable parameters and limits the model to operating on
convolution filters used the network was able to capture long- fixed-size inputs. Through this fully-convolutional approach,
term temporal dependencies in the input feature sequences. In the model is efficient during both training and testing phases.
the 2nd architecture, Dilated TCN, the authors have removed In the subsequent stages of the network, Farha et al. [20]
the pooling and upsampling operations and have used dilated take the predictions from the previous stage and refine them
convolution filters. This allowed the Dilated TCN network through subsequent TCN models which have an architecture
to capture long-term temporal patterns as well as pairwise similar to that outlined above. This allows the network to
transitions between the action segments. The authors have capture dependencies between different action classes and
utilised off the shelf categorical cross entropy to train the action segments, and refines predictions when appropriate,
model, and evaluations were conducted on 3 datasets (50 helping to reduce segmentation errors.
Salds [67], MERL shopping [4], and GTEA [67]). The ED- The authors also proposed to augment the model learn-
TCN model showed promising performance on all 3 datasets. ing process with a combination of categorical cross en-
Specifically, it achieved 64.7% frame wise action classification tropy loss and a custom smoothing loss which reduces over-
accuracy and 59.8% Edit score and 52.6 F1@50 on the 50 segmentation. Specifically, this loss is based on truncated
Salads dataset. In their empirical comparisons, the authors mean squared error over the log probabilities of frame-wise
demonstrate that this architecture is not only capable of predictions, and penalises the model for having predictions
outperforming recurrent architectures such as Bi-LSTM [4], oscillate across consecutive frames. The authors have tested
but it is also faster to train. For instance, a typical Bidirectional the proposed algorithm on the 50 Salads, GTEA and Break-
LSTM architecture only achieves 55.7% frame wise action fast [68] datasets where it has shown significant performance
classification accuracy and 55.6% Edit score and 47.0 F1@50 gain compared to baselines methods. For instance, this model
on the 50 Salads dataset, despite its greater computational achieves 80.7% frame wise classification accuracy. 67.9% Edit
burden. score, and 64.5 % F1@50 on the 50 Salads dataset, which
2) Multi-Stage Temporal Convolutional Network (MS-TCN) is an approximately 15% gain in the fame-wise classification
[20] : Motivated by the tremendous success of the TCN accuracy over the ED-TCN model of [2].
architecture of [2], Farha et al. proposed a multi-stage TCN 3) Multi-Stage Temporal Convolutional Network - Extended
model which stacks multiple single-stage TCN blocks. This (MS-TCN++) [65]: This model extends the MS-TCN model
was proposed to address the limited temporal resolution of- of [20]. In particular, the authors show that the dilation factor
fer by the TCN convolution operation. MS-TCN addresses and it’s use in deep networks may lead to information being
this limitation by aggregating temporal information across lost, as some input samples may receive less consideration
multiple-stages. Fig. 7 illustrates the multi-stage architecture due to the skipping of inputs through the dilated convolutions.
of MS-TCN. While the deeper layers have larger temporal receptive fields,
The first layer of the first-stage TCN model of [20] is a 1×1 the shallow layers have very small receptive fields which could
convolution layer which is then followed by several dilated 1- restrict them from learning informative temporal relationships.
dimensional convolutional layers. Inspired by [34], the dilation To address this, the authors propose expanding the network
factor is doubled at each layer (i.e. 1, 2, 4,. . . , 512), while to use dual dilated convolution layers. Fig. 8 provides an
using a constant number of convolutional filters at each layer. overview of this dual dilated convolution layer. In the first
8

Predictions

stage N LN

stage 1 L1

Features
Feature Extractor

Video Frames

Fig. 7. Multi-Stage Temporal Convolutional Network (MS-TCN) architecture, recreated from [20].

(a) Dilated Residual Layer. (b) Dual Dilated Residual Layer.


Fig. 8. Dilated Residual Layers of MS-TCN (a) and MS-TCN++ (b).

convolution the dilation factor is exponentially increased to, fore, the predictions in all stages of the multi-stage architecture
for instance, 2l where l is the layer number. In the 2nd dilation are more accurate and fewer refinements are needed. As such,
layer, the lower layers start with a higher dilation factor, 2L−l they were able to use a relatively smaller number of stages
where L is the total number of layers in the network, and it compared to the original MS-TCN formulation, making it
is exponentially decreased with the network depth. comparatively more efficient.
The authors show that the dual dilation strategy in MS- In their empirical evaluations the authors have utilised 50
TCN++ has enabled them to capture both local and global Salads, GTEA and their Breakfast datasets, where approxi-
features in all layers, irrespective of the network depth. There- mately a 3% increase in frame-wise classification accuracy is
9

observed when comparing MS-TCN++ to the MS-TCN archi- auxiliary input and both generator networks try to synthesise
tecture on the 50 Salads dataset. Furthermore, MS-TCN++ is action codes which are indistinguishable to real codes by
capable of outperforming MS-TCN with an approximately 6% the discriminator. In the coupled GAN setting, the context
gain in both edit distance and F1@50 score values. extractor also receives the past action codes and auxiliary
4) Boundary-Aware Cascade Networks [66]: Another ex- features from the auxiliary branch.
tension to the TCN based human action segmentation pipeline This framework is tested using the 50 Salads, MERL
is presented in [66] where the authors tackle the problem of Shopping and the GTEA datasets. It achieves 74.5% 76.9%
inaccurate action segmentation boundaries and misclassified and 71.1% for frame-wise classification accuracy, Edit score
short action segments. The authors show that this misclassi- and F1@50, respectively on the 50 Salads dataset. Tab. II
fication occurs due to deficiencies in the temporal modelling further compares the results against the other state-of-the-art
strategies employed by existing state-of-the-art models. Specif- methods.
ically, they identified that ambiguities in action boundaries 6) Semi-Supervised Action GAN [45]: An extension to
and ambiguities in frames within long action segments cause this architecture is proposed in [45], where the discriminator
these problems, and propose a cascade learning strategy to was extended to jointly classify the action class of the input
address this. They learn a series of cascade modules which frame together with the real/fake classification. This allows the
progressively learn weak-to-strong frame level classifiers. In framework to seamlessly utilise both synthesised unlabelled
the earlier stages, the classifier has weak capacity and learns to data and real labelled data, rendering a semi-supervised learn-
recognise actions from less ambiguous and more informative ing solution.
frames. In later stages, more capacity is added to the classifiers To further aid the learning the context extractor is extended
and they pay more attention to ambiguous frames and try to to a Gated Context Extractor in [44]. This maintains a fixed
make the predictions for those frames more accurate. length context queue where we sequentially push features of
To further augment the learning process, the authors propose the observed frames to the queue. Utilising a series of gated
a temporal smoothing loss function which further refines operations, which individually evaluate the informativeness of
the predictions of action boundaries. Unlike the MS-TCN each embedding in the context queue relative to the current
framework where the loss function is hand-engineered using frame, we dynamically determine how to combine context
prior knowledge, the authors propose to exploit information information together with the current observation. From this
from the action boundaries and use it to maintain the semantic augmented feature vector, the generator network generates the
consistency of the frames within the predicted boundary. action code.
The proposed Local Barrier Pooling (LBP) module uses a Similar to [44] we conducted evaluations on 50 Sal-
binary classifier which predicts the boundaries of the current ads, MERL Shopping and the GTEA datasets, and Semi-
segment. A set of video specific weights are used to aggregate Supervised Action GAN achieves 73.3% 69.8% and 67.0%
frame level predictions within identified boundaries. Hence, for frame-wise classification accuracy, Edit score and F1@50,
the prediction smoothing weights used are video and boundary respectively on the 50 Salads dataset. The comparatively
specific. lower performance compared to our Coupled Action GAN
This system is evaluated using 50 Salads, GTEA and the architecture (see Tab. II) arises from the the lack of auxiliary
Breakfast datasets. As shown in Tab. II, on 50 Salads, the feature stream, which offers additional valuable information.
model achieves 84.4% frame-wise classification accuracy, 74.3
% edit score and 74.0% F1@50 score, achieving a notable
improvement in the temporal segmentation metrics over MS- D. Action Segmentation Augmentation Methods
TCN. In this section we briefly discuss methods that augment
5) Coupled Action GAN [44]: A GAN based human those presented in Section II-C. The models discussed here
action recognition framework is proposed in [44], where the either build upon and refine the results produced by an earlier
generator network of the GAN is trained to generate what segmentation approach, or seek to optimise the segmentation
is termed an “action code”, a unique representation of the architecture.
action. The discriminator is trained to discriminate between 1) Self-Supervised Temporal Domain Adaptation (SSTDA)
the ground truth action codes and the synthesised action codes. [63]: SSTDA [63] is also an extension of the MS-TCN model,
Through this adversarial learning process, the generator learns however, it tackles the domain discrepancy challenge within
an embedding space which segregates different action classes. the action segmentation task and proposes an augmentation
To aid the temporal learning in this architecture, a context to the MS-TCN model to address this challenge. Specifically,
extractor is proposed which receives the previously generated the authors of [63] consider the fact that there are significant
action codes and the features from the current RGB frame. spatio-temporal variations among the ways that humans per-
Using this information the context extractor generates an form the same action due to personalised preferences, habits
embedding to represent the evolution of the current action. and provided instructions.
This is leveraged by the generator in the action code synthesis To overcome this issue the authors utilise two domain
process. To further augment this architecture an auxiliary predictors based on two self-supervised auxiliary tasks, binary
branch is proposed, which receives auxiliary information such domain prediction (BDP) and sequential domain prediction
as depth or optical flow data. The generator network in the (SDP). The BDP predicts the domain of each frame while
auxiliary branch also synthesises the action code using the the SDP predicts the domain of each video segment. Through
10

these auxiliary tasks the framework can learn the similarities work (Sec. II-C2) as the backbone model. Similar to the
and differences between the way that a certain action is original MS-TCN, the backbone model is fed pre-trained I3D
performed by different participants, allowing a more compre- features and the action class likelihood predictions are ob-
hensive modelling and understanding of context. The authors tained through a final softmax function. These predictions are
combine the auxiliary domain prediction task losses together then sent through the proposed DTGRM framework. Inspired
with the frame-wise classification loss which is generated by the multi-stage refinement in the MS-TCN architecture,
using the MS-TCN model, and train the complete framework the authors of DTGRM also iteratively refine the predictions
end-to-end. through DTGRM S times before obtaining the final prediction
Similar to both MS-TCN and MS-TCN++ the evaluations results.
were conducted on 50 Salads, GTEA and the Breakfast dataset. The proposed model is evaluated on 50 Salads, GTEA and
In the 50 Salads dataset, a 1.1% frame-wise action classifi- Breakfast datasets. On 50 Salads, this model achieved an 80%
cation accuracy increase is observed compared to MS-TCN. frame-wise classification accuracy, 72% edit score and 66.1%
Furthermore, a substantial 6.9% and 8.6% increase is observed F1@.50 score. Compared to the MS-TCN model, the DTGRM
for Edit and F1@50 score values, respectively. Note that achieved better segmentation performance with 4.1% and 1.6%
Edit and F1@50 measure the temporal segmentation accuracy improvements for edit score and F1@.50 respectively.
and the critical performance gain of SSTDA in these metrics 4) Hierarchical Action Segmentation Refiner (HASR) [64]:
clearly illustrate the superior temporal learning capabilities of A model agnostic add-on to refine action segmentation pre-
SSTDA, where it has been able to overcome the small-scale dictions is presented in [64], which encodes the context of the
differences between different subjects performing the same observed video in a hierarchical manner and uses context to
action. refine fine-grained predictions.
2) Graph-based Temporal Reasoning Module Specifically, in this architecture frame-level predictions from
(GTRM) [58]: The GTRM proposed in [58] is composed the backbone action segmentation model are encoded as a
of two GCNs, where each node is a representation of an vector which is subsequently used as a query to generate
action segment. This module is proposed to build on top attention weights for the corresponding frame-level features.
of a backbone model, which itself is an existing action Then, a video-level representation is generated by an encoder
segmentation model. When given an output prediction from which consists of 2 residual blocks interleaved by a 1D max-
the chosen backbone, each segment is mapped to a graph pooling operation. Instead of directly passing the generated
node (where a node represents an action segment of an segment level embeddings to this encoder module, the authors
arbitrary length), and then passed through the two graph sample multiple sub-sequences from the segment level embed-
models to refine the classification and temporal boundaries dings and pass them multiple times. The authors demonstrate
of the nodes. To encourage the temporal relation mapping, that this mechanism can compensate for erroneous predictions
the backbone and the proposed GTRM model are jointly generated by the backbone action segmentation model when
optimised. modelling the overall context of the video. The refiner network
The model effectiveness is evaluated on two public datasets, is a GRU network which accepts the video context and the
Extended GTEA (EGTEA) [69] and EPIC-Kitchens [60]. initial predictions from the action segmentation model and
Through the experimental results, the authors have shown corrects the erroneous predictions.
that by utilising GTRM on top of backbone models such The authors have tested the proposed refiner on the 50
as Bi-LSTM [4], ED-TCN [2] and MS-TCN, the backbone’s Salads, GTEA and Breakfast datasets, and it shows a signifi-
initial results can be improved. For example, the original MS- cant improvement in accuracy over the backbone network. For
TCN model achieves an edit score of 25.3% on Epic-Kitchens instance, in 50 Salads dataset for the MS-TCN model, for both
dataset while it improves by 7.2% when the MS-TCN is edit and F1@50 metrics an approximate 7% performance gain
integrated with the GTRM module and achieves an edit score is observed. This gain is approximately 2% and 3% for edit
of 32.5%. and F1@50 scores for the SSTDA architecture.
3) Dilated Temporal Graph Reasoning Module (DT- 5) Global2Local: efficient structure search for video action
GRM) [61]: The DTGRM is an action segmentation module segmentation [62]: Different from the above research, Gao
that is designed based on relational reasoning and GCNs. et al. [62] propose a mechanism to locate optimal receptive
The authors proposed to overcome the difficulties of GCNs field combinations for an action segmentation model through
when applied to long video sequences, as a large number a coarse-to-fine search mechanism which is termed a global-
of nodes (video frames) makes it hard for a GCN to effec- to-local search.
tively map temporal relations within the video. The proposed In particular, the authors illustrate the deficiencies with
approach models temporal relations and their dependencies existing state-of-the-art architectures where the receptive field
between different time spans, and is composed of two main parameters such as the dilation rate and pooling size of each
components: multi-level dilated temporal graphs that map the layer are hand defined by evaluating a small number of
temporal relations; and an auxiliary self-supervised task that combinations. However, these parameters are crucial factors
encourages the dilated graph reasoning module perform the which determine the capacity of the network to recover short-
temporal relational reasoning task while reducing model over- term and long-term temporal dependencies.
fitting. To address this limitation, the authors propose a search
The authors reuse the dilated TCN in the MS-TCN frame- algorithm which automatically identifies the best parameters
11

Object Detection Cropped Feature action


and Tracking Frames Buffer labels

t=1 t=1

Act 1
t=2

t=2

Action Segmentation
t=3

Feature Extractor

Act 2
t=4

t=3
t=5

t=6

Act 3
t=7

Act 4
t=N

Act 5
t=N

Fig. 9. Action Segmentation with Object Detection.

to configure a given network. However, exhaustively evaluat- which allows the action segmentation model to focus on infor-
ing all possible parameters is computationally infeasible. For mation that is relevant to the action performed. In particular,
instance, possible combinations are in the range of 102440 for the detected person of interest is cropped and features are
the MS-TCN architecture. A solution is proposed via a global- extracted from a cropped region around the subject rather than
to-local search algorithm which narrows the search space using the full-frame, as shown in Fig. 3
a low-cost genetic algorithm based global search operation,
and a more exhaustive fine-grained search is subsequently A. Object Detection
conducted in the narrowed search space using an expectation Object detection, an important yet challenging task in com-
guided algorithm. puter vision, aims to discover object instances in an image
This receptive field refinement strategy has been evaluated given a set of predefined object categories. Detecting objects
on the 50 Salads and Breakfast datasets (see Tab. II). One is a difficult problem that requires the solution to two main
of the noteworthy achievements of this architecture is the tasks. First, the detector must handle the recognition problem,
significant 5% increase in both Edit and F1@50 segmental distinguishing between foreground and background objects,
scores in the 50 salads dataset compared to the MSTCN model, and assigning them the correct object class labels. Second, the
only by refining the receptive field sizes, without modifying detector must solve the localisation problem, assigning precise
the overall network structure. bounding boxes to objects. Object detectors have achieved
exceptional performance in recent years, thanks to advances
III. O BJECT D ETECTION FOR ACTION S EGMENTATION in deep convolutional neural networks.
Object detectors can be categorised as anchor-based or
Features relating to detected humans can be incorporated anchor-free methods [70]. The core idea of anchor-based
into the action segmentation workflow to guide the localisation models is to introduce a constant set of bounding boxes,
of salient regions (i.e. regions that contain humans) in the referred to as anchors, which can be viewed as a set of pre-
video stream, and promote the selection of discriminative fea- defined proposals for bounding box regression. Such anchors
tures from regions-of-interest (ROIs). As common applications are defined by the user before the model is trained. Models
of human action segmentation (including target videos within typically refine these anchors to produce the final set of bound-
this study) include human-object interaction, it is often of ing boxes that contain the detected objects. Nevertheless, using
value to track the subject through time as well. anchors requires several hyperparameters (e.g. the number of
In a real-world setting, there is typically a large amount of boxes, sizes and the aspect ratios). Even slight changes in
background clutter which contains objects which are unrelated these hyperparameters impact the end-result, thus selecting
to the action being performed. Furthermore, in some situations an optimal set of anchors is, to an extent, dependent on the
multiple people may be observed, yet actions are performed experience and skill of the researcher.
by a single subject. In such situations, object detection and Overcoming the limitations imposed by hand-crafted an-
tracking can help identify the person of interest. This pipeline chors, anchor-free methods offer significant promise to cope
is illustrated in Fig. 9. By incorporating this pre-processing with extreme variations in object scales and aspect ratios [71].
phase, the person of interest is identified throughout the video Such approaches, for example, can perform object bounding
12

box regression based on anchor points instead of boxes (i.e. anchor box.
the object detection is reformulated as a keypoint localisation Given the basic object detector design, there are many
problem). methods to improve classification accuracy, bounding box
The design of object detectors can also be broadly divided estimation and evaluation speed. For the purpose of this
into two types: two-stage and one-stage detectors [70]. review, we categorise these improvements as either bag of
• Two-stage detection frameworks use a region proposal freebies (BoF) or bag of specials (BoS). These methods are:
network (RPN) to identify regions of interest (ROIs). A • Bag of freebies (BoF): Methods that can improve object
second network is then applied to the ROIs to identify detector accuracy without increasing the inference cost.
the object class and to regress to an improved bounding These methods only change the training strategy, or only
box. Two-stage detectors are often more flexible than increase the training cost. Examples are data augmen-
their one-stage counterparts, since other per instance tation and regularisation techniques used to avoid over-
operations can be easily added to the second network fitting such as DropOut, DropConnect and DropBlock.
to enhance their capability (e.g. instance segmentation • Bag of specials (BoS): Modules and post-processing
and instance tracking). Fig. 10 provides a high level methods that only increase the inference cost by a small
algorithmic overview of the two-stage method. amount, but can significantly improve the accuracy of ob-
• One-stage detection frameworks use a single network ject detection. These modules/methods usually introduce
to perform classification and bounding box regression. attention mechanisms including Squeeze-and-Excitation
Single-stage detectors are often faster to evaluate, how- and Spatial Attention Modules, to enlarge the receptive
ever most designs cannot match two-stage detectors for field of the model and enhance feature integration capa-
bounding box localisation accuracy. bilities.
Variants of models are based on changes in the main We refer the readers to Appendix B-A of the supplementary
components of the standard architecture of an object detector. material for a thorough review of the most relevant object
The structure for two-stage and one-stage object detectors is detector architectures from the literature. We organise the
shown in Fig. 11. Common components that may be changed object detectors as follows:
are detailed as follows: 1) Anchor-based (Two-stage Frameworks): Faster-
• Backbone: Backbones are used as feature extractors. As RCNN [86], R-FCN) [87], Libra R-CNN [88], Mask
discussed above in Sec. II-A, they are mainly commonly R-CNN [89], Chained cascade network and Cascade
feed-forward CNNs or residual networks. These networks R-CNN [90], [91], and TridentNet [92].
are pre-trained on image classification datasets (e.g. 2) Anchor-based (One-stage Frameworks): Single shot
ImageNet [72]), and then fined-tuned on the detection multibox detector (SSD) [93], (DSSD) [94], Reti-
dataset. In addition to the networks already introduced, naNet [95], M2det [96], EfficientDet [82], and YOLO
other popular backbones include Inception-v3 [73] and family detectors (YOLOv3 [79], YOLOv4 [85], and
Inception-v4 [74], ResNext [75] and ResNet-vd [76], Scaled-YOLOv4 [97]).
SqueezeNet [77], ShuffleNet [78], Darknet-53 [79], and 3) Anchor-free Frameworks: DeNet [98], CornerNet [99],
CSPNet [80]. CornetNet-lite [100], CenterNet (objects as
• Neck: These are extra layers that sit between the back- points) [101], CenterNet (keypoint triplets) [102],
bone and the head, and are used to extract neighbouring FCOS [103], and YOLOX [104].
feature maps from different stages of the backbone. Such Among well-known single stage object detectors, the You
features are summed element-wise, or concatenated prior only look once (YOLO) family of detectors (YOLO [105],
to being fed to the head. Usually, a neck consists of YOLOv2 [106], YOLOv3 [79], YOLOv3-tiny, YOLOv3-SPP)
several bottom-up and top-down paths, such that en- have demonstrated impressive speed and accuracy. This de-
riched information is fed to the head. In this scheme, tector can run well on low powered hardware, thanks to the
the top-down network propagates high-level large scale intelligent and conservative model design. In particular, their
semantic information down to shallow network layers, recent variants YOLOv4 [85] and Scaled-YOLOv4 [97] achieve
while the bottom-up network encodes the smaller scale one of the best trade-offs between speed and accuracy.
visual details via deep network layers. Therefore, the YOLOv4 [85] is composed of CSPDarknet53 as a backbone,
head’s input will contain spatially rich information from a SPP additional module, a PANet as the neck, and a YOLOv3
the bottom-up path, and semantically rich information as the head. CSPDarknet53 is a novel backbone that can
from the top-down path. The neck can, for example, be enhance the learning capability of the CNN by integrating
a feature pyramid network (FPN) [81], a bi-directional feature maps from the beginning and the end of a network
FPN (BiFPN) [82], a spatial pyramid pooling (SPP) [83] stage. The BoF for YOLOv4 backbones includes CutMix
network, and a path aggregation network (PANet) [84]. and Mosaic data augmentation [107], DropBlock regulari-
• Head: This is the network component in charge of sation [108], and class label smoothing [73]. The BoS for
the detection (classification and regression) of bounding the same CSPDarknet53 are Mish activation [109], cross-
boxes. The head can be a dense prediction (one-stage) or stage-partial-connections (CSP) [80], and multi-input weighted
a sparse prediction (two-stage) network. Object detectors residual connections (MiWRC). Finally, the scaling cross stage
that are anchor-based apply the head network to each partial network (Scaled-YOLOv4) [97] achieves one of the
13

RPN (Stage 1) ROI (Stage 2)

Region
Classify
Proposal
Input ROI Estimate
Image Pooling Instances
Feature Finetune
Extractor BBox

Fig. 10. Typical high level algorithm applied by a two-stage detector. Most modern methods implement stage 1 and 2 using neural networks, and estimate
instances via a non-max suppression (NMS) clustering algorithm.

Fig. 11. Structure of object detectors, highlighting the main components of the Backbone, Neck and Head, and how information flows between these
components. Image recreated from [85].

best trade-offs between speed and accuracy. In this approach, The core algorithm in multi-object tracking is this data as-
YOLOv4 is redesigned to form YOLOv4-CSP with a network sociation method. In many implementations, this method com-
scaling approach that modifies not only the network depth, putes the similarity between detections and existing tracklets,
width, and resolution; but also the structure of the network. identifies which detections should be matched with existing
Thus, the backbone is optimised and the neck (PANet) uses tracklets, and finally creates new tracklets where appropriate.
CSP and Mish activations. The similarity function computes a score between two
object instances, I0 and I1 , observed at differing times,
and indicates the likelihood that they are the same object
B. Multi-object Tracking instance. The similarity function is typically implemented via
Given the location of an arbitrary target of interest in the a combination of geometric methods (e.g. motion prediction
first frame of a video, the aim of visual object tracking is to or bounding box overlap) and visual appearance methods
estimate its position in all the subsequent frames. The ability (embeddings).
to perform reliable and effective object tracking depends on Despite the wide range of methodologies described in the
how a tracker can deal with challenges such as occlusion, scale literature, the great majority of MOT algorithms include some
variations, low resolution targets, fast motion and the presence or all of the following steps: 1) Detection stage; 2) Feature
of noise. Visual object tracking algorithms can be categorised extraction/motion prediction stage (appearance, motion or in-
into single-object and multiple-object trackers (MOT), where teraction features); 3) Affinity stage (similarity/distance score
the latter is the scope of this manuscript. calculation); and 4) Association stage (associate detections and
Due to recent progress in object detection, tracking-by- tracklets). MOTs approaches can be divided with respect to
detection has become the leading paradigm for multiple object their complexity into separate detection and embedding (SDE)
tracking. This method is composed of two discrete compo- methods, and joint detection and embedding (JDE) algorithms.
nents: object detection and data association. Detection aims to • SDE methods completely separate stages of detection
locate potential targets-of-interest from video frames, which and embedding extraction. Such a design allows the
are then used to guide the tracking process. The association system to adapt to various detectors with fewer changes,
component uses geometric and visual information to allo- and the two components can be tuned separately (e.g.
cate these detections to new or existing object trajectories Simple online and realtime tracking (SORT) [112], Deep-
(also known as tracklets), e.g, the re-identification (ReID) SORT [113], ByteTrack [114]).
task [110]. We refer to a tracklet as a set of linked regions • JDE methods learn to detect objects and extract embed-
defined over consecutive frames [111]. dings at the same time via a shared neural network,
14

and use multi-task learning to train the network. This how to prepare a certain dish are provided to the participants.
design takes into account both accuracy and speed, and In addition to video data, the dataset authors provide human
can achieve high-precision real-time multi-target tracking pose annotations, their trajectories, and text-based video script
(e.g. Tracktor [115], CenterTrack [116], FairMOT [117], descriptions. Example frames of the dataset are shown in
SiamMOT [118]). Fig. 12.
We refer the readers to Appendix B-B of the supplementary
material, where we introduce these SDE and JDE frameworks B. Evaluation Metrics:
in detail. From these methods, ByteTrack [114] is a simple Our evaluations utilise both frame-wise and segmentation
and effective tracking by association method for real-time metrics. As the frame-wise metric we utilise the frame-wise
applications, and makes the best use of detection results to accuracy, which is a widely used metric for action recognition
enhance multi-object tracking. ByteTrack keeps all detection and segmentation tasks [18], [16], [20]. However, as stated
boxes (detected by YOLOX) and makes associations across all in [20] such frame-wise metrics are unable to properly detect
boxes instead of only considering high scoring boxes, to re- over-segmentation errors, as such errors have a low impact
duce missed detections. In the matching process, an algorithm on the frame-wise accuracy. Furthermore, longer duration
called BYTE first predicts the tracklets using a Kalman filter, action classes have a higher impact on the frame-wise metric.
which are then matched with high-scoring detected bounding Therefore, it is essential to utilise segmentation metrics to
boxes using motion similarity. Next, the algorithm performs a fully evaluate action segmentation/detection performance. In
second matching between the detected bounding boxes with our experiments, we use the following segmentation metrics
lower confidence values and the objects in the tracklets that addition to frame-wise accuracy.
could not be matched.
• Segmental edit score (edit) is calculated through the
normalised edit distance for each ground truth (G) and
IV. E XPERIMENTS predicted (P) label using the Wagner-Fischer algorithm.
A. Dataset Then this is calculated as follows,
Widely used datasets in the current literature include Break- (1 − Se (G, P )) × 100, (5)
fast [68], 50Salads [59], MPII cooking activities dataset [3],
MPII cooking 2 dataset [119], EPIC-KITCHENS-100 [60], where the best score is 100 while the worst is 0.
GTEA [67], ActivityNet [120], THUMOS15 [121], Toyota • Segmental F1 score (F1@k) is calculated by first com-
Smart-home Untrimmed dataset [122], and FineGym [123]. puting whether or not each predicted action segment is
We refer the readers to Appendix C of the supplementary a true positive (TP) or false positive (FP), by comparing
material for more details on these publicly available datasets. its IoU with respect to the corresponding ground truth
For our experiments, we use the MPII Cooking 2 fine- with threshold, k. Then, the segmental F1 score is ob-
grained action dataset [119], considering its popularity and tained through precision (P) and recall (R) calculations
challenging nature due to the unstructured manner in which as follows,
P ×R
the actions evolve over time. Note that the egocentric view of F1 = 2 , (6)
P +R
the EPIC-KITCHENS-100 [60] and GTEA [67] datasets means
that the human subject cannot be detected and segmented. where,
As our evaluation explicitly considers the role that object TP TP
detection and tracking can play in supporting the action P = ,R= . (7)
TP + FP TP + FN
segmentation task, we cannot utilise these datasets for our
evaluation, despite their recent popularity. Furthermore, due to C. Implementation Details
the lack of available public multi-person action segmentation
For the selected action segmentation models we follow
datasets, we can not directly evaluate the state-of-the-art
settings outlined in the original works, except we use different
methods in a multi-person setting. However, we believe that
feature extraction backbones. More information on the selec-
our evaluations illustrate a general purpose pipeline for the
tion of the feature extraction models is provided in Sec. IV-D.
readers to use in a real-world setting where there are multiple
More details on the feature extraction layers and the feature
people concurrently conducting temporally evolving actions.
dimensions are provided in Table III.
The MPII Cooking 2 dataset contains 273 videos with
a total length of more than 27 hours, captured from 30
subjects preparing a certain dish, and comprising 67 fine- D. Initial Evaluations
grained actions. Similar to the MPII cooking activities dataset, For our initial experiment, we evaluate the selected action
this dataset is captured from a single camera at a 1624 × 1224 segmentation and detection models on the MPII Cooking 2
pixel resolution, with a frame rate of 30 fps. The camera dataset. We select the Multi-Stage Temporal Convolutional
is mounted on the ceiling and captures the front view of a Network (MS-TCN) [20], Multi-Stage Temporal Convolu-
person working at the counter. The duration of the videos tional Network - Extended (MS-TCN++) [65], Self-Supervised
range from 1 to 41 minutes. As per the MPII cooking activities Temporal Domain Adaptation (SSTDA) [63] and Dilated
dataset, this dataset offers different subject specific patterns Temporal Graph Reasoning Module (DTGRM) [61] models
and behaviours for the same dish, as no instructions regarding for evaluation. Models are selected considering both their
15

Fig. 12. Single frames from the MPII Cooking 2 fine-grained action [119] dataset that illustrate the full scene of the environment and several fine-grained
cooking activities with different participants. Samples of those activities include take out, dicing, peel, cut, squeeze, spread, wash, and throw in garbage.

performance and efficiency. In practical applications, such as TABLE III


in mobile-robotics, computing resources are limited and as F EATURE EXTRACTION BACKBONES USED IN THE EVALUATIONS .
such we have to consider both performance and efficiency Backbone Implementation Layer Feat dim
ResNet50 Keras-Tensorflow avg pool (GlobalAveragePooling2) 2048
when selecting the models. EfficientNet-B0 PyTorch AdaptiveAvgPool2d-277 1280
We use three different feature extractors ResNet-50, MobileNet-V2 PyTorch AdaptiveAvgPool2d-157 1280
EfficientNet-B0 and MobileNet-v2. More details regarding
these backbone models are provided in Sec. II-A. Here,
we select the ResNet-50 backbone model instead of the prior to action recognition the system must detect the human,
I3D model (see Sec. II-A) that is widely used in action generating a single bounding box for each person detected.
segmentation methods [20], [63] as the I3D model requires Then, an object tracking technique is integrated to estimate
additional calculations of the optical flow inputs. Keeping human position in subsequent frames, and to maintain the
the ResNet-50 as our first backbone, we consider two light- identity of the subject. Therefore, in our initial evaluations
weight models EfficientNet-B0 and MobileNet-v2 as these presented in Table IV we also evaluate how the state-of-the-
are of more relevance given the final goal on adapting the art models perform with image cropped using a bounding box.
models to a real-world application. We only make use of the Among the object detectors, we consider single-stage meth-
action labels during training, and additional information such ods as they are better suited to practical applications due to the
as object annotations that are available with the dataset are not faster inference. We seek to obtain an optimal balance between
utilised. In Table IV we report the experimental results when inference time and accuracy, noting that there is a trade-off
using the original full-frames as the model input. In addition, between these, by evaluating different components such as the
as described in Sec. III, selection of individual persons in backbone, neck and head of the object detection architecture.
the scene and recognising their individual actions has to be From the proposed models discussed earlier, we adopt the
carried out when there are multiple people in the scene. For Scaled-YOLOv4 model available in [124], which achieves
instance, consider a human-robot interaction scenario as our competitive accuracy while maintaining a high processing
target application. In this scenario, the robot must identify the frame rate (inference speed greater than 30 FPS). To alleviate
actions performed by the human in order to react appropriately. the high computational demand and achieve real-time object
In a real-world environment where there are multiple humans, tracking, we use ByteTrack [125], which is a comparatively
16

simple tracking method that can handle occlusion and view- In a real-world setting when the action is temporally evolv-
point changes. Both these approaches also allow use for ing, it is important to be able to accurately estimate the current
commercial purposes. action. Action segmentation pipelines use a buffer of frames
The next step is to select a backbone that is light-weight to provide spatio-temporal context and improve segmentation
and offers good performance for the action segmentation task. accuracy. To enable low-latency action segmentation, we use
As stated in the previous section (Sec. IV-C), we consider a first-in-first-out (FIFO) buffer, which allows the action
EfficientNet-B0, MobileNet-V2 and ResNet50 as backbones. segmentation models to observe the temporal progression of
Action segmentation results for the MPII cooking dataset with the action, and recognise the new actions/behaviours as they
these backbones applied to the four models, MS-TCN (Sec. occur. The FIFO buffer is updated by removing the oldest
II-C2), MS-TCN++ (Sec. II-C3), SSTDA (Sec. II-D1) and feature vector while the features vector of the latest frame is
DTGRM (Sec. II-D3) are reported in Table IV. pushed into the end of the buffer. This is illustrated in Fig.
Overall, the DTGRM model with EfficientNet features 13. When the feature buffer is updated, the buffer is passed
shows a high level of performance, achieving the best accuracy through the action segmentation model to generate the action
and F1 scores, although the edit score is 1.91% lower than predictions.
the results obtained through SSTDA with a mobileNet back- The buffer length has a crucial impact on the accuracy and
bone. Considering the performance across feature extraction the throughput of the action segmentation model. If the buffer
backbones, no single backbone is the most effective across all length is large, the action segmentation model has a larger
action segmentation models. However, we observe that MS- temporal receptive field, however, when processing the buffer
TCN, MS-TCN++ and DTGRM all obtained better results regularly it leads to redundant predictions as it is repetitively
when using the EfficientNet backbone, while SSTDA achieves predicting a large number of frames. Furthermore, when the
the best results with MobileNet features. We also note that buffer length is b frames, there exists a b frame delay between
in some instances, the cropped features (or Bbox) have a the observation of the first frame and the first prediction made
positive impact on the results. For example, MS-TCN using by the action segmentation model, as the buffer needs to be
ResNet50 features has improved by 3.27% by utilising cropped filled in order to invoke the action segmentation model.
frame features. However, for the most part, the cropped-frame In a real-world system our interest is the current action (i.e.
features tend to slightly degrade the overall results. action observed in the most recent frames). For example, in
Considering this degradation of results, we carried out Fig. 13 the predictions at t = i + L and t = i + L + 1 are
further experiments by first fine-tuning the feature extraction considered to be the current action prediction.
model with randomly selected cooking data from the Cooking As illustrated above, buffer size plays a crucial role in
2 dataset. The fine-tuning for full-frame and cropped frames this real-world setup. We conduct experiments with different
has been performed separately to align with the overall exper- buffer lengths to investigate the optimal attributes that are
iment. Furthermore, we also experimented with expanding the required to achieve accurate action recognition. We report the
detected bounding box by 10% before cropping the frames, to corresponding results in Table VI by evaluating the models
capture relevant context information in the images. We report with buffer lengths of 100, 250, 500, 750 and 1000, and we
the results in Table V. As shown in Table V, the fine-tuning compare the results with those obtained on the original video
has significantly improved the results for all the experiments. sequences (by maintaining a variable length buffer). We use
In some cases, applying 10% bounding box padding has the fine-tuned feature extractor backbone with 10% bounding
slightly improved the accuracy. For example, through 10% box padding applied. We observe a slight drop in results,
padding, the models MS-TCN and DTGRM have achieved especially in the accuracies and the edit scores, when the
a 0.44% and 0.12% of accuracy gain respectively. However, buffer length is reduced from the original sequence length to a
when considering the edit score and F1 scores, applying 10% buffer length of 1000. However, DTGRM achieves a higher F1
bounding box padding has resulted in improvements for all segmental score when the buffer length is 1000, while for the
the action segmentation models. other models the highest results are obtained by maintaining
the original sequence lengths. Overall, when the buffer lengths
E. Adapting to Real-World Applications are reduced the action segmentation results drop. However, in
The action segmentation methods presented in Table IV are some instances, the models achieved slightly higher results
challenging to directly adapt to real-world applications. One on shorter buffer lengths than longer ones. For example, MS-
primary challenge is due to the temporal modelling structure TCN++ achieves better accuracy and edit score values when
that these models employ which requires us to provide a se- the buffer lengths are 500 and 250 than the corresponding
quence of frames instead of a single frame. Specifically, these results on 750 and 1000 length buffers. Similarly, for DTGRM
methods operate over frame buffers which are sequentially maintaining a 750 length buffer achieved higher accuracy
filled with the observed frames, and thus depending on how and edit score values compared to maintaining a 1000 length
the buffering is performed and the size of the buffer, there buffer. Even though these scenarios with longer buffer lengths
may be delays between the first observation of actions being are still able to produce higher F1 segmental scores, compared
performed, and when predictions are made by the model are to the results where buffer lengths are 1000, we noticed a slight
reported. In this section, we investigate and propose a pipeline deduction on F1 scores on DTGRM results when the original
to apply an action segmentation method to a real-world task, lengths are maintained.
where the current action should be detected with low latency. In Tab. VII we provide evaluation times in seconds for mod-
17

TABLE IV
E VALUATION R ESULTS ON THE MPII C OOKING 2 DATASET: T HE EVALUATIONS ARE CARRIED OUT BASED ON DIFFERENT FEATURE EXTRACTION
BACKBONES (R ES N ET 50, E FFICIENT N ET-B0 AND M OBILE N ET-V2). P ERFORMANCE USING FULL - FRAME FEATURES (F ULL ) AND CROPPED FRAME
FEATURES BASED ON OBJECT DETECTION BOUNDING BOXES (B BOX ) ARE GIVEN .

Method Backbone Full/Bbox Acc Edit F1@0.10 F1@0.25 F1@0.50


Full 39.98 25.78 24.70 22.18 15.49
ResNet50
Bbox 40.48 24.71 22.34 20.14 13.72
Full 44.43 31.14 27.29 25.47 18.88
MS-TCN EfficientNet-B0
Bbox 40.37 30.06 25.86 24.38 16.78
Full 40.55 29.51 27.82 25.14 18.40
MobileNet-V2
Bbox 37.13 24.06 19.77 18.11 11.28
Full 41.47 25.45 23.92 21.32 13.97
ResNet50
Bbox 39.77 27.90 23.80 21.16 14.61
Full 44.40 29.64 27.42 25.49 19.42
MS-TCN++ EfficientNet-B0
Bbox 41.32 28.60 25.11 23.25 17.59
Full 41.61 27.91 24.96 22.45 16.07
MobileNet-V2
Bbox 37.71 23.25 19.56 17.14 10.56
Full 32.77 17.80 14.37 12.66 10.65
ResNet50
Bbox 31.78 19.50 12.26 10.88 8.55
Full 36.09 25.71 14.98 13.90 10.99
SSTDA EfficientNet-B0
Bbox 32.76 19.57 14.10 10.07 9.84
Full 42.50 32.18 26.81 24.52 17.62
MobileNet-V2
Bbox 35.50 23.89 20.33 17.84 10.04
Full 40.71 28.60 24.70 22.09 15.46
ResNet50
Bbox 40.17 25.96 24.05 21.88 15.22
Full 45.43 30.27 28.04 26.17 19.97
DTGRM EfficientNet-B0
Bbox 42.84 27.09 25.49 23.18 17.73
Full 41.96 29.36 25.80 24.00 18.02
MobileNet-V2
Bbox 37.57 24.58 20.42 18.97 11.64

TABLE V
E VALUATION R ESULTS ON MPII C OOKING 2 DATASET USING THE FINE - TUNED M OBILE N ET-V2 FEATURE EXTRACTION MODEL . F INE - TUNING IS
PERFORMED CONSIDERING FULL - FRAME (F ULL ) AND CROPPED FRAME FEATURES BASED ON OBJECT DETECTIONS (B BOX ), AND ALSO BY APPLYING
10% BOUNDING BOX PADDING . W E COMPARE THE RESULTS WITH THE INITIAL RESULTS BASED ON THE PRE - TRAINED M OBILE N ET-V2 FEATURE
EXTRACTOR ( TRAINED ON I MAGE N ET DATA , WITH NO FINE - TUNING PERFORMED ).

Method Full/Bbox FE Acc Edit F1@0.10 F1@0.25 F1@0.50


pre-trained 40.55 29.51 27.82 25.14 18.40
Full
fine-tuned 43.82 27.65 25.13 23.28 17.37
MS-TCN pre-trained 37.13 24.06 19.77 18.11 11.28
Bbox fine-tuned 41.23 25.69 23.07 20.66 14.26
fine-tuned (10%) 41.67 27.38 24.67 22.73 16.77
pre-trained 41.61 27.91 24.96 22.45 16.07
Full
fine-tuned 43.11 29.94 27.39 25.46 18.67
MS-TCN ++ pre-trained 37.71 23.25 19.56 17.14 10.56
Bbox fine-tuned 41.08 26.12 22.82 20.10 13.00
fine-tuned (10%) 41.08 28.24 25.01 23.17 16.78
pre-trained 42.50 32.18 26.81 24.52 17.62
Full
fine-tuned 43.22 33.31 29.44 26.73 19.20
SSTDA pre-trained 35.50 23.89 20.33 17.84 10.04
Bbox fine-tuned 38.78 24.26 21.36 18.66 10.89
fine-tuned (10%) 38.08 24.83 21.76 18.93 10.12
pre-trained 41.96 29.36 25.80 24.00 18.02
Full
fine-tuned 41.98 28.16 26.09 24.11 18.29
DTGRM pre-trained 37.57 24.58 20.42 18.97 11.64
Bbox fine-tuned 42.62 23.57 17.39 15.78 11.40
fine-tuned (10%) 42.74 23.79 18.69 17.18 12.23

els with different buffer sizes, which were introduced in Tab. V. L IMITATIONS AND F UTURE DIRECTIONS
VI. Note that runtimes are calculated only for the evaluation of
In this section, we outline the limitations of existing state-
the action segmentation models. Feature extraction time is not
of-the-art human action segmentation techniques, and discuss
taken into consideration, as all the models use MobileNet-V2
various open research questions and highlight future research
feature extractor and thus have a constant per-frame feature
directions.
extraction cost. We would like to highlight the trade off
between accuracy and shorter runtimes. Even though models
with shorter buffer sizes are less robust, they have higher A. Interpretation of the Action Segmentation Models
throughput and depending on the application requirements
Models that result from deep learning are hard to interpret,
they may be more appropriate.
as most of the decisions are made in an end to end manner and
models are highly parameterised. As such, model interpreta-
tion plays a crucial role when making black-box deep learning
18

out
in

Feature Buffer Feature Buffer

FIFO

t=i+3

t=i+L

t = i + L+1
t=i+1
t=i+2
t=i

t=i+3
t=i+1
t=i+2

t=i+4
Action Segmentation Action Segmentation

Act 1 Act 2 Act 3 Act 4 Act 1 Act 2 Act 3 Act 4

Action Action
(t= i+L) (t= i+L+1)

Fig. 13. Adapting the feature buffer to a real world setting: the initial buffer (left) and the updated buffer (right) are shown. The buffer is updated to function
in a first in, first out (FIFO) manner by removing the oldest feature vector while the feature vector of the latest frame is pushed into the end of the buffer.
Once the buffer is updated, the features within the buffer are passed through the Action segmentation model and the prediction corresponding to the latest
frame is considered to be the recognised action.

TABLE VI
E VALUATION RESULTS ON THE MPII C OOKING 2 DATASET BASED ON VARYING BUFFER LENGTHS : T HE RESULTS ARE COMPARED AGAINST THE
RESULTS OBTAINED WITH ORIGINAL SEQUENCE LENGTHS ( BY MAINTAINING A VARIABLE LENGTH BUFFER ) WHERE THE FEATURE EXTRACTION
BACKBONE IS FINE - TUNED WITH 10% BOUNDING BOX PADDING APPLIED .

Method Length Acc Edit F1@0.10 F1@0.25 F1@0.50


100 37.87 19.72 15.39 14.33 11.46
250 39.13 19.64 18.63 16.83 12.31
500 40.01 19.42 20.18 17.94 13.22
MS-TCN
750 40.00 17.85 22.35 19.73 13.60
1000 40.17 20.44 23.23 21.00 15.19
Original 41.67 27.38 24.67 22.73 16.77
100 39.59 19.49 14.88 13.52 10.92
250 40.48 19.38 17.41 15.52 11.85
500 40.97 20.07 21.15 18.87 14.08
MS-TCN++
750 38.24 18.39 22.49 20.33 13.44
1000 38.82 21.32 23.95 21.10 13.62
Original 41.08 28.24 25.01 23.17 16.78
100 31.76 19.14 18.40 16.21 7.45
250 36.56 23.20 20.00 17.01 8.31
500 36.70 23.22 20.67 17.03 9.63
SSTDA
750 37.75 23.97 20.73 17.33 9.97
1000 37.84 24.09 20.97 18.09 9.98
Original 38.08 24.83 21.76 18.93 10.12
100 35.29 20.75 16.63 15.23 12.36
250 37.83 17.29 14.79 13.09 9.67
500 38.27 17.31 16.35 14.38 10.35
DTGRM
750 41.28 17.71 16.81 14.90 10.60
1000 40.57 17.67 19.77 18.05 13.64
Original 42.74 23.79 18.69 17.18 12.23

TABLE VII
E VALUATION TIMES IN SECONDS FOR DIFFERENT BUFFER SIZES . N OTE THAT RUNTIME IS CALCULATED ONLY FOR THE EVALUATION OF THE ACTION
SEGMENTATION MODEL . F EATURE EXTRACTION TIME IS NOT TAKEN INTO CONSIDERATION .

Model MSTCN MSTCN++ SSTDA DTGRM


Buffer Size 100 500 1000 100 500 1000 100 500 1000 100 500 1000
Runtimes (ms) 112.9 729.1 1410.5 169.8 719.6 1470.6 387.7 1675.4 3225.8 1601.4 8830.6 24819.8

models transparent. However, to the best of our knowledge, to explain why certain model decisions are made. As such,
no existing state-of-the-art deep learning action segmentation in some instances it is hard to evaluate whether a given
methods have utilised model interpretation techniques, or tried model is making informed predictions, or whether it is simply
19

memorising the data via the high model capacity. C. Deployment on Embedded Systems
We believe one of the major hindrances in generating such In the past few years, the focal point of human action
interpretation outputs is the lack of a model agnostic spatio- segmentation research has been on attaining higher accuracies
temporal model interpretation pipeline. For instance, off-the- which have led to deeper, more complex, and computationally
shelf deep model interpretation frameworks such as Grad- expensive architectures. To date, direct deployment of human
CAM [126], Local Interpretable Model-agnostic Explanations action segmentation pipelines on onboard embedded hardware
(LIME) [127], and Guided backpropagation (GBP) [128] are for applications such as robotics has been deemed infeasi-
widely used to interpret deep spatial models. However, spatio- ble. In such a setting, human detection, tracking, a feature
temporal deep model interpretation developments are not as extraction backbone, and action segmentation models are all
advanced. Therefore, more research is invited towards the necessary to be of use as this entire pipeline should run in real-
design of model agnostic spatio-temporal model interpretation time to allow for timely decisions. Therefore, mechanisms to
strategies. Furthermore, attention should be paid to the end- reduce model complexity and/or to improve the computational
users when designing such model interpretation mechanisms. power and throughput of embedded systems are of benefit.
For instance, a given interpretation pipeline can be informative We observe that this challenge can be solved using both
for machine learning practitioners, however may not be useful hardware and software augmentations. For instance, Vision
to the end-users of the action segmentation system due to their Processing Units such as an Intel Movidius Myriad-X 1 or a
lower technical knowledge, and hence may not serve to build Pixel Visual Core 2 can be coupled with the onboard evaluation
his or her trust regarding the system. hardware which will enable fast inference of deep neural net-
works. Furthermore, deep learning acceleration libraries such
B. Model Generalisation as NVIDIA TensorRT 3 can be used to augment this through-
put and optimise large networks for embedded systems. Most
One of the major obstacles that is regularly faced during the importantly, mechanisms to reduce the complexity of deep
deployment of an action segmentation model in a real-world models without impacting their performance is an important
setting is a mismatch between the training and deployment avenue for research. Model pruning strategies can be used to
environment, which leads the deployed model to generate remove nodes that do not have a significant impact on the
erroneous decisions. This mismatch can be due to changes final decision layer of the model. Another interesting paradigm
in the data capture settings, operating conditions, or changes we suggest is to reuse the feature extraction backbones for
in the sensor types and modalities. With particular respect to subsequent evaluations. For instance, an action segmentation
action segmentation models, differences between the humans model can share the same features that the human detector or
observed in the training and testing data and the way they tracker uses, which would avoid multiple feature extraction
perform actions is also a major source of mismatch. Therefore, evaluation steps within a pipeline. Further research can be
the design of generalisable models plays a pivotal role within conducted to evaluate the viability of such feature sharing.
the real-world deployment of human action segmentation
models. Nevertheless, to the best of our knowledge, SSTDA D. Optimising Repeated Predictions from a Feature Buffer
[63] is the only model among recent state-of-the-art methods As illustrated in Sec. IV-E, a FIFO buffer is used when
which tries to compensate for this mismatch. In particular, executing a temporal model (such as an action segmentation
the SSTDA model tries to overcome subject level differences model) in real-world environments. We observe that it is
observed when performing the same action, and learn a unique redundant to make repetitive predictions across the entire
feature representation that is invariant to such differences. We sequence of frames within the frame buffer, yet it is also
note that none of the recent literature has investigated how to critical that a model has access to a sufficiently large temporal
overcome other domain shifts such as changes in operating window to understand how behaviours are evolving which aids
conditions, or sensor modalities. prediction.
We propose meta-learning [129] and domain generalisation We conducted a preliminary evaluation of the effect of
[130] as two fruitful pathways for future research towards the the buffer size on model performance (See Table VI), and
design of real-world deployment-ready action segmentation these results suggest that there is a considerable fluctuation
frameworks. Meta-learning is a subfield of transfer learning in performance of state-of-the-art models with different buffer
and focuses on how a model trained for a particular task can sizes. It is unreasonable to assume a fixed buffer size across
be quickly adapted to a novel situation, without completely re- applications, as the buffer size is an application-specific pa-
training on the data. On the other hand, domain generalisation rameter, therefore, further investigation should be conducted to
offers methods that may allow a model trained for the same minimise the dependency of those models on the buffer size.
task to be adapted to a particular sub-domain (i.e. data with Another interesting avenue for future research is how to
domain shift). For instance, in the action segmentation setting adapt the buffer-model evaluation pipeline to facilitate real-
this could involve adapting a model trained using the videos time response. In such a setting, the action segmentation model
from an overhead camera to work with videos captured from
1 https://www.intel.com.au/content/www/au/en/products/details/processors/
a head mounted camera with a comparatively narrow field
movidius-vpu/movidius-myriad-x.html
of view. Via domain generalisation such a model could be 2 https://blog.google/products/pixel/pixel-visual-core-image-processing-and-
rapidly adapted to work with first-person view videos without machine-learning-pixel-2/
significant tuning to the new domain. 3 https://developer.nvidia.com/tensorrt
20

has to be invoked at every observed frame or batch of frames. mechanisms to incorporate this information into the action
However, as the model has already made predictions about segmentation pipeline should be investigated.
the majority of earlier frames in the buffer, a mechanism that A naive approach would be to expand the size of the de-
reuses previously obtained knowledge and directs computa- tected bounding box such that information from the surround-
tional resources to the more recent frames is desirable. The ings is also captured. However, this could lead to the inclusion
use of custom loss functions that give more emphasis to recent of misleading information, such as information from the other
frames in the observed sequence to improve their classification humans in the surroundings if the environment is cluttered.
accuracy is one such possible approach. Such mechanisms can We suggest following more informative pathways for further
be investigated in future research, and would be valuable for investigations: i) explicitly detecting the interacting objects
adapting state-of-the-art methods for real-world applications. and feeding these features as additional feature vectors to the
model; ii) capturing the features from a scene parsing network,
E. Handling Unlabelled and Weakly Labelled Data such as an image segmentation model, and propagating these
features to the action segmentation model such that it can see
Despite the increase in publicly available datasets for action
the global context of the scene.
segmentation tasks, they remain substantially smaller in size
A preliminary investigation towards the first pathway is
than large-scale datasets such as ImageNet. Therefore, it is
presented in prior works [136], [137], [138]. For instance, in
vital that action segmentation models have the ability to
the group activity recognition setting person-level features and
leverage unlabelled or weakly labelled data. This has added
the scene-level features are incorporated into the action recog-
benefits with respect to adapted models to different environ-
nition model in [136] where as in [137] the authors extend
ments where it is costly and infeasible to hand annotate large
this idea to utilise pose context as well. However, these works
training corpora. Some preliminary research in this direction
try to recognise actions that are already segmented videos.
is presented in [45], [131], [132]. For instance in [45] GANs
Towards this end a mechanism to extract temporal context is
are used to perform learning in a semi-supervised setting,
purposed in [138] for egocentric action segmentation, however,
exploiting both real-training data and synthesised training data
more sophisticated modelling schemes are welcomed in order
obtained from the generator of the GAN, where as in [131],
to capture the complete background context across the entire
[132] the authors apply pseudo-labelling strategy to generate
unsegmented video.
one-hot labels for unlabelled examples.
Furthermore, weakly supervised action segmentation is a
newly emerging domain and has gained a lot of attention R EFERENCES
within the action segmentation community. Specifically, in a [1] Y. Kong and Y. Fu, “Human action recognition and prediction: A
weakly supervised action segmentation system, lower levels survey,” arXiv preprint arXiv:1806.11230, 2018.
[2] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporal
of supervision are provided to the model. For instance, the convolutional networks for action segmentation and detection,” in
supervision signal may simply be a list of actions without proceedings of the IEEE Conference on Computer Vision and Pattern
any information regarding their order or how many repetitions Recognition, 2017, pp. 156–165.
[3] M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele, “A database
of each action occurs, or annotations may be presented as for fine grained activity detection of cooking activities,” in 2012 IEEE
a list of actions without individual action start end times. Conference on Computer Vision and Pattern Recognition. IEEE, 2012,
One popular approach to solve this task has been to generate pp. 1194–1201.
[4] B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao, “A multi-
pseudo ground truth labels and iteratively refine them [133], stream bi-directional recurrent neural network for fine-grained action
[134]. A new line of work called time-stamp supervision detection,” in Proceedings of the IEEE conference on computer vision
[135] is also emerging within the weakly-supervised action and pattern recognition. Las Vegas, NV, USA: IEEE, 2016, pp. 1961–
1970.
segmentation domain. Here ground truth action classes for [5] I. Laptev and P. Pérez, “Retrieving actions in movies,” in 2007 IEEE
different segments of a video are provided and the goal is to 11th International Conference on Computer Vision. Rio de Janeiro,
determine action transition boundaries. Despite these recent Brazil: IEEE, 2007, pp. 1–8.
[6] A. Patron-Perez, M. Marszalek, A. Zisserman, and I. Reid, “High five:
efforts, the domain is less explored and the accuracy gap Recognising human interactions in tv shows.” in Proceedings of the
between the semi-supervised, weakly-supervised models and British Machine Vision Conference. Aberystwyth, UK: Citeseer, 2010,
fully supervised models is large. Hence, further research in p. 33.
[7] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu, “Action recognition
this direction to reduce this accuracy gap is encouraged. by dense trajectories,” in CVPR 2011. Colorado Springs, CO, USA:
IEEE, 2011, pp. 3169–3176.
[8] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learning
F. Incorporating Background Context Together with Human realistic human actions from movies,” in 2008 IEEE Conference on
Detections Computer Vision and Pattern Recognition. Anchorage, AK, USA:
IEEE, 2008, pp. 1–8.
In a multi-human action segmentation setting, one potential [9] N. Dalal, B. Triggs, and C. Schmid, “Human detection using oriented
challenge that we observe is the loss of background context histograms of flow and appearance,” in European conference on com-
when the features from cropped human detections are fed to puter vision. Graz, Austria: Springer, 2006, pp. 428–441.
[10] A. Zinnen, U. Blanke, and B. Schiele, “An analysis of sensor-oriented
an action segmentation model. Depending on the predicted vs. model-based activity recognition,” in 2009 International Symposium
bounding boxes, the action segmentation model may fail to on Wearable Computers. Linz, Austria: IEEE, 2009, pp. 93–100.
capture information from the objects in the background, or [11] B. Ni, V. R. Paramathayalan, and P. Moulin, “Multiple granularity
analysis for fine-grained action detection,” in Proceedings of the IEEE
objects that the human is interacting with. This information is Conference on Computer Vision and Pattern Recognition. Columbus,
crucial to properly recognise the ongoing action. Therefore, OH, USA: IEEE, 2014, pp. 756–763.
21

[12] M. Rohrbach, A. Rohrbach, M. Regneri, S. Amin, M. Andriluka, [33] H. Gammulle, S. Denman, S. Sridharan, and C. Fookes, “Tmmf:
M. Pinkal, and B. Schiele, “Recognizing fine-grained and composite Temporal multi-modal fusion for single-stage continuous gesture recog-
activities using hand-centric features and script data,” International nition,” IEEE Transactions on Image Processing, vol. 30, pp. 7689–
Journal of Computer Vision, vol. 119, no. 3, pp. 346–373, 2016. 7701, 2021.
[13] H. Pirsiavash and D. Ramanan, “Parsing videos of actions with seg- [34] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,
mental grammars,” in Proceedings of the IEEE conference on computer A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet:
vision and pattern recognition. Columbus, OH, USA: IEEE, 2014, A generative model for raw audio,” arXiv preprint arXiv:1609.03499,
pp. 612–619. 2016.
[14] H. Kuehne, A. Arslan, and T. Serre, “The language of actions: [35] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic
Recovering the syntax and semantics of goal-directed human activities,” convolutional and recurrent networks for sequence modeling,” arXiv
in Proceedings of the IEEE conference on computer vision and pattern preprint arXiv:1803.01271, pp. 1–14, 2018.
recognition. Columbus, OH, USA: IEEE, 2014, pp. 780–787. [36] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
[15] L. Wang, Y. Qiao, and X. Tang, “Action recognition with trajectory- S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,”
pooled deep-convolutional descriptors,” in Proceedings of the IEEE Advances in neural information processing systems, vol. 27, 2014.
conference on computer vision and pattern recognition. Boston, MA, [37] M. Mirza and S. Osindero, “Conditional generative adversarial nets,”
USA: IEEE, 2015, pp. 4305–4314. arXiv preprint arXiv:1411.1784, pp. 1–7, 2014.
[16] K. Simonyan and A. Zisserman, “Two-stream convolutional networks [38] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image transla-
for action recognition in videos,” in Advances in Neural Information tion with conditional adversarial networks,” in Proceedings of the IEEE
Processing Systems, 2014, pp. 568–576. conference on computer vision and pattern recognition. Honolulu, HI,
[17] G. Gkioxari and J. Malik, “Finding action tubes,” in Proceedings of USA: IEEE, 2017, pp. 1125–1134.
the IEEE conference on computer vision and pattern recognition, 2015, [39] M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scale video
pp. 759–768. prediction beyond mean square error,” in 4th International Conference
[18] H. Gammulle, S. Denman, S. Sridharan, and C. Fookes, “Two stream on Learning Representations, ICLR 2016, San Juan, Puerto Rico,, 2016,
lstm: A deep fusion framework for human action recognition,” in 2017 pp. 1–14.
IEEE Winter Conference on Applications of Computer Vision (WACV). [40] D. Yoo, N. Kim, S. Park, A. S. Paek, and I. S. Kweon, “Pixel-
IEEE, 2017, pp. 177–186. level domain transfer,” in European Conference on Computer Vision.
[19] B. Ni, X. Yang, and S. Gao, “Progressively parsing interactional Amsterdam, The Netherlands: Springer, 2016, pp. 517–532.
objects for fine grained action detection,” in Proceedings of the IEEE [41] T. Hu, C. Long, and C. Xiao, “A novel visual representation on text us-
Conference on Computer Vision and Pattern Recognition, 2016, pp. ing diverse conditional gan for visual recognition,” IEEE Transactions
1020–1028. on Image Processing, vol. 30, pp. 3499–3512, 2021.
[20] Y. A. Farha and J. Gall, “Ms-tcn: Multi-stage temporal convolutional [42] Z. Qi, J. Sun, J. Qian, J. Xu, and S. Zhan, “Pccm-gan: Photographic
network for action segmentation,” in Proceedings of the IEEE/CVF text-to-image generation with pyramid contrastive consistency model,”
Conference on Computer Vision and Pattern Recognition. Long Beach, Neurocomputing, vol. 449, pp. 330–341, 2021.
CA, USA: Springer-Verlag, 2019, pp. 3575–3584. [43] K. Gedamu, Y. Ji, Y. Yang, L. Gao, and H. T. Shen, “Arbitrary-view
human action recognition via novel-view action generation,” Pattern
[21] P. Lei and S. Todorovic, “Temporal deformable residual networks for
Recognition, vol. 118, p. 108043, 2021.
action segmentation in videos,” in Proceedings of the IEEE conference
on computer vision and pattern recognition, 2018, pp. 6742–6751. [44] H. Gammulle, T. Fernando, S. Denman, S. Sridharan, and C. Fookes,
“Coupled generative adversarial network for continuous fine-grained
[22] H.-B. Zhang, Y.-X. Zhang, B. Zhong, Q. Lei, L. Yang, J.-X. Du, and
action segmentation,” in 2019 IEEE Winter Conference on Applications
D.-S. Chen, “A comprehensive survey of vision-based human action
of Computer Vision (WACV). IEEE, 2019, pp. 200–209.
recognition methods,” Sensors, vol. 19, no. 5, p. 1005, 2019.
[45] H. Gammulle, S. Denman, S. Sridharan, and C. Fookes, “Fine-grained
[23] B. Liu, H. Cai, Z. Ju, and H. Liu, “Rgb-d sensing based human action
action segmentation using the semi-supervised action gan,” Pattern
and interaction analysis: A survey,” Pattern Recognition, vol. 94, pp.
Recognition, vol. 98, p. 107039, 2020.
1–12, 2019.
[46] ——, “Predicting the future: A jointly learnt model for action antici-
[24] I. Jegham, A. B. Khalifa, I. Alouani, and M. A. Mahjoub, “Vision- pation,” in Proceedings of the IEEE/CVF International Conference on
based human action recognition: An overview and real world chal- Computer Vision, 2019, pp. 5562–5571.
lenges,” Forensic Science International: Digital Investigation, vol. 32, [47] M. Wang and W. Deng, “Deep visual domain adaptation: A survey,”
p. 200901, 2020. Neurocomputing, vol. 312, pp. 135–153, 2018.
[25] K. Simonyan and A. Zisserman, “Very deep convolutional networks for [48] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola,
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. “A kernel two-sample test,” The Journal of Machine Learning Re-
[26] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, search, vol. 13, no. 1, pp. 723–773, 2012.
D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with [49] B. Sun, J. Feng, and K. Saenko, “Correlation alignment for unsuper-
convolutions,” in Proceedings of the IEEE conference on computer vised domain adaptation,” in Domain Adaptation in Computer Vision
vision and pattern recognition, 2015, pp. 1–9. Applications. Springer, 2017, pp. 153–171.
[27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image [50] G. Kang, L. Jiang, Y. Yang, and A. G. Hauptmann, “Contrastive
recognition,” in Proceedings of the IEEE conference on computer vision adaptation network for unsupervised domain adaptation,” in Proceed-
and pattern recognition. Las Vegas, NV, USA: IEEE, 2016, pp. 770– ings of the IEEE/CVF Conference on Computer Vision and Pattern
778. Recognition, 2019, pp. 4893–4902.
[28] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for con- [51] J. Shen, Y. Qu, W. Zhang, and Y. Yu, “Wasserstein distance guided
volutional neural networks,” in International Conference on Machine representation learning for domain adaptation,” in Thirty-Second AAAI
Learning. Long Beach, CA, USA: PMLR, 2019, pp. 6105–6114. Conference on Artificial Intelligence, 2018.
[29] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mo- [52] M. M. Rahman, C. Fookes, M. Baktashmotlagh, and S. Sridharan,
bilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of “Correlation-aware adversarial domain adaptation and generalization,”
the IEEE conference on computer vision and pattern recognition, 2018, Pattern Recognition, vol. 100, p. 107124, 2020.
pp. 4510–4520. [53] M.-Y. Liu and O. Tuzel, “Coupled generative adversarial networks,”
[30] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new in Advances in Neural Information Processing Systems, D. Lee,
model and the kinetics dataset,” in proceedings of the IEEE Conference M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, Eds., vol. 29.
on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308. Barcelona, Spain: Curran Associates, Inc., 2016.
[31] K. Greff, R. K. Srivastava, J. Koutnı́k, B. R. Steunebrink, and [54] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Lavi-
J. Schmidhuber, “Lstm: A search space odyssey,” IEEE transactions on olette, M. Marchand, and V. Lempitsky, “Domain-adversarial training
neural networks and learning systems, vol. 28, no. 10, pp. 2222–2232, of neural networks,” The journal of machine learning research, vol. 17,
2016. no. 1, pp. 2096–2030, 2016.
[32] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, [55] M. Ghifary, W. B. Kleijn, M. Zhang, D. Balduzzi, and W. Li, “Deep
H. Schwenk, and Y. Bengio, “Learning phrase representations using reconstruction-classification networks for unsupervised domain adapta-
rnn encoder-decoder for statistical machine translation,” arXiv preprint tion,” in European Conference on Computer Vision. Amsterdam, The
arXiv:1406.1078, 2014. Netherlands: Springer, 2016, pp. 597–613.
22

[56] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image [77] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally,
translation using cycle-consistent adversarial networks,” in Proceedings and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer
of the IEEE international conference on computer vision. Venice, Italy: parameters and¡ 0.5 mb model size,” arXiv preprint arXiv:1602.07360,
IEEE, 2017, pp. 2223–2232. 2016.
[57] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip, “A [78] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely effi-
comprehensive survey on graph neural networks,” IEEE transactions cient convolutional neural network for mobile devices,” in Proceedings
on neural networks and learning systems, vol. 32, no. 1, pp. 4–24, of the IEEE conference on computer vision and pattern recognition,
2021. 2018, pp. 6848–6856.
[58] Y. Huang, Y. Sugano, and Y. Sato, “Improving action segmentation [79] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,”
via graph-based temporal reasoning,” in Proceedings of the IEEE/CVF arXiv preprint arXiv:1804.02767, 2018.
Conference on Computer Vision and Pattern Recognition, 2020, pp. [80] C.-Y. Wang, H.-Y. M. Liao, I.-H. Yeh, Y.-H. Wu, P.-Y. Chen, and J.-W.
14 024–14 034. Hsieh, “Cspnet: A new backbone that can enhance learning capability
[59] S. Stein and S. J. McKenna, “Combining embedded accelerometers of cnn,” arXiv preprint arXiv:1911.11929, 2019.
with computer vision for recognizing food preparation activities,” [81] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie,
in Proceedings of the 2013 ACM international joint conference on “Feature pyramid networks for object detection,” in Proceedings of the
Pervasive and ubiquitous computing, 2013, pp. 729–738. IEEE conference on computer vision and pattern recognition, 2017,
[60] D. Damen, H. Doughty, G. M. Farinella, , A. Furnari, J. Ma, pp. 2117–2125.
E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray, [82] M. Tan, R. Pang, and Q. V. Le, “Efficientdet: Scalable and efficient
“Rescaling egocentric vision: Collection, pipeline and challenges for object detection,” in Proceedings of the IEEE/CVF Conference on
epic-kitchens-100,” International Journal of Computer Vision (IJCV), Computer Vision and Pattern Recognition, 2020, pp. 10 781–10 790.
2021. [Online]. Available: https://doi.org/10.1007/s11263-021-01531-2 [83] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep
[61] D. Wang, D. Hu, X. Li, and D. Dou, “Temporal relational modeling convolutional networks for visual recognition,” IEEE transactions on
with self-supervision for action segmentation,” in Proceedings of the pattern analysis and machine intelligence, vol. 37, no. 9, pp. 1904–
AAAI Conference on Artificial Intelligence, vol. 35, no. 4, 2021, pp. 1916, 2015.
2729–2737. [84] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network
[62] S.-H. Gao, Q. Han, Z.-Y. Li, P. Peng, L. Wang, and M.-M. Cheng, for instance segmentation,” in Proceedings of the IEEE Conference on
“Global2local: Efficient structure search for video action segmenta- Computer Vision and Pattern Recognition, 2018, pp. 8759–8768.
tion,” in Proceedings of the IEEE/CVF Conference on Computer Vision [85] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Op-
and Pattern Recognition, 2021, pp. 16 805–16 814. timal speed and accuracy of object detection,” arXiv preprint
[63] M.-H. Chen, B. Li, Y. Bao, G. AlRegib, and Z. Kira, “Action seg- arXiv:2004.10934, 2020.
mentation with joint self-supervised temporal domain adaptation,” in
[86] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-
Proceedings of the IEEE/CVF Conference on Computer Vision and
time object detection with region proposal networks,” in Advances in
Pattern Recognition, 2020, pp. 9454–9463.
neural information processing systems, 2015, pp. 91–99.
[64] H. Ahn and D. Lee, “Refining action segmentation with hierarchical
[87] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region-
video representations,” in Proceedings of the IEEE/CVF International
based fully convolutional networks,” in Advances in neural information
Conference on Computer Vision, 2021, pp. 16 302–16 310.
processing systems, 2016, pp. 379–387.
[65] S.-J. Li, Y. AbuFarha, Y. Liu, M.-M. Cheng, and J. Gall, “Ms-tcn++:
[88] J. Pang, K. Chen, J. Shi, H. Feng, W. Ouyang, and D. Lin, “Libra r-cnn:
Multi-stage temporal convolutional network for action segmentation,”
Towards balanced learning for object detection,” in Proceedings of the
IEEE transactions on pattern analysis and machine intelligence, 2020.
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
[66] Z. Wang, Z. Gao, L. Wang, Z. Li, and G. Wu, “Boundary-aware cascade
2019, pp. 821–830.
networks for temporal action segmentation,” in European Conference
on Computer Vision. Springer, 2020, pp. 34–51. [89] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in
Proceedings of the IEEE international conference on computer vision,
[67] A. Fathi, X. Ren, and J. M. Rehg, “Learning to recognize objects in
2017, pp. 2961–2969.
egocentric activities,” in CVPR 2011. IEEE, 2011, pp. 3281–3288.
[68] H. Kuehne, A. Arslan, and T. Serre, “The language of actions: [90] W. Ouyang, K. Wang, X. Zhu, and X. Wang, “Chained cascade
Recovering the syntax and semantics of goal-directed human activities,” network for object detection,” in Proceedings of the IEEE International
in Proceedings of the IEEE conference on computer vision and pattern Conference on Computer Vision, 2017, pp. 1938–1946.
recognition, 2014, pp. 780–787. [91] Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high quality
[69] Y. Li, M. Liu, and J. M. Rehg, “In the eye of beholder: Joint learning of object detection,” in Proceedings of the IEEE conference on computer
gaze and actions in first person video,” in Proceedings of the European vision and pattern recognition, 2018.
Conference on Computer Vision (ECCV), 2018, pp. 619–635. [92] Y. Li, Y. Chen, N. Wang, and Z. Zhang, “Scale-aware trident networks
[70] L. Jiao, F. Zhang, F. Liu, S. Yang, L. Li, Z. Feng, and R. Qu, “A for object detection,” in Proceedings of the IEEE/CVF International
survey of deep learning-based object detection,” IEEE access, vol. 7, Conference on Computer Vision, 2019, pp. 6054–6063.
pp. 128 837–128 868, 2019. [93] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu,
[71] W. Ke, T. Zhang, Z. Huang, Q. Ye, J. Liu, and D. Huang, “Multiple and A. C. Berg, “Ssd: Single shot multibox detector,” in European
anchor learning for visual object detection,” in Proceedings of the conference on computer vision. Springer, 2016, pp. 21–37.
IEEE/CVF Conference on Computer Vision and Pattern Recognition, [94] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, “Dssd: De-
2020, pp. 10 206–10 215. convolutional single shot detector,” arXiv preprint arXiv:1701.06659,
[72] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification 2017.
with deep convolutional neural networks,” Advances in neural infor- [95] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss
mation processing systems, vol. 25, pp. 1097–1105, 2012. for dense object detection,” in Proceedings of the IEEE international
[73] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethink- conference on computer vision, 2017, pp. 2980–2988.
ing the inception architecture for computer vision,” in Proceedings of [96] Q. Zhao, T. Sheng, Y. Wang, Z. Tang, Y. Chen, L. Cai, and H. Ling,
the IEEE conference on computer vision and pattern recognition, 2016, “M2det: A single-shot object detector based on multi-level feature
pp. 2818–2826. pyramid network,” in Proceedings of the AAAI conference on artificial
[74] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, intelligence, vol. 33, no. 01, 2019, pp. 9259–9266.
inception-resnet and the impact of residual connections on learning,” [97] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “Scaled-yolov4:
in Thirty-first AAAI conference on artificial intelligence, 2017. Scaling cross stage partial network,” in Proceedings of the IEEE/CVF
[75] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual Conference on Computer Vision and Pattern Recognition, 2021, pp.
transformations for deep neural networks,” in Proceedings of the IEEE 13 029–13 038.
conference on computer vision and pattern recognition, 2017, pp. [98] L. Tychsen-Smith and L. Petersson, “Denet: Scalable real-time object
1492–1500. detection with directed sparse sampling,” in Proceedings of the IEEE
[76] T. He, Z. Zhang, H. Zhang, Z. Zhang, J. Xie, and M. Li, “Bag of international conference on computer vision, 2017, pp. 428–436.
tricks for image classification with convolutional neural networks,” in [99] H. Law and J. Deng, “Cornernet: Detecting objects as paired key-
Proceedings of the IEEE/CVF Conference on Computer Vision and points,” in Proceedings of the European Conference on Computer
Pattern Recognition, 2019, pp. 558–567. Vision (ECCV), 2018, pp. 734–750.
23

[100] H. Law, Y. Teng, O. Russakovsky, and J. Deng, “Cornernet- [124] AlexeyAB, “Darknet: Open source neural networks in c,” https://github.
lite: Efficient keypoint based object detection,” arXiv preprint com/AlexeyAB/darknet, 2021.
arXiv:1904.08900, 2019. [125] Ifzhang, “Bytetrack: Multi-object tracking by associating every detec-
[101] X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” arXiv tion box,” https://github.com/ifzhang/ByteTrack, 2021.
preprint arXiv:1904.07850, 2019. [126] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and
[102] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, “Centernet: D. Batra, “Grad-cam: Visual explanations from deep networks via
Keypoint triplets for object detection,” in Proceedings of the IEEE/CVF gradient-based localization,” in Proceedings of the IEEE international
International Conference on Computer Vision, 2019, pp. 6569–6578. conference on computer vision, 2017, pp. 618–626.
[103] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one- [127] M. T. Ribeiro, S. Singh, and C. Guestrin, “” why should i trust you?”
stage object detection,” in Proceedings of the IEEE/CVF international explaining the predictions of any classifier,” in Proceedings of the 22nd
conference on computer vision, 2019, pp. 9627–9636. ACM SIGKDD international conference on knowledge discovery and
[104] Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “Yolox: Exceeding yolo data mining, 2016, pp. 1135–1144.
series in 2021,” arXiv preprint arXiv:2107.08430, 2021. [128] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller,
[105] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look “Striving for simplicity: The all convolutional net,” arXiv preprint
once: Unified, real-time object detection,” in Proceedings of the IEEE arXiv:1412.6806, 2014.
conference on computer vision and pattern recognition, 2016, pp. 779– [129] H. Coskun, M. Z. Zia, B. Tekin, F. Bogo, N. Navab, F. Tombari, and
788. H. Sawhney, “Domain-specific priors and meta learning for few-shot
[106] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in first-person action recognition,” IEEE Transactions on Pattern Analysis
Proceedings of the IEEE conference on computer vision and pattern and Machine Intelligence, 2021.
recognition, 2017, pp. 7263–7271. [130] Y. Zhu, Y. Long, Y. Guan, S. Newsam, and L. Shao, “Towards universal
[107] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix: Reg- representation for unseen action recognition,” in Proceedings of the
ularization strategy to train strong classifiers with localizable features,” IEEE conference on computer vision and pattern recognition, 2018,
in Proceedings of the IEEE International Conference on Computer pp. 9436–9445.
Vision, 2019, pp. 6023–6032. [131] A. Singh, O. Chakraborty, A. Varshney, R. Panda, R. Feris, K. Saenko,
[108] G. Ghiasi, T.-Y. Lin, and Q. V. Le, “Dropblock: A regularization and A. Das, “Semi-supervised action recognition with temporal con-
method for convolutional networks,” in Advances in Neural Information trastive learning,” in Proceedings of the IEEE/CVF Conference on
Processing Systems, 2018, pp. 10 727–10 737. Computer Vision and Pattern Recognition, 2021, pp. 10 389–10 399.
[109] D. Misra, “Mish: A self regularized non-monotonic neural activation [132] H. Terao, W. Noguchi, H. Iizuka, and M. Yamamoto, “Semi-supervised
function,” arXiv preprint arXiv:1908.08681, 2019. learning combining 2dcnns and video compression for action recog-
[110] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalable nition,” in Proceedings of the 2020 4th International Conference on
person re-identification: A benchmark,” in Proceedings of the IEEE Vision, Image and Signal Processing, 2020, pp. 1–6.
international conference on computer vision, 2015, pp. 1116–1124. [133] A. Richard, H. Kuehne, and J. Gall, “Weakly supervised action learning
[111] J. Peng, T. Wang, W. Lin, J. Wang, J. See, S. Wen, and E. Ding, with rnn based fine-to-coarse modeling,” in Proceedings of the IEEE
“Tpm: Multiple object tracking with tracklet-plane matching,” Pattern conference on Computer Vision and Pattern Recognition, 2017, pp.
Recognition, vol. 107, p. 107480, 2020. 754–763.
[112] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online and
[134] L. Ding and C. Xu, “Weakly-supervised action segmentation with itera-
realtime tracking,” in 2016 IEEE International Conference on Image
tive soft boundary assignment,” in Proceedings of the IEEE Conference
Processing (ICIP), 2016, pp. 3464–3468.
on Computer Vision and Pattern Recognition, 2018, pp. 6508–6516.
[113] N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime
[135] Z. Li, Y. Abu Farha, and J. Gall, “Temporal action segmentation from
tracking with a deep association metric,” in 2017 IEEE international
timestamp supervision,” in Proceedings of the IEEE/CVF Conference
conference on image processing (ICIP). IEEE, 2017, pp. 3645–3649.
on Computer Vision and Pattern Recognition, 2021, pp. 8365–8374.
[114] Y. Zhang, P. Sun, Y. Jiang, D. Yu, Z. Yuan, P. Luo, W. Liu, and
X. Wang, “Bytetrack: Multi-object tracking by associating every de- [136] H. Gammulle, S. Denman, S. Sridharan, and C. Fookes, “Multi-level
tection box,” arXiv preprint arXiv:2110.06864, 2021. sequence gan for group activity recognition,” in Asian Conference on
[115] P. Bergmann, T. Meinhardt, and L. Leal-Taixe, “Tracking without Computer Vision. Springer, 2018, pp. 331–346.
bells and whistles,” in Proceedings of the IEEE/CVF International [137] A. Dasgupta, C. Jawahar, and K. Alahari, “Context aware group
Conference on Computer Vision, 2019, pp. 941–951. activity recognition,” in 2020 25th International Conference on Pattern
[116] X. Zhou, V. Koltun, and P. Krähenbühl, “Tracking objects as points,” Recognition (ICPR). IEEE, 2021, pp. 10 098–10 105.
in European Conference on Computer Vision. Springer, 2020, pp. [138] E. Kazakos, J. Huh, A. Nagrani, A. Zisserman, and D. Damen, “With
474–490. a little help from my temporal context: Multimodal egocentric action
[117] Y. Zhang, C. Wang, X. Wang, W. Zeng, and W. Liu, “Fairmot: On the recognition,” British Machine Vision Conference, 2021.
fairness of detection and re-identification in multiple object tracking,” [139] H. Bilen, B. Fernando, E. Gavves, A. Vedaldi, and S. Gould, “Dynamic
International Journal of Computer Vision, pp. 1–19, 2021. image networks for action recognition,” in Proceedings of the IEEE
[118] B. Shuai, A. Berneshawi, X. Li, D. Modolo, and J. Tighe, “Siammot: conference on computer vision and pattern recognition, 2016, pp.
Siamese multi-object tracking,” in Proceedings of the IEEE/CVF Con- 3034–3042.
ference on Computer Vision and Pattern Recognition, 2021, pp. 12 372– [140] P. P. Busto, A. Iqbal, and J. Gall, “Open set domain adaptation for
12 382. image and action recognition,” IEEE transactions on pattern analysis
[119] M. Rohrbach, A. Rohrbach, M. Regneri, S. Amin, M. Andriluka, and machine intelligence, vol. 42, no. 2, pp. 413–429, 2018.
M. Pinkal, and B. Schiele, “Recognizing fine-grained and composite [141] R. Gao, B. Xiong, and K. Grauman, “Im2flow: Motion hallucination
activities using hand-centric features and script data,” International from static images for action recognition,” in Proceedings of the IEEE
Journal of Computer Vision, pp. 1–28, 2015. [Online]. Available: Conference on Computer Vision and Pattern Recognition, 2018, pp.
http://dx.doi.org/10.1007/s11263-015-0851-8 5937–5947.
[120] F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles, [142] W. Liu, C. Zhang, J. Zhang, and Z. Wu, “Global for coarse and part
“Activitynet: A large-scale video benchmark for human activity under- for fine: A hierarchical action recognition framework,” in 2018 25th
standing,” in Proceedings of the IEEE Conference on Computer Vision IEEE International Conference on Image Processing (ICIP). IEEE,
and Pattern Recognition (CVPR), June 2015. 2018, pp. 2630–2634.
[121] H. Idrees, A. R. Zamir, Y.-G. Jiang, A. Gorban, I. Laptev, R. Suk- [143] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman, “Video summarization
thankar, and M. Shah, “The thumos challenge on action recognition with long short-term memory,” in European conference on computer
for videos “in the wild”,” Computer Vision and Image Understanding, vision. Springer, 2016, pp. 766–782.
vol. 155, pp. 1–23, 2017. [144] K. Vats, M. Fani, P. Walters, D. A. Clausi, and J. Zelek, “Event de-
[122] S. Das, R. Dai, M. Koperski, L. Minciullo, L. Garattoni, F. Bremond, tection in coarsely annotated sports videos via parallel multi-receptive
and G. Francesca, “Toyota smarthome: Real-world activities of daily field 1d convolutions,” in Proceedings of the IEEE/CVF Conference
living,” in The IEEE International Conference on Computer Vision on Computer Vision and Pattern Recognition Workshops, 2020, pp.
(ICCV), October 2019. 882–883.
[123] D. Shao, Y. Zhao, B. Dai, and D. Lin, “Finegym: A hierarchical video [145] N. Hussein, E. Gavves, and A. W. Smeulders, “Timeception for com-
dataset for fine-grained action understanding,” in IEEE Conference on plex action recognition,” in Proceedings of the IEEE/CVF Conference
Computer Vision and Pattern Recognition (CVPR), 2020. on Computer Vision and Pattern Recognition, 2019, pp. 254–263.
24

[146] H. Gammulle, S. Denman, S. Sridharan, and C. Fookes, “Hierarchical of the IEEE/CVF Winter Conference on Applications of Computer
attention network for action segmentation,” Pattern Recognition Letters, Vision, 2020, pp. 576–585.
vol. 131, pp. 442–448, 2020. [168] S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional
[147] ——, “Forecasting future action sequences with neural memory net- networks for skeleton-based action recognition,” in Proc. AAAI Conf.
works,” British Machine Vision Conference (BMVC), 2019. Artif. Intell., 2018.
[148] D. Liu, N. Kamath, S. Bhattacharya, and R. Puri, “Adaptive context [169] S. Li, J. Yi, Y. A. Farha, and J. Gall, “Pose refinement graph
reading network for movie scene detection,” IEEE Transactions on convolutional network for skeleton-based action recognition,” IEEE
Circuits and Systems for Video Technology, vol. 31, no. 9, pp. 3559– Robotics and Automation Letters, vol. 6, no. 2, pp. 1028–1035, 2021.
3574, 2020. [170] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “Openpose: re-
[149] A. K. Bhunia, A. Sain, A. Kumar, S. Ghose, P. N. Chowdhury, and Y.- altime multi-person 2d pose estimation using part affinity fields,” IEEE
Z. Song, “Joint visual semantic reasoning: Multi-stage decoder for text transactions on pattern analysis and machine intelligence, vol. 43,
recognition,” in Proceedings of the IEEE/CVF International Conference no. 1, pp. 172–186, 2019.
on Computer Vision, 2021, pp. 14 940–14 949. [171] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international
[150] H. Gammulle, S. Denman, S. Sridharan, and C. Fookes, “Two-stream conference on computer vision, 2015, pp. 1440–1448.
deep feature modelling for automated video endoscopy data analy- [172] X. Long, K. Deng, G. Wang, Y. Zhang, Q. Dang, Y. Gao, H. Shen,
sis,” in International Conference on Medical Image Computing and J. Ren, S. Han, E. Ding et al., “Pp-yolo: An effective and efficient
Computer-Assisted Intervention. Springer, 2020, pp. 742–751. implementation of object detector,” arXiv preprint arXiv:2007.12099,
[151] Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, 2020.
J. Ngiam, Q. V. Le, Y. Wu et al., “Gpipe: Efficient training of [173] X. Huang, X. Wang, W. Lv, X. Bai, X. Long, K. Deng, Q. Dang,
giant neural networks using pipeline parallelism,” Advances in neural S. Han, Q. Liu, X. Hu et al., “Pp-yolov2: A practical object detector,”
information processing systems, vol. 32, pp. 103–112, 2019. arXiv preprint arXiv:2104.10419, 2021.
[152] X. Yin, D. Wu, Y. Shang, B. Jiang, and H. Song, “Using an efficientnet- [174] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for
lstm for the recognition of single cow’s motion behaviours in a human pose estimation,” in European conference on computer vision.
complicated environment,” Computers and Electronics in Agriculture, Springer, 2016, pp. 483–499.
vol. 177, p. 105707, 2020. [175] Z. Ge, S. Liu, Z. Li, O. Yoshie, and J. Sun, “Ota: Optimal transport
[153] Y. Huo, X. Xu, Y. Lu, Y. Niu, M. Ding, Z. Lu, T. Xiang, and J.-r. Wen, assignment for object detection,” in Proceedings of the IEEE/CVF
“Lightweight action recognition in compressed videos,” in European Conference on Computer Vision and Pattern Recognition, 2021, pp.
Conference on Computer Vision. Springer, 2020, pp. 337–352. 303–312.
[154] H. Kim, M. Jain, J.-T. Lee, S. Yun, and F. Porikli, “Efficient action [176] Z. Yang, S. Liu, H. Hu, L. Wang, and S. Lin, “Reppoints: Point set
recognition via dynamic knowledge propagation,” in Proceedings of representation for object detection,” in Proceedings of the IEEE/CVF
the IEEE/CVF International Conference on Computer Vision, 2021, International Conference on Computer Vision, 2019, pp. 9657–9666.
pp. 13 719–13 728. [177] Y. Chen, Z. Zhang, Y. Cao, L. Wang, S. Lin, and H. Hu, “Reppoints v2:
[155] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, Verification meets regression for object detection,” Advances in Neural
T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convo- Information Processing Systems, vol. 33, 2020.
lutional neural networks for mobile vision applications,” arXiv preprint [178] C. Zhu, Y. He, and M. Savvides, “Feature selective anchor-free module
arXiv:1704.04861, 2017. for single-shot object detection,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2019, pp.
[156] S. N. Gowda, M. Rohrbach, and L. Sevilla-Lara, “Smart frame selection
840–849.
for action recognition,” in Proceedings of the AAAI Conference on
[179] X. Zhou, J. Zhuo, and P. Krahenbuhl, “Bottom-up object detection by
Artificial Intelligence, vol. 35, no. 2, 2021, pp. 1451–1459.
grouping extreme and center points,” in Proceedings of the IEEE/CVF
[157] Z. Zhang, Z. Kuang, P. Luo, L. Feng, and W. Zhang, “Temporal se-
Conference on Computer Vision and Pattern Recognition, 2019, pp.
quence distillation: Towards few-frame action recognition in videos,” in
850–859.
Proceedings of the 26th ACM international conference on Multimedia,
[180] R. E. Kalman, “A new approach to linear filtering and prediction
2018, pp. 257–264.
problems,” 1960.
[158] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning [181] H. W. Kuhn, “The hungarian method for the assignment problem,”
spatiotemporal features with 3d convolutional networks,” in Proceed- Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955.
ings of the IEEE international conference on computer vision. San- [182] B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan, “Siamrpn++:
tiago, Chile: IEEE, 2015, pp. 4489–4497. Evolution of siamese visual tracking with very deep networks,” in
[159] Y. Wang, J. Song, L. Wang, L. Van Gool, and O. Hilliges, “Two-stream Proceedings of the IEEE/CVF Conference on Computer Vision and
sr-cnns for action recognition in videos.” in BMVC. York, UK, 2016. Pattern Recognition, 2019, pp. 4282–4291.
[160] M. Zhang, C. Gao, Q. Li, L. Wang, and J. Zhang, “Action detection [183] J. Pang, L. Qiu, X. Li, H. Chen, Q. Li, T. Darrell, and F. Yu, “Quasi-
based on tracklets with the two-stream cnn,” Multimedia Tools and dense similarity learning for multiple object tracking,” in Proceedings
Applications, vol. 77, no. 3, pp. 3303–3316, 2018. of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
[161] J.-K. Tsai, C.-C. Hsu, W.-Y. Wang, and S.-K. Huang, “Deep learning- nition, 2021, pp. 164–173.
based real-time multiple-person action recognition system,” Sensors, [184] J. Wu, J. Cao, L. Song, Y. Wang, M. Yang, and J. Yuan, “Track to detect
vol. 20, no. 17, p. 4758, 2020. and segment: An online multi-object tracker,” in Proceedings of the
[162] H. Ishioka, X. Weng, Y. Man, and K. Kitani, “Single camera worker IEEE/CVF Conference on Computer Vision and Pattern Recognition,
detection, tracking and action recognition in construction site,” in 2021, pp. 12 352–12 361.
ISARC. Proceedings of the International Symposium on Automation [185] Q. Wang, Y. Zheng, P. Pan, and Y. Xu, “Multiple object tracking with
and Robotics in Construction, vol. 37. IAARC Publications, 2020, correlation learning,” in Proceedings of the IEEE/CVF Conference on
pp. 653–660. Computer Vision and Pattern Recognition, 2021, pp. 3876–3886.
[163] A. Ali, F. Negin, F. Bremond, and S. Thümmler, “Video-based behavior [186] P. Sun, Y. Jiang, R. Zhang, E. Xie, J. Cao, X. Hu, T. Kong, Z. Yuan,
understanding of children for objective diagnosis of autism,” in VIS- C. Wang, and P. Luo, “Transtrack: Multiple-object tracking with
APP 2022-International Conference on Computer Vision Theory and transformer,” arXiv preprint arXiv:2012.15460, 2020.
Applications, 2022. [187] F. Zeng, B. Dong, T. Wang, C. Chen, X. Zhang, and Y. Wei, “Motr:
[164] T. Ahmed, K. Thopalli, T. Rikakis, P. Turaga, A. Kelliher, J.-B. End-to-end multiple-object tracking with transformer,” arXiv preprint
Huang, and S. L. Wolf, “Automated movement assessment in stroke arXiv:2105.03247, 2021.
rehabilitation,” Frontiers in Neurology, p. 1396, 2021. [188] D. Shan, J. Geng, M. Shu, and D. F. Fouhey, “Understanding human
[165] D. Ahmedt-Aristizabal, S. Denman, K. Nguyen, S. Sridharan, S. Dion- hands in contact at internet scale,” in Proceedings of the IEEE/CVF
isio, and C. Fookes, “Understanding patients’ behavior: Vision-based Conference on Computer Vision and Pattern Recognition, 2020, pp.
analysis of seizure disorders,” IEEE journal of biomedical and health 9869–9878.
informatics, vol. 23, no. 6, pp. 2583–2591, 2019.
[166] C. Chen, T. Wang, D. Li, and J. Hong, “Repetitive assembly action
recognition based on object detection and pose estimation,” Journal of
Manufacturing Systems, vol. 55, pp. 325–333, 2020.
[167] P. Ghosh, Y. Yao, L. Davis, and A. Divakaran, “Stacked spatio-temporal
graph convolutional networks for action segmentation,” in Proceedings
25

Continuous Human Action Recognition for x


Human-Machine Interaction: A Review
Supplementary Material
weight layer
A PPENDIX A
ReLu
F EATURE E XTRACTION
In practice, for feature extraction a pre-trained CNN is weight layer
commonly utilised. Early works [139], [140], [141] used pre-
trained networks such as ALexNet [72], which are compar- 𝐹 (x) + x
atively shallow networks with only 5 convolutional layers.
More recent methods [142], [18], [143] have used deeper ReLu
architectures such as VGG [25] and GoogleNet [26]. Even
though deeper models can learn more discriminative features Fig. 14. Residual Block Formulation. Adapted from [27]
than shallow networks, deeper network can struggle during
training due to problems such as vanishing gradients caused
by the increased network depth.
The following subsections will discuss recent backbones
that are commonly used within the action segmentation do-
main.
1) Residual Neural Network (ResNet): In [27], the pro-
posed residual block with identity mapping has shown to over-
come limitations of previous networks and make the training
of deeper networks far easier. Fig. 14 shows the residual block
formulation where the identity mapping is achieved through
the skip connection, which adds the output of an earlier layer
to a later layer without needing any additional parameters.
By utilising this identity mapping, they have shown that
residual networks are easier to optimise and can achieve better Fig. 15. Residual Network Architectures: Networks with a depth of 18, 34,
accuracies in networks with increased depth. The authors 50, 101, and 152 layers. Adapted from [27]
introduced multiple residual nets with depths of 18, 34, 50, TABLE VIII
101, and 152 layers, and the architectural details are included T HE N ETWORK A RCHITECTURE OF E FFICIENT N ET-B0.
in Fig. 15. These ResNet models are less complex (in terms of Stage Operator Resolution #Channels #Layers
parameters) compared to previous pre-trained networks such 1 Conv3×3 224 × 224 32 1
as VGG networks, where even the deeper ResNet network 2 MBConv1, k3×3 112 × 112 16 1
3 MBConv6, k3×3 112 × 112 24 2
(i.e. ResNet152) is less complex (11.3 billion FLOPs4 ) than 4 MBConv6, k5×5 56 × 56 40 2
both 16 and 19 layer VGG variants (i.e. VGG16 and VGG19) 5 MBConv6, k3×3 28 × 28 80 3
6 MBConv6, k5×5 14 × 14 112 3
which have 15.3 and 19.6 billion FLOPS respectively. 7 MBConv6, k5×5 14 × 14 192 4
Due to the aforementioned advantages, pre-trained ResNet 8 MBConv6, k3×3 7×7 320 1
9 Conv1×1 & Pooling & FC 7×7 1280 1
networks have been widely used as a feature extraction back-
bone within both the action recognition domain [144], [145],
[146], [46], [147] and related problem domains [148], [149], The main building block of the EfficientNet-B0 architecture
[150]. is the mobile inverted bottleneck MBConv, to which a squeeze-
2) EfficientNet-B0: In [28], the authors proposed a CNN and-excitation optimisation is added. MBConv has similarities
architecture which could both improve accuracy and effi- to the inverted residual block used in MobileNet-V2 (details
ciency through a reduction in the number of parameters and in Sec.A-3), and contains skip connections between the start
FLOPS factor compared to previous models such as GPipe and the end of a convolutional block. Table VIII shows the
[151]. In particular, they proposed an effective compound network architecture of EfficientNet-B0.
scaling mechanism to increase the model capacity, facilitating An important component of EfficientNet is the compound
improved accuracy. Among their models, EfficientNet-B0 is scaling method which focuses on uniformly scaling all di-
the simplest and most efficient model, and achieves 77.3% mensions (i.e. depth, width and resolution) using a compound
accuracy on ImageNet while having only 5.3M parameters coefficient. When the depth of the network is individually
and 0.39B FLOPS. In comparison, ResNet50 achieves only increased, the networks gain the capacity to learn more
76% accuracy [28] despite a substantially larger number of complex features while the dimension of width increases
trainable parameters (i.e. 26M trainable parameters and 4.1B the network’s ability to learn fine-grained features. Similarly,
FLOPS). when the resolution is increased the models are able to learn
finer details from the input image (i.e. detect smaller objects
4 Floating Point Operations Per Second and finer patterns). Although each scaling operation improve
26

Fig. 17. MobileNet V1 and V2 convolutional blocks. Recreated from [155]

TABLE IX
Fig. 16. Compound Scaling of EfficientNet: Uniformly scale all three M OBILE N ET V2 A RCHITECTURE . E ACH LAYER IS REPEATED N TIMES AND
dimensions, including width, depth and resolution with a fixed ratio. Recreated T , C , S DENOTE THE EXPANSION FACTOR , OUTPUT CHANNELS AND
from [28] STRIDES RESPECTIVELY.

Input Operator t c n s
network performance, care should be taken as accuracy gain 2242 × 3 conv2d - 32 1 2
and performance can diminish as model size increases. For 1122 × 32 bottleneck 1 16 1 1
example, as the depth of the network is continually increased 1122 × 16 bottleneck 6 24 2 2
the network can suffer from vanishing gradients and become 562 × 24 bottleneck 6 32 3 2
282 × 32 bottleneck 6 64 4 2
difficult to train. Therefore, in EfficientNet, instead of per- 142 × 64 bottleneck 6 96 3 1
forming arbitrary scaling, a method called compound scaling 142 × 96 bottleneck 6 160 3 2
is applied by uniformly scaling the network depth, width and 72 × 160 bottleneck 6 320 1 1
resolution with fixed scaling coefficients. This scaling method 72 × 320 conv2d 1 × 1 - 1280 1 1
72 × 1280 avgpool 7 × 7 - - 1 -
is illustrated in Fig. 16. As described in [28], if 2N times more 1 × 1 × 1280 conv2d 1 × 1 - k -
computational resources are available, it is only a matter of
increasing the network depth by αN , the width by β N , and the
resolution by γ N , where α, β and γ are constant coefficients tecture where t, c, n, s refer to the expansion factor, number
that are determined by a small grid search operating over the of output channels, repeating number and stride respectively.
original small model. In EfficientNet, the compound coeffi- For spatial convolutions, 3 × 3 kernels are used. Due to the
cient φ is used to uniformly scale network width, depth, and light-weight model architecture MobileNet is widely used as
resolution where φ is a user-specified coefficient determining a feature extraction backbone in applications in a live or real-
the number of resources available for model scaling; while α, time setting [156], [157].
β, γ determine how these extra resources will be deployed to 4) Inflated 3D ConvNet (I3D): The previously discussed
increase to network depth, width and resolution, respectively. feature extraction backbones are image-based networks, cap-
The advancements and flexibility offered by the EfficientNet turing only appearance information from an input image.
architecture has seen it widely adopted across computer vision However, in video-based action recognition capturing temporal
as a feature extractor [152], [153], [154]. information alongside spatial features is vital. One way of
3) MobileNet-V2: The initial version of MobileNet, achieving this is by utilising 3D convolutional neural net-
MobileNet-V1 [155], dramatically reduced the computational works. 3D convolutional networks are similar to standard
cost and model complexity in comparison to other networks convolutional networks, however 3D convolutional networks
to enable the use of DCNNs in mobile or low computational are capable of creating a hierarchical structure representing
power environments. In order to achieve this, MobileNetV1 spatio-temporal data through the use of spatio-temporal filters.
introduced with Depth-wise Separable Convolutions. Essen- The limitation of such 3D convolutional networks is that the
tially, the model contains 2 layers: depthwise convolutions training process becomes more difficult due to having an
which perform lightweight filtering by applying a single additional kernel dimension compared to 2D convolutional
convolutional filter to each input channel; and a point-wise networks. Therefore, most previous state-of-the art 3D net-
convolution which is a 1 × 1 convolution that learns new works are relatively shallow. For example, the C3D network
features through a linear combinations of the input channels. [158] is composed of only 8 convolutional layers. In [30],
MobileNetV2 [29] builds upon the findings of MobileNetV1 the authors introduced a novel model, two-stream Inflated
by utilising the Depth-wise Separable Convolutions as the 3D ConvNet (I3D), which proposed inflating a 2D CNN (i.e.
building block, while adding two other techniques to further Inception-V1 [26]), such that filters and pooling kernels within
improve performance: having linear bottlenecks between the an existing deep CNN model are expanded, adding a third
layers and shortcut connections between the bottlenecks. Fig. dimension such that the network can learn spatio-temporal
17 compares the convolutional blocks used in MobileNetV1 features from video inputs. A N × N 2D filter is inflated to an
and MobileNetV2. Fig. IX describes the MobileNetV2 archi- N × N × N filter with the addition of a temporal dimension.
27

Mask R-CNN [89] is an extension of the Faster R-CNN


architecture which adds a branch for predicting segmentation
masks for each region of interest in parallel with the existing
branch for classification and bounding box regression. Mask
R-CNN can be seen as a more accurate object detector by
including a feature pyramid network.
Chained cascade network and Cascade R-CNN [90], [91]
are multi-stage extensions of R-CNN frameworks. They con-
sist of a sequence of detectors trained end-to-end with increas-
ing IoU thresholds, to be sequentially more selective and thus
reduce false positives.
Fig. 18. Inflated Inception-V1 Architecture. Recreated from [26] TridentNet [92] is an object detector that generates scale-
specific feature maps using a parallel multi-branch architecture
As outlined in [30], compared to the C3D network, the I3D (trident blocks), in which each branch shares the same trans-
architecture is much deeper, yet contains fewer parameters. formation parameters, but has differing receptive fields through
Fig. 18 shows the network architecture of Inflated Inception- dilated convolutions.
V1. In the network, the first two pooling layers do not perform 2) Anchor-based: One-stage Frameworks:
temporal pooling where 1 × 3 × 3 kernels with stride 1 are Single shot multibox detector (SSD) [93] is a single-shot
used, while the remaining max-pooling layers have symmetric detector. As such, it predicts the bounding boxes and classes
kernels and strides. However, as stated in [30], the final model directly from feature maps in a single pass. To improve
is trained on 64 frame snippets, which can be extracted across accuracy, SSD introduces small convolutional filters to predict
a large temporal receptive field. As such, the model captures object classes, and offsets to a default set of boundary boxes.
long-term temporal information through the fusion of both Deconvolutional single shot detector (DSSD) [94] is a modified
RGB and flow streams. The authors further improved this version of SSD which adds a prediction and deconvolution
architecture by individually training two networks on RGB module. DSSD uses an integration structure similar to the FPN
and optical flow inputs respectively and performing average in SSD to generate an integrated feature map.
pooling afterwards. This 2-stream architecture has been widely RetinaNet [95] is a detector composed of a backbone
used in action recognition research as a feature extractor. For network and two task-specific subnetworks. The backbone is
example, in [20], [65], [63] 2048 dimensional feature vectors responsible for computing a convolutional feature map over an
are extracted by the two networks (RGB and optical flow), entire input image and is an off-the-self convolution network.
where 1024 dimensional features extracted through individual The first subnetwork performs classification on the backbones
networks are concatenated. output; the second subnetwork performs convolution bounding
box regression. RetinaNet uses an FPN to replace the multi-
A PPENDIX B CNN layers in SSD which integrate features from higher and
O BJECT D ETECTION AND T RACKING lower layers in the backbone network.
M2det [96] uses a multi-level feature pyramid network
Table X lists selected samples from the literature that
(MLFPN) to generate more effective feature pyramids. Fea-
have demonstrated the benefits of incorporating detection and
tures with multiple scales and from multiple levels are pro-
tracking techniques to aid action recognition.
cessed as per the SSD architecture to obtain bounding box
locations and classification results in an end-to-end manner.
A. Object Detection EfficientDet [82] is an object detector that adopts Efficient-
In this Subsection, we introduce some relevant object de- Net and several optimisation strategies such as the use of a
tector architectures from the literature. BiFPN to determine the importance of different input features,
1) Anchor-based: Two-stage Frameworks: and a compound scaling method that uniformly scales the
R-CNN series. Faster-RCNN [86] comprises two modules, depth, width and resolution as well as the box/class network
an RPN which generates a set of object proposals and an object at the same time.
detection network, based on the Fast R-CNN detector [171], You only look once (YOLO) family of detectors Among
which refines the proposal locations. One extension to Faster well-known single stage object detectors, YOLO [105] and
R-CNN is the feature pyramid network (FPN) [81] which YOLOv2 [106]) have demonstrated impressive speed and ac-
provides a robust way to deal with images of different scales, curacy. This detector can run well on low powered hard-
while maintaining real-time performance. Another extension ware, thanks to the intelligent and conservative model de-
is the region fully convolutional network (R-FCN) [87] that sign. YOLOv3 [79] makes an incremental improvement to
seek to make the Faster R-CNN network faster by making the YOLO by adopting the backbone network Darknet-53 as
it fully convolutional, and delaying the cropping step. To the feature extractor. YOLOv3-tiny is a compact version of
reduce imbalance at sample, feature, and objective level during YOLOv3 that has only nine convolutional layers and six pool-
training, Libra R-CNN [88] is introduced by integrating three ing layers, making it much faster but less accurate. YOLOv3-
components: intersection over union (IoU)-balanced sampling, SPP is a revised YOLOv3, which has one SPP module [83]
a balanced feature pyramid, and a balanced L1 loss. Finally, in front of its first detection header.
28

TABLE X
E XAMPLE METHODS WHICH HAVE USED OBJECT DETECTION TO SUPPORT ACTION RECOGNITION OR SEGMENTATION .

Authors Detector + Tracker Action rec model Dataset Additional remarks


Wang et al. (2016) [159] Faster R-CNN SR-CNN UCF101 No Tracking
Wang et al. (2016) [160] Faster R-CNN + Two-stream CNN J-HMBD, UCF
Distance-based tracker
Tsai et al. (2020) [161] YOLOv3 + DeepSORT I3D [30] NTU RGB+D
Ishioka et al. (2020) [162] Faster R-CNN + Kalman filter I3D [30] Construction site
Ali et al. (2022) [163] YOLOv3 + DeepSORT I3D [30] Activis Diagnosis of children with autism
Ahmed et al. (2021) [164] Faster R-CNN + SORT MS-TCN++ SARAH tasks Stroke rehabilitation assessment
Ahmedt-Aristizabal et al. (2019) [165] Mask R-CNN Skeleton-based (LSTM Pose Machine) + Mater Hospital Diagnosis of Epilepsy
LSTM; CNN+LSTM
Chen et al. (2020) [166] YOLOv3 Skeleton-based (CPM) Assembly action
Ghosh et al. (2020) [167] Faster R-CNN Skeleton-based (STGCN [168]) CAD120 Object detector as feature extractor;
Dataset provides skeletal data.
Li et al. (2021) [169] OpenPose [170] Skeleton-based (PR-GCN [169]) CAD120

Other relevant models on the YOLO series are: CornerNet [99] is a one-stage approach which detects
YOLOv4 [85], which is composed of CSPDarknet53 as a objects represented by paired heatmaps, the top-left cor-
backbone, a SPP additional module, a PANet as the neck, and ner and bottom-right corner, using a stacked hourglass net-
a YOLOv3 as the head. CSPDarknet53 is a novel backbone work [174]. The network uses associate embeddings and
that can enhance the learning capability of the CNN by predicts a group of offsets to group corners and produce tighter
integrating feature maps from the beginning and the end of bounding boxes. One extension to this model, is CornetNet-
a network stage. The BoF for YOLOv4 backbones include lite [100], which combines two variants (CornerNet-Saccade
CutMix and Mosaic data augmentation [107], DropBlock and CornerNet-Squeeze) to reduce the number of pixels pro-
regularisation [108], and class label smoothing [73]. The BoS cessed and the amount of processing per pixel, respectively.
for the same CSPDarknet53 are Mish activation [109], cross- CornerNet-Saccade uses an attention mechanism to eliminate
stage-partial-connections (CSP) [80], and multi-input weighted the need for exhaustively processing all pixels of the image,
residual connections (MiWRC). Paddle-Paddle YOLO (PP- and CornerNet-Squeeze introduces a compact hourglass back-
YOLO) [172] is based on YOLOv4 with a ResNet50-vd back- bone that leverages ideas from SqueezeNet and MobileNets.
bone, an SPP for the top feature map, a FPN as the detection CenterNet (objects as points) [101] considers the center of a
Neck, and the detection head of YOLOv3. An optimised box as an object and a key point, and then uses this predicted
version of this model was published as PP-YOLOv2 [173], center to find the coordinates/offsets of the bounding box. The
where a PANet is included for the FPN to compose bottom- model uses two prediction heads, one to predict the confidence
up paths. Finally, the scaling cross stage partial network heat map, and the other to predict regression values for box
(Scaled-YOLOv4) [97] achieves one of the best trade-offs dimensions and offsets. This helps the model remove the Non-
between speed and accuracy. In this approach, YOLOv4 is Maximum Suppression step commonly required during post-
redesigned to form YOLOv4-CSP with a network scaling processing. Another variant of this keypoint based approach,
approach that modifies not only the network depth, width, CenterNet (keypoint triplets) [102], detects each object as a
and resolution; but also the structure of the network. Thus, triplet using a centre keypoint and a pair of corners. The model
the backbone is optimised and the neck (PANet) uses CSP introduces two specialised modules, cascade corner pooling
and Mish activations. and centre pooling, which enrich information acquired by both
3) Anchor-free Frameworks: As discussed previously, use the top-left and bottom-right corners and extract richer data
of an anchor-free mechanism significantly reduces the number from the central regions. If a center keypoint is detected in
of hyperparameters and reduces design choices which need the central region, the bounding box is preserved.
heuristic tuning and expertise, making detector configuration, FCOS [103] (fully convolutional one-stage detector) is a
training and decoding considerably simpler. Anchor-free ob- center-based approach which computes per-pixel predictions
ject detectors can be categorised as keypoint based approaches in a fully convolutional manner (i.e. perform object detection
and center-based approaches. The former predicts predefined like segmentation). FCOS is constructed on top of an FPN
key points from the network which are then utilised to generate which acts as a pyramid to aggregate multi-level features from
the bounding box around an object and classify it. The latter the backbone. FPN predictions are obtained across five feature
employs the center-point or any part-point of an object to levels. The outputs are then fed to a detection head (per pixel
define positive and negative samples. From these points it predictions) consisting of three branches. FCOS introduces
predicts the distance to four coordinates for the generation a third branch, called the centerness branch, in addition to
of a bounding box. the two typical branches, classification and regression. The
DeNet [98] is a two-stage detector which first determines centerness branch provides a measure of how centred the
how likely it is each location belongs to either the top-left, positive sample location is within the regressed bounding box,
top-right, bottom-left or bottom-right corner of a bounding which improves the performance of anchor-free detectors and
box. It then generates ROIs by enumerating all possible corner brings them on-par with anchor-based detectors.
combinations, and rejects poor ROIs with a sub-detection YOLOX [104] adapts the YOLO series to an anchor-free
network. Finally, the model classifies and regresses the ROIs. setting, and incorporates other improvements such as a decou-
29

pled head and an advanced label assignment strategy, SimOTA, A PPENDIX C


based on OTA [175]. ACTION S EGMENTATION DATASETS
Other relevant anchor-free architectures are RepPoints [176] In this section, we discuss action segmentation datasets that
and RepPoints v2 [177], FSAF [178], and ExtremeNet [179]. are widely used in the current literature.
A. Breakfast [68]
B. Multi-object Tracking
This dataset contains 50 fine-grained actions related to
In this Subsection, we introduce well-known tracking algo-
breakfast preparation, performed by 52 different subjects in 18
rithms under the category of separate detection and embedding
different kitchens. This dataset offers an uncontrolled setting
(SDE), and joint detection and embedding (JDE) methods.
in which to evaluate action segmentation models. This dataset
1) SDE Frameworks:
is captured from different camera types, including webcams,
Simple online and realtime tracking (SORT) [112] is a
standard industry cameras and stereo cameras. Furthermore,
framework that combines location and motion cues in a very
the videos are captured from different viewpoints. The pro-
simple way. The model uses Kalman filtering [180] to predict
vided dataset has a resolution of 320 × 240 pixels with a
the location of the tracklets in the next frame, and then per-
frame rate of 15 fps.
forms the data association using the Hungarian method [181]
with an association metric that measures bounding box over- B. 50Salads [59]
lap. This dataset offers a multi-modal human action segmenta-
DeepSORT [113] augments the overlap-based association tion challenge where the provided data contains RGB video
cost in SORT by integrating a deep association metric and data and Depth maps sampled at 640×480 pixels at 30 fps,
appearance information. The Hungarian algorithm is used as well as 3-axis accelerometer data at 50 fps. This dataset
to resolve associations between the predicted Kalman states includes 50 sequences of people preparing a mixed salad with
and newly-arrived measurements. The tracker is trained in two sequences per subject. In total there are 52 fine-grained
an entirely offline manner. At test time, when tracking novel action classes. Specifically, this dataset offers a setting to
objects, the network weights are frozen, and no online fine- train models to recognise manipulative gestures such as hand-
tuning is required. object interactions which are important in food preparation,
ByteTrack [114] keeps all detection boxes (detected by manufacturing, and assembly tasks.
YOLOX) and associates across every box instead of only
C. MPII cooking activities dataset [3]
the high scoring boxes to reduce missed detections. In the
matching process, an algorithm called BYTE first predicts the The dataset consists of 12 subjects performing 65 fine-
tracklets using a Kalman filter, then they are matched with grained cooking related activities. The dataset is captured from
high-scoring detected bounding boxes using motion similarity. a single camera view at a 1624 × 1224 pixel resolution with a
Next, the algorithm performs a second matching between the frame rate of 30 fps. The dataset consists of 44 videos and has
detected bounding boxes with lower confidence values and the a total length of more than 8 hours. The duration of each video
objects in the tracklets that could not be matched. ranges from 3 to 41 minutes. To maintain a realistic recording
2) JDE Frameworks: setting, the dataset authors did not provide instructions for
Tracktor [115] uses the Faster-RCNN framework to directly individual activities, but informed the participant in advance
take the tracking results of previous frames as the ROI. what dish that they are required to prepare (e.g. salad), the
The model then removes the box association by directly ingredients to use (cucumber, tomatoes, cheese), and the
propagating identities of region proposals using bounding box utensils (i.e. grater) that they can use. Therefore, this dataset
regression. provides a real world action segmentation setting, with subject
CenterTrack [116] is based on the CenterNet model [101], specific variations across the same activity due to personal
and predicts the offsets of objects relative to the previous frame preferences (e.g. washing a vegetable before or after peeling
while detecting objects in the next frame. it).
FairMOT [117] uses an encoder-decoder network to extract D. MPII cooking 2 dataset [119]
high resolution features from an image, and thus incorporates This dataset contains 273 videos with a total length of more
a Re-ID module within CenterTrack [116]. The Re-ID features than 27 hours, captured from 30 subjects preparing a certain
are an additional target for CenterNet to regress to, in addition dish, and comprising 67 fine-grained actions. Similar to the
to the centre points, the object size, and the offset. MPII cooking activities dataset, this dataset is captured from
SiamMOT [118] combines the Faster-RCNN model with a single camera at a 1624 × 1224 pixel resolution, with a
two motion models based on siamese-based single-object frame rate of 30 fps. The camera is mounted on the ceiling
tracking [182], an implicit motion model and an explicit and captures the front view of a person working at the counter.
motion model. The motion model estimates an instance’s The duration of the videos range from 1 to 41 minutes. As
movement between two frames such that detected instances per the MPII cooking activities dataset, this dataset offers
are associated. different subject specific patterns and behaviours for the same
Other relevant real-time multi-object trackers that have dish, as no instructions regarding how to prepare a certain
a JDE framework are QDTrack [183], TraDeS [184], dish are provided to the participants. In addition to video
CorrTracker [185], and transformer-based tracking models data, the dataset authors provide human pose annotations, their
(TransTrack [186], MOTR [187]). trajectories, and text-based video script descriptions.
30

E. EPIC-KITCHENS-100 [60] TABLE XI


E VALUATION TIMES IN SECONDS FOR DIFFERENT BUFFER SIZES . N OTE
EPIC-KITCHENS-100 is a recent and popular dataset THAT RUNTIME IS CALCULATED ONLY FOR THE EVALUATION OF THE
within the action segmentation research community. Data is ACTION SEGMENTATION MODEL . F EATURE EXTRACTION TIME IS NOT
TAKEN INTO CONSIDERATION .
captured from 45 different kitchens using a head mounted
GoPro Hero 7. This egocentric perspective of the actions Model Buffer Size Runtime (s)
provides a challenging evaluation setting, as the hand-object 100 0.1129
interactions and critical informative regions are sometimes 250 0.3502
MSTCN 500 0.7291
occluded by the hand, or they occur out of the camera’s field 750 1.0435
of view. 1000 1.4105
In total there are 89,977 action segments of fine-grained 100 0.1698
250 0.3416
actions extracted from 700 videos, with a total duration MSTCN++ 500 0.7196
exceeding 100 hours. The dataset has a resolution of 1920 × 750 1.1406
1080 pixels with a frame rate of 50 fps. This dataset follows a 1000 1.4706
100 0.3877
verb-noun action labelling format where an action class such 250 0.7610
as “put down gloves” is broken into the verb “put down”, and SSTDA 500 1.6754
the noun “gloves”. In total this dataset consists of 97 verbs and 750 2.1942
1000 3.2258
300 noun classes. In addition to video data, the dataset authors 100 1.6014
provide hand-object detections which are extracted from two 250 3.9543
state-of-the-art object detectors, Mask R-CNN [89] and the DGTRM 500 8.8306
750 17.7307
model of Shan et al. [188], and provide this as an additional 1000 24.8198
feature modality.
the authors provide semantic attributes for the videos such as
F. GTEA [67]
visible body parts, body motion, visible object and the location
The Georgia Tech Egocentric Activities (GTEA) dataset of the activity, in addition to action class labels. The majority
offers an egocentric action segmentation evaluation setting, of the videos in this dataset have a short duration, where the
where data is captured from a head mounted GoPro camera average length of an action is 4.6 seconds.
with a 1280×720 pixel resolution capturing at 15 fps. In total
the dataset contains 28 videos where 7 different food prepa- I. Toyota Smart-home Untrimmed dataset [122]
ration activities (such as Hotdog Sandwich, Instant Coffee, The dataset consists of videos recorded at 640 x 480 pixel
Peanut Butter Sandwich) are performed by 4 subjects. This resolution at 20 fps frame rate using Microsoft Kinect sensors.
dataset offers 11 different fine-grained action classes and on Each action is captured from at least 2 distinct camera angles,
average there are 20 action instances per video. In addition and the dataset offers 3 modalities: RGB, depth and the 3D
to action annotations, this dataset offers hand and object pose of the subject. In total this dataset contains 536 videos
segmentation masks as well as object annotations, which can with an average duration of 21 minutes. There are 51 unique
be used as additional information cues when training the action fine-grained action classes in this dataset, with high intra-class
segmentation models. variations within each video. Furthermore, this dataset captures
a real-world daily living setting with various spontaneous
G. ActivityNet [120] behaviours, partially captured or occluded frames, as well as
ActivityNet consists of videos that are obtained from on- elementary activities that do not follow a specific temporal
line video sharing sites, and comprises 648 hours of video ordering. Therefore, this dataset offers a challenging real-
recordings. In contrast to previously described datasets where world evaluation setting in which to test action segmentation
there are multiple actions per video, ActivityNet contains only models.
one or two action instances per video. This dataset has 203
different action classes and an average of 193 video samples J. FineGym [123]
per class. The majority of videos have a duration between This dataset is composed of videos which capture pro-
5 and 10 minutes, and are captured at 1280 × 720 pixel fessional gymnastic performances, across 303 videos for a
resolution at 30 fps. This dataset offers a balanced distribution total duration of 708 hours. These videos were sub-sampled
of actions which include personal care, sports and exercises, into 32,697 samples which range in length from 8 to 55
socialising and leisure, and household activities. seconds. The majority of the videos are captured at either
720P or 1080P resolution. Compared to other publicly avail-
H. THUMOS15 [121] able datasets, this dataset offers a rich annotation structure
This dataset contains footage drawn from public videos where the action labels are categorised using a three-level
on YouTube. The dataset contains more than 430 hours of semantic hierarchy. For instance, the action “balance beam”
video, and manual filtering was conducted to ensure that contains fine-grained actions such as “leap-jumphop”, “beam-
the videos contain visible actions. Similar to ActivityNet, turns”, and these sub-actions are further divided into finely
THUMOS15 also contains one or two action instances per defined class labels such as “Handspring forward with leg
video. Specifically, this dataset has 101 action classes and change” and “Handspring forward”. This 3 stage annotation
31

structure offers 99, 288 and 530 unique fine-grained action


classes. Furthermore, this dataset is highly diverse in terms of
viewpoints and poses due to the “in the wild” nature of the
videos used.

A PPENDIX D
A DAPTING TO R EAL -W ORLD A PPLICATIONS : RUNTIMES
In Table XI we provide evaluation times in seconds for
models with different buffer sizes, which were introduced in
Sec IV-E of the main paper.

You might also like