You are on page 1of 9

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2931471, IEEE Access
IEEE Transaction on Internet of Things,Year:2019

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI

Object-level Trajectories based


Fine-Grained Action Recognition in
Visual IoT Applications
JIAN XIONG, (Member, IEEE), LIGUO LU, HENGBING WANG, JIE YANG, AND GUAN GUI,
(Senior Member, IEEE).
College of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing, China, 210003.
Corresponding authors: Jian Xiong (e-mail: jxiong@njupt.edu.cn) and Guan Gui (e-mail: guiguan@njupt.edu.cn).
This work was supported in part by the National Natural Science Foundation of China Grant (No. 61701258, No. 61701310, No.
61801219, No. 61876093), Natural Science Foundation of Jiangsu Province Grant (No. BK20170906, No. BK20181393), Natural Science
Foundation of Jiangsu Higher Education Institutions Grant (No. 17KJB510044, NO. 17KJB510038).

ABSTRACT The emerging computer vision and deep learning technologies are being applied to the
intelligent analysis of sports training videos. In this paper, a deep learning based fine-grained action
recognition (FGAR) method is proposed to analyze soccer training videos. The proposed method was
applied to an indoor training equipment for evaluating whether a player has stopped a soccer ball
successfully or not. Firstly, the problem of FGAR is modeled as human-object (player-ball) interactions. The
object-level trajectories are proposed as a new descriptor to identify fine-grained sports videos. The proposed
descriptor can take advantage of high-level semantic and human-object interaction motions. Secondly, a
cascaded scheme of deep networks based on the object-level trajectories is proposed to realize FGAR. The
cascaded network is constructed by concatenating a detector network with a classifier network (a long-short-
term-memory (LSTM) based network). The cascaded scheme takes the advantage of the high efficiency of
the detector on object detection and the outstanding performance of the LSTM-based network on processing
time series. Experimental results show that the proposed method can achieve an accuracy of 93.24%.

INDEX TERMS Event Classification, Object Detection Network, Sports Video Analysis, LSTM

I. INTRODUCTION fair amount of motion information [14]–[16], such that it is


N recent years, with the rapid development of information widely used in video surveillance applications [17], [18]. The
I processing and internet of things (IoT) technologies [1]–
[5], sports services and sports training tend to be digital
details can be obtained by analyzing the motion information
using computer vision and deep learning technologies [19]–
and intelligent. Specifically, advanced technologies such as [22].
wearable devices [1], [2], virtual reality [6]–[8], machine Many efforts have been made to apply visual analysis
learning [9], [10], data mining [11], [12], and cloud com- on sports video data. With respect to better assisting sports
puting [13] are applied to obtain and analyze the sports training and sports video broadcast, some works are put
data. Generally, the dynamic sports data are in the form of forward on player detection [23]–[25], ball detection [26],
various heterogeneous physiological signals [12], such as and posture estimation [27]–[29]. Furthermore, in order to
heart rate, oxygen consumption, carbon dioxide, oxygen con- reduce labor-intensive tasks, some other works [30]–[32]
centration, electromyography (EMG), electroencephalogram focus on automatic sports video editing and summarization.
(EEG), and blood lactate. These data are usually collected Besides the above, in order to extract more important
by using a set of external sensors and wearable devices, such details in sports videos, we often need fine-grained clas-
as spirometers, ergometers [12], myo armband [7], electrode sifications of the same type of sports. For example, golf
cap [11], and microwave sensors [1]. However, for collect- events were classified into full swing, incomplete swing, and
ing more details about the athlete’s postures, decisions, and irrelevant events [33]. Soccer game events were divided into
actions, visual sensors are better solutions than most of the attack, defense, middle game, and goal opportunity [34].
other sensors. The reason is that video cameras can capture a Such type of works is defined as fine-grained action recog-

VOLUME 4, 2016 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2931471, IEEE Access
IEEE Transaction on Internet of Things,Year:2019
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

nition (FGAR). FGAR is a major problem in the analysis II. RELATED WORKS
of sports training videos. It is different from the tradition- A. VISUAL ANALYSIS IN SPORTS
al action recognition, such as the activity classification in Visual analysis was widely applied to sports services and
Olympic Sports Dataset1 , UCF101 Dataset2 , and HMDB51 sports training. Firstly, visual analysis was used to track and
Dataset3 , in which, the motions and even the scenes have identify the players. Noor et al, [23] proposed a method for
obvious differences. Traditional action recognition methods estimating the number of soccer players by using computer
are generally based on dense trajectories features [35]–[38]. graphics and virtual simulations. Lu et al, [24] presented a
However, in the field of FGAR, the motions and the scenes method for tracking and identifying the player from broad-
generally have no obvious difference. The major difference cast sports videos. The players were identified and tracked by
is the human-object interaction motions. Nevertheless, the using DPM Detector and Kalman Filter, respectively. Arda et
dense trajectories-based features in traditional methods can- al, [25] proposed a part-based player identification by using
not represent the human-object interactions efficiently. deep convolutional representation and multi-scale pooling.
In this paper, we propose an object-level trajectories-based Vito et al, [26] utilized convolutional neural networks for
FGAR method for analyzing soccer training videos. An in- ball detection in tennis games. Secondly, visual analysis was
door training equipment was installed to improve the players’ used to conduct the pose estimation of the athletes. Hwang
ability of passing accuracy and skill of the first touch. The et al, [28] proposed an athlete pose estimation method by
first touch is defined as first stopping and control of the combining global and local information using the 2D image.
receiving ball. That is, evaluating the success rate of the Chen et al, [27] utilized Kinect video analysis for rectifying
fist touch is to judge whether a player has stopped a soccer yoga postures in a self-training system. Sumer et al, [29] pre-
ball successfully or not. In order to automatically obtain the sented a self-supervised learning method for pose estimation
success rate timely, we handle it as a typical FGAR problem. in sports videos. Thirdly, video analysis was used to reduce
The main contributes are listed as followings. labor-intensive and time-consuming work in video editing
1) We have collected and annotated a soccer training of broadcast media. The deep action recognition features
dataset, in which there are totally 2543 ball-stopping were used in summarization of user-generated sports videos
events, including 1597 successful events and 946 failed by Antonio et al [30]. Multimodal excitement features were
ones. This dataset can be used in further research on utilized for automatic curation of golf highlights by Merler
FGAR problem and sports analysis. et al [31]. An automatic cricket highlight generation method
2) The problem of FGAR is modeled as human-object [32] was proposed by using event-driven and excitement-
(player-ball) interactions. The object-level trajectories based features.
are proposed as a new descriptor to identify fine-
B. FINE-GRAINED ACTION RECOGNITION
grained sports videos. The object-level descriptor can
take advantage of the high-level semantic and the Fine-grained action recognition is crucial for extracting more
human-object interaction motions. important details from sports videos. Wang et al, [33] pro-
3) A cascaded scheme of deep networks based on the posed a hidden conditional random field (HCRF) model
object-level trajectories is proposed to realize FGAR. for sports video event classification by using independent
The cascaded network is constructed by concatenat- component analysis (ICA) mixture feature functions. For
ing a detector network (the YOLOv3 network) with example, the model was applied to classify golf events into
a classifier network (a long-short-term-memory (LST- three classes, such as full swing, incomplete swing and other
M) based network). The cascaded scheme takes the irrelevant events. Cioppa et al, [34] presented a bottom-up ap-
advantage of the high efficiency of the detector on proach for the interpretation of the soccer game videos based
object detection and the outstanding performance of on semantic features. By using the deep neural networks,
the LSTM-based network on processing time series. the soccer game events were classified into four classes,
including attack, defense, middle game, and goal or goal
The remainder of the paper is organized as follows. The opportunity. In [39], the diving video clips were extracted
related works are introduced in Section II. The background and classified by using a 3D convolutional neural network.
and the problem is stated in Section III. Observations and The diving events were classified into four classes, including
analysis are proposed in Section IV. The proposed object- handstand, rotation type, somersaults, twists, and pose type.
level trajectory-based cascaded network is presented in Sec- In order to obtain player statistics, Theagarajan et al, [40]
tion V. Experiments and more discussions are provided in proposed a visual analysis method to obtain the possession
Section VI. In the end, we make some conclusion in Section percentage by classifying the soccer teams and identifying
VII. the player controlling the ball.

C. TRADITIONAL ACTION RECOGNITION


1 http://vision.stanford.edu/Datasets/OlympicSports/
2 https://www.crcv.ucf.edu/research/data-sets/human-actions/ucf101/ Recently, dense trajectories, as well as optical flow were
3 http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion- widely used as features in event classification and action
database/#dataset recognition. Wang et al, [35] proposed an action recognition
2 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2931471, IEEE Access
IEEE Transaction on Internet of Things,Year:2019
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

method based on dense trajectories. The dense trajectories B. PROBLEMS STATEMENTS


were extracted by tracking feature points via a dense optical In this training, the success rate of the first touch and the
flow algorithm. The feature points are densely sampled on pass accuracy are two important indices, which can be used
dense grids for multi spatial scales. To reduce the impact of to evaluate the players’ ability and update the players’ de-
inconsistent matches, the authors improved the dense trajec- velopment timely. Firstly, as we know the pass accuracy is
tories by estimating the camera motion [36]. Peng et al, [37] important in soccer such that it is widely used to evaluate
proposed a video representation named stacked Fisher vector the performance of a soccer player or a team. In this training
(SFV) for action recognition. The dense trajectories were equipment, the pass accuracy is evaluated as the success rate
nested encoded using Fisher vectors in two layers. Karen of the player scoring the target goal. An infrared transceiver
et al, [38] presented a two-stream ConvNet architecture for module can be easily used to distinguish whether the ball is
action recognition, in which the spatial stream (single frame) kicked into the right goal or not. Secondly, the first touch is
and the temporal stream (multi-frame dense optical flow) defined as a skill of the first stopping and control to the re-
are scored and fused to obtain the final results. After that, ceiving soccer ball. It is considered to be the most important
deep neural network based fusion methods [41]–[44] were skill in soccer4 . Good skill of the first touch can increase the
proposed to improve the two-stream ConvNet architecture. players’ ability to maintain possession, such that they have
opportunities to use their other skills and allow the players to
be more successful at high levels. However, the success rate
of the first touch is not easy to be evaluated automatically.
Traditional sensors and IoT technologies cannot provide an
efficient solution. Fortunately, the visual sensor can pick up
a large amount of useful information in event classification
and action recognition. Therefore, in this work, we present a
computer vision based assessment to obtain the success rate
of the first touch.

IV. OBSERVATION AND ANALYSIS


Assessment of the player’s first touch is a typical FGAR
problem. In that case, the scenes are definitely the same, and
the players’ motion patterns have no significant difference.
By observing a large number of the ball-stopping events, we
found that the human-object (player-ball) interaction motions
are significantly different between the successful events and
failed ones.
Figure 2 shows the examples of the motion trajectories
FIGURE 1. The diagram of the indoor soccer training equipment. of the player and the ball, including successful cases (sub-
figure (a)-(c)), failed with touching (subfigure (d)-(f)), and
failed without touching (subfigure (g)-(i)). Generally, for the
III. BACKGROUND AND PROBLEM STATEMENT successful events, the ball flys to the center of the train-
A. INTRODUCTION OF THE SOCCER TRAINING
ing ground, and then it stops close to the player until the
EQUIPMENT
player kicks it to the target goal. In that case, the length
Recently, a new soccer training equipment is developed in of the players’ motion trajectories are relatively short and
order to improve the players’ ability of reflexes, the skill of the turning points of the soccer ball’s motion trajectories are
the first touch, and pass accuracy. As shown in Figure 1, the close to the players. However, for the failed events, there are
training equipment is installed as a squared indoor training usually two cases. The first failed case is that the ball flys
ground. The equipment is composed of 8 ball ejectors and 64 to the player and the player touches the ball but fails to stop
small goals, i.e., at the middle of each side, there are 2 ball it. This case generally happens when the player has chosen
ejectors and 16 goals (both in two rows). After one player an unreasonable body part, posture, strength, foot shape or
enters into the equipment and starts training, one ejector is striking spot on the ball. In that case, the ball will quickly
randomly selected and it ejects a soccer ball into the center bounce off the player. There will be sudden deflections in
of the ground. At the same time, a goal is randomly selected the ball motion trajectories. At the same time, the player
as the target goal. In the equipment, the lights and the sounds will pursue the ball unconsciously. Therefore, the players’
are used as indicators. According to the indicators, the player trajectories are generally longer than that of the successful
needs to stop the soccer ball and pass it to the target goal ones. The second failed case is that the ball flys past the
successfully. Afterward, the player will wait for the next ball.
That is, the player on the training ground should continually 4 https://www.active.com/soccer/articles/the-importance-of-the-first-
stop and pass the soccer until the end of training. touch

VOLUME 4, 2016 3

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2931471, IEEE Access
IEEE Transaction
Author et al.: Preparation on Internet
of Papers for IEEE of Things,Year:2019
TRANSACTIONS and JOURNALS

(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

FIGURE 2. Motion trajectories of both the player and the soccer ball in different cases. (a)-(c) Successful cases, (d)-(f) Failed Cases with Touching, (g)-(i) Failed
Cases without Touching

player but the player didn’t touch the ball at all. As can V. THE PROPOSED METHOD
be seen, the motion trajectories of the balls are relatively In this section, a deep learning based assessment method
smooth. is proposed to evaluate the success rate of the first touch
in soccer training. In Section V-A, we describe the total
The above observations show that motion trajectories have framework of the proposed method, which is a cascaded
obvious differences. In this work, we proposed the object- deep neural network consisting of a detector and a classifier.
level motion trajectories as the descriptor of the sports videos The detector is briefly introduced in Section V-B. Then, we
(in Section V-C). On one hand, in the FGAR applications, explain the extracted features in Section V-C. The classifier
the scenes generally have no obvious difference while the is described in Section V-D.
motion trajectories are significantly different. Directly fusing
these frames at the input layer may excessively focus on A. FRAMEWORK OF THE PROPOSED METHOD
spatial information such as the scenes and local structures The framework of the proposed method is shown in Figure 3.
but cannot extract motion information efficiently [45]. In This is a cascaded model based on deep neural networks. It
contrast, extracting object-level motion trajectories and then consists of a detector network and a classifier network. The
fusing them benefit for obtaining more distinguishable in- input of the model is a series of successive video frames,
formation. On the other hand, the object-level trajectories which is named as a ball-stopping event. It should be noted
are different from the dense trajectories, which are basically that the videos can be divided into a number of ball-stopping
a low-level features. The object-level trajectories can take events (as shown in Figure 4). In practice, the events can be
advantage of high-level semantic and efficiently represent the easily separated by recording the time when the ball ejectors
human-object interaction motions. eject the ball. For each ball-stopping event, all the frames
successively input into the detector to obtain the bounding
boxes of the player and the ball, i.e., the outputs of the
4 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2931471, IEEE Access

IEEEetTransaction
Author onofInternet
al.: Preparation Papers forofIEEE
Things,Year:2019
TRANSACTIONS and JOURNALS

FIGURE 3. The framework of the proposed object-level trajectories based fine-grained action recognition (FGAR) method.

FIGURE 4. The instances of the ball-stopping events.

detector are a series of bounding boxes recorded by their Yolo V3 performance is separated from the one used for the
coordinates. The series of coordinates are called as object- evaluation of ball-stopping events (introduced in section VI.
level motion trajectories since they are coordinates of objects A).
such as the player and the ball. Then, the features are ex- Table 1 shows the average precisions (APs) of the detector
tracted based on the object-level motion trajectories. Finally, with the original weights and the retrained weights, respec-
based on the features, the ball-stopping events are classified tively. It is observed that the APs of both the players and
into success and failure by using an LSTM neural network. the balls have been improved after retraining with the new
The cascaded scheme synthesizes the high efficiency of the dataset. Particularly, the AP of the balls has increased from
detector on object detection and the outstanding performance 45.11% to 92.11%. Moreover, the retrained AP of the player
of the LSTM-based network on processing time series. is 99.94%, i.e., almost no player was missed. The high APs
indicate that the retrained detector can extract near-complete
B. DETECTION OF PLAYER AND SOCCER BALL trajectories.
In order to extract the trajectories efficiently, the player and
TABLE 1. Average precisions (APs) of the YOLOv3 network with the original
the soccer ball should be detected first. In this method, the weights and the retrained weights, respectively.
players and the balls are detected by using YOLOv3 [46],
which is one of the state-of-the-art detectors. Compared with Player Soccer ball
its predecessor YOLO9000, YOLOv3 predicts the box on Original Weights AP(%) 98.83 45.11
three different scales, and more convolutional layers are Retrained Weights AP(%) 99.94 92.11
added to process up-sampling feature, such that YOLOv3 can
achieve a better performance in terms of the accuracy and
efficiency. C. FEATURE EXTRACTION
It should be noted that, since the soccer ball with ultra-high The observation and analysis indicate that object-level based
speed motion caught by a regular camera has the obvious motion trajectories can efficiently represent the human-object
effect of video trailing, the detector trained by the images of interactions of ball-stopping events. In this section, we
static balls cannot detect the ultra-high speed one. Therefore, present the details of object-level motion trajectories based
in this work, a new dataset is constructed to retrain the features, including the motion trajectories and the instanta-
detector for locating the deformed ball more accurately. A neous velocities.
total of 7946 images with the player and the soccer ball In this work, each ball-stopping event is composed of a se-
are collected and labeled in this dataset. Figure 5 shows ries of n successive frames (n is set to 45 in the experiments).
some examples of the labeled images. In order to verify the For each ball-stopping event, the detector outputs a series of
robustness of the proposed method, the dataset for adjusting bounding boxes of the players and the soccer balls. Assuming
VOLUME 4, 2016 5

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2931471, IEEE Access

Author et al.:IEEE Transaction


Preparation of Paperson
for Internet of Things,Year:2019
IEEE TRANSACTIONS and JOURNALS

the same, which makes that the scales of the coordinates


are different. Therefore, the features are unified by dividing
the coordinates with the width and the height of the input
videos, respectively. (2) Interpolation: Even though the APs
of the retrained detector is significantly high (99% and 92%
for both players and soccer balls, respectively as in Table 1),
leak detection is inevitable. That is, not all the players and
the balls are successfully detected. To ensure the consistency
of feature dimensions, linear interpolation is conducted on
the trajectories. The closest detected coordinates are used to
interpolated the missing detected objects. The distances of
(a) frames are used as the interpolated weights.
The final feature is constructed by combining the object-
level motion trajectory omt and the instantaneous velocities
vt , as ft = (pxt , pyt , sxt , syt , vt ), t = 0, 1, ..., n − 1. That
is, the ball-stopping event is represented as (f0 , f1 , ..., fn−1 ).

D. CLASSIFICATION
In this work, the object-level trajectories are typical time
series. In the view of the high performance of long short-term
memory (LSTM) neural network on the processing of time
series, an LSTM-based network is employed to classify the
trajectory-based features. The LSTM network is constructed
by stacking three basic LSTM cells. The number of neurons
(b)
is set to 150 in each cell.
FIGURE 5. Examples of the labeled players and soccer balls.
The input vector of each cell is the feature of each frame in
the ball-stopping events. Specifically, the ball-stopping event
is denoted as (f0 , f1 , ..., fn−1 ), and the input vector of each
cell is ft . The state of LSTM cell is split into two vectors: the
that both the player and the ball could be detected in each short-term state and the long-term state. The long-term state
frame, there will be two bounding boxes. The symbols pt and traverses through the network, with dropping some memories
st are employed to denote the bounding boxes of the player and adding some new memories. The loss is computed based
and the ball detected in the tth frame, respectively. We use on cross-entropy before applying the softmax activation func-
symbols (pxt , pyt ) and (sxt , syt ) to denote the coordinates tion.
of the center points of pt and st , respectively. The two
features are extracted based on the detected bounding boxes. VI. EXPERIMENTAL RESULTS
Firstly, the object-level motion trajectory is described as a The experiments were performed on a computer with an
series of coordinates of the center points of the bounding box- Intel(R) Xeon(R) CPU E5-2620 V4 @ 2.10 GHz, and GTX
es, denoted as omt = (pxt , pyt , sxt , syt ), t = 0, 1, ..., n − 1. 1080Ti GPU with 6.4 TFlops of single precision, 484GB/s of
It should be noted that, the detected bounding boxes are memory, and 11 GB of RAM memory. The proposed method
denoted as their widths, heights, and the coordinates of their was trained and evaluated on tensorflow. In this section, we
left-bottom points. It is easy to calculate the coordinates of give the details of the experiments. Firstly, we describe the
the center points in the bounding boxes. dataset which was used to train and validate the proposed
Secondly, in successful events, the ball will stop close method. Secondly, we present the overall results of the pro-
to the player for a short time. This is different from the posed method. Then, the proposed method is compared with
failed events. Therefore, instantaneous velocities of the bal- state-of-the-art dense trajectories based methods. Finally, we
l are also extracted as a feature, and it is also comput- compare the object-level motion trajectories with some other
ed based on the object-level motion trajectories. The ball features.
instantaneous velocities are described as the instantaneous
variations
p of the ball coordinates. It is expressed as vt = A. DATASET
(sxt − sxt−1 )2 + (syt − syt−1 )2 , t = 1, 2, ...n − 1. Note The proposed method was trained and evaluated on video
that since in each event there is no previous frame of the first dataset collected from the soccer training equipment. There
frame, it is reasonable to set v0 as v1 . are totally 84 soccer players included in the dataset. The
In order to standardize the data, two operations are per- players are from 5 different football clubs, with the ages
formed on the features, namely unification and interpolation. ranging from 9 to 28. By installing one camera in the soccer
(1) Unification: The resolutions of the input videos are not training equipment, we acquired 132 videos of the players’
6 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2931471, IEEE Access
IEEE et
Author Transaction onofInternet
al.: Preparation ofIEEE
Papers for Things,Year:2019
TRANSACTIONS and JOURNALS

training status. It should be noted that, the data collection has


lasted for more than half a year. During that time, the camera
was debugged by changing the resolution. Therefore, in the
datasets, each of the videos is with one of two resolutions
including 704 × 576 (32 videos) and 1920 × 1080 (100
videos).
By separating and annotating the videos, we obtained
2543 ball-stopping events, in which there are 1597 successful
events and 946 failed ones. In the experiments, 80% events
were randomly selected as the training set and the other 20%
events were used as the testing set. FIGURE 6. The confusion matrix of the proposed method.

B. THE OVERALL RESULTS OF THE PROPOSED


METHOD
HOF, and MBH. In the comparison method, we have re-
TABLE 2. The ACCURACYS OF THE PROPOSED METHOD WITH trained the codebook of the dense trajectories. By employing
DIFFERENT COMBINATIONS OF FEATURES. the new retrained codebook, the features are encoded as
Fisher Vectors. A linear SVM is utilized as the classifier, in
Motion Instantaneous
Unification Accuracy (%)
which the parameter C is set to 100.
Trajectories Velocity

91.67
Figure 7 shows the confusion matrix of the dense
√ trajectory-based method [36]. The results show that the ac-
√ √ 88.05
√ √ 92.14 curacy of the method [36] is 89.78%, which is lower than
92.30
√ √
87.11
that of the proposed method (93.24%). It indicates that, in
√ √ √ the case of resolving FGAR problem, the proposed object-
93.24
level trajectory-based feature can outperforms the dense
trajectory-based feature. It also can be observed that, 94.73%
As described in the above section, the proposed features 378
(as 378+21 ) of the successful events were correctly classified
consist of the object-level motion trajectories and the related 193
into the right class. However, only 81.43% (as 193+44 ) of
motions. Furthermore, the features are divided by the widths
failed events were correctly classified into the right class.
and heights of the input videos, respectively. In order to
That is, the dense trajectory-based method is inclined to
verify the efficiency, different combinations of the proposed
classify the failed events as the successful events.
features, as well as unification, are compared in the exper-
iments. Table 2 shows the overall results of the proposed Furthermore, we have recorded the computational costs
method. On one hand, the results show that unification can of both the proposed method and the dense trajectories-
improve the efficiency of features significantly, except the based method [36], in term of the average computation time
related motions. On the other hand, it can be observed that, for handling each event. The results are shown in Table 3.
if only the motion trajectories are used as the features, the Firstly, since the method [36] lacks a parallel processing
approach can achieve accuracy as 92.30%. Furthermore, if solution, for a fair comparison, both of the methods were
only the instantaneous velocities are used as the features, the run on the CPU. For the proposed method, the average
accuracy is 87.11%. It indicates that both the two features are computation time is 13.68s. However, for the comparison
efficient descriptors of the ball-stopping events. method, the average computation time is 102.2s because of
Furthermore, if both the features are utilized in the ap- high intensive computation of extracting dense trajectories.
proach, the proposed method can achieve the accuracy as It indicates that the proposed method is about 7.5× faster
93.24%. It means that the two features are complementary than the comparison method [36]. Secondly, the proposed
to each other. Figure 6 shows the confusion matrix of the method consists of two cascaded neural networks, such that
218
proposed method. It can be seen that, 91.98% (as 218+19 ) of they are able to be run on a GPU. When it is run on the
375
the failed events and 93.98% (as 375+24 ) of the successful GPU, the average computation time of the proposed method
events were classified into the right classes. That is the the is only 3.05s. That is, the proposed method is about 33.5×
recall ratios of both the two classes are almost equal to the faster than the comparison method. In that case, the proposed
overall accuracy. It indicates that the proposed method is method is able to handle the events in real time. Therefore,
efficient to both the successful and failed classes. it can be concluded that the proposed method significantly
outperforms the dense trajectory-based method [36], in terms
C. COMPARISON WITH DENSE TRAJECTORIES-BASED of both the accuracy and computational cost.
FEATURES
The proposed method is also compared with the dense D. COMPARISON WITH DIFFERENT FEATURES
trajectory-based method [36]. in which the features are com- Other features such as the heights and the aspect ratios of
posed of several descriptors, such as dense trajectory, HOG, the bounding boxes are used in object tracking [40]. In this
VOLUME 4, 2016 7

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2931471, IEEE Access

Author et al.:IEEE Transaction


Preparation of Paperson
for Internet of Things,Year:2019
IEEE TRANSACTIONS and JOURNALS

TABLE 3. Comparison of the average computation time for handling one The cascaded scheme synthesizes the high efficiency of the
ball-stopping event
detector on object detection and the outstanding performance
of the LSTM-based network on processing time series.
Computation Time per Events (s)
Dense Trajectories-based
In order to analysis the sport training videos, the proposed
method [36] (CPU) 102.20s method adopts the object-level trajectories to represent the
The proposed method (CPU) 13.68s human-object interactions. For more fine-grained actions,
The proposed method (GPU) 3.05s
the human postures estimated as series of keypoints may
take advantage of the middle-level semantic. In our future
work, we would like to employ the pose estimation along
with object level trajectories in analysis of more fine-grained
actions.

REFERENCES
[1] A. Mason, O. Korostynska, J. Louis, L. E. Cordova-Lopez, B. Abdullah,
J. Greene, R. Connell, and J. Hopkins, “Noninvasive in situ measurement
of blood lactate using microwave sensors,” IEEE Transactions on Biomed-
ical Engineering, vol. 65, no. 3, pp. 698–705, March 2018.
[2] E. Shahabpoor, A. Pavic, J. M. W. Brownjohn, S. A. Billings, L. Guo, and
M. Bocian, “Real-life measurement of tri-axial walking ground reaction
forces using optimal network of wearable inertial measurement units,”
FIGURE 7. The confusion matrix of the dense trajectory-based method [36]. IEEE Transactions on Neural Systems and Rehabilitation Engineering,
vol. 26, no. 6, pp. 1243–1253, June 2018.
[3] M. Jia, Z. Yin, D. Li, Q. Guo, and X. Gu, “Toward improved offloading
work, the features including the heights and the aspect ratios efficiency of data transmission in the iot-cloud by leveraging secure
truncating ofdm,” IEEE Internet of Things Journal, pp. 1–1, 2019.
are compared with the proposed features. The results are [4] M. Jia, X. Zhang, X. Gu, Q. Guo, Y. Li, and P. Lin, “Inter-beam inter-
shown in Table 4. The features heights and aspect ratios are ference constrained resource allocation for shared spectrum multi-beam
combined with the proposed features, motion trajectories and satellite communication systems,” IEEE Internet of Things Journal, pp. 1–
1, 2019.
instantaneous velocities, respectively. It shows that adding
[5] M. Jia, D. Li, Z. Yin, Q. Guo, and X. Gu, “High spectral efficiency secure
these features to the proposed features will decrease the communications with non-orthogonal physical and multiple access layers,”
accuracy of the approach. IEEE Internet of Things Journal, pp. 1–1, 2019.
[6] H. Pan, “Research on application of computer virtual reality technology
in college sports training,” in 2015 Seventh International Conference on
TABLE 4. EXPERIMENTAL RESULTS OF THE PROPOSED FEATURES
Measuring Technology and Mechatronics Automation, June 2015, pp.
COMBINING WITH THE OTHER FEATURES.
842–845.
[7] M. Wirth, S. Grad, D. Poimann, R. Richer, J. Ottmann, and B. M. Eskofier,
Instantaneous Height “Exploring the feasibility of emg based interaction for assessing cognitive
Center Point Proposed (%)
Velocity & Aspect capacity in virtual reality,” in 2018 40th Annual International Conference
√ √ Ratio
92.14 of the IEEE Engineering in Medicine and Biology Society (EMBC), July
√ √
88.21 2018, pp. 4953–4956.
√ √ √
√ √ 91.50 [8] Y. Qiu, , and X. Luo, “Application of computer virtual reality technology
93.24 in modern sports,” in 2013 Third International Conference on Intelligent
System Design and Engineering Applications, Jan 2013, pp. 362–364.
[9] W. R. Johnson, J. Alderson, D. Lloyd, and A. Mian, “Predicting athlete
ground reaction forces and moments from spatio-temporal driven cnn
VII. CONCLUSION models,” IEEE Transactions on Biomedical Engineering, vol. 66, no. 3,
pp. 689–694, March 2019.
FGAR is a major problem in the analysis of sports train- [10] M. Altini and O. Amft, “Estimating running performance combining non-
ing videos. Different from the common action recognition invasive physiological measurements and training patterns in free-living,”
applications, scenes and motion patterns in FGAR appli- in 2018 40th Annual International Conference of the IEEE Engineering in
Medicine and Biology Society (EMBC), July 2018, pp. 2845–2848.
cations generally have no obvious difference. In that case,
[11] J. Li and W. Wang, “Extracting impact characteristics of sports training
the human-object interactions should be efficiently repre- on eeg by genetic algorithm,” in 2011 First International Workshop on
sented in order to carry out FGAR. This paper proposes an Complexity and Data Mining, Sep. 2011, pp. 76–79.
object-level trajectories based FGAR method in the analysis [12] Y. Li and Y. Zhang, “Application of data mining techniques in sports train-
ing,” in 2012 5th International Conference on BioMedical Engineering and
of soccer training videos. Firstly, the problem of FGAR Informatics, 2012, pp. 954–958.
is modeled as human-object (player-ball) interactions. The [13] W. Hong-jiang, Z. Hai-yan, and Z. Jing, “Application of the cloud
object-level trajectories are proposed as a new descriptor to computing technology in the sports training,” in 2013 3rd International
Conference on Consumer Electronics, Communications and Networks,
identify fine-grained sports videos. This descriptor can take Nov 2013, pp. 162–165.
advantage of the high-level semantic and the human-object [14] J. Xiong, H. Li, Q. Wu, and F. Meng, “A fast HEVC inter CU selection
interaction motions. Secondly, a cascaded scheme of deep method based on pyramid motion divergence,” IEEE Transactions on
Multimedia, vol. 16, no. 2, pp. 559–564, 2014.
networks based on the object-level trajectories is proposed
[15] J. Xiong, H. Li, F. Meng, S. Zhu, Q. Wu, and B. Zeng, “MRF-based fast
to realize FGAR. The cascaded network is constructed by HEVC inter CU decision with the variance of absolute differences,” IEEE
concatenating a detector network with a classifier network. Transactions on Multimedia, vol. 16, no. 8, pp. 2141–2153, Dec. 2014.

8 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2931471, IEEE Access

IEEE
AuthorTransaction on of
et al.: Preparation Internet of IEEE
Papers for Things,Year:2019
TRANSACTIONS and JOURNALS

[16] J. Xiong, H. Li, F. Meng, Q. Wu, and K. N. Ngan, “Fast HEVC inter CU [36] H. Wang and C. Schmid, “Action recognition with improved trajectories,”
decision based on latent sad estimation,” IEEE Transactions on Multime- in Proceedings of the IEEE international conference on computer vision,
dia, vol. 17, no. 12, pp. 2147–2159, Dec 2015. 2013, pp. 3551–3558.
[17] J. Xiong, X. Long, R. Shi, M. Wang, J. Yang, and G. Gui, “Background [37] X. Peng, C. Zou, Y. Qiao, and Q. Peng, “Action recognition with stacked
error propagation model based RDO in HEVC for surveillance and con- fisher vectors,” in European Conference on Computer Vision. Springer,
ference video coding,” IEEE Access, vol. 6, pp. 67 206–67 216, 2018. 2014, pp. 581–595.
[18] J. Xiong, X. Long, R. Shi, M. Wang, Q. Zhou, and G. Gui, “Periodic [38] K. Simonyan and A. Zisserman, “Two-stream convolutional networks
enhanced frame based long-short-term reference in hevc for conference for action recognition in videos,” in Advances in neural information
and surveillance video coding,” IEEE Access, vol. 7, pp. 46 422–46 433, processing systems, 2014, pp. 568–576.
2019. [39] A. Nibali, Z. He, S. Morgan, and D. Greenwood, “Extraction and clas-
[19] J. Pan, Y. Yin, J. Xiong, W. Luo, G. Gui, and H. Sari, “Deep learning-based sification of diving clips from continuous video footage,” in 2017 IEEE
unmanned surveillance systems for observing water levels,” IEEE Access, Conference on Computer Vision and Pattern Recognition Workshops
vol. 6, pp. 73 561–73 571, 2018. (CVPRW), July 2017, pp. 94–104.
[40] R. Theagarajan, F. Pala, X. Zhang, and B. Bhanu, “Soccer: Who has the
[20] W. Li, H. Li, Q. Wu, F. Meng, L. Xu, and K. N. Ngan, “Headnet: An end-
ball? generating visual analytics and player statistics,” in 2018 IEEE/CVF
to-end adaptive relational network for head detection,” IEEE Transactions
Conference on Computer Vision and Pattern Recognition Workshops
on Circuits and Systems for Video Technology, pp. 1–1, 2019.
(CVPRW), June 2018, pp. 1830–18 308.
[21] H. Huang, W. Xia, J. Xiong, J. Yang, G. Zheng, and X. Zhu, “Unsuper- [41] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Mon-
vised learning-based fast beamforming design for downlink mimo,” IEEE ga, and G. Toderici, “Beyond short snippets: Deep networks for video
Access, vol. 7, pp. 7599–7605, 2019. classification,” in Proceedings of the IEEE conference on computer vision
[22] S. Xie, X. Zheng, W. Shao, Y. Zhang, T. Lv, and H. Li, “Non-blind image and pattern recognition, 2015, pp. 4694–4702.
deblurring method by the total variation deep network,” IEEE Access, pp. [42] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream
1–1, 2019. network fusion for video action recognition,” in Proceedings of the IEEE
[23] N. U. Huda, K. H. Jensen, R. Gade, and T. B. Moeslund, “Estimating the conference on computer vision and pattern recognition, 2016, pp. 1933–
number of soccer players using simulation-based occlusion handling,” in 1941.
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition [43] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool,
Workshops (CVPRW), June 2018, pp. 1905–190 509. “Temporal segment networks: Towards good practices for deep action
[24] W. Lu, J. Ting, J. J. Little, and K. P. Murphy, “Learning to track and recognition,” in European conference on computer vision. Springer, 2016,
identify players from broadcast sports videos,” IEEE Transactions on pp. 20–36.
Pattern Analysis and Machine Intelligence, vol. 35, no. 7, pp. 1704–1716, [44] A. Diba, V. Sharma, and L. Van Gool, “Deep temporal linear encoding
July 2013. networks,” in Proceedings of the IEEE conference on Computer Vision
[25] A. Senocak, T. Oh, J. Kim, and I. S. Kweon, “Part-based player identifica- and Pattern Recognition, 2017, pp. 2329–2338.
tion using deep convolutional representation and multi-scale pooling,” in [45] W. Ren, J. Zhang, X. Xu, L. Ma, X. Cao, G. Meng, and W. Liu, “Deep
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition video dehazing with semantic segmentation,” IEEE Transactions on Image
Workshops (CVPRW), June 2018, pp. 1813–18 137. Processing, vol. 28, no. 4, pp. 1895–1908, April 2019.
[26] V. Renĺř, N. Mosca, R. Marani, M. Nitti, T. D’Orazio, and E. Stella, [46] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv
“Convolutional neural networks based ball detection in tennis games,” in preprint arXiv:1804.02767, 2018.
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Workshops (CVPRW), June 2018, pp. 1839–18 396.
[27] H. Chen, Y. He, C. Chou, S. Lee, B. P. Lin, and J. Yu, “Computer-assisted
self-training system for sports exercise using kinects,” in 2013 IEEE
International Conference on Multimedia and Expo Workshops (ICMEW),
July 2013, pp. 1–4.
[28] J. Hwang, S. Park, and N. Kwak, “Athlete pose estimation by a global-
local network,” in 2017 IEEE Conference on Computer Vision and Pattern
Recognition Workshops (CVPRW), July 2017, pp. 114–121.
[29] O. Sumer, T. Dencker, and B. Ommer, “Self-supervised learning of pose
embeddings from spatiotemporal relations in videos,” in 2017 IEEE In-
ternational Conference on Computer Vision (ICCV), Oct 2017, pp. 4308–
4317.
[30] A. Tejero-de-Pablos, Y. Nakashima, T. Sato, N. Yokoya, M. Linna, and
E. Rahtu, “Summarization of user-generated sports video by using deep
action recognition features,” IEEE Transactions on Multimedia, vol. 20,
no. 8, pp. 2000–2011, Aug 2018.
[31] M. Merler, D. Joshi, Q. Nguyen, S. Hammer, J. Kent, J. R. Smith, and R. S.
Feris, “Automatic curation of golf highlights using multimodal excitement
features,” in 2017 IEEE Conference on Computer Vision and Pattern
Recognition Workshops (CVPRW), July 2017, pp. 57–65.
[32] P. Shukla, H. Sadana, A. Bansal, D. Verma, C. Elmadjian, B. Raman, and
M. Turk, “Automatic cricket highlight generation using event-driven and
excitement-based features,” in 2018 IEEE/CVF Conference on Computer
Vision and Pattern Recognition Workshops (CVPRW), June 2018, pp.
1881–18 818.
[33] X. Wang and X. Zhang, “An ica mixture hidden conditional random field
model for video event classification,” IEEE Transactions on Circuits and
Systems for Video Technology, vol. 23, no. 1, pp. 46–59, Jan 2013.
[34] A. Cioppa, A. Deliĺĺge, and M. Van Droogenbroeck, “A bottom-up ap-
proach based on semantics for the interpretation of the main camera stream
in soccer games,” in 2018 IEEE/CVF Conference on Computer Vision and
Pattern Recognition Workshops (CVPRW), June 2018, pp. 1846–184 609.
[35] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu, “Dense trajectories and
motion boundary descriptors for action recognition,” International journal
of computer vision, vol. 103, no. 1, pp. 60–79, 2013.

VOLUME 4, 2016 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

You might also like