You are on page 1of 16

applied

sciences
Article
Comparison of Tracking Techniques on
360-Degree Videos
Tzu-Wei Mi and Mau-Tsuen Yang *
Department of Computer Science & Information Engineering, National Dong Hwa University,
Hualien 974, Taiwan
* Correspondence: mtyang@gms.ndhu.edu.tw

Received: 29 June 2019; Accepted: 12 August 2019; Published: 14 August 2019 

Featured Application: Tracking techniques are essential for attaching an AR tag to a physical
target in 360-degree videos.

Abstract: With the availability of 360-degree cameras, 360-degree videos have become popular recently.
To attach a virtual tag on a physical object in 360-degree videos for augmented reality applications,
automatic object tracking is required so the virtual tag can follow its corresponding physical object
in 360-degree videos. Relative to ordinary videos, 360-degree videos in an equirectangular format
have special characteristics such as viewpoint change, occlusion, deformation, lighting change, scale
change, and camera shakiness. Tracking algorithms designed for ordinary videos may not work well
on 360-degree videos. Therefore, we thoroughly evaluate the performance of eight modern trackers in
terms of accuracy and speed on 360-degree videos. The pros and cons of these trackers on 360-degree
videos are discussed. Possible improvements to adapt these trackers to 360-degree videos are also
suggested. Finally, we provide a dataset containing nine 360-degree videos with ground truth of
target positions as a benchmark for future research.

Keywords: 360-degree videos; augmented reality; real-time tracking

1. Introduction
Nowadays, 360-degree videos are becoming more and more popular. Omnidirectional cameras,
also called 360-degree cameras, are widely available and more lightweight, and can even be installed on
drones [1]. They are useful for recording indoor or outdoor activities to cover views in all perspectives.
Rendering 360-degree videos on a virtual reality headset can provide immersive experience for users
of education, entertainment, and tourism [2]. For augmented reality applications using 360-degree
videos, a common request is to register a virtual tag to a physical target. As shown in Figure 1, a virtual
billboard marked in red color must follow its corresponding physical target over time. For this purpose,
automatic tracking of a specific target in 360-degree videos is highly desirable. Therefore, we explore
the characteristics of 360-degree videos and compare the performance of existing tracking techniques
on 360-degree videos.
A 360-degree video consists of a sequence of 360-degree images with a fixed interval of time.
Each 360-degree image is a panorama either captured by an omnidirectional camera or combined
by multiple cameras to cover the complete horizontal field of view (i.e., 360-degree FOV). As shown
in Figure 1, a 360-degree image is typically flattened in an equirectangular format in that longitude
lines are projected to vertical straight lines of constant spacing. Similarly, latitude lines are mapped to
horizontal straight lines of constant spacing.

Appl. Sci. 2019, 9, 3336; doi:10.3390/app9163336 www.mdpi.com/journal/applsci


Appl. Sci. 2019, 9, 3336 2 of 16
Appl. Sci. 2019, 9, x FOR PEER REVIEW 2 of 16

Figure 1.
Figure Tagging an
1. Tagging an Augmented
Augmented Reality
Reality (AR)
(AR) marker
marker on
on aa physical
physical target
target in
in 360-degree
360-degree videos.
videos.

Due to the natures of the immersiveness and full field of view, 360-degree videos are ideal
A 360-degree video consists of a sequence of 360-degree images with a fixed interval of time.
to be adopted in virtual reality applications. However, the interaction ways during playback of
Each 360-degree image is a panorama either captured by an omnidirectional camera or combined by
360-degree videos are quite limited. Users cannot walk within the scene in 360-degree videos, that is,
multiple cameras to cover the complete horizontal field of view (i.e., 360-degree FOV). As shown in
3D translation is not allowed besides the passive motion caused by a moving camera. Nevertheless,
Figure 1, a 360-degree image is typically flattened in an equirectangular format in that longitude lines
users can freely select their point of view during playback of 360-degree videos (i.e., 3D rotation is
are projected to vertical straight lines of constant spacing. Similarly, latitude lines are mapped to
possible). Conveniently, a 360-degree video can be viewed via an ordinary web browser in that a user
horizontal straight lines of constant spacing.
can pan around by clicking and dragging. Alternatively, a 360-degree video can be observed via a
Due to the natures of the immersiveness and full field of view, 360-degree videos are ideal to be
head-mounted display (HMD) in that a user can pan around simply by rotating his head.
adopted in virtual reality applications. However, the interaction ways during playback of 360-degree
In ordinary monoscopic videos, real-time tracking techniques rely on either offline datasets or
videos are quite limited. Users cannot walk within the scene in 360-degree videos, that is, 3D
online learning to train an appearance model, then apply the trained model to track potential targets
translation is not allowed besides the passive motion caused by a moving camera. Nevertheless, users
frame by frame. However, tracking algorithms designed for ordinary videos may not perform well
can freely select their point of view during playback of 360-degree videos (i.e., 3D rotation is possible).
on 360-degree videos with their unique characteristics [3,4]. For example, occlusion problems are
Conveniently, a 360-degree video can be viewed via an ordinary web browser in that a user can pan
almost unavoidable in panoramic 360-degree videos. Moreover, an object may disappear from the
around by clicking and dragging. Alternatively, a 360-degree video can be observed via a head-
left but reappear on the right border, or disappear from the top then reappear on the bottom border
mounted display (HMD) in that a user can pan around simply by rotating his head.
in 360-degree videos. Nonrigid deformation is obvious on 360-degree videos in an equirectangular
In ordinary monoscopic videos, real-time tracking techniques rely on either offline datasets or
format. Lighting and scale changes happen frequently in 360-degree videos due to continuous change
online learning to train an appearance model, then apply the trained model to track potential targets
of viewpoints. Camera shakiness is another common problem in aerial 360-degree videos captured by
frame by frame. However, tracking algorithms designed for ordinary videos may not perform well
drones. Trackers designed for ordinary videos tend to confuse or even lose the tracking target under
on 360-degree videos with their unique characteristics [3,4]. For example, occlusion problems are
these circumstances in 360-degree videos. To this end, Cai et al. [5] adapted the Kernelized Correlation
almost unavoidable in panoramic 360-degree videos. Moreover, an object may disappear from the
Filter (KCF) tracking algorithm to work on 360-degree videos. Delforouzi et al. [6] modified the
left but reappear on the right border, or disappear from the top then reappear on the bottom border
Track–Learn–Detection (TLD) tracker to fit the needs of 360-degree videos. Nevertheless, a thorough
in 360-degree videos. Nonrigid deformation is obvious on 360-degree videos in an equirectangular
evaluation of state-of-the-art tracking algorithms on 360-degree videos is still missing. For this purpose,
format. Lighting and scale changes happen frequently in 360-degree videos due to continuous change
we implement eight popular tracking techniques and evaluate their performance on 360-degree videos.
of viewpoints. Camera shakiness is another common problem in aerial 360-degree videos captured
To make a fair comparison of both quality and time of tracking, we adopt the default parameters of
by drones. Trackers designed for ordinary videos tend to confuse or even lose the tracking target
these trackers in an open-sourced library called OpenCV. The experimental results are analyzed to
under these circumstances in 360-degree videos. To this end, Cai et al. [5] adapted the Kernelized
reveal the pros and cons of these tracking algorithms on 360-degree videos.
Correlation Filter (KCF) tracking algorithm to work on 360-degree videos. Delforouzi et al. [6]
The contribution of this paper is a thorough evaluation of eight popular tracking algorithms
modified the Track–Learn–Detection (TLD) tracker to fit the needs of 360-degree videos.
on 360-degree videos. Both qualitative and quantitative comparisons are made in terms of accuracy
Nevertheless, a thorough evaluation of state-of-the-art tracking algorithms on 360-degree videos is
and speed. According to the experimental results, we discuss the strengths and weaknesses of these
still missing. For this purpose, we implement eight popular tracking techniques and evaluate their
trackers on 360-degree videos, and suggest potential ways to adapt them to 360-degree videos for
performance on 360-degree videos. To make a fair comparison of both quality and time of tracking,
better tracking performance. As a basis of the comparison, we capture nine 360-degree videos in a
we adopt the default parameters of these trackers in an open-sourced library called OpenCV. The
variety of scenarios. Three of them are captured on the ground and six of them are captured in the air.
experimental results are analyzed to reveal the pros and cons of these tracking algorithms on 360-
degree videos.
Appl. Sci. 2019, 9, 3336 3 of 16

Positions of interesting targets in these 360-degree videos are manually marked in each frame as the
ground truth of tracking. The dataset containing these nine 360-degree videos with the ground truth is
provided (online link in supplementary materials) to be a benchmark for future research.

2. Background
With the advance of virtual reality technology in the past few years, 360-degree images and
videos have become a blooming research topic. Neng and Chambel [7] designed and evaluated
360-degree hypervideos that allow users to explore and navigate through links. Berning et al. [8]
adopted 360-degree interactive video to create evaluation scenarios where users can select their point
of view during playback. Rupp et al. [9] used 360-degree videos as a learning tool and analyzed the
effects of immersiveness of three devices: a smartphone, a Google Cardboard, and an Oculus Rift.
Pakkanen et al. [10] compared three interactive ways for 360-degree video playback: remote control,
head orientation, and hand gesture. Huang et al. [11] presented an automatic approach to generate
spatial audio for panorama images based on object detection and action recognition.
Giving the initial position of an unknown object, the purpose of tracking is to locate the object
in successive frames of a video. Mousas [12] proposed a method for controlling motions of a virtual
partner character based on a performance-capturing process using multiple inertial measurement units
(IMUs). Instead of relying on IMUs for human tracking, this paper focuses on vision-based methods
for unknown object tracking. Among all the existing online visual tracking algorithms, we choose
eight modern and popular trackers for evaluation and comparison on 360-degree videos.
The Boosting tracker, proposed by Grabner et al. [13], is an online version of the AdaBoost feature
selection algorithm. The online boosting algorithm maintains a global classifier pool of weak classifiers
with multiple selectors. Each new training sample is used to update each weak classifier in the pool.
A cascade system initializes the first selector with the current sample’s importance, selects the best
weak classifier with the least error, and passes the estimated importance to the next selector until all
selectors have been updated. In the end, a strong classifier is chosen from the best weak classifiers,
and the worst weak classifier is replaced with a random one. The Boosting tracker utilizes the initial
target area in the current frame as a positive example, and exploits other areas with the same size
around the target as negative examples. Then, the online-trained classifier searches the neighborhood
for potential targets in the next frame. The Boosting tracker can handle temporary occlusions as well
as complex backgrounds.
The Multiple Instance Learning (MIL) tracker, proposed by Babenko et al. [14], extends the online
boosting algorithm by using a set of image patches (called a bag) instead of a single sample for training.
A bag containing at least one positive example is called a positive bag, otherwise it is called a negative
bag. The MIL tracker collects lots of small image patches centered at the tracking object as potential
positive bags, and chooses the best one to be the positive example. This strategy not only prevents the
MIL tracker from losing important information but also avoids the mislabeling problem.
The MedianFlow tracker, proposed by Kalal et al. [15], is a bidirectional approach that combines
forward and backward tracking. The forward and backward consistency is analyzed as a quality
measure to assist the tracking. The MedianFlow tracker constructs both forward and backward
trajectories at each time instant, and their corresponding errors are estimated. The trajectory with the
minimum forward–backward error is chosen as the candidate for the succeeding tracking. As a result,
the MedianFlow tracker is more reliable to follow objects with consistent movement.
The Minimum Output Sum of Squared Error (MOSSE) tracker, proposed by Bolme et al. [16], is a
tracker based on correlation filters. It achieves high efficiency by computing correlation in time domain.
The MOSSE filter improves the ASEF filter to overcome the potential overfitting problem. The MOSSE
tracker calculates the minimum output sum of square error to find out the most possible location of the
tracking object. The benefits of using a correlation filter make the MOSSE tracker more robust to the
problems of scaling, rotation, deformation, and occlusion compared to traditional approaches. Also,
Appl. Sci. 2019, 9, 3336 4 of 16

MOSSE is more flexible than other correlation-filter-based trackers because the target is not required to
be in the center of the image in the beginning of tracking.
The TLD tracker, proposed by Kalal et al. [17], is mainly composed of three parts: a tracker,
a learner, and a detector. The job of the tracker is to follow the target through consecutive frames; the
learner relies on a P-expert and an N-expert to estimate misdetection and false alarm, respectively, then
updates the detector. The detector locates potential targets according to an appearance model, feeds
the outputs to the learner, and corrects the tracker if necessary. The TLD tracker is well known for its
ability of failure recovery at the expense of instability. Compared to other online trackers struggling
with the problem of accumulating errors, the combination of tracking and detecting modules makes
the TLD tracker more reliable for long-term tracking.
The KCF tracker, proposed by Henriques et al. [18], extends the MOSSE concept and takes
advantage of overlapping regions in multiple positive samples. The abundant data is computed in
Fourier domain to increase the learning speed. The KCF tracker emphasizes the importance of the
negative samples and tends to use more samples for better training. To this end, a cyclic shift is applied
to generate more samples from each important sample. The characteristic of circulant matrices for
regression samples is utilized to speed up the computation. Also, kernel tricks are exploited to deal
with the problem of nonlinear regression. Instead of scanning through raw pixels, the KCF tracker
extracts the Histogram of Gradient (HoG) features to improve the accuracy of tracking.
The Generic Object Tracking Using Regression Networks (GOTURN) tracker, proposed by
Held et al. [19], adopts an offline dataset to train a Convolutional Neural Network (CNN) model in
advance. Then, it relies on the generated model for online tracking. The process of pretraining takes
advantage of readily available information in offline datasets to learn both target appearance and
motion relationship. Without the requirement to update CNN weights in run-time, the GOTURN
tracker has another significant advantage of online tracking speed. Although it is not necessary to
include specific tracking targets in the dataset for pretraining, the GOTURN tracker tends to favor
objects in the training set over objects that are not in the training set. A potential issue of the GOTURN
tracker is the quality of the pretrained model that can seriously affect the performance of the online
tracking process.
The Channel and Spatial Reliability Tracker (CSRT), proposed by Lukezic et al. [20], is based on
the Discriminative Correlation Filter (DCF) algorithm. It improves the DCF tracker by introducing
spatial and channel reliability. The spatial reliability map is used to find out the optimal filter size.
The ability to adjust filter size makes the CSRT tracker better than the traditional DCF algorithm by
excluding unrealistic samples. Another benefit from the spatial reliability map is its ability to handle
nonrectangular targets. The channel reliability is measured to weigh the importance of each channel
filter, then combine them to get the final response map. Using only the HoGs and Colorname standard
feature sets, the CSRT tracker can achieve an impressive accuracy with real-time speed.

3. Experiments

3.1. Experiment Setup


To measure the performance of eight trackers on 360-degree videos in a variety of situations,
we prepared a dataset containing nine 360-degree videos captured using a Garmin Virb 360-degree
camera. The 360-degree camera can be installed on top of a helmet for ground video capturing as shown
in Figure 2a. Alternatively, the 360-degree camera can be attached to a drone for aerial video capturing
as shown in Figure 2b. A Garmin Virb 360-degree camera contains two 12-megapixel sub-cameras that
are opposite to each other. Each sub-camera has a wide field of view (FOV) of 202 degrees. At each
time instant, the hardware inside the camera analyzes the overlap between two images captured by
sub-cameras, then aligns and stitches two images together to form a seamless 360-degree image.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 5 of 16

images
Appl. captured
Sci. 2019, 9, 3336by sub-cameras, then aligns and stitches two images together to form a seamless
5 of 16
360-degree image.

Appl. Sci. 2019, 9, x FOR PEER REVIEW 5 of 16

images captured by sub-cameras, then aligns and stitches two images together to form a seamless
360-degree image.

(a) (b)
Figure 2. Two ways to set up anan omnidirectional
omnidirectional camera: (a) installed on top of a helmet for ground
drone for
video capturing, (b) attached to a drone for aerial
aerial video
video capturing.
capturing.

(a) (b)
One problem
problem of of 360-degree
360-degreevideos videosisisthe thehuge
hugefile filesize.
size.Typically,
Typically, a 360-degree
a 360-degree video contains
video contains30
30 Figure
360-degree 2. Two
360-degree ways to
images
images set second,
per up an omnidirectional
per second, andand camera: (a) installed
eacheach
360-degree
360-degree image onhas
top of
image aresolution
ahas helmet for ground
a resolution of ×
of 3840 2160× with
3840 2160 three
with
video capturing, (b) attached to a drone for aerial video capturing.
color
three channels,
color channels,resulting in a total
resulting in aoftotal
746 MBof 746 perMB second. Video compression
per second. Video compression techniques can be applied
techniques can be
to effectively
applied
One problem reduce
to effectively the size
reduce
of 360-degree ofthe
videos the captured
size
is the of thefile
huge video
size. file.
captured Smaller
video
Typically, video
afile. size
Smaller
360-degree cancontains
video
video increase
size can the frame rate
increase the
30of tracking
frame
360-degree and
rate images
of make
tracking the
andmotion
per second, make
and each between
the360-degree
motion twobetween
consecutive
image has two frames
consecutive
a resolution ofsmaller,
3840frames andwith
× 2160 hence improve
smaller, and hencethe
three color channels,
performance
improve resulting inOn
of tracking.
the performance ofa total
the of
tracking. 746On
other MB
hand, per second.
the Videohigh
high compression
other hand, compression
ratio techniques
compression and low canand
bit
ratio be low
rate can bit
reduce the
rate can
applied
video to effectively reduce the size of the captured video file. Smaller video size can increase the
reducequality,
the video andquality,
hence degrade
and hence the degrade
accuracythe of tracking.
accuracy Wang et al. [21]
of tracking. Wang studied
et al. the
[21]influences
studied the of
frame rate of tracking and make the motion between two consecutive frames smaller, and hence
the choice of video coding parameters on the performance of visual
influences of the choice of video coding parameters on the performance of visual object tracking. In object tracking. In default setting,
improve the performance of tracking. On the other hand, high compression ratio and low bit rate can
reducehardware
the
default video inside
thesetting, the
andGarmin
the hardware
quality, henceinside Virbthe
degrade 360-degree
theGarmin camera
accuracyVirb applies
360-degree
of tracking. Wang the et most
camera
al. [21]commonly
applies
studiedthetheused
most AVC/H.264
commonly
encoding
influences of and
used AVC/H.264 generates
the choice encoding
of video a standard
and generates
coding MP4 video
parameters a on fileperformance
standard
the with MP4 a maximum
video bit rate
file object
of visual with of 120InMbps.
atracking.
maximum To make
bit rate of 120a
fair
default comparison
setting, the of eight
hardware trackers,
inside the our
Garminexperiments
Virb are
360-degree made
camera
Mbps. To make a fair comparison of eight trackers, our experiments are made based on the same based
applies on the
the same
most video
commonly encoding and bit
used
rate
videoAVC/H.264
all nineencoding
inencoding video
and bit and generates
sequences.
rate in all ninea standard
videoMP4 video file with a maximum bit rate of 120
sequences.
Mbps. To make a fair comparison of eight trackers, our experiments are made based on the same
As shown in Table 1, these 360-degree video sequences cover multiple scenarios, each containing
video encoding and bit rate in all nine video sequences.
a combination of characteristics such as viewpoint change, occlusion, deformation, deformation, lighting change,
As shown in Table 1, these 360-degree video sequences cover multiple scenarios, each containing lighting change,
scale change, and camera shakiness.
shakiness. Each 360-degree
a combination of characteristics such as viewpoint change, occlusion, deformation, lighting video sequence lasts 100~1000
change, frames with a
100~1000
resolution
resolution
scale change, and of 3840
of 3840
camera× 2160.
× shakiness. For speedup
For speedup purpose, all video sequences
Each 360-degree video sequence lasts 100~1000 are down-sampled
frames with a to
are down-sampled 1920 ×x 1080
to1920
resolution of 3840 × 2160.
in our tracking experiments. TheFor speedup purpose,
The benchmark all video
benchmark machine sequences are
machineisis aa PC down-sampled
PC with
with 3.2 to
3.2 GHz 1920
GHz CPU x 1080
CPU and and 16 16 GB
GB RAM.
RAM.
inThe
our operating
tracking experiments.
system is The benchmark
Microsoft
Microsoft machine10.
Windows
Windows is a PC with 3.2 GHz CPU and 16 GB RAM.
10.
The operating system is Microsoft Windows 10.
Table
Table 1.
1. OurOur dataset
dataset containing
containing nine
nine 360-degree videos,
videos, each with
with a combination of special
Table 1. Our dataset containing nine 360-degree360-degree
videos, each with a each
combination of special
characteristics.
characteristics.
Characteristics
Characteristics
Characteristics
Viewpoint
Viewpoint OcclusionOcclusion
Viewpoint OcclusionDeformation
Deformation Lighting
Lighting ScaleLighting
Deformation Shakiness Scale
Scale Shakiness
Shakiness
Sequences
Sequences
Sequences

v v v v
v v
Appl. Sci. 2019, 9, x FOR PEER REVIEW 6 of 16
Sequence
Sequence1 1
Sequence 1

v v v v v v
v v v
Sequence
Sequence2 2
Sequence 2

v v

Sequence 3

v v
Appl.
Appl.
Appl. Sci.Sci.
Sci. 2019, 9, xx FOR PEER REVIEW
9, 3336 REVIEW 666 of
of 16
16 6 of 16
Appl.
Appl. Sci. 2019,
2019, 9,
9, xx FOR
2019, FOR PEER
PEER REVIEW
REVIEW of 16
16
Appl. Sci.
Appl.
Appl. Sci. 2019,
Sci.
Sci. 2019, 9,
2019,
2019, 9, x
9,
9,
FOR
FOR PEER
FOR
xx FOR PEER REVIEW
PEER
PEER REVIEW
REVIEW
666 of
6 ofof 16
of 16
16

Table 1. Cont.
v
v v
v v
v
v
v
v v
v
v v
v
v
Characteristics v
v v
v v
v
Viewpoint Occlusion Deformation Lighting Scale Shakiness
Sequences Sequence 22
Sequence
Sequence 22
Sequence
Sequence 2
Sequence
Sequence 22

v
v v v
v v
v
v
v v
v
v
v
v v
v
Sequence
Sequence333 3
Sequence
Sequence
Sequence
Sequence 3
Sequence
Sequence
3
33

v
v v v
v v
v
v
v v
v
v
v
v v
v
Sequence 44
Sequence
Sequence
Sequence 44 4
Sequence
Sequence 4
Sequence
Sequence 44

v
v
v
v
v
v
v
v
Sequence 5
Sequence
Sequence 5
Sequence
Sequence
Sequence 5555
Sequence 55
Sequence

v
v v
v
v
v
v v v
v
v v
v
v v
v
Sequence 66
Sequence
Sequence 666
Sequence
Sequence 6
Sequence
Sequence
Sequence 66

v
v
v
v
v v
v
v
Sequence 77
Sequence
Sequence 77
Sequence
Sequence 7
Sequence
Sequence
Sequence 77 7

v
v
v
v
v
v
v v
Sequence 88
Sequence
Sequence 88
Sequence
Sequence 8
Sequence
Sequence
Sequence 888

v
v
v
v
v
v
v v
Sequence 99
Sequence
Sequence 99
Sequence
Sequence 9
Sequence
Sequence
Sequence 999
3.2. Experiment
3.2. Experiment Process
Process
3.2. Experiment
3.2. Experiment Process
Process
3.2. Experiment
3.2. Experiment Process
Process
3.2.Our
Experiment
Our implementation
Our implementation
Processof
implementation of
eight tracking
of eight
tracking algorithms
eight tracking
algorithms was
tracking algorithms
algorithms was
was based
based on
was based
based on
on the
the open-sourced
on the
open-sourced OPENCV
the open-sourced
open-sourced OPENCV
OPENCV
OPENCV
Our
Our
library
Our implementation
implementation
with Version
implementation 3.4.2 of
of
of eight
eight
(Intel
eight tracking
Corporation,
tracking algorithms
Santa
algorithms was
Clara,
was based
CA,
based on the
USA).
on the open-sourced
All eight
open-sourced OPENCV
trackers
OPENCVwere
library
library with
with Version
Version 3.4.2
3.4.2 (Intel
(Intel Corporation,
Corporation, Santa
Santa Clara,
Clara, CA,
CA, USA).
USA). All
All eight
eight trackers
trackers were
were
library
library Our
with
with
initialized
library with implementation
withVersion
Version
default
Version 3.4.2
3.4.2 (Intel
(Intel
parameters.
3.4.2 (IntelofCorporation,
eight
Among tracking
Corporation,
Corporation, these Santa
Santa
Santa algorithms
Clara,
Clara,
trackers, the
Clara, CA,
CA,
GOTURN
CA, was
USA).
USA).
USA). based
All
All
tracker
All on
eight
eight
was
eight the open-sourced
trackers
trackers
the
trackersonlywere
were
one
were OPENCV
initialized
initialized with
with default
default parameters.
parameters. Among
Among these
these trackers,
trackers, the
the GOTURN
GOTURN tracker
tracker was
was the
the only
only one
one
initialized
library
initialized
based with
with
with
on the
initializedthe
with default
Version
default
Convolutional parameters.
3.4.2
parameters.
default parameters.Neural Among
(IntelNetwork
Network
Among these
Corporation,
Among these
(CNN)
these trackers,
Santa
trackers,
and
trackers, the
Clara,
the
utilized
the GOTURN
GOTURN
a standard
GOTURN tracker
CA, USA).tracker
Caffe
tracker was
All
was
model
was the
eight
the only
only
for
thefor one
trackers
one were initialized
one
online
only
based
based on
on the Convolutional Neural (CNN) and utilized a
aa standard Caffe model online
based
withon
based on
tracking.
tracking. the Convolutional
For
default
the
For Convolutional
each
each frame in
parameters.
Convolutional
frame
Neural
Neural
in aaNeuralAmong
Network
Network
360-degree
Network
360-degree these
(CNN)
(CNN)
video,
(CNN)
video, the
and
and utilized
trackers,
the and utilized
centroid theof the
utilized
centroid of the
standard
standard
GOTURN
a standard
tracking
Caffe
Caffe
tracking Caffe
target
tracker
target
model
model
was
model
was was
for online
forthe
online
marked
for up one based on
only
online
marked up
tracking.
tracking. For
For each
each frame
frame in aa 360-degree
in ground
360-degree video,
video, the
the centroid
centroid of the
the tracking
ofshows tracking target
target was
was marked
marked up
up
manually
tracking.
the in advance
For advance
each
Convolutional
manually in asNeural
frame
as the
in
the ground
a 360-degree truth video,
Network
truth of (CNN)
of tracking.
the
tracking. Figure
centroid
and 3of
shows
utilized
Figure 3 the a the flowchart
tracking
standard
the target
flowchart of
was
Caffe
of the
the proposed
marked
model up
for
proposed online tracking.
manually
manually in
in advance
advance as
as the
the ground
ground truth
truth of
of tracking.
tracking. Figure
Figure 33 shows
shows the
the flowchart
flowchart of
of the
the proposed
proposed
experiment
manually
experiment in to
to evaluate
advance
evaluate as eight
the
eightgroundtrackers
truth
trackers on
on ofnine 360-degree
tracking.
nine Figure
360-degree videos.
3 shows
videos. Each
the
Each tracker
flowchart
tracker was
of
was executed
the on
proposed
executed on
For each frame
experiment to
experiment360-degree
in a 360-degree
to evaluate
evaluate eight
eight trackers video,
trackers inon ninethe360-degree
centroid videos.
nineto360-degree
onturn
of the tracking
videos. Each
Each
target
tracker
tracker was was
wasof
marked
executed
executed on
on
up manually in
individual
experiment
individual 360-degree
to evaluate video
eight
video sequences
trackers
sequences in
on nine measure the
360-degree thevideos.
tracking speed
Each in terms
tracker terms
was ofexecuted
Frames Per
Per
on
individual
advance
individual 360-degree
as the
360-degree video truth
ground
video sequences
sequencesof in turn
turn to
tracking.
in turn to
measure
to Figure
measure3the
measure the tracking
tracking
shows
tracking the speed
speed in
in terms
flowchart
speed in terms
of of
the
of
Frames
Frames Per experiment to
proposed
Frames Per
individual
Second
individual
Second 360-degree
(FPS). A spatial video
spatialvideo
360-degree sequences(inin
displacement
sequences inpixels)
turn to
turn towas
measure the tracking
computed
measure the tracking
as the speed in
the absolute
absolute
speed indistance
terms of
terms ofbetween
Frames Per
Frames Per
the
Second (FPS).
Second (FPS). A
(FPS). A spatial displacement
A spatial displacement (in
displacement (in pixels)
(in pixels) was
pixels) was computed
was computed as
computed as the
as absolute distance
the absolute distance between
distance between the
between the
the
evaluate
Second
tracker’s
Second
tracker’s
eight
(FPS).
output
(FPS).
output A
A trackers
spatial
position
spatial
position
on
displacement
and
displacement
and
nine
the
the
360-degree
(in
ground
(in
ground pixels)
truth.
pixels)
truth. was
If
was
If
videos.
computed
the
the
Each
displacement
computed
displacementas
as tracker
the
the was was
absolute
smaller
absolute
was smaller
executed
distance
than
distance
than a on
betweenindividual
the
predefined
between the
aa predefined
360-degree
tracker’s
tracker’s output
output position
position and
and the
the ground
ground truth.
truth. If
If the
the displacement
displacement was
was smaller
smaller than
than a predefined
predefined
tracker’s
tolerated
tracker’s
tolerated output
video sequences
threshold,
output
threshold,position
position
the and
in frame
the turn
andto
frame the
was
the
was ground
measure truth.
the
countedtruth.
ground
counted as If the
as aatracking
correct
correct displacement
speedframe.
tracking
If the displacement
tracking frame. was
in termsThe
was
Thesmaller
of
accuracy
smaller
accuracythan
Frames
thanwas
was a predefined
aPer Second
defined
predefined
defined as (FPS). A spatial
as
tolerated
tolerated threshold,
threshold, the
the frame
framewas was
was counted
counted as
as aaaas
correct
correct tracking
tracking frame.
frame. The accuracy
accuracy was
The between was defined
defined as
as output position
tolerated threshold,
displacement (inthe frame
pixels) was counted
computed as correct tracking
the absolute frame.
distanceThe accuracy was
thedefined
tracker’sas
and the ground truth. If the displacement was smaller than a predefined tolerated threshold, the frame
was counted as a correct tracking frame. The accuracy was defined as the ratio of the number
of correct tracking frames to the number of all frames. Because only the GOTURN, TLD, CSRT,
and MedianFlow trackers could adjust and update target window size dynamically, adaptable window
size was implemented in these four trackers for qualitative evaluation. To make a fair comparison of
eight trackers, the size of the target window was not considered for quantitative evaluations. For the
same reason, the Kalman filter was not applied for all trackers in our experiments.
Appl. Sci. 2019, 9, 3336 7 of 16
Appl. Sci.Sci.
Appl. 2019, 9, 9,
2019, x FOR PEER
x FOR PEERREVIEW
REVIEW 7 of
7 of 16 16

Figure 3. Flowchart of the proposed experiment to evaluate eight trackers on nine 360-degree videos.
Figure 3. Flowchart of the proposed experiment to evaluate eight trackers on nine 360-degree videos.
Figure 3. Flowchart of the proposed experiment to evaluate eight trackers on nine 360-degree videos.
3.3. Experiment Results
3.3. Experiment Results
3.3.
ToExperiment
demonstrate Results
qualitative tracking outputs of eight
To demonstrate qualitative tracking outputs of trackers on nine
eight trackers on360-degree
nine 360-degreevideo sequences,
video
six representative
To demonstrate
sequences, snapshots with
qualitative
six representative a fixed interval
trackingwith
snapshots of
outputs time
a fixed in each
of interval 360-degree
eight trackers
of time on video 360-degreeare
sequence
nine 360-degree
in each shown
video
video
in detail in Figures
sequences,
sequence six 4–12. in
shown The
are representative tracking
detail snapshots results
in Figures withof all
4–12. aThe trackers
fixed areresults
interval
tracking marked
of time asinrectangular
of all each 360-degree
trackers windows
are marked as with
video
sequence
different
rectangularare in
colors shown
windows in detail
each snapshot with in Figures 4–12.
(MOSSE:yellow;
different colors The in tracking
MIL:blue; results of(MOSSE:yellow;
each snapshot all trackersBOOST:turquoise;
MedianFlow:purple; are MIL:blue;
marked as
rectangular
TLD:red; windows
MedianFlow:purple;
KCF:green; with different
BOOST:turquoise;
GOTURN:pink; colors KCF:green;
TLD:red;
CSRT:cyan). inTheeach snapshot
sequence(MOSSE:yellow;
videoGOTURN:pink; 1 wasCSRT:cyan).
capturedThe MIL:blue;
by video
a moving
sequence 1
MedianFlow:purple; was captured by a
BOOST:turquoise; moving biker.
TLD:red; The tracking
KCF:green;
biker. The tracking target was another bike with huge scale change, some viewpoint change, target was
GOTURN:pink;another bike with
CSRT:cyan). huge
The scale
videoand
change,
sequence 1some
was viewpoint
captured change,
by a moving and minor
biker. lighting
The change
tracking over
target
minor lighting change over time. Among eight trackers, the CSRT, MIL, BOOST, and MOSSE trackers time.
was Among
another eight
bike trackers,
with huge the
scale
CSRT,
change, someMIL, BOOST, and MOSSE trackers performed quite well, but other trackers lost the target in
performed quite viewpoint change,
well, but other and minor
trackers lost lighting
the target change
in theover time.of
middle Among eight trackers,
the sequence as shown the in
the
CSRT, middle
MIL, of
BOOST,the sequence
and MOSSE as shown
trackersin Figure
performed 4. The video
quite sequence
well, but 2 was
other captured
trackers lostby a
the moving
target in a
Figure 4. The video sequence 2 was captured by a moving motorcycle. The tracking target was
themotorcycle.
middle of The the tracking
sequencetargetas shownwas a in stadium
Figureon4.one Theside of the
video road. The
sequence building
2 was was occasionally
captured by a moving
stadium on one
occluded by side
trees of
andthe road. The building
streetlamps. was occasionally occluded byproblem
trees and streetlamps.
motorcycle. The tracking target was aAll trackers
stadium onwere
one sideaffected
of theby road.
the occlusion
The building wasand so only
occasionally
All trackers
produced were
decentaffected
trackingbyresults
the occlusion
as shown problem
in Figure and
5. so onlythe
Especially, produced
MedianFlow decent tracking
tracker confused results
occluded by trees and streetlamps. All trackers were affected by the occlusion problem and so only
as shown
the in Figure
tracking 5. Especially,
target with other the MedianFlow
obstacles. It suffered tracker confused
seriously from the tracking
temporal target in
occlusion withthisother
produced decent tracking results as shown in Figure 5. Especially, the MedianFlow tracker confused
obstacles. It suffered
sequence. The video seriously
sequence from3 was temporal
captured occlusion
by a drone in with
this sequence.
an overlooking The view
videoofsequence
a lake. The3 was
the tracking target with other obstacles. It suffered seriously from temporal occlusion in this
tracking
captured target
by aThedrone was the an
with roof of a green building.
overlooking view of The shape of the target deformed dramatically ofdue
sequence. video sequence 3 was captured byaalake. droneThe with tracking target was
an overlooking viewtheofroof
a lake. aThe
green
to
building. the nature
The shape of 360-degree
of the target videos
deformedin an equirectangular
dramatically format.
due to Luckily,
the nature some
of trackers
360-degree could
videosstillin an
tracking target was the roof of a green building. The shape of the target deformed dramatically due
follow the target smoothly
equirectangular forsome
a short period of time.still follow the target smoothly for a short period
to the nature format.
of 360-degreeLuckily, videos intrackers could
an equirectangular format. Luckily, some trackers could still
of time.
follow the target smoothly for a short period of time.

Figure 4. Tracking results on 360-degree video sequence 1 (tracking windows with different colors:
MOSSE MIL MedianFlow BOOST TLD KCF GOTURN CSRT).
Appl.
Appl. Sci.
Sci. 2019,
2019, 9,
9, xx FOR
FOR PEER
PEER REVIEW
REVIEW 88 of
of 16
16

Appl. Sci.Figure
2019, 9,4.
Figure Tracking
Tracking results
4.3336 results on
on 360-degree
360-degree video
video sequence
sequence 11 (tracking
(tracking windows
windows with
with different
different colors:
colors: 8 of 16
MOSSE
MOSSE MIL
MIL MedianFlow
MedianFlow BOOST
BOOST TLD
TLD KCF
KCF GOTURN
GOTURN CSRT).
CSRT).

Figure
Figure 5. 5.
Figure Tracking
5.Tracking results
Trackingresults on
resultson 360-degree
on360-degree video
video sequence
360-degree video sequence 222 (tracking
sequence (trackingwindows
(tracking windowswith
windows withdifferent
with colors:
different
different colors:
colors:
MOSSE
MOSSE
MOSSE MILMIL
MIL MedianFlow
MedianFlow BOOST
MedianFlowBOOST TLD
BOOSTTLD KCF
TLDKCF GOTURN
KCFGOTURN CSRT).
GOTURN CSRT).

Figure
Figure 6. 6.
Figure Tracking
6.Tracking results
Trackingresults on
resultson 360-degree
on360-degree video
video sequence
360-degree video sequence 333 (tracking
sequence (trackingwindows
(tracking windowswith
windows withdifferent
with colors:
different
different colors:
colors:
MOSSE
MOSSE
MOSSE MILMIL
MIL MedianFlow
MedianFlow BOOST
MedianFlowBOOST TLD
BOOSTTLD KCF
TLDKCF GOTURN
GOTURN CSRT).
KCFGOTURN CSRT).

The
The
The video
videosequence
video sequence444was
sequence wascaptured
was captured by
captured by aa drone
by flying
drone flying aroundaaalake
flying around
around lakeharbor.
lake harbor.The
harbor. Thetracking
The tracking
tracking target
target
target
was
was a small boat docking at a pier. The scale of the target changed a lot over
was a small boat docking at a pier. The scale of the target changed a lot over time as shown in Figure 7.
a small boat docking at a pier. The scale of the target changed a lot over time
time as shown in Figure
7.
7. Though
Though
Though the
the KCF
the KCF andand
KCF MOSSE
and MOSSE
MOSSE trackers
trackers achieved
achieved
trackers achieved aa fair
a fair accuracy
accuracy
fair accuracy atatthe
at the
the beginning,
beginning,
beginning,they they lost
theylost the
lostthe target
thetarget
targetin
in
the the
the middle
inmiddle of
of the
of the
middle sequence.
sequence.
the sequence. Even
EvenEven worse,
worse,
worse, the
the GOTURN
theGOTURN
GOTURNand and TLD
andTLD trackers
trackers got
TLD trackers gotconfused
got confusedand
confused andtracked
and tracked
tracked
the
the
the wrong
wrong objects
objectsinin
wrongobjects the
inthe early
theearly stage.
stage. The
early stage. Thevideo
The videosequence
video sequence55 5was
sequence waswas captured
captured
captured by
byby aa drone
drone
a drone flying
flying along
along
flying aa
along
lakeshore.
lakeshore. The
a lakeshore. The tracking
The tracking target
trackingtarget was
targetwas a lakeside
wasaalakeside building
lakesidebuilding
buildingwithwith a red
witha ared roof.
redroof. The
roof.The scale
Thescale of the
scaleofofthe building
thebuilding
building
changed slowly
changedslowly
changed slowlyover over time,
overtime, hence
time,hence
hencethe the MedianFlow
theMedianFlow tracker
tracker performed
MedianFlow tracker performed quite
performed quitewell.
quite well.The
well. Thesequence
The sequence
sequence 55 5
contained
contained
contained another nature
anothernature
another natureof of 360-degree
of360-degree
360-degreevideos videos in
in that
videos in that the
that the tracking
the trackingtarget
tracking targetdisappeared
target disappearedfrom
disappeared fromone
from one
one side
side
side
and
and reappeared
reappeared on
on the
the other
other side
side of
of the
the panoramic
panoramic image.
image. Most
Most
and reappeared on the other side of the panoramic image. Most trackers cannot recover from trackers
trackers cannot
cannot recover
recover from
from this
this
this
problem.
problem. Interestingly,
Interestingly, the
the TLD tracker
TLDtracker correctly
trackercorrectly recovered
correctlyrecovered
recoveredthe the target
thetarget
targetas as shown
asshown
shownin in the
inthe last
thelast snapshot
lastsnapshot
snapshotof
problem. Interestingly, the TLD
of
of Figure
Figure 8. The
The video
8. video video sequence
sequence 66 was
was captured
captured by aa drone flying across aa lake. The
The tracking target
Figure 8. The sequence 6 was captured by a by
drone drone
flyingflying
across across
a lake. lake.
The tracking tracking
target target
was a
was
was aa fast-moving
fast-moving boat
boat with
with apparent
apparent viewpoint
viewpoint change.
change. With
With the
the problem
problem of
of large
large motion
motion in
in this
this
fast-moving boat with apparent viewpoint change. With the problem of large motion in this sequence,
sequence,
sequence, only
only the
the GOTURN
GOTURN tracker
tracker obtained
obtained great
great results.
results. In
In comparison,
comparison, the
the KCF
KCF and
and MOSSE
MOSSE
only the GOTURN tracker obtained great results. In comparison, the KCF and MOSSE trackers lost the
trackers
trackers lost
lost the
the tracking
tracking target
target very
very early
early asas shown
shown in in Figure
Figure 9.9.
tracking target very early as shown in Figure 9.
Appl. Sci. 2019, 9, 3336 9 of 16
Appl. Sci. 2019, 9, x FOR PEER REVIEW 9 of 16

Figure 7. 7.Tracking
Figure Trackingresults
resultson
on360-degree
360-degree video
video sequence
sequence 44 (tracking
(trackingwindows
windowswith
withdifferent
different colors:
colors:
MOSSE
MOSSE MILMILMedianFlow
MedianFlowBOOST
BOOSTTLD
TLDKCF
KCF GOTURN
GOTURN CSRT).

Figure
Figure 8. 8.Tracking
Trackingresults
resultson
on360-degree
360-degree video
video sequence
sequence 55 (tracking
(trackingwindows
windowswith
withdifferent
differentcolors:
colors:
MOSSE
MOSSE MILMILMedianFlow
MedianFlowBOOST
BOOSTTLD
TLDKCF
KCF GOTURN CSRT).
GOTURN CSRT).

Figure
Figure 9. 9.Tracking
Trackingresults
resultson
on360-degree
360-degree video
video sequence
sequence 66 (tracking
(trackingwindows
windowswith
withdifferent
differentcolors:
colors:
MOSSE MIL MedianFlow BOOST TLD KCF GOTURN
MOSSE MIL MedianFlow BOOST TLD KCF GOTURN CSRT). CSRT).

The
The videosequence
video sequence77was
wascaptured
captured by
by aa moving
moving biker.
biker. The
Thetracking
trackingtarget
targetwas
wasa alarge
largebuilding
building
with
with slow
slow viewpointchange,
viewpoint change,andandsome
somepartial
partial occlusion.
occlusion. Although
Althoughseveral
severaltrackers
trackersreceived
received decent
decent
scores, they were not really focused on the center of the building. Only the BOOST tracker accurately
scores, they were not really focused on the center of the building. Only the BOOST tracker accurately
trackedthe
tracked thewhole
wholebuilding
building throughout
throughout this
this sequence
sequenceasasshown
shownininFigure
Figure10.10.
TheThe
video
videosequence 8
sequence
was captured by a drone flying on windy days. The characteristic of this sequence was camera
8 was captured by a drone flying on windy days. The characteristic of this sequence was camera
shakiness which is a common problem on drone-recorded 360-degree videos. Even with the jittery
shakiness which is a common problem on drone-recorded 360-degree videos. Even with the jittery
Appl. Sci. 2019, 9, 3336 10 of 16

Appl. Sci.
Appl. Sci. 2019,
2019, 9,
9, xx FOR
FOR PEER
PEER REVIEW
REVIEW 10 of
10 of 16
16

motion and target deformation caused by the shaking camera, most trackers still performed quite well
motion and
motion and target
target deformation
deformation caused
caused by by the
the shaking
shaking camera,
camera, most
most trackers
trackers still
still performed quitequite
throughout the sequence as shown in Figure 11. The video sequence 9 was capturedperformed
by a drone flying
well throughout
well throughout the the sequence
sequence as as shown
shown in in Figure
Figure 11.11. The
The video
video sequence
sequence 99 waswas captured
captured by by aa drone
drone
along a seashore. The tracking target was the summit of a mountain, and the illumination changed
flying along
flying along aa seashore.
seashore. TheThe tracking
tracking target
target was
was thethe summit
summit of of aa mountain,
mountain, and and the
the illumination
illumination
dramatically over space and time. In the middle of the sequence, the sun sat just behind the mountain
changed dramatically over space and time. In the middle of the sequence, the
changed dramatically over space and time. In the middle of the sequence, the sun sat just behind sun sat just behind the
the
top. The problem
mountain top. of
top. The backlighting
The problem
problem of caused
of backlighting tracking
backlighting caused loss for
caused tracking the KCF
tracking loss tracker,
loss for
for the
the KCF and tracking
KCF tracker,
tracker, and error for
and tracking
tracking the
mountain
TLD andfor
error GOTURN TLD trackers.
the TLD and GOTURN Surprisingly,
GOTURN trackers.other trackers still
Surprisingly, followed
other trackersthe target
still very well
followed as shown
the target
target veryin
error for the and trackers. Surprisingly, other trackers still followed the very
Figure 12.
well as In
as shown summary,
shown in in FigureFigure
Figure 12.
12. In13 outlines
In summary,
summary, Figure the quantitative
Figure 1313 outlines results
outlines the of eight
the quantitative trackers
quantitative results on
results of nine
of eight sequences.
eight trackers
trackers
well
The
on vertical
nine axis indicates
sequences. The the tracking
vertical axis accuracy,
indicates the and theaccuracy,
tracking horizontal and axis
the representsaxis
horizontal therepresents
predefined
on nine sequences. The vertical axis indicates the tracking accuracy, and the horizontal axis represents
value
the of the tolerated
the predefined
predefined valuethreshold.
value of the
of the tolerated
tolerated threshold.
threshold.

Figure
Figure
Figure 10.Tracking
10.10. Trackingresults
Tracking resultson
results on360-degree
on 360-degree video
360-degree video sequence
video sequence 777(tracking
sequence (tracking windows
(trackingwindows with
windowswith different
withdifferent colors:
colors:
different colors:
MOSSE
MOSSE
MOSSE MILMIL MedianFlow
MILMedianFlow BOOST
MedianFlowBOOST
BOOSTTLDTLD KCF
TLDKCF GOTURN
KCF GOTURN CSRT).
GOTURN CSRT).

Figure
Figure
Figure 11.Tracking
11.11. Trackingresults
Tracking resultson
results on360-degree
on 360-degree video
360-degree video sequence
video sequence 888(tracking
sequence (tracking windows
(trackingwindows with
windowswith different
withdifferent colors:
colors:
different colors:
MOSSE
MOSSE
MOSSE MILMIL MedianFlow
MILMedianFlow BOOST
MedianFlowBOOST
BOOSTTLDTLD KCF
TLDKCF GOTURN
KCF GOTURN CSRT).
GOTURN CSRT).

Figure
Figure
Figure 12.Tracking
12.12. Trackingresults
Tracking resultson
results on360-degree
on 360-degree video
360-degree video sequence
video sequence
sequence999(tracking
(tracking windows
(trackingwindows
windowswith
with
withdifferent
different colors
colors
different ::
colors:
MOSSE
MOSSE MIL
MIL MedianFlow
MedianFlow BOOST
BOOST TLD
TLD KCF
KCF
MOSSE MIL MedianFlow BOOST TLD KCF GOTURN CSRT).GOTURN
GOTURN CSRT).
CSRT).
Appl. Sci. 2019, 9, 3336 11 of 16
Appl. Sci. 2019, 9, x FOR PEER REVIEW 11 of 16

(a) (b)

(c) (d)

(e) (f)

(g) (h)

(i) (j)
Figure
Figure 13. 13. Accuracy
Accuracy comparisonof
comparison of eight
eight trackers
trackerson onnine
nine360-degree
360-degree video sequences.
video (a) Seq.1,
sequences. (b)
(a) Seq.1,
Seq.2, (c) Seq.3, (d) Seq.4, (e) Seq.5, (f) Seq.6, (g) Seq.7, (h) Seq.8, (i) Seq.9, (j) Overall average
(b) Seq.2, (c) Seq.3, (d) Seq.4, (e) Seq.5, (f) Seq.6, (g) Seq.7, (h) Seq.8, (i) Seq.9, (j) Overall average.
Appl. Sci. 2019, 9, 3336 12 of 16

4. Discussion
In terms of Frames Per Second (FPS), Table 2 summarizes the speed comparison of all eight trackers
in nine video sequences. According to the experimental results, the MOSSE is the fastest tracker with an
average of 3776 FPS. The KCF is the next efficient tracker with an average of 175 FPS. The MedianFlow
tracker can also achieve 63 FPS. Among all eight trackers, The TLD is the slowest tracker. In fact, it is
about 600 times slower than the MOSSE tracker and not suitable for real-time applications.

Table 2. Speed comparison of eight trackers on nine video sequences in terms of Frames Per Second
(FPS).

Trackers
GOTURN MedianFlow TLD KCF BOOST CSRT MIL MOSSE
Sequences
Sequence 1 23.7 62.8 2.6 188.9 41.6 38.9 12.8 4118.2
Sequence 2 20.5 66.0 9.5 204.8 35.8 35.6 14.3 2250.0
Sequence 3 16.0 61.8 2.8 172.9 26.7 32.8 14.3 2532.2
Sequence 4 14.3 62.8 4.9 309.5 56.9 47.6 14.9 10356.8
Sequence 5 16.4 61.4 5.6 39.0 22.7 34.8 13.0 457.0
Sequence 6 21.3 65.2 7.1 304.8 49.8 47.3 17.2 8614.0
Sequence 7 21.2 63.1 8.7 41.4 15.8 22.8 12.1 440.3
Sequence 8 23.4 63.6 8.0 104.3 23.2 31.5 12.1 1233.8
Sequence 9 19.3 64.1 4.9 210.8 37.4 37.8 12.7 3982.4
Average 19.6 63.4 6.0 175.1 34.4 36.6 13.7 3776.1

In terms of tracking quality, Table 3 summarizes the accuracy comparison of all eight trackers in
nine video sequences. The strengths and weaknesses of all eight trackers are outlined in Table 4. The
GOTURN is the only tracker based on deep learning but does not perform well in terms of accuracy
on 360-degree videos. Interestingly, it works well in some special cases. For example, it is the only
tracker that could flawlessly track a fast-moving boat in sequence 6, possibly because the pretrained
dataset contained boats. In fact, the performance of GOTURN heavily depends on the appearance
model of the tracking target. Hence, we believe the GOTURN tracker can be improved by training
specific target models for 360-degree videos in advance.

Table 3. Overall accuracy comparison of eight trackers on nine video sequences (bad: 0~20%; poor:
20~40%; ok: 40~60%; good: 60~80%; great: 80~100%).

Trackers
GOTURN MedianFlow TLD KCF BOOST CSRT MIL MOSSE
Sequences
Sequence 1 poor poor bad poor great great great great
Sequence 2 great poor poor ok ok ok poor bad
Sequence 3 bad good bad good good great great great
Sequence 4 poor great poor ok great great great poor
Sequence 5 bad great great great poor great great great
Sequence 6 great ok ok bad ok good ok bad
Sequence 7 ok good poor great great great great good
Sequence 8 great great great great good great great great
Sequence 9 bad great bad poor great great great great
Appl. Sci. 2019, 9, 3336 13 of 16

Table 4. Strengths and weaknesses of all eight trackers on 360-degree videos.

Features Improvement
Principle Strength Weakness
Trackers Suggestion
Recovery from Include specific
Pretrained CNN Target not in
GOTURN failure and targets for training in
model training data
occlusion advance
Min
Reliable on slow Fast-moving Support motion
MedianFlow forward–backward
changing target target detection
error
Recovery from
Track, learn, and Combine with
TLD failure and High false alarm
detect reliable filter
occlusion
Kernelized Report tracking
KCF Fixed target size Adaptable target size
correlation filter failure
Seldom report
BOOST AdaBoost Decent accuracy Adaptable tolerance
tracking failure
Discriminative Robust and high Long-term Failure recovery with
CSRT
correlation filter accuracy occlusion spatial relationship
Multi-instance Long-term Failure recovery with
MIL High accuracy
learning occlusion spatial relationship
High tracking
MOSSE Min square error Fixed target size Adaptable target size
speed

Generally, the MedianFlow tracker performs well on consistent and slowly changing video
sequences. However, occasional occlusion prevents it from making an agreement in bidirectional
analysis and the tracking fails as shown in video sequence 2.
The TLD is a slow tracker with high false detect rate but works well in the case of failure recovery.
Especially, the TLD tracker is a good choice to track a target that disappears from one place and
reappears in another place in 360-degree videos.
The KCF tracker performs well on ordinary videos but not on 360-degree videos due to its
fix-sized filters. The characteristics of 360-degree videos such as scale change, viewpoint change, and
deformation easily lead the KCF tracker to a track loss. Thus, it is only useable to track plain targets
that do not contain these characteristics.
The Boost tracker achieves a fair accuracy on 360-degree videos, though it does not sense tracking
failure and continues to track a wrong target as shown in video sequence 5. The parameters of tolerance
need to be adjusted accordingly to avoid false tracks for the Boost tracker.
The CSRT tracker is a good choice for tracking on 360-degree videos because it detects target
objects using the HoG features instead of raw pixels. It can adjust the size of target window dynamically
as well. Nonetheless, it still has a hard time recovering from a temporarily disappearing target as
shown in video sequence 2, or tracking a fast-moving target as shown in video sequence 6.
The MIL tracker can properly handle most of the cases on 360-degree videos. Its weakness is the
problem of occlusion caused by change of viewpoints. The MIL tracker tends to fail in recovering the
tracking target even after the occlusion.
For applications with high-speed demand or large motion, the MOSSE tracker is the best choice
since its tracking speed is significantly higher than other trackers, though the fix-sized tracking window
could be a problem for video sequences with huge scale change.
A typical example of 360-degree videos in an equirectangular format is shown in Figure 14.
An ideal tracker should be able to tackle the problems in 360-degree videos such as viewpoint change,
occlusion, deformation, lighting change, scale change, and camera shakiness. For viewpoint change
caused by a moving camera, a motion model is helpful to assist tracking. To handle occlusion problems,
the ability to recover the temporarily missing target is essential. To alleviate the nonrigid deformation,
Appl. Sci. 2019, 9, 3336 14 of 16

trackers must learn and update the appearance model in run-time. To accommodate lighting change,
extracting illumination-robust features is critical. To solve the problem of scale change, adaptable and
dynamic target window size is beneficial. To deal with a shaking camera, trackers should measure and
compensate the
Appl. Sci. 2019, 9, global
x FOR PEERmotion.
REVIEW 14 of 16

14. A
Figure 14. A typical
typical example
example of
of 360-degree
360-degree videos
videos in an equirectangular format with several
several
characteristics: seashore
seashore deformation,
deformation, island
island scale change, building occlusion, mountain
mountain summit
summit
lighting change.

An inherent problem of 360-degree videos in an equirectangular format is the image distortion,


especially for the northern
northern and southern
southern polar
polar areas
areas inin aa panoramic
panoramic image. image. The
The distortion
distortion problem
problem
affects
affects the
the performance
performance of of tracking
tracking in in two
two aspects.
aspects. First, the motion of the tracking target is distorted
after
after an
an equirectangular
equirectangularprojection
projection[22].[22].InInvideo
videosequence
sequence 6, a6,boat moving
a boat moving in ainstraight line line
a straight lookslooks
like
it is moving in a curve in the 360-degree video in an equirectangular format.
like it is moving in a curve in the 360-degree video in an equirectangular format. The distortion of The distortion of trajectory
degrades
trajectory the accuracy
degrades theofaccuracy
all trackers,
of allespecially
trackers, for fast-moving
especially targets. Nevertheless,
for fast-moving the GOTURN
targets. Nevertheless, the
tracker can handle this problem properly as long as it includes the training
GOTURN tracker can handle this problem properly as long as it includes the training samples with samples with distorted
motions
distortedinmotions
the process of pretraining.
in the process of Second, the tracking
pretraining. Second,target suffers from
the tracking targetnonrigid
suffersdeformations
from nonrigid in
an equirectangular
deformations in anformat. In video sequence
equirectangular format.2,Inthevideo
deformation
sequence of 2,
thethe
target building makes
deformation of thestraight
target
lines become
building makes curves. Thus,
straight linesit tends
become to curves.
cause a Thus,
track loss for trackers
it tends to causeusing
a tracka static
loss fortarget appearance
trackers using a
model. Surprisingly,
static target appearance most trackers
model. survive themost
Surprisingly, slowtrackers
target deformation
survive the in slowthistarget
case except the TLD
deformation in
tracker. The deformed target triggers frequent reinitialization in the TLD
this case except the TLD tracker. The deformed target triggers frequent reinitialization in the TLD modules, and the target
appearance
modules, and is prone to drift
the target in the presence
appearance is proneoftooccasional
drift in the occlusions.
presence of Inoccasional
video sequence 3, the tracking
occlusions. In video
target
sequenceis accompanied
3, the tracking by target
both motion distortionby
is accompanied and target
both deformation.
motion distortionAs anda result,
targetmost trackers are
deformation. As
unstable and achieve low tracking accuracy in this case.
a result, most trackers are unstable and achieve low tracking accuracy in this case.

5. Conclusions
5. Conclusions
The problems of
The problems ofviewpoint
viewpointchange,
change,occlusion,
occlusion, deformation,
deformation, lighting
lighting change,
change, scaling
scaling change,
change, and
and shakiness occur frequently in 360-degree videos. According to our experiments
shakiness occur frequently in 360-degree videos. According to our experiments with maximum with maximum
tolerated
tolerated threshold,
threshold, the
the CSRT
CSRT achieves
achieves the
the best
best overall
overall accuracy
accuracy andand is
is the
the most
most robust
robust tracker
tracker on
on
360-degree
360-degree videos. Alternatively, the MOSSE is the most efficient tracker in terms of speed. For
videos. Alternatively, the MOSSE is the most efficient tracker in terms of speed. future
For future
work,
work, ananideal
idealtracker
trackerthat
thatcan
candeal
dealwith these
with problems
these problems in 360-degree
in 360-degree videos is crucial.
videos We believe
is crucial. that
We believe
athat
multimodal fusionfusion
a multimodal is beneficial in combining
is beneficial abilities abilities
in combining of failureofrecovery, robustness,
failure recovery, and adaptable
robustness, and
adaptable target size for online tracking on 360-degree videos. A Kalman filter can also be applied
for better prediction and stabilization of unknown object tracking in 360-degree videos. Our dataset
containing nine 360-degree videos with ground truth is accessible through the link at the end of the
paper. It can be utilized as a benchmark for future research.
Appl. Sci. 2019, 9, 3336 15 of 16

target size for online tracking on 360-degree videos. A Kalman filter can also be applied for better
prediction and stabilization of unknown object tracking in 360-degree videos. Our dataset containing
nine 360-degree videos with ground truth is accessible through the link at the end of the paper. It can
be utilized as a benchmark for future research.

Supplementary Materials: The dataset is available online at: https://drive.google.com/open?id=


1Ybp0G6yWXYCsP06nzEMRJR-exK0DSos8.
Author Contributions: Conceptualization, M.-T.Y.; Methodology, T.-W.M.; Software, T.-W.M.; Validation, T.-W.M.;
Formal analysis, T.-W.M.; Investigation, M.-T.Y.; Resources, M.-T.Y.; Data curation, T.-W.M.; Writing—original
draft preparation, T.-W.M.; Writing—review and editing, M.-T.Y.; Visualization, T.-W.M.; Supervision, M.-T.Y.;
Project administration, M.-T.Y.; Funding acquisition, M.-T.Y.
Acknowledgments: This research is partially funded by the Ministry of Science and Technology (MOST) under
contracts 107-2221-E-259-019-MY2, 107-2511-H-259-004-MY2, 108-2218-E-259-001, 108-2634-F-259-001 through
Pervasive Artificial Intelligence Research (PAIR) Labs, Taiwan.
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Zia, O.; Kim, J.; Han, K.; Lee, J.W. 360◦ Panorama Generation using Drone Mounted Fisheye Cameras.
In Proceedings of the IEEE International Conference on Consumer Electronics, Las Vegas, NV, USA,
11–13 January 2019; pp. 1–3.
2. Kelling, C.; Väätäjä, H.; Kauhanen, O. Impact of device, context of use, and content on viewing experience
of 360-degree tourism video. In Proceedings of the International Conference on Mobile and Ubiquitous
Multimedia, New York, NY, USA, 26–29 November 2017.
3. Afzal, S.; Chen, J.; Ramakrishnan, K.K. Characterization of 360-degree Videos. In Proceedings of the ACM
SIGCOMM 2017 Workshop on Virtual Reality and Augmented Reality Network, Los Angeles, CA, USA,
25 August 2017.
4. Boonsuk, W. Evaluation of Desktop Interface Displays for 360-Degree Video. Master’s Thesis, Iowa State
University, Ames, IA, USA, 2011.
5. Cai, C.; Liang, X.; Wang, B.; Cui, Y.; Yan, Y. A Target Tracking Method Based on KCF for Omnidirectional
Vision. In Proceedings of the 37th Chinese Control Conference, Wuhan, China, 25–27 July 2018; pp. 2674–2679.
6. Delforouzi, A.; Tabatabaei, S.; Shirahama, K.; Grzegorzek, M. Unknown object tracking in 360-degree
camera images. In Proceedings of the International Conference on Pattern Recognition, Cancun, Mexico,
4–8 December 2016.
7. Neng, L.; Chambel, T. Get around 360 hypervideo: Its design and evaluation. Int. J. Ambient Comput. Intell.
2012, 4, 40–57. [CrossRef]
8. Berning, M.; Yonezawa, T.; Riedel, T.; Nakazawa, J.; Beigl, M.; Tokuda, H. pARnorama: 360 degree interactive
video for augmented reality prototyping. In Proceedings of the ACM International Joint Conference on
Pervasive and Ubiquitous Computing, Zurich, Switzerland, 8–12 September 2013; pp. 1471–1474.
9. Rupp, M.; Kozachuk, J.; Michaelis, J.; Odette, K.; Smither, J.; McConnell, D. The effects of immersiveness
and future VR expectations on subjec-tive-experiences during an educational 360 video. In Proceedings of
the International Annual Meeting of the Human Factors and Ergonomics Society, Washington, DC, USA,
19–23 September 2016; pp. 2108–2112.
10. Pakkanen, T.; Hakulinen, J.; Jokela, T.; Rakkolainen, I.; Kangas, J.; Piippo, P.; Raisamo, R.; Salmimaa, M.
Interaction with WebVR 360 video player: Comparing three interaction paradigms. In Proceedings of the
IEEE Virtual Reality, Los Angeles, CA, USA, 18–22 March 2017; pp. 279–280.
11. Huang, H.; Solah, M.; Li, D.; Yu, L. Audible Panorama: Automatic Spatial Audio Generation for Panorama
Imagery. In Proceedings of the CHI Conference on Human Factors in Computing Systems, Glasgow, UK,
4–9 May 2019.
12. Mousas, C. Performance-driven dance motion control of a virtual partner character. In Proceedings of
the IEEE Conference on Virtual Reality and 3D User Interfaces, Reutlingen, Germany, 18–22 March 2018;
pp. 57–64.
13. Grabner, H.; Grabner, M.; Bischof, H. Real-Time Tracking via On-line Boosting. In Proceedings of the British
Machine Vision Conference, Edinburgh, UK, 4–7 September 2006.
Appl. Sci. 2019, 9, 3336 16 of 16

14. Babenko, B.; Yang, M.; Belongie, S.J. Visual tracking with online Multiple Instance Learning. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009;
pp. 983–990.
15. Kalal, Z.; Mikolajczyk, K.; Matas, J. Forward-Backward Error: Automatic Detection of Tracking Failures.
In Proceedings of the International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010;
pp. 2756–2759.
16. Bolme, D.S.; Beveridge, J.R.; Draper, B.A.; Lui, Y.M. Visual object tracking using adaptive correlation filters.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, CA,
USA, 13–18 June 2010; pp. 2544–2550.
17. Kalal, Z.; Mikolajczyk, K.; Matas, J. Tracking-learning-detection. IEEE Trans. Pattern Anal. Mach. Intell. 2012,
34, 1409–1422. [CrossRef] [PubMed]
18. Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-Speed Tracking with Kernelized Correlation Filters.
IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 583–596. [CrossRef] [PubMed]
19. Held, D.; Thrun, S.; Savarese, S. Learning to Track at 100 FPS with Deep Regression Networks. In Proceedings
of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016.
20. Lukezic, A.; Vojír, T.; Zajc, L.C.; Matas, J.; Kristan, M. Discriminative Correlation Filter Tracker with Channel
and Spatial Reliability. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Honolulu, HI, USA, 21–26 July 2017; pp. 4847–4856.
21. Wang, P.; Zhang, L.; Wu, Z.; Xu, R. Research on video coding parameters affecting object tracking.
In Proceedings of the International Conference on Wireless Communications & Signal Processing, Nanjing,
China, 15–17 October 2015.
22. Sacht, L.; Carvalho, P.; Velho, L.; Gattass, M. Face and Straight Line Detection in Equirectangular Images.
In Proceedings of the Workshop de Visão Computacional, São Paulo, Brzail, 4–7 July 2010.

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

You might also like