Multi-Stream Siamese and Faster Region-Based Neural Network For Real-Time Object Tracking

IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 22, NO.
11, NOVEMBER 2021 7279
Multi-Stream Siamese and Faster Region-Based

Neural Network for Real-Time Object Tracking
Yi Liu, Liming Zhang , Member, IEEE, Zhihui Chen, Yan Yan , Member, IEEE,
and Hanzi Wang , Senior Member, IEEE
Abstract— Object tracking is a challenging task in computer autonomous driving applications, a car needs to track the
vision based intelligent transportation systems. Recently, Siamese objects in the scene and estimate where and how they will
based object tracking methods have attracted significant attention move. Although there are numerous tracking methods pro-
due to their highly efficient performance. These tracking methods
usually train a Siamese network to match the initial target patch posed in recent years, this task is still challenging since the
of the first frame with candidates in a new frame. In these scenarios can be highly complex and the appearance of the
methods, the offline training of the deep neural network and object may significantly change due to unpredictable factors
the online instance searching are effectively combined. However, (e.g., fast motion, occlusion, and in/out-of-plane rotation)
these methods usually do not include template update or object during tracking.
re-identification, which easily results in the drift problem. In
this paper, we propose a novel real-time object tracking method To deal with target appearance changes, traditional tracking
to overcome the above problems by effectively combining a methods, including MIL [1], PST [2], TLD [3], CREST [4],
multi-stream Siamese network and a region-based convolutional DSST [5], Struck [6], and QKCF [7], usually learn the
neural network. Specifically, a novel multi-stream Siamese net- hand-crafted feature representations of a tracked object during
work is proposed to search the target and update the instance the online tracking process and use the learned features to track
template in a new frame. In addition, a faster region-based
convolutional neural network detector is used to perform object the target. However, these tracking methods are online trained
re-identification in order to improve the tracking performance without taking advantage of a large number of video sequences
by making full use of the object category information. These two available offline. Note that the offline video sequences involve
networks are tightly coupled to ensure that the proposed tracking various scenarios and they can be used for learning robust
method has high efficiency and strong discriminative capability. feature representations of objects to handle complex challenges
Experimental results on several object tracking benchmarks
show that our tracking method can effectively track vehicles in the object tracking task.
and pedestrians in video sequences by exploiting the object Recently, convolutional neural networks (CNNs) have been
category information. The proposed tracking method achieves adopted to learn robust feature representations of objects from
real-time operations and outperforms several other state-of-the- large-scale datasets [8] and they have achieved significant
art methods. progress on various computer vision tasks, including face
Index Terms— Object tracking, instance search, multi-stream recognition [9], [10], image annotation [11], [12], object clas-
Siamese network, Faster R-CNN. sification [13], [14] and object detection [15], [16]. Motivated
I. I NTRODUCTION by the great success of CNNs, several recent works [17], [18]
formulate the object tracking task as a binary classification
O BJECT tracking is one of the fundamental prob-

lems in computer vision. It has a variety of applica-
tions, such as video surveillance, human-computer interaction,
problem, which aims to separate the tracked object from the
background using CNNs. These methods use pre-trained CNN
models and exploit the robust feature representation capability
behavior analysis, and intelligent transportation systems. For
of CNNs for the classification task. However, these pre-trained
Manuscript received November 26, 2019; revised April 23, 2020; accepted CNN models may not be optimal for the object tracking task
June 3, 2020. Date of publication July 28, 2020; date of current version because object tracking is not only a binary classification
November 1, 2021. This work was supported in part by the National Natural problem but also an instance-level problem.
Science Foundation of China under Grant U1605252 and Grant 61872307,
in part by the National Key Research and Development Program of China Different from the previous classification based tracking
under Grant 2017YFB1302400, in part by the Science and Technology methods, some methods (e.g., [19], [20]) treat object track-
Development Fund of Macao SAR under Grant FDCT 079/2016/A2, and in ing as an instance search problem. These methods learn a
part by the University of Macau Research Grant under Grant MYRG2018-
00111-FST. The Associate Editor for this article was W. Lin. (Corresponding fully-convolutional deep Siamese network for object tracking.
author: Hanzi Wang.) However, the existing Siamese instance search based tracking
Yi Liu, Zhihui Chen, Yan Yan, and Hanzi Wang are with the Fujian Key methods have the following limitations. Firstly, the single
Laboratory of Sensing and Computing for Smart City, School of Informatics,
Xiamen University, Xiamen 361005, China (e-mail: liuyi_xmu@163.com; Siamese instance search network does not consider online
zhihui-chen@foxmail.com; yanyan@xmu.edu.cn; wang.hanzi@gmail.com). template update. As a result, these tracking methods may
Liming Zhang is with the Department of Computer and Information Science, easily drift when the object appearance significantly changes,
Faculty of Science and Technology, University of Macau, Taipa, Macau
(e-mail: lmzhang@um.edu.mo). especially for visual surveillance under traffic scenes, since
Digital Object Identifier 10.1109/TITS.2020.3006927 only the object state in the first frame is used as the template
1558-0016 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on January 30,2023 at 17:51:38 UTC from IEEE Xplore. Restrictions apply.
7280 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 22, NO. 11, NOVEMBER 2021
tracking method can effectively alleviate this problem by

introducing the online update of the instance template into
a multi-stream Siamese network.
• An object re-identification algorithm, based on both the
metric learning based model and the Faster R-CNN detec-
tor, is also proposed. As a result, the proposed tracking
method can easily re-locate the lost target by using the
object category information (e.g., vehicles or pedestrians).
Experimental results demonstrate that our tracking method
outperforms several state-of-the-art tracking methods on three
public object tracking benchmarks. In particular, it can effec-
tively deal with the challenges of out-of-view and occlusion.
The rest of this paper is organized as follows. Related work for
visual tracking is reviewed in Section II. The implementation
details of the proposed method are introduced in Section III.
Extensive experiments and the analysis are presented in
Section IV, followed by the conclusion in Section V.
Fig. 1. Tracking results obtained by SiamFC and our method on three II. R ELATED W ORK
challenging videos. Deep learning [11], [13], [21], [22] has recently been used
in the object tracking task [24], [26], [27]. In general, most
in the Siamese network [20]. Secondly, the search region in deep learning based tracking methods are based on deep clas-
these tracking methods is too small, which may result in low sification, and their running speeds are slow. In this section,
accuracy. However, the computational complexity will increase we briefly review the tracking methods closely related to our
if the search region is enlarged. It is difficult to achieve a trade- work, including deep classification based tracking methods,
off between running speed and tracking accuracy. Therefore, Siamese instance search based tracking methods and real-time
there are two main issues to be addressed for these Siamese tracking methods.
instance search based tracking methods, namely, how to update
the template in an effective way, and how to perform object A. Deep Classification Based Tracking Methods
re-identification in a larger search region.
To this end, we propose a novel real-time object tracking A number of deep classification based tracking methods
method that effectively combines a multi-stream Siamese have been developed in the literature [4], [17], [24]–[26],
network and a region-based convolutional neural network, [28], [29]. These methods aim to extract the target feature
namely MSRT. The proposed method can implement both representations that are robust to target appearance changes
the object template update and object re-identification tasks. by classifying the target candidates in each frame. HCF [26]
Firstly, we propose a multi-stream Siamese network to ensure uses the convolutional feature maps generated by a CNN
highly efficient search and update instance templates during (e.g., AlexNet [13] or VGG-Net [22]) to encode the target
the tracking process. Secondly, we train a metric learning appearance variations. CREST [4] integrates feature extrac-
based model (consisting of several distance functions) to tion, response map generation and model update into the
determine whether the tracked target is lost or not. When the neural network in an end-to-end training manner. The methods
tracked target is lost, a detector built on a Faster Region-based in [23]–[25], [28], [29] also focus on the deep features,
Convolutional Neural Network (Faster R-CNN) is triggered to which are utilized to represent the target appearance. The
perform object re-identification by using the object category disadvantage of these methods is that they suffer from heavy
information, which leads to significant performance improve- computational costs because the neural networks used to
ments. evaluate a large number of candidate objects are complex. For
Since SiamFC [19] is the baseline of our proposed MSRT, example, the multi-domain tracking method (MDNet) [17],
we illustrate some tracking results obtained by SiamFC [19] which is one of the most accurate tracking methods, can only
and our method on different videos in Fig. 1. The results be implemented at 1 fps on a high-end GPU. SANet [18] is
show that learning the multi-stream deep Siamese network also a highly accurate tracking method based on a recurrent
with the R-CNN detector, instead of using a single Siamese neural network, and it is much slower than MDNet. These
search network, can achieve superior accuracy and real-time tracking methods are difficult to use in real-time applications.
performance at the same time.
The main contributions of this paper can be summarized as
B. Siamese Instance Search Based Tracking Methods
follows:
• A new multi-stream deep Siamese network for real-time The Siamese instance search based tracking methods
object tracking is proposed. Unlike previous Siamese (e.g., [19], [20], [30]–[32]) select the target from a search
based tracking methods, which cannot effectively deal region by using a Siamese network trained on a large-scale
with significant target appearance changes, the proposed dataset offline. The major part of a Siamese network is a
LIU et al.: MULTI-STREAM SIAMESE AND FASTER REGION-BASED NEURAL NETWORK 7281
Fig. 2. Pipeline of the proposed tracking method. The proposed tracking method consists of two modules: one is a multi-stream instance search module,
and the other is a re-identification and location module. These two modules are separately trained.
two-branch CNN. Two branches share the same parameters However, these tracking methods lack online adaptability. In
and measure the similarity between the target object and the this paper, we propose a deep Siamese instance search based
target candidates cropped from a search region in each frame. tracking method with online update and re-identification to
SiamFC [19] uses a fully convolutional network to perform achieve a good tradeoff between running speed and tracking
offline training and online object searching together, which accuracy.
can be implemented in real time. SINT [20] uses a Siamese
network that contains a fully connected layer to learn a generic III. T HE P ROPOSED M ETHOD
matching function for tracking. However, SINT is much slower In this section, we present the proposed real-time object
(about 2 fps) than SiamFC (about 80 fps), because the fully tracking method, which combines a multi-stream Siamese
connected layer in SINT contains many parameters (more network and a region-based convolutional neural network
than 4000). Based on [19] and [20], some effective tracking (MSRT). We firstly introduce the pipeline of the pro-
methods (such as [30], [31], and [32]) have been proposed. posed MSRT (see Fig. 2 for an illustration). Then we
However, they only work in restricted environments due to describe the details of the multi-stream Siamese network and
the lack of online update and re-identification. In this paper, the region-based convolutional neural network, respectively.
we propose to learn a multi-stream Siamese network by intro- Finally, we show how to use the MSRT to achieve highly
ducing an online update branch and an adaptive metric-based accurate and real-time online tracking in detail.
re-identification detector to the Siamese network. Based on
these improvements, our method outperforms several other
A. Pipeline of the Proposed MSRT
state-of-the-art tracking methods on challenging videos.
MSRT contains two modules: a multi-stream instance search
module (termed MS) and a re-identification and location
C. Real-Time Tracking Methods module (termed RL). MS is a multi-stream Siamese network.
The running speed of a tracking method is critical to The inputs of MS consist of two templates and one search
many practical applications, such as intelligent transporta- region. One template I1 is cropped from the bounding box of
tion systems, video surveillance, motion analysis, and so the target in the first frame. Meanwhile, the other template
on [33], [34]. The recently proposed real-time tracking meth- I2 is cropped from a confident bounding box in the frame U.
ods can be divided into two categories. Most of the track- Especially, the frame U is selected from the frames between
ing methods are in the first category, which is based on the first frame and the current frame with the highest confident
correlation filters. This category of tracking methods can be value based on the Mahalanobis distance. In each frame,
efficiently operated in the frequency domain. For example, MS aims to search the target in the search region χ by using
Bolme et al. [35] propose a new correlation filter based two templates. Finally, MS outputs two response maps, and it
tracking method via learning the minimum output sum of generates a large number of target candidates according to the
squared error (MOSSE) filter, and this method can achieve response maps.
high running speed. Henriques et al. [36] introduce the ker- RL consists of a metric learning (ML) based model and a
nel function into the correlation filter and they propose the Faster R-CNN detector. It is employed to re-identify and locate
kernelized correlation filter (KCF) tracking method. Later, the target via the following steps. At first, the metric learning
SRDCF [37] and Staple [38] are respectively proposed to based model in RL uses the target candidates generated by MS
improve the performance of the correlation filters by solving as the inputs, and it outputs the estimated tracking state in the
the boundary effects and extracting more effective features. current frame. The outputs contain two types of tracking states:
More efforts on improving correlation filter based tracking the target is tracked Ttrack or the target is lost Tlost . When the
methods can be found in [40]–[43]. Moreover, there are a few target is tracked, RL locates the target by using the response
real-time tracking methods proposed in the literature based maps generated by MS, and the template I2 is not updated.
on the deep features (e.g., Re3 [44] and GOTURN [45]). When the target is lost, the pre-trained Faster R-CNN detector
Fig. 4. Examples of training pairs extracted from three video sequences. The
first row represents the instance templates and the second row represents the
search regions from the same video sequences.
Fig. 3. An illustration of the proposed multi-stream Siamese network. The

proposed network includes three streams sharing the same parameters with
one another.
more effective for the tracking task. The contrastive loss is
defined as follow:
ł(y[i ], ρ[i ]) = y[i ] · [max(Smargin − ρ[i ], 0)]2
re-identifies the target in the current frame by exploiting the
object category information. Meanwhile, the template I2 is + (1 − y[i ]) · ρ[i ]2 , (2)
updated by RL after re-identification. Especially, the category where Smargin is a distance threshold, and ρ[i ] =
information in the first frame is identified by using the xI [i ] − xC [i ] is the Euclidean distance between two feature
pre-trained Faster R-CNN detector. The details of MS and RL vectors (xI [i ] and xC [i ]) extracted from a template-candidate
are described in the next two subsections, respectively. pair, which comprises a template and a larger search region.
y[i ] ∈ {1, 0} is the ground-truth label provided in the training
B. MS: Multi-Stream Instance Search Module data. After training these two streams, the third stream uses
the same parameters of the first stream.
To track objects with fast speed, we design MS as We train MS by using a set of template-region pairs. The
a fully-convolutional neural network that includes several template-region pairs are cropped from the annotated video
streams sharing the same parameters with one another. As sequences in the ImageNet [8] dataset. To ensure that all
shown in Fig 2, the inputs of MS include two templates (i.e., the search regions have the same size, the missing areas
I1 , and I2 ) and a search region ( χ), and the outputs of MS are are filled with the mean RGB value of the image using the
two response maps and a large number of target candidates. strategy of [19]. By randomly selecting target candidates in the
To avoid confusion, the input of the first stream is the target search region, a template-region pair will effectively generate
template I1 , which is marked in the first frame. The second N template-candidate pairs. Fig. 4 illustrates examples of
stream takes the search region χ in the current frame as its training the template-region pairs extracted from three video
input. The input of the third stream is the second template I2 , sequences. Finally, we define the contrastive loss as
which is selected by RL from the target candidates based on
1
N
the Mahalanobis distance. MS extracts the appearance features
from the three inputs and it uses these features to generate two Ł(y, ρ) = ł(y[i ], ρ[i ]), (3)
N
response maps as shown in Fig 2. MS aims to find the target i=1
in the search region χ in each frame by using two templates The proposed MS is a fully-convolutional network without
(i.e., I1 , and I2 ). The extracted appearance features are denoted any fully-connected layers, so it has fewer parameters than
as ϕ(·). The two generated response maps Rm (·, ·) can be the deep networks with fully-connected layers. Thus, MS runs
written as follows: faster than the deep networks with fully connected layers and
it is able to achieve real-time performance.
Rm (I1 , χ) = corr (ϕ(I1), ϕ(χ)),
Rm (I2 , χ) = corr (ϕ(I2), ϕ(χ)), (1)
C. RL: Re-Identification and Location Module
where corr (·) is the correlation operation. Finally, MS outputs The RL module is designed for re-identification, and it
a large number of target candidates based on the two response is an important module of the tracking method, especially
maps in the frame. These target candidates are cropped from for long-term tracking. The RL module consists of a metric
the regions with high response scores in the two response maps learning based model and a Faster R-CNN detector. In this
during tracking. paper, RL is employed to re-identify and locate the target,
In this paper, we train MS using the following steps. At and it formulates the re-identification and location problem as
first, we train the first stream and the second stream with a Markov Decision Process (MDP). At first, the inputs of RL
a discriminative loss, as shown in Fig 3. Different from are two response maps and a large number of target candidates
SiamFC [19] (which uses the logistic loss), we adopt the generated by MS. Then, the metric learning based model in
contrastive loss [46] for MS because the contrastive loss is RL uses the target candidates to estimate the tracking state
(i.e., Ttrack or Tlost ) of MSRT in the current frame. Finally, 2) Faster R-CNN Based Re-Identification Algorithm: For
RL outputs the location of the tracked target. the recently proposed tracking methods, the object appearance
Next, we will introduce the details of the main components model is usually learned without considering the object cate-
used in the proposed MSRT. gory information, even though the object category information
1) Metric Learning Based Model: To estimate the tracking is important information for object tracking, especially for
state of MSRT, we propose a model based on metric learning, long-term object tracking. To effectively exploit the object
which consists of two Mahalanobis distance functions, namely category information, we propose an object re-identification
MLM. The inputs of MLM are the candidates generated by algorithm, namely Faster RCNN based re-identification algo-
MS, and the output is the tracking state of MSRT. MLM rithm, by using the Faster R-CNN detector.
estimates the tracking state by calculating the dissimilarity The Faster R-CNN based re-identification algorithm consists
scores between the target candidates and the two templates. If of two steps, including the initialization of the Faster R-CNN
the sum of the dissimilarity scores is higher than a threshold, detector and the re-identification of the target in the frames
MLM outputs the tracking state as Tlost (i.e., the target is lost) when the target is lost. For the first frame, the bounding box
and sends the re-identification request to the Faster R-CNN of the specified target is used as the initial input of the Faster
detector. Otherwise, MLM outputs the tracking state Ttrack R-CNN detector for the initialization of the Faster R-CNN
(i.e., the target is tracked) and locates the target in the current detector. For the following frames, the Faster R-CNN detector
frame through the two response maps generated by MS. is used to re-identify the lost target by using the category
The dissimilarity function ψ(Ci , C j ) between two cropped information of the target object when the target is lost. Firstly,
images Ci and C j measures the distance between two feature the cropped templates and the search region in the current
vectors xCi and xC j extracted from Ci and C j , respectively. frame are used as the inputs of MS. MS generates two response
And it is defined as follows: maps and a number of target candidates, which are further used
as the inputs of MLM. Secondly, MLM outputs the tracking
ψ(Ci , C j ) = (xCi , xC j ) state (i.e., Ttrack or Tlost ). When the tracking state is Tlost ,
= (xCi − xC j )T M(xCi − xC j ), (4) the Faster R-CNN detector will be activated and it will be
used to re-identify the target by using the object category
where (·) yields a dissimilarity score between the two information. When the tracking state is Ttrack , RL locates
vectors xCi and xC j . M is a metric matrix, and it is usually the target by using the two response maps generated by MS.
learned by two steps: dimensionality reduction (e.g., by using Instead of re-identifying the object in each frame, we re-
principle component analysis (PCA)) is firstly applied to xCi identify the object in the frames of the lost target, which can
and xC j , and then metric learning (e.g., the KISS metric [47]) reduce the computational costs.
is performed. In particular, we use the same network architecture as
Köstinger et al. [47] propose an effective and efficient in [48] to estimate the output by using the softmax func-
method to learn a Mahalanobis distance. We follow this tion. The Max-pooling layers and ReLU layers are employed
method to learn the metric matrix M. Given a pair of cropped after the convolutional layers. The training data of the net-
images Ci and C j , we define the pairwise differences as work include 20 object categories, which cover most object
xi j = xCi − xC j . Note that we partition the training data into categories (e.g., vehicles and pedestrians) in the tracking
xi j + and xi j − . xi j + represents that Ci and C j are the cropped dataset.
images containing the same object, while xi j − represents that 3) High-Confidence Template Update: Most recent tracking
Ci and C j are the cropped images containing different objects. methods [50]–[54] update the templates at each frame without
The metric matrix M is defined as follow: considering the tracking state. In fact, incorrect tracking results
(due to the target being severely occluded or completely lost)
M = ( (xi j + )(xi j + )T )−1 − ( (xi j − )(xi j − )T )−1 . (5) may result in inaccurate template updates, which can further
lead to tracking failures.
In this paper, we use two metric matrices M A and M B , For the frame that the estimated tracking state of MSRT is
where M A is used for calculating the dissimilarity score Ttrack , both the shape of the response maps and the score of the
between one target candidate generated by MS and the tem- softmax function in the Faster R-CNN detector can be used to
plate I1 , and M B is used for calculating the dissimilarity indicate the estimated confidence degree of the tracking state
score between one target candidate generated by the first to some extent. The ideal response maps should have only one
response map and the other target candidate generated by sharp peak and they should be smooth in all other areas when
the second response map. These two metric matrices are the detected target is matched to the templates. We only use
separately calculated. We train the two metric matrices M A the most confident target candidate (which is selected based
and M B with a large number of pairs of cropped images via on MLM) to update the template I2 .
Eq. (5). We finally define our dissimilarity function (·) as We update the template I2 when the response maps have
sharper peaks and less noises. We transform the response maps
(xCi , xC j ) = (xCi − xC j )T M A (xCi − xC j ) Rm (I1 , χ) and Rm (I2 , χ) into matrices S1 and S2 , respectively.
+ λ(xCi − xC j )T M B (xCi − xC j ), (6) And then, we define the maximum response scores of two
response maps (which correspond to the first frame and the
where λ is a constant coefficient. U-th frame) as Smax1 and Smax2, respectively. We measure the
distance of two response maps σ as follows: Algorithm 1 The Proposed MSRT

n∗m 1: Input:
|S1i − S2i |2
σ = i=1 , (7) 2: Image Ft .
n∗m
3: Previous target location lt −1 .
where S1i and S2i denote the elements of the matrices S1 and 4: Output:
S2 , respectively, and n ∗m is the size of the matrices. When the 5: The estimated target location lt .
estimated tracking state is Ttrack , and the response maps have /* Initialization of MSRT */
sharp peaks (which means that the Smax1 and Smax2 values of 6: if t = 1 then
the response maps are much higher than the average values of 7: Crop the bounding box and initialize the first template
the response maps) and low noise (i.e., they have low σ value), I1 .
the template I2 is updated. By using the above template update 8: Initialize the multi-stream Siamese network.
strategy, we update the template under certain conditions rather 9: Initialize the Faster R-CNN detector.
than updating it every frame. This way significantly reduces 10: end if
the computational costs. 11: if t = 2 then
12: Crop the search region χ2 .
D. Online Tracking 13: Generate the target candidates by using the multi-stream
We first train the multi-stream Siamese network, the Faster Siamese network.
R-CNN detector and the metric-learning based model offline. 14: Estimate the target location lt using Eq. (1).
Then, we implement the proposed MSRT for online tracking. 15: Initialize the second template I2 .
The whole algorithm of the proposed MSRT is summarized 16: end if
in Algorithm 1. /* Online tracking */
1) Initialization: At first, we crop a template I1 from the 17: if t = 3, . . . , T then
first frame, which is the labeled bounding box centered on the 18: Crop the search region χt .
target. The template I1 is used as the input of the proposed 19: Generate the target candidates by using the multi-stream
MSRT to initialize the multi-stream Siamese network and the Siamese network.
Faster R-CNN detector. Meanwhile, a search region χ2 that 20: Estimate the tracking state obtained by MSRT using
is centered on the target in the second frame, is cropped. Eq. (6).
Secondly, the template I1 and the search region χ2 are used as 21: if the target is tracked then
the input of the first two streams of MS. After that, MS outputs 22: Estimate the target location lt using Eq. (1).
a number of target candidates. RL takes these target candidates 23: else if the target is lost then
as its input, and it estimates the target location by selecting the 24: Re-identify the target and by using estimate the target
most similar target candidate (which has the highest response location lt by using the Faster R-CNN detector.
value). We crop the tracked target in the second frame as the 25: Update the template I2 .
template I2 . Finally, the proposed MSRT is well initialized 26: end if
after these steps. 27: end if
2) Tracking: As shown in Algorithm 1, the template I1 and
the template I2 are initialized by using the first two frames.
At time step t (t 3), we crop a search region χt from the method and several state-of-the-art tracking methods on the
target region predicted in the previous frame. The templates OTB-2013, OTB-2015 and LaSOT datasets are presented for
I1 , I2 and the search region χt are used as the inputs of MS to the performance evaluation.
generate two response maps and target candidates. Then these
target candidates are input to MLM to estimate the tracking
state of MSRT in the current frame. If the target is tracked, A. Experimental Settings
the proposed MSRT locates the target by search the region
1) Evaluation Metrics: In the tracking task, the most com-
with the maximum response value in the two response maps.
mon evaluation method is to initialize a tracking method with
Otherwise, RL is activated to re-identify the target by using the
the bounding box in the first frame and report the average
Faster R-CNN detector and then locates the target. During the
precision or success of all the results. This straightforward
process of online tracking, the proposed MSRT generates two
method is referred to as a one-pass evaluation (OPE). We
response scores (i.e, Smax1 and Smax2) for the two response
adopt the precision plot and the success plot of OPE as the
maps and estimates the distance σ between the two response
evaluation metrics [55], [56]. The precision plot depicts the
maps after predicting the target location. And then, MSRT
average center location error (CLE), which is the Euclidean
updates the template I2 with the most confident tracking result
distance between the center of the ground-truth target location
according to these three values (i.e, Smax1, Smax2 , and σ ).
and the center of the estimated target region at each frame.
The success plot shows the average overlap score (OS), which
IV. E XPERIMENTS is the area under the curve (AUC) between the ground-truth
In this section, the experimental settings are introduced first. target bounding box and the tracked target bounding box of
Then the comparison results between the proposed tracking each frame.
2) Competing Methods: In our experiments, we compare

our method with other eight state-of-the-art tracking methods,
which can be roughly divided into two categories. The first cat-
egory includes two base-line Siamese based tracking methods
(i.e., SiamFC [19] and SINT [20]). The second one consists of
several recently proposed tracking methods that can run faster
than 5fps (including BACF [40], Staple [38], GOTURN [45],
Re3 [44], SRDCF [37] and KCF [36]).
3) Implementation Details: Each branch of the multi-stream
Siamese network uses the AlexNet-like [13] network, which Fig. 5. The precision and success plots of OPE on the OTB-2013 dataset.
consists of five convolutional layers. Max pooling is only The solid lines in different colors represent 9 different tracking methods. The
employed after the first two convolutional layers. In addition, center location errors at a threshold of 20 pixels for precision plots and the
average overlap scores for success plots are respectively shown in the legend.
the batch normalization [59] is adopted after each convolu- (a) Precision plots of OPE. (b) Success plots of OPE.
tional layer. We use the ReLU layer as the non-linear function
of the hidden layer. The Faster R-CNN detector uses the same
network structure as in [48]. The number of candidates output
by the MS module affects the performance of MSRT. In the
experiments, when this number is set as 500, MSRT achieves
the highest accuracy, while maintaining real-time performance.
In addition, the proposed MSRT uses k different scales to track
the target. The value of k denotes the number of layer in the
image pyramid structure, and it affects the tracking accuracy
and speed of MSRT. When k is large (such as 5), the accuracy
improvement of the proposed method is limited, but its speed Fig. 6. The speed and accuracy plots obtained by the state-of-the-art tracking
methods on the OTB-2015 [55] dataset. The proposed MSRT achieves the best
is much slower. In this paper, in order to maintain real-time accuracy among the real-time tracking methods.
tracking, k is set as 3 to achieve a good tradeoff between the
running speed and tracking accuracy.
4) Training: We adopt the 3-step training algorithm for
MSRT. Firstly, we train the proposed MS by using the
ILSVRCDET dataset [8]. The parameters of the embedding
function of MS are estimated by minimizing Eq. (2) with the
Stochastic Gradient Descent (SGD) using MatConvNet [60],
and the threshold score Smargin is experimentally set to 0.5.
The training process is performed over 100 epochs, and we
set the size of mini-batches as 8 in each iteration. Secondly,
we train the Faster R-CNN, which consists of two components:
the RPN and the Fast R-CNN. We use the PASCAL VOC Fig. 7. The precision and success plots of OPE on the OTB-2015 dataset.
The solid lines in different colors represent 9 different tracking methods. The
2010 dataset [49] (which includes 20 types of objects) for center location errors at a threshold of 20 pixels for precision plots and the
training, since the category information in it is sufficient to average overlap scores for success plots are respectively shown in the legend.
cover most categories in the tracking dataset. More specifi- (a) Precision plots of OPE. (b) Success plots of OPE.
cally, we first train the RPN, and then we use the proposals
generated by the RPN to train the Fast R-CNN. We train the
the whole image. The proposed MSRT is implemented using
Fast R-CNN using 500 proposals. Thirdly, we train the metric
MATLAB on a computer equipped with a single NVIDIA
learning based model MLM of RL with ILSVRCDET [8] and
GTX 1080 GPU and an Intel 6700K 4.0 GHz CPU. The
set the threshold of the score of the dissimilarity function
proposed MSRT runs at 27 frames per second.
in MLM to 0.4.
5) Tracking: We set the size of the templates as 127 × 127,
and set the size of the search region as 255 × 255, which is B. Experiments on the OTB-2013 Dataset
four times as large as the size of the templates. We conduct We compare the proposed MSRT with 8 other tracking
some experiments to analyze the size of the search region. This methods (including SINT, KCF, SiamFC, BACF, SRDCF,
size of the search region is generally sufficient to cover the Staple, Re3 and GOTURN), on the OTB-2013 dataset [56].
target position in the next frame in practice. And when the size The evaluation is based on the precision and success plots
of the search region is large enough (i.e., over four times of of one-pass evaluation on the 50 video sequences of the
the template), the increasing of the size has less influence on OTB-2013 dataset. As shown in Fig. 5, the proposed MSRT
the accuracy, while the computational costs are significantly achieves the second best performance in the precision plots of
increased. However, when target is lost, the proposed tracking OPE and the best performance in the success plots of OPE.
method needs lager search region. This is the reason why The proposed MSRT performs significantly better than the
we use a fast R-CNN detector to re-identify the target in real-time tracking methods based on deep networks [44], [45].
Fig. 8. Qualitative evaluation of 7 tracking methods on 7 challenging sequences (from top to bottom: Car Scale, Biker1, Basketball, Singer2, T iger,
Coke, Bird1) from the OTB-2015 dataset. The proposed MSRT performs favorably against the 6 state-of-the-art competing tracking methods, including
SINT [20], KCF [36], SiamFC [19], BACF [40], SRDCF [37] and Staple [38].
Among the 9 competing tracking methods in Fig. 5, the pro-

posed MSRT (66.0%/87.8%) shows better performance than
the KCF [36] (51.3%/74.1%), SiamFC [19] (60.6%/80.1%),
BACF [40] (61.9%/80.7%), SRDCF [37] (62.6%/83.8%),
Staple [38] (60.0%/79.3%), Re3 [44] (31.6%/46.0%) and
GOTURN [45] (44.4%/62.0%), and it shows comparable
results to the SINT [20] (65.5%/88.2%). Although SINT
obtains the best performance in the precision plots of OPE,
it runs at only 4 fps and achieves a lower average OS than Fig. 9. The precision and success plots of OPE on the LaSOT dataset. The
solid lines in different colors represent the results obtained by 9 different
MSRT with regard to the success plots of OPE. These results tracking methods. The center location errors at a threshold of 20 pixels for
show the superior effectiveness and efficiency of the proposed precision plots and the average overlap scores for success plots are respectively
tracking method. shown in the legend. (a) Precision plots of OPE. (b) Success plots of OPE.
C. Experiments on the OTB-2015 Dataset with the 8 competing tracking methods on all 100 videos
The OTB-2015 [55] dataset is an extension of the of the OTB-2015 dataset. Fig. 6 shows the speed and accu-
OTB-2013 dataset and it is more challenging than the racy plots obtained by the 9 competing tracking methods
OTB-2013 dataset. We also compare the proposed MSRT on the OTB-2015 dataset. The proposed MSRT method
Fig. 10. Attribute based evaluation. The experimental results on 11 different challenging factors (including scale variation, out-of-plane rotation, in-plane
rotation, occlusion, deformation, fast motion, illumination variation, background clutter, motion blur, out of view and low resolution.) are presented. The
number of video sequences for each attribute challenging factors is shown in the parenthesis.
achieves the best accuracy among the 9 competing track- (21.1%/18.4%) and LCT [58] (24.6%/19.3%). The propsed
ing methods, and it maintains real-time tracking speed. MSRT is more effective than the 8 competing tracking
The detailed precision and success plots of OPE obtained methods for long-term tracking. Although MSRT achieves
by each tracking method are also reported in Fig. 7. only 1% better performance than the baseline SiamFC in
MSRT (64.3%/86.2%) shows better performance than the terms of the precision, it obtains over 4% improvement in
other 8 tracking methods, including KCF (47.5%/69.2%), terms of the success. Compared with another Siamese based
SiamFC (58.1%/76.9%), BACF (60.1%/79.3%), SRDCF tracking method (i.e., SINT), the MSRT achieves over 5%
(59.8%/78.9%), Staple (58.2%/78.4%), Re3 (28.9%/40.6%), performance improvement in terms of the precision on the
GOTURN (42.7%/62.0%) and SINT (59.2%/78.6%). LaSOT dataset. These results show the superior effectiveness
and efficiency of the proposed tracking method. In addtion,
D. Experiments on the LaSOT Dataset LCT is a well-known long-term tracking method. The
The Large-scale Single Object Tracking (LaSOT) proposed MSRT achieves better performance than the LCT
dataset [57] is a large-scale dataset, which consists method on the LaSOT dataset. Because the proposed MSRT
of 1,400 sequences with more than 3.52 million frames. takes the advantages of both the template update and object
The average length of the videos in LaSOT is more than re-identification.
2,500 frames. Some sequences in LaSOT can be used for
long-term tracking, and each sequence in this dataset is E. Qualitative Evaluation and Attribute Analysis
very challenging. We compare the proposed MSRT with the Fig. 8 summarizes the qualitative comparisons of MSRT
8 competing tracking methods (including SiamFC, SINT, with 6 competing tracking methods (SINT, KCF, SiamFC,
BACF, Staple, SRDCF, GOTURN, KCF and LCT [58]) on the BACF, SRDCF, and Staple) on 7 challenging sequences
LaSOT dataset. Fig. 9 shows the precision and success plots from OTB-2015. These challenging sequences contain most
obtained by the 9 competing tracking methods on the LaSOT of the challenging attributes. Overall, the proposed MSRT
dataset. As shown in Fig. 9, the proposed MSRT method achieves better results than the other 6 competing methods.
achieves the best accuracy among the 9 competing tracking For the Car Scale sequence, the scale of the target changes
methods. The detailed precision and success plots of OPE significantly and the proposed MSRT achieves the highest
obtained by each tracking method are also reported in Fig. 9. accuracy. For the Bi ker 1 sequence, it includes occlusion and
MSRT (39.9%/35.7%) achieves better performance than the fast motion, Staple, KCF and SRDCF lose the target in frame
other 8 tracking methods: SiamFC (35.8%/34.1%), SINT 75. For the T iger sequence, it includes motion blur and fast
(33.9%/29.9%), BACF (27.7%/23.9%), Staple (26.6%/23.1%), motion, all the tracking methods lose the target in frame 150,
SRDCF (27.1%/22.7%), GOTURN (22.5%/17.9%), KCF and the proposed MSRT re-identifies the target after a few
Fig. 11. The overlap score (OS) plots obtained by the 9 tracking methods over 6 challenging video sequences. The solid lines in different colors represent
the results obtained by different tracking methods. The larger OS values indicate the better performance (best viewed in color).
frames. For the Coke sequence, all the tracking methods fail track the target by using one template. When the template is a
to track the target when the coke is in the situation of in-plane low-resolution template, these tracking methods will lose the
rotation, except for the proposed MSRT. For the Si ngle2 target. In contrast, the proposed MSRT tracks the target by
sequence, the appearance of the target changes significantly using two low-resolution templates, which makes MSRT lose
due to shadows. SiamFC gradually drifts away from the target the target frequently.
when the singer turns around. For the Basketball sequence,
some tracking methods fail to reliably track the target (i.e.,
the human) because of the illumination variations. The BACF F. Tracking on Traffic Video Sequences
and the proposed MSRT are able to track the target accurately The OTB-2015 dataset contains 14 traffic video sequences,
for this sequence. For the Bir d1 sequence, it includes full which include vehicles and pedestrians. Fig. 12 shows the
occlusion and fast motion, which make the tracking task frame-by-frame center location error plots obtained by the
challenging. Only the proposed MSRT performs accurately 9 competing tracking methods over 9 challenging video
across the whole sequence. sequences. The solid lines in different colors represent the
In Fig. 10, we further analyze the tracking method perfor- results obtained by different tracking methods. From Fig. 12,
mance on all 11 video attributes (e.g., illumination variation we can see that the proposed MSRT achieves lower center
and out-of-plane rotation) annotated in the OTB-2015 dataset. location errors than the other competing methods, which
The results show that the proposed MSRT achieves the best demonstrates the robustness and the effectiveness of the pro-
results on the videos annotated with 10 out of 11 attributes. posed MSRT. Table I shows the performance comparison on
Moreover, we find that MSRT is more effective at handling the 13 traffic video sequences in terms of the average overlap
the out-of-view (OV) and occlusion (OC) attributes because score. From Table I, we can see that the vehicle in the Car 1,
it can re-identify the lost target by using the metric learning Car 2, Car 24, Car 4, Car Dar k sequences is successfully
based model (MLM) and the Faster R-CNN detector. When tracked by 4 tracking methods (i.e., SINT, Staple, BACF, and
the target is occluded or lost, the two base-line Siamese MSRT), among which SINT, BACF, and MSRT track the target
based tracking methods (i.e., SiamFC and SINT) can not more accurately on these vehicle sequences. In addition, for
accurately locate the target. This is because both methods do the human2, human4, human5, human6, human7, human8,
not include the template update and target re-identification human9 video sequences, BACF, Staple and MSRT perform
procedures. Fig. 11 shows the frame-by-frame OS plots well on the sequences. In particular, the human3 sequence is
obtained by all 9 competing tracking methods over six very challenging because it contains five different challenging
challenging video sequences. The proposed MSRT achieves factors, including scale variation, occlusion, deformation, out-
the best performance among all the competing tracking of-plane rotation and background clutter. The proposed MSRT
methods. achieves better performance than these competing tracking
However, for the low-resolution (LR) videos, we also note methods on this sequence. In short, the proposed MSRT
that the performance of the proposed MSRT is worse than that achieves the best overall performance among these competing
of SiamFC. This is because Siamese based tracking methods tracking methods on the traffic video sequences.
Fig. 12. The center location error (CLE) plots obtained by the 9 tracking methods over 9 challenging video sequences. The solid lines in different colors
represent the results obtained by different tracking methods. The smaller CLE values indicate the better performance (best viewed in color).
TABLE I
T HE AVERAGE C ENTER L OCATION E RRORS AT A T HRESHOLD OF 20 P IXELS ON 13 T RAFFIC V IDEO S EQUENCES OF THE OTB-2015 D ATASET O BTAINED
BY THE P ROPOSED MSRT AND THE O THER 8 T RACKING M ETHODS . T HE B EST VALUE I S H IGHLIGHTED BY B OLD
G. Ablation Study from the proposed MSRT. We report the average center
In this section, we evaluate two variants of the proposed location errors at a threshold of 20 pixels for the preci-
MSRT. The first one is the multi-stream Siamese network sion (ACLS20), the average overlap score (AOS) and the
based tracking method (MST), which is the proposed MSRT speed on two different datasets (e.g., OTB-2015 and LaSOT)
that does not use the Faster R-CNN detector. The second one in Table II.
is the single Siamese and faster region-based convolutional 1) MSRT vs. MST: Table II shows the tracking results of
neural network based tracking method (SRT), which elimi- MST on the OTB-2015 dataset. MST is much faster than
nates the third stream of the multi-stream Siamese network MSRT, but it achieves a lower ACLS20 and AOS than MSRT.
TABLE II video sequences, and it can also effectively track vehicles and
T HE ACLS20, THE AOS AND THE S PEED O BTAINED BY THE P ROPOSED pedestrians in traffic video sequences. Compared with several
MSRT, MST AND SRT ON THE OTB-2015 AND THE LaSOT D ATASETS
state-of-the-art tracking methods, the proposed MSRT achieves
better tracking accuracy. In future work, the proposed MSRT
can be extended to multiple object tracking by using multiple
templates and networks, since it is an instance search based
tracking-by-detection method for object tracking.
R EFERENCES
[1] B. Babenko, M.-H. Yang, and S. Belongie, “Robust object tracking with
online multiple instance learning,” IEEE Trans. Pattern Anal. Mach.
Intell., vol. 33, no. 8, pp. 1619–1632, Aug. 2011.
In MSRT, the Faster R-CNN detector aims to re-identify the [2] S. Zhang, Y. Qi, F. Jiang, X. Lan, P. C. Yuen, and H. Zhou, “Point-to-
lost target for long-term tracking. If the tracking method loses set distance metric learning on deep representations for visual tracking,”
the target object, the Faster R-CNN detector can re-identify IEEE Trans. Intell. Transport. Syst., vol. 19, no. 1, pp. 187–198,
Jan. 2018.
the target again. Even if the target appearance has significantly [3] Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-learning-detection,”
changed, the Faster R-CNN detector can still locate it by using IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 7, pp. 1409–1422,
the object category information. A reliable Faster R-CNN Jul. 2012.
[4] Y. Song, C. Ma, L. Gong, J. Zhang, R. W. H. Lau, and M.-H. Yang,
detector performs more accurately on the out-of-view and “CREST: Convolutional residual learning for visual tracking,” in Proc.
occlusion attributes. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2555–2564.
2) MSRT vs. SRT: Table II also shows the tracking results [5] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg, “Discriminative
scale space tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39,
of SRT on the OTB-2015 dataset. SRT achieves the lowest no. 8, pp. 1561–1575, Aug. 2017.
ACLS20 and AOS among the three tracking methods, and it [6] S. Hare et al., “Struck: Structured output tracking with kernels,” IEEE
is slightly faster than the others. The proposed MSRT uses Trans. Pattern Anal. Mach. Intell., vol. 38, no. 10, pp. 2096–2109,
Oct. 2016.
two template streams to improve the accuracy of the Siamese [7] G. Ding, W. Chen, S. Zhao, J. Han, and Q. Liu, “Real-time scalable
based tracking method. Different from the existing Siamese visual tracking via quadrangle kernelized correlation filters,” IEEE
based tracking methods, the proposed MSRT uses one stream Trans. Intell. Transport. Syst., vol. 19, no. 1, pp. 140–150, Jan. 2018.
[8] O. Russakovsky et al., “ImageNet large scale visual recognition chal-
to update the template online. We test SRT without using the lenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, Dec. 2015.
third stream, which is used to update the template. As shown [9] N. Wang, D. Tao, X. Gao, X. Li, and J. Li, “Transductive face sketch-
in Table II, SRT obtains faster running speed because it does photo synthesis,” IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 9,
pp. 1364–1376, Sep. 2013.
not need to consume time for template updating. However,
[10] N. Wang, X. Gao, L. Sun, and J. Li, “Bayesian face sketch synthesis,”
removing the third stream also leads to much less accurate IEEE Trans. Image Process., vol. 26, no. 3, pp. 1264–1274, Mar. 2017.
performance, especially for AOS. Note that the AOS of SRT is [11] X. Ke, J. Zou, and Y. Niu, “End-to-end automatic image annotation
even worse than the AOS of SiamFC. These results shows that, based on deep CNN and multi-label data augmentation,” IEEE Trans.
Multimedia, vol. 21, no. 8, pp. 2093–2106, Aug. 2019.
a Siamese based tracking method achieves better performance [12] Y. Niu, J. Chen, and W. Guo, “Meta-metric for saliency detection
in terms of ACLS20 by using the Faster R-CNN detector, evaluation metrics based on application preference,” Multimedia Tools
but it also leads to worse performance in terms of AOS. Appl., vol. 77, no. 20, pp. 26351–26369, Mar. 2018.
[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
The proposed MSRT combines the advantages of both the with deep convolutional neural networks,” in Proc. Adv. Neural Inf.
Siamese based tracking method and the Faster R-CNN detector Process. Syst., Dec. 2012, pp. 1097–1105.
to achieve better performance. [14] Y. Niu, H. Zhang, W. Guo, and R. Ji, “Image quality assessment for color
correction based on color contrast similarity and color value difference,”
IEEE Trans. Circuits Syst. Video Technol., vol. 28, no. 4, pp. 849–862,
V. C ONCLUSION Apr. 2018.
[15] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierar-
In this paper, we propose a new Siamese search based chies for accurate object detection and semantic segmentation,” in Proc.
tracking method, which effectively combines a multi-stream IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, pp. 580–587.
[16] R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis.
Siamese network and a region-based convolutional neural (ICCV), Dec. 2015, pp. 1440–1448.
network, namely MSRT. The proposed MSRT method is able [17] H. Nam and B. Han, “Learning multi-domain convolutional neural
to provide the reliable online tracking accuracy, while maintain networks for visual tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit. (CVPR), Jun. 2016, pp. 4293–4302.
real-time tracking speed. More specifically, we treat object [18] H. Fan and H. Ling, “SANet: Structure-aware network for visual
tracking as two sub-tasks: Siamese search based tracking tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops
and re-identification. Unlike previous Siamese search based (CVPRW), Jul. 2017, pp. 2217–2224.
[19] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr,
tracking methods, which cannot effectively deal with the sig- “Fully-convolutional siamese networks for object tracking,” in Proc. Eur.
nificant appearance changes of a target, the proposed tracking Conf. Comput. Vis. Workshops, Oct. 2016, pp. 850–865.
method can achieve robust tracking results by updating the [20] R. Tao, E. Gavves, and A. W. M. Smeulders, “Siamese instance search
for tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
template in the multi-stream Siamese network. Moreover, (CVPR), Jun. 2016, pp. 1420–1429.
a re-identification algorithm based on metric-learning is also [21] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv. Neural
proposed. Therefore, the proposed method can re-locate a lost Inf. Process. Syst., Dec. 2014, pp. 2672–2680.
target during tracking. Extensive experiments show that the [22] K. Simonyan and A. Zisserman, “Very deep convolutional networks
for large-scale image recognition,” in Proc. Int. Conf. Learn. Repres.,
proposed MSRT can effectively track objects in challenging May 2015, pp. 1–14.
[23] Y. Niu, L. Lin, Y. Chen, and L. Ke, “Machine learning-based framework [48] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-
for saliency detection in distorted images,” Multimedia Tools Appl., time object detection with region proposal networks,” in Proc. Adv.
vol. 76, no. 24, pp. 26329–26353, Dec. 2017. Neural Inf. Process. Syst., Dec. 2015, pp. 91–99.
[24] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, and P. H. S. Torr, [49] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and
“End-to-end representation learning for correlation filter based tracking,” A. Zisserman, “The Pascal visual object classes (VOC) challenge,” Int.
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, J. Comput. Vis., vol. 88, no. 2, pp. 303–338, Jun. 2010.
pp. 5000–5008. [50] D. Comaniciu, V. Ramesh, and P. Meer, “Kernel-based object tracking,”
[25] T. Liu, X. Cao, and J. Jiang, “Visual object tracking with partition IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 5, pp. 564–577,
loss schemes,” IEEE Trans. Intell. Transport. Syst., vol. 18, no. 3, May 2003.
pp. 633–642, Mar. 2017. [51] Z. Hong, Z. Chen, C. Wang, X. Mei, D. Prokhorov, and D. Tao, “MUlti-
[26] C. Ma, J.-B. Huang, X. Yang, and M.-H. Yang, “Hierarchical convolu- store tracker (MUSTer): A cognitive psychology inspired approach to
tional features for visual tracking,” in Proc. IEEE Int. Conf. Comput. object tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
Vis. (ICCV), Dec. 2015, pp. 3074–3082. (CVPR), Jun. 2015, pp. 749–758.
[27] M. Gao, L. Jin, Y. Jiang, and B. Guo, “Manifold siamese network: [52] H. Nam, M. Baek, and B. Han, “Modeling and propagating CNNs in
A novel visual tracking convnet for autonomous vehicles,” IEEE Trans. a tree structure for visual tracking,” 2016, arXiv:1608.07242. [Online].
Intell. Transport. Syst., vol. 21, no. 4, pp. 1612–1623, Apr. 2020. Available: http://arxiv.org/abs/1608.07242
[28] S. Pu, Y. Song, C. Ma, H. Zhang, and M.-H. Yang, “Deep attentive [53] J. Ning, J. Yang, S. Jiang, L. Zhang, and M.-H. Yang, “Object
tracking via reciprocative learning,” in Proc. Adv. Neural Inf. Process. tracking via dual linear structured SVM and explicit feature map,” in
Syst., Dec. 2018, pp. 1935–1945. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016,
pp. 4266–4274.
[29] G. Bhat, J. Johnander, M. Danelljan, F. S. Khan, and M. Felsberg,
[54] S. Zhang, S. Zhao, Y. Sui, and L. Zhang, “Single object tracking
“Unveiling the power of deep tracking,” in Proc. Eur. Conf. Comput.
with fuzzy least squares support vector machine,” IEEE Trans. Image
Vis., Sep. 2018, pp. 493–509.
Process., vol. 24, no. 12, pp. 5723–5738, Dec. 2015.
[30] Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu, “Distractor- [55] Y. Wu, J. Lim, and M. H. Yang, “Object tracking benchmark,” IEEE
aware Siamese networks for visual object tracking,” in Proc. Eur. Conf. Trans. Pattern Anal. Mach. Intell., vol. 37, no. 9, pp. 1834–1848,
Comput. Vis. (ECCV), Oct. 2018, pp. 103–119. Sep. 2015.
[31] X. Wang, C. Li, B. Luo, and J. Tang, “SINT++: Robust visual tracking [56] Y. Wu, J. Lim, and M.-H. Yang, “Online object tracking: A bench-
via adversarial positive instance generation,” in Proc. IEEE/CVF Conf. mark,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013,
Comput. Vis. Pattern Recognit., Jun. 2018, pp. 4864–4873. pp. 2411–2418.
[32] A. He, C. Luo, X. Tian, and W. Zeng, “A twofold siamese network [57] H. Fan et al., “LaSOT: A high-quality benchmark for large-scale
for real-time object tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. single object tracking,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Pattern Recognit., Jun. 2018, pp. 4834–4843. Recognit. (CVPR), Jun. 2019, pp. 5369–5378.
[33] Y. Niu, W. Lin, X. Ke, and L. Ke, “Fitting-based optimisation for [58] C. Ma, X. Yang, C. Zhang, and M.-H. Yang, “Long-term correlation
image visual salient object detection,” IET Comput. Vis., vol. 11, no. 2, tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
pp. 161–172, Mar. 2017. Jun. 2015, pp. 5388–5396.
[34] Y. Niu, W. Lin, and X. Ke, “CF-based optimisation for saliency [59] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
detection,” IET Comput. Vis., vol. 12, no. 4, pp. 365–376, Jun. 2018. network training by reducing internal covariate shift,” in Proc. Int. Conf.
[35] D. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui, “Visual object Mach. Learn., Jul. 2015, pp. 448–456.
tracking using adaptive correlation filters,” in Proc. IEEE Comput. Soc. [60] A. Vedaldi and K. Lenc, “MatConvNet: Convolutional neural net-
Conf. Comput. Vis. Pattern Recognit., Jun. 2010, pp. 2544–2550. works for MATLAB,” in Proc. ACM Int. Conf. Multim., Oct. 2015,
[36] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed pp. 448–456.
tracking with kernelized correlation filters,” IEEE Trans. Pattern Anal.
Mach. Intell., vol. 37, no. 3, pp. 583–596, Mar. 2015.
[37] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg, “Learning spatially
regularized correlation filters for visual tracking,” in Proc. IEEE Int.
Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 4310–4318.
[38] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. H. S. Torr,
“Staple: Complementary learners for real-time tracking,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, Yi Liu received the B.S. degree from the School
pp. 1401–1409. of Automation, Beijing Institute of Technology,
[39] X. Ke, M. Zhou, Y. Niu, and W. Guo, “Data equilibrium based automatic in 2011, and the M.S. degree from the School
image annotation by fusing deep model and semantic propagation,” of Aeronautic Science and Engineering, Beihang
Pattern Recognit., vol. 71, pp. 60–77, Nov. 2017. University, in 2015. He is currently pursuing the
[40] H. K. Galoogahi, A. Fagg, and S. Lucey, “Learning background-aware Ph.D. degree with the School of Informatics, Xiamen
correlation filters for visual tracking,” in Proc. IEEE Int. Conf. Comput. University, China. His current research interests
Vis. (ICCV), Oct. 2017, pp. 1144–1152. include pattern recognition, metric learning, and
[41] M. Wang, Y. Liu, and Z. Huang, “Large margin object tracking with visual tracking.
circulant feature maps,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit. (CVPR), Jul. 2017, pp. 4800–4808.
[42] Y. Yao, X. Wu, L. Zhang, S. Shan, and W. Zuo, “Joint representation
and truncated inference learning for correlation filter based tracking,” in
Proc. Eur. Conf. Comput. Vis., Sep. 2018, pp. 560–575.
[43] M. Zhang et al., “Visual tracking via spatially aligned correlation filters
network,” in Proc. Eur. Conf. Comput. Vis., Sep. 2018, pp. 484–500.
[44] D. Gorden, A. Farhadi, and D. Fox, “Re3: Real-time recurrent regression
networks for object tracking,” IEEE Robot. Autom. Mag., vol. 3, no. 2, Liming Zhang (Member, IEEE) received the B.S.
pp. 788–795, Apr. 2018. degree in computer software from Nankai Univer-
[45] D. Held, S. Thrun, and S. Savaresei, “Learning to track at 100 fps with sity, China, the M.S. degree in signal processing
deep regression networks,” in Proc. Eur. Conf. Comput. Vis., Oct. 2016, from the Nanjing University of Science and Technol-
pp. 749–765. ogy, China, and the Ph.D. degree in image process-
[46] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by ing from the University of New England, Australia.
learning an invariant mapping,” in Proc. IEEE Comput. Soc. Conf. She is currently an Assistant Professor with the
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2006, pp. 1735–1742. Faculty of Science and Technology, University of
[47] M. Kostinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof, “Large Macau. Her research interests include signal process-
scale metric learning from equivalence constraints,” in Proc. IEEE Conf. ing, image processing, computer vision, and IT in
Comput. Vis. Pattern Recognit., Jun. 2012, pp. 2288–2295. education.
Zhihui Chen received the B.S. degree from the Hanzi Wang (Senior Member, IEEE) received the
School of Informatics, Xiamen University, China, Ph.D. degree in computer vision from Monash
in 2011, where he is currently pursuing the Ph.D. University, Australia. He is currently a Distin-
degree with the School of Informatics. His current guished Professor of “Minjiang Scholars” in Fujian
research interests include pattern recognition, com- province and the Founding Director of the Cen-
puter vision, and machine learning. ter for Pattern Analysis and Machine Intelligence
(CPAMI), Xiamen University, China. His research
interests include computer vision and pattern recog-
nition, including visual tracking, robust statistics,
object detection, video segmentation, model fitting,
3D structure from motion, image segmentation and
its related fields.
Yan Yan (Member, IEEE) received the Ph.D. degree

in information and communication engineering from
Tsinghua University, China, in 2009. From 2009 to
2010, he worked as a Research Engineer with the
Research and Development Center, Nokia Japan,
and a Project Leader with Panasonic Singapore Lab
in 2011. He is currently an Associate Professor
with the School of Informatics, Xiamen University,
China. He has published over 60 papers in interna-
tional journals and conferences. His research inter-
ests include computer vision and pattern recognition.

Multi-Stream Siamese and Faster Region-Based Neural Network For Real-Time Object Tracking

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multi-Stream Siamese and Faster Region-Based Neural Network For Real-Time Object Tracking

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 22, NO.

11, NOVEMBER 2021 7279

Multi-Stream Siamese and Faster Region-Based

O BJECT tracking is one of the fundamental prob-

tracking method can effectively alleviate this problem by

Fig. 3. An illustration of the proposed multi-stream Siamese network. The

distance of two response maps σ as follows: Algorithm 1 The Proposed MSRT

2) Competing Methods: In our experiments, we compare

Among the 9 competing tracking methods in Fig. 5, the pro-

Yan Yan (Member, IEEE) received the Ph.D. degree

You might also like