Professional Documents
Culture Documents
Zizhang Wu1 Man Wang1 Lingxiao Yin1 Weiwei Sun2 Jason Wang1 Huangbin Wu3
1
Zongmu Technology 2 University of Victoria 3
Xiamen University
{zizhang.wu, man.wang, linxiao.yin, jason.wang}@zongmutech.com
{weiweisun}@uvic.ca, {wuhuangbin95}@stu.xmu.edu.cn
arXiv:2006.16503v1 [cs.CV] 30 Jun 2020
Abstract Front
1
Re-ID methods, most of them are designed for surveil- Vehicle Re-ID : Vehicle Re-ID has been widely studied in
lance scenario. While in the field of autonomous driving, recent years. As well-known, the common challenge is how
surround-view camera systems become more and more pop- to deal with the inter-class similarity and the intra-class dif-
ular. To achieve a 360◦ blind-spot free environment percep- ference [16] – Different vehicles have similar appearance,
tion, we mount multiple fisheye cameras around the vehi- while the same vehicle looks different due to the diverse
cle, as shown in Fig.1. Therefore, it is essential to learn perspectives and distortion. Until now, lots of works have
the relationship between the same vehicle among differ- been proposed to tackle this challenge. Liu et al. [16] de-
ent vehicle-mounted cameras and among different frames signed a pipeline, which adopts deep relative distance learn-
in single camera. We state two main challenges for achiev- ing (DRDL) to project vehicle images into Euclidean space,
ing a vehicle Re-ID solution in surround-view multi-camera then calculate the distance in Euclidean space to measure
scenario: the similarity of two vehicle images. Liu et al. [17] cre-
1) In single-camera view, vehicle features in consecutive ated a dataset named VeRi-776, which employs visual fea-
frames vary dramatically due to fish-eye distortion, occlu- tures, license plate and spatial-temporal information to ex-
sion, truncation, etc. It is difficult to recognize the same ve- plore the Re-ID task of vehicles. Meanwhile, further works
hicle from the past image archive under such interference. [23], [1], [20], [8], [28] introduced spatial-temporal infor-
2) In multi-camera view, the appearance of the same ve- mation to boost the performance of Re-ID . Shen et al. [23]
hicle varies significantly from different viewpoints. Indi- proposed a two-stage framework, which utilizes complex
vidual similarity of vehicles also led to great confusion for spatial-temporal information of vehicles to effectively reg-
matching. ularize Re-ID results. Subsequently, [1] used spatial prior
In this paper, we propose our methods respectively to knowledge to generate the tracklet and selected the vehicle
these challenges. For the balance of accuracy and effi- in the middle frame as a feature of tracklet. Furthermore,
ciency, SiamRPN++ network [13] and BDBnet [4] are em- [20] utilized location information between cameras to im-
ployed for vehicle re-identification in single camera and prove the accuracy of Re-ID. In the light of success of using
multi-camera, respectively. However, these two models are location information, we exploit a novel spatial constraint
mainly designed for surveillance scenes, and are unable to strategy to enhance the Re-ID results in surround-view cam-
deal with the significant variations of vehicle’s appearance era system.
in fisheye system. Therefore, a post-process module named Since the spatial-temporal information in datasets is not
quality evaluation mechanism for the output bounding box always available, approaches for ReId have been proposed
is proposed to alleviate the target drift caused by fish-eye by combining local and global features of targets [7], [4],
distortion, occlusion, etc. Besides, an attention module and [12], [11], [15], [6]. For instances, [7] designed a network
a spatial constraint strategy are introduced to respond the with the combination of local and global branch, which em-
intra-class difference and inter-class similarity of vehicles ploys joint feature vectors to enhance robustness on occlu-
[16]. To drive the study of vehicle Re-ID in surround-view sion problems. Different from the DropBlock, [4] proposed
scenarios and fill the gap in relative dataset, we will release Batch DropBlock that drops the same region in a batch of
the large-scale annotated fisheye dataset. image to better accomplish the metric learning task. [12]
Our contributions can be summarized as follows: enabled the model to perform intensive learning of local
features from the loss function aspect. Inspired by the atten-
• We design an integral vehicle Re-ID solution for the
tion mechanism, methods of processing the local and global
surround-view multi-camera scenario.
information are more flexible [14], [22], [34], [24], [39]. In
• We propose a quality evaluation mechanism for the this paper, we also utilize the attention mechanism to make
output bounding box, which can alleviate the problem the model focus on target regions, and use triplet loss [11]
of target drift. and softmax loss to enhance the performance of vehicle Re-
ID.
• We utilize novel spatial constraint strategy for regular-
izing the Re-ID results in the surround-view camera
system. Tracking algorithms related with Vehicle Re-ID: Ob-
ject tracking algorithm plays an important role in the im-
• We will release a large-scale fisheye dataset with an- plementation of vehicle Re-ID scheme [3], [10], [36], [37].
notations to promote the relevant research. For object tracking tasks, the trackers based on Siamese
2. Related Work network [25],[9],[2],[40],[27] have received signicant at-
tentions for their well-balanced tracking accuracy and ef-
In this section, we briefly review the literatures of Ve- ciency, which is of great value for boosting vehicle Re-
hicle Re-Id methods and the tracking algorithm, which are ID task in the surround-view multi-camera scenario. The
both closely related to our works. seminal work [25], [9], [2] formulated visual tracking as a
cross-correlation problem and produced a tracking similar-
ity map from the depth model with Siamese network, so as
to find the location of the target by comparing the similarity (a)
between the target and the search region. However, these 3
4
2 1
3 4
2
1
3 4 21 5
3 4
1 2
5 3 4
1
2
FPS 42
Update tracking
Front2 template Front
tracker1 IoUR 2 4
3 1 tracker1 1
tracker2 CR tracker2
Continuous
occlusion
tracker3 OC tracker3
FPS 30 FPS 42
…… tracker4
Quality Evaluation
Template Create new tracker
Figure 3. The overall framework of vehicle Re-ID in single camera. Each object is assigned a single tracker to realize Re-ID in single
channel. Tracking templates are initialized with object detection results. All tracking outputs are post-processed by the quality evaluation
module to deal with the distorted or occluded objects.
Vehicle Re-ID in multi-camera aims at building correla- Existing vehicle Re-ID methods mainly serve for surveil-
tion between the same vehicle in different cameras. Most lance scenarios. To meet the requirement of vehicle Re-
ID for surround-view camera system, we apply a modified first calculate coordinate distances between key points and
BDBnet [4] as our multi-camera Re-ID model. BDBnet then convert them to score s2 :
consists of a global branch and a local branch. Specially,
1
a fixed mask is added to local branch to help the network s2 = ln( + 1), (6)
learn semantic features, and it is shown to be effective in DK
pedestrian Re-ID. Different from pedestrian Re-ID, vehicle where DK is the distance between the projection coordinate
Re-ID for surround-view camera system suffers from de- of the key points, and DK = Df + Dr , Df and Dr are the
formation in multi-camera system. Fixed templates are dif- projection coordinate distance of these two front key points
ficult to improve learning outcome, so we introduce an at- and two rear key points, respectively.
tention module that leads the network to learn self-adaptive
template to focus on target regions. As shown in Fig. 5, the
Front
structure in red box is the attention module. For each new
target, the network is applied to extract features and mea- 1
FPS10
sure Euclidean distance between this feature and features 2
1
s1 = ln( + 1), (5) FPS10
DF
Key Points Projection Projection Uncertainty
where DF is the Euclidean distance between the feature of
Figure 6. Projection uncertainty of key points. Ellipse 1 and ellipse
2 are uncertainty ranges of front and left (right) cameras, respec-
tively.
No down-sampling
Mask
Attention Branch
Global max
BN 3.2.3 Framework of vehicle Re-ID in multi-camera
pooling
Input Output
Figure 7. The overall framework of the vehicle Re-ID in multi-camera.For the new target, Re-ID model is used first to extract the fea-
tures,followed by the distance metrics is carried out for this feature and features in gallery. Besides, the spatial constraint strategy is
adopted to improve the correlation effect.
3.3. Overall framework 0.00035 for the first 10 epochs. We train the network with
150 epochs. Batch size is 256.
The surround-view multi-camera Re-ID system pro-
cesses data of each camera serially. Each new object in 4.3. Evaluation of the proposed method
single channel will be assigned a single target tracker and
matched with data of other channels according to the Re- Quality Evaluation Mechanism. The proposed qual-
ID strategy in multi-camera. If not matched successfully, a ity evaluation mechanism is a key component to optimize
new ID will be created for it. Else, the ID of the matching Re-ID performance in single camera. We conduct some ab-
object will be inherited by the newcomer. All targets in sin- lation studies on our dataset to find out the contribution of
gle camera will be matched in time sequence according to our method to performance.
the Re-ID strategy in single camera. We first compare the template updating metrics. As
shown in Table 1, Default is our baseline model without
updating templates. It is obvious that updating the track-
4. Experiments
ing template with IoUT and CT as metrics brings signif-
icant improvement on identity consistency. Furthermore,
4.1. Dataset and evaluation metric when utilizing the IoUR and CR , the identity consistency is
greatly improved once again. That means the revised met-
We generate the fisheye dataset from structured roads.
rics helps the model pay more attention to demands of the
It consists of a total of 22190 annotated images sampled
Re-ID task.
from 36 sequences. The resolution of image is 1280×720.
We use the 80% and 20% images as training and testing
ICf ront IClef t ICright
set. We utilize three cameras (i.e., front, left and right) to
capture these images at 30 frames per second (fps). Default 0.82 0.81 0.87
We evaluate the results with the identity consistency +IoUT + CT 0.88 0.91 0.90
(IC). It is formulated as +IoUR + CR 0.94 0.96 0.97
P Table 1. The impact of the different methods updating tracking
t IDSWt template. ICf ront , IClef t and ICright correspond to the IC of
IC = 1 − P , (8)
t IDt front, left and right cameras
where t is the frame index, identity switch (IDSW) is The number of frames (N ) to delete the occluded object
counted if a ground truth target i is matched to tracking out- is summarized as Table 2. Zero frame means that we do not
put j and the last known assignment is k, k 6= j. IDt is the handle the occluded cars. When N = 2, the identity switch
sum of ground truth targets with ID in frame t. is decrease slightly. It is expected because deleting the tem-
porary occluded cars frequently contributing the ID switch
4.2. Implementation Details
as its rapid reappearance. With N grows, the ID is gradu-
We train all networks with stochastic gradient descent ally steady. However, maintaining too much frames means
(SGD) on GTX 1080Ti. For the single camera Re-ID model dont handle occlusion and the consistency descends once
SiamRPN++, the number of training epochs is 50. Batch again. Experimental results show that N = 4 is optimal in
size is set as 24 on training stage due to the limit of the this paper.
GPU memory. Learning rate is 0.0005. Weight decay is Vehicle Re-ID strategy in multi-camera. We examine
0.0001. Momentum is 0.9. In the case of multi-camera Re- the influence of various matching strategies in multi-camera
ID model, the weight decay and momentum are set as 0.9 in Table 3. All the experiments are based on the same imple-
and 0.0005 respectively. We use an initial learning rate of mentations in single view. The first row shows the method
N ICf ront IClef t ICright lation filters. In 2010 IEEE computer society conference on
computer vision and pattern recognition, pages 2544–2550.
0 0.83 0.81 0.80 IEEE, 2010. 2
2 0.85 0.83 0.84 [4] Zuozhuo Dai, Mingqiang Chen, Xiaodong Gu, Siyu Zhu,
3 0.92 0.90 0.93 and Ping Tan. Batch dropblock network for person re-
4 0.94 0.96 0.97 identification and beyond. In Proceedings of the IEEE Inter-
5 0.93 0.91 0.92 national Conference on Computer Vision, pages 3691–3701,
Table 2. The impact of the different methods updating tracking 2019. 2, 5, 7
template. [5] Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and
Michael Felsberg. Eco: Efficient convolution operators
for tracking. In Proceedings of the IEEE conference on
of matching with feature metrics as [4]. After introducing computer vision and pattern recognition, pages 6638–6646,
the attention module, the Re-ID accuracy is improved a lot. 2017. 3
Base on that, adding the spatial constraints strategy provide [6] M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M.
a nice improvement as show in the last row. Cristani. Person re-identification by symmetry-driven accu-
mulation of local features. In 2010 IEEE Computer Soci-
ID Matching strategy in multi-camera ICf ront IClef t ICright ety Conference on Computer Vision and Pattern Recognition,
Baseline 0.84 0.85 0.86 pages 2360–2367, June 2010. 2
+Attention module 0.87 0.92 0.89 [7] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Dropblock:
+Attention module +Spatial A regularization method for convolutional networks. In
0.94 0.96 0.97
constraint strategy Advances in Neural Information Processing Systems, pages
Table 3. The impact of the different methods updating tracking 10727–10737, 2018. 2
template. [8] Omar Hamdoun, Fabien Moutarde, Bogdan Stanciulescu,
and Bruno Steux. Person re-identification in multi-camera
system by signature based on interest point descriptors col-
5. Conclusion lected on short video sequences. In 2008 Second ACM/IEEE
International Conference on Distributed Smart Cameras,
In this paper, we propose a complete vehicle Re-ID so- pages 1–6, 2008. 2
lution for the surround-view multi-camera scenario. The [9] David Held, Sebastian Thrun, and Silvio Savarese. Learn-
introduced quality evaluation mechanism for output bound- ing to track at 100 fps with deep regression networks. In
ing box can alleviate the problems caused by distortion and European Conference on Computer Vision, pages 749–765.
occlusion. Besides, we deploy attention module to make Springer, 2016. 2
the network focus on target regions. A novel spatial con- [10] João F Henriques, Rui Caseiro, Pedro Martins, and Jorge
straint strategy is also introduced to significantly regular- Batista. High-speed tracking with kernelized correlation fil-
ize the Re-ID results in this scenario. Extensive component ters. IEEE transactions on pattern analysis and machine in-
analysis and comparison experiments on the fisheye dataset telligence, 37(3):583–596, 2014. 2
show that our vehicle Re-ID solution achieves promising [11] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In de-
results. Furthermore, the annotated fisheye dataset will be fense of the triplet loss for person re-identification. 2017.
2
public to advance the development of the research on this
[12] Ratnesh Kuma, Edwin Weill, Farzin Aghdasi, and
field. In future studies, we will further optimize the perfor-
Parthasarathy Sriram. Vehicle re-identification: an efficient
mance of vehicle Re-ID in the surround-view multi-camera
baseline using triplet embedding. In 2019 International Joint
scenario. Conference on Neural Networks (IJCNN), pages 1–9. IEEE,
2019. 1, 2
References [13] Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing,
[1] Pedro Antonio Marin-Reyes, Andrea Palazzi, Luca and Junjie Yan. Siamrpn++: Evolution of siamese visual
Bergamini, Simone Calderara, Javier Lorenzo-Navarro, and tracking with very deep networks. In Proceedings of the
Rita Cucchiara. Unsupervised vehicle re-identification using IEEE Conference on Computer Vision and Pattern Recog-
triplet networks. In Proceedings of the IEEE Conference on nition, pages 4282–4291, 2019. 2, 3
Computer Vision and Pattern Recognition Workshops, pages [14] Wei Li, Xiatian Zhu, and Shaogang Gong. Harmonious at-
166–171, 2018. 1, 2 tention network for person re-identification. 2018. 2
[2] Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea [15] Zhu Li Liao, Hu. Person re-identification by local maximal
Vedaldi, and Philip HS Torr. Fully-convolutional siamese occurrence representation and metric learning. Computer
networks for object tracking. In European conference on Science, 37(9):1834–1848, 2014. 2
computer vision, pages 850–865. Springer, 2016. 2 [16] Hongye Liu, Yonghong Tian, Yaowei Yang, Lu Pang, and
[3] David S Bolme, J Ross Beveridge, Bruce A Draper, and Tiejun Huang. Deep relative distance learning: Tell the dif-
Yui Man Lui. Visual object tracking using adaptive corre- ference between similar vehicles. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, networks for person re-identification. Pattern Recognition,
pages 2167–2175, 2016. 1, 2 page S0031320317304004, 2017. 1
[17] Xinchen Liu, Wu Liu, Tao Mei, and Huadong Ma. A [31] Lin Wu, Yang Wang, Ling Shao, and Meng Wang. 3-d
deep learning-based approach to progressive vehicle re- personvlad: Learning deep global representations for video-
identification for urban surveillance. In European conference based person reidentification. IEEE Transactions on Neural
on computer vision, pages 869–884. Springer, 2016. 1, 2 Networks and Learning Systems, pages 1–13, 2018. 1
[18] Xinchen Liu, Liu Wu, Huadong Ma, and Huiyuan Fu. Large- [32] Lin Wu, Yang Wang, Ling Shao, and Meng Wang. 3d person-
scale vehicle re-identification in urban surveillance videos. vlad: Learning deep global representations for video-based
In IEEE International Conference on Multimedia and Expo person re-identification. 2018. 1
(ICME) 2016, 2016. 1 [33] Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Object track-
[19] Xiaobin Liu, Shiliang Zhang, Qingming Huang, and Gao ing benchmark. IEEE Transactions on Pattern Analysis and
Wen. Ram: A region-aware deep model for vehicle re- Machine Intelligence, 37(9):1834–1848, 2015. 3
identification. In 2018 IEEE International Conference on [34] Shen Yang, Weiyao Lin, Junchi Yan, Mingliang Xu, and
Multimedia and Expo (ICME), 2018. 1 Jingdong Wang. Person re-identification with correspon-
[20] Jianming Lv, Weihang Chen, Qing Li, and Can Yang. Un- dence structure learning. In 2015 IEEE International Con-
supervised cross-dataset person re-identification by transfer ference on Computer Vision (ICCV), 2015. 2
learning of spatial-temporal patterns. In Proceedings of the [35] Dominik Zapletal and Adam Herout. Vehicle re-
IEEE Conference on Computer Vision and Pattern Recogni- identification for automatic video traffic surveillance. pages
tion, pages 7948–7956, 2018. 2 1568–1574, 06 2016. 1
[21] Hyeonseob Nam and Bohyung Han. Learning multi-domain [36] Mengdan Zhang, Junliang Xing, Jin Gao, and Weiming Hu.
convolutional neural networks for visual tracking. In Pro- Robust visual tracking using joint scale-spatial correlation
ceedings of the IEEE conference on computer vision and pat- filters. In 2015 IEEE International Conference on Image
tern recognition, pages 4293–4302, 2016. 3 Processing (ICIP), pages 1468–1472. IEEE, 2015. 2
[22] Yantao Shen, Hongsheng Li, Tong Xiao, Shuai Yi, Dapeng
[37] Mengdan Zhang, Junliang Xing, Jin Gao, Xinchu Shi, Qiang
Chen, and Xiaogang Wang. Deep group-shuffling random
Wang, and Weiming Hu. Joint scale-spatial correlation track-
walk for person re-identification, 2018. 2
ing with adaptive rotation estimation. In Proceedings of the
[23] Yantao Shen, Tong Xiao, Hongsheng Li, Shuai Yi, and Xiao-
IEEE International Conference on Computer Vision Work-
gang Wang. Learning deep neural networks for vehicle re-id
shops, pages 32–40, 2015. 2
with visual-spatio-temporal path proposals. In Proceedings
[38] Aihua Zheng, Xianmin Lin, Chenglong Li, Ran He, and
of the IEEE International Conference on Computer Vision,
Jin Tang. Attributes guided feature learning for vehicle re-
pages 1900–1909, 2017. 1, 2
identification. 2019. 1
[24] Jianlou Si, Honggang Zhang, Chun Guang Li, Jason
[39] Wei Shi Zheng, Li Xiang, Xiang Tao, Shengcai Liao,
Kuen, and Wang Gang. Dual attention matching net-
Jianhuang Lai, and Shaogang Gong. Partial person re-
work for context-aware feature sequence based person re-
identification. 2016. 2
identification. 2018. 2
[25] Ran Tao, Efstratios Gavves, and Arnold WM Smeulders. [40] Zheng Zhu, Qiang Wang, Bo Li, Wei Wu, Junjie Yan, and
Siamese instance search for tracking. In Proceedings of the Weiming Hu. Distractor-aware siamese networks for visual
IEEE conference on computer vision and pattern recogni- object tracking. 2018. 2
tion, pages 1420–1429, 2016. 2
[26] Huibing Wang, Jinjia Peng, Dongyan Chen, Guangqi Jiang,
Tongtong Zhao, and Xianping Fu. Attribute-guided feature
learning network for vehicle re-identification. 2020. 1
[27] Qiang Wang, Zhu Teng, Junliang Xing, Jin Gao, Weiming
Hu, and Stephen Maybank. Learning attentions: residual at-
tentional siamese network for high performance online vi-
sual tracking. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 4854–4863,
2018. 2
[28] Zhongdao Wang, Luming Tang, Xihui Liu, Zhuliang Yao,
and Xiaogang Wang. Orientation invariant feature em-
bedding and spatial temporal regularization for vehicle re-
identification. In 2017 IEEE International Conference on
Computer Vision (ICCV), 2017. 2
[29] Lin Wu, Richang Hong, Yang Wang, and Meng Wang.
Cross-entropy adversarial view adaptation for person re-
identification. 2018. 1
[30] Lin Wu, Yang Wang, Xue Li, and Junbin Gao. What-and-
where to match: Deep spatially multiplicative integration