Professional Documents
Culture Documents
Autonomous Driving
Spatial-Temporal
Mono3D Tracking Label
While separately leveraging monocular 3D object detec- Information Flow
…
tion and 2D multi-object tracking can be straightforwardly Gradient
tasks in the same framework. Sun et al. [29] use a siamese Matching Temporal Information Flow
Affinity Matrix Add & Norm Feed Forward
network with the paired frame as input and predict the simi- 𝑜! 𝑜" 𝑜# 𝑜$ 𝑜% ∅
𝑜! 0.8 0.01 0.05 0.03 0.02 0.09
larity score of detections. Guillem [3] proposes a graph par- 𝑜"
𝑡
0.05 0.02 0.8 0.1 0.01 0.02 Feed Forward
titioning method to treat the association problem as an edge 𝑜#
𝑜$
0.01 0.8 0.03 0.06 0.05 0.05
1, . . . , nt }, dimension Dt = {Dti ∈ R3 |i = 1, . . . , nt }, 𝑁 𝑁
\begin {aligned} \bm {Y} = MultiHeadAttn(\bm {\mathcal {Q}}, \bm {\mathcal {K}}, \bm {\mathcal {V}}, \bm {PE}) \end {aligned} (1) \label {quta} \begin {aligned} \bm {\Gamma }(i, j)_t^{t-\Delta t} = \bm {FFN} \left ( \bm {W}_q^i\bm {\Hat {\mathcal {F}}}_{t}^i*\bm {W}_k^j\bm {\Check {\mathcal {F}}}_{t-\Delta t}^j/\sqrt {C} + \bm {0} \right ) \end {aligned}
(3)
where P E is the positional encoding function to elimi- where Wq , and Wk are of learnable weights. Γ(i, j)t−∆t t
nate the permutation-invariant. This attention is called self- is affinity score between ith object in frame It and j th ob-
attention if the input query and key are the same, otherwise ject in frame It−∆t . Similar to spatial information flow
cross-attention. The transformer stacks self-attention, nor- module, temporal positional encoding is set as 0. Consider-
malization, and feed-forward layers in the encoder and de- ing that the objects in the image pair may not have a corre-
coder, and the cross-attention focuses on the interaction be- sponding relationship, we learn a extra row and column for
tween them. We refer the reader to the literature [31] for un-identified targets, following the DAN [29]. We simple
more detailed descriptions. add a FFN to weights of cross-attention and estimate affin-
The transformer architecture can be naturally extended ity matrix Γt−∆t t ∈ RK+1×K+1 , which will be trained as a
to the spatial-temporal information, in which the self- unimodal matrix for one-to-one matching.
attention propagate the objects’ information at a certain The temporal information flow branch can generate
time, and cross-attention aggregate objects information tracking information and guide the temporal aggregation of
across times. The structure of the spatial information flow appearance and geometric information. The temporal ag-
is shown at the bottom of Fig. 2. We first extract top-K gregation module then models the temporal transition of
center points in image It from main center head of our 3D targets to predict box smoothness refinement and the time-
detector, and index its corresponding appearance features related variable(e.g., velocity, motion attribute). Thus, the
Ait , and geometric Git (details in Sec. 3.2) following the mechanism of temporal information flow can be summa-
concatenation with a MLP layer to generate input embed- rized as:
ding F it . K is set to be larger than the typical number nt of
objects in an image( e.g. 128 in nuScenes dataset). So that \label {quta} \begin {aligned} \bm {\tilde {\mathcal {F}}}_{t}^i = MultiHeadAttn(\bm {\Hat {\mathcal {F}}}_{t}^i, \bm {\Check {\mathcal {F}}}_{t-\Delta t}, \bm {\Check {\mathcal {F}}}_{t-\Delta t}, \bm {0}) \end {aligned} (4)
Affinity Matrix Γ!!"# Γ"!,!"# = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(Γ!!"# , 𝑑𝑖𝑚 = 1) Γ"!"#,! = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(Γ!!"# , 𝑑𝑖𝑚 = 1) 1 Γ"!"#,! ×Γ"!,!"#
Γ=
𝐼! Heat Map Column Softmax Row Softmax Multiplication
𝐼!"#
Figure 3. Learnt unimodal affinity matrix. Our simple tracking loss can learn a unimodal affinity matrix, which can perform a one-to-one
matching.
i
where F̃ t ∈ RK×d is the aggregation features of ith target frame Ith and the j th object in the frame It are the same
at time t. The final prediction is computed by a FFN which object. Fig. 3 show a sample frames from nuScenes video
consist of 3-layer MLP. The FFN predicts the velocity Vt = to illustrate the processing of tracking loss.
{Vti ∈ R3 |i = 1, . . . , K}, motion attribute Mt = {Mti ∈ 3). Temporal-consistency loss. The design of most
R3 |i = 1, . . . , K}, and box smoothness refinement value functions hopes to check the stability in temporal to en-
∆B = {∆Bit = ∆Xti , ∆Yti , ∆Zti , ∆Wti , ∆Hti , ∆Lit |i = sure a smooth transition. However, previous approaches
1, . . . , K}. calculated detection results independently for each frame
and failed to constrain a trajectory’s inner consistency in
3.4. Training Loss the network. We propose an auxiliary loss to clarify tem-
We divide the multi-task loss into three parts: monocular poral topology between frames during the training phase to
object 3D detection loss LM ono3D , the tracking loss LT , ensure trajectory smoothness. Specifically, we first add the
and temporal-consistency loss LCons . smoothness refinement box to the outputs of our monocular
1). Monocular object 3D detection loss. We adopt the 3D detector to generate final box parameters and then cal-
same loss as KM3D [18]: culate the relative 3D corner distance of the same object in
a different frame. The ground truth then supervises these
\label {eq:a} \begin {aligned} \mathcal {L}_{Mono3D} = \mathcal {L}_{m} + \mathcal {L}_{kc} + \mathcal {L}_{D} + \mathcal {L}_{O} + \mathcal {L}_{T} \end {aligned} (5) corner distances:
where the Lm , Lkc , LD , LO , LT are main center loss, key- \label {eq:b} \begin {aligned} &\mathcal {L}_{Cons} = \sum _{\zeta =1}^{\tau } \sum _{i=0}^{n_t} \sum _{j=0}^{n_{t-k}} \left | \left | \mathcal {C}_3(\bm {B}_{t}^i + \Delta \bm {B}_{t}^i) \right . \right .\\ &\left . \left . - \mathcal {C}_3(\bm {B}_{t-\zeta }^j + \Delta \bm {B}_{t-\zeta }^j) \right |_2 - \left | \mathcal {C}_3(B^*_t)-\mathcal {C}_3(B^*_{t-\zeta })\right |_2 \right |_2 \end {aligned}
point loss, dimension loss, orientation loss and position loss (7)
respecitively.
2). Tracking loss. Benefit to our explicit modeling of the
target appearance features and geometric features in spatial-
temporal space, we can design a simpler and more effec- where C3 map the parameters of the 3D box to 8 corners.
tive loss function then DAN [29] to constrain the affinity B ∗ presents ground truth. Interestingly, although every po-
matrix to generate a unimodal response. Specifically, we tential target is supervised by corresponding ground truth,
first perform row- and column-wise softmax to affinity ma- the temporal-consistency loss can still minimize the relative
trix Γt−∆t
t ∈ RK+1×K+1 , and results association matrix location. Fig. 6 shows a sample to explain that.
Γ̂t,t−∆t and Γ̂t−∆t,t . The ith row of the matrix Γ̂t,t−∆t Finally, the Time3D is supervised in an end-to-end fash-
associates the ith objects in frame It to K + 1 identities ion with the multi-task loss combination:
in frame It−∆t , where +1 presents the un-identified targets
in frame It−∆t . The matrix Γ̂t−∆t,t signify similar associa-
\label {eq:b} \begin {aligned} \mathcal {L} = \mathcal {L}_{Mono3D} + \mathcal {L}_{Tracking} + \mathcal {L}_{Cons} \end {aligned} (8)
tions from It−∆t to It . We then multiply these two matrices
Γ̃t−∆t
t = Γt,t−∆t × Γ̂t−∆t,t and use cross-entropy losses
for training: 3.5. Inference of Tracking
The inference of tracking is shown in Fig. 4. We first
\label {eq:a} \begin {aligned} \mathcal {L}_{tracking} = - \sum _{j=1}^{K} \sum _{i=1}^{K+1} *\bm {\Gamma }_{t}^{t-\Delta t}(i, j)\log (\bm {\tilde {\Gamma }}_{t}^{t-\Delta t}(i, j)) \\ \end {aligned} perform the 3D object detection, heterogeneous cues extrac-
tor, and spatial information flow sequentially at each times-
(6) tamp. Next, the spatial features are stored with their time
where ∗ Γt−∆t
t presents the ground truth association matrix, stamps. Given the current frame image It with its spatial
and ∗ Γt−∆t
t (i, j) = 1 indicates that the ith identities in the features, the affinity matrix is computed by forward passing
Video Frames bounding box annotations from 10 categories are contained.
𝐼" 𝐼# 𝐼$ 𝐼% 𝐼& 𝐼!
Considering that nuScenes includes sequence monocular
Stage 1
images, 3D annotations, tracking ID, and the pose of
3D Object Heterogeneous Spatial each frame simultaneously, we use it as the benchmark to
Detection Cues Embedding Information Flow validate the performance of our method.
Memory of Spatial Features
4.2. Implementation Details
Cat as Batch Expand as Batch
5 In mono3D, we follow the KM3D [18] to report the
4
Feed Temporal
3
2 Forward Information Flow
results with the DLA-34 backbone. We stack three self-
1
Stage 2
Scatter Newborn object without softmax. Time3D is trained without the KM3D pre-
Memory of Object Trajectories trained model but with ImageNet pre-trained model as ini-
tialization. We use the augmentation of shifting and scaling.
𝑡-5 𝑡-4 𝑡-3 𝑡-2 𝑡-1 We resize the image from 900 × 1600 to 448 × 800 for fast
training. The batch size is 80 in 8 2080Ti GPUs, with ten
0.8 + 0.9 + 0.9 + 0.7 + 0.9 = 4.2 images per GPU. We train Time3D for 200 epochs with a
0.8 0.9 0.9 0.7 0.6 starting learning rate of 1.25e − 4 and reduce it by ×10 at
epoch 90 and 120. We only associate the past five frames to
the current frame for the fast running speed.
Figure 4. Inference of tracking. Each frame of images is first fed
into 3D object detection, embedding extractor, and spatial infor- 4.3. Comparison with state-of-the-art
mation flow module to generate spatial features. We then equip an
explicit memory to store the past trajectories with corresponding We report four official evaluation metrics [15] for track-
spatial features. Given the current spatial features, only the fast ing quantitative analyses: Average Multi-Object Tracking
and straightforward temporal information flow is used to compute Accuracy (AMOTA), Average Multi-Object Tracking Preci-
the similarity score. sion (AMOTP), Multi-Object Tracking Accuracy (MOTA),
and Multi-Object Tracking Precision (MOTP), while re-
through temporal information flow. In order to reduce the porting the seven official evaluation metrics for 3D detec-
ID switch, we follow the DAN [29] to calculate the affinity tion: Average Precision (AP), Average Translation Error
matrix of the objects in the current frame and the spatial fea- (ATE), Average Scale Error (ASE), Average Orientation
tures stored in all the trajectories and sum them as the simi- Error (AOE), Average Velocity Error (AVE), Average At-
larity score between the objects and the trajectories. Finally, tribute Error (AAE), and nuScenes detection score (NDS).
the Hungarian algorithm [23] is adopted to get the optimal We only compare the online (no peeking into the future)
assignments. When running the assignment process frame- tracking methods from published approaches for a fair com-
by-frame, object trajectories are yielded. Thus each frame parison. As shown in Tab. 1, Time3D outperforms all the
image is passed through a high-weight network of 3D de- other online trackers with different metrics while maintain-
tection, embedding extractor, and spatial information only ing outstanding running speed. As we all know, the MOT
once, but the stored spatial features are used multiple times performance is highly relay on the accuracy of detection.
through light-weight temporal information flow to compute We therefore list the results of two popular LiDAR-based
similarity score. Time3D, therefore, can run in real-time. 3D detector [16, 46] with baseline tracker AB3DMOT [35].
Megvii achieves high detection accuracy by employing Li-
4. Experiments DAR point cloud, but Time3D outperforms the Megvii by a
large margin for 3D tracking, in which AMOTA improves
4.1. Dataset
41.7%, AMOTP of 9%, MOTA 11%. We also perform a
We evaluate Time3D on a large-scale popular Time3D with no end-to-end training, in which we sepa-
nuScenes [32] dataset for autonomous driving. nuScenes is rately train KM3D, Re-ID extractor, and spatial-temporal
composed of 1000 scenes video, with 700/150/150 scenes module, detailed in supplementary. End-to-end Time3D
for train, validation, and testing set, respectively. nuScenes improves 37% for AMOTA and 22% for MOTA. All these
collected data from Boston and Singapore, in both day shows that the power of Time3D, which jointly learns detec-
and night, and different weather conditions. Each video tion and tracking in an end-to-end manner, can outperform
contains images from six cameras forming an entire 360◦ those methods that use detection and tracking as two dif-
field of view in a 2HZ keyframe sample. Finally, 1.4M 3D ferent steps, even if these methods have more powerful de-
Table 1. 3D Tracking and 3D Object Detectiom Performance nuScenes test set. † represent it use the AB3DMOT [35] as the tracker.
Time3D‡ trained in no end-to-end manner.
tection accuracy. Tab. 1 show Megvii has a smaller MOTP Heterogeneous Cues Embedding. We first examine the
to Time3D, which is because LiDAR-based Megvii has a effects of different cues, such as Re-ID features, the em-
higher accuracy to a translation error. bedding of 2D box, 3D box, and category. As shown in
Time3D is proposed to combine 3D detection and 3D Tab 2, for the single cue testing, using only Re-ID features
tracking in a unified framework and make autonomous driv- achieves the highest 16.5 AMOTA. Adding the other cues
ing more practical. We, therefore, employ a lightweight can further boost the performance. The best performance is
monocular 3D detection method, KM3D, for the trade-off achieved by combining all heterogeneous cues. We observe
between speed and accuracy. The 3D detection accuracy that the biggest improvement occurs in adding the 3D Box
of Time3D still has a gap to LiDAR-based methods Megvii cue, which demonstrates that the 3D location is important
and has the competitiveness to single task methods Center- to 3D object tracking. We expect to inform future research
Net [44] and FCOS3D [32]. The use of a more powerful 3D to focus on 3D information to improve the tracking perfor-
detector can improve the performance of 3D detection, and mance further.
we will take it as future work. Note that Time3D‡ achieves The geometric embedding extractor learns a compatible
higher accuracy in the 3D detection benchmark, and we will feature representation for 2D boxes and 3D boxes. We in-
analyze this phenomenon in the ablation study. vestigate its effect by comparing the solutions that directly
encode 4 Dof and 7 Dof parameters of the 2D Box and 3D
4.4. Qualitative Results Box with the same MLP layer as the proposed geometric
Fig. 5 shows some qualitative results of Time3D. embedding extractor. As shown in the last row of Tab 2, di-
Time3D can predict exciting tracking effects and output rect encoding has dropped performance significantly, which
smooth trajectories, even for occluded or high-speed mov- demonstrates that an explicit geometry encoding method is
ing targets. It is worth noting that Time3D only uses monoc- more conducive to the learning of the network.
ular images as input and outputs 3D trajectories. Small tar-
gets such as pedestrians have relatively rough trajectories Table 3. Ablation for with or without Re-ID feature. Perfor-
mance on nucenes val.
related to the 3D detector. It can be replaced with a more
robust 3D detector to generate a more accurate 3D position. Setting mAP mATE mASE mAOE NDS
More qualitative results can be seen in the supplementary. w/ 29.1 0.79 0.24 0.48 39.0
w/o 32.2 0.75 0.26 0.49 39.5
Figure 5. Qualitative Results. We visualize the tracking ID, 2D box, 3D box, category in front view in 1th row. The 2th row of Bird’s
eye view (BEV) show the 3D box with the center point trajectory of the past 15 frames. Different colors represent objects with different
tracking IDs.
5. Conclusion
with Temporal Consistency Loss without Temporal Consistency Loss
GT GT
This work proposes a novel framework to jointly
L1 Loss: 0.86 L1 Loss: 0.89 learn 3D object detection and 3D multi-object tracking
Temporal Consistency Loss: 0.27 Temporal Consistency Loss: 1.13 from only monocular video while running in real-time.
Our framework encodes the heterogeneous cues, includ-
Figure 6. A sample to illustrate the temporal-consistency loss. ing category, 2D Box, 3D Box, and Re-ID feature, to
We show two results predicted by Time3D with and without compatible embedding. The transformer-based architec-
temporal-consistency loss. They have the same corner L1 loss but ture performs spatial-temporal information flow to estima-
have different temporal-consistency loss. Time3D with temporal- tion trajectory, optimized by temporal-consistency loss to
consistency loss generates a smoother trajectory. smoother. On the nuScenes dataset, the proposed Time3D
achieves state-of-the-art tracking performance while run-
Spatial-Temporal Information Flow. ning in real-time. The Time3D may inspire future re-
Tab. 4 examines the effects of the spatial-temporal infor- searchers to combine 3D tracking and 3D detection in
mation flow module in improving the tracking accuracy. To a unified framework, and to encode more 3D informa-
avoid the influence caused by additional network parame- tion to make vision-based autonomous driving more prac-
ters, we replace the spatial and temporal information mod- tical.
References [11] Kuan Fang, Yu Xiang, Xiaocheng Li, and Silvio Savarese.
Recurrent autoregressive networks for online multi-object
[1] Seung Hwan Bae and Kuk-Jin Yoon. Confidence-based data tracking. In 2018 IEEE Winter Conference on Applications
association and discriminative deep appearance learning for of Computer Vision, WACV 2018, Lake Tahoe, NV, USA,
robust online multi-object tracking. IEEE Trans. Pattern March 12-15, 2018, pages 466–475. IEEE Computer Soci-
Anal. Mach. Intell., 40(3):595–610, 2018. 1 ety, 2018. 2
[2] Alex Bewley, ZongYuan Ge, Lionel Ott, Fabio Tozeto [12] Zeyu Fu, Pengming Feng, Syed Mohsen Naqvi, and
Ramos, and Ben Upcroft. Simple online and realtime track- Jonathon A. Chambers. Particle PHD filter based multi-
ing. In 2016 IEEE International Conference on Image Pro- target tracking using discriminative group-structured dictio-
cessing, ICIP 2016, Phoenix, AZ, USA, September 25-28, nary learning. In 2017 IEEE International Conference on
2016, pages 3464–3468. IEEE, 2016. 1 Acoustics, Speech and Signal Processing, ICASSP 2017,
[3] Guillem Brasó and Laura Leal-Taixé. Learning a neural New Orleans, LA, USA, March 5-9, 2017, pages 4376–4380.
solver for multiple object tracking. In 2020 IEEE/CVF Con- IEEE, 2017. 1, 3
ference on Computer Vision and Pattern Recognition, CVPR [13] Hou-Ning Hu, Qi-Zhi Cai, Dequan Wang, Ji Lin, Min Sun,
2020, Seattle, WA, USA, June 13-19, 2020, pages 6246– Philipp Krähenbühl, Trevor Darrell, and Fisher Yu. Joint
6256. Computer Vision Foundation / IEEE, 2020. 1, 3 monocular 3d vehicle detection and tracking. In 2019
[4] Garrick Brazil and Xiaoming Liu. M3d-rpn: Monocular 3d IEEE/CVF International Conference on Computer Vision,
region proposal network for object detection. In Proceedings ICCV 2019, Seoul, Korea (South), October 27 - November
of the IEEE International Conference on Computer Vision, 2, 2019, pages 5389–5398. IEEE, 2019. 2
Seoul, South Korea, 2019. 1, 2 [14] Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen
[5] Garrick Brazil, Gerard Pons-Moll, Xiaoming Liu, and Bernt Wei. Relation networks for object detection. CoRR,
Schiele. Kinematic 3d object detection in monocular video. abs/1711.11575, 2017. 2, 4
In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan- [15] Tino Kutschbach, Erik Bochinski, Volker Eiselein, and
Michael Frahm, editors, Computer Vision - ECCV 2020 - Thomas Sikora. Sequential sensor fusion combining prob-
16th European Conference, Glasgow, UK, August 23-28, ability hypothesis density and kernelized correlation filters
2020, Proceedings, Part XXIII, volume 12368 of Lecture for multi-object tracking in video data. In 14th IEEE Inter-
Notes in Computer Science, pages 135–152. Springer, 2020. national Conference on Advanced Video and Signal Based
1, 2 Surveillance, AVSS 2017, Lecce, Italy, August 29 - Septem-
[6] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, ber 1, 2017, pages 1–5. IEEE Computer Society, 2017. 1, 3,
Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- 6
ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- [16] Alex H. Lang, Sourabh Vora, Holger Caesar, Lubing Zhou,
modal dataset for autonomous driving. In 2020 IEEE/CVF Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders
Conference on Computer Vision and Pattern Recognition, for object detection from point clouds. In IEEE Conference
CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages on Computer Vision and Pattern Recognition, CVPR 2019,
11618–11628. Computer Vision Foundation / IEEE, 2020. Long Beach, CA, USA, June 16-20, 2019, pages 12697–
7 12705. Computer Vision Foundation / IEEE, 2019. 6
[7] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas [17] Laura Leal-Taixé, Cristian Canton-Ferrer, and Konrad
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- Schindler. Learning by tracking: Siamese CNN for robust
end object detection with transformers. In Andrea Vedaldi, target association. In 2016 IEEE Conference on Computer
Horst Bischof, Thomas Brox, and Jan-Michael Frahm, edi- Vision and Pattern Recognition Workshops, CVPR Work-
tors, ECCV (1), volume 12346 of Lecture Notes in Computer shops 2016, Las Vegas, NV, USA, June 26 - July 1, 2016,
Science, pages 213–229. Springer, 2020. 4 pages 418–425. IEEE Computer Society, 2016. 2
[8] Mohamed Chaabane, Peter Zhang, J. Ross Beveridge, and [18] Peixuan Li and Huaici Zhao. Monocular 3d detection with
Stephen O’Hara. DEFT: detection embeddings for tracking. geometric constraint embedding and semi-supervised train-
CoRR, abs/2102.02267, 2021. 7 ing. IEEE Robotics Autom. Lett., 6(3):5565–5572, 2021. 1,
[9] Mingyu Ding, Yuqi Huo, Hongwei Yi, Zhe Wang, Jianping 2, 3, 5, 6
Shi, Zhiwu Lu, and Ping Luo. Learning depth-guided con- [19] Peixuan Li, Huaici Zhao, Pengfei Liu, and Feidao Cao.
volutions for monocular 3d object detection. In CVPR Work- Rtm3d: Real-time monocular 3d detection from object key-
shops, pages 4306–4315. IEEE, 2020. 1 points for autonomous driving, 2020. 1, 2
[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, [20] Keng-Chi Liu, Yi-Ting Shen, and Liang-Gee Chen. Simple
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, online and realtime tracking with spherical panoramic cam-
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- era. In IEEE International Conference on Consumer Elec-
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image tronics, ICCE 2018, Las Vegas, NV, USA, January 12-14,
is worth 16x16 words: Transformers for image recognition 2018, pages 1–6. IEEE, 2018. 1, 2
at scale. In 9th International Conference on Learning Rep- [21] Xinzhu Ma, Shinan Liu, Zhiyi Xia, Hongwen Zhang, Xingyu
resentations, ICLR 2021, Virtual Event, Austria, May 3-7, Zeng, and Wanli Ouyang. Rethinking pseudo-lidar rep-
2021. OpenReview.net, 2021. 4 resentation. In Andrea Vedaldi, Horst Bischof, Thomas
Brox, and Jan-Michael Frahm, editors, ECCV (13), volume In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-
12358 of Lecture Notes in Computer Science, pages 311– Michael Frahm, editors, Computer Vision - ECCV 2020 -
327. Springer, 2020. 2 16th European Conference, Glasgow, UK, August 23-28,
[22] Arsalan Mousavian, Dragomir Anguelov, John Flynn, and 2020, Proceedings, Part XI, volume 12356 of Lecture Notes
Jana Kosecka. 3d bounding box estimation using deep learn- in Computer Science, pages 107–122. Springer, 2020. 2, 3
ing and geometry. In Proceedings of the IEEE Conference [35] Xinshuo Weng and Kris Kitani. A baseline for 3d multi-
on Computer Vision and Pattern Recognition, pages 7074– object tracking. CoRR, abs/1907.03961, 2019. 6, 7
7082, 2017. 1 [36] Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple
[23] J. Munkres. Algorithms for the assignment and transporta- online and realtime tracking with a deep association met-
tion problems. SIAM. J, 10, 1962. 6 ric. In 2017 IEEE International Conference on Image Pro-
[24] Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. cessing, ICIP 2017, Beijing, China, September 17-20, 2017,
Pointnet: Deep learning on point sets for 3d classification pages 3645–3649. IEEE, 2017. 2
and segmentation, 2016. cite arxiv:1612.00593. 2, 4 [37] Jiarui Xu, Yue Cao, Zheng Zhang, and Han Hu. Spatial-
[25] Hossein Rahmani, Ajmal S. Mian, and Mubarak Shah. temporal relation networks for multi-object tracking. In 2019
Learning a deep model for human action recognition from IEEE/CVF International Conference on Computer Vision,
novel viewpoints. IEEE Trans. Pattern Anal. Mach. Intell., ICCV 2019, Seoul, Korea (South), October 27 - November
40(3):667–681, 2018. 1, 3 2, 2019, pages 3987–3997. IEEE, 2019. 1, 2
[26] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali [38] Yan Yan, Yuxing Mao, and Bo Li. Second: Sparsely embed-
Farhadi. You only look once: Unified, real-time object de- ded convolutional detection. Sensors, 18(10), 2018. 2
tection. In Proceedings of the IEEE conference on computer [39] Bin Yang, Wenjie Luo, and Raquel Urtasun. Pixor: Real-
vision and pattern recognition, pages 779–788, 2016. 2 time 3d object detection from point clouds. In Proceedings of
[27] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. the IEEE conference on Computer Vision and Pattern Recog-
Faster r-cnn: Towards real-time object detection with region nition, pages 7652–7660, 2018. 2
proposal networks. In Advances in neural information pro- [40] Shengping Zhang, Xiangyuan Lan, Hongxun Yao, Huiyu
cessing systems, pages 91–99, 2015. 2 Zhou, Dacheng Tao, and Xuelong Li. A biologically inspired
[28] Sarthak Sharma, Junaid Ahmed Ansari, J. Krishna Murthy, appearance model for robust visual tracking. IEEE Trans.
and K. Madhava Krishna. Beyond pixels: Leveraging geom- Neural Networks Learn. Syst., 28(10):2357–2370, 2017. 1,
etry and shape cues for online multi-object tracking. In 2018 3
IEEE International Conference on Robotics and Automation, [41] Xingming Zhang, Xuehan Zhang, Xuedan Du, Xiangming
ICRA 2018, Brisbane, Australia, May 21-25, 2018, pages Zhou, and Jun Yin. Learning multi-domain convolutional
3508–3515. IEEE, 2018. 2 network for RGB-T visual tracking. In Wei Li, Qingli Li,
[29] ShiJie Sun, Naveed Akhtar, HuanSheng Song, Ajmal Mian, and Lipo Wang, editors, 11th International Congress on Im-
and Mubarak Shah. Deep affinity network for multiple age and Signal Processing, BioMedical Engineering and In-
object tracking. IEEE Trans. Pattern Anal. Mach. Intell., formatics, CISP-BMEI 2018, Beijing, China, October 13-15,
43(1):104–119, 2021. 1, 3, 4, 5, 6 2018, pages 1–6. IEEE, 2018. 1, 3
[30] Siyu Tang, Mykhaylo Andriluka, Bjoern Andres, and Bernt [42] Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenjun Zeng,
Schiele. Multiple people tracking by lifted multicut and per- and Wenyu Liu. Fairmot: On the fairness of detection and
son re-identification. In 2017 IEEE Conference on Computer re-identification in multiple object tracking. Int. J. Comput.
Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, Vis., 129(11):3069–3087, 2021. 2, 3
USA, July 21-26, 2017, pages 3701–3710. IEEE Computer [43] Xingyi Zhou, Vladlen Koltun, and Philipp Krähenbühl.
Society, 2017. 1, 2 Tracking objects as points. In Andrea Vedaldi, Horst
[31] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- Bischof, Thomas Brox, and Jan-Michael Frahm, editors,
reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Computer Vision - ECCV 2020 - 16th European Conference,
Polosukhin. Attention is all you need. arXiv, 2017. 4 Glasgow, UK, August 23-28, 2020, Proceedings, Part IV,
[32] Tai Wang, Xinge Zhu, Jiangmiao Pang, and Dahua Lin. volume 12349 of Lecture Notes in Computer Science, pages
Fcos3d: Fully convolutional one-stage monocular 3d object 474–490. Springer, 2020. 1, 7
detection. In Proceedings of the IEEE/CVF International [44] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Ob-
Conference on Computer Vision (ICCV) Workshops, pages jects as points. In arXiv preprint arXiv:1904.07850, 2019. 2,
913–922, October 2021. 6, 7 7
[33] Yan Wang, Wei-Lun Chao, Divyansh Garg, Bharath Hariha- [45] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning
ran, Mark Campbell, and Kilian Q Weinberger. Pseudo-lidar for point cloud based 3d object detection. 2017. 2
from visual depth estimation: Bridging the gap in 3d ob- [46] Benjin Zhu, Zhengkai Jiang, Xiangxin Zhou, Zeming Li, and
ject detection for autonomous driving. In Proceedings of the Gang Yu. Class-balanced grouping and sampling for point
IEEE Conference on Computer Vision and Pattern Recogni- cloud 3d object detection. CoRR, abs/1908.09492, 2019. 6
tion, pages 8445–8453, 2019. 2
[34] Zhongdao Wang, Liang Zheng, Yixuan Liu, Yali Li, and
Shengjin Wang. Towards real-time multi-object tracking.