Professional Documents
Culture Documents
HAIDONG WANG, XUAN HE, ZHIYONG LI, JIN YUAN, and SHUTAO LI,
Hunan University, China
In the last few years, enormous strides have been made for object detection and data association, which are
vital subtasks for one-stage online multi-object tracking (MOT). However, the two separated submodules
involved in the whole MOT pipeline are processed or optimized separately, resulting in a complex method
design and requiring manual settings. In addition, few works integrate the two subtasks into a single end-to-
end network to optimize the overall task. In this study, we propose an end-to-end MOT network called joint
detection and association network (JDAN) that is trained and inferred in a single network. All layers in JDAN
are differentiable, and can be optimized jointly to detect targets and output an association matrix for robust
multi-object tracking. What’s more, we generate suitable pseudo-labels to address the data inconsistency be-
tween object detection and association. The detection and association submodules could be optimized by the
composite loss function that is derived from the detection results and the generated pseudo association labels,
respectively. The proposed approach is evaluated on two MOT challenge datasets, and achieves promising
performance compared with classic and latest methods.
CCS Concepts: • Computing methodologies → Tracking; Object detection; Supervised learning by classifi- 45
cation; • Computer systems organization → Real-time system architecture;
Additional Key Words and Phrases: Online multi-object tracking, end-to-end model, object detection and data
association
1 INTRODUCTION
Online Multi-Object Tracking (MOT), which aims to detect and track multiple targets in a se-
quence of video frames, has attracted extensive attention in computer vision [1, 21, 28] recently.
Haidong Wang and Zhiyong Li are co-first authors, they contributed equally to this work.
This work was partially supported by National Key Research and Development Program of China (No. 2018YFB1308604),
National Natural Science Foundation of China (No. U21A20518, No. 61976086), and Special Project of Foshan Science and
Technology Innovation Team (No. FS0AA-KJ919-4402-0069).
Authors’ address: H. Wang, X. He, Z. Li (corresponding author), J. Yuan (corresponding author), and S. Li, Hunan University,
China; emails: haidong@hnu.edu.cn, 419432961@qq.com, {zhiyong.li, yuanjin, shutao_li}@hnu.edu.cn.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from permissions@acm.org.
© 2023 Association for Computing Machinery.
1551-6857/2023/01-ART45 $15.00
https://doi.org/10.1145/3533253
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.
45:2 H. Wang et al.
Fig. 1. An explanation of two-stage MOT, one-stage MOT, and our end-to-end MOT. (a) Traditional two-stage
MOT pipeline contains two separate stages and extract feature twice. (b) Existing one-stage MOT pipeline
adopts an independent processing pattern for detection and association during the training process, and
could not achieve an end-to-end training. (c) Our proposed end-to-end MOT model integrates object detection
and data association into a one-stage MOT tracker, and uses a traditional association method [43] to generate
pseudo association labels, which are utilized to keep data consistency between object representations and
association labels in each training iteration.
In practice, many applications benefit from the success of MOT tasks, such as intelligent driv-
ing [12], video surveillance [44], human activity recognition [29], and so on. Currently, there are
two kinds of mainstream MOT approaches, namely two-stage methods and one-stage methods. Two-
stage methods [10, 41, 49] usually contain two independent stages: detection and association. The
detection stage adopts an offline training pattern, and aims to localize the to-be-tracked targets in
the current video frame. Then, the association stage extracts re-identification representations for
the targets, and matches them with the previous tracks. Benefitting from the advances in detec-
tion [35, 36, 56], re-identification [27] and data association [41], two-stage approaches have sig-
nificantly improved the tracking accuracy for MOT task. Nevertheless, as depicted in Figure 1(a),
this paradigm first requires an offline detection, and then repeats feature extraction to generate
trajectories. Therefore, it is difficult to implement a real-time tracking.
Different from two-stage approaches, one-stage methods [40, 42] try to integrate both online
detection and association into one framework, and thus could share model parameters for target
representations (see Figure 1(b)) to decrease the tracking cost [20, 42]. However, common one-
stage methods adopt an independent processing pattern for detection and association, including
training an effective detection model and then employing complex association tricks for generat-
ing trajectories. The association results greatly depend on the detection accuracy. In other words,
detection and association are independent of each other during the training process, and could
not achieve an end-to-end training. As a result, object detection errors will be propagated to the
association stage, thereby reducing the accuracy of MOT.
Motivated by the above analysis, this article proposes an end-to-end training framework called
Joint Detection and Association Network (JDAN) to alleviate this problem. Our framework mainly
consists of three parts: detection submodule, joint submodule, and association submodule. Concretely,
we first use a pre-trained two-stream detection network to extract initial object candidates and
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.
JDAN: Joint Detection and Association Network for Real-Time Online MOT 45:3
their representations. Then, the joint submodule is employed to merge all possible representation
concatenations between two frames to generate a confusion tensor. Finally, the association sub-
module converts the tensor to an association matrix, which indicates the matching relation among
multiple targets from the two frames. To jointly train the previous submodules, one challenge lies
in the inconsistent object problem, where the detection submodule may generate inconsistent ob-
jects in quality, location and size as compared to the tracking ground truth for MOT task. To bridge
this gap, our approach abandons the existing tracking ground truth, and leverages the traditional
association method [43] to produce pseudo labels for detection results (see Figure 1(c)), which are
then fed to the association submodule to generate tracking results. Consequently, our model could
jointly train all the submodules in an end-to-end way to generate a robust one-stage model. More-
over, as pseudo labels are only used in the training stage instead of testing, they have no effect in
prediction speed during the inference stage.
Different from previous one-stage methods, our approach could jointly train both detection and
association submodules in an end-to-end fashion, which alleviates the error propagation problem.
We evaluate the proposed method on MOT15 [19] and MOT17 [29] datasets, and find the proposed
method outperforms multiple online trackers. We believe that this study will be a good inspiration
for one-stage online MOT.
To sum up, our primary contributions are listed below:
(1) We propose an end-to-end architecture to jointly process both object detection and associa-
tion to alleviate the error propagation problem. To the best of our knowledge, our work is
the first attempt to implement an end-to-end training for MOT task.
(2) Our approach employs pseudo labels to solve the object inconsistent problem, and we pro-
pose a joint submodule followed by association prediction to produce precise tracking results
based on these pseudo samples.
(3) We conduct abundant experiments on MOT benchmark datasets with ablation study. It is
demonstrated that our approach could implement a real-time online tracking as well as
achieve competitive tracking performance as compared to several popular models.
2 RELATED WORK
We summarize several recent research progress on online MOT by grouping them into two-stage
and one-stage approaches, and analyze the advantages and disadvantages of these approaches.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.
45:4 H. Wang et al.
since both detection and representation extraction require abundant computing resource. Hence,
it is very difficult to obtain a real-time tracking.
3 OUR ALGORITHM
In this section, we first describe the framework of our proposed JDAN. Then, we introduce the
details including our detection submodule, joint submodule, and association submodule, respectively.
Finally, we present the training and testing strategies to formulate our end-to-end online MOT.
3.1 Framework
Figure 2 demonstrates the framework of our proposed JDAN, which mainly consists of three sub-
modules: detection submodule, joint submodule, and association submodule. First, two input video
frames are disposed of the two-stream detection submodule with shared parameters to learn object
representations. On this basis, all object representations are then copied along the vertical and
horizontal directions, respectively, to form a confusion tensor in joint submodule. Finally, the gen-
erated tensor is converted into an association matrix via an association predictor in the association
submodule. As a result, the matching relationship between multiple targets from the two frames is
predicted according to the association matrix. Different from previous approaches, our algorithm
could solve the data inconsistent problem between detection and tracking tasks, and implement
an end-to-end training pattern to alleviate the error propagation problem.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.
JDAN: Joint Detection and Association Network for Real-Time Online MOT 45:5
Fig. 2. An illustration to the framework of our Joint Detection Association Network (JDAN), which consists
of three submodules: detection submodule S D , joint submodule S J , and association submodule S A .
balance between model complexity and precision. In order to extract robust object representations,
we introduce a variant of DLA [56] (deep layer aggregation) into ResNet-34. Compared with the
raw DLA [50], the variant of DLA has additional bypasses between the low- and high-layer rep-
resentations, and thus has stronger ability to handle various complex environments. It is demon-
strated that the variant of DLA is beneficial to alleviate the identity switch problem for one-stage
approaches because the encoder-decoder network could deal with objects of varied sizes [53, 56].
Moreover, all the convolutional blocks in upsampling process are displaced by the deformable
convolutional module which can adaptively accommodate the change of target size [55].
3.2.2 Positioning Head. The positioning head generates location and size information by using
three heads: heatmap head, size head, and displacement head.
The heatmap head aims to predict an object center for each object where we use 1 to indicate
the center and the smaller values in the range (0, 1) to represent the distance to the center. Given a
bounding box b i = (x 1i , y1i , x 2i , y2i ) in the ground truth, we obtain the corresponding center position
x i +x i y i +y i pi
p i = ( 1 2 2 , 1 2 2 ). Considering the downsampling operation, we use qi = G to represent the
center position on the representation map, where G is downsampling factor. The heatmap response
(q−q i ) 2
at the center is set to 1, and r q = exp− 2σ 2 at other positions, where σ is the Gaussian kernel
that is a function of target size [18]. Given a predicted heatmap response rˆq ∈ [0, 1] G × G , the
W H
heatmap loss function is formulated according to the focal loss [23] as follows:
1 ⎧ ⎪
⎨ (1 − rˆq ) log(rˆq ),
α if r q = 1,
Lh = − ⎪ (1)
N q (rˆp ) log(1 − rˆq )(1 − r q )
α β otherwise,
⎩
where N indicates the number of targets in the current video frame, and α, β are the hyperparam-
eters of the focal loss.
The size head is used for predicting the width and height of an object around its central position,
which is defined as Ẑ ∈ R G × G . Given a ground truth box b i , the box size z i is calculated as
W H
(x 2i −x 1i , y2i −y1i ), and the predicted bounding box size is noted as ẑ i . Moreover, as the downsampling
operation will produce strong quantitative bias, we use the displacement head to solve this problem,
i
which is important for improving the accuracy of data association [53]. Let d̂ be the predicted box
pi pi
displacement, d i denotes the corresponding ground truth which is calculated as d i = G − G ,
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.
45:6 H. Wang et al.
1 i 1 i
N N
i
Ls = z − ẑ i 1 + d − d̂ 1 . (2)
N i=1 N i=1
Finally, the positioning loss Lp is formulated as an assembly of the previous two losses:
Lp = L h + L s . (3)
3.2.3 Representation Head. The representation head aims to extract object representations that
can differentiate various tracked targets. Given a representation map E ∈ R S × S ×Ce by the back-
W H
bone, where S represents the scale and Ce indicates the number of channels, the representation
head adopts the heatmap approach to generate an object representation Ep ∈ RC for each target
at the central p. To better learn discriminative features for different objects, our approach designs
an identity classification loss based on identity representation Ep as follows:
1 i
N J
Lc = − u (j)log(v (j)), (4)
N × J i=1 j=1
where j represents the identity index, and i records the instance index of each identity. v (j) is the
classification probability for the identity j. u i (j) is the ground truth for the identity j. J is the total
number of identities and N is the instance number of each identity.
Finally, the total detection loss Ld is formed as an assembly of Equations (3) and (4):
Ld = Lp + L c . (5)
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.
JDAN: Joint Detection and Association Network for Real-Time Online MOT 45:7
where B 1 and B 2 are constructed by deleting the last row and column of Bt −n,t , respectively. The
operator denotes the Hadamard product [15].
The second one called non-maximum loss Lm penalizes the non-maximum association in disap-
pearance and appearance for association computation [39] as follows:
Nm Nm
k=1 l =1
− log Akl
m B3
kl
Lm = Nm Nm , (7)
k=1 l =1
B kl
3
where B 3 is defined by deleting both the last row and column of Bt −n,t . The third one called balance
loss Lb penalizes any imbalance between disappearance and appearance branches, which means
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.
45:8 H. Wang et al.
Fig. 3. The technical route of association submodule in processing target disappearance and appearance. For
the input historical and current frames, we predict the similarity association matrix Ma . In consideration
of multi-object disappearances and appearances between the historical and current video frames, M 1 and
M 2 are designed by adding an additional column and row to Ma . Then, the horizontal and vertical softmax
operations are conducted on M 1 and M 2 respectively, and the cropped matrices Ac , Ar are utilized to compute
loss Lm by using the pseudo association labels. Finally, the total association loss Ls is constructed according
to the sum of Lt , Lm , and Lb .
that the number of targets that appear and disappear from the field of view should be equal.
Nm
Nm
Lb = Akl − Akl . (8)
c r
k=1 l =1
Finally, the total association loss Ls is the sum of Equations (6)–(8):
Ls = Lt + Lm + Lb . (9)
In the implementation, we only calculate pseudo positive and pseudo negative pairs, and discard
uncertain pairs for the three losses.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.
JDAN: Joint Detection and Association Network for Real-Time Online MOT 45:9
Fig. 4. Online MOT by JDAN. Given an input video frame Ft , the target boxes and representations are sup-
plied by the positioning and representation heads, respectively, with a single stream detection submodule in
JDAN. The object representation R t is used to pair with the latest n historical representation R t −n:t −1 , and
every pair of representations is passed through the joint & association submodules to estimate the correspond-
ing association matrix Mt −n:t −1,t . In addition, the representation R t is reserved to estimate the association
matrix in the next n frames. Finally, we update the track recorder by linking the present image objects with
n historical image objects by utilizing the inferred association matrix.
pair, where key is the index of a historical frame, and value is the corresponding target representa-
tion in that frame. After receiving a new target representation R t in Ft , trajectory manager would
reserve the representation R t and output all the representations R t −n:t −1 , R t into the joint & associ-
ation submodules to generate the association matrix Mt −n:t −1,t between each historical frame Ft −n
and the current frame Ft . After that, the accumulator in trajectory manager are updated, and the
tracking results Tt in Ft are generated. Some special cases have been considered in our method.
For instance, a historical track may has no matching target in the current frame, which indicates
a target disappears. In such case, we add a column in the acculumator to make each track cor-
responding to a sole column. The cost of online tracking is composed of two parts: the detection
submodule to generate representation R t for the current frame, and the joint & association submod-
ules to output tracking results Tt . To ensure tracking speed, we adopt lightweight detection and
association networks, and all the historical information including R t −n:t −1 are stored in trajectory
manager without recalculation. Therefore, our online tracking could achieve a fast speed.
4 EXPERIMENTS
This section demonstrates the experiments by JDAN. We first introduce datasets and implementa-
tion details of our proposed JDAN, and then illustrate the experimental results as well as make a
deep analysis.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.
45:10 H. Wang et al.
For evaluation, we employ a variety of metrics, including MOT accuracy (MOTA), MOT preci-
sion (MOTP), the ratio of correctly identified detection results to the mean number of the ground
truth and predicted detection results (IDF1 [37]), mostly tracked (MT) trajectories, mostly lost
(ML) trajectories, identity switch (ID_Sw), the total number of false positives (FP), the to-
tal number of missed targets (FN), the total number of fragmented trajectories (Frag) [4],
and the number of frames processed per second (FPS). Different metrics focus on different aspects
to evaluate MOT performance.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.
JDAN: Joint Detection and Association Network for Real-Time Online MOT 45:11
Table 2. Ablative Study with Different Training Paradigms on MOT17 Benchmark Data
submodule fine-tuned with the generated pseudo labels. On this basis, Detection Training retrains
the detection submodule on the detection datasets, and then uses an association submodule trained
on the generated pseudo labels. Considering the visual gap between MOT and detection datasets,
the third training paradigm called Detection Fine-tuning employs a detection submodule fine-tuned
on MOT detection datasets to boost the detection performance, while Association Fine-tuning
does not use MOT datasets for detection submodule but fine-tunes association submodule on MOT
datasets. JDAN is the whole model fine-tuned on MOT datasets with an end-to-end training for
both detection and association submodules.
Table 2 reports the performance comparison results among the above training paradigms, and
our analysis is list below:
(1) Compared to Baseline, Detection Training achieves a significant performance improvement,
with MOTA increasing from 20.7% to 34.6% and MOTP rising from 38.3% to 42.8%. This
improvement indicates the effectiveness of retraining detection submodule on the detection
datasets since the pretraining on Microsoft COCO could not offer accurate pedestrian results
for crowded scenes in MOT task.
(2) Detection Fine-tuning further improves MOTA performance compared with Detection Train-
ing, with a significant improvement from 42.8% to 52.8% on MOTP, and from 18, 264 to
13, 236 on ID_Sw. This improvement stems from the use of MOT datasets on detection sub-
module to bridge the visual gap between the detection and MOT datasets. Meanwhile, Associ-
ation Fine-tuning also achieves the same improvement as compared to Detection Fine-tuning
due to the use of MOT datasets to update the association submodule.
(3) It is observed that all of the metrics are improved with the best performance by JDAN. This
enhancement is due not only to the fine-tuned detection submodule but also the fine-tuned
association submodule. Specifically, JDAN achieves 58.1% on MOTA, which reflects the track-
ing accuracy. As mentioned before, JDAN adopt an end-to-end training to alleviate the error
propagation problem and could obtain powerful object association capability.
4.3.2 Representation Dimension. The previous works in person re-identification normally ex-
ploit high-dimensional object representations, and attain excellent performance without explo-
ration study. Nevertheless, FairMOT [53] discovers that the representation dimension plays an
important role, and the lower-dimensional object representations are more prominent in MOT
task to avoid the underfitting problem since we have insufficient training data compared with the
re-identification task. Therefore, extracting low dimensional representations relieves the under-
fitting problem on small datasets, and thus enhances MOT performance. The previous two-stage
methods are not affected by data deficiency because they can utilize abundant re-identification
data. As a comparison, one-stage approaches cannot utilize these re-identification data because
it requires raw unclipped video frames. A method to solve this data-dependence problem is to
decrease the dimension of object representation.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.
45:12 H. Wang et al.
For this experiment, we test various dimension configurations and the results are listed in
Table 3. It is observed that MOTA increases continually when the dimension reduces from 1,024 to
128, which demonstrates the merits of lower dimensional representations for MOT task. Further-
more, MOTA starts to decrease when the dimensionality is decreased to 64 as the object represen-
tation is impaired. Even though the improvements for MOTA score are small, ID_Sw practically
decreases from 8,107 to 6,129, which plays an important role in enhancing the overall MOT per-
formance. The model running speed also marginally increases by decreasing the representation
dimension. However, the use of low dimensional object representation is effective only when the
training datasets are insufficient. With the increment of data, the problem arising from the repre-
sentation dimension can be alleviated.
4.3.3 Maximum Object Number. Considering various object densities for MOT task, we utilize
an appropriate maximum object number Nm to adapt to different MOT contexts. We try three Nm
values (100, 150, 200) under different representation dimensions, and the experimental results are
listed in Table 4. It is demonstrated that the best MOTA is achieved when Nm = 150, and the best
ID_Sw is achieved when Nm =100. For MOTA, we think that too small Nm will lead to insufficient
fitting ability of the model, while too large Nm will cause underfitting in the association submodule
and degrade MOT performance. For ID_Sw, the small value Nm reduces the object candidates, and
thus could avoid identity switch for MOT task.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.
JDAN: Joint Detection and Association Network for Real-Time Online MOT 45:13
Table 5. Comparison with the online MOT Methods Using Private Detection Agreement
RAR15 [10]. Meanwhile, we compare JDAN with the one-stage approach JDE [42], which is a rep-
resentive MOT method. TrackRCNN [40] is not listed and compared because it requires additional
image segmentation ground truth, which is not supported on the MOT datasets. Besides, Fair-
MOT [53] adopts both visual features and IoU distance with complex and hand-designed tricks in
its association stage to improve tracking performance. Its engineering implementation requires to
manually tune many model parameters to adapt to different datasets. While our JDAN is simple
and abandons manual parameter settings in association stage by using an association matrix.
Table 5 lists the comparision results. Compared with two-stage approaches, JDAN slightly im-
proves the accuracy measured by all the metrics on both MOT15 and MOT17 datsets. What’s more,
JDAN achieves a near real-time tracking speed, which is much faster than two-stage methods with
10 FPS at least. This high efficiency stems from the shared representation features and end-to-end
tracking paradigm. Compared to the one-stage JDE, JDAN performs better on all the metrics, with
IDF1 increasing from 56.9 to 57.8 on MOT15, and from 55.8 to 59.2 on MOT17. This improvement
proves that the end-to-end training in JDAN could well alleviate the error propagation problem
for the online tracking method. What’s more, JDAN runs faster for online tracking, because JDAN
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.
45:14 H. Wang et al.
Fig. 5. Tracking results on MOT17 dataset. “MOT17-03” in the first row (a crowded street at night, elevated
camera’s view), “MOT17-05” on the second row (street scene from a moving platform), and “MOT17-09” in
the last row (a crowded pedestrian environment filmed from a lower viewpoint) indicate that the proposed
algorithm can associate the object when the tracking process turns into drifting status in the penultimate
column.
directly adopts an association matrix to generate tracking results, while JDE uses a larger backbone
and many complex association tricks.
Table 6 reports the detailed performance on individual video of MOT17 testing datasets. For
instance, JDAN performs well on MOT17-03, which is a crowded context for MOT task. Finally,
Figure 5 illustrates three representative examples from MOT17 dataset tracked by JDAN. The first
example is a crowded street at night in a high view, and the second and third examples describe a
crowded pedestrian environment in a low view with moving and stationary platforms, respectively.
It is demonstrated that JDAN could successfully associate the targets when the tracking process
turns into drifting status.
5 CONCLUSION
In this article, we designed an end-to-end framework to alleviate the error propagation problem
for MOT task. Going beyond the existing methods, JDAN could jointly train both detection and
association tasks, with the hope that both target detection and tracking could be optimized. Tech-
nically, we employed pseudo labels to solve the object inconsistent problem, and designed a joint
submodule as well as an association predictor to produce precise tracking results. The end-to-end
architecture is quite simple yet effective, and avoids tedious manual settings in association as com-
pared to the existing MOT approaches. Extensive experiments on widely used MOT benchmarks
demonstrate the superiority of our proposed approach in terms of effectiveness and efficiency. We
believe our work can encourage and motivate further studies for MOT task.
REFERENCES
[1] Na An and Wei Qi Yan. 2021. Multitarget tracking using Siamese neural networks. ACM Transactions on Multimidia
Computing Communications and Applications 17, 2s (2021), 1–16.
[2] Seung-Hwan Bae and Kuk-Jin Yoon. 2018. Confidence-based data association and discriminative deep appearance
learning for robust online multi-object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018),
1–1.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.
JDAN: Joint Detection and Association Network for Real-Time Online MOT 45:15
[3] Philipp Bergmann, Tim Meinhardt, and Laura Leal-Taixe. 2019. Tracking without bells and whistles. In Proceedings of
the IEEE/CVF International Conference on Computer Vision. 941–951.
[4] Keni Bernardin and Rainer Stiefelhagen. 2008. Evaluating multiple object tracking performance: The CLEAR MOT
metrics. EURASIP J. Image and Video Processing (2008), Article No. 1.
[5] Long Chen, Haizhou Ai, Chong Shang, Zijie Zhuang, and Bo Bai. 2017. Online multi-object tracking with convolutional
neural networks. In Propceedings of the 2017 IEEE International Conference on Image Processing (ICIP). IEEE, 645–649.
[6] Peng Chu and Haibin Ling. 2019. Famnet: Joint learning of feature, affinity and multi-dimensional assignment for
online multiple object tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6172–
6181.
[7] Patrick Dendorfer, Hamid Seyed Rezatofighi, Anton Milan, Javen Shi, Daniel Cremers, D. Ian Reid, Stefan Roth, Konrad
Schindler, and Laura Leal-Taixeacute. 2019. CVPR19 tracking and detection challenge: How crowded can it get? In
Proceedings of CoRR (2019).
[8] Piotr Dollár, Christian Wojek, Bernt Schiele, and Pietro Perona. 2009. Pedestrian detection: A benchmark. In Proceed-
ings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 304–311.
[9] Andreas Ess, Bastian Leibe, Konrad Schindler, and Luc Van Gool. 2008. A mobile vision system for robust multi-person
tracking. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1–8.
[10] Kuan Fang, Yu Xiang, Xiaocheng Li, and Silvio Savarese. 2018. Recurrent autoregressive networks for online multi-
object tracking. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE,
466–475.
[11] Zeyu Fu, Federico Angelini, Jonathon Chambers, and Mohsen Syed Naqvi. 2019. Multi-level cooperative fusion of
GM-PHD filters for online multiple human tracking. IEEE Transactions on Multimedia (2019), 1–1.
[12] Andreas Geiger, Philip Lenz, and Raquel Urtasun. 2012. Are we ready for autonomous driving? The KITTI vision
benchmark suite. Computer Vision and Pattern Recognition (2012), 3354–3361.
[13] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. 2017. Mask R-CNN. IEEE Transactions on Pattern
Analysis & Machine Intelligence (PP), 99 (2017), 1–1.
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR, 2016). 770–778.
[15] Jin-Hwa Kim, Woon Kyoung On, Jeonghee Kim, JungWoo Ha, and Byoung-Tak Zhang. 2017. Hadamard product for
low-rank bilinear pooling. In Proceedings of ICLR (2017).
[16] Iasonas Kokkinos. 2017. Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level
vision using diverse datasets and limited memory. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. 6129–6138.
[17] Harold W. Kuhn. 1955. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly 2, 1–2
(1955), 83–97.
[18] Hei Law and Jia Deng. 2018. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Confer-
ence on Computer Vision (ECCV). 734–750.
[19] Laura Leal-Taixé, Anton Milan, D. Ian Reid, Stefan Roth, and Konrad Schindler. 2015. MOTChallenge 2015: Towards
a benchmark for multi-target tracking. CoRR (2015).
[20] Guiji Li, Manman Peng, Ke Nai, Zhiyong Li, and Keqin Li. 2019. Multi-view correlation tracking with adaptive memory-
improved update model. Neural Computing and Applications (2019), 9047–9063.
[21] Zhiyong Li, Ke Nai, Guiji Li, and Shilong Jiang. 2020. Learning a dynamic feature fusion tracker for object tracking.
IEEE Transactions on Intelligent Transportation Systems (2020).
[22] Zhiyong Li, Dongming Wang, Ke Nai, Tong Shen, and Ying Zeng. 2016. Robust object tracking via weight-based local
sparse appearance model. ICNC-FSKD (2016), 560–565.
[23] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In
Proceedings of the IEEE International Conference on Computer Vision. 2980–2988.
[24] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence
Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer
Vision. Springer, 740–755.
[25] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg.
2016. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision. Springer,
21–37.
[26] Zhipeng Luo, Zhiguang Zhang, and Yuehan Yao. 2020. A strong baseline for multiple object tracking on vidor dataset.
In Proceedings of the 28th ACM International Conference on Multimedia. 4595–4599.
[27] Jingyi Lv, Zhiyong Li, Ke Nai, Ying Chen, and Jin Yuan. 2020. Person re-identification with expanded neighborhoods
distance re-ranking. Image and Vision Computing (2020).
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.
45:16 H. Wang et al.
[28] Nima Mahmoudi, Seyed Mohammad Ahadi, and Mohammad Rahmati. 2019. Multi-target tracking using CNN-based
features: CNNMTT. Multimedia Tools and Applications 78, 6 (2019), 7077–7096.
[29] Anton Milan, Laura Leal-Taixe, Ian Reid, Stefan Roth, and Konrad Schindler. 2016. MOT16: A benchmark for multi-
object tracking. (2016).
[30] Munkres and James. 2006. Algorithms for the assignment and transportation problems. Journal of the Society for
Industrial and Applied Mathematics 5, 1 (2006), 32–38.
[31] Ke Nai, Zhiyong Li, Guiji Li, and Shanquan Wang. 2018. Robust object tracking via local sparse appearance model.
IEEE Transactions on Image Processing (2018), 4958–4970.
[32] Ke Nai, Degui Xiao, Zhiyong Li, Shilong Jiang, and Yu Gu. 2019. Multi-pattern correlation tracking. Knowledge-Based
Systems (2019).
[33] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming
Lin, Natalia Gimelshein, and Luca Antiga. 2019. PyTorch: An imperative style, high-performance deep learning library.
In Advances in Neural Information Processing Systems (NIPS’19). 8024–8035.
[34] Rajeev Ranjan, Vishal M. Patel, and Rama Chellappa. 2017. Hyperface: A deep multi-task learning framework for face
detection, landmark localization, pose estimation, and gender recognition. IEEE Transactions on Pattern Analysis and
Machine Intelligence 41, 1 (2017), 121–135.
[35] Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018).
[36] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with
region proposal networks. In Proceedings of the International Conference on Neural Information Processing Systems
(NIPS 2015). 91–99.
[37] Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. 2016. Performance measures and a data
set for multi-target, multi-camera tracking. In Proceedings of the European Conference on Computer Vision (ECCV’16).
Springer, 17–35.
[38] Ricardo Sanchez-Matilla, Fabio Poiesi, and Andrea Cavallaro. 2016. Online multi-target tracking with strong and weak
detections. In Proceedings of the European Conference on Computer Vision. Springer, 84–99.
[39] ShiJie Sun, Naveed Akhtar, HuanSheng Song, Ajmal S. Mian, and Mubarak Shah. 2019. Deep affinity network for
multiple object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019).
[40] Paul Voigtlaender, Michael Krause, Aljosa Osep, Jonathon Luiten, Balachandar Gnana Berin Sekar, Andreas Geiger,
and Bastian Leibe. 2019. MOTS: Multi-object tracking and segmentation. In Proceedings of (CVPR’2019), 7942–7951.
[41] Haidong Wang, Saizhou Wang, Jingyi Lv, Chenming Hu, and Zhiyong Li. 2020. Non-local attention association scheme
for online multi-object tracking. Image and Vision Computing (2020), 103983.
[42] Zhongdao Wang, Liang Zheng, Yixuan Liu, Yali Li, and Shengjin Wang. 2020. Towards real-time multi-object tracking.
(2020), 107–122.
[43] Greg Welch, Gary Bishop, et al. 1995. An introduction to the kalman filter. (1995).
[44] Nicolai Wojke, Alex Bewley, and Dietrich Paulus. 2017. Simple online and realtime tracking with a deep association
metric. In Proceedings of the (ICIP’17).
[45] Yu Xiang, Alexandre Alahi, and Silvio Savarese. 2015. Learning to track: Online multi-object tracking by decision
making. In Proceedings of the IEEE International Conference on Computer Vision. 4705–4713.
[46] Tong Xiao, Shuang Li, Bochao Wang, Liang Lin, and Xiaogang Wang. 2017. Joint detection and identification feature
learning for person search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3415–
3424.
[47] Xin Xu, Shiqin Wang, Zheng Wang, Xiaolong Zhang, and Ruimin Hu. 2021. Exploring image enhancement for salient
object detection in low light images. ACM Transactions on Multimedia Computing, Communications, and Applications
(TOMM) 17, 1s (2021), 1–19.
[48] Yihong Xu, Aljosa Osep, Yutong Ban, Radu Horaud, Laura Leal-Taixé, and Xavier Alameda-Pineda. 2020. How to train
your deep multi-object tracker. In Proceedings of the (CVPR’20), 6786–6795.
[49] Fengwei Yu, Wenbo Li, Quanquan Li, Yu Liu, Xiaohua Shi, and Junjie Yan. 2016. POI: Multiple object tracking with
high performance detection and appearance feature. In Proceedings of the European Conference on Computer Vision.
Springer, 36–42.
[50] Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. 2018. Deep layer aggregation. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition. 2403–2412.
[51] Shanshan Zhang, Rodrigo Benenson, and Bernt Schiele. 2017. Citypersons: A diverse dataset for pedestrian detection.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3213–3221.
[52] Sixin Zhang, Anna E. Choromanska, and Yann LeCun. 2015. Deep learning with elastic averaging SGD. In Advances
in Neural Information Processing Systems. 685–693.
[53] Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenjun Zeng, and Wenyu Liu. 2021. Fairmot: On the fairness of detection
and re-identification in multiple object tracking. International Journal of Computer Vision (2021), 1–19.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.
JDAN: Joint Detection and Association Network for Real-Time Online MOT 45:17
[54] Liang Zheng, Hengheng Zhang, Shaoyan Sun, Manmohan Chandraker, Yi Yang, and Qi Tian. 2017. Person re-
identification in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1367–1376.
[55] Xingyi Zhou, Vladlen Koltun, and Philipp Krähenbühl. 2020. Tracking objects as points. In Proceedings of the European
Conference on Computer Vision. Springer, 474–490.
[56] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. 2019. Objects as points. arXiv preprint arXiv:1904.07850 (2019).
[57] Zongwei Zhou, Junliang Xing, Mengdan Zhang, and Weiming Hu. 2018. Online multi-target tracking with tensor-
based high-order graph matching. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR).
IEEE, 1809–1814.
[58] Ji Zhu, Hua Yang, Nian Liu, Minyoung Kim, Wenjun Zhang, and Ming-Hsuan Yang. 2018. Online multi-object tracking
with dual matching attention networks. In Proceedings of the ECCV (2018), 379–396.
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.