You are on page 1of 17

JDAN: Joint Detection and Association Network for

Real-Time Online Multi-Object Tracking

HAIDONG WANG, XUAN HE, ZHIYONG LI, JIN YUAN, and SHUTAO LI,
Hunan University, China

In the last few years, enormous strides have been made for object detection and data association, which are
vital subtasks for one-stage online multi-object tracking (MOT). However, the two separated submodules
involved in the whole MOT pipeline are processed or optimized separately, resulting in a complex method
design and requiring manual settings. In addition, few works integrate the two subtasks into a single end-to-
end network to optimize the overall task. In this study, we propose an end-to-end MOT network called joint
detection and association network (JDAN) that is trained and inferred in a single network. All layers in JDAN
are differentiable, and can be optimized jointly to detect targets and output an association matrix for robust
multi-object tracking. What’s more, we generate suitable pseudo-labels to address the data inconsistency be-
tween object detection and association. The detection and association submodules could be optimized by the
composite loss function that is derived from the detection results and the generated pseudo association labels,
respectively. The proposed approach is evaluated on two MOT challenge datasets, and achieves promising
performance compared with classic and latest methods.

CCS Concepts: • Computing methodologies → Tracking; Object detection; Supervised learning by classifi- 45
cation; • Computer systems organization → Real-time system architecture;

Additional Key Words and Phrases: Online multi-object tracking, end-to-end model, object detection and data
association

ACM Reference format:


Haidong Wang, Xuan He, Zhiyong Li, Jin Yuan, and Shutao Li. 2023. JDAN: Joint Detection and Association
Network for Real-Time Online Multi-Object Tracking. ACM Trans. Multimedia Comput. Commun. Appl. 19,
1s, Article 45 (January 2023), 17 pages.
https://doi.org/10.1145/3533253

1 INTRODUCTION
Online Multi-Object Tracking (MOT), which aims to detect and track multiple targets in a se-
quence of video frames, has attracted extensive attention in computer vision [1, 21, 28] recently.

Haidong Wang and Zhiyong Li are co-first authors, they contributed equally to this work.
This work was partially supported by National Key Research and Development Program of China (No. 2018YFB1308604),
National Natural Science Foundation of China (No. U21A20518, No. 61976086), and Special Project of Foshan Science and
Technology Innovation Team (No. FS0AA-KJ919-4402-0069).
Authors’ address: H. Wang, X. He, Z. Li (corresponding author), J. Yuan (corresponding author), and S. Li, Hunan University,
China; emails: haidong@hnu.edu.cn, 419432961@qq.com, {zhiyong.li, yuanjin, shutao_li}@hnu.edu.cn.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from permissions@acm.org.
© 2023 Association for Computing Machinery.
1551-6857/2023/01-ART45 $15.00
https://doi.org/10.1145/3533253

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.
45:2 H. Wang et al.

Fig. 1. An explanation of two-stage MOT, one-stage MOT, and our end-to-end MOT. (a) Traditional two-stage
MOT pipeline contains two separate stages and extract feature twice. (b) Existing one-stage MOT pipeline
adopts an independent processing pattern for detection and association during the training process, and
could not achieve an end-to-end training. (c) Our proposed end-to-end MOT model integrates object detection
and data association into a one-stage MOT tracker, and uses a traditional association method [43] to generate
pseudo association labels, which are utilized to keep data consistency between object representations and
association labels in each training iteration.

In practice, many applications benefit from the success of MOT tasks, such as intelligent driv-
ing [12], video surveillance [44], human activity recognition [29], and so on. Currently, there are
two kinds of mainstream MOT approaches, namely two-stage methods and one-stage methods. Two-
stage methods [10, 41, 49] usually contain two independent stages: detection and association. The
detection stage adopts an offline training pattern, and aims to localize the to-be-tracked targets in
the current video frame. Then, the association stage extracts re-identification representations for
the targets, and matches them with the previous tracks. Benefitting from the advances in detec-
tion [35, 36, 56], re-identification [27] and data association [41], two-stage approaches have sig-
nificantly improved the tracking accuracy for MOT task. Nevertheless, as depicted in Figure 1(a),
this paradigm first requires an offline detection, and then repeats feature extraction to generate
trajectories. Therefore, it is difficult to implement a real-time tracking.
Different from two-stage approaches, one-stage methods [40, 42] try to integrate both online
detection and association into one framework, and thus could share model parameters for target
representations (see Figure 1(b)) to decrease the tracking cost [20, 42]. However, common one-
stage methods adopt an independent processing pattern for detection and association, including
training an effective detection model and then employing complex association tricks for generat-
ing trajectories. The association results greatly depend on the detection accuracy. In other words,
detection and association are independent of each other during the training process, and could
not achieve an end-to-end training. As a result, object detection errors will be propagated to the
association stage, thereby reducing the accuracy of MOT.
Motivated by the above analysis, this article proposes an end-to-end training framework called
Joint Detection and Association Network (JDAN) to alleviate this problem. Our framework mainly
consists of three parts: detection submodule, joint submodule, and association submodule. Concretely,
we first use a pre-trained two-stream detection network to extract initial object candidates and
ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.
JDAN: Joint Detection and Association Network for Real-Time Online MOT 45:3

their representations. Then, the joint submodule is employed to merge all possible representation
concatenations between two frames to generate a confusion tensor. Finally, the association sub-
module converts the tensor to an association matrix, which indicates the matching relation among
multiple targets from the two frames. To jointly train the previous submodules, one challenge lies
in the inconsistent object problem, where the detection submodule may generate inconsistent ob-
jects in quality, location and size as compared to the tracking ground truth for MOT task. To bridge
this gap, our approach abandons the existing tracking ground truth, and leverages the traditional
association method [43] to produce pseudo labels for detection results (see Figure 1(c)), which are
then fed to the association submodule to generate tracking results. Consequently, our model could
jointly train all the submodules in an end-to-end way to generate a robust one-stage model. More-
over, as pseudo labels are only used in the training stage instead of testing, they have no effect in
prediction speed during the inference stage.
Different from previous one-stage methods, our approach could jointly train both detection and
association submodules in an end-to-end fashion, which alleviates the error propagation problem.
We evaluate the proposed method on MOT15 [19] and MOT17 [29] datasets, and find the proposed
method outperforms multiple online trackers. We believe that this study will be a good inspiration
for one-stage online MOT.
To sum up, our primary contributions are listed below:
(1) We propose an end-to-end architecture to jointly process both object detection and associa-
tion to alleviate the error propagation problem. To the best of our knowledge, our work is
the first attempt to implement an end-to-end training for MOT task.
(2) Our approach employs pseudo labels to solve the object inconsistent problem, and we pro-
pose a joint submodule followed by association prediction to produce precise tracking results
based on these pseudo samples.
(3) We conduct abundant experiments on MOT benchmark datasets with ablation study. It is
demonstrated that our approach could implement a real-time online tracking as well as
achieve competitive tracking performance as compared to several popular models.

2 RELATED WORK
We summarize several recent research progress on online MOT by grouping them into two-stage
and one-stage approaches, and analyze the advantages and disadvantages of these approaches.

2.1 Two-Stage MOT Approaches


Traditional tracking algorithms [28, 44, 49, 57] usually consider object detection and data associa-
tion as two independent stages: First, they utilize the object detection technology to find all targets
for each video frame in the form of bounding boxes [13, 26, 35]; Second, they conventionally follow
a typical paradigm that computes a similarity loss according to the spatial relationship and appear-
ance representation of detection results, and then perform state estimation [21, 22, 31, 32, 43] and
data association [17, 57] to solve the association problem across video frames to generate a series
of trajectories. To improve the performance, advanced studies [28] utilize sophisticated associative
methods such as recurrent neural network [10] and graph matching [57]. The merit of these two-
stage approaches is that they can utilize the most appropriate method for each stage respectively to
maximize the performance. What’s more, they clip video frames based on detected bounding boxes
and resize them to the same scale before learning object representations, which contributes to deal-
ing with the scale difference problem of tracked targets. Generally, two-stage methods [49] could
achieve the optimum tracking performance on MOT benchmarks, but are quite time-consuming

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.
45:4 H. Wang et al.

since both detection and representation extraction require abundant computing resource. Hence,
it is very difficult to obtain a real-time tracking.

2.2 One-Stage MOT Approaches


With the development of object detection [47] and multi-task learning [16, 34], a recent trend
in MOT tries to combine both detection and data association into a single framework, with the
aim of reducing the runtime by using shared parameters. For instance, MOTS [40] appends a
re-identification branch based on mask R-CNN [13] to predict bounding boxes, and uses a fully
connected layer to predict an association vector for each proposal based on additional image seg-
mentations. Moreover, JDE [42] built on YOLOv3 [35] could obtain a high tracking speed, but the
used anchors are not suitable for extracting object features in MOT task, resulting in a decreased
accuracy. FairMOT [53] uses the anchor-free approach to alleviate the problem of anchor-based
methods. However, detection and association are independent of each other during the training
process, and thus it may cause detection error propagation. Typically, the performance of one-
stage approaches is usually worse than that of the two-stage ones. Moreover, since the data incon-
sistency between the two subtasks is not resolved, the previous methods could not implement an
end-to-end online MOT.
Unlike the previous approaches [42, 53], our approach implements an end-to-end architecture
to jointly train both detection and association, and thus alleviates the error propagation problem
as well as achieves a fast tracking.

3 OUR ALGORITHM
In this section, we first describe the framework of our proposed JDAN. Then, we introduce the
details including our detection submodule, joint submodule, and association submodule, respectively.
Finally, we present the training and testing strategies to formulate our end-to-end online MOT.

3.1 Framework
Figure 2 demonstrates the framework of our proposed JDAN, which mainly consists of three sub-
modules: detection submodule, joint submodule, and association submodule. First, two input video
frames are disposed of the two-stream detection submodule with shared parameters to learn object
representations. On this basis, all object representations are then copied along the vertical and
horizontal directions, respectively, to form a confusion tensor in joint submodule. Finally, the gen-
erated tensor is converted into an association matrix via an association predictor in the association
submodule. As a result, the matching relationship between multiple targets from the two frames is
predicted according to the association matrix. Different from previous approaches, our algorithm
could solve the data inconsistent problem between detection and tracking tasks, and implement
an end-to-end training pattern to alleviate the error propagation problem.

3.2 Detection Submodule


Given an input frame F ∈ RW ×H ×3 , where W and H are the width and height, respectively, the
detection submodule generates several object bounding boxes and their corresponding representa-
tions. In our model, two types of predictive heads are added to the backbone, where the positioning
head locates target bounding boxes, and the representation head generates object representations.
Next, we will elaborate the backbone and the two heads.
3.2.1 Backbone Network. The backbone network is crucial for MOT task since both low-
resolution and high-resolution representations are necessary to be exploited for fitting tracked
targets on various scales. In our work, we use ResNet-34 [14] as the backbone network to strike a

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.
JDAN: Joint Detection and Association Network for Real-Time Online MOT 45:5

Fig. 2. An illustration to the framework of our Joint Detection Association Network (JDAN), which consists
of three submodules: detection submodule S D , joint submodule S J , and association submodule S A .

balance between model complexity and precision. In order to extract robust object representations,
we introduce a variant of DLA [56] (deep layer aggregation) into ResNet-34. Compared with the
raw DLA [50], the variant of DLA has additional bypasses between the low- and high-layer rep-
resentations, and thus has stronger ability to handle various complex environments. It is demon-
strated that the variant of DLA is beneficial to alleviate the identity switch problem for one-stage
approaches because the encoder-decoder network could deal with objects of varied sizes [53, 56].
Moreover, all the convolutional blocks in upsampling process are displaced by the deformable
convolutional module which can adaptively accommodate the change of target size [55].
3.2.2 Positioning Head. The positioning head generates location and size information by using
three heads: heatmap head, size head, and displacement head.
The heatmap head aims to predict an object center for each object where we use 1 to indicate
the center and the smaller values in the range (0, 1) to represent the distance to the center. Given a
bounding box b i = (x 1i , y1i , x 2i , y2i ) in the ground truth, we obtain the corresponding center position
x i +x i y i +y i pi
p i = ( 1 2 2 , 1 2 2 ). Considering the downsampling operation, we use qi =  G  to represent the
center position on the representation map, where G is downsampling factor. The heatmap response
(q−q i ) 2
at the center is set to 1, and r q = exp− 2σ 2 at other positions, where σ is the Gaussian kernel
that is a function of target size [18]. Given a predicted heatmap response rˆq ∈ [0, 1] G × G , the
W H

heatmap loss function is formulated according to the focal loss [23] as follows:
1 ⎧ ⎪
⎨ (1 − rˆq ) log(rˆq ),
α if r q = 1,
Lh = − ⎪ (1)
N q (rˆp ) log(1 − rˆq )(1 − r q )
α β otherwise,

where N indicates the number of targets in the current video frame, and α, β are the hyperparam-
eters of the focal loss.
The size head is used for predicting the width and height of an object around its central position,
which is defined as Ẑ ∈ R G × G . Given a ground truth box b i , the box size z i is calculated as
W H

(x 2i −x 1i , y2i −y1i ), and the predicted bounding box size is noted as ẑ i . Moreover, as the downsampling
operation will produce strong quantitative bias, we use the displacement head to solve this problem,
i
which is important for improving the accuracy of data association [53]. Let d̂ be the predicted box
pi pi
displacement, d i denotes the corresponding ground truth which is calculated as d i = G −  G ,

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.
45:6 H. Wang et al.

we employ the L 1 loss for size and displacement heads as follows:

1  i 1  i
N N
i
Ls = z − ẑ i 1 + d − d̂ 1 . (2)
N i=1 N i=1
Finally, the positioning loss Lp is formulated as an assembly of the previous two losses:
Lp = L h + L s . (3)
3.2.3 Representation Head. The representation head aims to extract object representations that
can differentiate various tracked targets. Given a representation map E ∈ R S × S ×Ce by the back-
W H

bone, where S represents the scale and Ce indicates the number of channels, the representation
head adopts the heatmap approach to generate an object representation Ep ∈ RC for each target
at the central p. To better learn discriminative features for different objects, our approach designs
an identity classification loss based on identity representation Ep as follows:

1  i
N J
Lc = − u (j)log(v (j)), (4)
N × J i=1 j=1

where j represents the identity index, and i records the instance index of each identity. v (j) is the
classification probability for the identity j. u i (j) is the ground truth for the identity j. J is the total
number of identities and N is the instance number of each identity.
Finally, the total detection loss Ld is formed as an assembly of Equations (3) and (4):
Ld = Lp + L c . (5)

3.3 Joint Submodule


The joint submodule connects detection and association submodules, and aims to generate a con-
fusion tensor for paired frames. Concretely, given two representations R t ∈ R128×Nm and R t −n ∈
R128×Nm (the position of placeholder object is filled with zeros), where Nm is the max number of
targets in a frame, our approach first copies R t with Nm times along the vertical direction to con-
struct a tensor Mt ∈ R128×Nm ×Nm , as well as copies R t −n Nm times along the horizontal direction
to generate a tensor Mt −n ∈ R128×Nm ×Nm . Then, both Mt and Mt −n are merged along the channel
direction to output a confusion tensor Mt −n,t ∈ R (128+128)×Nm ×Nm , which contains all the represen-
tation pairs of any two objects from the two frames. Therefore, it offers all the object pairs for the
next association submodule to generate an association matrix for MOT.
In the joint submodule, all the object representations R t , R t −n come from the detection sub-
module, and may have data inconsistency with the tracking ground truth on MOT benchmark
datasets. Therefore, it is difficult to implement an end-to-end training. To solve this problem, we
discard the tracking ground truth, and adopt a simple yet effective traditional association method
called the Kalman Filter [43] to predict the locations of tracklets, resulting in tracking pseudo la-
bels for the object representations R t , R t −n . According to the pseudo labels, we obtain a pseudo
association matrix Bt −n,t ∈ R (Nm +1)×(Nm +1) , where each element bk,l indicates the matching re-
lationship between the object k and l, and the added one column/row (noted as “+1” for Bt −n,t )
represents object disappearance/appearance in the current frame. We define three values for bk,l :
1 indicates the same identity between the object k and l (called pseudo positive pairs), 0 represents
different ones (called pseudo negative pairs), and 0.5 means uncertain. In the implementation, we
set a high threshold in Kalman Filter to reduce the false matching for pseudo positive pairs, and set
a low threshold to increase the true mismatching for pseudo negative ones. The reamining pairs
are set to uncertain.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.
JDAN: Joint Detection and Association Network for Real-Time Online MOT 45:7

Table 1. A Specification of the Association Predictor

Index Input Output Kernel Stride Padding ReLU BatchNorm


1 256 128 1×1 1 0 Yes Yes
2 128 64 1×1 1 0 Yes Yes
3 64 32 1×1 1 0 Yes Yes
4 32 16 1×1 1 0 Yes No
5 16 1 1×1 1 0 Yes No

3.4 Association Submodule


Given a confusion tensor Mt −n,t , association submodule S A computes the association between two
target groups in the frame Ft −n and Ft . Table 1 demonstrates the structure of the association pre-
dictor, which transforms the confusion tensor Mt −n,t into an association matrix Ma ∈ RNm ×Nm ,
which records the association information between those inter-frame targets (Nm is the max num-
ber of targets). Concretely, the prediction implements the dimensionality compression from 256
to 1 step-by-step along the direction of object representation with a 1 × 1 kernel. Each row in Ma
represents a target in Ft −n , and each column indicates that in Ft . The element mk,l in Ma records
the matching probability between the kth target in Ft −n and the lth target in Ft .
To deal with the object disappearance and appearance problems across frames, Figure 3 il-
lustrates the detailed process. Given an association matrix Ma , we append a column to build
M 1 ∈ RNm ×(Nm +1) , where the mth row represents the m-th target in Ft −n associated with Nm + 1
targets in Ft , and the new added column indicates these new targets appearing in Ft . In this way,
we add an extra column to Ma , which is represented as u ∈ RNm = λ1, where λ is a hyperparam-
eter, and 1 is an all-ones vector [39]. This design simulates the situation of target disappearance
with a probability. On this basis, we normalize the horizontal vector by performing a softmax func-
tion [48], resulting in the association probability matrix A1 between all the targets in Ft −n and Ft .
We delete the last column of A1 to obtain the matrix Ac with Nm × Nm dimension. Similarly, the
target appearance is simulated by appending a row to build M 2 ∈ R (Nm +1)×Nm , A2 ∈ R (Nm +1)×Nm
and Ar ∈ RNm ×Nm . Finally, we utilize the max function to obtain Am ∈ RNm ×Nm , which is the
maximum of the corresponding position in both Ac and Ar .
We design three loss functions to train the association submodule. The first one is called direction
loss Ld which suppresses false target associations for disappearance and appearance as follows:
Nm Nm +1    Nm +1 Nm    
k=1 l =1
− log Akl
1  B kl
1 k=1 l =1
− log Akl
2  B kl
2
Lt = Nm Nm +1 kl + Nm +1 Nm , (6)
k=1 l =1
B1 k=1 l =1
B kl
2

where B 1 and B 2 are constructed by deleting the last row and column of Bt −n,t , respectively. The
operator  denotes the Hadamard product [15].
The second one called non-maximum loss Lm penalizes the non-maximum association in disap-
pearance and appearance for association computation [39] as follows:
 Nm  Nm    
k=1 l =1
− log Akl
m  B3
kl
Lm =  Nm  Nm , (7)
k=1 l =1
B kl
3

where B 3 is defined by deleting both the last row and column of Bt −n,t . The third one called balance
loss Lb penalizes any imbalance between disappearance and appearance branches, which means

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.
45:8 H. Wang et al.

Fig. 3. The technical route of association submodule in processing target disappearance and appearance. For
the input historical and current frames, we predict the similarity association matrix Ma . In consideration
of multi-object disappearances and appearances between the historical and current video frames, M 1 and
M 2 are designed by adding an additional column and row to Ma . Then, the horizontal and vertical softmax
operations are conducted on M 1 and M 2 respectively, and the cropped matrices Ac , Ar are utilized to compute
loss Lm by using the pseudo association labels. Finally, the total association loss Ls is constructed according
to the sum of Lt , Lm , and Lb .

that the number of targets that appear and disappear from the field of view should be equal.

Nm 
Nm
Lb = Akl − Akl  . (8)
 c r 
k=1 l =1
Finally, the total association loss Ls is the sum of Equations (6)–(8):
Ls = Lt + Lm + Lb . (9)
In the implementation, we only calculate pseudo positive and pseudo negative pairs, and discard
uncertain pairs for the three losses.

3.5 Training and Testing


3.5.1 End-to-End Training. JDAN contains two trainable submodules: detection submodule S D
and association submodule S A . Our approach adopts an iterative training process to train the model
which consists of two steps: First, we adopt a pretrained object detection model on several detection
datasets [8, 9, 46, 51, 54] and fine-tune its parameters according to the detection loss Ld on the
basis of ground truth; Second, given the pseudo labels on detected targets by utilizing the Kalman
Filter [43], we update both detection and association submodules according to Ls in Equation (9).
We iteratively repeat the above two steps until the loss Ls is converged. Compared to the previous
approaches, errors could be propagated back to update both detection and association submodules,
and thus our approach could achieve an end-to-end training to alleviate the error propagation
problem.
3.5.2 Online Tracking. Figure 4 illustrates the online tracking process by JDAN. Given an in-
put frame Ft , the detection submodule predicts target representations R t for all the targets in Ft .
Then, the representation R t and historical representations R t −n:t −1 are passed into the joint & asso-
ciation submodules to generate the corresponding association matrix Mt −n:t −1,t , which are passed
into the trajectory manager to updates accumulator by using Kuhn-Munkres method [30]. Con-
cretely, tracks in trajectory manager store tracking information for the historical frames Ft −n:t −1 ,
and every element in accumulator is the sum of similarities of a track to the detected objects in
the current frame. Here, a track corresponds to a target identity, and is a recorder of a key-value

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.
JDAN: Joint Detection and Association Network for Real-Time Online MOT 45:9

Fig. 4. Online MOT by JDAN. Given an input video frame Ft , the target boxes and representations are sup-
plied by the positioning and representation heads, respectively, with a single stream detection submodule in
JDAN. The object representation R t is used to pair with the latest n historical representation R t −n:t −1 , and
every pair of representations is passed through the joint & association submodules to estimate the correspond-
ing association matrix Mt −n:t −1,t . In addition, the representation R t is reserved to estimate the association
matrix in the next n frames. Finally, we update the track recorder by linking the present image objects with
n historical image objects by utilizing the inferred association matrix.

pair, where key is the index of a historical frame, and value is the corresponding target representa-
tion in that frame. After receiving a new target representation R t in Ft , trajectory manager would
reserve the representation R t and output all the representations R t −n:t −1 , R t into the joint & associ-
ation submodules to generate the association matrix Mt −n:t −1,t between each historical frame Ft −n
and the current frame Ft . After that, the accumulator in trajectory manager are updated, and the
tracking results Tt in Ft are generated. Some special cases have been considered in our method.
For instance, a historical track may has no matching target in the current frame, which indicates
a target disappears. In such case, we add a column in the acculumator to make each track cor-
responding to a sole column. The cost of online tracking is composed of two parts: the detection
submodule to generate representation R t for the current frame, and the joint & association submod-
ules to output tracking results Tt . To ensure tracking speed, we adopt lightweight detection and
association networks, and all the historical information including R t −n:t −1 are stored in trajectory
manager without recalculation. Therefore, our online tracking could achieve a fast speed.

4 EXPERIMENTS
This section demonstrates the experiments by JDAN. We first introduce datasets and implementa-
tion details of our proposed JDAN, and then illustrate the experimental results as well as make a
deep analysis.

4.1 Datasets and Metrics


Our datasets consist of two components: detection datasets and MOT datasets. We first adopt sev-
eral detection datasets to train our detection submodule. In particular, CityPersons [51] and ETH [9]
datasets merely supply bounding box information, and we use them to train the positioning head.
Moreover, CUHK [46], Caltech [8] and PRW [54] could supply both pedestrian bounding box and
identity information, and thus we can train the positioning and representation heads jointly. For
MOT task, we train and test our model on two benchmark datasets: MOT15 [19] and MOT17 [7].
MOT15 consists of 11 training and 11 testing sequences with 5,783 frames, 61,440 bounding boxes,
and 721 tracks, while MOT17 contains 7 training and 7 testing videos with a longer duration in-
cluding 17,757 frames, 564,228 bounding boxes, and 2,355 tracks.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.
45:10 H. Wang et al.

For evaluation, we employ a variety of metrics, including MOT accuracy (MOTA), MOT preci-
sion (MOTP), the ratio of correctly identified detection results to the mean number of the ground
truth and predicted detection results (IDF1 [37]), mostly tracked (MT) trajectories, mostly lost
(ML) trajectories, identity switch (ID_Sw), the total number of false positives (FP), the to-
tal number of missed targets (FN), the total number of fragmented trajectories (Frag) [4],
and the number of frames processed per second (FPS). Different metrics focus on different aspects
to evaluate MOT performance.

4.2 Implementation Details


We implement JDAN by using PyTorch 1.2.0 [33], and the training process takes 20 hours on a
Quadro RTX 6000 GPU. For each input frame, we first resize it to 1088 × 608, and then use a
modification of DLA [56] in the detection submodule as the backbone. As mentioned earlier, our
training procedure consists of two steps. First, we initialize the detection submodule trained on Mi-
crosoft COCO [24], and adjust parameters with the SGD optimizer [52] for 50 epochs on detection
datasets. The initial learning rate is 1e − 4, and multiplied by 0.1 at the epochs 25 and 40. Second,
we train the association submodule with following settings: the max number of objects Nm in each
frame is set to 100, the minibatch B is set to 4, the training epoch is set to 160, and the multiplier
of a unit vector λ is set to 10. Moreover, we set the maximum gap Na between the historical and
current frames to 30. Then, we train the whole model with the SGD optimizer [52], where the
momentum and weight decay are set to 0.9 and 5e-4, respectively. We launch the training with a
learning rate of 0.01 that is multiplied by 0.1 at the epochs 60, 100, and 140.
A potential problem of an overlong trajectory may lead to a large storage and computational
cost. Therefore, we set the threshold value μm to limit the number of frames that are reviewed in
the existing trajectory, and the farthest target information is discarded if the number of images in a
track exceeds μm = 15. Moreover, if a tracked object disappears for a duration of more than μ r = 12
frames, it will be removed from the tracking list [48]. We employ these superparameters in our
approach for physical significance. When the detection submodule and association submodule are
trained, we choose the optimal values of the two hyperparameters with the grid search technique
for the best MOTA performance on MOT training dataset. We utilize multiples of three with the
scopes of [3] and [30] to construct the grid.
In order to improve the accuracy, we exploit a series of data augmentation methods. First, we
increase the frame size by using a random sampling rate in the range of 1.0–1.25, and fill the pixels
in the expanded image with the average pixel value in MOT training set. Second, we clip the video
frame randomly in the range of 0.75–1.0, and each pixel value in an image is multiplied by a value
in the range of 0.75–1.25. The output frame is transformed to the HSV space, and the saturation
component is multiplied by a value in the range of 0.75–1.25. Then, the image is transformed
back to the RGB space and multiplied by a random factor sample [25]. Besides, we note that the
historical frame Ft −n and current frame Ft are not necessarily consecutive in the video sequence,
and thus set them to be separated by n frames, where n ∈ [0, Na − 1].

4.3 Ablation Studies


In order to prove the effectiveness of the proposed model, we conduct three ablation studies in-
cluding the training paradigm, the representation dimension, and the maximum object number.
4.3.1 Training Paradigm. In this stage, we evaluate the effect of several training processes by
using our proposed components. As depicted in Table 2, we perform five ablations including Base-
line, Detection Training, Detection Fine-tuning, Association Fine-tuning, and JDAN. Baseline first
adopts a detection submodel pre-trained on Microsoft COCO [24], and then uses an association

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.
JDAN: Joint Detection and Association Network for Real-Time Online MOT 45:11

Table 2. Ablative Study with Different Training Paradigms on MOT17 Benchmark Data

Method MOTA↑ MOTP↑ IDF1↑ MT↑ ML↓ ID_Sw↓ Frag↓


Baseline 20.7 38.3 38.2 17.6 48.8 24,875 6,731
Detection Training 34.6 42.8 40.4 19.9 46.7 18,264 4,084
Detection Fine-tuning 42.9 52.8 48.3 21.5 45.8 13,236 3,387
Association Fine-tuning 53.0 65.2 49.7 23.4 44.3 11,875 2,845
JDAN 58.1 79.8 59.2 27.7 32.9 6,129 1,515

submodule fine-tuned with the generated pseudo labels. On this basis, Detection Training retrains
the detection submodule on the detection datasets, and then uses an association submodule trained
on the generated pseudo labels. Considering the visual gap between MOT and detection datasets,
the third training paradigm called Detection Fine-tuning employs a detection submodule fine-tuned
on MOT detection datasets to boost the detection performance, while Association Fine-tuning
does not use MOT datasets for detection submodule but fine-tunes association submodule on MOT
datasets. JDAN is the whole model fine-tuned on MOT datasets with an end-to-end training for
both detection and association submodules.
Table 2 reports the performance comparison results among the above training paradigms, and
our analysis is list below:
(1) Compared to Baseline, Detection Training achieves a significant performance improvement,
with MOTA increasing from 20.7% to 34.6% and MOTP rising from 38.3% to 42.8%. This
improvement indicates the effectiveness of retraining detection submodule on the detection
datasets since the pretraining on Microsoft COCO could not offer accurate pedestrian results
for crowded scenes in MOT task.
(2) Detection Fine-tuning further improves MOTA performance compared with Detection Train-
ing, with a significant improvement from 42.8% to 52.8% on MOTP, and from 18, 264 to
13, 236 on ID_Sw. This improvement stems from the use of MOT datasets on detection sub-
module to bridge the visual gap between the detection and MOT datasets. Meanwhile, Associ-
ation Fine-tuning also achieves the same improvement as compared to Detection Fine-tuning
due to the use of MOT datasets to update the association submodule.
(3) It is observed that all of the metrics are improved with the best performance by JDAN. This
enhancement is due not only to the fine-tuned detection submodule but also the fine-tuned
association submodule. Specifically, JDAN achieves 58.1% on MOTA, which reflects the track-
ing accuracy. As mentioned before, JDAN adopt an end-to-end training to alleviate the error
propagation problem and could obtain powerful object association capability.

4.3.2 Representation Dimension. The previous works in person re-identification normally ex-
ploit high-dimensional object representations, and attain excellent performance without explo-
ration study. Nevertheless, FairMOT [53] discovers that the representation dimension plays an
important role, and the lower-dimensional object representations are more prominent in MOT
task to avoid the underfitting problem since we have insufficient training data compared with the
re-identification task. Therefore, extracting low dimensional representations relieves the under-
fitting problem on small datasets, and thus enhances MOT performance. The previous two-stage
methods are not affected by data deficiency because they can utilize abundant re-identification
data. As a comparison, one-stage approaches cannot utilize these re-identification data because
it requires raw unclipped video frames. A method to solve this data-dependence problem is to
decrease the dimension of object representation.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.
45:12 H. Wang et al.

Table 3. Evaluation of the Object Representation


Dimension on MOT17 Dataset

Feature dim MOTA↑ MOTP↑ IDF1↑ ID_Sw↓ FPS↑


1024 55.3 76.3 57.3 8,107 15.3
512 54.7 75.1 57.1 6,810 18.7
256 56.2 78.9 59.7 7,657 19.6
128 58.1 79.8 59.2 6,129 21.7
64 58.1 71.6 56.7 11,675 21.9

Table 4. The Effect of Maximum Object Number Nm


on the Object Association Matrix

Feature dim Det-1024 Det-512 Det-256 Det-128 Det-64


MOTAS 56.3 55.7 57.8 58.1 53.6
MOTAM 57.8 57.9 55.3 58.4 55.3
MOTAL 52.1 56.0 52.8 49.8 47.3
ID_SwS 8,519 7,236 7,083 6,129 8,913
ID_SwM 2,107 8,013 7,657 6,346 6,675
ID_SwL 10,397 9,281 9,597 7,475 7,286
Small (S): maximum object number is 100; Medium (M): maximum object
number is 150; Large (L): maximum object number is 200. Feature dim is the
object representation dimension.

For this experiment, we test various dimension configurations and the results are listed in
Table 3. It is observed that MOTA increases continually when the dimension reduces from 1,024 to
128, which demonstrates the merits of lower dimensional representations for MOT task. Further-
more, MOTA starts to decrease when the dimensionality is decreased to 64 as the object represen-
tation is impaired. Even though the improvements for MOTA score are small, ID_Sw practically
decreases from 8,107 to 6,129, which plays an important role in enhancing the overall MOT per-
formance. The model running speed also marginally increases by decreasing the representation
dimension. However, the use of low dimensional object representation is effective only when the
training datasets are insufficient. With the increment of data, the problem arising from the repre-
sentation dimension can be alleviated.
4.3.3 Maximum Object Number. Considering various object densities for MOT task, we utilize
an appropriate maximum object number Nm to adapt to different MOT contexts. We try three Nm
values (100, 150, 200) under different representation dimensions, and the experimental results are
listed in Table 4. It is demonstrated that the best MOTA is achieved when Nm = 150, and the best
ID_Sw is achieved when Nm =100. For MOTA, we think that too small Nm will lead to insufficient
fitting ability of the model, while too large Nm will cause underfitting in the association submodule
and degrade MOT performance. For ID_Sw, the small value Nm reduces the object candidates, and
thus could avoid identity switch for MOT task.

4.4 Compared with the State-of-the-arts Approaches


In this part, we evaluate and analyze the proposed method as compared with existing MOT meth-
ods including one-stage and two-stage approaches. Concretely, to prove the validity of the pro-
posed end-to-end detection-tracking method, we compare several two-stage approaches on MOT15
and MOT17 including MDP_SubCNN [45], CDA_DDAL [2], EAMTT [38], AP_HWDPL [5], and

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.
JDAN: Joint Detection and Association Network for Real-Time Online MOT 45:13

Table 5. Comparison with the online MOT Methods Using Private Detection Agreement

Dataset Method IDF1↑ MOTA↑ MT↑ ML↓ ID_Sw↓ FPS↑


MOT15 MDP_SubCNN [45] 55.7 47.5 30.0% 18.6% 628 2.1
CDA_DDAL [2] 54.1 51.3 36.3% 22.2% 544 1.3
EAMTT [38] 54.0 53.0 35.9% 19.6% 7,538 11.5
AP_HWDPL [5] 52.2 53.0 29.1% 20.2% 708 6.7
RAR15 [10] 61.3 56.5 45.1% 14.6% 428 5.1
JDE [42]* 56.9 62.1 34.4% 16.7% 1,608 22.5
JDAN (proposed)* 61.8 57.8 45.3% 12.9% 494 23.5
MOT17 DMAN [58] 55.7 48.2 19.3% 38.3% 2,193 0.3
MTDF [11] 45.2 49.6 18.9% 33.1% 5,567 1.3
FAMNet [6] 48.7 52.0 19.1% 33.4% 3,072 0.6
Tracktor++ [3] 52.3 53.5 19.5% 36.6% 2,072 2.0
SST [39] 49.5 52.4 21.4% 30.7% 8,431 6.3
JDE [42]* 55.8 64.4 32.8% 17.9% 1,544 18.5
JDAN (proposed)* 59.2 58.1 27.7% 32.9% 6,129 21.7
It is notable that the tracking speed (FPS) does not include the detection stage in the two-stage
MOT methods. However, in the one-stage approaches, it includes the detection and association.
Note that we label the one-stage MOT methods with “*”. The best results are shown in bold.

Table 6. Detailed Tracking Results of JDAN on MOT17 Dataset

Sequence MOTA↑ IDF1↑ MOTP↑ MT↑ ML↓ FP↓ FN↓


MOT17-01 55.57 53.88 80.49 33.33% 20.83% 249 2,507
MOT17-03 77.09 73.35 80.69 75.68% 7.43% 10,102 13,533
MOT17-06 26.68 18.91 64.54 2.25% 62.16% 6,017 8,464
MOT17-07 51.53 41.13 80.60 20.67% 18.33% 512 5,767
MOT17-08 34.49 29.95 81.18 19.74% 38.16% 287 12,706
MOT17-12 50.90 54.45 80.08 28.57% 28.57% 792 3,310
MOT17-14 41.83 41.56 75.13 23.17% 21.95% 739 7,696
Total 58.05 59.22 79.84 27.73% 32.99% 18,698 53,983

RAR15 [10]. Meanwhile, we compare JDAN with the one-stage approach JDE [42], which is a rep-
resentive MOT method. TrackRCNN [40] is not listed and compared because it requires additional
image segmentation ground truth, which is not supported on the MOT datasets. Besides, Fair-
MOT [53] adopts both visual features and IoU distance with complex and hand-designed tricks in
its association stage to improve tracking performance. Its engineering implementation requires to
manually tune many model parameters to adapt to different datasets. While our JDAN is simple
and abandons manual parameter settings in association stage by using an association matrix.
Table 5 lists the comparision results. Compared with two-stage approaches, JDAN slightly im-
proves the accuracy measured by all the metrics on both MOT15 and MOT17 datsets. What’s more,
JDAN achieves a near real-time tracking speed, which is much faster than two-stage methods with
10 FPS at least. This high efficiency stems from the shared representation features and end-to-end
tracking paradigm. Compared to the one-stage JDE, JDAN performs better on all the metrics, with
IDF1 increasing from 56.9 to 57.8 on MOT15, and from 55.8 to 59.2 on MOT17. This improvement
proves that the end-to-end training in JDAN could well alleviate the error propagation problem
for the online tracking method. What’s more, JDAN runs faster for online tracking, because JDAN

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.
45:14 H. Wang et al.

Fig. 5. Tracking results on MOT17 dataset. “MOT17-03” in the first row (a crowded street at night, elevated
camera’s view), “MOT17-05” on the second row (street scene from a moving platform), and “MOT17-09” in
the last row (a crowded pedestrian environment filmed from a lower viewpoint) indicate that the proposed
algorithm can associate the object when the tracking process turns into drifting status in the penultimate
column.

directly adopts an association matrix to generate tracking results, while JDE uses a larger backbone
and many complex association tricks.
Table 6 reports the detailed performance on individual video of MOT17 testing datasets. For
instance, JDAN performs well on MOT17-03, which is a crowded context for MOT task. Finally,
Figure 5 illustrates three representative examples from MOT17 dataset tracked by JDAN. The first
example is a crowded street at night in a high view, and the second and third examples describe a
crowded pedestrian environment in a low view with moving and stationary platforms, respectively.
It is demonstrated that JDAN could successfully associate the targets when the tracking process
turns into drifting status.

5 CONCLUSION
In this article, we designed an end-to-end framework to alleviate the error propagation problem
for MOT task. Going beyond the existing methods, JDAN could jointly train both detection and
association tasks, with the hope that both target detection and tracking could be optimized. Tech-
nically, we employed pseudo labels to solve the object inconsistent problem, and designed a joint
submodule as well as an association predictor to produce precise tracking results. The end-to-end
architecture is quite simple yet effective, and avoids tedious manual settings in association as com-
pared to the existing MOT approaches. Extensive experiments on widely used MOT benchmarks
demonstrate the superiority of our proposed approach in terms of effectiveness and efficiency. We
believe our work can encourage and motivate further studies for MOT task.

REFERENCES
[1] Na An and Wei Qi Yan. 2021. Multitarget tracking using Siamese neural networks. ACM Transactions on Multimidia
Computing Communications and Applications 17, 2s (2021), 1–16.
[2] Seung-Hwan Bae and Kuk-Jin Yoon. 2018. Confidence-based data association and discriminative deep appearance
learning for robust online multi-object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018),
1–1.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.
JDAN: Joint Detection and Association Network for Real-Time Online MOT 45:15

[3] Philipp Bergmann, Tim Meinhardt, and Laura Leal-Taixe. 2019. Tracking without bells and whistles. In Proceedings of
the IEEE/CVF International Conference on Computer Vision. 941–951.
[4] Keni Bernardin and Rainer Stiefelhagen. 2008. Evaluating multiple object tracking performance: The CLEAR MOT
metrics. EURASIP J. Image and Video Processing (2008), Article No. 1.
[5] Long Chen, Haizhou Ai, Chong Shang, Zijie Zhuang, and Bo Bai. 2017. Online multi-object tracking with convolutional
neural networks. In Propceedings of the 2017 IEEE International Conference on Image Processing (ICIP). IEEE, 645–649.
[6] Peng Chu and Haibin Ling. 2019. Famnet: Joint learning of feature, affinity and multi-dimensional assignment for
online multiple object tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6172–
6181.
[7] Patrick Dendorfer, Hamid Seyed Rezatofighi, Anton Milan, Javen Shi, Daniel Cremers, D. Ian Reid, Stefan Roth, Konrad
Schindler, and Laura Leal-Taixeacute. 2019. CVPR19 tracking and detection challenge: How crowded can it get? In
Proceedings of CoRR (2019).
[8] Piotr Dollár, Christian Wojek, Bernt Schiele, and Pietro Perona. 2009. Pedestrian detection: A benchmark. In Proceed-
ings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 304–311.
[9] Andreas Ess, Bastian Leibe, Konrad Schindler, and Luc Van Gool. 2008. A mobile vision system for robust multi-person
tracking. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1–8.
[10] Kuan Fang, Yu Xiang, Xiaocheng Li, and Silvio Savarese. 2018. Recurrent autoregressive networks for online multi-
object tracking. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE,
466–475.
[11] Zeyu Fu, Federico Angelini, Jonathon Chambers, and Mohsen Syed Naqvi. 2019. Multi-level cooperative fusion of
GM-PHD filters for online multiple human tracking. IEEE Transactions on Multimedia (2019), 1–1.
[12] Andreas Geiger, Philip Lenz, and Raquel Urtasun. 2012. Are we ready for autonomous driving? The KITTI vision
benchmark suite. Computer Vision and Pattern Recognition (2012), 3354–3361.
[13] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. 2017. Mask R-CNN. IEEE Transactions on Pattern
Analysis & Machine Intelligence (PP), 99 (2017), 1–1.
[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR, 2016). 770–778.
[15] Jin-Hwa Kim, Woon Kyoung On, Jeonghee Kim, JungWoo Ha, and Byoung-Tak Zhang. 2017. Hadamard product for
low-rank bilinear pooling. In Proceedings of ICLR (2017).
[16] Iasonas Kokkinos. 2017. Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level
vision using diverse datasets and limited memory. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. 6129–6138.
[17] Harold W. Kuhn. 1955. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly 2, 1–2
(1955), 83–97.
[18] Hei Law and Jia Deng. 2018. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Confer-
ence on Computer Vision (ECCV). 734–750.
[19] Laura Leal-Taixé, Anton Milan, D. Ian Reid, Stefan Roth, and Konrad Schindler. 2015. MOTChallenge 2015: Towards
a benchmark for multi-target tracking. CoRR (2015).
[20] Guiji Li, Manman Peng, Ke Nai, Zhiyong Li, and Keqin Li. 2019. Multi-view correlation tracking with adaptive memory-
improved update model. Neural Computing and Applications (2019), 9047–9063.
[21] Zhiyong Li, Ke Nai, Guiji Li, and Shilong Jiang. 2020. Learning a dynamic feature fusion tracker for object tracking.
IEEE Transactions on Intelligent Transportation Systems (2020).
[22] Zhiyong Li, Dongming Wang, Ke Nai, Tong Shen, and Ying Zeng. 2016. Robust object tracking via weight-based local
sparse appearance model. ICNC-FSKD (2016), 560–565.
[23] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In
Proceedings of the IEEE International Conference on Computer Vision. 2980–2988.
[24] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence
Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer
Vision. Springer, 740–755.
[25] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg.
2016. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision. Springer,
21–37.
[26] Zhipeng Luo, Zhiguang Zhang, and Yuehan Yao. 2020. A strong baseline for multiple object tracking on vidor dataset.
In Proceedings of the 28th ACM International Conference on Multimedia. 4595–4599.
[27] Jingyi Lv, Zhiyong Li, Ke Nai, Ying Chen, and Jin Yuan. 2020. Person re-identification with expanded neighborhoods
distance re-ranking. Image and Vision Computing (2020).

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.
45:16 H. Wang et al.

[28] Nima Mahmoudi, Seyed Mohammad Ahadi, and Mohammad Rahmati. 2019. Multi-target tracking using CNN-based
features: CNNMTT. Multimedia Tools and Applications 78, 6 (2019), 7077–7096.
[29] Anton Milan, Laura Leal-Taixe, Ian Reid, Stefan Roth, and Konrad Schindler. 2016. MOT16: A benchmark for multi-
object tracking. (2016).
[30] Munkres and James. 2006. Algorithms for the assignment and transportation problems. Journal of the Society for
Industrial and Applied Mathematics 5, 1 (2006), 32–38.
[31] Ke Nai, Zhiyong Li, Guiji Li, and Shanquan Wang. 2018. Robust object tracking via local sparse appearance model.
IEEE Transactions on Image Processing (2018), 4958–4970.
[32] Ke Nai, Degui Xiao, Zhiyong Li, Shilong Jiang, and Yu Gu. 2019. Multi-pattern correlation tracking. Knowledge-Based
Systems (2019).
[33] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming
Lin, Natalia Gimelshein, and Luca Antiga. 2019. PyTorch: An imperative style, high-performance deep learning library.
In Advances in Neural Information Processing Systems (NIPS’19). 8024–8035.
[34] Rajeev Ranjan, Vishal M. Patel, and Rama Chellappa. 2017. Hyperface: A deep multi-task learning framework for face
detection, landmark localization, pose estimation, and gender recognition. IEEE Transactions on Pattern Analysis and
Machine Intelligence 41, 1 (2017), 121–135.
[35] Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018).
[36] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with
region proposal networks. In Proceedings of the International Conference on Neural Information Processing Systems
(NIPS 2015). 91–99.
[37] Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. 2016. Performance measures and a data
set for multi-target, multi-camera tracking. In Proceedings of the European Conference on Computer Vision (ECCV’16).
Springer, 17–35.
[38] Ricardo Sanchez-Matilla, Fabio Poiesi, and Andrea Cavallaro. 2016. Online multi-target tracking with strong and weak
detections. In Proceedings of the European Conference on Computer Vision. Springer, 84–99.
[39] ShiJie Sun, Naveed Akhtar, HuanSheng Song, Ajmal S. Mian, and Mubarak Shah. 2019. Deep affinity network for
multiple object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019).
[40] Paul Voigtlaender, Michael Krause, Aljosa Osep, Jonathon Luiten, Balachandar Gnana Berin Sekar, Andreas Geiger,
and Bastian Leibe. 2019. MOTS: Multi-object tracking and segmentation. In Proceedings of (CVPR’2019), 7942–7951.
[41] Haidong Wang, Saizhou Wang, Jingyi Lv, Chenming Hu, and Zhiyong Li. 2020. Non-local attention association scheme
for online multi-object tracking. Image and Vision Computing (2020), 103983.
[42] Zhongdao Wang, Liang Zheng, Yixuan Liu, Yali Li, and Shengjin Wang. 2020. Towards real-time multi-object tracking.
(2020), 107–122.
[43] Greg Welch, Gary Bishop, et al. 1995. An introduction to the kalman filter. (1995).
[44] Nicolai Wojke, Alex Bewley, and Dietrich Paulus. 2017. Simple online and realtime tracking with a deep association
metric. In Proceedings of the (ICIP’17).
[45] Yu Xiang, Alexandre Alahi, and Silvio Savarese. 2015. Learning to track: Online multi-object tracking by decision
making. In Proceedings of the IEEE International Conference on Computer Vision. 4705–4713.
[46] Tong Xiao, Shuang Li, Bochao Wang, Liang Lin, and Xiaogang Wang. 2017. Joint detection and identification feature
learning for person search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3415–
3424.
[47] Xin Xu, Shiqin Wang, Zheng Wang, Xiaolong Zhang, and Ruimin Hu. 2021. Exploring image enhancement for salient
object detection in low light images. ACM Transactions on Multimedia Computing, Communications, and Applications
(TOMM) 17, 1s (2021), 1–19.
[48] Yihong Xu, Aljosa Osep, Yutong Ban, Radu Horaud, Laura Leal-Taixé, and Xavier Alameda-Pineda. 2020. How to train
your deep multi-object tracker. In Proceedings of the (CVPR’20), 6786–6795.
[49] Fengwei Yu, Wenbo Li, Quanquan Li, Yu Liu, Xiaohua Shi, and Junjie Yan. 2016. POI: Multiple object tracking with
high performance detection and appearance feature. In Proceedings of the European Conference on Computer Vision.
Springer, 36–42.
[50] Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. 2018. Deep layer aggregation. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition. 2403–2412.
[51] Shanshan Zhang, Rodrigo Benenson, and Bernt Schiele. 2017. Citypersons: A diverse dataset for pedestrian detection.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3213–3221.
[52] Sixin Zhang, Anna E. Choromanska, and Yann LeCun. 2015. Deep learning with elastic averaging SGD. In Advances
in Neural Information Processing Systems. 685–693.
[53] Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenjun Zeng, and Wenyu Liu. 2021. Fairmot: On the fairness of detection
and re-identification in multiple object tracking. International Journal of Computer Vision (2021), 1–19.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.
JDAN: Joint Detection and Association Network for Real-Time Online MOT 45:17

[54] Liang Zheng, Hengheng Zhang, Shaoyan Sun, Manmohan Chandraker, Yi Yang, and Qi Tian. 2017. Person re-
identification in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1367–1376.
[55] Xingyi Zhou, Vladlen Koltun, and Philipp Krähenbühl. 2020. Tracking objects as points. In Proceedings of the European
Conference on Computer Vision. Springer, 474–490.
[56] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. 2019. Objects as points. arXiv preprint arXiv:1904.07850 (2019).
[57] Zongwei Zhou, Junliang Xing, Mengdan Zhang, and Weiming Hu. 2018. Online multi-target tracking with tensor-
based high-order graph matching. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR).
IEEE, 1809–1814.
[58] Ji Zhu, Hua Yang, Nian Liu, Minyoung Kim, Wenjun Zhang, and Ming-Hsuan Yang. 2018. Online multi-object tracking
with dual matching attention networks. In Proceedings of the ECCV (2018), 379–396.

Received 14 August 2021; revised 5 February 2022; accepted 22 April 2022

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 1s, Article 45. Publication date: January 2023.

You might also like