Professional Documents
Culture Documents
Keyfilter-Aware Real-Time UAV Object Tracking: Yiming Li, Changhong Fu, Ziyuan Huang, Yinqiang Zhang, and Jia Pan
Keyfilter-Aware Real-Time UAV Object Tracking: Yiming Li, Changhong Fu, Ziyuan Huang, Yinqiang Zhang, and Jia Pan
Yiming Li1 , Changhong Fu1,∗ , Ziyuan Huang2 , Yinqiang Zhang3 , and Jia Pan4
𝟏−𝝆
… Response map
Training
Abstract— Correlation filter-based tracking has been widely 𝝆 𝝆
ers [16]–[18] have limited performance due to the lack of E(w) = ky − Bxd0 ? wd k22 + kwd k22 , (1)
2 2
d=1 d=1
negative samples, i.e., the circulant artificial samples created
to train the CF hugely reduce its discriminative power. One where y ∈ RM , xd ∈ RN and wd ∈ RM denote the desired
solution to this problem is spatial penalization to punish response, the dth one of D feature channels and correlation
the filter value at the boundary [6]–[10]. Another solution filter respectively. λ is a regularization parameter and E(w)
is cropping both the background and target to use negative refers to an error between the desired response y and the
samples in the real word instead of synthetic samples [11]– actual one. ? is the spatial correlation operator. The main
[13]. However, the aforementioned approaches are prone to idea of BACF is to utilize a cropping matrix B ∈ RM ×N
introduce context distraction because of enlarging search to extract real negative samples. However, more background
area, especially in the scenarios of similar object around. distraction is introduced because of the enlargement.
B. Prior work to context noise and filter corruption IV. KEYFILTER-AWARE OBJECT TRACKER
In literature, M. Mueller et al. [14] proposed to repress the Inspired by the keyframe technique used in SLAM, the
response of context patches, i.e., the features extracted from keyfilter is firstly proposed in visual tracking to boost ac-
surrounding context are directly fed into classic DCF frame- curacy and efficiency, as illustrated in Fig. 2. The objective
work and their desired responses are suppressed as zero. The function of KAOT tracker is written as follows:
context distraction is thus effectively repressed, consequently 1
D
X λX
D
the discriminative ability of the filter is enhanced. Neverthe- E(w) = ky − Bxd0 ? wd k22 + kwd k22
2 2
d=1 d=1
less, the frame-by-frame context learning is effective but not P D D
, (2)
efficient, and its redundancy can be significantly reduced. Sp X X γX
+ k Bxdp ? wd k22 + kwd − w̃d k22
Another problem of classic DCF trackers is that the learned 2 p=1 2
d=1 d=1
single filter is commonly subjected to corruption because of where the third term is response repression of context patches
the frequent appearance variation. Online passive-aggressive (their desired responses are zero), and Sp is the score of pth
learning is incorporated into the DCF framework [19] to patch to measure the necessity of penalization (introduced
mitigate the corruption. Compared to [19], the presented in IV-B). wd ∈ RM and w̃d ∈ RM are the current filter
keyfilter performs better in both precision and speed. and keyfilter, respectively. γ is the penalty parameter of the
gap between wd and w̃d . To improve the calculation speed,
C. Tracking by deep neural network Eq. (2) is calculated in the frequency domain:
Recently, deep neural network has contributed a lot to 1 λ γ
E(w, ĝ) = kX̂ĝ − Ŷk22 + kwk22 + kw − w̃k22
the development of computer vision. For visual tracking, 2 √ 2 2 , (3)
some deep trackers [20]–[22] fine-tuning the deep network s.t. ĝ = N (ID ⊗ FB> )w
online for high precision yet run too slow (around 1 fps on a
high-end GPU) to use in practice. Other methods like deep where ⊗ is the Kronecker product and ID ∈ RD×D is an
reinforcement learning [23], unsupervised learning [24], con- identity matrix.ˆdenotes the discrete Fourier transform with
X̂0 , S1ˆX1 , · · · , Sp X̂P ,
tinues operator [8], end-to-end learning [25] and deep feature orthogonal matrix F. X̂T =
Ŷ = ŷ, 0, · · · , 0 , and X̂p ∈ CN ×DN (p = 0, 1, ..., P ),
representation [26] have also increased the tracking accuracy.
Among them, incorporating lightweight deep features into ĝ ∈ CDN ×1 and w ∈ RDM ×1 are respectively defined as
online learned DCF framework has exhibited competitive X̂ = [diag(x̂1 )> , · · · , diag(x̂D )> ], ĝ = [ĝ1> , · · · , ĝD> ]> ,
performance both in precision and efficiency. w̃ = [w̃1> , · · · , w̃D> ]> and w = [w1> , · · · , wD> ]> .
Ours
… … Baseline
#281 #282 #283 #284 #285 #286
CLE
T=3 T=7
10 T=4 T=8
0
Frame 500
Fig. 2. Illustration of advantages of KAOT. With the keyfilter restriction, the filter corruption is mitigated, as shown on the top right. With the context
learning, the distraction is reduced, as shown in the response maps from frame 281 to 286. Set the keyfilter update period T as 1-8 frames (learns the
context every 2 - 16 frames), and the object is tracked successfully in all eight trackers, while FPS (frame per second) is raised to 15.2 from 9.8, lowering
the redundancy of context learning significantly. In addition, trackers lacking of the keyfilter restriction or the context learning both lose the target.
Success rate
ECO_HC [0.643] STRCF [0.437]
Precision
Success rate
STRCF [0.627] CSRDCF [0.450]
Precision
Fig. 3. Precision and success plots based on one-pass-evaluation [31] of KAOT and other real-time trackers on DTB70 [32] and UAV123@10fps [33].
Precision
Precision
BACF [0.636] UDT+ [0.564] CSRDCF [0.636]
MCCT_H [0.621] BACF [0.545] UDT [0.579]
UDT [0.620] MCCT_H [0.484] CF2 [0.559]
0.4 CF2 [0.613] 0.4 UDT [0.433] 0.4 MCCT_H [0.502]
fDSST [0.587] SAMF [0.358] fDSST [0.444]
Staple_CA [0.535] fDSST [0.356] SAMF [0.403]
SAMF [0.526] Staple_CA [0.336] Staple_CA [0.387]
0.2 KCF [0.470] 0.2 KCC [0.293] 0.2 DCF [0.345]
DCF [0.469] CFNet [0.286] KCF [0.345]
KCC [0.456] DCF [0.280] CFNet [0.299]
CFNet [0.455] KCF [0.280] KCC [0.298]
Staple [0.373] Staple [0.260] Staple [0.248]
0 0 0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
Location error threshold Location error threshold Location error threshold
Precision
Precision
CF2 [0.557] UDT+ [0.551] BACF [0.624]
UDT [0.557] MCCT_H [0.550] CSRDCF [0.614]
MCCT_H [0.551] CF2 [0.509] MCCT_H [0.606]
0.4 BACF [0.547] 0.4 CFNet [0.478] 0.4 UDT [0.577]
fDSST [0.489] SAMF [0.464] fDSST [0.562]
SAMF [0.450] BACF [0.448] SAMF [0.552]
Staple_CA [0.439] fDSST [0.390] Staple_CA [0.538]
0.2 DCF [0.416] 0.2 Staple_CA [0.379] 0.2 KCF [0.497]
KCF [0.416] KCC [0.335] DCF [0.496]
CFNet [0.411] KCF [0.302] CFNet [0.490]
KCC [0.397] DCF [0.301] KCC [0.470]
Staple [0.309] Staple [0.281] Staple [0.416]
0 0 0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
Location error threshold Location error threshold Location error threshold
Fig. 4. Attribute based evaluation on precision. KAOT ranks first place on five out of six challenging attributes.
TABLE I
AVERAGE PRECISION ( THRESHOLD AT 20 PIXELS ) AND S PEED (( FPS , * MEANS GPU SPEED , OTHERWISE CPU SPEED ) ) OF TOP TEN REAL - TIME
TRACKERS . R ED , GREEN , AND BLUE FONTS RESPECTIVELY INDICATES THE BEST, SECOND , AND THIRD PLACE IN TEN TRACKERS .
KAOT ECO [8] ARCF [13] UDT+ [24] STRCF [19] CSRDCF [10] ECO_HC [8] CF2 [26] MCCT_H [34] UDT [24]
Avg. precision 72.2 71.7 68.0 66.5 63.8 63.5 64.3 62.5 60.3 58.9
Speed (FPS) 14.7* 11.6* 15.3 43.4* 26.3 11.8 62.19 14.4* 59.0 57.5*
CLE
ARCF
attributed to the keyfilter restriction, which can prevent the 200 UDT+
KAOT
filter from aberrant variation. In addition, KAOT exhibits 0
0 20 40 60 80 100 120 140 160 180 200
excellent performance in the scenario of fast camera motion Frame
CLE
five difficult UAV image sequences are shown in Figure 5. 500 ARCF
UDT+
KAOT
Besides, the respective center location error (CLE) variations 0
0 20 40 60 80 100 120 140 160 180 200
of five sequences are visualized in Figure 6. Specifically, in Frame
CLE
distraction of the context, so it can perform well despite 10
ARCF
UDT+
KAOT
the large movement in a certain complex context. Only the 0
0 10 20 30 40 50 60 70 80 90 100 110
pre-trained UDT+ tracks successfully in addition to KAOT. Frame
CLE
ARCF
KAOT has kept tracking owing to the mitigated filter corrup- 100 UDT+
KAOT
tion. As for the last two sequences, keyfilter restriction and 0
0 20 40 60 80 100 120
intermittent context learning have collaboratively contributed Frame
CLE
500 ARCF
UDT+
using deep neural network, as shown in Table II. To sum up, 0 50 100
Frame
150 200