You are on page 1of 1

is set as 12 during training.

The optical flow V0 is obtained non-local visual similarity:


by regression of the orthogonal correlation through the CRI (
module. The details of the CRI module will be described Qv (x) · K̂v ,
Cv (x) =
below. Each iteration will produce an update field ∆V , thus (4)
Ch (x) = Qh (x) · Kˆh .
the updated optical flow is Vi+1 = Vi + ∆V . After each
update, Vi will be upsampled to the original input scale. We Cv and Ch are then concatenated with the all-pair correlation
upsample the optical flow to full resolution by taking the volume C ∈ H × W × H × W to obtain an aggregated
convex combination of the low-resolution 3 × 3 grid. After correlation Ĉ ∈ H × W × 2 × H × W . Ĉ encodes both fine-
that, we calculate the L1 norm between the predicted Vi grained matching costs and global visual similarity. Ĉ is then
and the ground truth Vt with the supervision loss function, sent to a 4-layer average pooling pyramid with kernel sizes
and the initial estimate V0 given by CRI is also supervised. 1, 2, 4, 8 which pools the last two dimensions to establish
For the estimation of flow sequence {V0 , V1 , V2 , ... Vm }, we a 4-layer correlation pyramid Cˆ1 , Cˆ2 , Cˆ3 , Cˆ4 . Since the
calculate the weighted sequence loss function: high-resolution information of the first two dimensions are
m
X preserved, the correlation pyramid can give the displacement
L= ΥN −i kVt − Vi k1 . (2) information of fast-moving small objects. Additionally, the
i=0 correlation pyramid further aggregates non-local scene cues,
In the pre-training stage, Υ is set to 0.8, and in the fine- assisting the update block to distinguish the optical flow of
tuning stage, the value of Υ is adjusted to 0.85. extreme displacements on similar textures.

B. Cross Strip Correlation Module C. Correlation Regression Initialization Module


To capture the global context and reduce the computational As mentioned above, we introduce the Correlation Regres-
complexity of non-local information encoding, we introduce sion Initialization (CRI) module to fully take the advantages
a Cross Strip Correlation module (CSC). The structure of the of the orthogonal correlations Cv , Ch given by CSC. We
module is shown in Fig. 2. CSC extracts the orthogonal query regress the orthogonal correlation volumes instead of the all-
matrix Qv and Qh of the target image and the orthogonal pair correlation C. The reason is that C contains redundant
key matrix K̂v and Kˆh of the involved images through the spatial details, which introduce useless information during
strip operation, while keeping the direction consistency of regressing. We will verify this in the ablation experiments (
the global information. Subsequently, we further calculate Sec. IV). The rich high-level context information in Cv and
the correlation between the orthogonal query matrix and the Ch are more helpful to obtain the initialized optical flow
orthogonal key matrix to obtain the global visual similarity field. In consideration of the speed performance important
of I1 and I2 in the vertical and horizontal directions to guide for autonomous driving, we use the CRI module without
the iterative updates of flow field. any learnable parameters to regress Cv and Ch .
Specifically, given I1 , I2 , the feature map F1 , F2 ∈ The details of the CRI module are shown in Fig. 3. Given
RC×H×W are obtained by the feature encoder e(). We first the orthogonal correlation volumes Cv ∈ H × W × W, Ch ∈
introduce two 1 × 1 convolutional layers to activate F1 H × W × H, we use a softmax layer to activate Cv and Ch
and obtain vertical and horizontal query matrix Qv , Qh ∈ respectively, and perform the multiplication on each element
0
C × H × W . Simultaneously, we use two additional 1 × 1 of original Cv and Ch to obtain the orthogonal energy
convolutional layers to activate F2 and obtain Kv , Kh ∈ maps. Finally, we can obtain the vertical and horizontal
0
C × H × W . Then, the orthogonal global key matrix K̂v ∈ initialization of optical flow v0 , u0 ∈ H × W :
0 0
C × W, Kˆh ∈ C × H can be obtained through vertical and
horizontal striping operations: 
 XW

 v (x, y) = σ(Cv (x, y, w))Cv (x, y, w),
 0

H


 1 X w=1
K̂ (i, j) = Kv (i, k, j),


 v
 (5)
 H  H
k=1  X
W
(3) h0 (x, y) =

 σ(Ch (x, y, h))Ch (x, y, h).
1
 X 
ˆ h=1

Kh (i, j) = W Kh (i, j, k),



k=1
We then stack the orthogonal optical flows v0 and v1 to
where the average pooling window is H × 1 and 1 × W , attain the initialized flow field V0 ∈ H ×W ×2, which will be
respectively. The intuition of the operation is that it empha- further sent to the update block as the initial distribution of
sizes the vertical- and horizontal feature, which corresponds optical flow refinement. The module takes the advantage of
to the definition of optical flow. Then we transpose the global visual similarity to initialize the optical flow, lowering
0 0
orthogonal query matrix from C × H × W to H × W × C , the workload of the subsequent iterative updates, so that the
and perform the dot product with the orthogonal key values update block can focus on distinguishing foreground and
to obtain the vertical and horizontal correlation volumes background and estimating the motion information of the
Cv ∈ H × W × W, Ch ∈ H × W × H, which encode the area with similar textures.

You might also like