The document describes an optical flow estimation method with three key components:
1. A Cross Strip Correlation module extracts vertical and horizontal feature maps from two input images to calculate orthogonal correlation volumes, encoding global visual similarity.
2. An update block takes the initial optical flow estimate and correlation volumes to iteratively refine the optical flow field.
3. A Correlation Regression Initialization module regresses the orthogonal correlation volumes, without learnable parameters, to obtain the initial optical flow estimate. This captures global context to guide subsequent iterative refinement.
The document describes an optical flow estimation method with three key components:
1. A Cross Strip Correlation module extracts vertical and horizontal feature maps from two input images to calculate orthogonal correlation volumes, encoding global visual similarity.
2. An update block takes the initial optical flow estimate and correlation volumes to iteratively refine the optical flow field.
3. A Correlation Regression Initialization module regresses the orthogonal correlation volumes, without learnable parameters, to obtain the initial optical flow estimate. This captures global context to guide subsequent iterative refinement.
The document describes an optical flow estimation method with three key components:
1. A Cross Strip Correlation module extracts vertical and horizontal feature maps from two input images to calculate orthogonal correlation volumes, encoding global visual similarity.
2. An update block takes the initial optical flow estimate and correlation volumes to iteratively refine the optical flow field.
3. A Correlation Regression Initialization module regresses the orthogonal correlation volumes, without learnable parameters, to obtain the initial optical flow estimate. This captures global context to guide subsequent iterative refinement.
The optical flow V0 is obtained non-local visual similarity:
by regression of the orthogonal correlation through the CRI ( module. The details of the CRI module will be described Qv (x) · K̂v , Cv (x) = below. Each iteration will produce an update field ∆V , thus (4) Ch (x) = Qh (x) · Kˆh . the updated optical flow is Vi+1 = Vi + ∆V . After each update, Vi will be upsampled to the original input scale. We Cv and Ch are then concatenated with the all-pair correlation upsample the optical flow to full resolution by taking the volume C ∈ H × W × H × W to obtain an aggregated convex combination of the low-resolution 3 × 3 grid. After correlation Ĉ ∈ H × W × 2 × H × W . Ĉ encodes both fine- that, we calculate the L1 norm between the predicted Vi grained matching costs and global visual similarity. Ĉ is then and the ground truth Vt with the supervision loss function, sent to a 4-layer average pooling pyramid with kernel sizes and the initial estimate V0 given by CRI is also supervised. 1, 2, 4, 8 which pools the last two dimensions to establish For the estimation of flow sequence {V0 , V1 , V2 , ... Vm }, we a 4-layer correlation pyramid Cˆ1 , Cˆ2 , Cˆ3 , Cˆ4 . Since the calculate the weighted sequence loss function: high-resolution information of the first two dimensions are m X preserved, the correlation pyramid can give the displacement L= ΥN −i kVt − Vi k1 . (2) information of fast-moving small objects. Additionally, the i=0 correlation pyramid further aggregates non-local scene cues, In the pre-training stage, Υ is set to 0.8, and in the fine- assisting the update block to distinguish the optical flow of tuning stage, the value of Υ is adjusted to 0.85. extreme displacements on similar textures.
B. Cross Strip Correlation Module C. Correlation Regression Initialization Module
To capture the global context and reduce the computational As mentioned above, we introduce the Correlation Regres- complexity of non-local information encoding, we introduce sion Initialization (CRI) module to fully take the advantages a Cross Strip Correlation module (CSC). The structure of the of the orthogonal correlations Cv , Ch given by CSC. We module is shown in Fig. 2. CSC extracts the orthogonal query regress the orthogonal correlation volumes instead of the all- matrix Qv and Qh of the target image and the orthogonal pair correlation C. The reason is that C contains redundant key matrix K̂v and Kˆh of the involved images through the spatial details, which introduce useless information during strip operation, while keeping the direction consistency of regressing. We will verify this in the ablation experiments ( the global information. Subsequently, we further calculate Sec. IV). The rich high-level context information in Cv and the correlation between the orthogonal query matrix and the Ch are more helpful to obtain the initialized optical flow orthogonal key matrix to obtain the global visual similarity field. In consideration of the speed performance important of I1 and I2 in the vertical and horizontal directions to guide for autonomous driving, we use the CRI module without the iterative updates of flow field. any learnable parameters to regress Cv and Ch . Specifically, given I1 , I2 , the feature map F1 , F2 ∈ The details of the CRI module are shown in Fig. 3. Given RC×H×W are obtained by the feature encoder e(). We first the orthogonal correlation volumes Cv ∈ H × W × W, Ch ∈ introduce two 1 × 1 convolutional layers to activate F1 H × W × H, we use a softmax layer to activate Cv and Ch and obtain vertical and horizontal query matrix Qv , Qh ∈ respectively, and perform the multiplication on each element 0 C × H × W . Simultaneously, we use two additional 1 × 1 of original Cv and Ch to obtain the orthogonal energy convolutional layers to activate F2 and obtain Kv , Kh ∈ maps. Finally, we can obtain the vertical and horizontal 0 C × H × W . Then, the orthogonal global key matrix K̂v ∈ initialization of optical flow v0 , u0 ∈ H × W : 0 0 C × W, Kˆh ∈ C × H can be obtained through vertical and horizontal striping operations: XW v (x, y) = σ(Cv (x, y, w))Cv (x, y, w), 0 H 1 X w=1 K̂ (i, j) = Kv (i, k, j), v (5) H H k=1 X W (3) h0 (x, y) = σ(Ch (x, y, h))Ch (x, y, h). 1 X ˆ h=1 Kh (i, j) = W Kh (i, j, k), k=1 We then stack the orthogonal optical flows v0 and v1 to where the average pooling window is H × 1 and 1 × W , attain the initialized flow field V0 ∈ H ×W ×2, which will be respectively. The intuition of the operation is that it empha- further sent to the update block as the initial distribution of sizes the vertical- and horizontal feature, which corresponds optical flow refinement. The module takes the advantage of to the definition of optical flow. Then we transpose the global visual similarity to initialize the optical flow, lowering 0 0 orthogonal query matrix from C × H × W to H × W × C , the workload of the subsequent iterative updates, so that the and perform the dot product with the orthogonal key values update block can focus on distinguishing foreground and to obtain the vertical and horizontal correlation volumes background and estimating the motion information of the Cv ∈ H × W × W, Ch ∈ H × W × H, which encode the area with similar textures.