You are on page 1of 1

TABLE I

IV. EXPERIMENTS
S YNTHETIC TO REAL GENERALIZATION EXPERIMENTS .
We evaluate CSFlow on Sintel [30] and KITTI [4] datasets.
The model is first pre-trained on the synthetic datasets KITTI-12 (train) KITTI-15 (train)
Training data Method
FlyingChairs [10] and FlyingThings [31], and fine-tuned F1-epe F1-all F1-epe F1-all
on Sintel or KITTI-2015. In the following, we present the
PWC-Net [13] 5.14 28.7 13.2 41.8
experiment settings and results.
RAFT [14] 4.72 30.6 9.86 37.6
A. Training Details C
LiteFlowNet2 [22] 4.11 - 11.31 32.1
CSFlow is trained on a NVIDIA Tesla P40, implemented Ours 4.05 22.3 9.28 32.3
in PyTorch. All the weights are initialized randomly. We
PWC-Net 4.14 21.4 10.35 33.7
choose AdamW optimizer [32], and set the learning rate
FlowNet2 [11] 4.09 - 10.06 30.0
according to the one-cycle learning rate policy [33]. During
LiteFlowNet2 [22] 3.42 - 8.97 25.9
the training process, the flow field is updated 12 times.
VCN [21] - - 8.36 25.1
When evaluating on KITTI, we set the number of updates
C+T HD3 [34] 4.65 - 13.17 24.0
to 24. When evaluating on Sintel, the number of updates is
DICL-Flow [28] - - 8.70 23.6
set to 32. The final model is first trained on FlyingChairs
MaskFlowNet [35] 2.94 - - 23.1
(C) with 150k iterations, where the batch size is 10, the
RAFT [14] 2.14 9.3 5.04 17.4
learning rate is 4e−4 , and the image size is cropped to
Ours 1.96 8.6 4.69 16.5
368×496. Then 150k training iterations are performed on
FlyingThings (T), where the batch size is 6, the learning
rate is 1.25e−4 , and the image size is 400×720. Finally, and calculate the initial optical flow, which improves the
we fine-tune the model on Sintel (S) or KITTI-2015 (K). result compared to the baseline. However, we notice that
The ablation experiment is performed with 100k training it is more efficient to initialize the optical flow using regres-
iterations on synthetic data Chairs, and the batch size is 10. sion without introducing additional calculations. Correlation
When evaluating the results of ablation experiments, the flow Regression: Compared to directly regressing all pairs cor-
field is updated 32 times. Following the settings of RAFT, relation volume, it is better to use orthogonal correlations
all experiments are with data augmentation, including spatial to initialize optical flow, because it makes greater use of
and photometric augmentation. non-local visual similarity. GRU Levels: Following RAFT-
B. Zero-Shot Generalization Stereo [36], we adapt the multi-layer GRU from the parallax
We evaluate the synthetic-to-real zero-shot generalization estimation task to optical flow. Multi-layer GRU maintains
capability of CSFlow on KITTI-12 and KITTI-15. Our model and updates 1/32, 1/16, and 1/8 multi-scale optical flow
is trained with 150k iterations on Chairs (C) and Things fields, respectively. However, we find that this change intro-
(T). The model is trained on synthetic data and then directly duces a large number of parameters, while the improvement
evaluated on unseen real-world datasets. We compare the is negligible. CSC: Our proposed cross strip correlation
previous methods under the same zero-shot setting. As shown aggregates the global context, leading to an end-point-error
in Tab. I, CSFlow exhibits state-of-the-art performance under of 3.98 pixels on Sintel-final, an 11.6% error reduction w.r.t.
complex street scenes in the zero-shot setting. When trained the baseline. Queries: We further try to use the same query
only on Chairs (C), CSFlow achieves an F1-all error of volume for I1 to calculate the orthogonal correlation from I2 ,
22.3% on KITTI-12, a 27.1% error reduction compared with which slightly reduces the amount of parameters. Using the
RAFT. After the C+T training, CSFlow achieves an end- separate query volume significantly improves the estimation
point-error of 1.96 pixels on KITTI-12 and an 8.4% error accuracy, which shows that two orthogonal query volumes
reduction from the best prior network trained on the same are required for the horizontal and vertical key matrices.
data (2.14 pixels). This feature is vital in autonomous driving D. Quantitative Experiment
applications, considering that there is no large-scale real-
We quantitatively evaluate CSFlow on Sintel and KITTI-
world optical flow data available for training.
2015. As shown in Tab. III, the best results are shown in
C. Ablations bold, and the second best results are underlined, † indicates
We conduct a series of ablation experiments to verify the that the method uses the estimation result of the previous
component requirements of the proposed CSFlow. As shown frame to refine the subsequent optical flow. The model has
in Tab. II, the settings used in the final model are underlined. been trained for two stages. The first stage is pre-training
We test each component individually, while keeping the on synthetic data Chairs (C) and Things (T), and the second
settings of other components consistent with the final model, stage is fine-tuning on Sintel (S) and KITTI (K) respectively.
and evaluate the model on Sintel and KITTI. Following RAFT, our final model is divided into two types,
CRI: The optical flow initialization method we propose one is to fine-tune only on the training set of the target
significantly improves the estimation accuracy without addi- benchmark (C+T+S/K), while the other is to fine-tune the
tional parameters. Initial method: Flow head uses an addi- mixed data (C+T+S+K+H). Our mixed data distribution is
tional convolutional layer to process the correlation volume consistent with RAFT. We use average End Point Error (EPE)

You might also like