Professional Documents
Culture Documents
Shubhra Aich∗ , Jean Marie Uwabeza Vianney∗ , Md Amirul Islam, Mannat Kaur, and Bingbing Liu†
S1 S2 S3 S4 S5
32 × 32 UP
⊗
⊗ D2S D2S D2S D2S D2S
3 × 3 Conv
3×3
BRC
ϕ
σ ϕ
9×9 9×9 9×9 9×9 9×9
Gb Conv Conv Conv Conv Conv
P
⊗
P
BN+ReLU
F D2S D2S D2S D2S D2S
Fig. 2: The computation graph of our proposed bidirectional attention network (BANet). The directed arrows refer to the flow
of data through this graph. The forward and backward attention modules take stage-wise features as input and process through
multiple operations to generate stage-wise attention weights. The rectangles represent
P either a single or composite operation
with learnable parameters. The white circles with inscribed symbols (⊗, ϕ, , , σ) denote parameterless operations. The
colored circles highlight the outputs of different conceptual stages of processing. The construction of the D2S modules is
shown inside the gray box on the right. Note that the green D2S modules used for feature computation F does not employ
the last BN+ReLU block inside D2S.
attention mechanisms of a bidirectional RNN. Next, all relatively large kernels followed by fully connected layers
the forward and backward attention maps are concatenated and bilinear upsampling operations. This aggregation of
channel-wise and processed with a 3 × 3 convolution (G global context in the D2S module helps to resolve ambigui-
in Eq. 1) and softmax (ϕ) to generate the per stage pixel- ties (see Fig. 4 and 5) for thinner objects in more challenging
level attention weights, A (see Fig. 3). The procedure of scenarios (i.e., very bright or dark contexts). Additionally, we
computing the feature representation, fi , from the stage-wise provide details of few alternative realizations of our proposed
feature maps, Si , using the D2S module can be formalized architecture as follows:
as:
• BANet-Full: This is the complete realization of the archi-
fi = DiF (si ) ; F = f1 ⊗ f2 ⊗ . . . ⊗ fN (2)
tecture as shown in Fig. 2.
∀i ∈ {1, 2, . . . , N }. Then we compute the unnormal- • BANet-Vanilla: It contains only the backbone followed by
ized prediction, D b u , from the concatenated features, F, 1 × 1 convolution, a single D2S operation, and sigmoid to
and attention maps, A, using the HadamardP (element-wise) generate the final depth prediction. This is similar to [32].
product followed by pixel-wise summation . Finally, the • BANet-Forward: The backward attention modules of
normalized prediction, D, b is generated using the sigmoid (σ) BANet are missing in this setup.
function. The operations can be expressed as: • BANet-Backward: The forward attention modules of
X
bu = b =σ D bu BANet are not included here.
D A F, D (3)
• BANet-Markov: This follows the Markovian assumption
Global Context Aggregation. We incorporate global context that the feature at each time step (or stage) i depends only
inside the D2S modules by applying average pooling with on the feature from the immediate preceding (for forward
Full
Backward Forward
Fig. 3: An illustration of the generated stage-wise attention weights using the forward and backward attention modules.
attention) or succeeding (for backward attention) time step GPU using the sparse labels without any densification. The
(or stage) i ∓ 1 (i.e. Xi ⊥ {Xi∓2 , Xi∓3 , ...Xi∓k }|Xi∓1 ). runtime was measured with a single NVIDIA GeForce
• BANet-Local: This variant replaces the global context RTX 2080 Ti GPU just because of its availability at the
aggregation part with a single 9 × 9 convolution. time of writing. We use Adam optimizer with the initial
To demonstrate that the performance improvement on learning rate of 1e−4 and weight decay of 0. The learning
the MDE task using the time dependent structuring is not rate is multiplied on plateau after 10 epochs by 0.1 until
due to the mere increase in the capacity as well as the 1e−5 . Random horizontal flipping and color jitter are the only
number of parameters, we conduct pilot experiments without augmentations used in training. For KITTI, we randomly
any time dependent structuring by simply concatenating the crop 352 × 1216 training samples following the resolution
different stage features all at once followed by similar post- of the validation and test splits. Also, for computational
processing. However, we empirically find that such a naive efficiency, the inputs are half-sampled (and zero-padded if
implementation performs much poorer than our proposed necessary) prior to feeding into the models, and the pre-
time dependent realizations mentioned above. Therefore, dictions are bilinearly upsampled to match the ground truth
we exclude this straightforward employment from further before comparison. For DORN [7], we set the number of
experimental analysis. ordinal levels to the optimal value of 80 (or 160 considering
IV. E XPERIMENTS both positive and negative planes). This value is set to 50,
75, and 75 for DIODE indoor, outdoor, and combined splits
Datasets: Among several publicly available standard datasets following the same principle. We employ SILog and L1
[23], [24], [33]–[35] for monocular depth estimation, we losses for training all the models on KITTI and DIODE,
choose DIODE [34] and KITTI [35] due to their com- respectively, except for DORN that comes with its own
paratively high density of ground truth annotations. We ordinal loss. Following BTS [5] and DenseDepth [4], we
hypothesize that our model assessment might be biased to- use DenseNet161 [10] as our BANet backbone.
wards better nonlinear interpolator rather than the candidates Evaluation Metrics: For the error (lower is better) metrics,
capable of learning to predict depth maps using monocular we mostly follow the measures (SILog, SqRel, AbsRel,
cues. Therefore, our experiments done on the datasets com- MAE, RMSE, iRMSE) directly from the KITTI leaderboard.
prising truly dense ground truths provide unbiased measure Additionally, we have used the following thresholded accu-
of different types of architectural design choices. racy (higher is better) metrics (Eq. 4).
• DIODE: Dense Indoor/Outdoor DEpth (DIODE) [34] is
the first standard dataset for monocular depth estimation 1 X ai ti k
δk = max( ) < (1 + ) ∗ 100 (%);
comprising diverse indoor and outdoor scenes acquired with N i ti ai 100
the same hardware setup. The training set consists of 8574 k ∈ {25, 56, 95}
indoor and 16884 outdoor samples from 20 scans each. (4)
The validation set contains 325 indoor and 446 outdoor
samples with each set from 10 different scans. The ground A. Results on Monocular Depth Estimation
truth density for the indoor training and validation splits are Quantitative Comparison: Table I, II, III, and IV present
approximately 99.54% and 99%, respectively. The density of the comparison of our BANet variants with SOTA architec-
the outdoor sets are naturally lower with 67.19% for training tures. Note that DORN performs much worse due to the
and 78.33% for validation subsets. The indoor and outdoor effect of granularity provided by its discretization or ordinal
ranges for the dataset are 50m and 300m, respectively. levels. A higher value of this parameter might improve its
• KITTI: KITTI dataset for monocular depth estimation precision, but with exponential increase in memory consump-
[35] is a derivative of the KITTI sequences. The training, tion during training; thus making it difficult to train on large
validation, and test sets comprise 85898, 1000, and 500 datasets like KITTI and DIODE. Overall, our BANet variants
samples, respectively. Following previous works, we set the perform better or close to existing SOTA in particular BTS.
prediction range to [0−80m] where ever applicable, although Finally, our model performs better than SOTA architectures
there are a few pixels with depth values greater than 80m. on the KITTI leaderboard 1 (Table V) in terms of the primary
Implementation Details: For fair comparison, all the recent metric (SILog) used for ranking.
models [4], [5], [7] are trained from scratch alongside ours
Qualitative Comparison: Figure 4 and 5 show the qual-
using the hyperparameter settings proposed in the corre-
itative comparison of different SOTA methods. Also, we
sponding papers. Our models are trained for 50 epochs 1 http://www.cvlibs.net/datasets/kitti/eval_depth.
with batch size of 32 in the NVIDIA Tesla V100 32GB php?benchmark=depth_prediction
TABLE I: Quantitative results on the DIODE Indoor val set.
#Params Time Lower is better Higher is better (%)
Model
(106 ) (ms) SiLog SqRel AbsRel MAE RMSE iRMSE δ25 δ56 δ95
DORN [7] 91.21 57 24.45 49.54 47.25 1.63 1.86 208.30 37.61 62.39 77.97
DenseDepth [4] 44.61 27 19.67 25.10 34.50 1.39 1.56 177.33 48.13 70.19 83.00
BTS [5] 47.00 34 19.19 25.55 33.31 1.42 1.61 171.41 50.14 72.71 83.44
BANet-Vanilla 28.73 21 18.78 27.75 34.36 1.43 1.61 175.35 51.28 70.63 81.32
BANet-Forward 32.58 28 18.39 25.81 35.49 1.49 1.65 179.58 47.53 68.63 80.91
BANet-Backward 32.58 28 18.25 24.43 34.19 1.46 1.62 176.02 48.94 69.50 81.52
BANet-Markov 35.64 31 18.62 21.78 32.58 1.44 1.61 173.58 48.99 72.28 82.29
BANet-Local 35.08 32 18.83 23.65 33.49 1.43 1.59 172.11 49.71 71.18 82.88
BANet-Full 35.64 33 18.43 26.49 34.99 1.45 1.61 177.22 48.48 70.66 82.26
TABLE III: Quantitative results on the DIODE All (Indoor+Outdoor) val set.
#Params Time Lower is better Higher is better (%)
Model
(106 ) (ms) SiLog SqRel AbsRel MAE RMSE iRMSE δ25 δ56 δ95
DORN [7] 91.31 57 34.23 226.87 47.30 3.55 5.34 135.86 46.41 73.54 86.02
DenseDepth [4] 44.61 27 31.05 157.70 39.33 3.10 4.77 182.26 54.25 76.46 87.24
BTS [5] 47.00 34 29.83 146.12 37.03 3.01 4.71 152.42 56.35 77.29 88.14
BANet-Vanilla 28.73 21 30.11 156.16 39.30 3.08 4.73 129.02 53.76 76.67 87.37
BANet-Forward 32.58 28 30.54 181.53 39.09 3.05 4.77 128.88 55.17 76.86 87.21
BANet-Backward 32.58 28 30.28 179.41 40.25 3.05 4.73 129.98 54.17 76.83 87.47
BANet-Markov 35.64 31 30.16 188.45 40.10 3.12 4.80 129.31 54.04 76.04 86.76
BANet-Local 35.08 32 29.69 148.91 37.68 2.98 4.66 126.96 56.08 77.34 87.96
BANet-Full 35.64 33 29.93 150.76 39.28 3.06 4.73 130.39 53.26 76.40 87.15
TABLE V: Results on the KITTI leaderboard (test set). • Why does vanilla perform better than SOTA in many
Method SILog SqRel AbsRel iRMSE cases? The vanilla architecture inherently maintains and gen-
DORN [7] 11.77 2.23 8.78 12.98 erates the full-resolution depth map using the D2S conversion
BTS [5] 11.67 2.21 9.04 12.23
BANet-Full 11.55 2.31 9.34 12.17 without increasing the number of parameters. We hypothe-
size this to be the primary reason why it performs better
than DORN (huge parameters) and DenseDepth (generates
provide a video of our predictions for the KITTI validation
4× downsampled output followed by 4× upsampling).
sequences 2 . The DORN prediction map clearly indicates its
semi-discretized nature of prediction. For both the indoor and • Why is Markovian worse than vanilla? BANet-Markov
outdoor images in Fig. 4, BTS and DenseDepth suffer from utilizes only the Si−1 stage features to generate the atten-
the high intensity regions surrounding the window frame tion map for Si stage features. For other variants, when
in the bottom right of the left image and the trunks in si−1 , si−2 , si−3 , . . . are used for the attention mechanism,
the right image. Our BANet variants work better in these the forgotten (or future for backward) information from
delicate regions. To better tease out the effect of the different si−2 , si−3 , . . . are used for the refinement. However, essen-
components in BANet, we provide observations as follows: tially no forgotten information is brought back to the stage
2 https://www.youtube.com/watch?v=a9wW02pfeKw&ab_
for Markovian formulation. The same information (or si−1 )
channel=MannatKaur
Image Ground truth DORN DenseDepth BTS BANet-Local BANet-Full
Fig. 4: Sample results on DIODE indoor (top) and outdoor (bottom) images. Top: The window frame on the bottom right
is detected well by BANets compared to BTS and DenseDepth. Bottom: Other methods but our BANet-Full fail to localize
the highly illuminated tree trunks. This shows the importance of global context aggregation in BANet-Full.
and the vanilla and local variants perform well for these
(comparatively easier) regions. However, in a real scenario,
it is more important to identify the related depth of both
the difficult and easy pixels satisfactorily than to be better
only at the easy pixels, which are in bulk quantities in all
DORN