Professional Documents
Culture Documents
SEGMENTATION
SDI
l l DownSample
l Supervision SpatialAttentoin
l
l Supervision
l
ChannelAttentoin
ChannelAttentoin
ChannelAttentoin
l SmoothConv
(a) (b)
Fig. 1. (a) The overall architecture of our U-Net v2 model, which consists of an Encoder, the SDI (semantics and detail infusion)
module, and a Decoder. (b) The architecture of the SDI module. For simplicity, weNonly show the refinement of the third level
features (l = 3). SmoothConv denotes a 3 × 3 convolution for feature smoothing. denotes the Hadamard product.
In [8], reverse attention was utilized to explicitly establish 2.2. Semantics and Detail Infusion (SDI) Module
connections among multi-scale features. In [9], ReLU activa-
With the hierarchical feature maps generated by the en-
tion was applied to higher-level features and the activated fea-
coder, we first apply the spatial and channel attention mech-
tures were multiplied with lower-level features. Additionally,
anisms [11] to the features fi0 of each level i. This process
in [10], the authors proposed to extract features from CNN
enables the features to integrate both local spatial information
and Transformer models separately, combining the features
and global channel information, as formulated below:
from both the CNN and Transformer branches at multiple lev-
els to enhance the feature maps. However, these approaches fi1 = ϕci (φsi (fi0 )), (1)
are complex, and their performance remains not very satisfac-
tory, thus desiring further improvement. where fi1 represents the processed feature map at the i-th
In this paper, we present U-Net v2, a new U-Net based level, and φsi and ϕci denote the parameters of spatial and
segmentation framework with straightforward and efficient channel attentions at the i-th level, respectively. Furthermore,
skip connections. Our model first extracts multi-level fea- we apply a 1 × 1 convolution to reduce the channels of fi1 to
ture maps using a CNN or Transformer encoder. Next, for c, where c is a hyper-parameter. This resulted feature map is
a feature map at the i-th level, we explicitly infuse higher- denoted as fi2 , with fi2 ∈ RHi ×Wi ×c , where Hi , Wi , and c
level features (which contain more semantic information) and represent the width, height, and channels of fi2 , respectively.
lower-level features (which capture finer details) through a Next, we need to send the refined feature maps to the de-
simple Hadamard product operation, thereby enhancing both coder. At each decoder level i, we use fi2 as the target refer-
the semantics and details of i-th level features. Subsequently, ence. Then, we adjust the sizes of the feature maps at every
the refined features are transmitted to the decoder for reso- j-th level to match the same resolution as fi2 , formulated as:
lution reconstruction and segmentation. Our method can be
seamlessly integrated into any Encoder-Decoder network.
D (fj2 , (Hi , Wi )) if j < i ,
We evaluate our new method on two medical image seg- 3
fij = I (fj2 ) if j = i, (2)
mentation tasks, Skin Lesion Segmentation and Polyp Seg-
mentation, using publicly available datasets. The experimen- U (fj2 , (Hi , Wi )) if j > i,
tal results demonstrate that our U-Net v2 consistently out-
performs state-of-the-art methods in these segmentation tasks where D , I , and U represent adaptive average pool-
while preserving FLOPs and GPU memory efficiency. ing, identity mapping, and bilinearly interpolating fj2 to the
resolution of Hi × Wi , respectively, with 1 ≤ i, j ≤ M .
2. METHOD Afterwards, a 3 × 3 convolution is applied in order to
3
smooth each resized feature map fij , formulated as:
2.1. Overall Architecture 4 3
fij = θij (fij ), (3)
The overall architecture of our U-Net v2 is shown in Fig. 1(a).
where θij represents the parameters of the smooth convolu-
It comprises three main modules: the encoder, the SDI (Se- 4
tion, and fij is the j-th smoothed feature map at the i-th level.
mantic and Detail Infusion) module, and the decoder.
After resizing all the i-th level feature maps into the same
Given an input image I, with I ∈ RH×W ×C , the en-
resolution, we apply the element-wise Hadamard product to
coder produces features in M levels. We denote the i-th
all the resized feature maps to enhance the i-th level features
level features as fi0 , 1 ≤ i ≤ M . These collected features,
with both more semantic information and finer details, as:
{f10 , f20 , . . . , fM
0
}, are then transmitted to the SDI module for
further refinement. fi5 = H([fi1
4 4
, fi2 4
, . . . , fiM ]), (4)
Dataset Method DSC (%) IoU (%) Dataset Method DSC (%) IoU (%)
U-Net [4] 86.99 76.98 UNet++ (PVT) [5] 89.60±0.17 81.16± 0.07
TransFuse [10] 88.40 79.21 U-Net v2 w/o SDI 89.85±0.14 81.57±0.06
ISIC 2017
ISIC 2017 MALUNet [12] 88.13 78.78 U-Net v2 w/o SC 90.20±0.13 82.16±0.05
EGE-UNet [13] 88.77 79.81 U-Net v2 (ours) 90.21±0.13 82.17±0.05
U-Net v2 (ours) 90.21 82.17 UNet++ (PVT) [5] 78.0±4.3 69.6±3.9
U-Net [4] 87.55 77.86 U-Net v2 w/o SDI 79.2±4.1 71.5±3.7
UNet++ [5] 87.83 78.31 U-Net v2 w/o SC 81.3±3.7 72.8±4.0
ColonDB
TransFuse [10] 89.27 80.63 U-Net v2 (ours) 81.2±3.9 73.1±4.4
ISIC 2018
SANet [9] 88.59 79.52
EGE-UNet [13] 89.04 80.25 Table 3. Ablation study on the ISIC 2017 and ColonDB
U-Net v2 (ours) 91.52 84.15 datasets. SC denotes spatial and channel attentions.
Table 2. Experimental comparison with state-of-the art meth- Comparison results with state-of-the-art methods on the ISIC
ods on the Polyp datasets. datasets are presented in Table 1. As shown, our proposed U-
Net v2 improves the DSC scores by 1.44% and 2.48%, and the
IoU scores by 2.36% and 3.90% on the ISIC 2017 and ISIC
where H(·) denotes the Hadamard product (see Fig. 1(b)). 2018 datasets, respectively. These improvements demonstrate
Afterwards, fi5 is dispatched to the i-th level decoder for fur- the effectiveness of our proposed method for infusing seman-
ther resolution reconstruction and segmentation. tic information and finer details into each feature map.
Model DSC (ISIC 2017) Input size # Params (M) GPU memory usage (MB) FLOPs (G) FPS
U-Net (PVT) 89.85 (1, 3, 256, 256) 28.15 478.82 8.433 39.678
UNet++ (PVT) 89.60 (1, 3, 256, 256) 29.87 607.31 19.121 34.431
U-Net v2 (ours) 90.21 (1, 3, 256, 256) 25.02 411.42 5.399 36.631
Table 4. Comparison of computational complexity, GPU memory usage, and inference time, using an NVIDIA P100 GPU.
input gt Ours UNet UNet++ EGE-UNet (i.e., SDI) consistently yield performance improvements.
[3] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia, [15] Noel Codella, Veronica Rotemberg, Philipp Tschandl,
“Path aggregation network for instance segmentation,” M Emre Celebi, Stephen Dusza, David Gutman, Brian
in CVPR, 2018, pp. 8759–8768. Helba, Aadi Kalloo, Konstantinos Liopyris, Michael
Marchetti, et al., “Skin lesion analysis toward melanoma
[4] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, detection 2018: A challenge hosted by the International
“U-Net: Convolutional networks for biomedical im- Skin Imaging Collaboration (ISIC),” arXiv preprint
age segmentation,” in MICCAI, Proceedings, Part III. arXiv:1902.03368, 2019.
Springer, 2015, pp. 234–241. [16] Matt Berseth, “ISIC 2017-skin lesion analysis towards
melanoma detection,” arXiv preprint arXiv:1703.00523,
[5] Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima
2017.
Tajbakhsh, and Jianming Liang, “UNet++: A nested U-
Net architecture for medical image segmentation,” in [17] Debesh Jha, Pia H Smedsrud, Michael A Riegler,
DLMIA 2018. Springer, 2018, pp. 3–11. Pål Halvorsen, Thomas de Lange, Dag Johansen, and
Håvard D Johansen, “Kvasir-SEG: A segmented polyp
[6] Jiawei Zhang, Yuzhen Jin, Jilan Xu, Xiaowei Xu, and dataset,” in MMM, Part II 26, 2020, pp. 451–462.
Yanchun Zhang, “MDU-Net: Multi-scale densely
connected U-Net for biomedical image segmentation,” [18] Jorge Bernal, F Javier Sánchez, Gloria Fernández-
arXiv preprint arXiv:1812.00352, 2018. Esparrach, Debora Gil, Cristina Rodrı́guez, and Fer-
nando Vilariño, “WM-DOVA maps for accurate polyp
[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian highlighting in colonoscopy: Validation vs. saliency
Sun, “Deep residual learning for image recognition,” in maps from physicians,” CMIG, vol. 43, pp. 99–111,
CVPR, 2016, pp. 770–778. 2015.
[8] Deng-Ping Fan, Ge-Peng Ji, Tao Zhou, Geng Chen, [19] Nima Tajbakhsh, Suryakanth R Gurudu, and Jianming
Huazhu Fu, Jianbing Shen, and Ling Shao, “PraNet: Liang, “Automated polyp detection in colonoscopy
Parallel reverse attention network for polyp segmenta- videos using shape and context information,” TMI, vol.
tion,” in MICCAI. Springer, 2020, pp. 263–273. 35, no. 2, pp. 630–644, 2015.
[20] David Vázquez, Jorge Bernal, F Javier Sánchez, Glo-
[9] Jun Wei, Yiwen Hu, Ruimao Zhang, Zhen Li, S Kevin
ria Fernández-Esparrach, Antonio M López, Adriana
Zhou, and Shuguang Cui, “Shallow attention network
Romero, Michal Drozdzal, Aaron Courville, et al.,
for polyp segmentation,” in MICCAI, Proceedings, Part
“A benchmark for endoluminal scene segmentation of
I 24. Springer, 2021, pp. 699–708.
colonoscopy images,” Journal of Healthcare Engineer-
[10] Yundong Zhang, Huiye Liu, and Qiang Hu, “Trans- ing, vol. 2017, 2017.
Fuse: Fusing Transformers and CNNs for medical im- [21] Juan Silva, Aymeric Histace, Olivier Romain, Xavier
age segmentation,” in MICCAI, Proceedings, Part I 24. Dray, and Bertrand Granado, “Toward embedded de-
Springer, 2021, pp. 14–24. tection of polyps in WCE images for early diagnosis of
colorectal cancer,” Journal of CARS, vol. 9, pp. 283–
[11] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and
293, 2014.
In So Kweon, “CBAM: Convolutional block attention
module,” in ECCV, 2018, pp. 3–19. [22] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan,
Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling
[12] Jiacheng Ruan, Suncheng Xiang, Mingye Xie, Ting Liu, Shao, “Pyramid Vision Transformer: A versatile back-
and Yuzhuo Fu, “MALUNet: A multi-attention and bone for dense prediction without convolutions,” in
light-weight UNet for skin lesion segmentation,” in IEEE/CVF CVPR, 2021, pp. 568–578.
BIBM. IEEE, 2022, pp. 1150–1156.