Aligning and Prompting Everything All at Once For Universal Visual Perception

U-NET V2: RETHINKING THE SKIP CONNECTIONS OF U-NET FOR MEDICAL IMAGE
SEGMENTATION
Yaopeng Peng1 Milan Sonka2 Danny Z. Chen1

1 2
University of Notre Dame University of Iowa
arXiv:2311.17791v1 [eess.IV] 29 Nov 2023
ABSTRACT In the field of medical image analysis, accurate image seg-

mentation plays a pivotal role in computer-aided diagnosis
In this paper, we introduce U-Net v2, a new robust and ef- and analysis. U-Net [4] was originally introduced for medical
ficient U-Net variant for medical image segmentation. It image segmentation, utilizing skip connections to connect the
aims to augment the infusion of semantic information into encoder and decoder stages at each level. The skip connec-
low-level features while simultaneously refining high-level tions empower the decoder to access features from earlier en-
features with finer details. For an input image, we begin by coder stages, hence preserving both high-level semantic infor-
extracting multi-level features with a deep neural network mation and fine-grained spatial details. This approach facil-
encoder. Next, we enhance the feature map of each level itates precise delineation of object boundaries and extraction
by infusing semantic information from higher-level features of small structures in medical images. Additionally, a dense
and integrating finer details from lower-level features through connection mechanism was applied to reduce dissimilarities
Hadamard product. Our novel skip connections empower fea- between features in the encoders and decoders by concatenat-
tures of all the levels with enriched semantic characteristics ing features from all levels and all stages [5]. A mechanism
and intricate details. The improved features are subsequently was designed to enhance features by concatenating features
transmitted to the decoder for further processing and seg- of different scales from both higher and lower levels [6].
mentation. Our method can be seamlessly integrated into
However, these connections in U-Net based models may
any Encoder-Decoder network. We evaluate our method on
not be sufficiently effective in integrating low-level and high-
several public medical image segmentation datasets for skin
level features. For example, in ResNet [7], a deep neural
lesion segmentation and polyp segmentation, and the experi-
network was formed as an ensemble of multiple shallow net-
mental results demonstrate the segmentation accuracy of our
works, and an explicitly added residual connection illustrated
new method over state-of-the-art methods, while preserving
that the network can struggle to learn the identity map func-
memory and computational efficiency. Code is available at:
tion, even when trained on a million-scale image dataset.
https://github.com/yaoppeng/U-Net v2.
Regarding the features extracted by the encoders, the low-
Index Terms— Medical image segmentation, U-Net, level features usually preserve more details but lack sufficient
Skip connections, Semantics and detail infusion semantic information and may contain undesired noise. In
contrast, the high-level features contain more semantic infor-
mation but lack precise details (e.g., object boundaries) due
1. INTRODUCTION
to the significant resolution reduction. Simply fusing fea-
tures through concatenation will heavily rely on the network’s
With the advance of modern deep neural networks, significant
learning capacity, which is often proportional to the training
progress has been made in semantic image segmentation. A
dataset size. This is a challenging issue, especially in the
typical paradigm for semantic image segmentation involves
context of medical imaging, which is commonly constrained
an Encoder-Decoder network with skip connections [1]. In
by limited data. Such information fusion, accomplished by
this framework, the Encoder extracts hierarchical and abstract
concatenating low-level and high-level features across multi-
features from an input image, while the decoder takes the fea-
ple levels through dense connections, may limit the contribu-
ture maps generated by the encoder and reconstructs a pixel-
tion of information from different levels and potentially intro-
wise segmentation mask or map, assigning a class label to
duce noise. On the other hand, despite the fact that the ad-
each pixel in the input image. A series of studies [2, 3] have
ditional convolutions introduced do not significantly increase
been conducted to incorporate global information into the fea-
the number of parameters, GPU memory consumption will
ture maps and enhance multi-scale features, resulting in sub-
rise because all intermediate feature maps and the correspond-
stantial improvements in segmentation performance.
ing gradients must be stored for forward passes and backward
THIS RESEARCH WAS SUPPORTED IN PART BY NIH NIBIB gradient computations. This leads to an increase in both GPU
GRANT R01-EB004640. memory usage and floating point operations (FLOPs).
Encoder Decoder SDI UpSample
l
IdentityMap
SDI
l l DownSample
l Supervision SpatialAttentoin
l
l Supervision
l
ChannelAttentoin
ChannelAttentoin
ChannelAttentoin
l SmoothConv
(a) (b)
Fig. 1. (a) The overall architecture of our U-Net v2 model, which consists of an Encoder, the SDI (semantics and detail infusion)
module, and a Decoder. (b) The architecture of the SDI module. For simplicity, weNonly show the refinement of the third level
features (l = 3). SmoothConv denotes a 3 × 3 convolution for feature smoothing. denotes the Hadamard product.
In [8], reverse attention was utilized to explicitly establish 2.2. Semantics and Detail Infusion (SDI) Module
connections among multi-scale features. In [9], ReLU activa-
With the hierarchical feature maps generated by the en-
tion was applied to higher-level features and the activated fea-
coder, we first apply the spatial and channel attention mech-
tures were multiplied with lower-level features. Additionally,
anisms [11] to the features fi0 of each level i. This process
in [10], the authors proposed to extract features from CNN
enables the features to integrate both local spatial information
and Transformer models separately, combining the features
and global channel information, as formulated below:
from both the CNN and Transformer branches at multiple lev-
els to enhance the feature maps. However, these approaches fi1 = ϕci (φsi (fi0 )), (1)
are complex, and their performance remains not very satisfac-
tory, thus desiring further improvement. where fi1 represents the processed feature map at the i-th
In this paper, we present U-Net v2, a new U-Net based level, and φsi and ϕci denote the parameters of spatial and
segmentation framework with straightforward and efficient channel attentions at the i-th level, respectively. Furthermore,
skip connections. Our model first extracts multi-level fea- we apply a 1 × 1 convolution to reduce the channels of fi1 to
ture maps using a CNN or Transformer encoder. Next, for c, where c is a hyper-parameter. This resulted feature map is
a feature map at the i-th level, we explicitly infuse higher- denoted as fi2 , with fi2 ∈ RHi ×Wi ×c , where Hi , Wi , and c
level features (which contain more semantic information) and represent the width, height, and channels of fi2 , respectively.
lower-level features (which capture finer details) through a Next, we need to send the refined feature maps to the de-
simple Hadamard product operation, thereby enhancing both coder. At each decoder level i, we use fi2 as the target refer-
the semantics and details of i-th level features. Subsequently, ence. Then, we adjust the sizes of the feature maps at every
the refined features are transmitted to the decoder for reso- j-th level to match the same resolution as fi2 , formulated as:
lution reconstruction and segmentation. Our method can be 
seamlessly integrated into any Encoder-Decoder network.


 D (fj2 , (Hi , Wi )) if j < i ,

We evaluate our new method on two medical image seg- 3
fij = I (fj2 ) if j = i, (2)
mentation tasks, Skin Lesion Segmentation and Polyp Seg- 


mentation, using publicly available datasets. The experimen- U (fj2 , (Hi , Wi )) if j > i,

tal results demonstrate that our U-Net v2 consistently out-
performs state-of-the-art methods in these segmentation tasks where D , I , and U represent adaptive average pool-
while preserving FLOPs and GPU memory efficiency. ing, identity mapping, and bilinearly interpolating fj2 to the
resolution of Hi × Wi , respectively, with 1 ≤ i, j ≤ M .
2. METHOD Afterwards, a 3 × 3 convolution is applied in order to
3
smooth each resized feature map fij , formulated as:
2.1. Overall Architecture 4 3
fij = θij (fij ), (3)
The overall architecture of our U-Net v2 is shown in Fig. 1(a).
where θij represents the parameters of the smooth convolu-
It comprises three main modules: the encoder, the SDI (Se- 4
tion, and fij is the j-th smoothed feature map at the i-th level.
mantic and Detail Infusion) module, and the decoder.
After resizing all the i-th level feature maps into the same
Given an input image I, with I ∈ RH×W ×C , the en-
resolution, we apply the element-wise Hadamard product to
coder produces features in M levels. We denote the i-th
all the resized feature maps to enhance the i-th level features
level features as fi0 , 1 ≤ i ≤ M . These collected features,
with both more semantic information and finer details, as:
{f10 , f20 , . . . , fM
0
}, are then transmitted to the SDI module for
further refinement. fi5 = H([fi1
4 4
, fi2 4
, . . . , fiM ]), (4)
Dataset Method DSC (%) IoU (%) Dataset Method DSC (%) IoU (%)
U-Net [4] 86.99 76.98 UNet++ (PVT) [5] 89.60±0.17 81.16± 0.07
TransFuse [10] 88.40 79.21 U-Net v2 w/o SDI 89.85±0.14 81.57±0.06
ISIC 2017
ISIC 2017 MALUNet [12] 88.13 78.78 U-Net v2 w/o SC 90.20±0.13 82.16±0.05
EGE-UNet [13] 88.77 79.81 U-Net v2 (ours) 90.21±0.13 82.17±0.05
U-Net v2 (ours) 90.21 82.17 UNet++ (PVT) [5] 78.0±4.3 69.6±3.9
U-Net [4] 87.55 77.86 U-Net v2 w/o SDI 79.2±4.1 71.5±3.7
UNet++ [5] 87.83 78.31 U-Net v2 w/o SC 81.3±3.7 72.8±4.0
ColonDB
TransFuse [10] 89.27 80.63 U-Net v2 (ours) 81.2±3.9 73.1±4.4
ISIC 2018
SANet [9] 88.59 79.52
EGE-UNet [13] 89.04 80.25 Table 3. Ablation study on the ISIC 2017 and ColonDB
U-Net v2 (ours) 91.52 84.15 datasets. SC denotes spatial and channel attentions.
Table 1. Experimental comparison with state-of-the-art meth-

ods on the two ISIC datasets. 3. EXPERIMENTS
Datasets Methods DSC (%) IoU (%) MAE 3.1. Datasets

U-Net [4] 81.8 74.6 0.055 We evaluate our new U-Net v2 using the following datasets.
UNet++ [5] 82.1 74.3 0.048
PraNet [8] 89.8 84.0 0.030 ISIC Datasets: Two datasets of skin lesion segmentation
Kvasir-SEG are used: ISIC 2017 [15, 16], which comprises 2050 der-
SANet [9] 90.4 84.7 0.028
Polyp-PVT [14] 91.7 86.4 0.023 moscopy images, and ISIC 2018 [15], which contains 2694
U-Net v2 (ours) 92.8 88.0 0.019 dermoscopy images. For fair comparison, we follow the
U-Net [4] 82.3 75.5 0.019 train/test split strategy as outlined in [13].
UNet++ [5] 79.4 72.9 0.022 Polyp Segmentation Datasets: Five datasets are used:
PraNet [8] 89.9 84.9 0.009 Kvasir-SEG [17], ClinicDB [18], ColonDB [19], Endoscene
ClinicDB [20], and ETIS [21]. For fair comparison, we use the train/test
SANet [9] 91.6 85.9 0.012
Polyp-PVT [14] 93.7 88.9 0.006 split strategy in [8]. Specifically, 900 images from ClinicDB
U-Net v2 (ours) 94.4 89.6 0.006 and 548 images from Kvasir-SEG are used as the training set,
U-Net [4] 51.2 44.4 0.061 while the remaining images serve as the test set.
UNet++ [5] 48.3 41.0 0.064
PraNet [8] 71.2 64.0 0.043 3.2. Experimental Setup
ColonDB
SANet [9] 75.3 67.0 0.043
Polyp-PVT [14] 80.8 72.7 0.031 We conduct experiments on an NVIDIA P100 GPU with Py-
U-Net v2 (ours) 81.2 73.1 0.030 Torch. Our network is optimized using the Adam optimizer,
U-Net [4] 39.8 33.5 0.036 with an initial learning rate = 0.001, β1 = 0.9, and β2 = 0.999.
UNet++ [5] 40.1 34.4 0.035 We employ a polynomial learning rate decay with a power
PraNet [8] 62.8 56.7 0.031 of 0.9. The maximum number of training epochs is set to
ETIS 300. The hyper-parameter c is set to 32. As the approach
SANet [9] 75.0 65.4 0.015
Polyp-PVT [14] 78.7 70.6 0.013 in [13], we report DSC (Dice Similarity Coefficient) and IoU
U-Net v2 (ours) 79.0 70.5 0.013 (Intersection over Union) scores for the ISIC datasets. For the
U-Net [4] 71.0 62.7 0.022 polyp datasets, we report DSC, IoU, and MAE (Mean Abso-
UNet++ [5] 70.7 62.4 0.018 lute Error) scores. Each experiment is run 5 times, and the
PraNet [8] 87.1 79.7 0.010 averaged results are reported. We use the Pyramid Vision
Endoscene Transformer (PVT) [22] as the encoder for feature extraction.
SANet [9] 88.8 81.5 0.008
Polyp-PVT [14] 90.0 83.3 0.007
U-Net v2 (ours) 89.7 83.1 0.007 3.3. Results and Analysis
Table 2. Experimental comparison with state-of-the art meth- Comparison results with state-of-the-art methods on the ISIC
ods on the Polyp datasets. datasets are presented in Table 1. As shown, our proposed U-
Net v2 improves the DSC scores by 1.44% and 2.48%, and the
IoU scores by 2.36% and 3.90% on the ISIC 2017 and ISIC
where H(·) denotes the Hadamard product (see Fig. 1(b)). 2018 datasets, respectively. These improvements demonstrate
Afterwards, fi5 is dispatched to the i-th level decoder for fur- the effectiveness of our proposed method for infusing seman-
ther resolution reconstruction and segmentation. tic information and finer details into each feature map.
Model DSC (ISIC 2017) Input size # Params (M) GPU memory usage (MB) FLOPs (G) FPS
U-Net (PVT) 89.85 (1, 3, 256, 256) 28.15 478.82 8.433 39.678
UNet++ (PVT) 89.60 (1, 3, 256, 256) 29.87 607.31 19.121 34.431
U-Net v2 (ours) 90.21 (1, 3, 256, 256) 25.02 411.42 5.399 36.631
Table 4. Comparison of computational complexity, GPU memory usage, and inference time, using an NVIDIA P100 GPU.
input gt Ours UNet UNet++ EGE-UNet (i.e., SDI) consistently yield performance improvements.
3.5. Qualitative Results
Some qualitative examples on the ISIC 2017 dataset are given

in Fig. 2, which demonstrate that our U-Net v2 is capable of
incorporating semantic information and finer details into the
feature maps at each level. Consequently, our segmentation
model can capture finer details of object boundaries.
3.6. Computation, GPU Memory, and Inference Time
To examine the computational complexity, GPU memory us-

age, and inference time of our U-Net v2, we report the pa-
rameters, GPU memory usage, FLOPs, and FPS (frames per
second) for our method, U-Net [4], and UNet++ [5] in Ta-
Fig. 2. Example segmentations from ISIC 2017 dataset. We ble 4. The experiments use float32 as the data type, which
use PVT as the encoder for U-Net and UNet++. results in 4B of memory usage per variable. The GPU mem-
ory usage records the size of the parameters and intermediate
variables that are stored during the forward/backward pass.
Comparison results with state-of-the-art methods on the (1, 3, 256, 256) represents the size of the input image. All the
polyp segmentation datasets are presented in Table 2. As tests are conducted on an NVIDIA P100 GPU.
shown, our proposed U-Net v2 outperforms Poly-PVT [14] In Table 4, one can observe that UNet++ introduces more
on the Kavasir-SEG, ClinicDB, ColonDB, and ETIS datasets, parameters, and its GPU memory usage is larger due to the
with DSC score improvements of 1.1%, 0.7%, 0.4%, and storage of intermediate variables (e.g., feature maps) during
0.3%, respectively. This underscores the consistent effective- the dense forward process. Typically, such intermediate vari-
ness of our proposed method in infusing semantic information ables consume much more GPU memory than the parameters.
and finer details into feature maps at each level. Furthermore, the FLOPs and FPS of U-Net v2 are also supe-
rior to those of UNet++. The FPS reduction by our U-Net v2
3.4. Ablation Study compared to U-Net (PVT) is limited.
We conduct ablation study using the ISIC 2017 and ColonDB
datasets to examine the effectiveness of our U-Net v2, as re-
ported in Table 3. Specifically, we use the PVT [22] model 4. CONCLUSIONS
as the encoder for UNet++ [5]. Note that U-Net v2 is re-
verted to a vanilla U-Net with a PVT backbone when our SDI A new U-Net variant was introduced, U-Net v2, which fea-
module is removed. SC denotes spatial and channel atten- tures a novel and straightforward design of skip connections
tions within the SDI module. One can see from Table 3 that for improved medical image segmentation. This design ex-
UNet++ exhibits a slight performance reduction compared to plicitly integrates semantic information from higher-level fea-
U-Net v2 without SDI (i.e., U-Net with the PVT encoder). tures and finer details from lower-level features into feature
This decrease may be attributed to the simple concatenation maps at each level produced by the encoder using a Hadamard
of multi-level features generated by dense connections, which product. Experiments conducted on Skin Lesion and Polyp
could confuse the model and introduce noise. Table 3 demon- Segmentation datasets validated the effectiveness of our U-
strates that the SDI module contributes the most to the overall Net v2. Complexity analysis suggested that U-Net v2 is also
performance, highlighting that our proposed skip connections efficient in FLOPs and GPU memory usage.
5. REFERENCES [13] Jiacheng Ruan, Mingye Xie, Jingsheng Gao, Ting Liu,
and Yuzhuo Fu, “EGE-UNet: An efficient group en-
[1] Jonathan Long, Evan Shelhamer, and Trevor Darrell, hanced UNet for skin lesion segmentation,” arXiv
“Fully convolutional networks for semantic segmenta- preprint arXiv:2307.08473, 2023.
tion,” in IEEE CVPR, 2015, pp. 3431–3440.
[14] Bo Dong, Wenhai Wang, Deng-Ping Fan, Jinpeng Li,
[2] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Huazhu Fu, and Ling Shao, “Polyp-PVT: Polyp seg-
Wang, and Jiaya Jia, “Pyramid scene parsing network,” mentation with Pyramid Vision Transformers,” arXiv
in CVPR, 2017, pp. 2881–2890. preprint arXiv:2108.06932, 2021.
[3] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia, [15] Noel Codella, Veronica Rotemberg, Philipp Tschandl,
“Path aggregation network for instance segmentation,” M Emre Celebi, Stephen Dusza, David Gutman, Brian
in CVPR, 2018, pp. 8759–8768. Helba, Aadi Kalloo, Konstantinos Liopyris, Michael
Marchetti, et al., “Skin lesion analysis toward melanoma
[4] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, detection 2018: A challenge hosted by the International
“U-Net: Convolutional networks for biomedical im- Skin Imaging Collaboration (ISIC),” arXiv preprint
age segmentation,” in MICCAI, Proceedings, Part III. arXiv:1902.03368, 2019.
Springer, 2015, pp. 234–241. [16] Matt Berseth, “ISIC 2017-skin lesion analysis towards
melanoma detection,” arXiv preprint arXiv:1703.00523,
[5] Zongwei Zhou, Md Mahfuzur Rahman Siddiquee, Nima
2017.
Tajbakhsh, and Jianming Liang, “UNet++: A nested U-
Net architecture for medical image segmentation,” in [17] Debesh Jha, Pia H Smedsrud, Michael A Riegler,
DLMIA 2018. Springer, 2018, pp. 3–11. Pål Halvorsen, Thomas de Lange, Dag Johansen, and
Håvard D Johansen, “Kvasir-SEG: A segmented polyp
[6] Jiawei Zhang, Yuzhen Jin, Jilan Xu, Xiaowei Xu, and dataset,” in MMM, Part II 26, 2020, pp. 451–462.
Yanchun Zhang, “MDU-Net: Multi-scale densely
connected U-Net for biomedical image segmentation,” [18] Jorge Bernal, F Javier Sánchez, Gloria Fernández-
arXiv preprint arXiv:1812.00352, 2018. Esparrach, Debora Gil, Cristina Rodrı́guez, and Fer-
nando Vilariño, “WM-DOVA maps for accurate polyp
[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian highlighting in colonoscopy: Validation vs. saliency
Sun, “Deep residual learning for image recognition,” in maps from physicians,” CMIG, vol. 43, pp. 99–111,
CVPR, 2016, pp. 770–778. 2015.
[8] Deng-Ping Fan, Ge-Peng Ji, Tao Zhou, Geng Chen, [19] Nima Tajbakhsh, Suryakanth R Gurudu, and Jianming
Huazhu Fu, Jianbing Shen, and Ling Shao, “PraNet: Liang, “Automated polyp detection in colonoscopy
Parallel reverse attention network for polyp segmenta- videos using shape and context information,” TMI, vol.
tion,” in MICCAI. Springer, 2020, pp. 263–273. 35, no. 2, pp. 630–644, 2015.
[20] David Vázquez, Jorge Bernal, F Javier Sánchez, Glo-
[9] Jun Wei, Yiwen Hu, Ruimao Zhang, Zhen Li, S Kevin
ria Fernández-Esparrach, Antonio M López, Adriana
Zhou, and Shuguang Cui, “Shallow attention network
Romero, Michal Drozdzal, Aaron Courville, et al.,
for polyp segmentation,” in MICCAI, Proceedings, Part
“A benchmark for endoluminal scene segmentation of
I 24. Springer, 2021, pp. 699–708.
colonoscopy images,” Journal of Healthcare Engineer-
[10] Yundong Zhang, Huiye Liu, and Qiang Hu, “Trans- ing, vol. 2017, 2017.
Fuse: Fusing Transformers and CNNs for medical im- [21] Juan Silva, Aymeric Histace, Olivier Romain, Xavier
age segmentation,” in MICCAI, Proceedings, Part I 24. Dray, and Bertrand Granado, “Toward embedded de-
Springer, 2021, pp. 14–24. tection of polyps in WCE images for early diagnosis of
colorectal cancer,” Journal of CARS, vol. 9, pp. 283–
[11] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and
293, 2014.
In So Kweon, “CBAM: Convolutional block attention
module,” in ECCV, 2018, pp. 3–19. [22] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan,
Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling
[12] Jiacheng Ruan, Suncheng Xiang, Mingye Xie, Ting Liu, Shao, “Pyramid Vision Transformer: A versatile back-
and Yuzhuo Fu, “MALUNet: A multi-attention and bone for dense prediction without convolutions,” in
light-weight UNet for skin lesion segmentation,” in IEEE/CVF CVPR, 2021, pp. 568–578.
BIBM. IEEE, 2022, pp. 1150–1156.

Aligning and Prompting Everything All at Once For Universal Visual Perception

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Aligning and Prompting Everything All at Once For Universal Visual Perception

Uploaded by

Copyright:

Available Formats

U-NET V2: RETHINKING THE SKIP CONNECTIONS OF U-NET FOR MEDICAL IMAGE

Yaopeng Peng1 Milan Sonka2 Danny Z. Chen1

ABSTRACT In the field of medical image analysis, accurate image seg-

Table 1. Experimental comparison with state-of-the-art meth-

Datasets Methods DSC (%) IoU (%) MAE 3.1. Datasets

3.5. Qualitative Results

Some qualitative examples on the ISIC 2017 dataset are given

3.6. Computation, GPU Memory, and Inference Time

To examine the computational complexity, GPU memory us-

You might also like