Attention U-Net, ResUnet, U-Net++, U - Net - AIGuys

03.02.
2023, 17:53 Attention U-Net, ResUnet, U-Net++, U²-Net | AIGuys
Open in app Sign up Sign In
Published in AIGuys
You have 1 free member-only story left this month. Sign up for Medium and get an extra one
Vishal Rajput Follow
Apr 1, 2022 · 8 min read · · Listen
Save
Attention U-Net, ResUnet, & many more

It's been a while since the inception of U-Net, the network was initially designed to
do medical image segmentation but since then it’s been used for all sorts of
segmentation tasks. In my limited experience, I’ve always found that segmentation
works better than object detection (given that you have the labeled data). The
probable reason for segmentation being better than detection is that it is learning to
identify each pixel as part of the object or not whereas detection tries to learn the
four sets of coordinates surrounding the object (a much tougher and error-prone
task to optimize). In this blog, we are going to take a look at all the different versions
of U-Net that are highly efficient and take up the performance of the original U-Net
by multiple notches.
Following are the different versions of U-Net:
U-Net
V-Net
U-Net++
R2U-Net
Attention U-Net
150
ResUnet
https://medium.com/aiguys/attention-u-net-resunet-many-more-65709b90ac8b 1/16
03.02.2023, 17:53 Attention U-Net, ResUnet, U-Net++, U²-Net | AIGuys
U²-Net
UNET3+
TransUNET
Swin-UNET
Let’s take a look into the all-new exciting world of U-Net. Feel free to skip the
explanation of U-Net (most of you are already aware of that).
U-Net
To understand the architecture of U-Net let’s understand the given task first. Given
an input image network should try to generate a segmentation output mask which
means each pixel should be classified as the desired object or not (look at the figure
below). So, the idea behind U-net was that, if we feed the image to an encoder that
keeps decreasing the spatial size of the feature block; after sufficient training, the
network will generalize to store only the important features and discard away less
useful data. Finally, the output of the encoder followed by a decoder will generate
the desired output mask. The problem was that the decoder layers were not getting
enough context in order to generate the segmentation mask from the encoder output.
The great idea introduced in the U-Net paper to solve the context issue was to add
skip connections from the encoder to the decoder before each size-reduction step.
Given below is the architecture of the U-Net, we can see that after applying two
Conv blocks image is reduced by half, and from each Conv block (2 Conv blocks),
there is a skip connection that takes the features from the encoder and
concatenates them to the decoder thus giving the decoder enough context to
generate proper segmentation mask. If we replace the operation of concatenation
with the addition we get the network called LinkNet. LinkNet performs similar to
the U-Net (in some cases even beating the U-Net).
The basic principle behind each U-Net variation is first to decrease the feature block
spatially and then increase it back again to the original size creating a bottleneck for
learning the important features. Also, add skip connections between the encoder and
decoder so that network has enough context while generating the segmentation mask.
V-Net
The original V-Net was proposed for 3D or volumetric data but it can be still used
with 2D images. The only difference between U-Net and V-Net is that V-Net uses a
convolutional layer replacing the up-sampling and down-sampling pooling layer.
The idea behind V-Net is that using the Maxpool operation leads to a lot of
information loss thus replacing it with another series of Conv operations without
padding will help in preserving more information.
V-Net architecture (Image taken from original V-Net paper)
U-Net++
This network took the idea of skip connection even one step further. Why take the
contextual information from the same spatial dimension between encoder and
decoder. Instead, they took the context from each encoder level (spatial size-wise)
and scaled it accordingly to feed it to every level in the decoder. See the given below
image to understand it clearly.
The image is taken from the original U-Net++ paper
R2U-Net
This paper tried to use the idea of the recurrent neural network to give the temporal
dynamic behavior to the network. I’m not sure why they think this is going to give
better results but in my testing, it completely failed to converge on a custom
dataset.
Attention U-Net
This network borrowed the idea of an attention mechanism from NLP and used it in
skip connections. It gave the skip connections an extra idea of which region to focus
on while segmenting the given object. This works great even with very small objects
due to the attention present in the skip connections. This one is a little bit more
complex to implement on your own from scratch but the idea behind this is quite
ingenious and simple.
Attention U-Net (Image taken from the original Attention U-Net paper)
Attention mechanism (Image taken from the original Attention U-Net paper)
How the attention mechanism works is as follows:
1. The attention gate takes in two inputs, vectors x and g.
2. The vector, g, is taken from the next lowest layer of the network. The vector has
smaller dimensions and better feature representation, given that it comes from
deeper into the network.
3. In the example figure above, vector x would have dimensions of 64x64x64

(filters x height x width) and vector g would be 32x32x32.
4. Vector x goes through a stridded convolution such that its dimensions become
64x32x32 and vector g goes through a 1x1 convolution such that its dimensions
become 64x32x32.
5. The two vectors are summed element-wise. This process results in aligned
weights becoming larger while unaligned weights becoming relatively smaller.
6. The resultant vector goes through a ReLU activation layer and a 1x1 convolution
that collapses the dimensions to 1x32x32.
7. This vector goes through a sigmoid layer which scales the vector between the
range [0,1], producing the attention coefficients (weights), where coefficients
closer to 1 indicate more relevant features.
8. The attention coefficients are upsampled to the original dimensions (64x64) of

the x vector using trilinear interpolation. The attention coefficients are
multiplied element-wise to the original x vector, scaling the vector according to
relevance. This is then passed along in the skip connection as normal.
ResUnet
ResUnet is a very interesting idea that takes the performance gain of Residual
networks and uses it with the U-Net. Given below is the architecture of ResUnet. In
my testing, I’ve found that it is a very capable network but with a slightly large
number of parameters.
The image is taken from the original ResUnet paper
U²-Net
It uses the idea of U-Net and implements that in each Conv block. It is basically a U-
Net of U-Net. If you want to know more about this network click here.
The image is taken from the original U2net paper
UNET3+
This is similar to UNet++ but with fewer parameters. This works extremely well,
comparable to Attention U-Net but with even fewer parameters. Another novel idea
in this paper is that classification results are also used to augment the process of
segmentation (it’s called the Classification Guided Module). Explaining the full
implementation detail of this paper is beyond this blog.
The image is taken from the original UNet3+ paper
TransUNET
TransUNET is based on the idea of using Transformers. Recently Vision Image
transformers made a huge noise in the field thus researchers of this paper thought
why not add that to UNet as well. It is indeed a very capable network but takes a lot
of time to train, much slower to train compared to other variants of UNet (I couldn’t
run this because it has more than 400 million parameters). Explaining the
mechanism of the transformer is another blog in itself. But if you are interested in
Vision Transformers click here.
Swin-UNET
This architecture is also based on Transformers but this time it is Swin-
transformers. This is also pretty slow to train but manageable with an RTX-GPU. In
my testing, it gave some decent results but was still too big and slow compared to
smaller, faster, and better Attention U-Net. To know more about Swin Transformers
click here.
The image is taken from the original Swin-UNET paper
Conclusion
In my testing of all these U-Net variants on a custom dataset, I find that Attention U-
Net and Unet3+ are the best performing networks with a limited number of
parameters (less than 10 million). Other networks might outperform these two but
they require a huge amount of data and computation power.
Join Medium with my referral link - Vishal Rajput

As a Medium member, a portion of your membership fee goes to
writers you read, and you get full access to every story…
vishal-ai.medium.com
References:
U-Net: Convolutional Networks for Biomedical Image

Segmentation
There is large consent that successful training of deep networks
requires many thousand annotated training samples. In…
arxiv.org
V-Net: Fully Convolutional Neural Networks for Volumetric

Medical Image Segmentation
Convolutional Neural Networks (CNNs) have been recently
employed to solve problems from both the computer vision and…
arxiv.org
UNet++: A Nested U-Net Architecture for Medical Image

Segmentation
In this paper, we present UNet++, a new, more powerful architecture
for medical image segmentation. Our architecture is…
arxiv.org
Recurrent Residual Convolutional Neural Network based on U-Net

(R2U-Net) for Medical Image…
Deep learning (DL) based semantic segmentation methods have
been providing state-of-the-art performance in the last few…
arxiv.org
Attention U-Net: Learning Where to Look for the Pancreas

We propose a novel attention gate (AG) model for medical imaging
that automatically learns to focus on target…
arxiv.org
ResUNet-a: a deep learning framework for semantic segmentation

of remotely sensed data
Scene understanding of high resolution aerial images is of great
importance for the task of automated monitoring in…
arxiv.org
UNet 3+: A Full-Scale Connected UNet for Medical Image

Segmentation
Recently, a growing interest has been seen in deep learning-based
semantic segmentation. UNet, which is one of deep…
arxiv.org
TransUNet: Transformers Make Strong Encoders for Medical

Image Segmentation
Medical image segmentation is an essential prerequisite for
developing healthcare systems, especially for disease…
arxiv.org
Swin-Unet: Unet-like Pure Transformer for Medical Image

Segmentation
In the past few years, convolutional neural networks (CNNs) have
achieved milestones in medical image analysis…
arxiv.org
Unet Segmentation Deep Learning Artificial Intelligence
Semantic Segmentation
Enjoy the read? Reward the writer.Beta

Your tip will go to Vishal Rajput through a third-party platform of their choice, letting them know you appreciate their
story.
Give a tip
Sign up for Weekly Newsletter from AIGuys

By AIGuys
Keeping you updated on recent advancements in the AI is our primary goal. Hope you enjoy these curated lists. Take
a look.
Your email
Get this newsletter
By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our
privacy practices.
About Help Terms Privacy
Get the Medium app

Attention U-Net, ResUnet, U-Net++, U - Net - AIGuys

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Attention U-Net, ResUnet, U-Net++, U - Net - AIGuys

Uploaded by

Copyright:

Available Formats

03.02.

2023, 17:53 Attention U-Net, ResUnet, U-Net++, U²-Net | AIGuys

Open in app Sign up Sign In

Vishal Rajput Follow

Apr 1, 2022 · 8 min read · · Listen

Attention U-Net, ResUnet, & many more

Following are the different versions of U-Net:

V-Net architecture (Image taken from original V-Net paper)

The image is taken from the original U-Net++ paper

How the attention mechanism works is as follows:

1. The attention gate takes in two inputs, vectors x and g.

3. In the example figure above, vector x would have dimensions of 64x64x64

8. The attention coefficients are upsampled to the original dimensions (64x64) of

The image is taken from the original ResUnet paper

The image is taken from the original U2net paper

The image is taken from the original UNet3+ paper

The image is taken from the original Swin-UNET paper

Join Medium with my referral link - Vishal Rajput

U-Net: Convolutional Networks for Biomedical Image

V-Net: Fully Convolutional Neural Networks for Volumetric

UNet++: A Nested U-Net Architecture for Medical Image

Recurrent Residual Convolutional Neural Network based on U-Net

Attention U-Net: Learning Where to Look for the Pancreas

ResUNet-a: a deep learning framework for semantic segmentation

UNet 3+: A Full-Scale Connected UNet for Medical Image

TransUNet: Transformers Make Strong Encoders for Medical

Swin-Unet: Unet-like Pure Transformer for Medical Image

Unet Segmentation Deep Learning Artificial Intelligence

Enjoy the read? Reward the writer.Beta

Sign up for Weekly Newsletter from AIGuys

Get this newsletter

About Help Terms Privacy

Get the Medium app

You might also like