ResGANet PDF

Medical Image Analysis 76 (2022) 102313
Contents lists available at ScienceDirect
Medical Image Analysis

journal homepage: www.elsevier.com/locate/media
ResGANet: Residual group attention network for medical image

classification and segmentation
Junlong Cheng a,b, Shengwei Tian c,d,∗, Long Yu a, Chengrui Gao b, Xiaojing Kang e, Xiang Ma f,
Weidong Wu e, Shijia Liu a, Hongchun Lug
a
College of Information Science and Engineering, Xinjiang University, Urumqi 830000, China
b
College of Computer Science, Sichuan University, Chengdu 610065, China
c
College of Software Engineering, Xin Jiang University, Urumqi 830000, China
d
Key Laboratory of Software Engineering Technology, Xinjiang University, China
e
Xinjiang Key Laboratory of Dermatology Research, People’s Hospital of Xinjiang Uygur Autonomous Region, China
f
The First Affiliated Hospital of Xinjiang Medical University, Urumqi 830000, China
g
School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu 610031, Sichuan, China
a r t i c l e i n f o a b s t r a c t
Article history: In recent years, deep learning technology has shown superior performance in different fields of med-
Received 24 July 2020 ical image analysis. Some deep learning architectures have been proposed and used for computational
Revised 25 October 2021
pathology classification, segmentation, and detection tasks. Due to their simple, modular structure, most
Accepted 22 November 2021
downstream applications still use ResNet and its variants as the backbone network. This paper proposes
Available online 26 November 2021
a modular group attention block that can capture feature dependencies in medical images in two inde-
Keywords: pendent dimensions: channel and space. By stacking these group attention blocks in ResNet-style, we ob-
Deep learning tain a new ResNet variant called ResGANet. The stacked ResGANet architecture has 1.51–3.47 times fewer
Medical image analysis parameters than the original ResNet and can be directly used for downstream medical image segmenta-
Residual group attention network tion tasks. Many experiments show that the proposed ResGANet is superior to state-of-the-art backbone
Image classification models in medical image classification tasks. Applying it to different segmentation networks can improve
Image segmentation
the baseline model in medical image segmentation tasks without changing the network architecture. We
hope that this work provides a promising method for enhancing the feature representation of convolu-
tional neural networks (CNNs) in the future.
© 2021 Elsevier B.V. All rights reserved.
1. Introduction Li et al., 2019; Hu et al., 2018; Gao et al., 2019; Xie et al., 2017)
as the backbone CNNs. A model of its simple modular design and
Image classification is the primary task of computer vision. A ability to extract features effectively can quickly adapt to various
deep neural network trained on large-scale datasets (such as Ima- medical image processing tasks. However, ResNet was initially de-
geNet (Russakovsky et al., 2015)) is used as a backbone network to signed for specific image classification tasks (Russakovsky et al.,
extract representative features for various downstream tasks, in- 2015; Krizhevsky et al., 2010) and had a limited receptive field size,
volving object detection (Litjens et al., 2017; He et al., 2017) and and lacks cross-channel and cross-spatial interactions. It may not
segmentation (Long et al., 2015; Zhu et al., 2019). A network with be suitable as a backbone for a direct downstream medical image
good classification performance can usually mine features that are job. This means that for a given computational pathology task, it is
more relevant to the current task to benefit downstream tasks. necessary to manually adjust the network architecture to modify
Therefore, enhancing the feature representation ability of CNNs is ResNet and make it more effective for a specific task. For exam-
the focus of our research. ple, some methods add squeeze and excitation blocks (Kaul et al.,
At present, the latest work on medical image segmentation 2019; Woo et al., 2018), introduce long-range connection methods
(Alom et al., 2018; Kaul et al., 2019; Cheng et al., 2020) still uses (Alom et al., 2018; Ronneberger et al., 2015), or add pyramid mod-
ResNet (He et al., 2016) or one of its variants (Woo et al., 2018; ules (Chen et al., 2014, 2017, 2018). In addition, applying attention
modules (Woo et al., 2018; Hu et al., 2018; Fu et al., 2019) or non-
local blocks (He et al., 2019; Cao et al., 2019; Wang et al., 2018) in
∗
Corresponding author at: College of Software Engineering, Xin Jiang University, downstream tasks has proven to be effective.
Urumqi 830 0 0 0, China.
Recent research on image classification networks has fo-
E-mail address: tianshengwei@163.com (S. Tian).
cused more on group or deep convolutions (Xie et al., 2017;
https://doi.org/10.1016/j.media.2021.102313
1361-8415/© 2021 Elsevier B.V. All rights reserved.
J. Cheng, S. Tian, L. Yu et al. Medical Image Analysis 76 (2022) 102313
Fig. 1. Overview of the group attention block.
Howard et al., 2017, 2019; Tan et al., 2019). Although the above mation of the feature maps within the group, further strengthening
methods can indeed improve the learning performance of spe- the weight of some essential features, and then solve the problem
cific computer vision tasks, these improvements are also limited of the CNN’s limited ability to work with spatially constant input
by a single function. For example, Res2Net (Gao et al., 2019) in- data. In Section 4.3, we illustrate the effect of feature transforma-
creases the receptive field in the residual block: ResNeXt (Xie et al., tion on network performance through ablation experiments.
2017) uses grouping convolution to improve accuracy and CBAM The third section of this paper benchmarks the application of
(Woo et al., 2018) adds a spatial attention module based on SENet medical image classification and segmentation. We find that the
(Hu et al., 2018) to enhance the feature representation ability. Be- accuracy of ResGANet on the two medical image classification
cause these models only improve the performance of specific com- datasets is higher than that of the current ResNet and its variants,
puter vision tasks from a single level, their performance is often far and it can also maintain excellent performance when we directly
inferior to their initial target task when we transfer these models use ResGANet as the backbone network for medical image segmen-
to other tasks. Therefore, it is highly desirable to construct a pub- tation. Moreover, we also designed a decoding module for medical
lic backbone network with rich feature representations to simul- image segmentation, called the multiscale atrous spatial pyramid
taneously improve the performance of different medical computer pooling module (MsASPP), which is used to cooperate with Res-
vision tasks. GANet to obtain more accurate medical image segmentation re-
In the first section of this paper, we explore the modification of sults. Can find all experimental results in Sections 4.4 and 4.5.
ResNet architecture. We divide the feature map into several groups
and emphasize the channel dependence between any two-channel
maps in the group. Meanwhile, the spatial relationship between 2. Related work
features is used to generate a spatial attention map. More specif-
ically, first, we divide each group into four subgroups along the 2.1. Modern architectural design
channel direction of the feature map. Then, we send the feature
map after feature transformation to the channel attention mod- In recent years, many novel network architectures have
ule to obtain the channel attention map with the same number of emerged (He et al., 2016; Krizhevsky et al., 2012; Simonyan et al.,
groups and apply a weighted sum to all grouped attention maps. 2014; Szegedy et al., 2015). The emergence of these architectures
Finally, we use the spatial relationship of features to aggregate the has led to deep CNNs occupying the dominant position in im-
features of each location to ensure that similar features are mu- age classification and be seen as state-of-the-art technology in
tually promoted in spatial size. We call this unit a group atten- many computer-vision tasks. AlexNet (Krizhevsky et al., 2012) im-
tion block (as shown in Fig. 1), with a high degree of modularity plemented the basic principles of a CNN and applied them to more
and functionalization. By stacking several group attention blocks, profound and broader networks. After that, VGGNet (Simonyan and
we can create a ResNet-like network called ResGANet. Our archi- Zisserman, 2014) successfully constructed 16–19 deep convolu-
tecture has a smaller number of parameters than ResNet. In additional neural networks to expand the receptive field, enabling the
tion, ResGANet can process different medical image data; not only network to extract features in a more extensive range. GoogleNet
does it perform well in medical image classification tasks, but it is (Szegedy et al., 2015) used parallel filters of different kernel sizes
also easily used as the basis for medical image segmentation tasks. to enhance multiscale representation capabilities. Based on the
The second part of this paper studies the effect of feature trans- success of previous work, ResNet (He et al., 2016) introduced
formation on model performance in grouped convolution. Unlike an identity skip connection to alleviate problems. Such as gradi-
the usual group convolution, we perform a simple spatial transfor- ent disappearance or explosion and deepened the network level.
2
ResNet has become one of the most successful CNN frameworks 3.1. Group attention block
and is widely used in various computer vision tasks.
Our group attention block is a computing unit similar to the
residual block in ResNet. It consists of a feature map group, fea-
2.2. Multipath and attention mechanisms ture transformation, and a channel and spatial attention operation.
Fig. 1 depicts an overview of the group attention block.
The InceptionNets (Szegedy et al., 2015; Ioffe and Szegedy, 2015,
Szegedy et al., 2016) series has achieved widespread success in 3.1.1. Feature map group
multipath representation. They stacked filters of different kernel In ResNeXt blocks, features can be divided into groups for con-
sizes in each path of parallel paths to further expand the size of volution operations, and the number of feature graph groups is de-
the receptive field. ResNeXt changed the convolution in the resid- termined by "cardinality". Similar to ResNeXt, we divide the input
ual block of ResNet to a group convolution and converted the features into N groups and concurrently introduce a new parame-
multipath structure into a unified operation. ResNeXt can increase ter "S", which indicates the number of groups of channel shuffles
the accuracy without increasing the complexity of the parameters (Zhang et al., 2018) and the number of subgroups in each group.
while also reducing the number of hyperparameters. ShuffleNet The purpose of channel shuffling is to help channel information
(Zhang et al., 2018) evenly mixed the feature maps after group flow without increasing the number of calculations. This operation
convolution by using channels to obtain global information better. is conducive to the feature subgroup transformation. Finally, the
The network dramatically reduces the calculations of the model total number of feature groups should be G = N ∗ S. In this article,
while maintaining accuracy. SENet used the interdependence be- we fix "S" to 4.
tween channels for modeling to improve the network’s feature rep-
resentation ability and won the championship in the image clas- 3.1.2. Feature transformation
sification task of the last ImageNet competition. CBAM improved After channel shuffling, we use Eq. (1) to perform a simple fea-
based on SENet, applying attention feature refinement to two dif- ture transformation on the subgroups in each group.
ferent modules, channel, and space. Inspired by the previous meth-
ods, our network summarizes the attention of the channel into a cos (rπ /2 ) − sin (rπ /2 ) i
g(r, i, j ) = (1)
feature map group representation. It aggregates all the grouped sin (rπ /2 ) cos (rπ /2 ) j
spatial information through the spatial attention module to en-
Here, 0 ≤ r<4, and (i, j ) represents the coordinates of each
hance the feature representation ability within a single residual
value in the original matrix.
block.
We use K() to represent the convolution of the bottleneck block
3 × 3 in ResNet and ys to represent the output of K(). Then, for
2.3. Feature transformation learning each input Xs , we have:

K (gr (xs ) )r, s = 0,
Feature transformation learning is beneficial to enhance the fea- ys = (2)
K ( gr ( xs ) ) y0 0 < r = s < 4 ,
ture representation ability of CNNs. Lenc and Vedaldi (2015) ex-
plored the invariance and equivalence of the input image trans- Here, gr () represents performing the corresponding feature
formation by CNNs by estimating the linear relationship be- transformation on the input xs matrix and "" represents element-
tween the original and transformed images. Gens and Domin- wise multiplication. Note that each 3 × 3 convolution operator K()
gos (2014) proposed a deep symmetric network that uses sparse receives feature information from all xs undergoing feature trans-
high-dimensional feature maps to deal with high-dimensional formations. The output of ys contains the same number of differ-
groups of transformations. Dieleman et al. (2015) proved that ro- ent types of feature maps. We use elementwise multiplication to
tational symmetry could accomplish rotational symmetry by ro- enhance the identifiability between channel features.
tating feature maps in CNNs. Can use this symmetry to construct
a rotation-invariant convolutional neural network for galaxy mor- 3.1.3. Channel and spatial attention modules
phology classification. Spatial transform networks (Jaderberg et al., Using the interdependence between channel maps can improve
2015) insert the spatial transform module into the existing convo- the feature representation of specific semantics. We treat each
lutional structure. The CNNs can actively perform spatial transfor- channel of the feature map as a feature detector. As shown in
mation according to the feature map without additional training Fig. 2(A), we send the feature map Gn ∈ RC/N×H×W of the n-th
and supervision and successfully perform small-scale image classi- group to the channel attention module where n ∈ 1, 2, . . . , N.
fication. Later extended this work to various computer vision prob- First, global contextual information (Woo et al., 2018; Li et al.,
lems for evaluating cyclic symmetry (Lin et al., 2017). Our method 2019; Hu et al., 2018) with embedded channel statistics is collected
achieves a simple feature transformation on the feature map with- through global avg-pooling (GAP) across spatial dimensions. Then,
out increasing the number of calculations for the CNN to obtain the shared fully connected layer is used to infer a 1D channel at-
richer feature information in a single residual block and improve tention map C n ∈ RC/N :
the network’s overall performance. We believe that the feature
map’s affine transformation helps improve the feature representa- C n = DSigmoid (DReLU (GAP (Gn ) ) ) (3)
tion ability of the CNN, which can further study in future work. “DSigmoid ” and “DReLU ” represent a shared fully connected layer that
uses “Sigmoid” and “ReLU” as the activation function.
Finally, the inferred grouped attention map and the correspond-
3. Residual group attention network
ing input feature are made with Hadamard product, and all the
grouped features are weighted and summed to obtain the final
We now introduce the group attention block, which makes
channel attention feature map C ∈ RC/N×H×W :
channel attention between different feature map groups possible.
N
At the same time, these other feature map groups considered by C= (C n Gn ) (4)
the channel aggregate all the grouped spatial information through n=1
the spatial attention module, which improves the feature represen- Here "" represents elementwise multiplication. The weight of
tation ability of CNN. the convolutional layer with a convolution kernel of 1 × 1 in each
3
Fig. 2. The details of the channel and spatial attention modules are illustrated in (A) and (B).
group and the weight of the convolutional layer with a convolution spatial attention map S ∈ RC/N×H×W :
kernel of 3 × 3 in the subgroup are shared. Therefore, the equal
weighting of channel attention is equivalent to the addition of the S = Conv3×3 (SC ) C (6)
channel attention weights obtained by each group, which does not “Conv3×3 ” represents standard convolution, and the activation func-
affect the global feature dependency. tion is “Sigmoid”.
Fig. 2(B) shows that we use the spatial attention module to
aggregate spatial relationships to ensure that similar features are
mutually promoted in spatial size. It is different from the chan- 3.2. ResGANet block
nel attention module. First, we use both global avg-pooling (GAP)
and global max-pooling (GMP) (Woo et al., 2018) to aggregate the Let “x” denote the input, and let F () denote a series of op-
spatial information of feature maps to generate two different con- erations of the group attention block. Then, the output of each
text descriptors. Then, these two descriptors GAP (C ) ∈ R1×H×W and group attention block can be expressed as F (x ). Like ResNet, each
GMP (C ) ∈ R1×H×W are connected to obtain SC ∈ R2×H×W : of our ResGANet blocks uses residual learning. If the input and out-
put feature map shares the same shape, we obtain Y = F (x ) + x.
For blocks with a stride, appropriate transformations T () and T1 ()
SC = GAP (C ) + GMP (C ) (5) are applied to the shortcut connection to align the output shapes:
Y = F (T1 (x ) ) + T (x ). In ResNet and its variants, T () can be max-
Among them, “+” indicates a feature map connection. Finally, pooling or have a convolution with a stride size of 2. This article
the weight value information Sconv ∈ R1×H×W of the spatial dimen- introduces a new stride operation T1 (). T1 () signifies avg-pooling
sion is obtained through a standard convolutional layer. To main- with a pool size of 4, and the output shapes of T () and T1 () should
tain the original spatial size, we perform element-wise multipli- be consistent. It should be noted that the ResGANet block does
cation between Sconv and the input feature map C to get the final not use feature transformation operations when the input features
4
and the output features are the same shapes, which is beneficial to In the residual block, the size of the input feature map and the
maintain the consistency of global features. output feature map are equal (i.e., H × W × C). The feature map
generated by the middle layer is H × W × C , and the computa-
3.3. Instantiations tional cost of a residual block is:

2 C · C + 9 C · C (7)
Fig. 1 shows instantiations of the group attention block. In the
channel attention module, the reduction ratio of the shared fully- Fig. 4(B) shows that the ResGANet block divides the input fea-
connected layer is 2. To prove the performance of ResGANet, we ture map into N groups for calculation. The number of parameters
considered both 50- and 101-layer bottleneck structures. For sim- before the spatial attention module is N times the number of pa-
plicity, we maintain all hyper-parameters related to ResNet. With- rameters of a single group. First, calculate the number of parame-
out exceptional circumstances, our group defaults to 2. ters before the channel attention module of a single group:

C · C 9 C · C
+ (8)
3.3.1. Relation to grouped convolutions N 4N 2
The usually grouped convolution adopts the split transform Two shared fully-connected layers are used in the channel at-
merge strategy. For example, Res2Net performs convolution in a tention module (the dimensionality reduction factor is 2), and the
single residual block in a hierarchical form or operates in paral- computational cost of the channel attention module is:
lel in multiple identical groups, such as ResNeXt. In contrast to the
above method, we first strengthen the flow of channel informa- C C C C C · C
· + · = (9)
tion in the feature map group by channel shuffling. Then, we di- N 2N 2N N N2
vide each parallel group into four subgroups and perform different Then the calculation cost of N groups should be N times the
feature transformations. Finally, we use convolution operations to sum of (8) and (9):
extract various features and fuse them. This dramatically increases
C · C 9 C · C C · C 13 C · C
the identifiability of the features within the group, thereby enhanc- + 2
+ · N = C · C + (10)
ing the feature representation ability of the CNN. N 4N N2 4N
The spatial attention module uses a standard convolution with

3.3.2. Relation to existing attention methods
a kernel of 3 × 3 to obtain the weight value of each spatial po-
As shown in Fig. 3, SENet (Hu et al., 2018) first proposed and
sition in the feature map, and its parameter value is fixed at 18
used the global context to predict channel weights. After that,
(the calculation formula is (3•3•2•1)). Finally, the cost of a 1 × 1
CBAM (Woo et al., 2018) extended and improved SENet, using the
convolution is (C · C )/N.
global max-pooling layer and the global avg-pooling layer to in-
Therefore, the total number of parameters of a ResGANet block
fer the space and channel attention. However, SENet and CBAM
is:
added a corresponding attention module at the top of each resid-
ual block and did not consider multiple grouping situations. In pre- 13 C · C C · C (N + 1 ) · C · C
C · C + + + 18 =
vious work, SKNet (Li et al., 2019) introduced channel feature at-
4N N N
tention between two different network branches to increase multi- 13 C · C
path and dynamic selection design without excessive overhead but + + 18 (11)
4N
did not consider the importance of the feature space dimension.
DANet (Fu et al., 2019) uses a dual attention network to capture To compare the number of parameters, we replace C with
the global feature dependence in the spatial and channel dimen- (4 · C ) and ignore the number of calculations in the spatial atten-
sions and improve semantic segmentation results. However, they tion module. The parameter ratio of the ResNet block to the Res-
only added these two types of attention modules to the traditional GANet block ((7 )/(11 )) is:
expanded FCN and did not optimize the training efficiency and ex- 68N
(12)
pansion to large-scale neural networks. 16N + 29
Our work relies on existing methods of attaching attention When N = 1, ResGANet is 1.51 times lower than the ResNet pa-
blocks, but it is also very different. We extend the channel atten- rameter, and when the maximum grouping N is 8, it is 3.47 times
tion to each grouping, and this method is still effective in actual lower than the ResNet parameter.
calculations. When Group = 1, ResGANet applies channel attention
to the set of feature subgroups. In addition, we send the weighted 4. Experiments
and summed channel attention map of each group to the spatial
attention module, which retains the importance of each feature To evaluate the method in this paper, we conduct compre-
in the channel dimension and improves the weight of the valu- hensive experiments on two public medical image classification
able features in the spatial extent for the current task. Should note datasets (Codella et al., 2019; Yang et al., 2020) and three pub-
that ResGANet is similar to SKNet in that both attention modules lic medical image segmentation datasets (Codella et al., 2018;
are integrated into the residual structure. While SENet, CBAM, and Setio et al., 2017). In the next section, we first introduce the de-
DANet are plug-and-play modules, they embed the attention mod- tails of each dataset and the implementation. Then, we conduct a
ules on the residual block or the entire network. series of ablation experiments on the international skin imaging
cooperation 2018 (ISIC2018) medical image classification dataset to
3.3.3. Parameter calculation verify the importance of each component in our proposed architec-
To illustrate the problem of the number of parameters intu- ture. At the same time, we present the results of different medical
itively, we take a residual block as an example to calculate the image classification tasks and compare them with state-of-the-art
number of parameters of the ResNet block and ResGANet block, methods. Finally, the network proposed in this paper can be di-
and the shortcut connection situation is not considered in Fig. 4. rectly used for image segmentation tasks and improve the baseline
Assuming that the input feature map is H × W × C and the out- model. We report the experimental results to replace the backbone
put feature map is H × W × C , a standard convolution parameter network with ResGANet’s baseline segmentation model (other pa-
quantity is 9(C · C ). rameters are unchanged). We also design an image segmentation
5
Fig. 3. Comparing our ResGANet block with existing attention methods.
model based on ResGANet and compare the experimental results actinic keratosis (327), benign keratosis (1099), basal cell carci-
with state-of-the-art methods using the same datasets. noma (514), and vascular lesion (142). The image size in the
dataset is 650×450 pixels. We reduce all images to 256×256 pix-
els, then 70% of the samples (7010) are used for training and veri-
4.1. Datasets
fication and the remaining 30% (3005) for testing.
COVID19-CT (Yang et al., 2020): This dataset contains the med-
ISIC2018 (Codella et al., 2019): We use the ISIC2018 skin le-
ical images collected by He et al. (2020) in the medRxiv2 and
sion diagnosis dataset.1 There are a total of 10,015 images in
bioRxiv3 literature related to COVID-19. It has 349 COVID-19 posi-
this dataset, which contains seven different categories. They are
tive CT scan images and 397 normal or negative CT scans contain-
melanocytic nevus (6705), dermatofibroma (115), melanoma (1113),
2
https://www.medrxiv.org/
1 3
https://challenge2018.isic-archive.com/task3/ https://www.biorxiv.org/
6
Fig. 4. ResNet block and ResGANet block calculation process.
ing other types of diseases. The size of the pictures in this dataset momenta are added to speed up the convergence. The batch size
ranges from 143×76 to 1637×1225. We use bilinear interpolation is set to 16, and the maximum number of epochs is 120. The ini-
to adjust images more minor than the 256×256 to 256×256 size, tial learning rate is 1e-3, and the learning rate decays to 0.1 times
and the remaining images are compressed to 256×256 size. We the original after every 40 epochs and gradually decays to the fi-
follow the data division method in (He et al., 2020), dividing the nal learning rate of 1e-5. In the image segmentation experiment,
dataset into a ratio of 0.6:0.15:0.25 for training, verification, and the Adam optimizer is used with a fixed learning rate of 1e-4. The
testing. batch size is set to 8, and the early stop mechanism is used to stop
ISIC2017 (Codella et al., 2018): ISIC2017 is a skin lesion segmen- training when the validation loss is stable, and there is no signif-
tation dataset4 released by the international skin imaging cooper- icant change in 15 epochs. All comparative experiments share the
ation organization in 2017. The dataset consists of 20 0 0 training same operating environment, hyperparameters and use the same
images, 150 verification images, and 600 test images. The images training set, validation set, and test set.
in the original dataset provided by ISIC have different pixels. We When training the ISIC2018 dataset, we use “Softmax” as the
first use the gray world color constancy algorithm to normalize output layer and use the categorical cross-entropy (CE) loss func-
the colors of the images and then adjust the size of all images to tion to calculate the loss value:
256 × 256 pixels. The experimental results reported on the dataset
1 t
N C
in this article are all from the official test set results. CE = − yi log yip (13)
Lung Nodule Analysis (LUNA) (Setio et al., 2017): LUNA is a seg- N
n=1 i=1
ment lung structure in 2D CT images. The dataset contains 267
samples (512 × 512 pixels) and corresponding label images. Can The remaining datasets belong to a binary classification task.
download it for free from the official website.5 We adjusted the “Sigmoid” is used as the output layer of the model, and the bi-
sizes of all images to 256×256, used 80% of the images for train- nary cross entropy (BCE) loss function is used to calculate the loss
ing and the rest for testing, and conducted cross-validation. value:
1
t
Kaggle 2018 data science bowl (referred to as Nuclei segmenta- N C
tion)6 : The Booz Allen Foundation provides the dataset containing BCE = − yi log yip + 1 − yti log 1 − yip (14)
670 nuclei feature maps and a label for each image. We adjusted
N
n=1 i=1
all images and corresponding labels to 256×256 pixels, used 80%
where N represents the total number of samples, yti is the true la-
of the images for training, used the rest for testing, and performed p
5-fold cross-validation. bel corresponding to the n-th category, and yi is the correspond-
ing model output value. C represents the number of categories,
4.2. Implementation details i ∈ [1, C ], C = 7 in the ISIC2018 dataset, and C = 2 in the remain-
ing datasets.
We implement our method based on TensorFlow by training on
an NVIDIA Tesla V100 GPU with 16 gigabytes of memory. The input 4.3. Ablation study and visualization
image size of all datasets is 256×256. In the image classification
experiment, SGD is used to optimize the model’s target, and 0.9 4.3.1. Ablation study of the different groups
We conduct ablation studies on different groups of ResGANet
4
https://challenge.isic-archive.com/landing/2017 on the ISIC2018 dataset. We show the results produced by the 50-
5
https://www.kaggle.com/kmader/finding-lungs-in-ct-data/data/ and 101-layer versions of the ResGANet model when the grouping
6
https://www.kaggle.com/c/data-science-bowl-2018/ (G) is 1, 2, 4, and 8. The accuracy (Acc), precision (Prec), and recall
7
Table 1 Table 3
Comparison of different groups of ResGANet on the ISIC2018 Ablation experiments with different sizes and pooling
dataset. types applied to the ISIC2018 dataset.
Model Params Acc% Prec% Recall% Pooling Pool size Acc% Prec% Recall%
50-layer ResGANet-50 (G = 2)
ResGANet(G = 1) 15.59 M 80.40 79.20 80.54 Max 2 80.50 80.24 80.29
ResGANet(G = 2) 11.25 M 81.66 81.18 82.37 3 79.51 80.09 79.15
ResGANet(G = 4) 8.92 M 81.13 81.26 81.55 4 80.03 79.77 80.31
ResGANet(G = 8) 7.91 M 80.91 80.43 80.85 Average 2 81.00 80.32 81.53
101-layer 3 80.41 80.11 80.16
ResGANet(G = 1) 27.65 M 81.29 80.47 80.69 4 81.66 81.18 82.37
ResGANet(G = 2) 19.45 M 82.35 82.07 82.51
ResGANet(G = 4) 15.35 M 82.04 81.96 82.47
ResGANet(G = 8) 13.30 M 81.54 80.43 81.29
tion module is added, the area of interest expands accordingly, but
it does not match the target area well. Using two attention mod-
ules together is better than the other two cases, and the generated
of the experiment are showing in Table 1. We can see that increas-
attention points can more accurately locate and cover the target
ing the number of groups from 1 to 8 will reduce the number of
object. The visualization effect of adding only the spatial attention
network parameters. Nevertheless, we expect to obtain better clas-
module is slightly worse than that of only adding the channel at-
sification accuracy with a more significant number of groups. How-
tention module. The ability to locate the boundary area is not as
ever, the experimental results show that it can achieve improves
strong as that of the channel attention module. Compared to the
classification accuracy when the grouping is 2 or 4. Furthermore,
visualization without the attention module, the spatial attention
according to experience, an increase in the number of groups will
module is able can pay more attention to the target area. Moreover,
reduce the inference speed of the network. For the sake of making
when the target object is small, our network focus can be limited
a good trade-off between speed, accuracy, and parameters in sub-
to semantic. To show that ResGANet helps to find more complete
sequent experiments, we use network settings grouped into 2 and
target objects, even if their size is small.
4. Finally, these experimental results also prove that increasing the
number of groups in the network can improve the classification
results of medical images. 4.3.4. Average pooling vs. max pooling
Beyond the above comparison of ablation experiments, we also
4.3.2. Ablation study of channel shuffle and feature transformation study the effect of different sizes and pooling types on image clas-
In Table 2, we compare the classification performance of Res- sification performance. In the ablation study, we replace the T1 ()
GANet (50-layer) with and without channel shuffling and feature operators described in Section 3.2 with pooling operations of dif-
transformation. Considering the design of the network structure, ferent sizes (2, 3, 4) and different types (Avg, Max) and compare
we uniformly divide the channel feature into four parts for the the performance differences. As shown in Table 3, when using the
shuffle operation. The experimental results show that the model’s max-pooling operator and keeping other parameters unchanged,
classification performance with channel shuffling is better than better classification accuracy (80.50%) can be obtained when the
that of the corresponding model when we use the channel shuf- pooling size is 2. Moreover, using the average pooling size of 4
fling operation as a variable and a fixed feature transformation op- can achieve the best classification accuracy. Compared to the max-
eration (the first and second lines of the experimental results). To pooling operator with the highest classification accuracy, the per-
demonstrates the importance of cross-group information exchange. formance of accuracy is improved by 1.16% (80.50 vs. 81.66). We
Moreover, the model with feature transformation can obtain richer think this may be because, unlike max pooling, average pooling es-
feature information, thereby improving the model’s overall perfor- tablishes connections between various locations in the entire pool
mance (Acc 78.40% vs. 81.66%). Note that the above two operations window, which can better capture local context information.
do not introduce additional parameters, which shows that an ap-
propriate and reasonable transformation feature map can further 4.4. Medical image classification task
enhance the CNN’s feature representation ability.
4.4.1. Results on ISIC2018 dataset
4.3.3. Ablation study for attention modules For comparisons with ResNet and its variants (He et al., 2016;
We use Grad-CAM (Selvaraju et al., 2020) as an attention ex- Hu et al., 2018; Woo et al., 2018; Xie et al., 2017; Gao et al., 2019;
traction tool and visualize the attention generated by ResGANet-50 Wang et al., 2020; Li et al., 2019), all networks use a 256×256
without the attention module, channel attention module, spatial image size for training. According to previous practice, we con-
attention module, and two attention modules. As shown in Fig. 5, sidered the 50- and 101-layer networks in each task. In addition,
when the attention module is not added, the network can only fo- we have also compared some lightweight methods. For ShuffleNet
cus on a specific part of the target area. When the channel atten- (Zhang et al., 2018), the complexity level is 1.0×, using 4 and 8
Table 2
Comparison of experiments with/without channel shuffling and feature transformation
on the ISIC2018 dataset.
Model Params Shuffle Trans Acc% Prec% Recall%
50-layer
√
ResGANet (G = 2) 11.25 M 80.27 79.33 80.15
√
11.25 M 78.40 79.06 79.41
√ √
11.25 M 81.66 81.18 82.37
√
ResGANet (G = 4) 8.92 M 80.51 79.49 80.07
√
8.92 M 79.92 80.54 79.16
√ √
8.92 M 81.13 81.26 81.55
8
Fig. 5. Visualized results of the attention module ablation study. In the figure, the class with a higher weight reflects a higher thermal value. The first column is the input
image, and the second column is the visualization result without the attention module (AM). The third, fourth, and fifth columns are the ResGANet visualization results of
adding only the channel attention module (CAM), adding only the spatial attention module (SAM), and adding two attention modules simultaneously. All networks are set
up with 50 layers.
groups to compare with 50- and 101-layer networks. For Shuf- racy of the baseline network without significantly increasing the
fleNetV2 (Ma et al., 2018), the complexity levels of 1.0× and 1.5× number of parameters. Compared to ResNet-101, SKNet-101 has an
are compared with the 50- and 101-layer networks, respectively. increase of 2.09 M in parameters and an accuracy increase of 4.81%
Based on prior experience, it is fair to compare EfficientNet-B0 (76.61 vs. 81.42). When ShuffleNet’s grouping is set to 4, it main-
(Tan and Li, 2019) and EfficientNet-B1 (Tan et al., 2019) to the 50- tains the same number of parameters as ResGANet-50 (10.34 M vs.
and 101-layer networks. The comparison models are adopted from 11.25 M), but the overall performance is poor. As the complexity
their original implementation and share the operating environment level increases, the parameter volume and performance of Shuf-
and some necessary hyperparameters (such as loss function, batch fleNetV2 correspondingly improve, and the accuracy increases from
size, learning rate, etc.). The experimental results are displayed on 77.7% to 79.07%. These experimental results further confirm that
the bar graph on the right side of Fig. 6. The results show that Res- increasing the baseline network of the attention module or refin-
GANet is superior to the most advanced ResNet variant network in ing the features into multiple groups can reduce (or not increase)
medical image classification. Similar to ResNeXt and Res2Net, we the number of parameters and improve the performance of the
use grouped convolution to improve accuracy. Still, unlike them, classification model to varying degrees. ResGANet has the above
we improve the accuracy of medical image classification while two characteristics; ResGANet-101 has only 19.45 M parameters
keeping the network parameters small (the left side of Fig. 6 com- and has the best classification accuracy (82.35%) on the ISIC2018
pares of the parameters and accuracy of each model). Moreover, dataset.
comparable to SENet and CBAM, we build attention modules to
improve the feature representation ability of convolutional neural 4.4.2. Results on COVID19-CT dataset
networks. Still, our network’s classification accuracy is improved by Table 4 shows the evaluation results of the proposed model
3.69% compared to that of SENet-50 (77.97% vs. 81.66%) and 3.19% and some of the most advanced classification algorithms on the
compared of CBAM-50 (78.47% vs. 81.66%). SKNet applies channel COVID19-CT dataset. We adopt the data division method and eval-
attention to multiple groups and improves the classification accu- uation index (i.e., accuracy, F1 value, and AUC) found in the litera-
9
Fig. 6. Comparison of the classification performance and network parameters of ResGANet and ResNet and their variants (i.e., ResNet (He et al., 2016), SENet (Hu et al., 2018),
CBAM (Woo et al., 2018), ResNeXt (Xie et al., 2017), Res2Net (Gao et al., 2019), ECANet (Wang et al., 2020), SKNet (Li et al., 2019), EfficientNet (Tan et al., 2019), ShuffleNet
(Zhang et al., 2018), ShuffleNetV2 (Ma et al., 2018)) on the ISIC2018 dataset. Please note that our ResGANet can achieve higher accuracy while the model parameters are less.
Table 4 without adjusting the network structure and maintains a high clas-
Performance comparison of different initialization networks on the COVID19-CT
sification performance.
dataset.
Fig. 7 is a box plot of the results in five experiments, with the
Network Params Acc F1 AUC standard deviation marked above the box plot to show the degree
VGG-16 (Simonyan et al., 2014) 131.95 M 0.66 0.58 0.74 of stability in different models. The median and average values
ResNet-18 (He et al., 2016) 11.15 M 0.67 0.66 0.76 of ResGANet for the ISIC2018 dataset are better than those of the
ResNet-50 (He et al., 2016) 24.37 M 0.72 0.73 0.78 other models. Among them, the stability of ResGANet-50 (0.96) is
DenseNet-121 (Huang et al., 2017) 7.61 M 0.76 0.77 0.82
second only to SENet-50 (0.36), but it is far superior to SENet-50
DenseNet-169 (Huang et al., 2017) 13.49 M 0.80 0.79 0.86
EfficientNet-b0 (Tan et al., 2019) 5.04 M 0.72 0.71 0.76 in terms of classification performance. On the COVID19-CT dataset,
EfficientNet-b1 (Tan et al., 2019) 7.43 M 0.70 0.62 0.77 the accuracy of ResGANet decrease as the depth increases, but the
ShuffleNet 1.0×(G = 4) (Zhang et al., 2018) 10.34M 0.72 0.72 0.80 F1 value and AUC do not fluctuate significantly. To suggests that
ShuffleNet 1.0×(G = 8) (Zhang et al., 2018) 7.25M 0.71 0.72 0.76
when ResGANet is applied to other smaller medical image classi-
CRNet (He et al., 2020) 0.52 M 0.72 0.76 0.77
ShuffleNetV2 (1.0×) (Ma et al., 2018) 3.83M 0.74 0.74 0.79 fication datasets, using ResGANet-50 with a lower depth may per-
ShuffleNetV2 (1.5×) (Ma et al., 2018) 16.89M 0.73 0.76 0.79 form better. Fig. 8 shows the t-test results of different methods on
SENet-50 (Hu et al., 2018) 27.31 M 0.76 0.77 0.80 different datasets. There are substantial differences in performance
CBAM-50 (Woo et al., 2018) 27.31 M 0.78 0.80 0.79 between ResGANet and most methods. Although the method pre-
ResNeXt-50 (Xie et al., 2017) 21.98 M 0.72 0.75 0.78
sented in this paper is an improvement of ResNet, there are highly
Res2Net-50 (Gao et al., 2019) 13.69 M 0.73 0.74 0.78
ECANet-50 (Wang et al., 2020) 22.49 M 0.75 0.74 0.78 significant differences in the performance of ResGANet and ResNet.
SKNet-50 (Li et al., 2019) 23.57 M 0.77 0.76 0.77
ResGANet-50 (G = 2) (ours) 11.25 M 0.80 0.81 0.82
4.5. Medical image segmentation task
ResGANet-101 (G = 2) (ours) 19.45 M 0.78 0.81 0.82
To explore the generalization ability of ResGANet, we apply it

to different medical image segmentation tasks. First, we use ResU-
ture (He et al., 2020) and use the initialization network for train- Net (Alom et al., 2018), PSPNet (Zhao et al., 2017), DeepLabV3+
ing. Among them, red, green, and blue indicate the best, second (Chen et al., 2018), and DANet (Fu et al., 2019) as the base-
and third best performance (the following). line methods for implementation on the ISIC2017 dataset. Then,
DenseNet establishes a dense connection between all the front we directly replace the encoder network in the baseline method
layers and the back layers, realizes feature reuse, and obtains with ResGANet and keep other parameters unchanged to com-
the best AUC (86%) on this dataset. EfficientNet applies depth- pare the differences in segmentation performance. Finally, inspired
wise separable convolution and channel attention modules to learn by DeepLab (Chen et al., 2014, 2017, 2018) series networks and
the importance of different channel features and dramatically re- Jahanifar et al. (2018), we design a decoding module for medical
duces the number of parameters of the backbone network. How- image segmentation, called the multiscale artous spatial pyramid
ever, when compared with other attention-based methods (such as pooling module (MsASPP), as shown in Fig. 9. This module aims
SENet, CBAM, ECANet, SKNet, etc.), the classification performance to cooperate with ResGANet to obtain more accurate segmentation
of EfficientNet on this dataset is limited. After analyzing the situa- results (exploring the optimal segmentation framework is beyond
tion, we suppose that the network structure is mainly designed for the scope of this article). Next, we report the segmentation results
large datasets (such as ImageNet) and unsuitable for small datasets on the three datasets in turn and compare them with the state-of-
such as COVID19-CT. The experimental results verify that deeper the-art network.
or wider networks usually show higher classification performance,
which also benefits from the complex network structure. The ac- 4.5.1. Results for the ISIC2017 dataset
curacy (80%) and F1 value (81%) of ResGANet-50 (G = 2) achieve Among the four baseline networks except for DeepLabV3+,
the best performance in this dataset. These results prove that Res- which uses Xception (Chollet et al., 2017) as the backbone, the
GANet can be directly used in other applications in the same field remaining three all use ResNet-101 as the backbone network to ex-
10
Fig. 7. Box-plots of different methods on the ISIC2018 and CONVID19-CT datasets.
Fig. 8. A heatmap of statistical significance testing to evaluate the classification capabilities of different methods. (a) The t-test result of the statistical significance of the
50-layer network on the ISIC2018 dataset. (b) The t-test result of the statistical significance of the 101-layer network on the ISIC2018 dataset. (c) The t-test results of the
statistical significance of different methods on the COVID19-CT dataset.
tract semantic features in medical images. For performance evalu- mark segmentation models with ResGANet, the segmentation per-
ation, we adopt several metrics recommended by the ISIC, namely, formance is also improved to varying degrees. These results show
accuracy (Acc), sensitivity (Sen), specificity (Spec), Jaccard index that a well-designed backbone network can help improve segmen-
(JI), and Dice coefficient (DC). Among these metrics, JI and DC tation performance.
are the leading indicators for measuring the segmentation results. Next, we use ResGANet101 (G = 2) to extract the semantic fea-
Table 5 shows the experimental results. Our ResGANet-101 (G = 2) tures of medical image segmentation and combine them with the
backbone network improves the JI and DC of the PSPNet model MsASPP module as the segmentation model (ResGANet-MsASPP)
by 0.5% (0.746 vs. 0.751) and 1.4% (0.812 vs. 0.826), respectively, in this paper. Table 6 shows the experimental results of our seg-
and maintains similar overall model complexity at the same time. mentation model and some of the most advanced segmentation
In addition, when replacing the backbone network of other bench- methods on the ISIC2017 medical image segmentation dataset.
11
Fig. 9. Overview of the multiscale atrous spatial pyramid pooling module.
These comparison methods include U-Net (Ronneberger et al., 2018) and Attention R2U-Net (Alom et al., 2018). We first briefly
2015), AutoED (Attia et al., 2017), LIN (Li and Shen, 2018), DC- introduce these comparison models. Can find detailed methods and
GAN (Radford et al., 2016), Pix2pix (Isola et al., 2017), SCANet descriptions in the corresponding references. U-Net improves FCN
(Dai et al., 2018), RefineNet (Lin et al., 2017), CDNN (Yuan et al., (Long et al., 2015) and is a U-shaped network for medical image
2017), MResNet-Seg (Bi et al., 2017), FocusNet (Kaul et al., 2019), segmentation. U-Net is fused at the same scale as the number of
DAGAN (Lei et al., 2020), ResU-Net (Xiao et al., 2018), DoubleU-Net channels corresponding to the feature extraction part each time it
(Jha et al., 2020), U-Net++ (Zhou et al., 2018), DANet (Fu et al., is upsampled, making it possible to segment medical images more
2019), FCANet (Cheng et al., 2020), Attention U-Net (Oktay et al., accurately. CDNN is an improvement of U-Net, and its output map-
12
Table 5 Table 7
Segmentation results of different baseline networks and the baseline network after Performance comparison of LUNA segmentation (mean ± standard deviation).
replacing the backbone network on the ISIC2017 test set.
Network 1-JI Acc Sen
Network Acc Sen Spec JI DC
FCN-8 s (Long et al., 2015) 0.091±0.066 0.931±0.042 0.927±0.037
ResU-Net (Zhu et al., 2019) 0.901 0.806 0.970 0.740 0.811 SegNet (Badrinarayanan et al., 0.050±0.019 0.956±0.014 0.965±0.015
ResGAU-Net (ours) 0.905 0.802 0.954 0.743 0.814 2017)
PSPNet (Zhao et al., 2017) 0.906 0.835 0.954 0.746 0.812 U-Net (Ronneberger et al., 0.087±0.090 0975±0.032 0.938
PSPNet (ours) 0.906 0.819 0.970 0.751 0.826 2015)
DeepLabV3+ (Chen et al., 2018) 0.900 0.785 0.977 0.731 0.814 Backbone (Gu et al., 2019) 0.044±0.063 0.988±0.024 0.967
DeepLabV3+ (ours) 0.900 0.796 0.964 0.735 0.821 CE-Net (Gu et al., 2019) 0.038±0.061 0.990±0.023 0.980
DANet (Fu et al., 2019) 0.859 0.771 0.930 0.694 0.788 ResU-Net (Xiao et al., 2018) 0.076±0.024 0.947±0.011 0.965±0.018
DANet (ours) 0.882 0.784 0.963 0.712 0.804 DoubleU-Net (Jha et al., 2020) 0.062±0.018 0.955±0.009 0.959±0.010
U-Net++ (Zhou et al., 2018) 0.061±0.009 0.949±0.011 0.970±0.013
DANet (Fu et al., 2019) 0.056±0.005 0.949±0.009 0.974±0.009
Table 6 FCANet (Cheng et al., 2020) 0.040±0.016 0.959±0.006 0.974±0.010
Segmentation results of different methods on the ISIC2017 test set. Attention U-Net (Oktay et al., 0.043±0.009 0.949±0.005 0.972±0.006
2018)
Method Acc Sen Spec JI DC Attention R2U-Net 0.043±0.007 0.956±0.003 0.977±0.007
U-Net (Ronneberger et al., 2015) 0.933 0.806 0.954 0.696 0.783 (Alom et al., 2018)
AutoED (Attia et al., 2017) 0.936 0.836 0.966 0.738 0.824 ResGANet-MsASPP (ours) 0.037±0.004 0.959±0.003 0.981±0.006
LIN (Li et al., 2018) 0.934 0.855 0.974 0.753 0.839
DCGAN (Radford et al., 2016) 0.853 0.837 0.883 0.616 0.724
Pix2pix (Isola et al., 2017) 0.888 0.848 0.928 0.642 0.739
SCANet (Dai et al., 2018) 0.882 0.791 0.973 0.721 0.815 tures and reduce the parameters of the entire network. Although
RefineNet (Lin et al., 2017) 0.918 0.839 0.945 0.705 0.772 our method is not as robust as other methods in some indicators, it
CDNN (Yuan et al., 2017) 0.934 0.825 0.975 0.765 0.849 shows excellent performance. The accuracy and Dice reached 93.6%
MResNet-Seg (Bi et al., 2017) 0.934 0.802 0.985 0.760 0.844 and 86.2%, respectively. In addition, we provide a solid framework
FocusNet (Kaul et al., 2019) 0.921 0.767 0.990 0.756 0.832
improvement idea for future image segmentation work.
DAGAN (Lei et al., 2020) 0.935 0.835 0.976 0.771 0.859
ResU-Net (Xiao et al., 2018) 0.901 0.806 0.970 0.740 0.811
DoubleU-Net (Jha et al., 2020) 0.933 0.841 0.967 0.760 0.845 4.5.2. Results for the LUNA dataset
U-Net++ (Zhou et al., 2018) 0.925 0.830 0.956 0.743 0.832
The second task is to segment the lung structure in a 2D CT
DANet (Fu et al., 2019) 0.859 0.771 0.930 0.694 0.788
FCANet (Cheng et al., 2020) 0.934 0.839 0.968 0.761 0.855 image. In addition to making a comparison with the state-of-the-
Attention U-Net (Oktay et al., 2018) 0.933 0.839 0.959 0.741 0.836 art methods of the current study, we also implement some bench-
Attention R2U-Net (Alom et al., 2018) 0.933 0.840 0.957 0.752 0.856 mark medical image segmentation methods (Long et al., 2015;
ResGANet-MsASPP (ours) 0.936 0.842 0.950 0.764 0.862 Badrinarayanan et al., 2017; Xiao et al., 2018; Jha et al., 2020;
Zhou et al., 2018) and some attention-based medical image seg-
mentation methods (Fu et al., 2019; Cheng et al., 2020; Oktay et al.,
ping has the same dimension as the input image. AutoED combines 2018; Alom et al., 2018) and report the experimental results. We
deep convolution and recurrent neural networks and has achieved use the evaluation metrics in the literature (Gu et al., 2019),
good results in skin lesion segmentation. The LIN, MResNet-Seg, namely, overlapping error (1-JI), Acc, and Sen.
and FocusNet methods are similar, and all construct multiscale We can see from the comparison in Table 7 that our method
residual networks to segment skin lesions. Unlike previous work, has an overlap error of 0.037 and a sensitivity score of 0.981,
DCGAN, Pix2pix, and DAGAN generate adversarial networks to seg- which is better than the state-of-the-art CE-Net (Gu et al., 2019).
ment medical images. DCGAN applies convolutional operators to Although the segmentation content of this dataset is relatively sim-
replace pooling operators, stride convolutions for the discriminator, ple, the standard deviation reflects that our network has higher
and fractional-stride convolutions for the generator. Pix2pix uses stability. Compared to DANet, which only adds an attention mod-
U-Net as a generator and adds a specific loss function (Isola et al., ule on the top of the backbone network, ResGANet aggregates the
2017) to the objective function. DAGANis an improvement of the feature information of the space and channel dimensions in the
above two methods and uses dual discriminators to complete skin residual learning process for the segmentation method based on
lesion segmentation. In addition, DANet, FCANet, Attention U-Net, ResGANet as the backbone network to achieve a better segmenta-
and Attention R2U-Net are all attention-based segmentation meth- tion effect. In addition, adding attention modules (such as FCANet,
ods. They add attention modules at different stages to aggregate Attention U-Net, and Attention R2U-Net, etc.) at different stages
detailed information lost by excessive down-sampling operations. can effectively improve medical image segmentation performance.
In contrast to the above techniques, ResGANet enhances the fea- At the same time, these results also show that the proposed seg-
ture representation ability in the down-sampling process in two mentation model can outperform other benchmarks and attention-
different dimensions of space and channel, uses MsASPP to restore based segmentation methods on lung segmentation datasets where
the size of the feature map, reducing the loss of global semantic the distribution of positive and negative samples is relatively con-
information in the up-sampling process, thereby achieving the ac- centrated.
curacy needed for medical image segmentation.
The segmentation results in Table 6 show that our proposed 4.5.3. Results for nuclei segmentation
ResGANet-MsASPP model has certain potential for being effec- The last application is the task of nuclei segmentation. Auto-
tive in skin lesion segmentation tasks. These results benefit from matically detecting the nucleus could help medical staff cure the
ResGANet’s powerful feature representation ability and the up- common cold and other rarer diseases faster. We use the same
sampling method of the MsASPP module. Unlike the traditional U- evaluation metrics as the ISIC2017 dataset and report the fivefold
Net up-sampling method, we combine the feature maps between cross-validation results of the nuclei segmentation. The experimen-
the encoder and the decoder through elementwise summation op- tal results are showing in Table 8.
erators. Meanwhile, we extract feature maps from five different The distribution of positive and negative sample features in
levels and construct a decoding path in the form of a spatial pyra- the nuclei segmentation dataset is not uniform. Due to the rel-
mid pool. This operation allows us to combine various levels of fea- atively simple structure of the decoder of the FCN-8 s model, a
13
Table 8
Performance comparison of nuclei segmentation (mean±standard deviation).
Network Acc Sen Spec JI DC
FCN-8 s (Long et al., 2015) 0.974±0.012 0.944±0.016 0.984±0.009 0.894±0.014 0.942±0.012

SegNet (Badrinarayanan et al., 2017) 0.979±0.009 0.957±0.017 0.987±0.011 0.912±0.009 0.954±0.011
U-Net (Ronneberger et al., 2015) 0.972±0.014 0.960±0.012 0.981±0.009 0.904±0.007 0.946±0.009
ResU-Net (Xiao et al., 2018) 0.971±0.008 0.936±0.013 0.977±0.010 0.889±0.019 0.936±0.014
DoubleU-Net (Jha et al., 2020) 0.976±0.006 0.948±0.010 0.985±0.009 0.902±0.012 0.943±0.007
U-Net++ (Zhou et al., 2018) 0.978±0.006 0.961±0.009 0.985±0.003 0.910±0.006 0.949±0.009
DANet (Fu et al., 2019) 0.978±0.006 0.952±0.011 0.981±0.006 0.899±0.010 0.946±0.008
FCANet (Cheng et al., 2020) 0.977±0.002 0.955±0.005 0.983±0.006 0.907±0.007 0.950±0.003
Attention U-Net (Oktay et al., 2018) 0.975±0.004 0.933±0.020 0.983±0.005 0.898±0.011 0.935±0.012
Attention R2U-Net (Alom et al., 2018) 0.971±0.006 0.953±0.014 0.985±0.007 0.911±0.010 0.941±0.007
ResGANet-MsASPP (our) 0.981±0.003 0.969±0.007 0.985±0.002 0.918±0.007 0.956±0.006
Image Mask FCN-8s U-Net SegNet ResU-Net DoubleU-Net U-Net++ DANet FCANet Att U-Net Att R2U-Net Our
ISIC 2017
LUNA
Nuclei
Fig. 10. Comparison of segmentation performance of different state-of-the-art networks on ISIC2017, LUNA and Nuclei segmentation datasets. The first two columns belong
to the input image and label, and the remaining columns are the visualization results of the corresponding methods.
large amount of semantic information is lost during the decoding target is not clear (as shown in the first line of Fig. 10). In ad-
process, which makes the segmentation result for this dataset is dition, the attention-based segmentation methods are superior to
poor. SegNet achieves the best performance in specificity metrics, other benchmark methods in dealing with medical images with
indicating that the network pays more attention to the accuracy unclear boundaries because the attention module produces a more
of the background area. U-Net++, ResU-Net and DoubleU-Net are discriminative feature representation by selecting the focus posi-
all variant networks of U-Net, aiming to mine the richer semantic tion. The previous segmentation model can also show a good seg-
information in medical images fully. UNet++ introduces a built- mentation effect when the target area and the background area are
in depth-variable U-Net collection. Furthermore, image segmenta- apparent (as shown in the second and third lines of Fig. 10). How-
tion performance is improved, and the accuracy of nuclei segmen- ever, due to the similarity of pixels in some areas (red box area in
tation is increased by 0.6% (0.972 vs. 0.978). Compared to other the fourth line), the FCN-8S, ResU-Net, and attention U-Net meth-
attention-based segmentation methods (such as DANet, FCANet, ods still produce inaccurate segmentation results. The method pre-
Attention U-Net, Attention R2U-Net), the method proposed in this sented in this paper can effectively distinguish these similar pixels
paper achieves the best performance in accuracy, sensitivity, Jac- and achieve almost the same boundary segmentation effect that
card index, and Dice. While it shows that the attention-based Res- labels do. As shown in the last two lines of Fig. 10, the distribution
GANet backbone network can improve the segmentation effect of of positive and negative sample features in the nuclei segmenta-
medical images, it also shows that ResGANet can process a variety tion dataset is uneven. A background is similar to the lesion area
of biomedical image data and improve the feature representation (the fifth line is the red box area). To require the segmentation
ability of convolutional neural networks. model to make an accurate decision on the features of the back-
ground region and the focus region to ensure the segmentation
4.5.4. Visual results performance. The attention-based ResGANet-MsASPP method im-
The visual performance comparison of different segmentation proves the semantic consistency, balances the difference between
methods on three medical image segmentation datasets is show- positive and negative samples well, and produces a better segmen-
ing in Fig. 10. Experimental results show that the proposed method tation effect than other methods. Finally, these visualization results
achieves better segmentation results when compared to the state- further confirm the effectiveness of the proposed method in med-
of-the-art methods. However, the boundary of the segmentation ical image segmentation.
14
5. Discussion While our proposed method has shown promising results, it

should address some limitations in future research:
Due to its modular and straightforward structure, ResNet and
its variants have been applied as the backbone network for down-
• We adopt the hyper-parameter settings optimized by the resid-
stream medical image analysis applications. However, the diversity ual network in our research. In the future, detailed adjustment
of medical images, the high cost of labeling, and the small amount of hyper-parameter settings (such as the size of the input image
of data compared with natural images result in limited training and the number of output channels, etc.), may further reduce
data. Simply applying a backbone network that performs well on the parameters of our model. Because the grouped convolution
natural images to the field of medical images cannot obtain satis- with feature transformation has richer feature information, re-
factory results. Therefore, finding an effective and robust medical ducing the number of channels of the output feature map may
image backbone network is still a challenging task. Based on the not affect the model’s performance.
above requirements, we design a group attention block with highly
• Designing targeted feature transformations and fine-grained
modular and functional features by stacking group attention blocks subgrouping for different tasks may further improve the per-
in ResNet-style to obtain a new variant called ResGANet. formance of ResGANet.
In the case of small datasets, researchers usually tend to
• This paper proves that the proposed method contributes to im-
use affine transformation to increase the diversity of data and proving the performance of downstream segmentation tasks. In
expand the amount of data so that the network can more the future, designing a suitable decoder structure can further
fully mine deep features and improve its generalization ability. improve the performance of ResGANet on different tasks.
Dieleman et al. (2015) rotate the input image to various angles
to generate different viewpoints, use a shared encoder to learn 6. Conclusion
the features of different views, and splice the output features into
the fully connected layer and the classification layer. Considering This work proposes the ResGANet architecture with a novel
that using a shared encoder to learn features will increase time group attention block. This structure enhances the feature repre-
and parameter costs, our method performs a simple feature trans- sentation ability of CNNs and improves the performance of medi-
formation on the feature map without increasing the number of cal image classification and segmentation. We prove that ResGANet
calculations. Therefor a single residual block can obtain richer fea- has good generalization ability and performs well on multiple tasks
ture information and alleviate the problem of a CNN’s limited abil- in medical computer vision. In the image segmentation task, sim-
ity to execute input data with constant space. In addition, unlike ply replacing the backbone network with ResGANet is better than
traditional group convolution, each feature subgroup of the group the model segmentation effect using a standard backbone network
attention block contains the information of the first feature sub- (such as ResNet or Xception). In addition, ResGANet requires fewer
group. This method has more vital feature extraction capabilities parameters and fewer calculations to achieve the most advanced
and increases the range of the receptive field of each group to a performance. We believe that ResGANet should be widely applica-
certain extent. ble to various medical vision tasks.
Hu et al. (2018) first proposed the concept of channel atten-
tion (squeeze-and-excitation networks in the original paper). They Declaration of Competing Interest
used the global context to predict the attention factor of the chan-
nel direction. After that, they have generated many attention-based The authors declare that there are no conflicts of interest in this
methods (CBAM (Woo et al., 2018), ECANet (Wang et al., 2020), work.
DANet (Fu et al., 2019), SKNet (Li et al., 2019)) on this basis. How-
ever, these methods either only focus on the importance of the CRediT authorship contribution statement
channel dimension without considering the importance of the fea-
ture space dimension (such as SENet, ECANet, SKNet, etc.), or only Junlong Cheng: Writing – original draft, Methodology, Soft-
embed the attention module on the top of the entire block or ware, Writing – review & editing. Shengwei Tian: Data curation,
the entire network without considering the importance to multiple Investigation. Long Yu: Writing – review & editing, Formal analy-
groups (such as CBAM, DANet, etc.). The work presented in this pa- sis. Chengrui Gao: Writing – original draft. Xiaojing Kang: Visual-
per relies on the existing method of attaching attention blocks and ization, Investigation. Xiang Ma: Supervision, Funding acquisition.
extends the channel attention to each group. This method is still Weidong Wu: Data curation, Formal analysis. Shijia Liu: Investiga-
effective in actual calculations. We send each group’s weighted and tion, Validation, Visualization. Hongchun Lu: Resources, Software.
summed channel attention map into the spatial attention mod-
ule, which preserves the importance of features in the channel di-
Acknowledgements
mension and increases the weight of valuable features in the spa-
tial extent for the current task. Moreover, the proposed method is
This research is partially supported by the National Natu-
different from the previous plug-and-play attention modules. We
ral Science Foundation of China (No. 62162058). Science and
integrate the attention module into the entire residual structure
Technology Department of Xinjiang Uyghur Autonomous Region
to form a highly modular model. The stacked ResGANet reduces
(2020E0234). Xinjiang Autonomous Region key research and devel-
the parameters by 1.51–3.47 times (depending on the group size)
opment project (2021B03001–4). We would also like to thank our
compared to the original ResNet’s parameters. Numerous experi-
tutor for the careful guidance and all the participants for their in-
mental results show that the proposed ResGANet is superior to
sightful comments.
the most advanced convolutional neural network backbone model
in medical image classification tasks. The model can also improve
References
the baseline model in medical image segmentation tasks without
changing the network architecture. Although our model focuses Alom, M.Z., Hasan, M., Yakopcic, C., Taha, T.M., Asari, V.K., 2018. Recurrent resid-
on medical computer vision tasks, the ideas presented in this pa- ual convolutional neural network based on u-net (r2u-net) for medical image
per can provide some insights for researchers working on design- segmentation, [Online]. Available: https://arxiv.org/abs/1802.06955.
Attia, M., Hossny, M., Nahavandi, S., Yazdabadi, A., 2017. Spatially aware melanoma
ing feature representations for improving convolutional neural net- segmentation using hybrid deep learning techniques, [Online]. Available: https:
works. //arxiv.org/abs/1702.07963.
15
Bi, L., Kim, J., Ahn, E., Feng, D., 2017. Automatic skin lesion analysis using large- Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K., 2015. Spatial trans-
scale dermoscopy images and deep residual networks, [Online]. Available: https: former networks. In: Proceedings of the NIPS 2217–2025.
//arxiv.org/abs/1703.04197. Kaul, C., Manandhar, S., Pears, N., 2019. Focusnet: an attention-based fully convo-
Badrinarayanan, V., Kendall, A., Cipolla, R., 2017. SegNet: a deep convolutional lutional network for medical image segmentation. In: Proceedings of the IEEE
encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. ISBI, pp. 455–458. doi:10.1109/ISBI.2019.8759477.
Mach. Intell. 39 (12), 2481–2495. doi:10.1109/TPAMI.2016.2644615. Krizhevsky, A., Hinton, G., 2010. Convolutional deep belief networks on cifar-10.
Cheng, J., Tian, S., Yu, L., Lu, H., Lv, X., 2020. Fully convolutional attention network Unpubl. Manuscr. 1–9. [Online]. Available: http://www.cs.toronto.edu/∼kriz/
for biomedical image segmentation. Artif. Intell. Med. 107, 101899. doi:10.1016/ conv- cifar10- aug2010.pdf .
j.artmed.2020.101899. Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. ImageNet classification with deep
Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L., 2014. Semantic image convolutional neural networks. In: Proceedings of the NIPS, pp. 1097–1105.
segmentation with deep convolutional nets and fully connected CRFs. Comput. Litjens, G., Kooi, T., Bejnordi, B., B., Setio, A., Ciompi, F., Ghafoorian, M., 2017. A sur-
EnCe (4) 357–361. [Online]. Available: https://arxiv.org/abs/1412.7062 . vey on deep learning in medical image analysis. Med. Image Anal. 42, 60–88.
Chen, L., Papandreou, G., Schroff, F., Adam, H., 2017. Rethinking atrous convolution doi:10.1016/j.media.2017.07.005.
for semantic image segmentation, [Online]. Available: https://arxiv.org/abs/1706. Long, J., Shelhamer, E., Darrell, T., 2015. Fully convolutional networks for semantic
05587. segmentation. In: Proceedings of the IEEE CVPR, pp. 3431–3440.
Chen, L., Zhu, Y., Papandreou, G., Schroff, F., Adam, H., 2018. Encoder-decoder with Li, X., Wang, W., Hu, X., Yang, J., 2019. Selective kernel networks. In: Proceedings of
atrous separable convolution for semantic image segmentation. In: Proceedings the IEEE CVPR, pp. 510–519. doi:10.1109/CVPR.2019.0 0 060.
of the IEEE ECCV, pp. 833–851. Lenc, K., Vedaldi, A., 2015. Understanding image representations by measuring their
Cao, Y., Xu, J., Lin, S., Wei, F., Hu, H., 2019. GCNet: non-local networks meet squeeze- equivariance and equivalence. In: Proceedings of the IEEE CVPR, pp. 991–999.
excitation networks and beyond. In: Proceedings of the IEEE ICCV, pp. 1971– Li, Y., Shen, L., 2018. Skin lesion analysis towards melanoma detection using deep
1980. doi:10.1109/ICCVW.2019.00246. learning network. Sensors 18 (2), 556. doi:10.3390/s18020556.
Codella, N.C., Rotemberg, V., Tschandl, P., Celebi, M.E., Dusza, S.W., Gutman, D.A., Lin, G., Milan, A., Shen, C., Reid, I., 2017. RefineNet: multi-path refinement networks
Halpern, A.C., 2019. Skin lesion analysis toward melanoma detection 2018: A for high-resolution semantic segmentation. In: Proceedings of the IEEE CVPR,
challenge hosted by the International skin imaging collaboration (ISIC), [Online]. pp. 5168–5177. doi:10.1109/CVPR.2017.549.
Available: http://arxiv.org/abs/1902.03368. Lei, B., Xia, Z., Jiang, F., Jiang, X., Wang, S., 2020. Skin lesion segmentation via
Codella, N., Gutman, D., Celebi, M., Helba, B., Marchetti, M., Dusza, S., Kalloo, A., generative adversarial networks with dual discriminators. Med. Imag. Anal. 64,
Liopyris, K., Mishra, N., Kittler, H., Halpern, A., 2018. Skin lesion analysis to- 101716. doi:10.1016/j.media.2020.101716.
ward melanoma detection: a challenge at the 2017 international symposium on Ma, N., Zhang, X., Zheng, H.T., et al., 2018. Shufflenet v2: practical guidelines for
biomedical imaging (ISBI), hosted by the international skin imaging collabora- efficient CNN architecture design. In: Proceedings of the ECCV, pp. 116–131.
tion (ISIC). In: Proceedings of the IEEE ISBI, pp. 168–172. doi:10.1007/978- 3- 030- 01264- 9_8.
Chollet, F., 2017. Xception: deep learning with depthwise separable convolutions. In: Oktay, O., Schlemper, J., Folgoc, L.L., Lee, M., Heinrich, M., Misawa, K., & Rueckert, D.
Proceedings of the IEEE CVPR, pp. 1800–1807. doi:10.1109/CVPR.2017.195. 2018. Attention u-net: learning where to look for the pancreas. [Online]. Avail-
Dieleman, S., Willett, K.W., Dambre, J., 2015. Rotation-invariant convolutional neural able: https://arxiv.org/abs/1804.03999v3
networks for galaxy morphology prediction. Mon. Not. R. Astron. Soc. 450 (2), Russakovsky, O., Deng, J., Su, H., Fei, L., 2015. ImageNet large scale visual recognition
1441–1459. doi:10.1093/mnras/stv632. challenge. Int. J. Comput. Vis. 115 (3), 211–252. doi:10.1007/s11263-015-0816-y.
Dai, W., Doyle, J., Liang, X., Zhang, H., Dong, N., Li, Y., Xing, E.P., 2018. SCAN: struc- Ronneberger, O., Fischer, P., Brox, T., 2015. U-Net: convolutional networks for
ture correcting adversarial network for organ segmentation in chest X-rays. In: biomedical image segmentation. In: Proceedings of the International Conference
Proceedings of the DLMIA, pp. 263–273. doi:10.1007/978- 3- 030- 00889- 5_30. MICCAI, pp. 234–241. doi:10.1007/978- 3- 319- 24574- 4_28.
Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., Lu, H., 2019. Dual attention network for Radford, A., Metz, L., Chintala, S., 2016. Unsupervised representation learning with
scene segmentation. In: Proceedings of the IEEE CVPR, pp. 3146–3154. doi:10. deep convolutional generative adversarial networks. In: Proceedings of the 4th
1109/CVPR.2019.00326. International Conference on Learning Representations, ICLR Conference Track,
Jahanifar, M., Tajeddin, N.Z., Koohbanani, N.A., Gooya, A., & Rajpoot, N., 2018. Seg- pp. 1–16.
mentation of skin lesions and their attributes using multi-scale convolutional Simonyan, K., Zisserman, A., 2014. Very deep convolutional networks for large-s-
neural networks and domain specific augmentations. [Online]. Available: https: cale image recognition. In: Proceedings of the 3rd International Conference on
//arxiv.org/abs/1809.10243 Learning Representations. ICLR 2015, Conference Track, pp. 1–14.
Jha, D., Riegler, M.A., Johansen, D., Halvorsen, P., & Johansen, H.D., 2020. DoubleU- Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Rabinovich, A., 2015.
Net: a deep convolutional neural network for medical image segmentation. [On- Going deeper with convolutions. In: Proceedings of the IEEE CVPR, pp. 1–9.
line]. Available: https://arxiv.org/abs/2006.04868 doi:10.1109/CVPR.2015.7298594.
Gao, S., Cheng, M., Zhao, K., Zhang, X., Yang, M., Torr, P.H., 2019. Res2Net: a new Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z., 2016. Rethinking the in-
multi-scale backbone architecture. IEEE Trans. Pattern Anal. doi:10.1109/tpami. ception architecture for computer vision. In: Proceedings of the IEEE CVPR,
2019.2938758, 1-1. pp. 2818–2826. doi:10.1109/CVPR.2016.308.
Gens, R., Domingos, P., 2014. Deep symmetry networks. In: Proceedings of the NIPS, Setio, A.A.A., Traverso, A., De Bel, T., Berens, M.S., Den Bogaard, C.V., Cerello, P., Ja-
pp. 2537–2545. cobs, C., 2017. Validation, comparison, and combination of algorithms for au-
Gu, Z., Cheng, J., Fu, H., Zhou, K., Hao, H., Zhao, Y., Liu, J., 2019. CE-Net: context tomatic detection of pulmonary nodules in computed tomography images: the
encoder network for 2d medical image segmentation. IEEE Trans. Med. Image LUNA16 challenge. Med. Image Anal. 42, 1–13. doi:10.1016/j.media.2017.06.015.
38 (10), 2281–2292. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D., 2020. Grad-
He, K., Gkioxari, G., Dollar, P., Girshick, R., 2017. Mask R-CNN. In: Proceedings of the CAM: visual explanations from deep networks via gradient-based localization.
IEEE ICCV, pp. 2980–2988. Int. J. Comput. Vis. 128 (2), 336–359. doi:10.1007/s11263- 019- 01228- 7.
He, K., Zhang, X., Ren, S., Sun, J., 2016a. Deep residual learning for image recognition. Tan, M., Chen, B., Pang, R., Vasudevan, V.K., Sandler, M., Howard, A., Le, Q.V., 2019.
In: Proceedings of the IEEE CVPR, pp. 770–778. doi:10.1109/CVPR.2016.90. MnasNet: platform-aware neural architecture search for mobile. In: Proceedings
Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E., 2018. Squeeze-and-excitation networks. of the IEEE CVPR, pp. 2815–2823. doi:10.1109/CVPR.2019.00293.
In: Proceedings of the IEEE CVPR, pp. 7132–7141. doi:10.1109/CVPR.2018.00745. Tan, M., Le, Q.V., 2019. EfficientNet: rethinking model scaling for convolutional neu-
He, X., Yang, S., Li, G., Li, H., Chang, H., Yu, Y., 2019. Non-local context encoder: ral networks. In: Proceedings of the ICML, pp. 6105–6114.
robust biomedical image segmentation against adversarial attacks. In: Proceed- Woo, S., Park, J., Lee, J., Kweon, I.S., 2018. CBAM: convolutional block at-
ings of the AAAI Conference on Artificial Intelligence, pp. 8417–8424. doi:10. tention module. In: Proceedings of the IEEE ECCV, pp. 3–19. doi:10.1007/
1609/aaai.v33i01.33018417. 978- 3- 030- 01234- 21.
Howard, A., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Adam, H., Wang, Q., Wu, B., Zhu, P., Li, P., Hu, Q., 2020. ECA-Net: efficient channel attention
2017. MobileNets: efficient convolutional neural networks for mobile vision ap- for deep convsolutional neural networks. In: Proceedings of the IEEE/CVF Con-
plications, [Online]. Available: http://arxiv.org/pdf/1704.04861. ference on Computer Vision and Pattern Recognition (CVPR). IEEE.
Howard, A., Sandler, M., Chu, G., Chen, L., Chen, B., Tan, M., Adam, H., 2019. Wang, X., Girshick, R., Gupta, A., He, K., 2018. Non-local neural networks. In: Pro-
Searching for mobilenetv3. In: Proceedings of the IEEE ICCV, pp. 1314–1324. ceedings of the IEEE CVPR, pp. 7794–7803. doi:10.1109/CVPR.2018.00813.
doi:10.1109/ICCV.2019.00140. Xie, S., Girshick, R., Dollar, P., Tu, Z., He, K., 2017. Aggregated residual transforma-
He, X., Yang, X., Zhang, S., Zhao, J., 2020. Sample-efficient deep learn- tions for deep neural networks. In: Proceedings of the IEEE CVPR, pp. 5987–
ing for COVID-19 diagnosis based on CT scans, [Online]. Available: 5995. doi:10.1109/CVPR.2017.634.
10.1101/2020.04.13.20063941. Xiao, X., Shen, L., Luo, Z., Li, S., 2018. Weighted Res-UNet for high-quality retina ves-
Huang, G., Liu, Z., Der Maaten, L.V., Weinberger, K.Q., 2017. Densely connected con- sel segmentation. In: Proceedings of the 9th International Conference on Infor-
volutional networks. In: Proceedings of the IEEE CVPR, pp. 2261–2269. doi:10. mation Technology in Medicine and Education (ITME). IEEE Computer Society.
1109/CVPR.2017.243. Yang, X., He, X., Zhao, J., Zhang, Y., Zhang, S., Xie, P., 2020. COVID-CT-Dataset: a
Ioffe, S., Szegedy, C., 2015. Batch normalization: accelerating deep network training CT scan dataset about COVID-19, [Online]. Available: http://arxiv.org/abs/2003.
by reducing internal covariate shift. In: Proceedings of the 32nd International 13865?context=stat.
Conference on Machine Learning ICML, 1, pp. 448–456. Yuan, Y., Chao, M., Lo, Y., 2017a. Automatic skin lesion segmentation using deep
Isola, P., Zhu, J., Zhou, T., Efros, A.A., 2017. Image-to-image translation with condi- fully convolutional networks with jaccard distance. IEEE Trans. Med. Imag. 36
tional adversarial networks. In: Proceedings of the IEEE CVPR, pp. 5967–5976. (9), 1876–1886.
doi:10.1109/CVPR.2017.632.
16
Zhu, Y., Sapra, K., Reda, F.A., Shih, K.J., Newsam, S., Tao, A., Catanzaro, B., 2019. Im- Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J., 2017. Pyramid scene parsing network. In:
proving semantic segmentation via video propagation and label relaxation. In: Proceedings of the IEEE CVPR, pp. 6230–6239. doi:10.1109/CVPR.2017.660.
Proceedings of the IEEE CVPR, pp. 8856–8865. doi:10.1109/CVPR.2019.00906. Zhou, Z., Siddiquee, M., Tajbakhsh, N., Liang, J., 2018. UNet++: a nested u-net archi-
Zhang, X., Zhou, X., Lin, M., Sun, J., 2018. ShuffleNet: an extremely efficient convo- tecture for medical image segmentation. In: Proceedings of the 4th Deep Learn-
lutional neural network for mobile devices. In: Proceedings of the IEEE CVPR, ing in Medical Image Analysis (DLMIA) Workshop.
pp. 6848–6856. doi:10.1109/CVPR.2018.00716.
17

ResGANet PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ResGANet PDF

Uploaded by

Copyright:

Available Formats

Medical Image Analysis 76 (2022) 102313

Contents lists available at ScienceDirect

Medical Image Analysis

ResGANet: Residual group attention network for medical image

Fig. 1. Overview of the group attention block.

The spatial attention module uses a standard convolution with

Fig. 3. Comparing our ResGANet block with existing attention methods.

Fig. 4. ResNet block and ResGANet block calculation process.

Model Params Shuﬄe Trans Acc% Prec% Recall%

To explore the generalization ability of ResGANet, we apply it

Fig. 7. Box-plots of different methods on the ISIC2018 and CONVID19-CT datasets.

Fig. 9. Overview of the multiscale atrous spatial pyramid pooling module.

Network Acc Sen Spec JI DC

FCN-8 s (Long et al., 2015) 0.974±0.012 0.944±0.016 0.984±0.009 0.894±0.014 0.942±0.012

5. Discussion While our proposed method has shown promising results, it

You might also like