A Feature-Wise Attention Module Based On The Difference With Surrounding Features For Convolutional Neural Networks

Front. Comput. Sci.
, 2023, 17(6): 176338

https://doi.org/10.1007/s11704-022-2126-1
RESEARCH ARTICLE
A feature-wise attention module based on the difference with

surrounding features for convolutional neural networks
Shuo TAN, Lei ZHANG ( ✉), Xin SHU, Zizhou WANG

Machine Intelligence Laboratory, College of Computer Science, Sichuan University, Chengdu 610065, China
Higher Education Press 2023
Abstract Attention mechanism has become a widely of capturing unique information on each feature, which plays
researched method to improve the performance of an important role in deciding both “what” and “where” to
convolutional neural networks (CNNs). Most of the researches focus [12]. Meanwhile, from the perspective of the human
focus on designing channel-wise and spatial-wise attention brain, channel-wise attention and spatial-wise attention
modules but neglect the importance of unique information on correspond to feature-based attention and spatial-based
each feature, which is critical for deciding both “what” and attention in the human brain, respectively [13]. During the
“where” to focus. In this paper, a feature-wise attention module visual processing of the human brain, the two attentions co-
is proposed, which can give each feature of the input feature exist and jointly contribute to process the important
map an attention weight. Specifically, the module is based on information selection, which further demonstrates the
the well-known surround suppression in the discipline of necessity of capturing feature-wise attention.
neuroscience, and it consists of two sub-modules, Minus- In order to capture feature-wise attention, there are mainly
Square-Add (MSA) operation and a group of learnable non- two categories of methods. The first is to design an elaborate
linear mapping functions. The MSA imitates the surround encoder-decoder structure (e.g., Residual Attention Network
suppression and defines an energy function which can be for Image Classification (RANet) [9], Learning Pixel-wise
applied to each feature to measure its importance. The group of Contextual Attention for Saliency Detection(PiCANet) [14]).
non-linear functions refines the energy calculated by the MSA The second is to design a simple and efficient method of
to more reasonable values. By these two sub-modules, feature- weight calculation that can be applied to each feature (e.g., A
wise attention can be well captured. Meanwhile, due to the Simple, Parameter-Free Attention Module for Convolutional
simple structure and few parameters of the two sub-modules, Neural Networks (SimAM) [10], 3D Attention Map Learning
the proposed module can easily be almost integrated into any Using Contextual Information for Point Cloud Based Retrieval
CNN. To verify the performance and effectiveness of the (PCAN) [15]). The designed structure in the first category of
proposed module, several experiments were conducted on the methods can not be flexible and modularized enough because
Cifar10, Cifar100, Cinic10, and Tiny-ImageNet datasets, of its complexity and numerous parameters. Therefore, it may
respectively. The experimental results demonstrate that the be more general for CNNs to use the second category of
proposed module is flexible and effective for CNNs to improve methods. Based on the well-known phenomenon in
their performance. neuroscience called surround suppression, SimAM proposes a
method to calculate the feature-wise attention. Specifically,
Keywords feature-wise attention, surround suppression, the surround suppression shows that the most important
image classification, convolutional neural networks neurons are those which exhibit significant difference from
other neurons. What’s more, an important neuron may
1 Introduction suppress other neurons that surround it [16]. On the flip side,
Convolutional neural networks (CNNs) have shown outstan- it can be considered that the neurons that present the more
ding performance in tackling a series of tasks of computer notable difference may be more important. By imitating the
vision [1−5]. Recently some reasearches (e.g., [6−11]) find surround suppression, attention modules can be designed to
that combined with attention modules, CNNs can get greater make CNNs pay more attention to the features that show
performance. significant difference with surrounding features and reduce the
However, most of the existing attention modules focus on attention to others. Based on this, SimAM defines a linear
capturing channel-wise and spatial-wise attention while can function to measure the difference from each feature to others
not capture both at the same time. This limits their capability and hence measure the importance of each feature. However,
in order to get the values of the parameters of its linear
Received March 7, 2022; accepted September 29, 2022
function, the SimAM has to invite a number of drawbacks.
E-mail: leizhang@scu.edu.cn Specifically, it assumes that during the calculation of the
2 Front. Comput. Sci., 2023, 17(6): 176338
energy of each feature, the distribution of the current feature is module, which can be easily integrated into most CNNs to
similar to that of the others. The assumption is reasonable to improve their performance without big changes in
some extent. However, it is possible that the distribution of architecture.
important features may differ significantly from that of Channel-wise and spatial-wise attention modules Nume-
unimportant ones, which may affect the reliability of its whole rous existing researches on attention mechanism concentrate
module. Meanwhile, it introduces additional hyperparameters, upon capturing and utilizing channel-wise and spatial-wise
which given different values may have distinctive effects on attention. Specifically, Squeeze-and-Excitation Networks (SE)
performance. Therefore, it is necessary to explore other ways. [6] uses average-pooled features to compute channel-wise
In this paper, we follow the thought of surround suppression attention from a global view. Convolutional Block Attention
and propose another simple module, which can be applied to Module (CBAM) [7] and Squeeze and Excitation Blocks
each feature to capture feature-wise attention. Specifically, the (scSE) [30] use a convolution module to compute the spatial-
proposed module contains two sub-modules: the Minus- wise attention corresponding to the channel attention of SE
Square-Add (MSA) operation and a group of learnable non- and combine it with their designed channel-wise attention to
linear mapping functions. The MSA imitates the surround capture more useful information. Non-Local Attention [8]
suppression and uses an energy function defined as the introduces an approach that uses the representing relationships
average sum of the squares of the euclidean distances from in spatial dimension between features to capture long-range
each feature to others to measure the difference each feature dependencies. Double Attention Networks (A2-Net) [31]
exhibits from others and hence measure their importance. introduces a novel relation function for Non-Local Attention.
Moreover, the calculation process can easily be simplified by Dual Attention Network (DANet) [32] shows that Non-Local
a series of equivalent transformations without assumptions, Attention is a spatial-wise attention module and introduces
and the final form of the MSA is flexible and lightweight. corresponding channel-wise attention module. GCNet [33]
Next, since the MSA can not control the range of its output proposes a simplified Non-Local Attention and integrates it
very well, the group of non-linear functions is introduced to into SE, getting a lightweight channel-wise attention module
refine the range by some learnable parameters. As the training with the ability to capture long-range dependencies. Gated
of the entire network progresses, the group of non-linear Channel Transformation for Visual Recognition (GCT) [34]
functions can gradually learn and master the energy which is modifies the SE by using a l2 normalization to replace the FC
more reasonable for the features. layers and gets a more stable and effective channel-wise
In summary, the contributions of this work are mainly attention module. However, all of these attention modules can
summarized as follows: only capture either channel-wise attention or spatial-wise
attention at one time, while they can not capture feature-wise
● A simple method based on the surround suppression in attention. In contrast, this work aims at design a feature-wise
neuroscience called MSA is proposed to capture attention module, which can capture and well utilize unique
feature-wise attention. information on each feature.
● A series of simple equivalent transformations are Feature-wise attention modules In order to capture
derived to speed up energy calculation, and the MSA is feature-wise attention, some methods such as RANet [9] and
turned to a lightweight form. PiCANet [14] propose a number of well-designed encoder-
● A group of learnable non-linear mapping functions is decoder architectures. SimAM [10] introduces the well-known
introduced to refine the energy calculated by the MSA, surround suppression in neuroscience and based on it designs
and the effectiveness of combining it with the MSA is a simple weight calculation that can be applied to each feature.
verified. In contrast, unlike the methods like RANet and PiCANet, the
proposed module is more flexible and modularized.
2 Related work Meanwhile, unlike the SimAM, the module does not require
In this section, some representative works on powerful CNN assumptions and hyperparameters.
architectures and attention modules are discussed.
Increasingly powerful architectures Some works [17−23] 3 Method
show that by increasing the depth, width, and cardinality of In this section, details of the proposed module are presented.
the CNNs, the representation power of the networks can be
improved. Furthermore, a series of works (e.g., [24−29]) use 3.1 Overview of the proposed module
neural architecture search (NAS) in the field of automated The starting point of this work is to design a novel feature-
machine learning (AutoML) to search the best combination of wise attention module (the difference among channel-wise,
depth, width, and cardinality for the networks. These have spatial-wise, and feature-wise attention modules are clearly
greatly facilitated the development of CNNs and the existing illustrated in Fig. 1), which is simple, no hyperparameters, no
basic architectures are now very powerful. Attention parameters, modularized, plug-and-play, and effective.
mechanism is a now popular and effective way to further However, obviously it is difficult to design an attention
improve the performance of CNNs that does not greatly module that combines all of these advantages. For more
increase the complexity as well as the parameters. It aims to effectiveness, the module introduces some learnable
improve the representation power by telling CNNs where to parameters. Specifically, it consists of the MSA and a group of
focus. The aim of this work is to design a lightweight attention learnable non-linear mapping functions. The MSA defines an
Shuo TAN et al. A feature-wise attention module based on the difference with surrounding features for convolutional neural networks 3
Fig. 1 Comparisons of channel-wise, spatial-wise, and feature-wise attention modules. In each subfigure, the left side represents the input
features and the right side represents the feature weights calculated by different attention modules. Most of the existing attention modules are
channel-wise attention modules (a) and spatial-wise attention modules (b). They give the same attention weights to features in the same channel
or spatial, while feature-wise attention modules (c) can give each feature an attention weight
energy function to measure the importance of each feature to Because the MSA can be used for each feature to estimate the
capture feature-wise attention. Furthermore, it is simplified by importance of individual features, it possesses the ability to
a series of equivalent transformations and its final form is capture feature-wise attention. Specifically, because euclidean
simple, no hyperparameters, and no parameters. However, the distance of two features can measure the difference between
energy calculated by the MSA may be too large to keep two features, the average sum of the squares of the euclidean
effective at all times, the gaps between the calculated energy distances from each feature to other features possesses the
of different categories of features are not reasonable enough, ability to measure the difference that each feature presents
and there is some noise in the features that gets big energy. from the others. On the basis of surround suppression, the
Therefore, the group of non-linear functions is introduced to difference that each feature shows from the others can be used
solve these by some learnable parameters. Specifically, each to estimate the importance of that feature. So, we define the
of its non-linear mapping functions consists of a linear above approach of measuring difference as an energy function
function, a non-linear hyperbolic tangent(tanh) function, and a and use the output of this function represent the importance of
linear function, and each of them corresponds to a channel. each feature. The energy function is as follows:
Meanwhile, each of the linear functions possesses two
1 ∑
n
parameters and hence the group of non-linear functions adds ei,c = (xi,c − x j,c )2 , (1)
four additional parameters for each channel, which are few n − 1 j=1, j,i
and acceptable to most CNNs. Combining these two sub-
where xi,c and x j,c denotes the target feature and other features
modules, it forms the final proposed module. The overview of
in channel c of the feature map X ∈ RC×H×W , i and j are two
the whole module can also be clearly seen in Fig. 2. The
indexes from 1 to n, and n = H × W denotes the total number
details of the MSA and the group of non-linear functions are
of features in channel c .
shown in turn in the following two subsections.
However, it requires complex calculations to compute
3.2 MSA energy ei,c directly via Eq. (1). To simplify the calculations for
According to the surround suppression, features show the efficiency, Eq. (1) is turned into an easily computable form
more significant difference with surrounding features may be through simple equivalent transformations.
the more important. Based on this, the MSA is proposed to Specifically, because (xi,c − xi,c )2 = 0, the value of i can be
measure the significant difference each feature exhibits from added to the range of j . Then Eq. (1) is fortunately converted
others and hence measure the importance of each feature. to the form as follows:
Fig. 2 Overview of the proposed module. Two 1 × 1 convolution modules of the number of groups equal to that of the channels with a tanh
function are used to implement the group of non-linear functions
4 Front. Comput. Sci., 2023, 17(6): 176338
1 ∑ another range. However, as mentioned above, the values of

ei,c = 2
(nxi,c + ∥xc ∥22 − 2xi,c xc ), (2)
n−1 these ranges may be too large. Meanwhile, for the
∑ ∑ ∑
where ∥xc ∥22 = nj=1 x2j,c and xc = nj=1 x j,c can be calculated classification result of whether a person or not, the facial
before calculating ei,c . features may be more important than those of the hands, and
It is worth noting that for all features of the input feature the gaps between the two categories of features on importance
∑ ∑ may be larger (or smaller) than that calculated by the MSA. At
map X , the calculations of nj=1 x2j,c, nj=1 x j,c, and Eq. (2) can
be done by using matrixs that can be processed in parallel, this time, the group of non-linear functions should learn and
which is not possible with Eq. (1). Meanwhile, it is obvious to map the two categories of features to reasonable ranges, and
see that Eq. (2) eliminates the direct dependence of each expand (or shrink) the gap to a more reasonable one. What’s
feature on others. Therefore, by using Eq. (2), the importance more, if there is some noise, such as there is a scar on the
of each feature can be simply and efficiently computed. person's face and the scar truly disturbs the classification, the
group of non-linear functions should learn and map the scar to
3.3 A group of learnable non-Linear mapping functions smaller energy. To implement these, each of the group of non-
However, it is found that usually the MSA can extract linear functions uses a combination of a linear function, a non-
important features but it’s unavoidable that there is some noise linear activation function, and a linear function, because such
which will get big energy. What’s more, it is also realized that functions are already primed for the above-mentioned ablities
the MSA can not produce negative numbers and can easily and do not introduce an enormous number of parameters. As
output very large positive results. Such results may cause the the training process progresses, suitable parameter values for
subsequent sigmoid function, which is often used to scale the the linear functions can be gradually learned, and finally the
range of the output of the attention modules, to produce values group of non-linear functions can learn and master the energy
which are very approximate of 1. And this means that the which is more reasonable for the features.
MSA does not always work well. Because batch normalization Tanh function is chosen as the non-linear activation function
[35] narrows the range of its input, which can indirectly because its output ranges from −1 to 1, which meaning that it
narrow the range of the output of the MSA, the proposed may stretch wider the gap for the subsequent sigmoid
module is always integrated after batch normalization. Its function. Then the group of functions is as follows:
exact position is shown in Fig. 3, which is the same as that of
most attention modules. However, doing only this has little eri,c = ω2,c (H(ω1,c ei,c + b1,c )) + b2,c , (3)
effect. where H denotes the tanh function, eri,c
denotes the final
In order to solve this problem one step further and learn refined energy, and ω1,c , b1,c , ω2,c , b2,c denote the weight and
more reasonable range, two approaches were considered: bias of the first and second linear functions corresponding to
(1). instead of using the subsequent sigmoid function, use a channel c , respectively.
normalization to narrow the range. (2). introduce a group of In summary, the overall calculation process of the proposed
learnable non-linear mapping functions to learn and map to a module is given in Algorithm 1.
more reasonable range. The first approach does narrow the
range, but the process may lose some useful information so 4 Experiments
that this way cannot be effective. Therefore, the second one is In this section, the details and results of our experiments on
finally used. the Cifar10, Cifar100 [2], Cinic10 [36], and Tiny-ImageNet
The group of non-linear functions should possess the are shown.
following capabilities: (1). mapping the importance of the
different categories of features to different and reasonable 4.1 Datasets
ranges. (2). expanding or shrinking the gaps between different Cifar10 has ten categories and Cifar100 is the fine-grained
categories of features on importance. (3). alleviating the image categorization of Cifar10 which has one hundred
importance of noise. For example, an image of a person as categories. Both Cifar10 and Cifar100 are 32 × 32 colour
input, after the MSA, the calculated energy of facial features is images and both of them have 50,000 training images and
usually approximate and in a range. Similarly, the calculated 10,000 validation images. Cinic10 is compiled as a bridge
energy of hand features is also normally approximate and in between Cifar10 and ImageNet, it has ten categories and each
Fig. 3 The exact position where the proposed module is integrated into a ResBlock. The module is applied to each ResBlock of the ResNet
has even fewer parameters than its MobileNetV2 [38]), the

used ResNets are not hugely adjusted, only the first 7 × 7
convolution is changed to a 3 × 3 convolution because of
image sizes.
4.3 Image classification on Cifar10, Cifar100, Cinic10 and

Tiny-ImageNet
Because deep neural networks have many local extremes and
the randomness of parameter initialization, for Cifar10,
Cifar100, and Cinic10, we report average top1 accuracies and
standard derivation over five times for each model in Table 1.
It can be seen that our module improves the prediction
accuracies of the baseline models obviously and qualifies for
comparison with the powerful SE, CBAM, ECA, GCT, and
SimAM. Specifically, with ResNet18, the module increases
the average accuracies of baseline by 0.57%, 1.02%, 0.69% on
the Cifar10, Cifar100, and Cinic10 validation sets,
category consists of 90,000 32 × 32 color images for training
respectively, and gets the best accuracies on the three datasets.
and 90,000 32 × 32 color images for validation. Tiny-
And with ResNet50, the module increases the average
ImageNet is a reduced version of the ImageNet dataset which accuracies of baseline by 1.59%, 5.54%,1.67% on the Cifar10,
contains two hundred categories with each category containing Cifar100, and Cinic10 validation sets, respectively, and
500 64 × 64 color images for training, 50 64 × 64 color images reaches performance similar to the SE, CBAM, ECA, GCT,
for validation. and SimAM. These results demonstrate the effectiveness of
the proposed module.
4.2 Implementation details
For more complex Tiny-ImageNet, we report parameters,
The specific experimental configuration is as follows. For data
additional parameters to baseline, FLOPs, top1 accuracies, and
augmentation, the easy data augmentation for training set in
top5 accuracies for each model. The experimental results are
[37] is followed. Specifically, on Cifar10, Cifar100, and shown in Table 2. It is noted that the proposed module only
Cinic10, each side of each image is zero-padded 4 pixels, and adds few parameters and FLOPs to the baseline. Because the
then a 32 × 32 image for training is randomly cropped from group of non-linear functions implemented by two 1 × 1
the padded image or its horizontal flip. On Tiny-ImageNet, convolution modules of the number of groups equal to that of
because 70 = 3 + 3 + 64, each side of each image is zero- the channels, and there are currently few optimizations for
padded 3 pixels instead of 4 pixels, then from the padded such convolution modules, the FLOPs of the proposed module
image or its horizontal flip, a 64 × 64 image is randomly are slightly more than others. In contrast, the number of
cropped for training. For optimizer, Stochastic Gradient parameters is few and competitive.
Descent (SGD) is used with a momentum of 0.9 and a weight Meanwhile, Table 2 also demonstrates that by integrating
decay of 0.0001. The learning rate is started with 0.1 and the proposed module, ResNets and MobileNetV2 get better
cyclically annealed towards 0 in 100 epochs. For performance. Specifically, with ResNet18, ResNet34,
hyperparameters of SE [6], CBAM [7], and SimAM [10], ResNet50, ResNet101, and MobileNetV2, the proposed
default settings are used without any changes. For ResNets module increases the top1 and top5 accuracy by 0.78% and
[19], unlike many works (e.g., the ResNet110 used in SimAM 0.77%, 0.76% and 0.57%, 1.70% and 1.06%, 0.54% and
Table 1 Top-1 accuracies (%) for ResNet18 and ResNet50 with diferent attention modules, SE [6], CBAM [7], ECA [39], GCT [34], SimAM [10] and the
proposed module on Cifar10, Cifar100, and Cinic10 datasets. All results are reported as mean±std via over five trials
Dataset
Model
Cifar10 Cifar100 Cinic10
ResNet18 (Baseline) 93.21±0.38 73.37±0.20 84.84±0.45
ResNet18 + SE 93.67±0.19 73.93±0.10 85.49±0.10
ResNet18 + CBAM 93.65±0.13 73.41±0.25 85.27±0.13
ResNet18 + ECA 93.45±0.08 72.11±0.48 84.81±0.13
ResNet18 + GCT 93.15±0.35 73.51±0.31 84.97±0.49
ResNet18 + SimAM 93.57±0.11 74.21±0.21 85.48±0.19
ResNet18 + proposed module 93.78±0.09 74.39±0.09 85.53±0.03
ResNet50 (Baseline) 91.49±0.57 69.58±1.54 83.21±1.06
ResNet50 + SE 92.59±0.40 74.46±0.43 84.84±0.54
ResNet50 + CBAM 93.74±0.21 75.91±0.13 85.46±0.41
ResNet50 + ECA 92.33±1.69 74.73±0.77 85.20±0.51
ResNet50 + GCT 90.84±1.24 69.23±1.37 83.78±0.77
ResNet50 + SimAM 92.55±0.26 71.85±1.59 84.95±0.37
ResNet50 + proposed module 93.08±0.52 75.12±0.49 84.88±0.55
6 Front. Comput. Sci., 2023, 17(6): 176338
Table 2 Parameters, additional parameters to baseline, FLOPs, Top-1 and Top-5 accuracies (%) for various models with SE [6], CBAM [7], ECA [39], GCT
[34], SimAM [10] and the proposed module on Tiny-ImageNet
Model Parameters + Parameters-to-baseline FLOPs Top-1 Acc/% Top-5 Acc/%
ResNet18 (Baseline) 11.27M 0 2.23G 65.12 84.23
ResNet18 + SE [6] 11.36M 0.0870M 2.23G 66.38 85.44
ResNet18 + CBAM [7] 11.36M 0.0899M 2.23G 66.04 85.05
ResNet18 + ECA [39] 11.27M 24 2.23G 64.94 84.60
ResNet18 + GCT [34] 11.28M 0.0058M 2.23G 66.21 85.15
ResNet18 + SimAM [10] 11.27M 0 2.23G 65.48 84.37
ResNet18 + proposed module 11.28M 0.0077M 2.23G 65.90 85.00
ResNet34 + SE [6] 21.54M 0.1572M 4.65G 67.29 86.16
ResNet34 + CBAM [7] 21.54M 0.1628M 4.65G 67.14 85.80
ResNet34 + ECA [39] 21.38M 48 4.65G 66.22 85.35
ResNet34 + GCT [34] 21.39M 0.0113M 4.65G 67.37 85.92
ResNet34 + SimAM [10] 21.38M 0 4.65G 67.29 85.87
ResNet50 + SE [6] 26.43M 2.5149M 5.23G 69.68 87.83
ResNet50 + CBAM [7] 26.44M 2.5326M 5.23G 69.30 87.81
ResNet50 + ECA [39] 23.91M 48 5.23G 68.63 86.60
ResNet50 + GCT [34] 21.96M 0.0453M 5.22G 69.14 87.13
ResNet50 + SimAM [10] 23.91M 0 5.22G 68.93 87.27
ResNet101 + SE [6] 47.65M 4.7431M 10.10G 71.06 88.53
ResNet101 + CBAM [7] 47.68M 4.7810M 10.09G 70.48 88.09
ResNet101 + ECA [39] 42.90M 99 10.09G 69.52 87.61
ResNet101 + GCT [34] 43.00M 0.0975M 10.08G 71.02 88.33
ResNet101 + SimAM [10] 42.90M 0 10.08G 69.79 87.78
MobileNetV2 (Baseline) 2.54M 0 0.38G 62.65 84.47
MobileNetV2 + SE [6] 2.57M 0.0284M 0.38G 62.39 84.17
MobileNetV2 + CBAM [7] 2.57M 0.0317M 0.38G 62.39 83.89
MobileNetV2 + ECA [39] 2.54M 51 0.38G 61.74 83.50
MobileNetV2 + GCT [34] 2.54M 0.0045M 0.38G 63.55 85.09
MobileNetV2 + SimAM [10] 2.54M 0 0.38G 63.50 84.92
MobileNetV2 + proposed module 2.55M 0.0060M 0.38G 63.62 85.14
0.88%, 0.97% and 0.67%, respectively. Moreover, compared Meanwhile, we can observe that the MSA also extracts some
with other attention modules, the proposed module is also noise that should not be focused on. As the back layers are
very competitive (e.g., with ResNet34, ResNet50 and more difficult to understand, we do not show and analyse the
MobileNetV2, the proposed module gets the best top1 visualisation results here. To a certain extent, these
accuracies). empirically support the effectiveness of the MSA and the
The proposed module is based on very simple operations necessary of the subsequent group of learnable non-Linear
rather than stacking of modules such as convolution, global mapping functions.
average pooling, and etc., and the module may capture and
4.5 Ablation studies on Tiny-ImageNet
make rational use of feature-wise attention. These are the
In this subsection, the results of several ablation studies are
reasons why the module can be few parameters and powerful
presented in Table 3. It can be observed that both the MSA
performance.
and the group of non-linear functions improve the
4.4 Analysis of the MSA performance of baselines slightly. This empirically shows that
To explore the MSA in more depth, we plot the intermediate the MSA has impact on capturing feature-wise attention and
results from the processing of the ResNet50 which integrates some additional parameters can improve the performance of
the MSA in Fig. 4. Combining the visualization results and CNNs. Meanwhile, it is worth noting that by combining these
Visualizing and Understanding Convolutional Networks [40], two sub-modules, the performance has been improved further.
it can be considered that the front layers of the network mainly Moreover, unsurprisingly, normalization is not able to work
extract edge features and the back layers primarily extract well, because its process may lose some useful information.
features such as texture. And from Fig. 4 we can also see that These results are good indicators that the two sub-modules can
the MSA after front layers are good for noticing edge features, work well together and show the reasonableness of the overall
which are indeed important features for the front layers. process of the proposed module.
4.6 Sigmoid or tanh as the non-Linear activation function

To verify the reasonableness of the choice of tanh rather than
sigmoid as the non-linear activation function, we make several
fair comparisons of the two functions.
As shown in Table 4, sometimes using sigmoid may get
very a litte better performance while more usually using tanh
may get better performance. Specifically, with ResNet18,
ResNet50, and MobileNetV2, tanh is slightly less top1 or top5
accurate than sigmoid. At the same time, with ResNet18,
ResNet34, ResNet50, ResNet101, and MobileNetV2, tanh is
obviously more accurate than sigmoid. Through these results,
it can be confirmed that it is reasonable to choose tanh as the
non-linear activation function due to its larger range which can
stretch wider the gap for the subsequent sigmoid function and
therefore may obtain greater representation abilities.
4.7 Visualization with Grad-CAM

To better anlysis the proposed module, the Grad-CAM [41] is
used to visualization the features learned by different attention
modules. The visualization results are illustrated in Fig. 5. It
can be observed that the proposed module can help capture the
label objects and global features well. Therefore, it’s
reasonable to conjecture that the proposed module captures
important feature-wise attention and these important features
are well utilized.
5 Conclusion
In this paper, inspired by surround suppression and a series of
Fig. 4 Visualization results of the intermediate features and the attention existing attention modules, a feature-wise attention module
weights calculated by the MSA was proposed. This work focused on how to make good use of
the information from each feature, which is critical for
deciding both “what” and “where” to focus, and introduced
Table 3 Ablation studies on Tiny-ImageNet
Model Top-1 Acc/% Top-5 Acc/%
ResNet18 (Baseline) 65.12 84.23
ResNet18 + MSA (With normalization) 59.45 80.49
ResNet18 + MSA (With sigmoid) 65.49 84.69
ResNet18 + group of non-linear functions 65.63 84.38
ResNet18 + proposed module 65.90 85.00
ResNet101 + group of non-linear functions 70.29 87.86 3
MobileNetV2 (Baseline) 62.65 84.47
MobileNetV2 + MSA (With normalization) 52.08 76.87
MobileNetV2 + MSA (With sigmoid) 63.43 84.80
MobileNetV2 + group of non-linear functions 63.11 84.50
MobileNetV2 + proposed module 63.62 85.14
8 Front. Comput. Sci., 2023, 17(6): 176338
Table 4 Experiments on Tiny-ImageNet to choose non-linear activation function of the non-linear mapping function
Model Top-1 Acc/% Top-5 Acc/%
ResNet18 + Proposed Module (Using sigmoid) 65.91 84.85
ResNet18 + Proposed Module (Using tanh) 65.90 85.00
ResNet34 + Proposed Module (Using sigmoid) 67.20 86.02
ResNet34 + Proposed Module (Using tanh) 67.49 85.86
ResNet50 + proposed module (Using sigmoid) 68.92 86.98
ResNet50 + proposed module (Using tanh) 69.89 87.72
ResNet101 + proposed module (Using sigmoid) 70.31 87.71
ResNet101 + proposed module (Using tanh) 70.46 88.46
MobileNetV2 (Baseline) 62.65 84.47
MobileNetV2 + proposed module (Using sigmoid) 63.14 85.16
MobileNetV2 + proposed module (Using tanh) 63.62 85.14
Fig. 5 Visualization results using Grad-CAM [41]. The visualization results of SE, CBAM, SimAM, and the proposed module integrated into
ResNet50 on the Tiny-ImageNet validation set, respectively
the MSA. Then, to simplify the calculations for efficiency, the MSA to other computer vision tasks such as object detection
MSA was turned into a simpler form. Furthermore, since the and semantic segmentation for helping to capture boundaries.
MSA can not control the range of its output very well, a group
Acknowledgements This work was supported by the National Natural
of learnable non-linear mapping functions was introduced. At Science Fund for Distinguished Young Scholar (No. 62025601).
last, extensive experiments were conducted and the results
demonstrate the effectiveness of the whole module. References
For future research, the Section 4.4 shows that the MSA can 1. Deng J, Dong W, Socher R, Li L J, Li K, Fei-Fei L. ImageNet: a large-
well extract the edge features, so it is promising to extend the scale hierarchical image database. In: Proceedings of 2009 IEEE
Conference on Computer Vision and Pattern Recognition. 2009, Conference on Computer Vision and Pattern Recognition. 2017,
248–255 2261–2269
2. Krizhevsky A. Learning multiple layers of features from tiny images. 22. Chollet F. Xception: deep learning with depthwise separable
Toronto: University of Toronto, 2009 convolutions. In: Proceedings of 2017 IEEE Conference on Computer
3. Everingham M, van Gool L, Williams C K I, Winn J, Zisserman A. The Vision and Pattern Recognition. 2017, 1800–1807
PASCAL visual object classes (VOC) challenge. International Journal 23. Xie S, Girshick R, Dollár P, Tu Z, He K. Aggregated residual
of Computer Vision, 2010, 88(2): 303–338 transformations for deep neural networks. In: Proceedings of 2017 IEEE
4. Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Conference on Computer Vision and Pattern Recognition. 2017,
Franke U, Roth S, Schiele B. The cityscapes dataset for semantic urban 5987–5995
scene understanding. In: Proceedings of 2016 IEEE Conference on 24. Domhan T, Springenberg J T, Hutter F. Speeding up automatic
Computer Vision and Pattern Recognition. 2016, 3213–3223 hyperparameter optimization of deep neural networks by extrapolation
5. Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, of learning curves. In: Proceedings of the 24th International Conference
Zitnick C L. Microsoft COCO: common objects in context. In: on Artificial Intelligence. 2015, 3460–3468
Proceedings of the 13th European Conference on Computer Vision. 25. Ha D, Dai A, Le Q V. Hypernetworks. 2016, arXiv preprint arXiv:
2014, 740–755 1609.09106
6. Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings 26. Zoph B, Le Q V. Neural architecture search with reinforcement
of 2018 IEEE/CVF Conference on Computer Vision and Pattern learning. In: Proceedings of the 5th International Conference on
Recognition. 2018, 7132–7141 Learning Representations. 2017
7. Woo S, Park J, Lee J Y, Kweon I S. CBAM: convolutional block 27. Mendoza H, Klein A, Feurer M, Springenberg J T, Hutter F. Towards
attention module. In: Proceedings of the 15th European Conference on automatically-tuned neural networks. In: Proceedings of Workshop on
Computer Vision. 2018, 3–19 Automatic Machine Learning. 2016, 58–65
8. Wang X, Girshick R, Gupta A, He K. Non-local neural networks. In: 28. Bello I, Zoph B, Vasudevan V, Le Q V. Neural optimizer search with
Proceedings of 2018 IEEE/CVF Conference on Computer Vision and reinforcement learning. In: Proceedings of the 34th International
Pattern Recognition. 2018, 7794–7803 Conference on Machine Learning. 2017, 459–468
9. Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Wang X, Tang X. 29. Fernando C, Banarse D, Blundell C, Zwols Y, Ha D, Rusu A A, Pritzel
Residual attention network for image classification. In: Proceedings of A, Wierstra D. Pathnet: Evolution channels gradient descent in super
2017 IEEE Conference on Computer Vision and Pattern Recognition. neural networks. 2017, arXiv preprint arXiv: 1701.08734
2017, 6450–6458 30. Roy A G, Navab N, Wachinger C. Recalibrating fully convolutional
10. Yang L, Zhang R Y, Li L, Xie X. SimAM: a simple, parameter-free networks with spatial and channel “squeeze and excitation” blocks.
attention module for convolutional neural networks. In: Proceedings of IEEE Transactions on Medical Imaging, 2019, 38(2): 540–549
the 38th International Conference on Machine Learning. 2021, 31. Chen Y, Kalantidis Y, Li J, Yan S, Feng J. A2-nets: double attention
11863–11874 networks. 2018, arXiv preprint arXiv: 1810.11579
11. Wang L, Zhang L, Qi X, Yi Z. Deep attention-based imbalanced image 32. Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H. Dual attention network
classification. IEEE Transactions on Neural Networks and Learning for scene segmentation. In: Proceedings of 2019 IEEE/CVF Conference
Systems, 2022, 33(8): 3320–3330 on Computer Vision and Pattern Recognition. 2019, 3141–3149
12. Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T S. SCA-CNN: 33. Cao Y, Xu J, Lin S, Wei F, Hu H. GCNet: non-local networks meet
spatial and channel-wise attention in convolutional networks for image squeeze-excitation networks and beyond. In: Proceedings of 2019
captioning. In: Proceedings of 2017 IEEE Conference on Computer IEEE/CVF International Conference on Computer Vision Workshop.
Vision and Pattern Recognition. 2017, 6298–6306 2019, 1971–1980
13. Carrasco M. Visual attention: the past 25 years. Vision Research, 2011, 34. Yang Z, Zhu L, Wu Y, Yang Y. Gated channel transformation for visual
51(13): 1484–1525 recognition. In: Proceedings of 2020 IEEE/CVF Conference on
14. Liu N, Han J, Yang M H. PiCANet: learning pixel-wise contextual Computer Vision and Pattern Recognition. 2020, 11791–11800
attention for saliency detection. In: Proceedings of 2018 IEEE/CVF 35. Ioffe S, Szegedy C. Batch normalization: Accelerating deep network
Conference on Computer Vision and Pattern Recognition. 2018, training by reducing internal covariate shift. In: Proceedings of the 32nd
3089–3098 International Conference on Machine Learning. 2015, 448–456
15. Zhang W, Xiao C. PCAN: 3D attention map learning using contextual 36. Darlow L N, Crowley E J, Antoniou A, Storkey A J. CINIC-10 is not
information for point cloud based retrieval. In: Proceedings of 2019 ImageNet or CIFAR-10. 2018, arXiv preprint arXiv: 1810.03505
IEEE/CVF Conference on Computer Vision and Pattern Recognition. 37. Lee C Y, Xie S, Gallagher P, Zhang Z, Tu Z. Deeply-supervised nets.
2019, 12428–12437 In: Proceedings of the Eighteenth International Conference on Artificial
16. Webb B S, Dhruv N T, Solomon S G, Tailby C, Lennie P. Early and late Intelligence and Statistics. 2015, 562–570
mechanisms of surround suppression in striate cortex of macaque. 38. Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L C. MobileNetV2:
Journal of Neuroscience, 2005, 25(50): 11666–11675 inverted residuals and linear bottlenecks. In: Proceedings of 2018
17. Simonyan K, Zisserman A. Very deep convolutional networks for large- IEEE/CVF Conference on Computer Vision and Pattern Recognition.
scale image recognition. 2014, arXiv preprint arXiv: 1409.1556 2018, 4510–4520
18. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, 39. Wang Q, Wu B, Zhu P, Li P, Zuo W, Hu Q. ECA-Net: efficient channel
Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: attention for deep convolutional neural networks. In: Proceedings of
Proceedings of 2015 IEEE Conference on Computer Vision and Pattern 2020 IEEE/CVF Conference on Computer Vision and Pattern
Recognition. 2015, 1–9 Recognition. 2020, 11531–11539
19. He K, Zhang X, Ren S, Sun J. Deep residual learning for image 40. Zeiler M D, Fergus R. Visualizing and understanding convolutional
recognition. In: Proceedings of 2016 IEEE Conference on Computer networks. In: Proceedings of the 13th European Conference on
Vision and Pattern Recognition. 2016, 770–778 Computer Vision. 2014, 818–833
20. Zagoruyko S, Komodakis N. Wide residual networks. In: Proceedings 41. Selvaraju R R, Cogswell M, Das A, Vedantam R, Parikh D, Batra D.
of British Machine Vision Conference. 2016, 87.1–87.12 Grad-CAM: visual explanations from deep networks via gradient-based
21. Huang G, Liu Z, van der Maaten L, Weinberger K Q. Densely localization. In: Proceedings of 2017 IEEE International Conference on
connected convolutional networks. In: Proceedings of 2017 IEEE Computer Vision. 2017, 618–626
10 Front. Comput. Sci., 2023, 17(6): 176338
Shuo Tan is currently pursuing the MS degree at include theory and applications of neural networks based on
the Machine Intelligence Laboratory, College of neocortex computing and big data analysis methods by very deep
Computer Science, Sichuan University, China. His neural networks.
current research interests include convolutional
neural network and medical image analysis. Xin Shu is currently pursuing the PhD degree with
the Machine Intelligence Laboratory, College of
Computer Science, Sichuan University, China. His
current research interests include neural network
Lei Zhang received the BS and MS degrees in and intelligent medical.
mathematics and the PhD degree in computer
science from the University of Electronic Science
and Technology of China, China in 2002, 2005,
and 2008, respectively. She was a Post-Doctoral Zizhou Wang is currently pursuing the PhD
Research Fellow with the Department of degree with the Machine Intelligence Laboratory,
Computer Science and Engineering, Chinese College of Computer Science, Sichuan University,
University of Hong Kong, China from 2008 to 2009. She was an China. His current research interests include
Associate Editor of IEEE Transactions on Neural Networks and neural network and medical image analysis.
Learning Systems and an Associate Editor of IEEE Transactions on
Cognitive and Developmental Systems. Her current research interests

A Feature-Wise Attention Module Based On The Difference With Surrounding Features For Convolutional Neural Networks

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Feature-Wise Attention Module Based On The Difference With Surrounding Features For Convolutional Neural Networks

Uploaded by

Copyright:

Available Formats

Front. Comput. Sci.

, 2023, 17(6): 176338

A feature-wise attention module based on the difference with

Shuo TAN, Lei ZHANG ( ✉), Xin SHU, Zizhou WANG

Higher Education Press 2023

1 ∑ another range. However, as mentioned above, the values of

has even fewer parameters than its MobileNetV2 [38]), the

4.3 Image classification on Cifar10, Cifar100, Cinic10 and

4.6 Sigmoid or tanh as the non-Linear activation function

4.7 Visualization with Grad-CAM

You might also like