You are on page 1of 5

EFFECTIVE FEATURE FUSION NETWORK IN BIFPN FOR SMALL OBJECT DETECTION

Jun Chen1,2,3 , HongSheng Mai1,2,3 , Linbo Luo4 , Xiaoqiang Chen1,2,3 , Kangle Wu1,2,3
2021 IEEE International Conference on Image Processing (ICIP) | 978-1-6654-4115-5/21/$31.00 ©2021 IEEE | DOI: 10.1109/ICIP42928.2021.9506347

1
School of Automation, China University of Geosciences
2
Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems
3
Engineering Research Center of Intelligent Technology for Geo-Exploration, Ministry of Education
4
School of Mechanical Engineering and Electronic Information, China University of Geosciences

ABSTRACT the resolution of small objects in the image is low, and the
information features are less. This will lead to the high false
In view of the difficulty and low accuracy of small object de-
detection rate and missed detection rate of common object
tection in remote sensing images, this paper proposes a bidi-
detection algorithm in small object detection, which will af-
rectional cross-scale connection feature fusion network with
fect the detection effect of small objects. Second, in remote
an information direct connection layer and a shallow infor-
sensing images, the object scale span is large, and multiple
mation fusion layer. Aiming at the problem that the detec-
scales coexist, which makes the traditional object detection
tion targets in remote sensing images are mainly small and
algorithm ineffective in small object detection, and the missed
medium-sized targets, we fuse the shallow feature maps with
detection rate is high. Third, the current mainstream object
rich spatial information in the bidirectional cross-scale con-
detection algorithms are all based on deep learning, which
nection feature fusion network instead of directly using the
makes the network model obtain rich semantic information
shallow feature maps for regression and classification. While
through continuous convolutional neural networks to obtain
ensuring the model inference speed, the detection accuracy
the results of object regression and classification. The feature
of small objects is improved. At the same time, we use the
information of small objects is mostly concentrated in the in-
information direct connection layer to perform feature fusion
clusion In the shallow feature layer with rich spatial resolution
with the initial information in each iteration of the bidirec-
information, small objects are easy to lose feature information
tional cross-scale connection feature fusion pyramid to pre-
in the continuous convolution process. How to merge the in-
vent the loss of small object information. Experimental re-
formation of different layers plays an important role in the
sults show that the algorithm proposed in this paper can ob-
effect of small object detection.
tain good accuracy and real-time performance on the NWPU
VHR-10 dataset. In recent years, with the development of deep learning
Index Terms— Remote sensing, deep learning, small ob- technology, researchers have devoted themselves to design-
ject detection, feature fusion ing a network to solve the multi-scale problem in object
detection, and some multi-scale feature fusion methods have
also emerged. The proposal of FPN[1] makes multi-scale fea-
1. INTRODUCTION ture fusion technology widely used in computer vision tasks.
In recent years, researches such as PANet[2], NAS-FPN[3],
With the development of remote sensing technology, ob-
DFPN[4], RefineDet[6], DSSD[7], MDSSD[8] and other
ject detection data sets based on remote sensing images are
studies[11][12][13][5][9][10] are based on FPN network
becoming more and more abundant. Object detection algo-
structures for cross-scale feature fusion have been developed.
rithms based on remote sensing images have been more and
BiFPN[14] proposes a simple and efficient weighted bidirec-
more widely used in military object detection and environ-
tional feature pyramid network, which introduces learnable
mental monitoring, but small object detection is still one of
weights to learn the importance of different input feature
the existing difficulties. This article is dedicated to finding an
layers, and can be modularly repeatedly applied. The ad-
efficient small object detection algorithm that does not rely
vantage of EfficientDet[14] is that it uses fewer parameters
on high-resolution input images.
and floating point of operations (FLOPs) to achieve better
For remote sensing image object detection, there are the
performance than other object detectors, and BiFPN has
following difficulties and problems: First, the proportion of
excellent cross-scale feature fusion capabilities. However,
small and medium objects in remote sensing images is large,
experiments show that its detection effect on small objects is
This work was supported by the National Natural Science Foundation of still not good, so we have made targeted improvements to the
China nos. 62073304, 41977242 and 61973283. network structure of BiFPN.

978-1-6654-4115-5/21/$31.00 ©2021 IEEE 699 ICIP 2021

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on November 08,2022 at 11:40:22 UTC from IEEE Xplore. Restrictions apply.
This paper proposes a remote sensing image small object 2.2. Multi-scale fusion
detection algorithm with a bidirectional cross-scale connec-
We use a weighted fusion method to fuse feature layers of
tion feature fusion network with an information direct con-
different resolutions, add a weight to each input, and the net-
nection layer and a shallow information fusion layer. Our
work adjusts the fusion weight of different inputs. The fusion
contribution is mainly in three aspects: First, we indirectly in-
formula is:
tegrate the shallow feature maps with rich spatial information w
Pi
X
O= · Ii (2)
in BiFPN instead of directly using the shallow feature maps  + j wj
i
for regression and classification, which ensures the model in-
Relu is applied once to each wi to ensure that wi >= 0 and
ference speed while improving the small object detection ac-
 = 0.0001. As a specific example, as shown in Fig.1, the
curacy. Secondly, we used deconvolution to adjust the reso-
dashed box is the basic unit of the feature fusion pyramid.
lution of the feature map and improved the detection effect
The entire network is composed of repeated applications. We
of the network. Third, we have designed a direct information
take a part of it as an example. For intuitive display, we select
connection layer. Aiming at the problem that small objects are
the first Iteration is taken as an example, because the Pout of
easily lost in the convolution process, we use the information
the first iteration does not need to fuse the directly connected
direct connection layer to compare the original information
layers. As shown in Fig.2, we describe how the P 3 layer
in each iteration of the bidirectional cross-scale connection
merges the features of the upper and lower layers.
feature fusion pyramid. One-time feature fusion prevents the
loss of small object feature information. In the experiment, !
this paper uses the NWPU VHR-10 dataset [15]. The exper- w1 · P3in + w2 · Deconv P4in
imental results confirm that the proposed method can better P3td = Conv
w1 + w2 + 
solve the problem of small object detection in remote sensing
w10 · P3in + w20 · P3td + w30 · Resize (P2out )
 
images. P3out = Conv
w10 + w20 + w30 + 
P3out = Swish P3out

2. METHOD
(3)
td out
The algorithm proposed in this paper is based on EfficientDet[14] where P3 is the intermediate feature of the third layer, P3
as the theoretical basis, and designed a shallow information is the output feature of the third layer, P4td matches the res-
fusion layer and an information direct connection layer for olution with the P3 layer through deconvolution, and P3out
small target detection. Fig.1 is a schematic diagram of the is activated by the Swish[17] activation function after fusion,
network structure designed in this article. and we use depthwise separable convolution[18] for feature
fusion, and adds batch normalization after each convolution.
2.1. Problem formulation
2.3. Direct connection layer
The purpose of multi-scale feature fusion is to fuse feature
information under different resolution feature maps. Given a Aiming at the problem that small objects are easily lost in the
set of multi-scale features P in = (P1in , P2in , ...),where Piin convolution process, we designed the information direct con-
represents the feature at level i, As shown in Fig.1, we use nection layer in the feature fusion network. We use the infor-
EfficientNet[16] as the backbone network, and the feature mation direct connection layer to perform feature fusion with
fusion network uses 2 − 7 levels of input features P in = the initial input feature map in every iterative output node of
(P2in , ..., P7in ), where Piin in represents a feature level with the bidirectional cross-scale connection feature fusion pyra-
resolution of 1/2i of the input images. For example, our input mid, so that the small object feature information remains in-
image is 512 ∗ 512, then P2in represents the feature layer with tact in the feature extraction process. As shown by the orange
a resolution of 128 ∗ 128 (512/22 = 128), and P7in represents straight line in Fi.1, we save the input layer Piin of the first
the input feature layer with a resolution of 4∗4 (512/27 = 4). feature fusion network iteration as Pis . Starting from the sec-
The traditional FPN[1] multi-scale fusion method is: ond feature fusion network iteration, the Piin of each iteration
will be Pis performs a fusion. As an example, we choose the
P7out = Conv P7in

last iteration of the P 3 layer in Fig.1 as an example, as shown
P6out = Conv P6in + Resize P7out
 in Fig.2. The P3out node fusion formula in Fig.2 is:
(1) 
0 · P in + w0 · P td + w0 · Resize P out + w0 · P s
w1
  
... out
P3 = Conv 
3 2 3 3
0 + w0 + w0 + w0 + 
2 4 3 
(4)
w1 2 3 4
P2out = Conv P2in + Resize P3out


2.4. Shallow information fusion layer


where Resize is usually an up-sampling or down-sampling
operation for resolution matching, and Conv is usually a con- The high-resolution spatial information of the shallow feature
volution operation for feature processing. layer also plays an important role in the detection of small

700

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on November 08,2022 at 11:40:22 UTC from IEEE Xplore. Restrictions apply.
濣濊澳濂澳濄濅濋
Class prediction net
濣濉澳濂澳濉濇

濣濈澳濂澳濆濅
Box prediction net
濣濇澳濂澳濄濉

濣濆澳濂澳濋

濣濅澳濂澳濇
濣濄澳濂澳濅
濜瀈瀃瀈瀇

Fig. 1. Structure schematic diagram of the network.

ܲସ୧௡ 3. EXPERIMENTS
ܲସୱ 瀖
3.1. Implementation details

ܲଷୱ 瀖 In the training process, only the labeled positive sample im-
ages in the NWPU VHR-10 dataset[15] are used, and the 650
ܲଷ୧୬ ܲଷ୲ୢ ܲଷ୭୳୲
positive sample images are divided into a training set of 550
ܲଶୱ 瀖 and a test set of 100.
In order to get better training effect and training speed,
ܲଶ୧୬ ܲଶ୭୳୲
our ablation experiments are all carried out on the basis of
EfficientDet-D0[16]. The input size of the image is 512*512.
Fig. 2. P3 layer fusion method of upper and lower layer fea- Backbone network chooses EfficientNet-B0[16], which is in
tures and direct connection layer. the feature fusion network. The number of channels in the
convolutional layer is 64, the number of iterations of the fea-
ture fusion network unit is 3, and the network depth of clas-
objects and has an impact that cannot be ignored. We add sification and regression is 3 layers. In the final algorithm
the indirect fusion of 4 times downsampling the P2 feature comparison, we deepen the depth of the backbone network
layer with rich spatial information to the bidirectional cross- and increased the number of iterations and channels of the
scale connection feature fusion network instead of directly us- feature fusion network. In the case of maintaining the image
ing it to enter the final regression and classification detector, input size of 512∗512, Backbone network chose EfficientNet-
which improves the model inference speed while improving B3, the number of channels in the convolutional layer in the
The detection accuracy of small objects is improved, so that feature fusion network is 160, and the number of iterations of
the feature fusion network can maximize the use of different the feature fusion network unit is 6, and the classification and
information in each feature layer. regression The network depth is 4 layers.
The red dashed box in Fig.1 shows the shallow informa- All experiments in this paper are performed on a NVIDIA
tion fusion layer we designed. In the feature fusion network, GeForec GTX 1080Ti with 11G graphics memory.
we fuse the information of the P2 feature layer in each itera-
tion of the first few iterations, but in the final P2 feature layer 3.2. Experimental results
will be the P3 feature layer is fused instead of directly en-
tering the regression and classification network. The reason We design a deconvolution ablation experiment. In EfficientNet-
for this is that our experiments show that directly using the B0’s BiFPN, the feature map upsampling operation was
P2 feature layer for regression and classification prediction changed to deconvolution. The experiment showed that our
will greatly increase the number of a priori boxes that the net- method improved 0.37% compared with EfficientDet-D0’s
work needs to predict, and it will have a greater impact on the mAP. The ablation experiment results are as follows Table 1
model inference speed. Experiments show that the introduc- shows.
tion of the P2 feature layer can improve the accuracy of small We also introduced a shallow information fusion layer and
object detection in the network. a direct information connection layer, and performed ablation

701

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on November 08,2022 at 11:40:22 UTC from IEEE Xplore. Restrictions apply.
Table 1. Comparative results of EfficientDet-D0 and Ours-deconv on the NWPU VHR-10’s dataset.
airplane baseball diamond basketball court bridge ground track field habor ship storage tank tennis court vehicle mAP
EfficientDet-D0(512) 97.66% 96.52% 75.43% 58.60% 100.00% 89.36% 74.35% 67.70% 90.80% 39.45% 78.99%
Ours-deconv(512) 98.06% 95.66% 88.10% 64.02% 100.00% 85.33% 72.67% 64.26% 86.69% 38.81% 79.36%

Table 2. Comparative results of EfficientDet-D0 and Ours-D0 on the NWPU VHR-10’s dataset.
airplane baseball diamond basketball court bridge ground track field habor ship storage tank tennis court vehicle mAP
EfficientDet-D0(512) 97.66% 96.52% 75.43% 58.60% 100.00% 89.36% 74.35% 67.70% 90.80% 39.45% 78.99%
Ours-D0(512) 97.70% 95.63% 89.90% 61.42% 99.78% 88.11% 70.44% 76.39% 90.29% 42.72% 81.24%

Table 3. Performance comparison of algorithms.


airplane baseball diamond basketball court bridge ground track field habor ship storage tank tennis court vehicle mAP FPS
YOLOv2(512) 73.32% 88.91% 27.60% 51.87% 98.85% 75.47% 74.95% 34.42% 29.19% 51.37% 60.56% 30.83
YOLOv3(512) 97.17% 94.06% 92.93% 52.00% 99.03% 72.76% 81.62% 71.58% 88.36% 74.00% 82.30% 15.82
YOLOv4(512) 97.60% 95.40% 93.30% 58.60% 99.00% 86.50% 82.20% 74.50% 89.50% 77.80% 85.40% 10.60
SSD(512) 90.40% 89.90% 80.60% 67.20% 98.30% 73.40% 60.90% 79.80% 82.60% 52.10% 77.50% 15.30
DSSD(513) 96.21% 97.82% 84.93% 97.13% 98.18% 61.13% 91.70% 29.13% 89.22% 76.61% 82.39% 3.77
EfficientDet-D0(512) 97.66% 96.52% 75.43% 58.60% 100.00% 89.36% 74.35% 67.70% 90.80% 39.45% 78.99% 17.10
Ours(512) 98.89% 97.72% 98.10% 68.82% 100.00% 94.22% 85.70% 88.69% 97.75% 57.06% 88.69% 11.06

experiments on EfficientDet-D0 while keeping other parame-


ters unchanged. Experiments show that our method improves
mAP by 2.25% compared with EfficientDet-D0. The results
of the ablation experiment are shown in Table 2.
We deepen the depth of the backbone network and in-
creased the number of iterations and channels of the feature
fusion network. In the case of maintaining the image input
size of 512*512, backbone chose EfficientNet-B3, the num-
ber of channels in the convolutional layer in the feature fusion
network is 160, and the number of iterations of the feature fu-
sion network unit is 6, and the classification and regression
are The network depth is 4 layers.
The quantitative results of different methods on the
NWPU VHR-10 dataset, including the AP value and mAP
value of 10 categories, as shown in Table 3. Experiments
show that the algorithm proposed in this paper is compared
with other tests on the NWPU VHR-10 data set. Compared
with other algorithms, the algorithm proposed in this paper
shows better performance. Fig.3 shows part of the detection
results obtained by this algorithm in NWPU VHR-10, and the
missed detection rate and false detection rate are low. Fig. 3. Results display

4. CONCLUSION

In this paper, we design the shallow feature fusion layer and formance in remote sensing small target detection.
the information direct connection layer. The shallow feature
fusion layer can fuse high-resolution spatial information fea- At the same time, due to the high memory usage of the
ture layers, and the information direct connection layer can graphics during the training process of EfficientDet[14], we
effectively retain the feature information of small objects. On can only maintain batchsize = 3 during the training process
the basis of increasing the amount of network parameters, the of our final method, which leads to poor training effects. We
detection effect of remote sensing small objects is improved. did not perform excessive training hyperparameter optimiza-
Experimental results show that compared with other object tion. We believe that the final network performance still has a
detection algorithms, the method in this paper has better per- lot of room for improvement.

702

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on November 08,2022 at 11:40:22 UTC from IEEE Xplore. Restrictions apply.
5. REFERENCES for object detection,” in Proceedings of the European
conference on computer vision (ECCV), 2018, pp. 169–
[1] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, 185.
Bharath Hariharan, and Serge Belongie, “Feature pyra-
mid networks for object detection,” in Proceedings of [12] Qijie Zhao, Tao Sheng, Yongtao Wang, Zhi Tang, Ying
the IEEE conference on computer vision and pattern Chen, Ling Cai, and Haibin Ling, “M2det: A single-
recognition, 2017, pp. 2117–2125. shot object detector based on multi-level feature pyra-
mid network,” in Proceedings of the AAAI conference
[2] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia, on artificial intelligence, 2019, vol. 33, pp. 9259–9266.
“Path aggregation network for instance segmentation,”
in Proceedings of the IEEE conference on computer vi- [13] Seung-Wook Kim, Hyong-Keun Kook, Jee-Young Sun,
sion and pattern recognition, 2018, pp. 8759–8768. Mun-Cheon Kang, and Sung-Jea Ko, “Parallel feature
pyramid network for object detection,” in Proceed-
[3] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le, “Nas- ings of the European Conference on Computer Vision
fpn: Learning scalable feature pyramid architecture for (ECCV), 2018, pp. 234–250.
object detection,” in Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, [14] Mingxing Tan, Ruoming Pang, and Quoc V Le, “Ef-
2019, pp. 7036–7045. ficientdet: Scalable and efficient object detection,” in
Proceedings of the IEEE/CVF conference on computer
[4] Zhenwen Liang, Jie Shao, Dongyang Zhang, and Lianli vision and pattern recognition, 2020, pp. 10781–10790.
Gao, “Small object detection using deep feature pyra-
mid networks,” in Pacific Rim Conference on Multime- [15] Gong Cheng, Junwei Han, Peicheng Zhou, and Lei Guo,
dia. Springer, 2018, pp. 554–564. “Multi-class geospatial object detection and geographic
image classification based on collection of part detec-
[5] Alexander Kirillov, Ross Girshick, Kaiming He, and Pi- tors,” ISPRS Journal of Photogrammetry and Remote
otr Dollár, “Panoptic feature pyramid networks,” in Sensing, vol. 98, pp. 119–132, 2014.
Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2019, pp. 6399–6408. [16] Mingxing Tan and Quoc Le, “Efficientnet: Rethinking
model scaling for convolutional neural networks,” in
[6] Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and International Conference on Machine Learning. PMLR,
Stan Z Li, “Single-shot refinement neural network for 2019, pp. 6105–6114.
object detection,” in Proceedings of the IEEE confer-
ence on computer vision and pattern recognition, 2018, [17] Prajit Ramachandran, Barret Zoph, and Quoc V Le,
pp. 4203–4212. “Searching for activation functions,” arXiv preprint
arXiv:1710.05941, 2017.
[7] Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish
Tyagi, and Alexander C Berg, “Dssd: Deconvolutional [18] François Chollet, “Xception: Deep learning with depth-
single shot detector,” arXiv preprint arXiv:1701.06659, wise separable convolutions,” in Proceedings of the
2017. IEEE conference on computer vision and pattern recog-
nition, 2017, pp. 1251–1258.
[8] Lisha Cui, Rui Ma, Pei Lv, Xiaoheng Jiang, Zhimin
Gao, Bing Zhou, and Mingliang Xu, “Mdssd: multi-
scale deconvolutional single shot detector for small ob-
jects,” arXiv preprint arXiv:1805.07009, 2018.
[9] Zuoxin Li and Fuqiang Zhou, “Fssd: feature fu-
sion single shot multibox detector,” arXiv preprint
arXiv:1712.00960, 2017.
[10] Guimei Cao, Xuemei Xie, Wenzhe Yang, Quan Liao,
Guangming Shi, and Jinjian Wu, “Feature-fused ssd:
Fast detection for small objects,” in Ninth International
Conference on Graphic and Image Processing (ICGIP
2017). International Society for Optics and Photonics,
2018, vol. 10615, p. 106151E.
[11] Tao Kong, Fuchun Sun, Chuanqi Tan, Huaping Liu, and
Wenbing Huang, “Deep feature pyramid reconfiguration

703

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on November 08,2022 at 11:40:22 UTC from IEEE Xplore. Restrictions apply.

You might also like