I-YOLO: A Novel Single-Stage Framework For Small Object Detection

The Visual Computer
https://doi.org/10.1007/s00371-024-03284-8
RESEARCH
I-YOLO: a novel single-stage framework for small object detection

Kang Tong1 · Yiquan Wu1
Accepted: 20 January 2024

© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024
Abstract
Small object detection is a challenging task in computer vision. We claim that the huge performance gap between the small
object detectors and normal sized object detectors stems from two aspects, including the small object dataset and the small
object itself. In terms of datasets, we build a large-scale dataset with high image resolution dubbed Small-PCB, in order to
promote detection in semiconductor industry. For the small object itself, we utilize multi-scale feature learning and feature
fusion strategy to help detect objects. More concretely, we devise two novel components to predict small objects better:
re-parameterized module with channel shuffle (RMCS) and multi-scale feature enhanced convolution (MFEC). MFEC aims
to split input channels into several parts and applies convolutions with different sizes to each part, and adopt point-by-point
convolution to fuse individual channel features. RMCS not only use structural re-parameterization, but also channel shuffle.
The usage of channel shuffle can be seen as a fusion of channel features. It strengthens feature information interaction between
different channel groups, which bring more informative feature clues. Based on the RMCS and the MFEC, we introduce OIU-
RMCS and M-MFEC, respectively. Finally, we build our I-YOLO via integrating these two components into a YOLO-based
detector. A large number of qualitative and quantitative results in the experiments indicate that our proposed I-YOLO achieves
the state-of-the-art performance on the popular AI-TODv2 and Small-PCB datasets.
Keywords Small object detection · Channel shuffle · Multi-scale feature learning · Feature fusion · Small-PCB · YOLO
1 Introduction For small object detection, there are few widely approved
datasets. Lots of researchers have to train and test their algo-
Object detection [1–9] is an essential issue in computer rithms on the subsets of large-scale datasets [23, 24] that
vision, which aims at discovering the location of instances satisfy the definition of small objects. Although some detec-
and providing their corresponding categories. As a sub-field tion datasets of small objects have been proposed [25–27],
of object detection, small object detection focuses on predict- these datasets are either small in scale (TinyPerson [27] only
ing those objects with small size. It enjoys great theoretical with 1610 images), or only applicable to daily life (e.g. SOD
and practical significance in remote sensing [10–12], surveil- [26] and SDOD-MT [25]). In addition, existing industrial
lance [13], defect detection [14, 15] in industrial scenarios, datasets (such as Augmented PCB [28] and DeepPCB [29])
etc. The research on small object detection [16–22] pro- enjoy low spatial resolution, which would affect the detection
ceeds at a relatively slow pace, compared with generic object of small objects. To sum up, the currently available datasets
detection. There remains a great detection performance gap could not meet the training of small object detection model in
between normal sized object detectors and small object detec- industrial scenarios. Thus, we construct a large-scale dataset
tors. We hold that this phenomenon originates from small named Small-PCB to promote small object detection in semi-
object dataset and small object itself. conductor industry. As we all known, detecting small defects
at the early stages of this industry is crucial to improve yield
and save costs.
Small objects lack necessary information to distinguish
B Yiquan Wu them from background or similar category. Therefore, learn-
nuaavision@163.com ing proper representations from small objects is an extremely
1 College of Electronic and Information Engineering, Nanjing hard task. To capture the features of small objects more
University of Aeronautics and Astronautics, Nanjing 211106,
China
123
K. Tong, Y. Wu
Fig. 1 The framework of our improved YOLO detector (I-YOLO)
Fig. 2 The structure of OIU-RMCS. It mainly includes RMCS and RepVGG. As shown in RMCS, RepVGG and RepConv are adopted at the
training stage (a) and inference stage (b), respectively
efficiently, a series of works have been proposed, includ- deal with small object detection. The former learns to cap-
ing super-resolution scheme [30–33], optimization of loss ture features at multiple scales, while the latter uses some
function [34–37], contextual clues [38–40], multi-scale fea- strategies to fuse features. Both of these two approaches can
ture learning [41–46] and feature fusion [47–49]. Super- also facilitate small object detection. In this paper, we pro-
resolution scheme can recover high resolution from corre- pose re-parameterized module with channel shuffle (RMCS)
sponding low resolution features. Meanwhile, the generative and multi-scale feature enhanced convolution (MFEC). To
adversarial network (GAN) [50] can be employed to recon- some extent, the MFEC and RMCS can be subordinate to
struct high-resolution images. GAN consists of generator multi-scale feature learning and feature fusion. MFEC splits
and discriminator, which contest with each other in a min- input channels into several parts and applies convolutions
imax optimization framework. Moreover, the loss function with different sizes to each part, and uses point-by-point
plays an important role in good model training. Thus, small convolution to fuse individual channel features. Our RMCS
object detection can benefit from the optimization of loss not only adopt structural re-parameterization (RepVGG and
function. Since small objects enjoy limited information, the RepConv are used in training and inference phase, respec-
surrounding region of small objects can provide necessary tively), but also channel shuffle. The usage of channel shuffle
contextual clues to assist the detection of small objects. can strengthen feature information interaction between dif-
In addition to the above three methods, multi-scale feature ferent channel groups, which bring more informative feature
learning and feature fusion are also significant means to information. This can be seen as a fusion of channel features.
123
Based on the RMCS and MFEC, we introduce One-time Inte-

gration Unit with RMCS (OIU-RMCS) and Module based on
MFEC (M-MFEC), respectively. Finally, we construct our
Improved YOLO (I-YOLO) object detector by integrating
these two components into a YOLO-based detector.
This paper offers main contributions as follows:
1. Design two novel components: re-parameterized mod-

ule with channel shuffle (RMCS) and multi-scale feature
enhanced convolution (MFEC). Introduce one-time inte-
gration unit with RMCS (OIU-RMCS) and module based
on MFEC (M-MFEC). Finally, we construct I-YOLO Fig. 3 The structure of M-MFEC
framework by integrating OIU-RMCS and M-MFEC into
a YOLO-based detector. Our proposed I-YOLO uses
multi-scale feature learning and feature fusion strategy these two components, our method achieves the state-of-the-
to obtain accurate detection of small objects. We will art performance on two datasets, such as AI-TODv2 and
release our trained models. Small-PCB. Thus, the design of these modules is novel and
2. Build a large-scale small object detection dataset named contributes to the overall effectiveness of our approach.
Small-PCB. It concentrates on the semiconductor indus- The remainder of this work is arranged as follows.
try. Small-PCB includes 8806 high-resolution printed cir- Section 2 briefly introduces related works. The proposed
cuit board (PCB) images and annotates 39,894 instances method and experiments are described in Sects. 3 and 4, sev-
across 6 categories. This dataset we will release makes erally. Section 5 concludes this paper.
up for the lack of small object detection datasets in indus-
trial scenarios and will facilitate future research in this
field. The resources are available.1
3. We evaluate the effectiveness of our method on two
datasets, including Small-PCB and AI-TODv2. A large 2 Related work
number of qualitative and quantitative results indicate
that our proposed detector (such as I-YOLOm) achieves 2.1 Datasets for small object detection
state-of-the-art detection performance on AI-TODv2 and
our constructed Small-PCB datasets. Datasets are the foundation of deep learning-based object
detectors. However, there are few datasets dedicated to detect
small objects. Therefore, lots of works have to be evalu-
The principal contributions mentioned above once again
ated on the subset of large-scale datasets (e.g. MS-COCO
reveal the novelty of our work. A new dataset named Small-
[23] and WIDER FACE [24]) that satisfy the definition of
PCB is constructed to make up for the lack of small object
small objects. Recently, some small object detection datasets
dataset in industrial scenarios and further facilitate the dis-
have been proposed, including SDOD-MT [25], SOD [26],
covery of small defects in the semiconductor industry at an
TinyPerson [27], TT100K [51] and AI-TOD [52]. Specif-
early stage. In addition, the proposed two components are
ically, AI-TOD dedicates to aerial scenes. SDOD-MT and
cleverly integrated and applied to the YOLO detector. At the
SOD are devoted to small object detection in daily life.
same time, they themselves have special designs. Concretely,
TinyPerson and TT100K concentrate on tiny-person and traf-
MFEC splits input channels into four parts and applies
fic signs detection, respectively. These datasets mentioned
convolutions with different sizes to each part, and adopts
above either do not involve industrial scenarios or are small
point-by-point convolution to fuse individual channel fea-
in scale (TinyPerson with 1610 images and SOD only with
tures. RMCS not only uses structural re-parameterization, but
8393 instances), while our Small-PCB contains 8806 PCB
also channel shuffle. Channel shuffle can strengthen feature
images and 39,894 object instances. Our Small-PCB is cru-
information interaction between different channel groups,
cial for defect detection in the semiconductor industry. In
which bring more informative feature clues. In other words,
addition, existing industrial datasets (e.g. Augmented PCB
MFEC captures effective features of the objects through
[28] and DeepPCB [29]) have low spatial resolution, which
multi-scale feature learning. RMCS adopts feature fusion to
hinder the detection of small objects. Compared to the Aug-
obtain the discriminant features of the objects. By combining
mented PCB (600 × 600) and DeepPCB (640 × 640), the
average resolution at 2613×2093 in our Small-PCB presents
1 https://github.com/graceveryear/iyolo. a clear advantage.
123
K. Tong, Y. Wu
Fig. 4 Visualized results between YOLOv8m (left) and our I-YOLOm (right) on the AI-TODv2 dataset
Table 1 Numbers of instances of

each category under different Category Missing hole Mouse bite Open circuit Spurious copper Short Spur
divisions of Small-PCB
Instances
Trainval 2484 2372 10,824 2392 11,508 2302
Test 498 580 2834 626 2848 626
2.2 Approaches for small object detection Super-resolution scheme often uses generative adversar-
ial network (GAN). A common GAN contains a generator
Generally, existing paradigms of small object detection and discriminator network, contesting with each other in a
introduce some custom components into the generic object minimax optimization framework. Li et al. [30] declare that
detectors. These works cover several aspects, including perceptual GAN narrow the gap of representation between
super-resolution scheme [30–33], optimization of loss func- large objects and small ones. For detecting small-scale faces
tion [34–37], contextual clues [38–40], feature fusion [43, and common small objects, Zhang et al. [31] proposed multi-
44, 47–49], and multi-scale feature learning [41–46]. task GAN. Its generator aims at up-sampling small fuzzy
pictures into fine-scale ones and recovering details to realize
123
Table 2 The performance of

different methods on AI-TODv2 Year Method AP AP50 AP75 APvt APt APs APm
validation set
2023 GFL [72] 15.0 33.6 10.7 2.3 12.2 23.4 34.6
2022 SSPNet [57] 13.1 30.3 8.8 0.0 9.7 27.1 37.6
2021 DyHead [73] 14.0 32.0 9.5 1.7 10.9 22.9 37.9
2021 TOOD [74] 18.6 43.0 12.7 3.2 16.8 26.2 38.1
2021 Cascade R-CNN [75] 14.4 32.7 10.6 0.0 9.9 28.3 39.9
2020 ATSS [76] 15.5 36.5 9.6 1.9 12.7 24.6 36.2
2020 RetinaNet [35] 7.4 21.1 3.5 2.5 6.5 13.1 22.9
2019 FASF [77] 14.4 35.3 8.4 3.4 14.4 19.9 24.2
2019 TridentNet [78] 9.7 23.3 6.5 0.0 5.2 20.5 32.7
2017 Faster RCNN [69] 12.4 28.3 8.1 0.0 8.4 26.3 36.2
2023 YOLOv8m 18.2 41.3 13.6 3.9 15.8 26.2 42.0
2023 I-YOLOm 19.1 44.0 13.3 3.7 16.4 27.5 42.3
Bold and underline fonts indicate the best and suboptimal results, severally. (in %)
Table 3 Category-wise AP of
different methods on AI-TODv2 Method Airplane Bridge Storage Ship Swimming Vehicle Person Wind
validation set tank pool mill
GFL [72] 10.4 0.0 29.2 26.3 25.7 25.1 3.5 0.0
SSPNet 15.5 3.8 22.8 20.1 23.4 15.8 3.5 0.2
[57]
DyHead 11.4 5.8 25.1 22.9 22.7 19.6 3.8 0.4
[73]
TOOD [74] 14.6 8.5 35.3 31.5 26.2 27.2 5.1 0.6
Cascade 13.8 5.5 22.6 24.5 28.2 17.4 3.4 0.0
R-CNN
[75]
ATSS [76] 13.7 3.4 30.2 25.6 24.2 22.5 3.9 0.1
RetinaNet 2.4 7.5 13.0 18.8 2.9 12.3 2.3 0.1
[35]
FASF [77] 1.6 8.6 30.5 31.4 14.8 23.2 4.9 0.1
TridentNet 12.2 0.0 17.9 13.5 20.0 11.9 1.9 0.0
[78]
Faster 13.4 2.9 21.7 19.2 24.7 14.4 2.8 0.0
RCNN
[69]
YOLOv8m 21.2 11.9 37.2 30.4 21.7 28.9 6.75 0.804
I-YOLOm 23.3 12.7 38.2 30.7 24.9 29.1 6.99 1.69
Bold and underline fonts indicate the best and suboptimal results. (in %)
accurate detection. Furthermore, QueryDet [53] predicts the terms of the class-imbalance, Zhang et al. [37] assign larger
potential locations of small objects in the low-resolution fea- weights to minority real PCB defects and then optimize
tures, and construct sparse feature maps at these locations cost-sensitive residual CNN by minimizing the weighted
utilizing high-resolution features, which accelerate detec- cross-entropy loss function.
tions. Oliva and Torralba [54] claim the surroundings of the
The optimization of loss function can help detect small small object provide useful context-based clues to help detect
objects. Liu et al. [34] designed feedback-driven loss function the target object. Thus, contextual clues are of significance
could supervise small objects more efficiently and effec- for small objects. Leng et al. [38] build an internal–external
tively. The models can be trained in a balanced manner by network (IENet), which make use of the appearance and con-
taking the loss distribution clues as the feedback signal. In text of objects for robust small object detection. Afterwards,
123
K. Tong, Y. Wu
Fig. 5 Comparison of visual detection results before and after adding noise in daytime scene
Fig. 6 Comparison of visual detection results before and after adding noise in night scene
123
Fig. 7 Comparison between our methods and baselines loss curve
Table 4 The results of different detectors on the Small-PCB test-set
Method AP AP50 AP75 Missing hole Mouse bite Open circuit Spurious copper short Spur
FR with Swin-T 58.2 99.1 62.5 63.5 57.2 57.5 58.5 56.2 56.3
Faster RCNN 57.7 99.1 61.3 63.4 56.6 57.1 56.8 57.6 54.6
RetinaNet 55.6 98.8 55.1 62.8 52.6 55.5 54.8 55.0 53.0
Sparse RCNN 54.5 95.8 55.8 60.2 53.9 55.2 52.0 55.0 50.5
FCOS 62.8 99.4 73.9 68.8 62.1 61.1 63.3 60.7 61.0
I-YOLOn 58.8 97.8 66.3 64.4 57.8 58.2 59.8 58.9 57.0
I-YOLOs 62.9 98.3 73.1 69.0 64.3 60.6 65.6 61.1 60.9
I-YOLOm 65.0 98.8 77.5 70.2 68.0 61.6 68.9 61.7 65.4
The best result is marked in bold. (in %)
Leng et al. [55] devise a context-guided reasoning network is devised to strengthen semantic clues of the detection layer,
(CRNet) to explore the relationships between objects and so as to enhance the missing details in network and employ
adopt easy detected objects to help understand hard ones. the local context within the limited receptive field.
Leng et al. [56] also propose a new pareto refocusing detec- Feature fusion can fuse different level features to promote
tion network (PRDet) that distinguishes the hard regions the object detection. FS-SSD [49] uses feature scaling with
from vanilla regions under reverse-attention guidance and feature fusion and de-convolution block for predicting dif-
refocuses the hard regions with the assistance of the region- ferent objects better. FA-SSD [39] employs attention module
specific context. Yan et al. [40] proposed LocalNet can retain and feature fusion to integrate features from different lay-
more details at an early stage to enhance the representation ers. FPN [43] fuses different-level features through adopting
of small objects. Moreover, the local detail-context module top-down architecture and lateral connections, so as to detect
123
K. Tong, Y. Wu
Fig. 8 Comparison between our

approaches and baselines
performance curve
Table 5 Improvements over baselines on the Small-PCB test-set
Baseline AP AP50 AP75 MH MB OC SC SH SP Param Time
YOLOv8n 57.8 97.5 62.7 64.1 56.6 57.2 58.6 58.1 54.7 3.0 M 1.8 ms
I- 58.8(+ 97.8(+ 66.3(+ 64.4(+ 57.8(+ 58.2(+ 59.8(+ 58.9(+ 57.0(+ 4.9 M 2.1 ms
YOLOn 1.0) 0.3) 3.6) 0.3) 1.2) 1.0) 1.2) 0.8) 2.3)
YOLOv8s 62.3 98.4 72.3 67.6 62.6 60.1 65.8 60.9 61.6 11.1 M 2.5 ms
I- 62.9(+ 98.3(- 73.1(+ 69.0(+ 64.3(+ 60.6(+ 65.6(- 61.1(+ 60.9(- 18.8 M 3.2 ms
YOLOs 0.6) 0.1) 0.8) 1.4) 1.7) 0.5) 0.2) 0.2) 0.7)
YOLOv8m 64.8 98.5 76.1 70.7 65.3 61.2 68.8 61.3 64.9 25.8 M 5.2 ms
I- 65.0(+ 98.8(+ 77.5(+ 70.2(- 68.0(+ 61.6(+ 68.9(+ 61.7(+ 65.4(+ 38.7 M 6.7 ms
YOLOm 0.2) 0.3) 1.4) 0.5) 2.7) 0.4) 0.1) 0.4) 0.5)
We also report the performance gains of proposed method, compared with the baseline. (in %)
small objects better. ABFPN [44] adopt atrous convolution layers to predict objects of different sizes, which ameliorates
to reinforce multi-scale feature fusion, in order to facilitate the detection accuracy to a certain extent. The top-down
the small PCB defect detection. SSPNet [57] builds proper architecture and lateral connections of FPN [43] construct
feature-sharing rules for shallow and deep layers to tackle high-level semantic feature maps across various scales. IPG-
the problem that there is a gradient inconsistency among dif- Net [45] introduce image pyramid guidance (IPG) into the
ferent feature maps, which benefit for tiny object detection. backbone network to handle the problem of information
Multi-scale feature learning is also an important mean for imbalance, thus relieving the feature disappearance of small
small object detection. SSD [42] uses the features of multiple objects. Different from IPG-Net, Zheng et al. [46] design an
123
Fig. 9 Visual comparison of small defect missed detection between YOLOv8n (left) and I-YOLOn (right)
interactive multi-scale feature representation enhancement interactive-cross attention (ICA). More concretely, RMDS
strategy. Proposed multi-scale auxiliary enhancement net- is able to produce deep multi-scale resolution-maintenance
work and adaptive interaction module not only scale the input features while learning global context clues. ICA can encode
to multiple scales corresponding to the prediction layer, but the local context information between the high-level seman-
also aggregate the features of adjacent layers, which enhance tic features and low-level details. In short, UIU-Net realizes
the detection of small objects. Gong et al. [58] introduce the multi-level and multi-scale representation learning of
fusion factor to control the transmission of information from objects.
deep to shallow layers in the FPN, and explore how to esti-
mate valid value of fusion factor for a particular dataset via
statistical strategy. The model can obtain great improvements
on small objects when appropriate fusion factor is used in the 3 The proposed method
FPN. Based on FPN, Deng et al. [59] design an extended fea-
ture pyramid network (EFPN) with an extra high-resolution Figure 1 shows the overall architecture of our proposed
pyramid level specialized for small traffic-sign detection and improved YOLO (I-YOLO) framework. It integrates two new
small object detection in common life. To detect small tar- modules into the YOLO-based detector, including one-time
gets in the infrared images, Wu et al. [60] propose an effective integration unit with RMCS (OIU-RMCS) and module based
framework named UIU-Net. It contains two modules, includ- on MFEC (M-MFEC). We will focus on re-parameterized
ing resolution-maintenance deep supervision (RMDS) and module with channel shuffle (RMCS) and multi-scale fea-
ture enhanced convolution (MFEC) next.
123
K. Tong, Y. Wu
Fig. 10 Visual comparison of small defect false detection between YOLOv8n (left) and I-YOLOn (right)
3.1 Reparameterized module with channel shuffle features to be reused, boosting the signal flow among differ-
(RMCS) ent channels between features of adjacent layers. To ease the
computational burden of network and reduce memory usage,
In the DenseNet [61], dense connections are inefficient. To we maintain only three feature cascades on the one-time inte-
overcome this problem, VoVNet [62] and V2[63] design one- gration path. Motivated via the idea of PANet [64], feature
shot aggregation module to build large-scale and lightweight fusion strategy aligns the feature maps of different sizes to
detectors, which brings faster speed and high efficiency. This realize information exchange between the two prediction fea-
module aggregates all features only once in the last maps ture layers, so as to obtain the object detection with high
through representing different features with multi-receptive accuracy and fast inference. Furthermore, OIU-RMCS can
fields. Inspired by one-shot aggregation module, we devise a reduce the memory access cost by keeping the same number
new one-time integration unit with RMCS, dubbed as OIU- of input channels and minimum output channels.
RMCS. As displayed in Fig. 2, stacked RMCS modules allow
123
Fig. 11 Comparison between the baseline and different components loss curve
Next, we will focus on reparameterized module with the usage of channel shuffle can achieve more efficient fea-
channel shuffle (RMCS). The RMCS is motivated by the ture information interaction between different groups. This
ShuffleNet [65] and Fig. 2 exhibits the structure of RMCS. moment, the input and output features are completely cor-
Suppose that the input feature dimension is a tensor of related, meaning that one convolutional group can obtain
C × H × W , which is divided into two different channel-wise data information from the other groups. The channel shuf-
tensors by the channel splitting operation, whose dimension fle brings more informative feature clues, by operating on
is C × H ×W . During training, we adopt the RepVGG (it pri- stacked grouped convolutions.
marily includes identity branch, 1 × 1 and 3 × 3 convolution)
to build RMCS and train the model. For the inference phase,
RepVGG is replaced with RepConv (it mainly involves 3 × 3 3.2 Multi-scale feature enhanced convolution
convolution) by employing structural re-parameterization. At (MFEC)
the training stage, multi-branch structure can capture rich
feature clues, while simplified single-branch structure can Figure 3 depicts the architecture of module based on MFEC
accelerate inference and save memory consumption. After (M-MFEC). Our proposed multi-scale feature enhanced con-
conducting multi-branch training, one tensor will be concate- volution (MFEC) introduces several convolutions of different
nated to another tensor in a channel-wise way. Moreover, the sizes, enabling it to learn various spatial features at mul-
operation of channel shuffle can strengthen information inter- tiple scales. It is observed in Fig. 3 that our MFEC splits
action between two tensors. This realizes depth measurement input channels into four parts and applies different convolu-
between different channel features of the input through low tions to each part, which reduces the computational burden
computational complexity. and parameters. However, these extracted features from con-
If we don’t utilize channel shuffle, the output features from volution operation are completed on a single channel part,
a certain group only correlate the input within this group, and the information between the features of each channel
which hinders the signal flow between channel groups and part is independent. Therefore, we adopt 1 × 1 convolution
impairs the capability to extract features. On the contrary, kernel to implement point-by-point convolution to fuse indi-
vidual channel features. The mathematical form of MFEC is
123
K. Tong, Y. Wu
expressed as: unavailable, we report the results on validation set. Based on

the absolute size, objects are divided into four subsets: very
M F EC(X ) conv1×1 (conv1×1 (x1 ), conv3×3 (x2 ), tiny objects (2–8 pixels), tiny objects (8–16 pixels), small
conv5×5 (x3 ), conv7×7 (x4 )) (1) objects (16–32 pixels) and medium objects (32–64 pixels).
where x {x1 , x2 , x3 , x4 } refers to divide the input feature x

into four parts in the channel dimension. Moreover, conv1×1 , 4.2 Implementation details
conv3×3 , conv5×5 and conv7×7 represent the convolution
kernel with 1 × 1, 3 × 3, 5 × 5 and 7 × 7, respectively. We adopt NVIDIA RTX3080Ti GPU to train all the mod-
els, and the datasets utilized are Small-PCB and AI-TODv2
[67]. We use the YOLOv84 as the baseline detector. The
4 Experiments value of weight decay and momentum is 0.0005 and 0.937.
All the models are trained with 100 epochs. For the Small-
4.1 Datasets PCB, the batch size of our approach (I-YOLOn, I-YOLOs
and I-YOLOm) is set to 32, 32 and 16, severally. For the
1. Small-PCB dataset AI-TODv2, the batch size of our I-YOLOm is set to 8.
On the Small-PCB dataset, we also re-implemented sev-
The images in our Small-PCB are mainly from PKU-Market- eral popular detectors based on the PaddleDetection5 frame-
PCB,2 INTERNET3 and self-built. The PKU-Market-PCB work, including Sparse RCNN [68], Faster RCNN [69],
and INTERNET contain 693 images and 664 images, respec- RetinaNet [35], FCOS [70] and Swin Transformer [71].
tively. The self-built part has 7449 images. Concretely, we Except for the learning rate, we follow most of the frame-
perform horizontal mirroring, vertical mirroring, and hori- work’s default settings. For the Sparse RCNN, the learning
zontal overlay vertical mirroring operations on the images rate is set to 0.000025. For the Faster RCNN, RetinaNet,
of existing datasets. In addition to mirroring, we also rotate FCOS and FR with Swin-T, the learning rate both are
the images at a certain angle, such as 5, 60, 120, 180 0.00125. We set the batch size is 1, 2, 4 for the Faster RCNN
and 357 degrees. Eventually, we gain 8806 high-resolution and FR with Swin-T, FCOS and RetinaNet, and Sparse
(2613 × 2093) defect images and 39,894 instances. Krishna RCNN, respectively. In the training phase, ‘1x’ schedule (12
and Jawahar [66] indicate that an object is regarded small if epochs) is used. For all the experiments, FPN is employed
it occupies only less than 1% of the image area. As far as this to capture hierarchical and detailed representations for better
definition (relative size) is concerned, the all objects in our performance.
Small-PCB datasets are treated as small objects. We focus on
6 categories, including missing hole (MH), mouse bite (MB),
open circuit (OC), spurious copper (SC), short (SH) and spur 4.3 Evaluation metrics
(SP). According to the popular principle of dividing datasets,
Small-PCB is split into two subsets: trainval-set and test-set, Following the COCO-style [23] evaluation protocol, we
where each subset occupies about 80% (7044 images) and adopt the average precision (AP) and average recall (AR)
20% (1762 images). Table 1 shows the numbers of instances to evaluate the performance of detectors. Specifically, the
of each class under different divisions of our Small-PCB. overall AP is gained by averaging the AP across 10 IoU
thresholds between 0.5 and 0.95 (with an interval of 0.05).
2. AI-TODv2 dataset The overall AR refers to the maximum recall given a fixed
number of detections per image, averaged over all categories
It is the dataset with the smallest object size in the earth and IoU thresholds. AP50 and AP75 denote AP at true posi-
observation community. The absolute size of AI-TODv2 is tive detections with IoU thresholds of 0.5 and 0.75, severally.
12.7 pixels. It has re-labeled the AI-TOD [52] to fix many Since AI-TODv2 dataset enjoys different definitions of spe-
un-annotated objects in the previous dataset. AI-TODv2 cific size objects, the evaluation representations are slightly
includes 28,036 aerial images (800 × 800) with 752,745 different. Concretely, APvt , APt , APs and APm signify AP
common-seen objects of 8 classes. The specific categories for very tiny, tiny, small and medium objects, respectively.
are follows: airplane, bridge, storage tank, ship, swimming
pool, vehicle, person and wind mill. Due to the test set is
2 https://robotics.pkusz.edu.cn/resources/dataset/. 4 https://github.com/ultralytics/ultralytics.
3 https://aistudio.baidu.com/aistudio/datasetdetail/117696. 5 https://github.com/PaddlePaddle/PaddleDetection.
123
Fig. 12 Comparison between the

baseline and different
components performance curve
Table 6 The performance impact of different components on the Small-PCB test-set
Number M- OIU- AP AP50 AP75 AR1 AR10 AR100 Parameters GFLOPs Times
MFEC RMCS
YOLOv8n (baseline) 57.8 97.5 62.7 18.4 64.9 64.9 3.0 M 8.1 1.8 ms
1 ✓ 58.1(+ 97.6(+ 63.4(+ 18.3(− 65.0(+ 65.0(+ 2.9 M 8.0 2.0 ms

0.3) 0.1) 0.7) 0.1) 0.1) 0.1)
2 ✓ 58.5(+ 97.9(+ 65.4(+ 18.6(+ 65.1(+ 65.2(+ 5.0 M 18.2 2.0 ms
0.7) 0.4) 2.7) 0.2) 0.2) 0.3)
3 ✓ ✓ 58.8(+ 97.8(+ 66.3(+ 18.5(+ 65.4(+ 65.5(+ 4.9 M 18.1 2.1 ms
1.0) 0.3) 3.6) 0.1) 0.5) 0.6)
We also report the performance gains from newly-added components. (in %)
4.4 Comparison with state-of-the-art methods Moreover, we also compare ten state-of-the-art meth-
ods with our approach, including GFL [72], SSPNet [57],
4.4.1 Results on the AI-TODv2 dataset DyHead [73], TOOD [74], Cascade R-CNN [75], ATSS
[76], RetinaNet [35], FASF [77], TridentNet [78] and Faster
As exhibited in Fig. 4, we show the visual comparison RCNN [69]. From Table 2, we find that I-YOLOm achieves
between the baseline YOLOv8m and our I-YOLOm. It is the top performance, with AP and AP50 metric results of
obvious that our model can predict more object instances, 19.1% and 44.0% respectively. It surpasses the second-
such as ship and vehicle, than YOLOv8m. place detector (TOOD) by 0.5% and 1.0%, severally. When
delving deeper into specific metrics, we observe that our
123
K. Tong, Y. Wu
approach obtains one first place (APm ) and three second From these qualitative graphs, it can be observed that our
places (APvt , APt and APs ). Compared with the baseline method outperforms the baseline detector. More concretely,
detector YOLOv8m, our I-YOLOm achieves gains of 0.6%, for the baseline YOLOv8n, YOLOv8s and YOLOv8m, our
1.3% and 0.3% on metrics APt , APs and APm , respectively. approaches (I-YOLOn, I-YOLOs and I-YOLOm) all achieve
In terms of category-wise AP, our I-YOLOm outperforms varying degrees of performance improvement, as shown in
other detectors as observed in Table 3. I-YOLOm achieves the performance curve in Fig. 8.
top results on six categories, such as airplane, bridge, stor- In addition to qualitative results, we also conducted
age tank, vehicle, person and wind mill. Compared to the quantitative analysis. Table 5 shows the comparison (over-
baseline YOLOv8m, our I-YOLOm brings varying degrees all performance, category-wise AP, parameters and time)
of improvement across all eight categories, which once again between our methods and baselines (YOLOv8n, YOLOv8s
confirms the visualization results in Fig. 4. and YOLOv8m). In terms of parameters and time, our
As we all known, the remote sensing data usually tend approaches have a slight increase over the baselines. But
to suffer from various degradation and noise effects in the overall performance, our methods are significantly better
process of imaging [79]. We further explore the detection than the three baseline detectors. More concretely, for met-
performance of our approach by using these noisy images. rics AP, AP50 and AP75 , our I-YOLOn achieves the gains of
Figures 5 and 6 show the visual detection results before 1.0%, 0.3% and 3.6%, compared to the YOLOv8n. Moreover,
and after adding noise (salt-and-pepper noise and Gaussian for YOLOv8m, our I-YOLOm obtains performance improve-
noise) in the daytime and night scenarios, respectively. From ments of 0.2%, 0.3% and 1.4%, respectively. If we dig into
Fig. 5, it can be observed that the detection performance of the specific categories, our detectors also show dominance
the model decreases when the images in the daytime scene in these classes. In particular, compared to YOLOv8n and
suffer from salt-and-pepper noise and Gaussian noise. At YOLOv8m, our I-YOLOn and I-YOLOm obtain great gains
this point, salt-and-pepper noise has a great impact on object for category spur (2.3 points higher) and mouse bite (2.7
detection results than Gaussian noise. The night scene image points higher).
in Fig. 6 also exhibits the same phenomenon. At this time,
the model can only detect a small portion of the object or 4.4.4 Visualization results
even cannot accurately detect the object, as shown in Fig. 6c
and b. It can be seen that these degradation and noise have a In addition to the quantitative analysis mentioned above, we
significant impact on the performance of the detector. also exhibit the qualitative results. Taking our I-YOLOn and
baseline YOLOv8n as an example, the visual comparison
between the two is displayed in Figs. 9 and 10, respectively.
4.4.2 Results on the small-PCB dataset
Compared to YOLOv8n, I-YOLOn can detect more
instances, such as spurious copper and short, as shown
Table 4 lists the detection results of our proposed approaches
in Fig. 9. Furthermore, our proposed I-YOLOn can accu-
and several representative methods (such as FR with Swin-T,
rately detect some categories (such as mouse bite and spur)
Faster RCNN, RetinaNet, Sparse RCNN and FCOS) on the
incorrectly detected by the baseline detector (YOLOv8n) in
Small-PCB test-set. FCOS wins the first place in ranking with
Fig. 10.
AP50 of 99.4%, while our I-YOLOm reaches AP of 65.0%
and AP75 of 77.5%, which steadily outperform other algo-
4.4.5 Component analysis
rithms. In addition, I-YOLOm obtains the best performance
in each category. More concretely, our I-YOLOm shows
We investigate the individual contributions of each com-
great dominance in several classes, including mouse bite
ponent via conducting an ablation study on our I-YOLOn.
(14.1 points higher), spurious copper (16.9 points higher),
Figures 11 and 12 show the comparison between the base-
spur (14.9 points higher), compared with Sparse RCNN.
line and different components loss curve and performance
For the detector FCOS, I-YOLOm outperforms 4.4%, 5.6%
curve, respectively.
and 5.9% on category spur, spurious copper and mouse bite,
As a qualitative result, the curves in the figure also show
respectively.
the superiority of our proposed components. Compared with
the baseline YOLOv8n, the proposed OIU-RMCS and M-
4.4.3 Improvements over baseline detectors MFEC both bring certain performance improvements. In
particular, when OIU-RMCS and M-MFEC are used in
Figure 7 displays the comparison between our methods and combination, the performance improvement is even more
baselines loss curve. Meanwhile, we report the comparison significant.
between our approaches and baselines performance curve in In addition, we report the quantitative results. The effects
Fig. 8. of all the proposed approaches on the model’s performance
123
Table 7 Effects of different components on category-wise AP on the Small-PCB test-set
Number M-MFEC OIU-RMCS Missing hole Mouse bite Open circuit Spurious copper Short Spur
YOLOv8n (baseline) 64.1 56.6 57.2 58.6 58.1 54.7
1 ✓ 64.2 56.1 56.9 60.1 58.4 53.8

2 ✓ 63.9 56.5 58.6 60.2 58.8 56.2
3 ✓ ✓ 64.4 (+ 0.3) 57.8 (+ 1.2) 58.2 (+ 1.0) 59.8 (+ 1.2) 58.9 (+ 0.8) 57.0 (+ 2.3)
We also report the performance gains from newly-added M-MFEC and OIU-RMCS, compared with the baseline. (in %)
are shown in Tables 6 and 7. Overall, our approach has better Author contributions K. T. conducted all the experiments and wrote
detection performance and stable processing times compared the main manuscript text, including figures and tables. Y. Q. W. and K.
T. reviewed the manuscript.
to the baseline, although the value of GFLOPs is slightly
larger. Specifically, it is observed that presented M-MFEC Data availability The link to obtain the dataset has been provided in
enjoys fewer parameters and proves to be effective for object this article.
detection, resulting in 0.3%, 0.1% and 0.7% gains in AP,
AP50 and AP75 , respectively. Moreover, the usage of pro- Declarations
posed OIU-RMCS leads to 0.7%, 0.4% and 2.7% increase in
terms of AP, AP50 and AP75 . The combination of these two Conflict of interest The authors declare that they have no conflict of
interest or personal relationships related to the work in this paper.
components has brought significant improvements to metrics
AP, AP50 and AP75 , with gains of 1.0%, 0.3% and 3.6%, sev-
erally. For metrics AR1 , AR10 and AR100 , the combination
of these two components achieve the improvements, com- References
pared with the baseline. Moreover, for specific categories, the
1. Li, J., et al.: Automatic detection and classification system of
integration of OIU-RMCS and M-MFEC also brings vary- domestic waste via multimodel cascaded convolutional neural net-
ing degrees of performance improvement, as demonstrated work. IEEE Trans. Industr. Inf. 18(1), 163–173 (2022)
in Table 7. The combined use of components OIU-RMCS 2. Guo, Z., Shuai, H., Liu, G., Zhu, Y., Wang, W.: Multi-level feature
and M-MFEC brings significant gains in category spur (2.3 fusion pyramid network for object detection. Vis. Comput. 39(9),
4267–4277 (2023)
points higher). 3. Ma, Y., Wang, Y.: Feature refinement with multi-level context for
object detection. Mach. Vis. Appl. 34(4), 49 (2023)
4. Wang, Q., Zhou, L., Yao, Y., Wang, Y., Li, J., Yang, W.: An
interconnected feature pyramid Networks for object detection. J.
Vis. Commun. Image Represent.Commun. Image Represent. 79,
5 Conclusion 103260 (2021)
5. Liu, L., et al.: Deep learning for generic object detection: a survey.
Int. J. Comput. Vis. 128(2), 261–318 (2020)
This work first devises two new modules, including re- 6. Tong, K., Wu, Y.: Object detection with shallow feature learning
parameterized module with channel shuffle (RMCS) and network. Presented at the 10th International Conference on Com-
multi-scale feature enhanced convolution (MFEC). Then, puting and Pattern Recognition, Shanghai, China (2021).
7. Wang, H., Chen, Y., Wu, M., Zhang, X., Huang, Z., Mao, W.: Atten-
we propose an effective I-YOLO detector by integrating
tional and adversarial feature mimic for efficient object detection.
OIU-RMCS and M-MFEC into a YOLO-based detector. In Vis. Comput. 39(2), 639–650 (2023)
addition, we construct a large-scale dataset dubbed Small- 8. Lin, X., Sun, S., Huang, W., Sheng, B., Li, P., Feng, D.D.: EAPT:
PCB in order to facilitate industrial small object detection. efficient attention pyramid transformer for image processing. IEEE
Trans. Multimedia 25, 50–61 (2023)
Our Small-PCB includes 8806 high-resolution PCB images 9. Li, C., Zhang, B., Hong, D., Yao, J., Chanussot, J.: LRR-Net: An
and 39,894 instances of 6 classes. The experimental results interpretable deep unfolding network for hyperspectral anomaly
indicate that our approach achieves the state-of-the-art per- detection. IEEE Trans. Geosci. Remote Sensing 61 (2023).
formance on the popular AI-TODv2 and Small-PCB datasets. 10. Cheng, G., Han, J.: A survey on object detection in optical remote
sensing images. ISPRS J. Photogramm. Remote Sens. 117, 11–28
In the future, we will continue to study the small object (2016)
dataset and the small object itself. In terms of the dataset, 11. Hong, D., et al.: Cross-city matters: a multimodal remote sens-
we aim to extend the Small-PCB dataset to more defects in ing benchmark dataset for cross-city semantic segmentation using
real-world scenarios to better study industrial semiconductor high-resolution domain adaptation networks. Remote Sensing Env-
iron. 299, 113856 (2023)
inspection. For small object itself, we will further exp lore 12. Hong, D., et al.: More diverse means better: multimodal deep
the semantic and contextual information to help improve the learning meets remote-sensing imagery classification. IEEE Trans.
detection performance of the small objects. Geosci. Remote Sens. 59(5), 4340–4354 (2021)
123
K. Tong, Y. Wu
13. Amin, S.U., Kim, Y., Sami, I., Park, S., Seo, S.: An efficient 33. Lian, J., et al.: Deep-learning-based small surface defect detection
attention-based strategy for anomaly detection in surveillance via an exaggerated local variation-based generative adversarial net-
video. Comput. Syst. Sci. Eng.. Syst. Sci. Eng. 46(3), 3939–3958 work. IEEE Trans. Industr. Inf. 16(2), 1343–1351 (2020)
(2023) 34. Liu, G., Han, J., Rong, W.: Feedback-driven loss function for small
14. Üzen, H., Turkoglu, M., Aslan, M., Hanbay, D.: Depth-wise object detection. Image Vis. Comput.Comput. 111, 104197 (2021)
squeeze and excitation block-based efficient-unet model for sur- 35. Lin, T.-Y., Goyal, P., Girshick, R.B., He, K., Dollár, P.: Focal loss
face defect detection. Vis. Comput. 39(5), 1745–1764 (2023) for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell.
15. Yu, X., Li, H.-X., Yang, H.: Collaborative learning classification 42(2), 318–327 (2020)
model for PCBs defect detection against image and label uncer- 36. Wang, Z., Fang, J., Dou, J., Xue, J.: Small object detection on road
tainty. IEEE Trans. Instrum. Meas. 72, 1–8 (2023) by embedding focal-area loss. Resented at the 10th International
16. Tong, K., Wu, Y.: Deep learning-based detection from the perspec- Conference on Image and Graphics, Beijing, China (2019).
tive of small or tiny objects: a survey. Image Vis. Comput.Comput. 37. Zhang, H., Jiang, L., Li, C.: CS-ResNet: Cost-sensitive residual
123, 104471 (2022) convolutional neural network for PCB cosmetic defect detection.
17. Wang, S.-Y., Qu, Z., Li, C.-J., Gao, L.: BANet: small and multi- Exp. Syst. Appl. 185, 115673 (2021)
object detection with a bidirectional attention network for traffic 38. Leng, J., Ren, Y., Jiang, W., Sun, X., Wang, Y.: Realize your
scenes. Eng. Appl. Artific. Intell. 117, 105504 (2023) surroundings: exploiting context information for small object
18. Min, K., Lee, G.-H., Lee, S.-W.: Attentional feature pyramid net- detection. Neurocomputing 433, 287–299 (2021)
work for small object detection. Neural Netw. 155, 439–450 (2022) 39. Lim, J.-S., Astrid, M., Yoon, H.-J., Lee, S.-I.: Small object detection
19. Chen, G., et al.: A survey of the four pillars for small object using context and attention. Presented at the International Confer-
detection: multiscale representation, contextual information, super- ence on Artificial Intelligence in Information and Communication,
resolution, and region proposal. IEEE Trans. Syst. Man Cybern. Jeju Island, South Korea (2021)
Syst. 52(2), 936–953 (2022) 40. Yan, Z., Zheng, H., Li, Y., Chen, L.: Detection-oriented backbone
20. Tong, K., Wu, Y.: Rethinking PASCAL-VOC and MS-COCO trained from near scratch and local feature refinement for small
dataset for small object detection. J. Vis. Commun. Image Rep- object detection. Neural. Process. Lett. 53(3), 1921–1943 (2021)
resent.Commun. Image Represent. 93, 103830 (2023) 41. Liang, W., Sun, Y.: ELCNN: a deep neural network for small
21. Gong, L., Huang, X., Chao, Y., Chen, J., Lei, B.: An enhanced SSD object defect detection of magnetic tile. IEEE Trans. Instrum.
with feature cross-reinforcement for small-object detection. Appl. Meas.Instrum. Meas. 71, 1–10 (2022)
Intell. 53(16), 19449–19465 (2023) 42. Liu, W., et al.: SSD: single shot MultiBox detector. Presented at the
22. Sun, C., Ai, Y., Wang, S., Zhang, W.: Mask-guided SSD for Small- Proceedings of European Conference on Computer Vision, Ams-
object detection. Appl. Intell. 51(6), 3311–3322 (2021) terdam, The Netherlands (2016)
23. Lin, T.-Y., et al.: Microsoft COCO: Common objects in context. 43. Lin, T.-Y., Dollár, P., Girshick, R.B., He, K., Hariharan, B.,
Presented at the Proceedings of European Conference on Computer Belongie, S.J.: Feature pyramid networks for object detection. Pre-
Vision, Zurich, Switzerland (2014). sented at the IEEE Conference on Computer Vision and Pattern
24. Yang, S., Luo, P., Loy, C.C., Tang, X.: WIDER FACE: a face detec- Recognition, Honolulu, HI (2017)
tion benchmark. Presented at the IEEE Conference on Computer 44. Zeng, N., Wu, P., Wang, Z., Li, H., Liu, W., Liu, X.: A
Vision and Pattern Recognition, Las Vegas, NV (2016). small-sized object detection oriented multi-scale feature fusion
25. Ji, Z., Kong, Q., Wang, H., Pang, Y.: Small and dense commodity approach with application to defect detection. IEEE Trans. Instrum.
object detection with multi-scale receptive field attention. Pre- Meas.Instrum. Meas. 71, 1–14 (2022)
sented at the ACM International Conference on Multimedia, Nice, 45. Liu, Z., Gao, G., Sun, L., Fang, L.: IPG-Net: image pyramid guid-
France (2019). ance network for small object detection. Presented at the IEEE
26. Chen, C., Liu, M.-Y., Tuzel, O., Xiao, J.: R-CNN for small object Conference on Computer Vision and Pattern Recognition Work-
detection. Presented at the Asian Conference on Computer Vision, shops, Seattle, WA (2020).
Taipei, Taiwan (2016). 46. Zheng, Q., Chen, Y.: Interactive multi-scale feature representa-
27. Yu, X., Gong, Y., Jiang, N., Ye, Q., Han, Z.: Scale match for tiny tion enhancement for small object detection. Image Vis. Com-
person detection. Presented at the IEEE Winter Conference on put.Comput. 108, 104128 (2021)
Applications of Computer Vision, Snowmass Village, CO (2020). 47. Cao, G., Xie, X., Yang, W., Liao, Q., Shi, G., Wu, J.: Feature-fused
28. Ding, R., Dai, L., Li, G., Liu, H.: TDD-net: a tiny defect detection SSD: fast detection for small objects. Presented at the 9th Inter-
network for printed circuit boards. CAAI Trans. Intell. Technol. national Conference on Graphic and Image Processing, Qindao,
4(2), 110–116 (2019) China (2017).
29. He, F., Tang, S., Mehrkanoon, S., Huang, X., Yang, J.: A real- 48. Li, Z., Zhou, F.: FSSD: feature fusion single shot multibox detector.
time PCB defect detector based on supervised and semi-supervised Comput. Res. Reposit. 5 (2018).
learning. Presented at the 28th European Symposium on Artificial 49. Liang, X., Zhang, J., Zhuo, L., Li, Y., Tian, Q.: Small object
Neural Networks, Computational Intelligence and Machine Learn- detection in unmanned aerial vehicle images using feature fusion
ing, Bruges, Belgium (2020). and scaling-based single shot detector with spatial context analy-
30. Li, J., Liang, X., Wei, Y., Xu, T., Feng, J., Yan, S.: Perceptual gener- sis. IEEE Trans. Circuits Syst. Video Technol. 30(6), 1758–1770
ative adversarial networks for small object detection. Presented at (2020)
the IEEE Conference on Computer Vision and Pattern Recognition, 50. Goodfellow, I.J., et al.: Generative adversarial nets. Presented at
Honolulu, HI (2017). the Neural Information Processing Systems, Montreal, Quebec,
31. Zhang, Y., Bai, Y., Ding, M., Ghanem, B.: Multi-task generative Canada (2014)
adversarial network for detecting small objects in the wild. Int. J. 51. Zhu, Z., Liang, D., Zhang, S.-H., Huang, X., Li, B., Hu, S.-M.:
Comput. Vision 128(6), 1810–1828 (2020) Traffic-sign detection and classification in the wild. Presented at
32. Bai, Y., Zhang, Y., Ding, M., Ghanem, B.: SOD-MTGAN: small the IEEE Conference on Computer Vision and Pattern Recognition,
object detection via multi-task generative adversarial network. Pre- Las Vegas, NV (2016)
sented at the Proceedings of European Conference on Computer 52. Wang, J., Yang, W., Guo, H., Zhang, R., Xia, G.-S.: Tiny object
Vision, Munich, Germany, (2018). detection in aerial images. Presented at the International Confer-
ence on Pattern Recognition Milan, Italy (2021).
123
53. Yang, C., Huang, Z., Wang, N.: QueryDet: cascaded sparse query 73. Dai, X. et al.: Dynamic head: unifying object detection heads with
for accelerating high-resolution small object detection. Presented at attentions, Presented at the IEEE Conference on Computer Vision
the IEEE Conference on Computer Vision and Pattern Recognition, and Pattern Recognition, virtual (2021)
New Orleans, LA (2022). 74. Feng, C., Zhong, Y., Gao, Y., Scott, M.R., Huang, W.: TOOD:
54. Oliva, A., Torralba, A.: The role of context in object recognition. task-aligned one-stage object detection. Presented at the IEEE
Trends Cogn. Sci.Cogn. Sci. 11(12), 520–527 (2007) International Conference on Computer Vision, Montreal, QC,
55. Leng, J., Liu, Y., Gao, X., Wang, Z.: CRNet: context-guided rea- Canada (2021).
soning network for detecting hard objects. IEEE Trans. Multimed. 75. Cai, Z., Vasconcelos, N.: Cascade R-CNN: high quality object
pp 1–13 (2023). detection and instance segmentation. IEEE Trans. Pattern Anal.
56. Leng, J., Mo, M., Zhou, Y., Gao, C., Li, W., Gao, X.: Pareto refo- Mach. Intell. 43(5), 1483–1498 (2021)
cusing for drone-view object detection. IEEE Trans. Circuits Syst. 76. Zhang, S., Chi, C., Yao, Y., Lei, Z., Li, S.Z.: Bridging the
Video Technol. 33(3), 1320–1334 (2023) gap between anchor-based and anchor-free detection via adaptive
57. Hong, M., Li, S., Yang, Y., Zhu, F., Zhao, Q., Lu, L.: SSPNet: scale training sample selection. Presented at the IEEE Conference on
selection pyramid network for tiny person detection from UAV Computer Vision and Pattern Recognition, Seattle, WA (2020).
images. IEEE Geosci. Remote Sens. Lett. 19, 1–5 (2022) 77. Zhu, C., He, Y., Savvides, M.: Feature selective anchor-free module
58. Gong, Y., Yu, X., Ding, Y., Peng, X., Zhao, J., Han, Z.: Effective for single-shot object detection. Presented at the IEEE Conference
fusion factor in FPN for tiny object detection. Presented at the IEEE on Computer Vision and Pattern Recognition, Long Beach, CA
Winter Conference on Applications of Computer Vision, Waikoloa, (2019).
HI (2021). 78. Li, Y., Chen, Y., Wang, N., Zhang, Z.-X.: Scale-aware trident net-
59. Deng, C., Wang, M., Liu, L., Liu, Y., Jiang, Y.: Extended feature works for object detection. Presented at the IEEE International
pyramid network for small object detection. IEEE Trans. Multime- Conference on Computer Vision, Seoul, South Korea (2019).
dia 24, 1968–1979 (2022) 79. Hong, D., Yokoya, N., Chanussot, J., Zhu, X.X.: An augmented
60. Wu, X., Hong, D., Chanussot, J.: UIU-Net: U-Net in U-Net for linear mixing model to address spectral variability for hyperspectral
infrared small object detection. IEEE Trans. Image Process. 32, unmixing. IEEE Trans. Image Process. 28(4), 1923–1938 (2019)
364–376 (2023)
61. Huang, G., Liu, Z., Maaten, L., Weinberger, K.Q.: Densely con-
nected convolutional networks. Presented at the IEEE Conference
Publisher’s Note Springer Nature remains neutral with regard to juris-
on Computer Vision and Pattern Recognition, Honolulu, HI (2017).
dictional claims in published maps and institutional affiliations.
62. Lee, Y., Hwang, J.-W., Lee, S., Bae, Y., Park, J.: An energy and
GPU-computation efficient backbone network for real-time object
Springer Nature or its licensor (e.g. a society or other partner) holds
detection. Presented at the IEEE Conference on Computer Vision
exclusive rights to this article under a publishing agreement with the
and Pattern Recognition Workshops, Long Beach, CA (2019).
author(s) or other rightsholder(s); author self-archiving of the accepted
63. Lee, Y., Park, J.: CenterMask: real-time anchor-free instance seg-
manuscript version of this article is solely governed by the terms of such
mentation. Presented at the IEEE Conference on Computer Vision
publishing agreement and applicable law.
and Pattern Recognition, Seattle, WA (2020).
64. Liu, S., Qi, L., Qin, H., Shi, J., Jia, J.: Path aggregation network for
instance segmentation. Presented at the IEEE Conference on Com-
puter Vision and Pattern Recognition, Salt Lake City, UT (2018). Kang Tong received a master’s
65. Zhang, X., Zhou, X., Lin, M., Sun, J.: ShuffleNet: an extremely effi- degree from Jiangxi University of
cient convolutional neural network for mobile devices. Presented at Science and Technology in 2019.
the IEEE Conference on Computer Vision and Pattern Recognition, He is currently a Ph.D. student
Salt Lake City, UT (2018). at Nanjing University of Aeronau-
66. Krishna, H., Jawahar, C.V.: Improving small object detection. Pre- tics and Astronautics. He is dedi-
sented at the Asian Conference on Pattern Recognition, Nanjing, cated to research on small object
China (2017). detection.
67. Xu, C., Wang, J., Yang, W., Yu, H., Yu, L., Xia, G.-S.: Detecting
tiny objects in aerial images: a normalized wasserstein distance
and a new benchmark. ISPRS J. Photogramm. Remote Sens. 190,
79–93 (2022)
68. Sun P., et al.: Sparse R-CNN: End-to-end object detection with
learnable proposals. Presented at the IEEE Conference on Com-
puter Vision and Pattern Recognition, Virtual (2021).
69. Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards
real-time object detection with region proposal networks. IEEE
Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)
70. Tian, Z., Shen, C., Chen, H., He, T.: FCOS: a simple and strong
anchor-free object detector. IEEE Trans. Pattern Anal. Mach. Intell.
44(4), 1922–1933 (2022)
71. Liu Z., et al.: Swin transformer: hierarchical vision transformer
using shifted windows. Presented at the IEEE International Con-
ference on Computer Vision, Montreal, QC, Canada (2021)
72. Li, X., Lv, C., Wang, W., Li, G., Yang, L., Yang, J.: General-
ized focal loss: towards efficient representation learning for dense
object detection. IEEE Trans. Pattern Anal. Mach. Intell. 45(3),
3139–3153 (2023)
123
K. Tong, Y. Wu
Yiquan Wu received his M.S.

and Ph.D. degrees from Nanjing
University of Aeronautics and
Astronautics in 1987 and 1998,
respectively. He is currently a
professor and Ph.D. supervisor
in the Department of Information
and Communication Engineering
at the Nanjing University of
Aeronautics and Astronautics.
His research interests include
image processing and machine
vision.
123

I-YOLO: A Novel Single-Stage Framework For Small Object Detection

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

I-YOLO: A Novel Single-Stage Framework For Small Object Detection

Uploaded by

Copyright:

Available Formats

The Visual Computer

I-YOLO: a novel single-stage framework for small object detection

Accepted: 20 January 2024

Fig. 1 The framework of our improved YOLO detector (I-YOLO)

Based on the RMCS and MFEC, we introduce One-time Inte-

1. Design two novel components: re-parameterized mod-

Table 1 Numbers of instances of

Table 2 The performance of

Fig. 7 Comparison between our methods and baselines loss curve

Table 4 The results of different detectors on the Small-PCB test-set

The best result is marked in bold. (in %)

Fig. 8 Comparison between our

Table 5 Improvements over baselines on the Small-PCB test-set

Baseline AP AP50 AP75 MH MB OC SC SH SP Param Time

expressed as: unavailable, we report the results on validation set. Based on

where x {x1 , x2 , x3 , x4 } refers to divide the input feature x

Fig. 12 Comparison between the

Table 6 The performance impact of different components on the Small-PCB test-set

1 ✓ 58.1(+ 97.6(+ 63.4(+ 18.3(− 65.0(+ 65.0(+ 2.9 M 8.0 2.0 ms

We also report the performance gains from newly-added components. (in %)

Table 7 Effects of different components on category-wise AP on the Small-PCB test-set

1 ✓ 64.2 56.1 56.9 60.1 58.4 53.8

Yiquan Wu received his M.S.

You might also like