You are on page 1of 15

Intelligent Systems with Applications 20 (2023) 200296

Contents lists available at ScienceDirect

Intelligent Systems with Applications


journal homepage: www.journals.elsevier.com/intelligent-systems-with-applications

Anti-drone systems: An attention based improved YOLOv7 model for a


real-time detection and identification of multi-airborne target
Ghazlane Yasmine a, b, *, Gmira Maha a, Medromi Hicham b, c
a
School of Digital Engineering and Artificial Intelligence, Euromed Research Center, Euromed University, Fes, 30110, Morocco
b
Research Foundation for Development and Innovation in Science and Engineering (FRDISI), Casablanca, 16469, Morocco
c
National Higher School of Electricity and Mechanic (ENSEM), Hassan II University, Casablanca 20310, Morocco

A R T I C L E I N F O A B S T R A C T

Keywords: Recently, with the significant rise of drones, reinforcing and securing aerial security and privacy has become an
Deep learning urgent task. Their malicious use takes benefit from the malevolent deployment which leverages some existing
Real-time detection gaps in Artificial Intelligence (AI) and cybersecurity. Anti-drone systems are the spotlighted security solution
Airspace safety
developed to ensure aerial safety and security against rogue drones. However, the anti-drone systems are con­
Drone identification
straints to accurate airborne target identification and real-time detection to neutralize the target properly
without causing damages. In this paper, we have developed a real-time multi-target detection model based on
Yolov7 aiming to detect, identify and locate the airborne target properly and rapidly using a varied dataset which
is biased and imbalanced due to the differences between the targets. In order to develop a model with the best
compromise between a high performance and fast speed, we have used a series of improvements by incorporating
the CSPResNeXt module in the backbone, a transformer block with the C3TR attention mechanism and decoupled
head structure to enhance the performance of the model. The comparative and ablation experiments confirm the
effectiveness of the proposed ensemble learning-based model. The experiments have shown that the improved
model has reached high performance, with 0.97 precision, 0.961 recall, 0.979 map@0.50 and 0.732 and 0.979
map@0.50–0.95. Additionally, the real-time detection condition is satisfied with 92 FPS and an inference speed
equal to 0.02 ms per image. The results show that the model succeeds in achieving an optimal balance between
inference speed and detection performance. The proposed model achieves competitive results compared with the
existing state-of-the-art models.

1. Introduction misidentified and confused due to the high altitude and long distance
between the target and the anti-drone system, especially ground based
Recently, with the significant rise of drones, it has been reported that anti-drone platforms are concerned. That is, not only drones should be
their anarchic and uncontrolled deployments have caused far reaching detected but even any other airborne target that is likely to mislead the
impacts and heavy consequences on both rural and urban areas. To deal system and cause damages during the neutralization phase, especially
with this security concern, anti-drones have been developed as a secu­ when target is not harmful. Drones are the first targets in view of the
rity system aiming to regulate and counter the illegal activities of ma­ anti-drone systems that should be identified, tracked and intercepted
licious drones following a pre-defined anti-drone process. An exhaustive properly. But still, other targets such as birds and airplanes are so pre­
overview of the anti-drone process and technologies used can be found sent in the sky. As well, lightning day frames and buildings are affecting
in Yasmine et al. (2022). As explained in this paper, the airborne targets the identification of drones and can assimilated to them. Missed detec­
identification and classification phase have significant impacts on the tion or confused identification are causing false alarms, leading to un­
performance of the system. successful target neutralization. Thus, the other airborne targets or
The anti-drone should be able to depict all the airborne targets at background should not be identified as drones, based on their visual
long distances and identify them precisely in order to adopt efficient characteristics.
countermeasures. There are many airborne objects that can be easily The identification and detection modules process the collected

* Corresponding author.
E-mail address: y.ghazlane@ueuromed.org (G. Yasmine).

https://doi.org/10.1016/j.iswa.2023.200296
Received 19 April 2023; Received in revised form 25 September 2023; Accepted 25 October 2023
Available online 31 October 2023
2667-3053/© 2023 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
G. Yasmine et al. Intelligent Systems with Applications 20 (2023) 200296

signals from the sensing unit to get the target status and locate it in real- data (images or videos) for inference. The received visual data; which
time. The use of visual signals is the most advantageous detection and are the equivalent to testing data, are processed by the model which
identification method with respect to radar, Radio Frequency (RF) and takes them as input to provide confidence scores along with bounding
acoustic technologies mostly when it is used with appropriate deep boxes around the targets. Based on the identity of the detected target,
learning algorithms as pointed out in Ajakwe et al. (2022); Al-Qubaydhi especially if the detected target is a drone, the system is triggered and
et al. (2022); Behera & Bazil Raj, 2020; B. Liu and Luo (2022); Singha alarmed to track and intercept it using appropriate countermeasure and
and Aydin (2021); X. Wang et al. (2021). neutralization strategy.
To perform a real-time detection task, the one stage object detectors The purpose of this paper is to develop a real-time detection model
are the most suitable solution thanks to the optimal compromise that able to detect, identify and localize multiple targets while satisfying the
they present between speed and performance (Soviany & Ionescu, speed performance compromise and mainly overcoming the real-time,
2018). Indeed, You Only Look Once, commonly known as Yolo models data and performance challenges. Overall, the significant finding of
are an end to end one stage object detection algorithms used to perform this work is summarized as follows:
object detection, prediction and localization by drawing bounding boxes
around the targets (Redmon et al., 2016; Ultralytics/Yolov5, 2020/2022; - We develop a proper and varied labelled dataset with five classes,
C.-Y. Wang et al., 2022; Yasmine et al., 2023). namely drones, birds, airplanes, day frame and building.
Recently, the researchers are leveraging massively Yolo algorithms - We introduce an improved single stage object detector model based
to their specific object detection tasks while improving the baseline on Yolov7 through the use of CSPResNeXt module in the backbone,
models with the aim to meet the needed requirements. Indeed, most of transformer block with C3TR and decoupled head structures.
the existing object detection methods have used and customized Yolov5 - We further design a proper methodology to retain the maximum
model according to the specific recognition task, dataset and related information from the learned features and optimize them during the
constraints by bringing specific changes and improvements to the testing and detection using training from scratch, Transfer Learning
detection architecture. Thus, In (Zhu, Lyu, Wang, & Zhao, 2021), the (TL), Test Time Augmentation (TTA) and ensemble learning,
authors have developed a model for object detection in drone captured
scenarios based on Yolov5. The incorporated improvements are the use a The rest of this paper is presented as follows. Section 2 provides an
fourth prediction head and a transformer structure as well as the inte­ outstanding overview of previous research studies related to the detec­
gration of Convolutional Block Attention Model (CBAM) module. tion of the airborne targets. In Section 3, we investigate the structure of
Further, a novel detection model was developed for real-time multi-scale the improved model with the incorporated modules. Section 4 discusses
traffic sign detection (Wang et al., 2023), which adds a Feature the proposed methodology, the conducted experimental setup and the
Enhancement Module (FEM) and an Adaptive Attention Module (AAM) presents the results and the comparative analyses. Then, we conclude in
on the Yolov5 model to meet real-time requirements and improve the the last section.
ability of multi-scale targets.
Knowing the degree of lethality and hazard of the potential airborne 2. Related work
targets, the detection and identification should be done properly for
safety and security issues, since false alarm responses due to missed or As explained earlier, the unrestricted deployment of drones is
confused detections may lead to unsuccessful anti-drone process with a threatening the public safety and privacy. Thus, this problem falls under
doomed tracking and interception that could have weighty damages for an aerial security context. Earlier, many research studies have consid­
both the target and the environment under operation. Therefore, it is of ered the detection of the airborne objects as a challenging task to
high importance to deploy an accurate recognition model that responds enhance the airspace safety. Thus, many models have been developed
to the anti-drone needs and requirements and mostly to follow the and proposed accordingly and several novel approaches have been
process. adopted.
Further, the visual based detection is the most advantageous in view Al-Qubaydhi et al., (2022) proposed the use of Yolov5 with TL to
of the quality and quantity of information delivered by the electro- detect drones accurately and rapidly. The model was trained on 1359
optical and infrared sensors (Park et al., 2021; Yasmine et al., 2022). images of drones and it was enhanced with the integration of TL and
With the rise of machine and deep learning, object detection and data augmentation techniques to address data scarcity. The experiments
tracking using visual methods have become more automated by mostly showed that the proposed model achieved an average precision of 94.7
using Convolutional Neural Networks (CNN) as foundational part to %. Further, the data augmentation technique was adopted in Liu & Luo,
process and extract the visual features from the dataset to provide a (2022) to train Yolov5 for detecting multi-rotor UAV. The authors have
probability distribution over a set of categories (Garcia et al., 2020; proposed to improve the model by replacing the model backbone with
Isaac-Medina et al., 2021; Lykou et al., 2020). Efficient lite, integrating spatial feature fusion into the head part and
The collected images and videos are more reliable in recognizing and optimizing the regression loss function with the use of angle cost crite­
identifying the targets by gathering the visual cues; such as the rion. The proposed model achieved 93.54 % precision, 91.09 % recall,
appearance feature, e.g., shapes, colors, geometric forms, contour lines, 94.82 mAP. Similar approach was used in (Akyon et al., 2021) with the
forms and edges, etc., and the motion across consecutive frames (Shi incorporation of a fine-tuned Yolov5 and a Kalman tracker on real and
et al., 2018). Recent advances in AI and computer vision have reinforced synthetically generated datasets for the detection of drones. It has been
the visual detection module in the anti-drone systems, allowing to get demonstrated that the proposed model with the combination of both
fine grained details about the target, provide contextual information and real and synthetically data performs better than other single model;
high-level visual information, ensure real-time detection and tracking by achieving a rate of 79.4 % mAP. This improvement is explained by
continuously processing the visual information. When coupled with AI reducing the rate of false positive and filling out missing frames. The
and advanced computer vision algorithms, the visual detection has advantages of using synthetic data for training a drone detection model
enabled to take a significant step forward in making the anti-drone are also pointed in Wisniewski et al., (2022). The authors have used
system intelligent, flexible and autonomous. synthetic images for training a DenseNet201 model and then test it on
The proposed model real-time implementation in the anti-drone real life dataset. The model reached 92.4 %, a precision of 88.8 %, a
system. First, the model is beforehand trained on a large training data­ recall of 88.6 %, and an F1 score of 88.7 %. Similar studies (Behera &
set and tested on unseen images in different complex scenarios, which Bazil Raj, 2020; Singha & Aydin, 2021) have used the third and fourth
make it suitable to be integrated in the anti-drone system. Once the anti- version of Yolo detection algorithms for real-time drone detection. The
drone deployed, the sensors feed the developed model with the captured work reported in Behera & Bazil Raj, (2020) used Yolov4 to detect

2
G. Yasmine et al. Intelligent Systems with Applications 20 (2023) 200296

Fig. 1. The structure of the CSPResNeXt module.

different types of drones in fixed and moving contexts. The model ach­ is shown that detection of the airborne targets is a challenging task due
ieved the best performance with 0.74 mAP at 150 epochs. Singha al. to real-time constraints and visual similarities of the flying targets in the
(Singha & Aydin, 2021) used Yolov4 to distinguish between drones and sky. Therefore, it is necessary to develop proper detection modules to
birds. The main contribution is the use of images captured at three enhance and improve the effectiveness of the anti-drone system.
different altitudes: 60, 40, and 20 feet. The model achieved 0.74 mAP, Several authors have developed proper models using specific
0.95 precision, 0.68 recall, and 0.79 F1 score. detection algorithms to enhance airborne target detection. However, in
Due to significant similarities between drones and birds in the sky, the existing literature review, the compromise between performance
the authors in Fujii et al., (2021) have proposed a model to tackle bird and speed is not prioritized as well the variety of the target categories
detection using CenterNet model, data augmentation and hard-negative used in the dataset. In addition, the proposed detection models are
training techniques. The experiments showed that the use of the afore­ inappropriate for real-time deployment on an anti-drone device.
mentioned techniques has achieved 72.03 % and 72.13 mAP, surpassing Further, there are still research gaps to work in to enhance the detection
the other proposed combinations. To address the similarities between of airborne targets under various flying speeds, weather conditions and
the drones and the birds, Samadzadegan et al. (Samadzadegan et al., altitudes while satisfying both fast speed and high performance con­
2022) have used Yolov4 model to distinguish between drones and birds. straints. The most challenging part is to include the predominant aerial
Bag of freebies and bag of specials are leveraged to improve the per­ targets in the sky that are likely that may mislead the anti-drone system.
formance of the proposed model. Based on the presented results, the Therefore, these latter challenges are the main motivations of this
accuracy, mAP and IoU reached 83 %, 84 %, and 81 %, respectively. research study.
Aerial detection is a challenging task due to scale variance and dense
distribution of the data. To deal with this latter, the paper (X. Wang 3. Improved yolov7: solving methodology
et al., 2021) presents an end to end object detector SPB-YOLO inspired
by Yolov5. The main contributions are the integration of a strip bottle­ In this section, we investigate the proposed methodology and the
neck module to process the width and height dependencies of the improved real-time detection model which is developed accordingly to
different targets, following an upsample strategy based on the Path maximize the performance while reducing the inference time. The
Aggregation Network (PANet) and adding a fourth detection head. developed model provides novel research contributions with the
SPB-YOLO model surpassed Yolov5 by 5.3 % mAP, which confirms the development of the transformer and backbone modules, the followed
effectiveness of the proposed model. To complete the visual information training strategy and the use of the preliminary evolved hyper­
on the drones, the authors in (Kim et al., 2022) have used data fusion parameters for training the model.
with the integration of both acoustic and visual data for the detection To meet our specific requirement and satisfy the optimal compro­
using OR logical function. The developed model is divided to two mise between speed and performance, we have made a set of improve­
sub-models; a CNN model to process the acoustic data and a Yolov5 ments to Yolov7 model. First, we have introduced CSPResNeXt module
model for the detection of the visual features. The model has reached an into the backbone to improve the feature extraction. Secondly, the C3TR
accuracy of 88.96 % and 90.26 %, respectively. After applying the attention mechanism is integrated to collect more global information
fusion, the accuracy got improved to 92.53 %. and enhance the feature fusion. In addition, this specific attention
Knowing that the intruder drones carry usually a payload for their mechanism leads to more effective object detection by weighting the
malicious mission; such as gun, missile, and explosive, the paper importance of the existing objects in an image and focusing on the
(Ajakwe, Ihekoronye, Kim, & Lee, 2022) presents a detection model relevant instances while ignoring negligible regions. Following, we
aiming to detect the types of drone and identify the attached payloads. modify the head part and use a decoupled head structure to separate the
The selected model is Yolov5s in view of its optimal performance, classification and regression tasks. Further, we have incorporated some
timeliness and low computational complexity. The proposed model regularization techniques to improve the overall performance while
achieved promising results with respect to each type of drone and reducing the inference speed.
payload.
Through the information delivered by the aforementioned papers, it

3
G. Yasmine et al. Intelligent Systems with Applications 20 (2023) 200296

Fig. 1 shows the ResNeXt module with and without the CSP
approach.

3.2. Transformer C3TR

Knowing that the resolution of the feature maps decreases signifi­


cantly during the feature extraction and aggregation, we have replaced
the C3 modules by a vision transformer blocks (C3TR) to first collect
more global information from the extracted features and then select and
emphasize the most meaningful ones while ignoring insignificant in­
formation. The use of C3TR improves the model’s learning ability by
increasing the attention mechanism with the receptive field and enhance
the extraction performance of complex features. After giving satisfactory
results in Natural Language Processing (NLP), the integration of the
transformer has shown better performance within image processing
tasks (Dosovitskiy et al., 2021; Vaswani et al., 2017; Wu et al., 2020).
Instead of applying convolutions equally on the image pixels regardless
of their importance and the content, we use the transformers with
self-attention mechanisms which leverage the semantic segmentation. In
fact, their main contribution is to deal with the pixel-convolution
paradigm and relate the most significant semantic concepts in the
image by maximizing the extraction capability of the semantic features
(Wu et al., 2020). Three transformer blocks are incorporated to relate
Fig. 2. The transformer C3TR block. the significant extracted and aggregated features from complex back­
grounds and foregrounds and localize them accurately. Our model
3.1. Backbone with CSPResNeXt contains three transformer prediction modules to detect all target sizes
with small, medium and large coverage. Therefore, we have replaced the
Based on the first collected results, it was noted that the training is conventional convolution modules C3 by transformer blocks C3TR. The
deteriorated because of the undesirable gradient flow and increasing latter includes a patch and position embedding’s that are charged on
error rate during the backpropagation process due to the complexity of generating the Region of Interest (RoI) and adding the corresponding
the dataset and since the expected performance was unreached. There­ positional information into a linear sequence, which is fed to the
fore, we have introduced a ResNext CNN in the backbone to leverage the transformer encoder which performs the feature extraction. This last
identity shortcuts connections to add the input features to the output consists of two main blocks: Multi-Head self-Attention (MSA) and
block and to use the split-transform-merge strategy in an extensible way Multilayer Perceptron (MLP). MSA collects and retrieve all the de­
as well as the cardinality to provide complex transformations (Xie et al., pendencies among the features using a scaled-dot product from the
2017). Due to the loss of information through the feature extraction source to the target sequence (L. Liu et al., 2021) through various query,
process and undesirable gradient flow, we have added a Cross Stage key, and value representation subspaces notated as Q, K and V ∈ R. The
Partial Network (CSPNet) approach into the ResNeXt module to MSA module is used to leverage and fuse features from different repre­
strengthen its learning capacity, enhance the variability of the learned sentation subspaces at different positions. Then, the MLP module pro­
features and to make the model light weighted for the Graphic Pro­ cesses the received features whilst following a self-attention mechanism
cessing Unit (GPU) deployment (C.-Y. Wang et al., 2020). Further, to collect all the global and contextual information and reduce the loss.
ResNext uses a multi-branch strategy during the training and a single The structure of the C3TR transformer is shown in Fig. 2.
one during the inference which improves both the detection accuracy The MSA module is calculated as follows:
and speed. In addition, the improved spatial pyramid pooling structure
with the Cross Stage Partial (CSP) structure (SPPCSPC) is also used.

Fig. 3. Decoupled head structure.

4
G. Yasmine et al. Intelligent Systems with Applications 20 (2023) 200296

Fig. 4. Overall structure of the developed model with the incorporated improved modules.


N ∑
N
( ) high. To this end, we have introduced decoupled head structure to
H= h= f Wiq q, Wik k, Wiv v (1) perform the detection following three branches instead of one. The
regression, classification and objectness are performed separately as
i− 1 i− 1

Where the variable f is expressed as: shown in Fig. 3. Once the input is received, the candidate features of
( T) different scales are extracted and fused to collect contextual and spatial
QK
f = softmax √̅̅̅ v ∈ Rn×v (2) information about the targets from the Feature Pyramid Network (FPN)
d used in the backbone and the neck following bottom-up and top-down
The notation W refers to the parameters matrices presenting the pathways, respectively. Following, the head uses decoupled method to
linear projection of the keys, queries and values parameters with various refine the object detection, determine the class probability of the target
dimensional projections dq , dk , dv . N refers to the attention heads used and apply a bounding box around it. As pointed out in Ge et al., (2021);
and h is the attention coefficient. Zhuang et al., (2022), the use of decoupled heads ensures an end-to-end
process and significantly improves the convergence speed.
3.3. Decoupled head Further, Fig. 4 illustrates the main units of our improved model with
the relevant modules from the input image to outputted results. Further,
With the continuous improvement made on YOLO algorithms, the the structure location and the number of the aforementioned used
followed structure with backbone, neck as well as the used parameters modules are provided.
and the feature models have known progressive improvements. How­
ever, the detection heads have not been improved (Ge et al., 2021). As it 3.4. Test time augmentation (TTA)
is well known, the one stage object detection algorithms perform
regression and classification as a whole which causes a dual conflict The test Time Augmentation (TTA) technique is applied during the
especially when the dataset is large and the number of parameters is testing phase to improve the results. TTA applies a series of

Fig. 5. Overview of the research pipeline used to train and test the proposed model.

5
G. Yasmine et al. Intelligent Systems with Applications 20 (2023) 200296

Table 1
Sample images from the dataset used.

enhancement techniques by changing the image’s lightning and develop a single model with high performance. Thus, ensemble more
configuration by three scales, i.e., 1, 0.83 and 0.67, applying flip ori­ than one model into one framework allows to upgrade the overall per­
entations, processing them at three different resolutions respectively formance and perform better to achieve the desired objective (Chujai
and increasing the image size by 30 %. The applied modifications on the et al., 2015; MacKay, 1995; Tan et al., 2022).
images are merged with the training outputs before Non-Max Suppres­ Ensemble learning is a fusion approach that integrates data fusion
sion (NMS) process (Ultralytics/Yolov5, 2020/2022). In fact, TTA is used and numerous models’ knowledge into a unified framework (Dong et al.,
to improve the detection performance of the model, reduce the effect of 2020). This approach allows to train several models and combine their
variations and bias in the data and increase the diversity of the data as learned weights (Sagi & Rokach, 2018). In this study, we apply ensemble
well as it provides more accurate predictions. This is achieved by inte­ learning by fusing the improved model and the model trained from
grating specific data augmentation techniques during the inference scratch. The first model has more general knowledge where TL is applied
phase. We have developed a specific TTA module to be incorporated and the second is a self-learned model which is trained from scratch on
with our improved Yolov7 by applying lightning change, random the dataset. This self-trained learns more specific features about the
cropping, flipping, and scaling. targets while the pre-trained model focuses more on general knowledge.
Fig. 5 highlights the main steps followed to develop the proposed model.

3.5. Ensemble learning


3.6. Evaluation indices
Since single machine learning models have limitations in conducing
new sophisticated tasks on imbalanced datasets, then it is challenging to The conducted experiments are quantitatively assessed using the

6
G. Yasmine et al. Intelligent Systems with Applications 20 (2023) 200296

Fig. 6. Correlogram of the used dataset with a) the correlation of the target’s location, width and height and coordinates b) the labels distribution of the tar­
gets classes.

current predefined metrics; precision, recall, Intersection over Union


N ∫
1
(IoU), mean Average Precision (mAP), Frame Per Second (FPS) and 1 ∑N
1 ∑
mAP = APn = P(R)dr (5)
inference time. These metrics evaluate the robustness of the detection N i=1 N i=1
models since they rely on the verified and missed detection which are 0

either true or false alarms. The precision and recall show the validity of
mAP@ 0.5 and mAP@0.5–0.95 are calculated at fixed IoU threshold
the positive detected aerial targets by highlighting the percentage of the
values from 50 % to 95 %. mAP@0.5 is fixed at 50 % IoU whereas
relevant detection targets and the total corrected classified targets,
mAP@0.5–0.95 includes the IoU range from 50 % to 95 %.
respectively. mAP is the average AP for each target class and it is
∑k
measured by calculating the area under the precision recall curve. i=1 I
Further, the number of parameters and Floating Point Operations FPS = (6)
Inference time
(FLOPS) indices inform the needed computational resources and cost.
These precision, recall and mAP metrics are calculated using the where TP, FP, FN refer to True Positives, False Positive and False
following equations: Negative, N is the number of classes, P and R are the precision and recall,
∑ I the total number of images and Inference time refers to time spent by
TP
Precision = ∑ (3) the model to infer all the images.
TP + FP

TP 4. Experimental results and discussion
Recall = ∑ (4)
TP + FN
In this section, we present the main milestones that we considered to
develop our proper model. During our preliminary experimentation

7
G. Yasmine et al. Intelligent Systems with Applications 20 (2023) 200296

Fig. 7. The evolved hyperparameters on our dataset.

phase, we have started by experimenting and comparing many detection module is of great importance to avoid targets misidentification.
algorithms including one stage and two stage object detectors in order to Knowing that there are different airborne objects that may be confused
find the best one. with the flying drones, especially at high altitudes where they may look
similar physically, we have tried to use the most encountered targets in
4.1. Experimental data the sky. Therefore, we have selected the targets that share similarities
with drones. Indeed, there is no specific speed or type of aircraft the
• Dataset model can detect since it is trained on a different type of aircraft, at
different altitudes during the take-off and cruise captured at different
The primary objective of the anti-drone system is the neutralization angles. We have collected the most representative aircraft images which
of the unwanted target in real-time. This can be achieved by the use of an have different speeds.
accurate identification and detection module which is highly needed to Trying to deal with these perspectives; we have developed a varied
avoid collateral damage on the environment and to the non-hostile and representative dataset; mainly from 2022; Roboflow Universe, (n.d)
targets. Thus, developing and integrating an appropriate detection Pawełczyk and Wojtyra, (2020), that includes five classes which are

8
G. Yasmine et al. Intelligent Systems with Applications 20 (2023) 200296

likely to be identify to drones. These classes are captured in different Algorithm


environments, contexts and times from a variety of capture angles. Development of a detection and tracking model for multiple airborne target.
Table 1 shows a selection of the used images from our dataset to train Input Data: Images I= {id1 ,.., idn; ib1 ,.., ibn; ; ia1 ,.., ian; ; ida1 ,.., idan; ; ib1 ,.., ibn; } and
our model. their labels L= { ld1 ,.., ldn ; lb1 ,.., lbn ; la1 , .., lan ; lda1 , .., ldan ; lb1 , .., lbn } where n ∈ N.
Indeed, we have covered all the most encountered cases by selecting The dataset is composed of five airborne targets: Dronesi , Birdsi , Airplanei , Dayframei ,
Buildingi .
five categories:
Output: Target class, Confidence score, Bounding Boxes, Inference time, Tracking
Coordinate, Trajectory prediction;
- The first category represents drones with different shapes and frame Goal: Perform airborne targets classification, detection and tracking.
configurations, i.e., quadcopter, octocopter, high-wing and low- Initialization: Collect and clean the Dataset N;
wing. These images are taken from far and close in several back­ Annotate the images into yolo format (center_x, center_y, width, height);
For In ∈ I (Data augmentation)
grounds and foregrounds.
Perform Mosaic, MixUp and CutMix;
- The second category illustrates the most encountered birds in the sky Repeat for all training samples;
at different altitudes and contexts. Save original and augmented images;
- The third category shows images of airplanes in different areas. In Split the images to training, validation and testing clusters;
fact, we have selected different types of aircraft at different altitudes End for
Processing: Run and compare yolov5 and yolov7 algorithms from the smallest to the
during the take-off and cruise captured at different angles. Also, the largest model;
used aircraft have different speeds. Select the model presenting the highest compromise between the speed and
- The fourth category presents day frames which are the extremities of performance;
the electric poles present massively in the urban areas. These light­ Train the model from scratch;
Initialize and evolve the hyperparameters;
ning day frames are located in the altitude range of the drone and
Fine-tune the selected model;
birds and have many similarities with drones, especially from the Use ensemble learning with trained model from scratch and fine-tuned model during
ground view. the testing;
- The last category uses building images. We have used this category Apply TTA and ensemble learning;
since many drones get wrongly identified when they are ahead of Detection: While
Confidence threshold ≥ 0.6, IoU threshold ≥ 0.6 and tinference < 0 s
building. Therefore, we fed the model with pictures of the building to
Validate the model;
make a difference between parts of the buildings and the other Save the results;
airborne objects. Export the model and the weights.
• Data augmentation

distributions over the images. Therefore, the dataset presents a signifi­


In fact, we integrate data augmentation techniques during the
cant challenge because of the variation in the size and the location of the
training to generate more diversified samples aiming to enhance the
targets as well as the variability of the foregrounds and backgrounds.
recognition performance in complex and different situations and to deal
This is explained by the physical dissimilarities between the targets
with a wide range of semantic variations. Mosaic, MixUp and CutMix
which makes it challenging for the detection model to accurately iden­
techniques are employed to enrich the dataset by applying some pre­
tifying these targets against different ambient backgrounds and fore­
defined transformations. Mosaic data augmentation uses random cut­
grounds. Therefore, the detection of these targets requires appropriate
ting, random scaling, and random layout to create and generate batch
tools.
mosaic images where four random images are combined, cropped and
fed to the neural network to improve the detection of small targets at a
4.2. Experimental design
smaller scale (Bochkovskiy et al., 2020). In fact, it is used to decrease the
need to use large batch sizes during the training and, thus, high
Our experiments were conducted using the Moroccan national
computational resources. In addition, MixUp and CutMix techniques are
research and education network MARWAN server with high perfor­
used mainly to generate inter-class training samples by selecting random
mance computing characterized by a 4 processors tesla Volta 100S-PCIE-
areas from one image and overlaying them on another one and inter­
32GB, NVIDIA-SMI 495.29.05 and 192 GB total system memory and also
polating the pixel values between two random images, respectively
we have used a local desktop using NVIDIA Quadro P4000 GPU card,
(Yun et al., 2019; Zhang et al., 2018). These both techniques reduce the
Intel(R) Xeon(R) W-2155@ 3.30GHZ with a memory of 32GB and
likelihood of overfitting during the training and increase its ability to
Windows as OS.
generalize to unseen images that don’t fit the training distribution.
Developing and training a specific model with custom data requires
Further, we have added nighttime augmentation techniques by chang­
customized parameters that fit best to the detection problem. In order to
ing the contrast and brightness of the images to enhance the detection
create new offspring with the maximized fitness of the appropriate
performance at nighttime.
combination of parameters (including learning rate, decay, momentum),
we have preliminarily evolved the parameters properly initialized by
• Histogram
running the base model for 300 generations. It is noted that the optimal
performance is achieved only when the hyperparameters are adjusted
Detecting targets with various shapes and frames is a challenging
and optimized correctly on the dataset. In our case, most of the targets
task since the dataset becomes more biased and varied; which is the case
are small in size and occupy small regions in the images. Thus, adapting
of our dataset. Fig. 6 presents correlation diagrams highlighting the
the hyperparameters setting is of utmost importance to optimize the
distribution of the dataset and their corresponding positions (x,y), label
detection performance. Fig. 7 depicts the evolved hyperparameters used
coordinates and bounding boxes sizes. It shows the significant differ­
to train our model and it is divided into several subplots showing the
ences between the targets with respect to their physical characteristics
distribution and concentration of each hyperparameter represented by
and behavior. This confirms the fact that the dataset is biased since the
the fitness on the ordinate axis versus hyperparameter values on the
airborne targets have different characteristics, e.g., shape, color, and
abscissa axis. The evolved hyperparameters are used to fine-tune the
size. Further, the width and heights of the targets are scattered
model.
throughout the images and they have a wide distribution since they
range from very small targets to large ones, as shown in Fig. 6.a. In
• Research methodology
addition, the capture angle of the target impacts also its characteristics.
Fig. 6.b highlights the number of labels and their corresponding

9
G. Yasmine et al. Intelligent Systems with Applications 20 (2023) 200296

Table 2 Rauber, 2016.; Zhong et al., 2019), we have compared the largest
Performance comparison of the largest versions of Yolov5 and Yolov7. pre-trained models of Yolov5 and Yolov7 since the overall performance
Model Precision Recall mAP@50 mAP@50–95 is related also to the depth of the model. Table 2 presents the results of
training a selection of the models that we have experimented. It is shown
Yolov7-e6e 0.948 0.969 0.969 0.714
Yolo-tiny 0.935 0.679 0.862 0.594 that the models have different performances and the pre-trained model
YoloX 0.79 0.779 0.786 0.629 Yolov7x presents the highest confidence scores.
Yolov5x 0.883 0.878 0.911 0.628 Based on the presented results, Yolov7x experiences the highest
Yolov6x 0.796 0.694 0.449 0.628 detection performance in comparison to the other models with respect to
Yolov7x 0.965 0.946 0.979 0.716
the evaluation metrics, i.e., accuracy, precision and mean Average
Precision (mAP). It is shown that the overall performance depends
In this study, the proposed model is designed to detect aerial targets mostly on the depth of the model and Yolov7x model outperforms the
in real-time under different weather conditions and environmental other models since the model has the highest size’s metrics with
contexts while presenting the best compromise between performance 70,839,714 parameters, 70,839,714 gradients, 189.0 Giga Floating
and speed. Indeed, we have proposed a suitable strategy able to over­ Point Of Operations (GFLOPS). In fact, Yolov7x is a pre-trained model
come overfitting and biased training. To address these challenges, we characterised by its learned parameters and knowledge which are
have started by collecting a large amount of varied and representative transferred and fine-tuned on the airborne target recognition task.
images that meet the quantity and quality requirements. Then, the im­ Further, the number of layers and parameters have a direct impact on
ages are thoroughly labelled and annotated using Yolo format. the performance since the largest models extract the maximum amount
Following, data augmentation techniques are applied to increase the of informative and discriminative features.
variability of the training samples, overcome overfitting and enhance
the performance of the detection model. Following, we have compared 4.4. Performance analysis
Yolov5 and Yolov7 single shot object detection algorithms in view of
their highly achieved results (Ultralytics/Yolov5, 2020/2022; C.-Y. Based on the previous results, we have fine-tuned the pretrained
Wang et al., 2022). Based on the experimental results, the model with model Yolov7x into our detection. This model was trained on COCO
the optimal compromise between speed and performance is selected. large dataset. After evolving the pretrained model to adjust the hyper­
Then, the corresponding hyperparameters are evolved to further adjust parameters on our research study, the model is fine-tuned on the new
the model’s parameters on our dataset. Following, we use TL and then evolved hyperparameters, as described in the previous section. The
we fine-tune the new model to best fit the requirements and constraints. performance of the improved model is shown in Fig. 8. The training and
Once the training is finished, the weights are tested while using TTA and the validation behaviors are detailed with respect to the aforementioned
ensemble learning. As the last step, we set our confidence and IoU appropriate metrics; recall, precision, mAP@0.5 and mAP@0.5–0.95, in
thresholds to evaluate the performance limit of the model as well as the addition to object, class and boxes losses. All the curves converge toward
inference time. a fixed threshold after achieving a 100 epochs training. Besides, the
Algorithm 1. sheds light on the main steps followed to develop the model has demonstrated an optimal performance as well as a high
airborne targets detection algorithm. generalization ability to fit with new data without bias, variance, and
overfitting or under fitting. Also, both the training and validation curves
have similar behaviors without gaps between them and they converge
4.3. Comparison analysis both starting from a specific point (≈ 80 epochs). Further, it is note­
worthy to mention that the validation loss stabilizes after 100 epochs
Finding a suitable model is an important task in the development of a showing the effectiveness of the improved model in detecting and pre­
specific object detection model. Since deeper models have the ability to dicting the aerial targets under new conditions.
create a profound analysis of the input features and to fit detection
functions better and perform better when using large (Schindler, Lidy, &

Fig. 8. Training performance of the improved model.

10
G. Yasmine et al. Intelligent Systems with Applications 20 (2023) 200296

Table 3 significantly because of the false detections. Moreover, the precision


Detailed detection performance of the improved model. recall curve presented in Fig. 8.d is more informative regarding our
Target class Precision Recall mAP@0.50 imbalanced dataset, the target classes have different threshold perfor­
mance metrics, namely, 0.993, 0.966, 0976, 0967, 0.991 for bird, drone,
all 0.97 0.961 0.979
Drone 0.992 0.984 0.993 day frame, airplane and building classes. Table 3 highlights the overall
Bird 0.97 0.924 0.966 detection performance as well as the detection results of each target
Dayframe 1 0.962 0.976 class. We can see that the model has high confidence scores. Indeed, the
Airplane 0.924 0.963 0.967 improved model has shown better performance and its corresponding
Building 0.965 0.974 0.991
metrics have made significant progress compared to Yolov7x.
Fig. 9.
• Prediction performance

4.5. Real-time detection


To assess further the prediction performance of the model, we have
plotted the evolution of the confidence versus the considered metrics.
Knowing that the anti-drones are real-time systems, it is important to
Further, Fig. 8 shows the performance of the model with respect to the
precision, recall and F1 score versus the confidence progress. The F1 select the fastest detection model with respect to the total inference
time, FPS and detection time of each testing image with the highest
curve (Fig. 8.a) reports the weighted harmonic of the precision and
recall and also the optimized confidence threshold fixed at 0.505. detection performance. Indeed, the model is tested on 1179 unseen
images with 2296 target labels including the five classes. We have
Further, the latter gives high results and the best performance balance
ensembled the training weights from the fine-tuned improved model and
between the two metrics. Further, at this level, the average F1 score
the self-trained model into our model to be able to identify our target
reaches rapidly a maximum value at 0.9. It is important to know the
classes in whatever environment and context and to enhance the
optimal confidence score of the model since it is required to perform an
deployment performance on the edge device. As result, the total infer­
accurate and real-time detection. Fig. 8.b shows the exponential evo­
ence time reaches 24.6 ms, the pre-process speed and Non-Max Sup­
lution of the precision with respect to the training epochs. The maximum
pression (NMS) are equal to 0.3 ms and 1.4 ms, respectively. Further, our
precision is achieved at 0.945, which is explained by the fact that all the
model has the ability to process 92 FPS, which is much faster compared
target classes reach high true positive detections. In addition, the recall
to the other detection models, e.g., Yolov5x and Yolov7x. The average
curve has a decreasing behavior in contrast to the precision as shown in
time that takes our model to process an image and to perform the
Fig. 8.c. As the confidence score goes up, the recall decreases
detection and prediction of the potential targets is equal to 0.02 ms. The

Fig. 9. Improved model performance parameters: a) F1 curve b) precision curve c) recall curve and d) precision recall curve.

11
G. Yasmine et al. Intelligent Systems with Applications 20 (2023) 200296

Table 4 each module increases significantly the performance. Thus, the devel­
Comparison experiments. oped model has inherently a high performance thanks to the feature
Model Speed ( ms) FPS Detection time per extraction and fusion module used in the backbone that extracts and
image ( ms) aggregates the utmost features from the images as well as the decoupled
Pre- Inference NMS
process head module which optimizes significantly the detection performance.

Improved 0.3 24.6 1.4 92 0.02


model
4.7. The effect of using different improvement modules
Yolov7x 0.3 45.8 4.9 51 0.038
Yolov5x 0.3 63.2 0.6 35 0,054 To illustrate the effectiveness of the proposed model, we have con­
Yolov6x 0.20 71.81 0.99 16.4 0.06 ducted comparative experiments. During the training, we have
improved and tested other improvement methods. We have used
inference time, FPS and detection performance based on the five different visual transformers such as: Hornet (Rao et al., 2022), Con­
different classes confirm the real-time detection of the suggested model. vnext(Woo et al., 2023), Squeeze and Excitation (SE) (Hu et al., 2019),
Table 4 highlights the performance’s speed of our model versus Yolov5 CBAM (Woo et al., 2018) and Spatial Pyramid Pooling Fast (SPPF) (He
and Yolov7. et al., 2015). Table 6 shows the experimental results. Finally, our model
Fig. 10 presents a selection of the generated images during the demonstrates its effectiveness in detecting the airborne targets and the
testing stage of the proposed model. The testing images are generated significant improvement with respect to both the performance’s metrics
with the aforementioned confidence score value which optimizes the and processing speed.
capacity of the model to detect the targets.
As shown in Fig. 11, the model gives satisfactory results in detecting, 4.8. Comparison with benchmark
identifying and locate aerial targets of different types and sizes and
under different contexts and especially to perform the detection within a The results of our proposed detection model and the aforementioned
small amount of time. Therefore, the combination of our improved and detection models discussed in the introduction are presented in Table 7.
proposed methodology has demonstrated its effectiveness and ability to To the best of our knowledge, we confirm that our proposed model
be deployed in anti-drone systems. In addition, each class has a specific achieves the highest detection performance in terms of precision, recall
bounding box color, e.g., red for birds and pink for drones. Also, we have mAP@50 and mAP@50–95 and thus it outperforms the other models. In
set IoU and confidence thresholds at 0.6. When they exceed or are equal addition, we have carefully selected and used the suitable performance
to this value, the detection result is considered positive. Otherwise, the and speed evaluation metrics to best assess the model according to our
result is considered as a negative result. It is confirmed that our model requirements and constraints. Further, it is shown that our model is
outperforms the existing models in view of the high compromise be­ assessed with respect to several evaluation metrics in comparison to the
tween the speed and the performance. In addition, the use of transformer other models, which proves that the performance is highly reliable.
blocks with C3TR modules, CSPResNeXt and decoupled head structures Further, the achieved detection performance is explained by the
has proven its efficiency in identifying and detecting complex targets in integration of CSPResNeXt module, C3TR transformer, and decoupled
different environments. head structure as well as TTA, data augmentation and ensemble
learning. Also, the selected combination of the selected dataset and AI
approaches have proven their effectiveness on our model and they have
4.6. Ablation experiments generated satisfactory results with respect to the performance and speed
compromise.
To evaluate the efficiency of our developed model with the contri­
bution of CSPResNeXt, transformer attentions blocks and decoupled 5. Conclusion
head, ablation experiments were performed. Table 5 presents the results
of the ablation experiments of the used modules to assess the contri­ Due to the rise of accidents caused by the anarchic and malicious
bution and the effect of each one. Indeed, it is shown that the addition of deployment of drones, it has become highly urgent to use anti-drone

Fig. 10. A selection the testing batches of the proposed model.

12
G. Yasmine et al. Intelligent Systems with Applications 20 (2023) 200296

Fig.11. Detection results showing drones, birds and day frames illustrated with the corresponding confidence score of the detected target.

systems able to identify drones from non-drone objects in order to


Table 5 enhance airspace and civilian safety. In this work, we have proposed a
Ablation experiments of the different used modules. novel airborne targets detection model able to identify, detect and
Model Precision Recall map@0.50 localize the most encountered aerial targets in a variety of contexts and
environments. The improved Yolov7 has proven its effectiveness in
Without TL 0.957 0.947 0.972
WITH TL 0.965 0.945 0.978 distinguishing and detecting the most encountered aerial targets with
TL + TRANS 0.967 0.947 0.975 small inference times. The effectiveness of the improved model is
TL + decoupled 0.955 0.936 0.971 demonstrated by the use of CSPResNeXt modules in the backbone which
TL + TRANS + RESNEXT 0.95 0.879 0.921 enhance the feature extraction to strengthen the learning capacity,
TL + TRANS + DECOUPLED 0.95 0.952 0.974
TL þ TRANS þ RESNEXT þ DECOUPLED 0.97 0.961 0.979
Transformer C3TR blocks which increase the attention mechanism with
the receptive field and enhance the extraction performance of complex
features and decoupled head modules which ensure an end-to-end
process and improve the convergence speed. Further, the comparative
Table 6 experiments have shown that all the performances have improved
Effect on using different improvements on the base model. significantly and, thus our proposed model presents the optimal
Improvement Precision Recall mAP@50 mAP@50–95 compromise between the performance and speed and produces high
Hornet 0.952 0.917 0.965 0.678
detection rates with fast inference times compared to different bench­
Convext 0.955 0.838 0.903 0.647 mark instances recently reported in the literature. The experimental
SE 0.961 0.951 0.955 0.698 results show that the models have a fast detection speed with 92 FPS
SPPF 0.954 0.94 0.959 0.697 with high detection performance with 0.97 precision, 0.961 recall,
CBAM 0.96 0.95 0.967 0.706
0.979 map@0.50 and 0.732 and 0.979 map@0.50–0.95. Future works
Our model 0.97 0.961 0.979 0.732
include inter-frame trajectory tracking and upcoming movement pre­
diction using this model.

Table 7 CRediT authorship contribution statement


Comparison of our proposed model with other contributions.
Paper Baseline model mAP@50 mAP@50–95 Ghazlane Yasmine: Conceptualization, Methodology, Software,
(Zhu, Lyu, Wang, & Zhao, 2021) Yolov5 0.357 0.5731 Data curation. Gmira Maha: Validation, Supervision, Investigation,
(J. Wang et al., 2023) Yolov5 0.6514 - Methodology. Medromi Hicham: Visualization, Project administration.
Our model Yolov7 0.979 0.732

Declaration of Competing Interest

The authors declare that they have no known competing financial

13
G. Yasmine et al. Intelligent Systems with Applications 20 (2023) 200296

interests or personal relationships that could have appeared to influence Park, S., Kim, H. T., Lee, S., Joo, H., & Kim, H. (2021). Survey on anti-drone systems:
Components, designs, and challenges. IEEE Access, 9, 42635–42659. https://doi.org/
the work reported in this paper.
10.1109/ACCESS.2021.3065926
Pawełczyk, M.Ł., & Wojtyra, M. (2020). Real world object detection dataset for
Data availability quadcopter unmanned aerial vehicle detection. IEEE access : practical innovations,
open solutions, 8, 174394–174409. https://doi.org/10.1109/ACCESS.2020.3026192
Rao, Y., Zhao, W., Tang, Y., Zhou, J., Lim, S.-N., & Lu, J. (2022). HorNet: Efficient high-
Data will be made available on request. order spatial interactions with recursive gated convolutions (arXiv:2207.14284).
arXiv. https://doi.org/10.48550/arXiv.2207.14284.
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified,
real-time object detection, 2016. In IEEE Conference on Computer Vision and Pattern
Acknowledgment Recognition (CVPR) (pp. 779–788). https://doi.org/10.1109/CVPR.2016.91.
Roboflow Universe: Open source computer vision community. (2022). Roboflow.
This conducted experiments were achieved through computational Retrieved November 1, 2022, from https://universe.roboflow.com/.
Sagi, O., & Rokach, L. (2018). Ensemble learning: A survey. WIREs Data Mining and
resources of HPC-MARWAN provided by the National Center for Sci­ Knowledge. Discovery, 8(4), Article e1249. https://doi.org/10.1002/widm.1249.
entific and Technical Research (CNRST), Rabat, Morocco. Samadzadegan, F., Dadrass Javan, F., Ashtari Mahini, F., & Gholamshahi, M. (2022).
Detection and Recognition of Drones Based on a Deep Convolutional Neural Network
Using Visible Imagery. Aerospace, 9(1), 31. https://doi.org/10.3390/
References aerospace9010031
Schindler, A., Lidy, T., & Rauber, A. (2016). Comparing shallow versus deep neural network
Ajakwe, S. O., Ihekoronye, V. U., Kim, D.-S., & Lee, J. M. (2022). DRONET: Multi-tasking architectures for automatic music genre classification, 5.
framework for real-time industrial facility aerial surveillance and safety. Drones, 6 Shi, X., Yang, C., Xie, W., Liang, C., Shi, Z., & Chen, J. (2018). Anti-drone system with
(2), 46. https://doi.org/10.3390/drones6020046 multiple surveillance technologies: Architecture, implementation, and challenges.
Ajakwe, S.O., Ihekoronye, V.U., Kim, D.-S., & Lee, J.M. (2022). Scenario-based drone IEEE Communications Magazine, 56(4), 68–74. https://doi.org/10.1109/
detection and identification system for real-time industrial facility aerial surveillance MCOM.2018.1700430
and safety. 4. Singha, S., & Aydin, B. (2021). Automated drone detection using YOLOv4. Drones, 5(3),
Akyon, F. C., Eryuksel, O., Ozfuttu, K. A., & Altinuc, S. O. (2021). Track boosting and 95. https://doi.org/10.3390/drones5030095
synthetic data aided drone detection. In 2021 17th IEEE International Conference on Soviany, P., & Ionescu, R.T. (2018). Optimizing the trade-off between single-stage and
Advanced Video and Signal Based Surveillance (AVSS) (pp. 1–5). https://doi.org/ two-stage object detectors using image difficulty prediction. ArXiv:1803.08707 [Cs].
10.1109/AVSS52988.2021.9663759 https://arxiv.org/abs/1803.08707.
Al-Qubaydhi, N., Alenezi, A., Alanazi, T., Senyor, A., Alanezi, N., Alotaibi, B., Tan, K. L., Lee, C. P., Lim, K. M., & Anbananthen, K. S. M. (2022). Sentiment analysis
Alotaibi, M., Abdelhamid, A. A., Razaque, A., & Alotaibi, A. (2022). Unauthorized with ensemble hybrid deep learning model. IEEE access : practical innovations, open
unmanned aerial vehicle detection using YOLOv5 and transfer learning [Preprint]. solutions, 10, 103694–103704. https://doi.org/10.1109/ACCESS.2022.3210182
ENGINEERING. https://doi.org/10.20944/preprints202202.0185.v1 Ultralytics/yolov5. (2022). [Python]. Ultralytics. https://github.com/ultralytics/yolov5
Behera, D. K., & Bazil Raj, A. (2020). Drone detection and classification using deep (Original work published 2020).
learning. In 2020 4th International Conference on Intelligent Computing and Control Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., &
Systems (ICICCS) (pp. 1012–1016). https://doi.org/10.1109/ Polosukhin, I. (2017). Attention is All you Need.
ICICCS48265.2020.9121150 Wang, C.-Y., Mark Liao, H.-Y., Wu, Y.-H., Chen, P.-Y., Hsieh, J.-W., & Yeh, I.-H. (2020).
Bochkovskiy, A., Wang, C.-Y., & Liao, H.-Y.M. (2020). YOLOv4: Optimal speed and CSPNet: A new backbone that can enhance learning capability of CNN. In 2020 IEEE/
accuracy of object detection (arXiv:2004.10934). arXiv. https://arxiv.org/abs/2 CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (pp.
004.10934. 1571–1580). https://doi.org/10.1109/CVPRW50498.2020.00203
Caltech-UCSD Birds-200-2011. (2022). Retrieved March 7, 2022, from http://www.visi Wang, X., Li, W., Guo, W., & Cao, K. (2021). SPB-YOLO: An efficient real-time detector
on.caltech.edu/visipedia/CUB-200-2011.html. for unmanned aerial vehicle images. In 2021 International Conference on Artificial
Chujai, P., Chomboon, K., Teerarassamee, P., Kerdprasop, N., & Kerdprasop, K. (2015). Intelligence in Information and Communication (ICAIIC) (pp. 099–104). https://doi.
Ensemble learning for imbalanced data classification problem. In The Proceedings of org/10.1109/ICAIIC51459.2021.9415214
the 2nd International Conference on Industrial Application Engineering (pp. 449–456). Wang, C.-Y., Bochkovskiy, A., & Liao, H.-Y.M. (2022). YOLOv7: Trainable bag-of-freebies
https://doi.org/10.12792/iciae2015.079, 2015. sets new state-of-the-art for real-time object detectors (arXiv:2207.02696). arXiv. htt
Dong, X., Yu, Z., Cao, W., Shi, Y., & Ma, Q. (2020). A survey on ensemble learning. ps://arxiv.org/abs/2207.02696.
Frontiers of Computer Science, 14(2), 241–258. https://doi.org/10.1007/s11704-019- Wang, J., Chen, Y., Dong, Z., & Gao, M. (2023). Improved YOLOv5 network for real-time
8208-z multi-scale traffic sign detection. Neural Computing and Applications, 35(10),
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., 7853–7865. https://doi.org/10.1007/s00521-022-08077-5
Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. Wisniewski, M., Rana, Z. A., & Petrunin, I. (2022). Drone model classification using
(2021). An image is worth 16×16 words: Transformers for image recognition at scale convolutional neural network trained on synthetic data. Journal of Imaging, 8(8),
(arXiv:2010.11929). arXiv. https://arxiv.org/abs/2010.11929. 218. https://doi.org/10.3390/jimaging8080218
Fujii, S., Akita, K., & Ukita, N. (2021). Distant bird detection for safe drone flight and its Woo, S., Park, J., Lee, J.-Y., & Kweon, I.S. (2018). CBAM: Convolutional block attention
dataset, 2021. In 17th International Conference on Machine Vision and Applications module. 3–19. https://openaccess.thecvf.com/content_ECCV_2018/html/Sanghyun
(MVA) (pp. 1–5). https://doi.org/10.23919/MVA51890.2021.9511386. _Woo_Convolutional_Block_Attention_ECCV_2018_paper.html.
Garcia, A. J., Min Lee, J., & Kim, D. S (2020). Anti-drone system: A visual-based drone Woo, S., Debnath, S., Hu, R., Chen, X., Liu, Z., Kweon, I. S., & Xie, S. (2023). ConvNeXt
detection using neural networks. In 2020 International Conference on Information and V2: Co-Designing and Scaling ConvNets With Masked Autoencoders, 16133–16142.
Communication Technology Convergence (ICTC) (pp. 559–561). https://doi.org/ https://openaccess.thecvf.com/content/CVPR2023/html/Woo_ConvNeXt_V2_Co
10.1109/ICTC49870.2020.9289397 -Designing_and_Scaling_ConvNets_With_Masked_Autoencoders_CVPR_2023_paper.
Ge, Z., Liu, S., Wang, F., Li, Z., & Sun, J. (2021). YOLOX: Exceeding YOLO Series in 2021 html.
(arXiv:2107.08430). arXiv. https://arxiv.org/abs/2107.08430. Wu, B., Xu, C., Dai, X., Wan, A., Zhang, P., Yan, Z., Tomizuka, M., Gonzalez, J., Keutzer,
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Spatial pyramid pooling in deep K., & Vajda, P. (2020). Visual transformers: Token-based image representation and
convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis processing for computer vision (arXiv:2006.03677). arXiv. https://arxiv.org/abs/2
and Machine Intelligence, 37(9), 1904–1916. https://doi.org/10.1109/ 006.03677.
TPAMI.2015.2389824 Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual
Hu, J., Shen, L., Albanie, S., Sun, G., & Wu, E. (2019). Squeeze-and-excitation networks transformations for deep neural networks (arXiv:1611.05431). arXiv. https://arxiv.
(arXiv:1709.01507). arXiv. https://doi.org/10.48550/arXiv.1709.01507. org/abs/1611.05431.
Isaac-Medina, B.K.S., Poyser, M., Organisciak, D., Willcocks, C.G., Breckon, T.P., & Yasmine, G., Maha, G., & Hicham, M. (2022). Survey on current anti-drone systems:
Shum, H.P.H. (2021). Unmanned aerial vehicle visual detection and tracking using Process, technologies, and algorithms. International Journal of System of Systems
deep neural networks: A performance benchmark. ArXiv:2103.13933 [Cs]. htt Engineering, 12(3), 235–270. https://doi.org/10.1504/IJSSE.2022.125947
ps://arxiv.org/abs/2103.13933. Yasmine, G., Maha, G., & Hicham, M. (2023). Overview of single-stage object detection
Kim, J., Lee, D., Kim, Y., Shin, H., Heo, Y., Wang, Y., & Matson, E. T. (2022). Deep models: From Yolov1 to Yolov7. 2023 International wireless communications and
learning based malicious drone detection using acoustic and image data. 7. mobile computing (IWCMC), 1579–1584. 10.1109/IWCMC58020.2023.10182423.
Liu, B., & Luo, H. (2022). An improved Yolov5 for multi-rotor UAV detection. Electronics, Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., & Yoo, Y. (2019). CutMix: Regularization
11(15), 2330. https://doi.org/10.3390/electronics11152330 strategy to train strong classifiers with localizable features (arXiv:1905.04899).
Liu, L., Liu, J., & Han, J. (2021). Multi-head or single-head? An empirical comparison for arXiv. https://arxiv.org/abs/1905.04899.
transformer training (arXiv:2106.09650). arXiv. https://arxiv.org/abs/2106.09650. Zhang, H., Cisse, M., Dauphin, Y.N., & Lopez-Paz, D. (2018). mixup: Beyond empirical
Lykou, G., Moustakas, D., & Gritzalis, D. (2020). Defending airports from UAS: A survey risk minimization (arXiv:1710.09412). arXiv. https://arxiv.org/abs/1710.09412.
on cyber-attacks and counter-drone sensing technologies. Sensors, 20(12), 3537.
https://doi.org/10.3390/s20123537
MacKay, D. J. C. (1995). Ensemble learning and evidence maximization.

14
G. Yasmine et al. Intelligent Systems with Applications 20 (2023) 200296

Zhong, G., Ling, X., & Wang, L. (2019). From shallow feature learning to deep learning: Zhuang, W., Xing, F., Fan, J., Gao, C., & Zhang, Y. (2022). An integrated model for on-site
Benefits from the width and depth of deep architectures. WIREs Data Mining and teaching quality evaluation based on deep learning. Wireless Communications and
Knowledge Discovery, 9(1). https://doi.org/10.1002/widm.1255 Mobile Computing, 2022, 1–13. https://doi.org/10.1155/2022/9027907
Zhu, X., Lyu, S., Wang, X., & Zhao, Q. (2021). TPH-YOLOv5: Improved YOLOv5 based on
transformer prediction head for object detection on drone-captured scenarios.

15

You might also like