You are on page 1of 21

applied

sciences
Article
A Study on Tomato Disease and Pest Detection Method
Wenyi Hu 1 , Wei Hong 1 , Hongkun Wang 1 , Mingzhe Liu 1 and Shan Liu 2, *

1 Department of Computer and Network Security, Chengdu University of Technology, Chengdu 610059, China
2 School of Automation, University of Electronic Science and Technology of China, Chengdu 610054, China
* Correspondence: shanliu@uestc.edu.cn

Abstract: In recent years, with the rapid development of artificial intelligence technology, computer
vision-based pest detection technology has been widely used in agricultural production. Tomato
diseases and pests are serious problems affecting tomato yield and quality, so it is important to detect
them quickly and accurately. In this paper, we propose a tomato disease and pest detection model
based on an improved YOLOv5n to overcome the problems of low accuracy and large model size
in traditional pest detection methods. Firstly, we use the Efficient Vision Transformer as the feature
extraction backbone network to reduce model parameters and computational complexity while
improving detection accuracy, thus solving the problems of poor real-time performance and model
deployment. Second, we replace the original nearest neighbor interpolation upsampling module with
the lightweight general-purpose upsampling operator Content-Aware ReAssembly of FEatures to
reduce feature information loss during upsampling. Finally, we use Wise-IoU instead of the original
CIoU as the regression loss function of the target bounding box to improve the regression prediction
accuracy of the predicted bounding box while accelerating the convergence speed of the regression
loss function. We perform statistical analysis on the experimental results of tomato diseases and pests
under data augmentation conditions. The results show that the improved algorithm improves mAP50
and mAP50:95 by 2.3% and 1.7%, respectively, while reducing the number of model parameters by
0.4 M and the computational complexity by 0.9 GFLOPs. The improved model has a parameter
count of only 1.6 M and a computational complexity of only 3.3 GFLOPs, demonstrating a certain
advantage over other mainstream object detection algorithms in terms of detection accuracy, model
parameter count, and computational complexity. The experimental results show that this method is
suitable for the early detection of tomato diseases and pests.
Citation: Hu, W.; Hong, W.; Wang,
H.; Liu, M.; Liu, S. A Study on Keywords: identification of pests and diseases; object detection; YOLOv5; WIoU Loss;
Tomato Disease and Pest Detection CARAFE; EfficientViT
Method. Appl. Sci. 2023, 13, 10063.
https://doi.org/10.3390/
app131810063
1. Introduction
Academic Editor: Sungho Kim
Tomatoes are an important vegetable crop that is widely grown throughout the
Received: 27 July 2023 world [1]. The tomato crop is important in terms of food and nutrition, economics, ecosys-
Revised: 3 September 2023
tem services, and cultural and historical values. First, as a nutrient-rich vegetable, tomatoes
Accepted: 5 September 2023
play an important role in human health. It is rich in nutrients such as vitamin C, folate,
Published: 6 September 2023
and potassium [2], and is also a low-calorie food suitable for weight loss and maintaining a
healthy diet. In addition, tomatoes can be used in a variety of sauces and condiments to en-
hance the taste and flavor of food. Second, tomato crops make a significant contribution to
Copyright: © 2023 by the authors.
the economy and agricultural industry [3]. The cultivation and sale of tomatoes drive many
Licensee MDPI, Basel, Switzerland. related industries and employment opportunities, including seed production, greenhouse
This article is an open access article construction, transportation, and sales. Tomato crops also make a positive contribution to
distributed under the terms and the ecosystem. Tomatoes can absorb a large amount of carbon dioxide, reducing greenhouse
conditions of the Creative Commons gases in the atmosphere [4], providing important habitat and food resources, and support-
Attribution (CC BY) license (https:// ing local biodiversity. Tomatoes are widely used as important research objects and model
creativecommons.org/licenses/by/ plants, with significant research value in genetics, cell biology, biotechnology, molecular
4.0/). biology, and genomics [5]. Tomato production is often associated with the occurrence of

Appl. Sci. 2023, 13, 10063. https://doi.org/10.3390/app131810063 https://www.mdpi.com/journal/applsci


Appl. Sci. 2023, 13, 10063 2 of 21

pests and diseases that can cause significant yield losses [6], particularly the impact of
late blight on tomato production in humid areas [7]. Therefore, prevention and control
of tomato pests and diseases are key to improving tomato yield and quality, and early
detection of diseases is extremely important for selecting the correct control methods and
stopping the spread of diseases [8]. Therefore, it is of great significance to design a simple
and efficient real-time detection model for tomato pests and diseases to improve tomato
yield.
The main contributions of this paper include the following three points:
1. In this paper, we propose a lightweight model for tomato pest and disease detection
called YOLOv5n-VCW. This model improves the YOLOv5n architecture by replacing
the original backbone network with Efficient Vision Transformer (EfficientViT) [9],
replacing the original upsampling method with the lightweight and general-purpose
Content-Aware ReAssembly of FEatures (CARAFE) algorithm [10], and replacing
Complete-IoU (CIoU) Loss with Wise-IoU (WIoU) Loss [11]. All these improvements
are effective in improving the performance of the model in tomato disease and pest
detection tasks.
2. This paper evaluates and compares the performance of mainstream object detection
models, including YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, YOLOv5x, SSD, Faster
R-CNN, and the proposed YOLOv5n-VCW model, in the task of detecting tomato
pests and diseases. The evaluation results show that YOLOv5n-VCW achieves mAP50
and mAP50:95 scores of 98.1% and 84.8%, respectively, which is a 2.3% and 1.7%
improvement over YOLOv5n and even outperforms other models such as YOLOv5s,
YOLOv5m, YOLOv5l, and YOLOv5x.
3. Another contribution of this paper is that it reduces the size of the model parameters
to 1.9 M, which is a reduction of 0.4 M compared with YOLOv5n. In addition, the
computational complexity is reduced by 1.2 GFLOPs, making the YOLOv5-VCW
model much smaller than other evaluated models. This makes the YOLOv5-VCW
model more suitable for use on devices with limited computational resources.
This paper is organized as follows: Following the introduction, Section 2 briefly
reviews previous work related to the detection of tomato pests and diseases using deep
learning methods. Section 3 describes the base model used in this paper. Section 4 describes
the method that improves on the base model. Section 5 provides experimental results.
Section 6 discusses the experimental results and the limitations of this study, and finally,
Section 7 summarizes some comments and future work.

2. Related Works
In recent years, with the development of machine learning and deep learning, the
application of computer vision in agriculture has achieved remarkable results, especially
in the field of plant disease recognition. Traditional object detection algorithms require
researchers to manually design features and use machine learning algorithms to classify
the extracted features. Representative feature descriptors include Haar features [12], His-
tograms of Oriented Gradients (HOG) features [13], and the Deformable Parts Model
(DPM) [14]. However, these algorithms rely greatly on manually extracted features, result-
ing in poor generalization, low robustness, and high computational complexity. Compared
with traditional methods, deep learning-based object detection algorithms improve detec-
tion speed and accuracy and have become a focus of research in pest and disease detection.
Mokhtar et al. [15] proposed a method to identify tomato yellow leaf curl virus and tomato
spotted wilt virus, achieving an average diagnostic accuracy of 90%. Lin et al. [16] pro-
posed a feature pyramid structure based on Faster R-CNN that fully exploits the features
of each layer, enabling the detector to maintain high detection accuracy for small targets.
Fuentes et al. [17] proposed incorporating refined filter banks into deep neural networks to
address the class imbalance of the tomato dataset. Ale et al. [18] proposed a lightweight
deep neural network-based plant disease detection method that reduces the model size
and parameter set. Zhao et al. [19] proposed the use of the YOLOv2 algorithm for tomato
Appl. Sci. 2023, 13, 10063 3 of 21

disease detection. Latif et al. [20] improved the ResNet model for pest and disease detection
and increased the detection accuracy. Prabhakar et al. [21] proposed the use of ResNet 101
to determine the severity of early leaf blight in tomatoes. Pattnaik et al. [22] proposed a
transfer learning-based Convolutional Neural Network (CNN) framework for tomato pest
classification. Jiang et al. [23] used a deep learning method to extract features of tomato leaf
diseases such as yellow leaf curl virus, bacterial spot, and late blight. Liu et al. [24] proposed
a tomato pest and disease detection algorithm based on the YOLOv3 convolutional neural
network. Wang et al. [25] proposed the YOLOv3-tiny-IRB algorithm, which improves the
feature extraction network, ameliorates the gradient disappearance phenomenon caused by
excessive depth of the network structure, and improves the detection accuracy of tomato
pests and diseases under occlusion and overlap conditions in real natural environments.
To address the problem of lost feature information for small targets during transmission.
Huang et al. [26] proposed an automatic identification and detection method for crop
leaf diseases based on a fully convolutional-switchable normalized dual-path network
(FC-SNDPN) to reduce the influence of complex backgrounds on the image identification
of crop diseases and pests.
Although the above studies have strongly demonstrated the effectiveness of CNN
structures in the field of plant pest and disease recognition, these models inevitably have
problems due to their large number of parameters and high computational complexity.
Therefore, many studies have focused on the design of lightweight CNNs. Kamal et al. [27]
used a MobileNet architecture based on depth-separable convolution constructed on public
datasets for plant disease recognition. Albahli et al. [28] proposed the use of DenseNet-77
as a backbone network for CornerNet, which reduces the model parameters and improves
accuracy. Zhong et al. [29] designed LightMixer, a lightweight tomato leaf disease recog-
nition model, to improve the computational efficiency of the entire network architecture
and reduce the loss of disease feature information. Chen et al. [30] optimized the Mo-
bileNetV2 model using an augmented loss function approach for recognizing rice diseases
and pests in complex contexts. The above studies performed well in the field of plant
disease recognition, but the lightweight line and accuracy of the model can be further
improved.

3. YOLOv5 Object Detection Algorithm


The YOLOv5 object detection algorithm was released by Ultralytics in June 2020 and
has since maintained a fast iteration speed, currently developing to version 7.0. Compared
with the updated version of YOLOv8, YOLOv5 is simpler in terms of model architecture
and faster in terms of speed because its models are smaller and simpler. In terms of accuracy,
YOLOv8 has higher performance because its model is more complex. After considering the
factors of detection accuracy, model amount, and detection speed, we chose to carry out
related research and improvement work based on YOLOv5n because it is more suitable for
real-time target detection tasks. There are five network models in the YOLOv5 algorithm,
namely YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. Each model has the
same structure but differs in depth and width, with YOLOv5n having the smallest depth
and width and therefore the smallest number of parameters and computational complexity,
resulting in faster inference speed. Considering practical application scenarios, this paper
selects YOLOv5n as the improved model, and the whole network structure is shown
in Figure 1.
The network structure of YOLOv5n mainly consists of four parts: the input part,
the backbone part, the neck part, and the prediction part. The Backbone consists of
three modules: CBS, C3_1, and SPPF. The CBS module consists of convolution, batch
normalization (BN), and the SiLU activation function. The C3_1 module is a stack of CBS
modules with a residual structure that divides the input feature map into two parts. One
part is processed by a small convolutional network, and the other part is processed directly
by the next layer, and then the two parts are stitched together as the input of the next layer,
which can reduce the parameters and computation of the network and at the same time
Appl. Sci. 2023, 13, 10063 4 of 21

Appl. Sci. 2023, 13, x FOR PEER REVIEW 4 of 22


improve the efficiency of feature extraction, thus speeding up model training and inference
speed, as shown in Figure 2.

Figure 1. Structure of the YOLOv5n.

The network structure of YOLOv5n mainly consists of four parts: the input part, the
backbone part, the neck part, and the prediction part. The Backbone consists of three mod-
ules: CBS, C3_1, and SPPF. The CBS module consists of convolution, batch normalization
(BN), and the SiLU activation function. The C3_1 module is a stack of CBS modules with
a residual structure that divides the input feature map into two parts. One part is pro-
cessed by a small convolutional network, and the other part is processed directly by the
next layer, and then the two parts are stitched together as the input of the next layer, which
can reduce the parameters and computation of the network and at the same time improve
the efficiency of feature extraction, thus speeding up model training and inference speed,
Figure 1. Structure
as shown of the
in Figure 2. YOLOv5n.

The network structure of YOLOv5n mainly consists of four parts: the input part, the
backbone part, the neck part, and the prediction part. The Backbone consists of three mod-
ules: CBS, C3_1, and SPPF. The CBS module consists of convolution, batch normalization
(BN), and the SiLU activation function. The C3_1 module is a stack of CBS modules with
a residual structure that divides the input feature map into two parts. One part is pro-
cessed by a small convolutional network, and the other part is processed directly by the
next layer, and then the two parts are stitched together as the input of the next layer, which
can reduce the parameters and computation of the network and at the same time improve
the efficiency of feature extraction, thus speeding up model training and inference speed,
as shown in Figure 2.

Figure2.2.C3_1
Figure C3_1structure
structurediagram.
diagram.

The SPPF module is a spatial pyramid pooling structure that achieves adaptive output
The SPPF module is a spatial pyramid pooling structure that achieves adaptive out-
sizes. Unlike traditional pooling structures, the output size of SPPF is independent of the
put sizes. Unlike traditional pooling structures, the output size of SPPF is independent of
input size, and it can achieve a fixed-dimensional output. The SPPF structure first performs
the input
pooling size, andofitdifferent
operations can achieve
sizesaon
fixed-dimensional output.
the input feature map, thenThe
usesSPPF structureoper-
convolutional first per-
forms pooling operations of different sizes on the input feature map, then uses convolu-
ations to fuse the results of different scales, and finally outputs the fused feature map. The
tionaluses
neck operations to fuse
the PAN (Path the resultsNetwork)
Aggregation of different
+ FPN scales, andPyramid
(Feature finally Networks)
outputs the
[31]fused
network structure, which mainly consists of CBS, C3_2, Concat, and upsampling modules.
C3_1 and C3_2 are used in the backbone and neck of the network, respectively, and the
only difference between them is the BottleNeck structure inside, as shown in Figure 3.
Figure 2. C3_1 structure diagram.

The SPPF module is a spatial pyramid pooling structure that achieves adaptive out-
put sizes. Unlike traditional pooling structures, the output size of SPPF is independent of
the input size, and it can achieve a fixed-dimensional output. The SPPF structure first per-
forms pooling operations of different sizes on the input feature map, then uses convolu-
tional operations to fuse the results of different scales, and finally outputs the fused
feature map. The neck uses the PAN (Path Aggregation Network) + FPN (Feature Pyramid
Networks) [31] network structure, which mainly consists of CBS, C3_2, Concat, and up-
sampling modules. C3_1 and C3_2 are used in the backbone and neck of the network,
Appl. Sci. 2023, 13, 10063 5 of 21
respectively, and the only difference between them is the BottleNeck structure inside, as
shown in Figure 3.

Figure3.3.C3_2
Figure C3_2structure
structurediagram.
diagram.

The
Thepurpose
purposeofofFPN
FPNandandPANPAN is is
to to
achieve
achievefeature fusion.
feature TheThe
fusion. FPN network
FPN network generates
gener-
aates
feature pyramid
a feature withwith
pyramid different resolutions
different by connecting
resolutions by connecting feature maps
feature mapsfromfromdifferent
differ-
levels of the
ent levels of network. These
the network. feature
These pyramids
feature pyramids cancan
capture object
capture information
object information at different
at differ-
scales. TheThe
ent scales. PANPANnetwork
network generates
generates a feature
a feature pyramid
pyramidwithwithdifferent
differentresolutions
resolutionsand and
semantic
semantic information by successively merging higher-level feature maps withlower-level
information by successively merging higher-level feature maps with lower-level
ones.
ones. In
Incomplex
complexscenarios,
scenarios, using
using FPN
FPN and andPAN
PANnetworks,
networks,YOLOv5
YOLOv5can candetect
detectobjects
objectsofof
different shapes and sizes, improving the robustness and reliability of object
different shapes and sizes, improving the robustness and reliability of object detection. detection. The
prediction part uses
The prediction partthe three
uses the different scales of
three different feature
scales maps obtained
of feature in the neck
maps obtained to predict
in the neck to
large,
predict large, medium, and small objects. First, these feature maps are divided intoand
medium, and small objects. First, these feature maps are divided into lattices, lat-
then
tices,predictions are made for
and then predictions areeach
madelattice using
for each threeusing
lattice anchor boxes.
three Finally,
anchor each
boxes. detection
Finally, each
box outputs
detection boxa outputs
feature vector,
a featureincluding classification
vector, including probability
classification and object
probability andconfidence.
object con-
When an object has multiple prediction boxes, NMS (non-maximum suppression) [32] is
fidence. When an object has multiple prediction boxes, NMS (non-maximum suppression)
used to filter the target boxes.
[32] is used to filter the target boxes.
4. Methods
4. Methods
4.1. Backbone Network Improvements
4.1. Backbone Network Improvements
Although YOLOv5’s CSPDarknet53 backbone network performs well in object detec-
Although
tion tasks, it hasYOLOv5’s CSPDarknet53
some drawbacks. backboneit network
For example, has highperforms well in and
computational object detec-
storage
tion tasks,
costs because it has some drawbacks.
CSPDarknet53 contains For example,
multiple it has high computational
convolutional and pooling layers and that
storage
re-
costs large
quire because CSPDarknet53
amounts contains multiple
of computational and storage convolutional
resources for andtraining
poolingandlayers that re-
inference.
quire largeCSPDarknet53
Deploying amounts of computational
can be difficultand onstorage resources for training
resource-constrained devicesand suchinference.
as edge
Deploying CSPDarknet53 can be difficult on resource-constrained
and mobile devices. The network structure of CSPDarknet53 is relatively fixed, making devices such as edgeit
and mobile
difficult devices. The
to customize networkAs
and extend. structure
a result,of CSPDarknet53
CSPDarknet53 is relatively
may not be ablefixed, making
to meet it
the re-
difficult to of
quirements customize and extend.
some specific As ascenarios,
application result, CSPDarknet53
such as thosemay not bespecific
requiring able toreceptive
meet the
requirements
fields or specific of feature
some specific application
extraction scenarios,
capabilities. such asalthough
In addition, those requiring
YOLOv5specific
providesre-
ceptive versions
several fields orof specific
models, feature extraction
including YOLOv5s, capabilities.
YOLOv5m, In addition,
YOLOv5l, although YOLOv5
and YOLOv5x,
these models
provides may still
several have high
versions computational
of models, including andYOLOv5s,
storage costs on resource-constrained
YOLOv5m, YOLOv5l, and
devices.
YOLOv5x, these models may still have high computational and storage costsTransformer
To address these issues, the Google Brain team proposed the Vision on resource-
(ViT) model [33]
constrained in 2020.
devices. The ViT
To address model
these is anthe
issues, image
Google classification
Brain teammodel
proposed based
theon the
Vision
Transformer
Transformerarchitecture.
(ViT) modelIt[33] divides the image
in 2020. The ViT into a fixed
model is number
an image of classification
image patches and
model
uses
basedeach
on patch as an input architecture.
the Transformer to the model. ItViT uses the
divides themulti-head
image intoself-attention
a fixed number mechanism
of image
of the Transformer
patches and uses each encoder
patchtoas
process image
an input data,
to the so the
model. ViTmodel
usesdoes not requireself-atten-
the multi-head convolu-
tional operations of
tion mechanism to the
process images and
Transformer therefore
encoder has better
to process image scalability
data, so and generalization
the model does not
ability. In the ViT model, each image patch is converted to a vector
require convolutional operations to process images and therefore has better scalability and then inputand to
ageneralization
multi-layer Transformer encoder for processing. The encoder maps
ability. In the ViT model, each image patch is converted to a vector and the input vector
sequence to another vector sequence, where each vector contains all the information in
the sequence. To make the ViT model adaptable to images of different sizes, a deformable
attention mechanism is introduced to adapt to different spatial structures when processing
image patches. The structure is shown in Figure 4.
then input
then input to
to aa multi-layer
multi-layer Transformer
Transformer encoder
encoder for
for processing.
processing. The
The encoder
encoder maps
maps the
the
input vector sequence to another vector sequence, where each vector contains all
input vector sequence to another vector sequence, where each vector contains all the in-the in-
formation in the sequence. To make the ViT model adaptable to images of different
formation in the sequence. To make the ViT model adaptable to images of different sizes, sizes,
aa deformable
deformable attention
attention mechanism
mechanism isis introduced
introduced toto adapt
adapt to
to different
different spatial
spatial structures
structures
Appl. Sci. 2023, 13, 10063 6 of 21
when processing
when processing image
image patches.
patches. The
The structure
structure is
is shown
shown in
in Figure
Figure 4.
4.

Figure 4. ViT structure.


Figure 4.
Figure 4. ViT
ViT structure.
structure.

The ViT
The
The ViT model
ViT model partially
model partially compensates
partially compensates for
compensates for the
the shortcomings
shortcomings of of traditional
traditional CNN
CNN back-back-
bone networks,
bone
bone networks, such
networks, such as
such as the
as the ability
the ability to
ability to process
to process large
process large images,
large images, computational
images, computational and
computational and memory
and memory
memory
costs,
costs, generalization,
costs, generalization,
generalization, andand flexibility.
and flexibility. Subsequently,
flexibility. Subsequently,
Subsequently,the the MIT-IBM
theMIT-IBM Watson
MIT-IBMWatson
WatsonAI AI Lab
AI Lab improved
Lab improved
improved
the traditional
the
the traditional ViT
traditional ViT model
ViT model to
model to obtain
to obtain the
obtain theEfficientViT
the EfficientViTmodel.
EfficientViT model. The
model. The EfficientViT
The EfficientViT model
EfficientViT model can
model can
can
maintain
maintain
maintain high high accuracy
high accuracy while
accuracy while having
whilehaving lower
havinglower computational
lowercomputational
computationaland andandmemory
memory
memory costs.
costs.
costs. It also en-
It also
It also en-
hances
enhances
hances the the multi-scale
themulti-scale attention
multi-scaleattention mechanism
attentionmechanism
mechanismand and has
andhas greater
hasgreater scalability,
greater scalability, which can
scalability, which can be be
be
adapted to
adapted
adapted to different
to different tasks
different tasks and
tasks and devices.
and devices. For
devices. For example,
example, by increasing or
by increasing or decreasing
decreasing the num-
the num-
ber
ber of
of attention
attention heads,
heads, accuracy
accuracy can
can be
be increased
increased or
or decreased,
decreased, and
and the
the depth
depth
ber of attention heads, accuracy can be increased or decreased, and the depth of the model of
of the
the model
model
can be
can
can be adjusted
be adjusted to
adjusted to balance
to balance accuracy
balance accuracy and
accuracy and computational
and computational cost.
computational cost. The
cost. structure of
The structure
structure of EfficientViT
of EfficientViT
EfficientViT
is
is shown
shown in
in Figure
Figure
is shown in Figure 5. 5.
5.

Figure 5. EfficientViT structure.

MBConv [34] is a module for building efficient neural networks on mobile and embed-
ded devices. It can reduce the number of parameters and computation while improving
Figure 5. EfficientViT structure.

MBConv [34] is a module for building efficient neural networks on mobile and em-
Appl. Sci. 2023, 13, 10063 bedded devices. It can reduce the number of parameters and computation while7 of improv-
21
ing the accuracy of the model by using Depthwise Separable Convolution (DSC) and In-
verted Residual Connection.
Lightweight
the accuracy MSA is
of the model by ausing
lightweight
Depthwisemulti-scale attention mechanism
Separable Convolution in the Effi-
(DSC) and Inverted
cientViT model
Residual Connection. that is used to capture features at different scales. It consists of scale
grouping modules,
Lightweight MSA lightweight attention
is a lightweight modules,
multi-scale and mechanism
attention scale fusioninmodules, which are
the EfficientViT
responsible
model that isfor partitioning
used to capturethe input feature
features tensor
at different into several
scales. groups
It consists of different
of scale grouping scales,
modules,
with eachlightweight attentionfeatures
group containing modules, ofand
the scale
samefusion
scale.modules,
Attentionwhich are responsible
weighting is performed
for the
on partitioning
features oftheeach
input feature
scale tensor into
to capture several
feature groups ofatdifferent
information differentscales, with
scales. Thiseach
module
groupa containing
uses features convolutional
set of lightweight of the same scale.layersAttention weighting
and attention is performed
mechanisms on thecom-
to reduce
features of each
putational and scale
memory to capture
costs. feature information
The weighted at different
feature tensorsscales.
are thenThis module
fused uses
in the a
channel
set of lightweight convolutional layers and attention mechanisms to reduce
dimension to produce the final multi-scale feature representation. This multi-scale atten- computational
and memory costs. The weighted feature tensors are then fused in the channel dimension to
tion mechanism can effectively capture feature information at different scales in the input
produce the final multi-scale feature representation. This multi-scale attention mechanism
image and fuse them together to generate more expressive and rich feature representa-
can effectively capture feature information at different scales in the input image and fuse
tions. In addition,
them together this mechanism
to generate has lower
more expressive andcomputational and memory In
rich feature representations. costs due to the
addition,
use of lightweight
this mechanism has attention modules. The
lower computational and structure
memory costsof the lightweight
due to the use of MSA is shown in
lightweight
Figure 6.
attention modules. The structure of the lightweight MSA is shown in Figure 6.

Figure 6. Lightweight
Figure 6. LightweightMSA
MSAstructure.
structure.

Therefore,
Therefore, in
in this
this article,
article, the
the EfficientViT,
EfficientViT, aa model
model improved
improvedfromfromthe
the traditional
traditional ViT
ViT model, is used as the backbone network, replacing the original
model, is used as the backbone network, replacing the original backbonebackbone network in in
network
YOLOv5.
YOLOv5.
4.2. Up-Sampling Improvements
4.2. Up-Sampling Improvements
In the feature fusion network, YOLOv5 uses nearest-neighbor interpolation for upsam-
pling,Inwhich
the feature fusion network,
only considers YOLOv5
the position usespoints
of the pixel nearest-neighbor
and does notinterpolation for up-
fully exploit the
sampling,
informationwhichin the only
featureconsiders
map. Thisthecan
position
reduce of
thethe pixelofpoints
quality and does feature
the upsampled not fully exploit
map.
the information
To address in thethis
this issue, feature
papermap. This to
proposes canuse
reduce the quality
a lightweight andofgeneral
the upsampled
upsampling feature
map.
operatorTo address this issue, thisReAssembly
called Content-Aware paper proposes to use a(CARAFE)
of Features lightweight to and general
replace upsam-
nearest-
neighbor
pling interpolation
operator and obtain a higher-quality
called Content-Aware ReAssembly upsampled feature
of Features map. The
(CARAFE) toCARAFE
replace near-
operator consists
est-neighbor of two blocks:
interpolation andthe feature
obtain a reassembly
higher-qualitymodule and the up-sampling
upsampled feature map.kernel
The CA-
prediction module. The upsample kernel prediction module analyzes
RAFE operator consists of two blocks: the feature reassembly module and the up-sam- and encodes the
input feature map to predict the upsample kernels corresponding
pling kernel prediction module. The upsample kernel prediction module analyzes andto different positions
of the feature points. The feature reorganization module then performs upsampling us-
encodes the input feature map to predict the upsample kernels corresponding to different
ing the predicted upsample kernels. Compared with nearest-neighbor interpolation, the
positions of the feature points. The feature reorganization module then performs upsam-
CARAFE operator makes better use of the semantic information in the feature map during
pling using the predicted upsample kernels. Compared with nearest-neighbor interpola-
the upsampling process, thereby improving the quality of the upsampled feature map. The
tion, the of
structure CARAFE
the CARAFEoperator makes
module betterinuse
is shown of the
Figure 7. semantic information in the feature
map The
during the upsampling process, thereby improving
upsampling kernel prediction module consists of a content the quality of thesub-module,
encoding upsampled fea-
ture map.
a kernel The structuresub-module,
normalization of the CARAFE and amodule
channeliscompression
shown in Figure 7.
sub-module. First, the
channel compression sub-module reduces the computational cost by using a 1 × 1 convolu-
tional layer to compress the channel dimension of the input feature map. Then, the content
encoding sub-module uses a convolutional layer with a size of k encoder × k encoder × σ2 × k2up
Appl. Sci. 2023, 13, 10063 8 of 21

to encode the deep semantic information contained in each feature point and its sur-
rounding points in the input feature map, generating an upsample kernel with a shape
of H × W × σ2 × k2up . The channel dimension of the upsample kernel is then expanded
to the width and height dimensions, resulting in an expanded upsample kernel with a
shape of σH × σW × k2up , where k2up is the size of the upsample kernel for a single feature
point, and σ is the upsampling rate. Finally, the kernel normalization sub-module uses the
softmax function to normalize the predicted upsample kernel. Overall, the upsample kernel
Appl. Sci. 2023, 13, x FOR PEER REVIEW 8 of 22
prediction module uses the semantic information in the input feature map to adaptively
generate corresponding upsample kernels for different feature points.

Figure 7.
Figure CARAFE module
7. CARAFE modulestructure.
structure.
The role of the feature reassembly module is to map each feature point in the output
The upsampling kernel prediction module consists of a content encoding sub-mod-
feature map back to the input feature map and extract a region of size k_up × k_up centered
ule, a kernel
on that featurenormalization sub-module,
point. The feature and a channel
reassembly module compression
then performs sub-module.
a dot product operation First,
the channel
between the compression sub-module
extracted region reduceskernel
and the upsample the computational
predicted by thecost by usingkernel
upsample a 1 × 1 con-
volutional layer tofor
prediction module compress thegenerating
that point, channel dimension
the upsampled of the inputfor
feature feature map.Since
that point. Then, the
content
the featureencoding sub-module
reassembly module pays uses a convolutional
more attention to the layer a size ofin𝑘the
with contained
information ×
𝑘relevant × 𝜎 × 𝑘points
feature in the local
to encode theregion
deepduring
semanticthe reassembly
information process, the reassembled
contained in each feature
feature
point map
and itsusually contains
surrounding richerin
points semantic information
the input and is
feature map, more expressive
generating than thekernel
an upsample
original feature map.
with a shape of 𝐻 × 𝑊 × 𝜎 × 𝑘 . The channel dimension of the upsample kernel is then
Compared with the original nearest-neighbor interpolation upsampling, the CARAFE
expanded to the width and height dimensions, resulting in an expanded upsample kernel
upsampling operator can aggregate contextual semantic information within a larger re-
with a shape of 𝜎𝐻 × 𝜎𝑊 × 𝑘 , where 𝑘 is the size of the upsample kernel for a single
ceptive field and perform adaptive upsampling operations for different feature points
feature
using the point, and 𝜎upsampling
predicted is the upsampling rate.operation
kernels. This Finally, the kernel reduces
effectively normalization
the loss,sub-mod-
en-
ule
sures the integrity of the feature information, and improves the quality of the upsampled the
uses the softmax function to normalize the predicted upsample kernel. Overall,
upsample
feature map. kernel prediction module uses the semantic information in the input feature
map to adaptively generate corresponding upsample kernels for different feature points.
4.3. Bounded
The roleBox
of Regression
the feature Loss Function Improvement
reassembly module is to map each feature point in the output
feature map back to the input feature map andofextract
The loss function of YOLOv5 is composed three sub-loss of size 𝑘_𝑢𝑝
a regionfunctions, × 𝑘_𝑢𝑝
namely the cen-
confidence loss function, the bounding box regression loss function, and the
tered on that feature point. The feature reassembly module then performs a dot product classification
loss function, which are calculated as follows:
operation between the extracted region and the upsample kernel predicted by the upsam-
ple kernel prediction module for N that point, generating the upsampled feature for that
Lv5 = ∑ λmodule

point. Since the feature reassembly 1 Lcls + λ2pays
Lobj +more
λ3 Lboxattention to the information con-
tained in the relevant feature points i in the local region during the  reassembly process, (1) the
N Bi Bi Si × Si
reassembled feature mapusually contains richer semantic information and is more ex-
= ∑ λ1 ∑ Lcls j + λ2 ∑ LCIoU j + λ3 ∑ Lobj j 
pressive than the original i feature
j map. j j
Compared with the original nearest-neighbor interpolation upsampling, the CA-
RAFE upsampling operator can aggregate contextual semantic information within a
larger receptive field and perform adaptive upsampling operations for different feature
points using the predicted upsampling kernels. This operation effectively reduces the loss,
ensures the integrity of the feature information, and improves the quality of the upsam-
Appl. Sci. 2023, 13, 10063 9 of 21

Lbox is the bounding box regression loss, Lobj is the object confidence loss, Lcls is the
object classification loss, λ1 , λ2 and λ3 are the weights of the three losses, where the object
classification loss and the object confidence loss are calculated using the Binary Cross
Entropy (BCE) loss function, and the calculation formulas are as follows:
n
L BCE =−
1
n ∑[y × log (σ(x )) + (1 − y ) × log(1 − σ(x ))]
i i i i (2)
i
1
σ( a) = (3)
1 + exp(− a)
The bounding box regression loss in version 5.0 and later uses the CIoU loss as the
bounding box regression loss function. The formula for calculating the CIoU loss is as
follows:
ρ2 b, b gt

LCIoU = 1 − IoU + + αv (4)
c2

b ∩ b gt
IoU = (5)
|b ∪ b gt |

4 w gt w 2
v= 2
arctan gt − arctan (6)
π h h
v
α= (7)
(1 − IoU ) + v
Compared with the Generalized-IoU (GIoU) loss [35] used in previous versions of
YOLOv5, the CIoU loss takes into account additional penalties for the distance between
the centers of the predicted and ground truth boxes, the difference in their areas, and
incomplete overlap between the two boxes. Specifically, the distance between the centers
of the predicted and ground truth boxes and the difference in their areas are taken into
account before the IoU is calculated. This ensures that for the same IoU, the predicted box
with a smaller distance and a smaller area will score higher. The CIoU loss provides more
significant penalties for incomplete overlap between the boxes, resulting in a more accurate
regression of the predicted box to the shape and size of the ground truth box. However,
when the aspect ratio of the predicted box and the ground truth box is linearly related, the
aspect ratio penalty term becomes ineffective, which affects the regression of the predicted
box. Therefore, in this paper, we propose to replace the CIoU loss with the Wise-IoU (WIoU)
loss. The WIoU loss focuses more on the aspect ratio, position offset, and scale changes
of the bounding box than the traditional IoU loss and other improved versions such as
GIoU and CIoU. In addition, the WIoU loss has an important parameter, the outlier weight,
which adjusts the degree of penalty for different IoU values in the bounding box matching
process. Specifically, the larger the value of the outlier weight, the stronger the penalty
for low IoU matching results and the weaker the penalty for high IoU matching results.
This allows the bounding box regression to focus on the anchor boxes of normal quality,
preventing low-quality examples from producing large harmful gradients and making
the model more accurate in matching the shape and size of the ground truth box, thereby
improving the accuracy of object detection.

4.4. Improved Network Structure


As a result of the above discussion, an improved model was designed. It is shown
in Figure 8. It consists of three parts: the backbone is EfficientViT, and the neck layer is
composed of combining and replacing the neck layers of YOLOV5 and CARAFE. The head
layer is consistent with YOLOV5. The box loss function is changed to WIoU.
4.4. Improved Network Structure
As a result of the above discussion, an improved model was designed. It is shown in
Figure 8. It consists of three parts: the backbone is EfficientViT, and the neck layer is com-
Appl. Sci. 2023, 13, 10063 posed of combining and replacing the neck layers of YOLOV5 and CARAFE.10The of 21 head
layer is consistent with YOLOV5. The box loss function is changed to WIoU.

Figure8.8.Improved
Figure Improved network
network structure.
structure.

5.5.Experiment
Experimentand
and Result
Result
5.1.
5.1.Datasets
Datasets
The
Theexperimental
experimental dataset
datasetin in
this paper
this paperuses the the
uses tomatotomatodataset fromfrom
dataset Kaggle, which
Kaggle, which
contains eight categories of tomato diseases and pests, some of which
contains eight categories of tomato diseases and pests, some of which are shown in Figure are shown in Figure 9.
In order to increase the training volume of the network model and improve its generaliza-
9. In order to increase the training volume of the network model and improve its general-
tion ability, this paper uses the data augmentation approach to expand the tomato diseases
ization ability, this paper uses the data augmentation approach to expand the tomato dis-
and pests dataset based on the original data. The final dataset contains 20,413 images,
eases and pests dataset based on the original data. The final dataset contains 20,413 im-
including 16,332 images in the training set and 4081 images in the test set. Table 1 shows
ages,
the including
statistical table16,332
for all images
data. The inspecific
the training set and 4081methods
data enhancement images usedin theintest
thisset. Table 1
paper
shows (1)
include theflipping
statisticalthetable for all
original data.vertically
images The specific data enhancement
or horizontally methods used
with a probability in this
of 0.3;
paper include (1) flipping the original images vertically or horizontally
(2) translating the images with a scale of 1:1:10; (3) randomly rotating the images with with a probability
of 0.3;ranging
angles (2) translating
from −10 ◦ to
the 10◦ with
images with a scale
a step sizeof 1◦ ; (4) (3)
of1:1:10; randomly
enhancing therotating
contrast the images
of the
with angles
images with an ranging from −10°
enhancement range toof10° withand
1:1:10; a step size of 1°;
(5) injecting (4) into
noise enhancing the with
the images contrast
a of
the images
mean of 0 and with an enhancement
a standard deviation of range
1:1:5.ofThrough
1:1:10; andthese (5)data
injecting noise into
enhancement the images
methods,
the
withdataset
a mean canof be expanded,
0 and a standardthe deviation
model’s generalization
of 1:1:5. Through ability
thesecan be enhancement
data improved, andmeth-
overfitting
ods, the dataset can be expanded, the model’s generalization ability canoriginal
problems can be effectively avoided; (6) partial masking of the image and
be improved,
with a probability of 0.3; and (7) using a random light transformation.
overfitting problems can be effectively avoided; (6) partial masking of the original image
with1.a Statistical
Table probability of 0.3; and (7) using a random light transformation.
table.

Class Training Set (Sheets) Test Set (Sheets)


healthy 1700 425
bacterial spot 1900 475
early blight 1900 475
late blight 1850 462
leaf mold 1880 470
powdery mildew 1827 456
septoria leaf spot 1740 435
spider mites 1740 435
mosaic virus 1790 447
yellow leaf curl virus 1965 491

5.2. Experimental Environment


The experimental environment in this paper is Ubuntu 20.04, with a system running
memory of 80 GB, a CPU of Intel(R) Xeon(R) Platinum 8358P CPU @ 2.60 GHz, and a GPU
Appl. Sci. 2023, 13, 10063 11 of 21

Appl. Sci. 2023, 13, x FOR PEER REVIEW GeForce RTX2080 Super (8 GB). The deep learning framework used 11
of NVIDIA in of 22
this
paper is PyTorch 1.11.0, and the CUDA version is 11.5.

Figure 9.
Figure 9. Tomato
Tomato pest
pestand
anddisease
diseaseexample
examplediagram.
diagram.

Table
5.3. 1. Statistical
Model table.
Evaluation Metrics and Training Parameter Settings
The experimental
Class part of thisTraining
paper uses
Set mean average precision
(Sheets) (mAP)
Test Set of 50 and
(Sheets)
50:95 as metrics
healthyfor model detection accuracy;
1700 FLOPs are used as a measure
425 of model
computation; the higher
bacterial spot the FLOPs, the higher
1900 the computational complexity
475 of the model,
which requires more computational resources to complete training and inference. Param-
early blight 1900 475
eters is the number of parameters to be learned in the model, and in general, the higher
late blight 1850 462
the number of parameters, the higher the complexity of the model, which requires more
leaf mold 1880 470
data and computational resources for training and inference. mAP50 and mAP50:95 were
powdery mildew 1827 456
chosen as metrics to assess detection accuracy because they can simultaneously assess the
model’sseptoria
ability leaf spot and classify targets
to locate 1740
for detection, which is more435reflective of the
spider mites
model’s detection 1740recall. AP50 is the average 435
capability than accuracy and accuracy of each
categorymosaic
at an IoUvirus 1790
threshold of 0.5, and similarly, 447
AP95 is the average accuracy of each
yellow leaf curl virus 1965 491
category at an IoU threshold of 0.95. mAP is the average of the APs of all categories at that
IoU threshold, calculated as follows:
5.2. Experimental Environment n
The experimental environment in this
mAP
memory of 80 GB, a CPU of Intel(R) Xeon(R)
1

= paperAP is (Ubuntu
j) 20.04, with a system running
n Platinum 8358P CPU @ 2.60 GHz, and a GPU
(8)
j =1
of NVIDIA GeForce RTX2080 Super (8 GB). The deep learning framework used in this
n
paper is PyTorch 1.11.0, and the CUDA 1 version is 11.5.
∑PiIoU =0.5 RiIoU =0.5

AP50 = (9)
n
5.3. Model Evaluation Metrics and Training
i =1Parameter Settings
The experimental part of this 1paper uses mean average precision (mAP) of 50 and
AP50 : 95 = ( AP50 + AP55 + . . . + AP95) (10)
50:95 as metrics for model detection 10 accuracy; FLOPs are used as a measure of model
computation; the higher
In the formula, thenumber
n is the FLOPs,ofthe higher the
categories, R iscomputational
the recall rate, complexity
which is theofratio
the
model, which requires more computational resources to complete training and
of the number of true positive and predicted positive samples to the total number of true inference.
Parameters
positive is the and
samples, number of parameters
P is the to be
precision rate, learned
which in ratio
is the the model, and in general,
of the number the
of samples
higher
that arethe number
true of parameters,
positive and predicted thepositive
higher the complexity
samples to theoftotal
the model,
number which requires
of predicted
more data
positive and computational resources for training and inference. mAP50 and mAP50:95
samples.
wereFor
chosen as metrics topart,
the experimental assess
thedetection accuracyrecommended
official YOLOv5 because they can simultaneously
hyperparameters as-
were
sess the model’s ability to locate and classify targets for detection, which
used for the model training parameter settings. These hyperparameters were selected is more reflective
of the on
based model’s detection
experiments capability
conducted onthan accuracy
the MS COCO and recall. AP50 is the average accuracy
dataset.
of each category at an IoU threshold of 0.5, and similarly, AP95 is the average accuracy of
each category at an IoU threshold of 0.95. mAP is the average of the APs of all categories
at that IoU threshold, calculated as follows:
positive samples, and 𝑃 is the precision rate, which is the ratio of the number of samples
that are true positive and predicted positive samples to the total number of predicted pos-
itive samples.
For the experimental part, the official YOLOv5 recommended hyperparameters were
used for the model training parameter settings. These hyperparameters were selected
Appl. Sci. 2023, 13, 10063 12 of 21
based on experiments conducted on the MS COCO dataset.

5.4. Experimental Results and Analysis


5.4. Experimental
5.4.1. Results
Experimental and Analysis
Analysis of Improved Backbone Networks
5.4.1. Experimental Analysis of Improved Backbone Networks
To verify the effectiveness of the improved backbone network, the original CSPDark-
net53Tobackbone
verify thenetwork
effectiveness of the improved
was improved backbone leaving
to EfficientViT, network,the
therest
original
of theCSPDark-
network
net53 backbone network was improved to EfficientViT, leaving the rest
unchanged. The model with the improved backbone network is renamed YOLOv5n-V. of the network
unchanged.
ExperimentalThe model with
comparisons the performed
were improved backbone
between thenetwork is renamed
two models before YOLOv5n-V.
and after the
Experimental comparisons were performed between
improvement, and the results are shown in Table 2. the two models before and after the
improvement, and the results are shown in Table 2.
Table 2. Improved backbone network validation experiment.
Table 2. Improved backbone network validation experiment.
Models mAP@0.5/% mAP@0.5:0.95/% Params/M FLOPs
Models mAP@0.5/% mAP@0.5:0.95/% Params/M FLOPs
YOLOv5n 95.8 83.1 1.9 4.2 G
YOLOv5n 95.8 83.1 1.9 4.2 G
YOLOv5n-V 96.9 83.7 2.0 4.4 G
YOLOv5n-V 96.9 83.7 2.0 4.4 G

Figures 10 and 11 show the mAP curves of the original model and the model with
Figures 10 and 11 show the mAP curves of the original model and the model with the
the EfficientViT.
EfficientViT.

Appl. Sci. 2023, 13, x FOR PEER REVIEW 13 of 22

Figure10.
Figure 10.mAP@0.5
mAP@0.5 curve.
curve.

Figure 11.
Figure 11. mAP@0.5:0.95 curve.

5.4.2. Experimental Analysis of the Improved Upsampling Operator


To verify the effectiveness of the improved upsampling operator, the nearest neigh-
bor interpolation upsampling was improved to the CARAFE upsampling, leaving the rest
of the network unchanged. The model with the improved upsampling operator is re-
Appl. Sci. 2023, 13, 10063 13 of 21
Figure 11. mAP@0.5:0.95 curve.

5.4.2. Experimental
5.4.2. Experimental Analysis
Analysis of
of the
the Improved
Improved Upsampling
Upsampling Operator
Operator
To verify
To verifythe
theeffectiveness
effectivenessofofthe
theimproved
improvedupsampling
upsampling operator,
operator, thethe nearest
nearest neigh-
neighbor
bor interpolation
interpolation upsampling
upsampling waswas improved
improved to the
to the CARAFE
CARAFE upsampling,
upsampling, leaving
leaving thethe
restrest
of
of the network unchanged. The model with the improved upsampling
the network unchanged. The model with the improved upsampling operator is renamed operator is re-
named YOLOv5n-C.
YOLOv5n-C. Experimental
Experimental comparisons
comparisons were made werebetween
made between
the two the two models
models before andbe-
fore and after the improvement, and the results are shown
after the improvement, and the results are shown in Table 3. in Table 3.

Table 3.
Table 3. Improved
Improved upsampling
upsampling operator
operator validation
validation experiment.
experiment.

Models
Models mAP@0.5/%
mAP@0.5/% mAP@0.5:0.95/%
mAP@0.5:0.95/% Params/M
Params/M FLOPs
FLOPs
YOLOv5n
YOLOv5n 95.8
95.8 83.1
83.1 1.9
1.9 4.2GG
4.2
YOLOv5n-C
YOLOv5n-C 97.1
97.1 84.1
84.1 1.5
1.5 3.0GG
3.0

and 13
Figures 12 and 13 show
show the
the mAP
mAP curves
curves for
for the
the original
original model
model and
and the
the model
model using
using
CARAFE upsampling.
CARAFE upsampling.

Appl. Sci. 2023, 13, x FOR PEER REVIEW 14 of 22


Figure 12.
Figure 12. mAP@0.5 curve.

Figure 13. mAP@0.5:0.95


Figure13. mAP@0.5:0.95 curve.
curve.

5.4.3. Experimental Analysis of the Improved Bounding Box Regression Loss Function
In order to verify the effectiveness of improving the box loss function to WIoU Loss,
the box loss function of the original YOLOv5n was improved from CIoU Loss to WIoU
Loss while keeping the rest of the network unchanged. The model with the improved loss
Appl. Sci. 2023, 13, 10063 14 of 21
Figure 13. mAP@0.5:0.95 curve.

5.4.3. Experimental
5.4.3. Experimental Analysis
Analysis of of the
the Improved
Improved Bounding
Bounding BoxBox Regression
Regression Loss
Loss Function
Function
In order
In order to
to verify
verify the
the effectiveness
effectiveness of of improving
improving the
the box
box loss
loss function
function to
to WIoU
WIoU Loss,
Loss,
the box loss function of the original YOLOv5n was improved from CIoU
the box loss function of the original YOLOv5n was improved from CIoU Loss to WIoU Loss to WIoU
Loss while
Loss while keeping
keeping the
the rest
rest of
of the
the network
network unchanged.
unchanged. The
The model
model with
with the
the improved
improved loss
loss
function is called YOLOv5n-W. Experimental comparisons were made
function is called YOLOv5n-W. Experimental comparisons were made between the two between the two
models before
models before and
and after
afterthe
theimprovement,
improvement,and andthe
theresults
resultsare
areshown
shownin inTable
Table4.4.

Table4.4. Improved
Table Improved WIoU
WIoULoss
Lossvalidation
validationexperiment.
experiment.

Models
Models mAP@0.5/%
mAP@0.5/% mAP@0.5:0.95/%
mAP@0.5:0.95/% Params/M
Params/M FLOPs
FLOPs
YOLOv5n
YOLOv5n 95.8
95.8 83.1
83.1 1.9
1.9 4.2GG
4.2
YOLOv5n-W
YOLOv5n-W 96.2
96.2 83.7
83.7 1.9
1.9 4.2GG
4.2

Figures
Figures1414and
and1515show
showthe
themAP
mAPcurves of of
curves thethe
original model
original andand
model the the
model withwith
model the
WIoU LossLoss
the WIoU replacement.
replacement.

Appl. Sci. 2023, 13, x FOR PEER REVIEW 15 of 22


Figure 14. mAP@0.5
Figure 14. mAP@0.5 curve.

Figure 15. mAP@0.5:0.95 curve.

5.4.4. Ablation Experiments


In this paper, three improvement methods are proposed, namely V (EfficientViT), C
(CARAFE upsampling operator), and W (WIoU Loss).In order to test the effectiveness of
the three improvement methods, the ablation experiments in this paper are designed in
the following two directions:
Appl. Sci. 2023, 13, 10063 15 of 21

5.4.4. Ablation Experiments


In this paper, three improvement methods are proposed, namely V (EfficientViT), C
(CARAFE upsampling operator), and W (WIoU Loss).In order to test the effectiveness of
the three improvement methods, the ablation experiments in this paper are designed in the
following two directions:
(1) Using the original YOLOv5n as a base, only one of the above improvements was
added to each group of experiments separately to verify the effectiveness of each
improvement method on the original algorithm.
(2) Based on the finally obtained improved algorithm, YOLOv5n-VCW, each experimental
group eliminated only one of the above improvement methods separately to verify
the effect of each improvement method on the final improved algorithm.
“3” indicates the introduction of the method; the design of the ablation experiment
is shown in Table 5, and the average accuracy mean curve index at training is shown in
Figures 16 and 17. The PR curve is shown in Figure 18.

Table 5. Ablation experiment results.

Models V C W mAP@0.5% mAP@0.5:0.95/% Params/M FLOPs


YOLOv5n 95.8 83.1 1.9 4.2 G
YOLOv5n-V 3 96.9 83.7 1.5 3.0 G
YOLOv5n-C 3 97.1 84.1 2.0 4.4 G
YOLOv5n-W 3 96.2 83.7 1.9 4.2 G
YOLOv5n-VC 3 3 97.8 84.5 1.6 3.3 G
YOLOv5n-VW 3 3 97.3 84.1 1.5 3.0 G
YOLOv5n-CW
Appl. Sci. 2023, 13, x FOR PEER REVIEW 3 3 97.7 84.6 2.0 4.4 G 16 of 2
YOLOv5n-VCW 3 3 3 98.1 84.8 1.6 3.3 G

Figure16.
Figure 16.mAP@0.5
mAP@0.5 curve.
curve.
Appl. Sci. 2023, 13, 10063 16 of 21

Figure 16. mAP@0.5 curve.

Appl. Sci. 2023, 13, x FOR PEER REVIEW 17 of 22

Figure17.17.
Figure mAP@0.5:0.95
mAP@0.5:0.95 curve.
curve.

Figure 18. PR curve.


Figure 18. PR curve.
5.4.5. Comparison Experiments
5.4.5. Comparison Experiments
To further verify that the improvement algorithm proposed in this paper has certain
advantages oververify
To further other that
mainstream object detection
the improvement algorithms
algorithm in terms
proposed inofthis
detection
paper accu-
has certain
racy, model size, and detection speed, we compared the proposed improvement algorithm
advantages over other mainstream object detection algorithms in terms of detection accu-
YOLOv5n-VCW
racy, model size,with
andFaster R-CNN,
detection SSD, YOLOv5n,
speed, YOLOv5s,
we compared the YOLOv5m, YOLOv5l, and algo-
proposed improvement
YOLOv5x algorithms on this dataset, and the experimental results are shown in Table 6.
rithm YOLOv5n-VCW with Faster R-CNN, SSD, YOLOv5n, YOLOv5s, YOLOv5m,
YOLOv5l, and YOLOv5x algorithms on this dataset, and the experimental results are
shown in Table 6.

Table 6. Comparison of experimental results with mainstream object detection algorithms.


Appl. Sci. 2023, 13, 10063 17 of 21

Table 6. Comparison of experimental results with mainstream object detection algorithms.

Models mAP@0.5/% mAP@0.5:0.95/% Params/M GFLOPs


YOLOv5n 95.8 83.1 1.9 4.2 G
YOLOv5s 96.8 83.7 7.2 16.5 G
YOLOv5m 97.1 84.1 21.2 49.0 G
YOLOv5l 97.4 84.3 46.5 109.1 G
YOLOv5x 97.5 84.7 86.7 205.7 G
YOLOv3 92.3 75.9 61.5 155.4 G
SSD 78.5 59.7 23.6 273.1 G
Faster R-CNN 81.7 64.5 136.6 369.7 G
YOLOv5n-
98.1 84.8 1.6 3.3 G
VCW(Ours)

Finally, the validation set images are detected using the base model and the improved
model. Figure 19 shows the validation set with real labels. Figures 20 and 21 are images
of the detection validation set. From the figure, it can be seen that the improved
Appl. Sci. 2023, 13, x FOR PEER REVIEW 18 of 22model
Appl. Sci. 2023, 13, x FOR PEER REVIEW 18 of 22
detects images with generally higher confidence, and more objects are detected.

Figure 19.
Figure 19. Labeled images.
Labeledimages.
images.
Figure 19. Labeled

Figure 20. YOLOv5n predictive images.


Figure 20.
Figure 20. YOLOv5n
YOLOv5npredictive
predictiveimages.
images.
Appl. Sci. 2023, 13, 10063 18 of 21

Figure 20. YOLOv5n predictive images.

Figure 21. YOLOv5n-VCW predictive images.

6. Discussion
Based on the experimental results of this study, the following conclusions can be
drawn:
First, by replacing the backbone network with EfficientViT, we found that the number
of parameters in the model was significantly reduced while the accuracy was slightly
improved. Specifically, mAP@0.5 and mAP@0.5:0.95 increased by 1.1% and 0.6%, respec-
tively, while the number of floating-point operations per second performed by the model
decreased by 1.2 GFLOPs, a 28% decrease, and the number of parameters decreased by
0.4 M, a 21% decrease. This indicates that the EfficientViT backbone network proposed in
this paper has significant effectiveness in tomato disease and pest detection tasks.
Secondly, we found that improving the original nearest neighbor interpolation up-
sampling to CARAFE upsampling can improve the model’s detection accuracy with a
slight increase in model size and computational complexity. Specifically, mAP@0.5 and
mAP@0.5:0.95 increased by 1.3% and 1.0%, respectively, while the increase in model floating-
point operations and number of parameters was within an acceptable range. This indicates
that the CARAFE upsampling operator proposed in this paper is effective in improving
model accuracy in tomato disease and pest detection tasks.
Thirdly, we found that improving the original CIoU Loss to a WIoU Loss can improve
model detection accuracy without affecting model volume or detection speed. Specifically,
mAP@0.5 and mAP@0.5:0.95 increased by 0.4% and 0.6%, respectively. This indicates that
the WIoU Loss proposed in this paper has significant effectiveness in tomato disease and
pest detection tasks.
Additionally, we found that the proposed improved algorithm, YOLOv5n-VCW, has
the highest detection accuracy while ensuring model lightness. Compared with other
mainstream object detection algorithms, the proposed algorithm in this paper can maintain
high detection accuracy even with a decrease in the number of model parameters and
computational complexity, making it highly applicable.
Finally, we found that the proposed YOLOv5n-VCW algorithm has significant advan-
tages in resource-limited situations. Compared with other algorithms, this algorithm has
the highest detection accuracy with a smaller number of model parameters and computa-
tional complexity. This indicates that the YOLOv5n-VCW algorithm proposed in this paper
is significantly effective and superior for tomato disease and pest detection tasks.
However, our study has some limitations. First, this study conducted experiments
only for tomato pest detection tasks and did not consider other types of object detection
tasks. Second, this study was only based on a single dataset, and we did not consider the
Appl. Sci. 2023, 13, 10063 19 of 21

differences and influences between different datasets. Finally, the lighting and occlusion
effects in the experiment were realized based on data enhancement, which may differ from
the real scene, so this is a theoretical approach. It is also a direction for future research.
In conclusion, the results of this study show that the proposed improved algorithm
theoretically has significant effectiveness and superiority in tomato disease and pest de-
tection tasks. These results provide useful clues for future research and feasible solutions
for practical applications. Future research can go a step further to explore how to apply
the proposed improved algorithm to other types of object detection tasks and conduct
experimental verification, as well as how to improve the real-time performance of the
model and better deal with the differences and influences between different datasets.

7. Conclusions
To address the problems of large model size and low detection performance of existing
models, this paper proposes an improved tomato pest and disease detection algorithm
called YOLOv5n-VCW, based on YOLOv5n. First, EfficientViT is used to replace the feature
extraction module of the original YOLOv5, which significantly reduces the computational
and parameter costs of the model. Second, the nearest neighbor interpolation upsampling
module is replaced with the CARAFE upsampling module to reduce the loss of feature
information during upsampling. Finally, the WIoU Loss is used to replace the CIoU
Loss as the new target box loss function to optimize the calculation of the loss function.
The hyperparameters for training were set as follows: 0.01 for lr0, 0.01 for lrf, 0.937 for
momentum, 0.0005 for weight_decay, 3.0 for warmup_epochs, 0.8 for warmup_momentum,
0.1 for warmup_bias_lr, and 0.05 for box; cls is 0.5.
The experimental results on the tomato pest and disease detection dataset show
that the YOLOv5n-VCW proposed in this paper has significantly better detection accu-
racy, model parameters, and computational cost than the original YOLOv5n. With only
1.6 M model parameters and 3.3 G computational costs, mAP50 and mAP50:95 reach 98.1%
and 84.8%, respectively. Compared with other mainstream object detection algorithms
in terms of detection accuracy, model size, and computational cost, YOLOv5n-VCW has
certain advantages and is more practical and feasible in real-world applications. Next, de-
ployment of the proposed model in mobile or embedded device environments for accurate
identification and detection of tomato pests and diseases is our main goal, and further
exploration of the proposed model for detection and identification of various other plant
pests and diseases will be a part of future plans.

Author Contributions: Conceptualization, S.L., W.H. (Wei Hong) and W.H. (Wenyi Hu); methodol-
ogy, W.H. (Wei Hong); software, W.H. (Wei Hong); validation, M.L. and W.H. (Wenyi Hu); formal
analysis, S.L. and W.H. (Wenyi Hu); investigation, W.H. (Wei Hong); resources, H.W.; data curation,
H.W.; writing—original draft preparation, W.H. (Wei Hong) and M.L.; writing—review and editing,
W.H. (Wei Hong) and S.L.; visualization, W.H. (Wei Hong) and S.L.; supervision, W.H. (Wenyi Hu);
project administration, M.L.; funding acquisition, W.H. (Wenyi Hu). All authors have read and agreed
to the published version of the manuscript.
Funding: Supported by Sichuan Science and Technology Program (2023YFSY0026, 2023YFH0004).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: In this paper, we use publicly available datasets, including Kaus-
tubh B’s tomato leaf disease detection dataset (https://www.kaggle.com/datasets/kaustubhb9
99/tomatoleaf (accessed on 27 July 2023)) and Nouaman Lamrhi’s open-source tomato dataset
(https://www.kaggle.com/datasets/noulam/tomato (accessed on 27 July 2023)).
Conflicts of Interest: The authors declare no conflict of interest.
Appl. Sci. 2023, 13, 10063 20 of 21

References
1. Li, J. Research on tomato bacterial pith necrosis. Plant Dis. Pests 2012, 3, 9.
2. Takayama, M.; Ezura, H. How and why does tomato accumulate a large amount of GABA in the fruit? Front. Plant Sci. 2015,
6, 612. [CrossRef] [PubMed]
3. Manríquez-Altamirano, A.; Sierra-Pérez, J.; Muñoz, P.; Gabarrell, X. Analysis of urban agriculture solid waste in the frame of
circular economy: Case study of tomato crop in integrated rooftop greenhouse. Sci. Total Environ. 2020, 734, 139375. [CrossRef]
[PubMed]
4. Rehman, A.; Ulucak, R.; Murshed, M.; Ma, H.; Işık, C. Carbonization and atmospheric pollution in China: The asymmetric impacts
of forests, livestock production, and economic progress on CO2 emissions. J. Environ. Manag. 2021, 294, 113059. [CrossRef]
5. Li, N.; Yu, Q. Tomato super-pangenome highlights the potential use of wild relatives in tomato breeding. Nat. Genet. 2023, 55,
744–745.
6. Wang, X.Y.; Feng, J.; Zang, L.Y.; Yan, Y.L.; Yang, Y.Y.; Zhu, X.P. Natural occurrence of Tomato chlorosis virus in cowpea
(Vigna unguiculata) in China. Plant Dis. 2018, 102, 254. [CrossRef]
7. Arafa, R.A.; Kamel, S.M.; Taher, D.I.; Solberg, S.; Rakha, M.T. Leaf Extracts from Resistant Wild Tomato Can Be Used to Control
Late Blight (Phytophthora infestans) in the Cultivated Tomato. Plants 2022, 11, 1824. [CrossRef]
8. Ferrero, V.; Baeten, L.; Blanco-Sánchez, L.; Planelló, R.; Díaz-Pendón, J.A.; Rodríguez-Echeverría, S.; Haegeman, A.; Peña, E.
Complex patterns in tolerance and resistance to pests and diseases underpin the domestication of tomato. New Phytol. 2020, 226,
254–266. [CrossRef]
9. Han, C.; Gan, C.; Han, S. Efficientvit: Enhanced linear attention for high-resolution low-computation visual recognition. arXiv
2022, arXiv:2205.14756.
10. Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. Carafe: Content-aware reassembly of features. In Proceedings of the
IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019.
11. Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding Box Regression Loss with Dynamic Focusing Mechanism. arXiv 2023,
arXiv:2301.10051.
12. Viola, P.; Jones, M.J. Robust real-time face detection. Int. J. Comput. Vis. 2004, 57, 137–154. [CrossRef]
13. Tan, P.S.; Lim, K.M.; Lee, C.P. Human action recognition with sparse autoencoder and histogram of oriented gradients. In
Proceedings of the 2020 IEEE 2nd International Conference on Artificial Intelligence in Engineering and Technology (IICAIET),
Kota Kinabalu, Malaysia, 26–27 September 2020.
14. Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object Detection with Discriminatively Trained Part-Based Models.
IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1627–1645. [CrossRef] [PubMed]
15. Mokhtar, U.; Ali, M.A.; Hassanien, A.E.; Hefny, H. Identifying two of tomatoes leaf viruses using support vector machine.
In Information Systems Design and Intelligent Applications, Proceedings of the Second International Conference INDIA 2015, Kalyani,
India, 8–9 January 2015; Springer: Berlin/Heidelberg, Germany, 2015; Volume 1.
16. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017.
17. Fuentes, A.F.; Yoon, S.; Lee, J.; Park, D.S. High-performance deep neural network-based tomato plant diseases and pests diagnosis
system with refinement filter bank. Front. Plant Sci. 2018, 9, 1162. [CrossRef] [PubMed]
18. Ale, L.; Sheta, A.; Li, L.; Wang, Y.; Zhang, N. Deep learning based plant disease detection for smart agriculture. In Proceedings of
the 2019 IEEE Globecom Workshops (GC Wkshps), Waikoloa, HI, USA, 9–13 December 2019; IEEE: Piscataway, NJ, USA, 2019.
19. Zhao, J.; Qu, J. Healthy and diseased tomatoes detection based on YOLOv2. In Proceedings of the Human Centered Computing:
4th International Conference, HCC 2018, Mérida, Mexico, 5–7 December 2018; Revised Selected Papers 4. Springer International
Publishing: New York City, NY, USA, 2019.
20. Latif, G.; Alghazo, J.; Maheswar, R.; Vijayakumar, V.; Butt, M. Deep learning based intelligence cognitive vision drone for
automatic plant diseases identification and spraying. J. Intell. Fuzzy Syst. 2020, 39, 8103–8114. [CrossRef]
21. Prabhakar, M.; Purushothaman, R.; Awasthi, D.P. Deep learning based assessment of disease severity for early blight in tomato
crop. Multimed. Tools Appl. 2020, 79, 28773–28784. [CrossRef]
22. Pattnaik, G.; Shrivastava, V.K.; Parvathi, K. Transfer learning-based framework for classification of pest in tomato plants. Appl.
Artif. Intell. 2020, 34, 981–993. [CrossRef]
23. Jiang, D.; Li, F.; Yang, Y.; Yu, S. A tomato leaf diseases classification method based on deep learning. In Proceedings of the 2020
Chinese Control and Decision Conference (CCDC), Hefei, China, 22–24 August 2020.
24. Liu, J.; Wang, X. Tomato diseases and pests detection based on improved Yolo V3 convolutional neural network. Front. Plant Sci.
2020, 11, 898. [CrossRef]
25. Wang, X.; Liu, J.; Liu, G. Diseases detection of occlusion and overlapping tomato leaves based on deep learning. Front. Plant Sci.
2021, 12, 792244. [CrossRef]
26. Huang, X.; Chen, A.; Zhou, G.; Zhang, X.; Wang, J.; Peng, N.; Yan, N.; Jiang, C. Tomato leaf disease detection system based on
FC-SNDPN. Multimed. Tools Appl. 2023, 82, 2121–2144. [CrossRef]
27. Kc, K.; Yin, Z.; Wu, M.; Wu, Z. Depthwise separable convolution architectures for plant disease classification. Comput. Electron.
Agric. 2019, 165, 104948. [CrossRef]
Appl. Sci. 2023, 13, 10063 21 of 21

28. Albahli, S.; Nawaz, M. DCNet: DenseNet-77-based CornerNet model for the tomato plant leaf disease detection and classification.
Front. Plant Sci. 2022, 13, 957961. [CrossRef] [PubMed]
29. Zhong, Y.; Teng, Z.; Tong, M. LightMixer: A novel lightweight convolutional neural network for tomato disease detection. Front.
Plant Sci. 2023, 14, 1166296. [CrossRef]
30. Chen, J.; Zhang, D.; Zeb, A.; Nanehkaran, Y.A. Identification of rice plant diseases using lightweight attention networks. Expert
Syst. Appl. 2021, 169, 114514. [CrossRef]
31. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018.
32. He, Y.; Zhu, C.; Wang, J.; Savvides, M.; Zhang, X. Bounding box regression with uncertainty for accurate object detection. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019.
33. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.;
Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929.
34. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018.
35. Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss
for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long
Beach, CA, USA, 15–20 June 2019.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like