Corn Leaf Disease Classification and Detection Using Deep Convolutional Neural Network

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/354688452
Corn Leaf Disease Classiﬁcation and Detection using Deep Convolutional Neural
Network
Research · September 2021

DOI: 10.13140/RG.2.2.20819.50722
CITATIONS READS
0 2,104
2 authors, including:
Md Shafiul Haque
Universität Potsdam
2 PUBLICATIONS 0 CITATIONS
SEE PROFILE
All content following this page was uploaded by Md Shafiul Haque on 19 September 2021.
The user has requested enhancement of the downloaded file.

Haque Corn Leaf Disease Detection
Corn Leaf Disease Classification and Detection using

Deep Convolutional Neural Network
Research Project
Final Report
September 19, 2021
Haque Md Shafiul* Prof. Dr. Niels Landwehr † Dr. Julian Adolphs †

University of University of ATB Potsdam
Potsdam Potsdam Junior research group
Department of Department of Data Science in
Computer Science Computer Science Agriculture
Abstract
Crop diseases are one of the major challenges in the agriculture sector. Traditional
methods for detecting crop diseases are not time and cost-efficient, as well as mis-
diagnosis rate is also high. Moreover, the misdiagnosis of the diseases reduces crop
yield efficiency and increases economic loss. In the advent era of machine learning,
the deep neural network show prominent outcomes and allowed researchers to im-
prove the accuracy of object detections and recognition systems. The main objective
of the study to find suitable deep-learning-based architectures to classify and local-
ize corn leaf diseases. We propose a convolutional neural network-based system
with data augmentation combined with transfer learning and hyperparameter tun-
ing for disease classification and localization tasks. An open-source dataset known
as the Corn leaf infection dataset is used for experiments where we have to classify
healthy leaves out of the infected leaves, and which parts of the leaf are infected.
We used a VGG block model for our classification task keeping the number of pa-
rameters lower for faster convergence but achieving high accuracy. In the disease
localization tasks, we do the data preprocessing, improve the generalization by in-
creasing the variability of data and tuning hyperparameters. We have used the state-
of-the-art deep learning model YOLOv4 to identify the infected area in the corn leaf.
We conduct different end-to-end experiments and show the efficacy of our proposed
system by illustrating the experiment results. Considering the data we have, the re-
sults exhibits that our proposed model performs quite well. In classification, our
model achieves 99.25% accuracy and the detection model achieves 55.30% mean
average precision(mAP).
Key words: corn leaf; maize leaf; plant disease; CNN; classification; object de-
tection; VGG; YOLOv4
* author
† supervisor
Research module WS2020/21

1 I NTRODUCTION
A nation’s economic strength depends on various factors; one of the vital factors
is the success of the agriculture sector. Crops can be infected with different kinds of
diseases thus maintaining the health of any crop is very crucial in order to ensure
the quality and production of crops (Brahimi et al., 2017). As crops are one of the
main sources of food so study the diseases of crops becomes imperative for research
and scientific works. Recent studies show that how easily plants (Frommer et al.,
2018) are getting affected by diseases.
There are several types of staple food, corn is among one of those and the pro-
duction of corn is one of the largest in the world after rice and wheat. Corn is well
known as cereals apart from that corn is also used for various products such as corn
starch, corn syrup, animal feed, and ethanol. Corn diseases occur in different parts
of the plant such as the leaf, stem, rhizomes, or even in the whole plants as well (Sun
and Wei, 2020). Corn leaves exhibit varieties of symptoms and the most common
corn leaf infections are gray leaf spot, common rust maize disease, northern leaf
blight and brown spot (Sibiya and Sumbwanyambe, 2019).
In general detecting corn leaf infections may be discombobulated for farmers
compared with professional plant pathologists or experts (Miller et al., 2009). How-
ever, a disease diagnosis system could be a great help for the farmers to identify
plant diseases by the plant’s appearance and visual symptoms rather than hiring an
expert because of high expenses. Legacy approaches, for instance, chemical anal-
ysis of infected area (Alvarez, 2004), imaging, and spectroscopy has been used to
remedy such problems but could not achieve promising results in terms of accu-
racy, time, and cost-efficiency. Although the traditional methods such as fisher dis-
crimination analysis (FDA), support vector machine (SVM), achieves good results on
image-based crop disease recognition but that’s required various preprocessing and
feature extraction beforehand (Song et al., 2007; Wang et al., 2009). Besides these
traditional methods performs well when there are small samples of data is available,
for a large number of samples these methods can not achieve high accuracy.
With the advent of the era of machine learning and artificial intelligence, the
deep convolutional neural network has made immense advancement in the field of
computer vision (LeCun et al., 2015). After the successful implementation of AlexNet
(Krizhevsky et al., 2012) in the ImageNet Large Scale Visual Recognition Challenge
2012 (ILSVRC) (Russakovsky et al., 2015), the researcher proposed other deeper net-
works (Simonyan and Zisserman, 2014) and achieved state-of-the-art performance
on ImageNet, Pascal VOC, and other benchmark datasets. Therefore these results
proved that we need to study wider and deeper convolutional neural networks to
achieve higher accuracy on image classification and object detection.
Our main purpose of this study is to identify infected corn leaves from healthy
leaves and detecting the area of infection as well. Corn diseases evidently exhibit
heterogeneity in shapes, forms, colors, etc. (Chen et al., 2016) which makes the clas-
sification and disease localization more difficult. At this juncture, it is important to
define the differences between the concept of image classification and object detec-
tion or localization. Classification refers to a predictive modeling problem where a
Research module WS2020/21 1

(a) Healthy (b) Infected (c) Disease Localization
Fig. 1: Example of healthy(a) and infected(b) corn leaves for classification and disease local-
ization(c) to identify the area of diseases in the corn leaf.
class label is predicted for a given example of input data, on the other hand, detec-
tion is responsible for detecting interesting objects (e.g. disease area) in an image
with respect to the background (Brownlee, 2010). As shown, in Fig 1, our classifi-
cation system is able to classify between healthy and infected corn leaves and the
detection system identifies the area of infection.
In our research, we proposed two improved deep convolutional neural network
models to achieve high classification accuracy by maintaining a lower number of pa-
rameters and high mean Average Precision(mAP) on the corn leaf diseases dataset.
Our proposed classification model known as the VGG block model works as a binary
classifier and achieved 99.25% accuracy. On the other hand, our YOLOV4 detection
model achieves 52.11% mAP so there is still room for improvement for our detection
model.
2 R ELATED W ORK
Over the last few years, computer vision technology has developed at a great rate
and has been used in different industries including the agriculture sector. In current
agricultural production, it is quite important to observe crop health conditions dur-
ing the growth process in contemplation of improving the efficiency of crop yield. In
recent times the convolutional neural networks ameliorated due to lots of research
and one major aspect of this technology is that it can extract salient features which
is a crucial factor for image recognition. Therefore, varieties of convolutional neu-
ral network architectures are out there and few popular architectures are AlexNet,
VGG16, VGG19, GoogLeNet, ResNet, DenseNet, etc. (Krizhevsky et al., 2012; Si-
monyan and Zisserman, 2014; Szegedy et al., 2015; Huang et al., 2017) and these
architectures are also applied for plant disease detection.
Corn leaf disease classification based on imaging and machine learning tech-
niques was employed by Alehegn in his study (Alehegn, 2020). They are focusing on
digital image analysis techniques based on texture, color, and morphology features
are extracted to classify maize leaf diseases and healthy leaf. In total 800 images
were collected for four classes consists of healthy leaf, common rust, leaf blight, and

leaf spot. Each class contains 200 images and a total of 22 features are collected
from each image after that they have applied K-nearest neighbors(KNN) and Artifi-
cial Neural Networks(ANN) classifier on this data. Their study concludes that ANN
outperforms KNN and achieves 94.4% accuracy.
Mohanty et al. (2016) used a public dataset that contains 54,306 images of healthy
and diseased plant leaves and then trained a deep convolutional neural network to
detect 14 crop species and 26 diseases. Their trained model achieves an accuracy of
99.35% on a held-out test set. They used AlexNet and GoogleNet architectures and
applied transfer learning and training from scratch mechanism as well.
The Deep Forest algorithm an ensemble-based decision tree approach is used
for maize leaf diseases classification by Arora et al. (2020). Their proposed method
outperforms traditional machine learning algorithms such as SVM, RandomForest,
Logistic Regression, and KNN classifier. The dataset consisted of 4 categories and
100 images per category. After optimizing the hyperparameters the deep forest model
has 1000 trees, 4 forests, and 3 grains achieve an accuracy of 96.26%.
An improved YOLOv3 algorithm is proposed to detect tomato diseases and insect
pests by Liu and Wang (2020). This network is improved by using multi-scale feature
detection based on image pyramid, object bounding box dimension clustering, and
multi-scale training. Their dataset consists of 15000 images with 12 different dis-
eases and 146,912 bounding boxes. The proposed YOLOv3 model outperforms SSD,
Faster R-CNN with mAP of 92.39%. The improved YOLOv3 network has strong ro-
bustness for the detection of different object sizes and different resolution images in
a complex environment.
A robust deep-learning-based detector for real-time tomato and pests recog-
nition model is proposed by Fuentes et al. (2017). In their study, they consider
three main detectors and those are Faster Region-based Convolutional Neural Net-
work (Faster R-CNN), Region-based Fully Convolutional Network (R-FCN), and Sin-
gle Shot Multibox Detector. With these meta-architectures, they combined deep
feature extractors such as VGG net and Residual Network (ResNet). The dataset
contains various task complexities, such as illumination conditions the size of ob-
jects, background variations including the surrounding area of the plant. The study
showed that plain networks perform better than deeper networks like R-FCN with
ResNet-50 as feature extractor achieves a mAP of 85.98% and outperforms all other
networks.
Although the works mentioned above shows excellent performance in different
leaf diseases classification and detection, still there are some challenges that needs
to overcome. Different types of diseases and pests in different location in the im-
age, pattern variation, noisy images, surrounding similar backgrounds of objects,
various image resolution could be make the task more challenging. Therefore, we
proposed a solution to overcome such problems in corn leaf diseases classification
and detection.
3 M ETHODOLOGY
Convolution neural networks are widely used in natural language processing, image
classification, segmentation, and other computer vision fields for its outstanding

performance and salient feature extraction from data.

In order to successfully train a classification model, we need to do the data pre-
processing and create an input pipeline. The following block diagram will give a
bird’s-eye view of the whole process of our experiments.
Fig. 2: Process of training the classification model
However, for the detection task, we also need to prepare our data but in a differ-
ent manner. Images need to be resized and data needs to be annotated in a specific
format and then we train our detector model. After completion of training, we eval-
uate the model on the test set. The whole process is summarized in the following
block diagram.
Fig. 3: Process of training the detection model
3.1 Dataset
An appropriate dataset is required for all phases of classification and detection tasks
starting from training to performance evaluation. In our study, we have used an
open-source dataset that has been taken from Kaggle repository(Acharya, 2020).
The dataset is known as "Corn Leaf Infection Dataset". This dataset contains 4225
images in total. The images are captured by a digital device with a very high resolu-
tion of 3456 x 4608 pixels. Images are labeled as "healthy" and "infected". There is a
total of 2000 images under healthy and 2225 images under the infected category, so
our data is slightly imbalanced. An annotation file is also given where infected im-
ages are annotated with bounding box coordinates. There is a total of 11597 bound-
ing box annotation for 2225 images.

However, when we check the annotated data we found some anomalies such as
bounding box shapes are different for similar disease spots, the number of bounding
box differs between two similar images. As for now, we overlook this limitation, as
annotating all images will be time-consuming.
(a) actual image (b) image with small rotation
Fig. 4: Example of bounding box anomalies
3.1.1 Data preprocessing
Data preprocessing is an integral step in machine learning as the quality of data and
the useful information that can be derived from this process directly affect the ability
of the model to learn. As we have two different objectives which are classification
and detection, and we want to train two different models, we can treat this as two
separate tasks.
To improve feature extraction and increase consistency, the images in the dataset
for the deep convolutional neural network classifier required preprocessing. One of
the most significant preprocessing operations is the normalization of image size and
format. Our input image has very high resolution in order to train the model we need
to resize the data into a smaller size and the resizing dimension is 224 × 224 pixels.
As mentioned earlier, we have an imbalance dataset which’s why we will apply the
class weights technique during training. This can be achieved by giving different
weights to both the majority and minority classes. The whole purpose is to penalize
the misclassification made by the minority class by setting a higher class weight and
at the same time reducing weight for the majority class. We calculated the class
weights on the training data and save the wights in a file for later use.
In order to fit the data into the detection model, we also need to resize the im-
ages. We will resize the image in 854 × 854 dimensions. we collect the bounding box
data from a CSV file. A bounding box is a rectangle that locates the area of infection
in an image.
As we resized our images we need to adjust the bounding box values because
the given data is based on the actual image dimension after resizing the dimension

Name xmin ymin xmax ymax label

2023.jpg 1088 115 2248 850 infected
2024.jpg 132 550 447 721 infected
Table 1: Example of bounding box data
changed and we have to recalculate those coordinates. A bounding box is just a rect-
angle that is specified by four coordinates, see table 1, in this case, those are xmin,
ymin, xmax, ymax where xmin, ymin represents the top-left point and xmax, ymax
represents the bottom-right point of the rectangle. In the process of recalculation,
we need to normalize the coordinates and then multiply the values with newly re-
sized dimensions. Our proposed model YOLOv4 required a different format of the
bounding box. According to the YOLO model, the bounding box is represented with
the center coordinates (x,y) and width (w), height (h) of a rectangle. Then for ev-
ery image, we need to save a text file where each line represents a class label and
bounding box, the format of the line is a class label, center coordinates (x,y), width
(w), height (h) of the bounding box. If there are multiple infection areas in the leaves
then we wrote multiple lines for that in the same format. In the detection task, we
only have to detect infected areas that’s why there is only one label that is labeled as
’infected’.
3.1.2 Data augmentation
Data augmentation is a strategy that enables users to significantly increase the di-
versity of data available for training models, without actually collecting new data.
Applying data augmentation techniques we can increase the variability in our data
and achieve high accuracy. Photometric distortion and geometric distortion are two
commonly used data augmentation. Photometric distortion includes changing the
brightness, contrast, hue, saturation, and noise in an image. On the contrary geo-
metric distortion includes random scaling, cropping, flipping, rotation, translation
etc. Random erase and CutOut can randomly select the rectangle region in an im-
age and fill in a random or complementary value of zero. MixUp uses two images
to multiply and superimpose with different coefficient ratios and then adjusts the
label with these superimposed ratios. As for CutMix, it is to cover the cropped image
to the rectangle region of other images, and adjusts the label according to the size of
the mix area. There is a particular augmentation technique called Mosaic data aug-
mentation that combines 4 training images into one in certain ratios and applied
at run time. Thus 4 different contexts are mixed, while CutMix mixes only 2 input
images. This allows detection of objects outside their normal context and the model
learns how to identify objects at a smaller scale than normal(Solawetz, 2020).
3.2 VGG Model
The design of neural network architectures had grown progressively more abstract,
with the researcher’s interests shifts from individual neurons to whole layers, and

now to blocks, repeating patterns of layers. The idea of using blocks first emerged
from the Visual Geometry Group (VGG) at Oxford University, in their eponymously
named VGG network (Simonyan and Zisserman, 2014). It is easy to implement these
repeated structures in code with any modern deep learning framework by using
loops and subroutines. The VGG network can be partitioned into two parts: the
first consisting mostly of convolutional and pooling layers and the second consist-
ing of fully connected layers. In this study, we used fewer convolution layers, filters,
and fully connected layers as a result parameters of the network are greatly reduced.
One traditional VGG block consists of a sequence of convolutional layers, an acti-
vation function for non-linearity followed by a max-pooling layer for spatial down-
sampling.
In this study, our modified VGG model has 8 convolution layers and every con-
volution layer is followed by a Rectified Linear Units (ReLU), a Batch Normalization
layer and this is known as a block. Every two convolution block is followed by one
max-pooling layer. Then we used a global average pooling followed by a fully con-
nected layer and a dropout layer. We add batch normalization after every convo-
lutional layer that solves the problem of the middle layer data distribution changes
during the training process.
Fig. 5: Block diagram of basic CNN architecture
The main function of the convolution layer is feature extraction. Convolution

is a mathematical operation to merge two sets of information. The convolution is
applied to the input data using a convolution filter to produce a feature map. We
used 3x3 convolution throughout the network that helped to reduce the number of
parameters. Stacking two convolutions instead of one are that we used two ReLU
operations and that means more non-linearity gives us more power to the model.
Rectified Linear Units (ReLUs) are used as an activation function. The unsatu-
rated nonlinear characteristics of the function are effective to alleviate the problem
that the traditional Sigmoid and Tanh activation functions produce that is gradient
disappearence (Xu et al., 2015). During training, ReLU can speed up the network
training. ReLU is followed by a Batch Normalization (BN) layer. Global data normal-
ization can transform all the data to have zero-mean and unit variance. However,
as the data flows into the deeper network, the distribution of data to internal lay-
ers will be changed, the network will lose the learning capacity and accuracy of the

network. The Batch Normalization (BN) layer helps to reduce the internal covariate
shift of the network by a normalization step that fixes the means and variances of
layer inputs where the estimations of mean and variance are not computed for the
entire training set rather than computed after each mini-batch. The dependence of
gradients on the scale of the parameters or of their initial values is reduced by batch
normalization, which gives a beneficial effect on the gradient flow through the net-
work and also enables the use of a higher learning rate without the risk of divergence
(Gu et al., 2018).
After a convolution operation, we usually perform pooling to reduce the dimen-
sionality. This enables us to reduce the number of parameters, which both shortens
the training time and combats overfitting. Pooling layers downsample each feature
map independently, reducing the height and width, keeping the depth intact. In our
model, we used max pooling which just takes the max value in the pooling window.
Contrary to the convolution operation, pooling has no parameters while retaining
the main features and it also improves the generalization ability of the network.
A fully Connected Layer is simply, feed-forward neural networks. The input to
the fully connected layer is the output from the final Pooling Layer, which is flat-
tened and then fed into the fully connected layer. Fully connected networks increase
the number of parameters and thus they’re likely to excessively co-adapting them-
selves causes overfitting then the dropout is used as a regularization technique for
preventing overfitting in the network. It sets a node’s weight to zero with a given
probability during training, so that the neural network won’t be able to rely on any
particular activation’s in a given feed-forward pass during training. The final layer
used the SoftMax activation function to get probabilities of the input being in a par-
ticular class (classification).
3.3 YOLOv4
You Only Look Once (YOLO) a regression-based algorithm for object detection is
proposed by Redmon et al. (2016). YOLO formulates object detection tasks as a sin-
gle regression problem such as a single neural network predicts bounding box coor-
dinates and class probabilities, therefore the detection is immensely fast and accu-
rate. By design, the network is able to see the entire image during training and test
time so it implicitly encodes contextual information about classes as well as their
appearance. YOLOV4 is proposed by Bochkovskiy et al. (2020) that is a major im-
provement from its predecessor YOLOV3 (Redmon and Farhadi, 2018). The imple-
mentation of new architecture in the Backbone and the modifications in the Neck
have improved the mAP by 10% and Frame per Second (FPS) by 12% (Bochkovskiy
et al., 2020). In order to train a neural network with a large mini-batch size, the sys-
tem needs a large number of GPUs, and this type of system is not suitable for real-
time application, therefore, YOLOv4 resolves such problems by creating a CNN that
operates in real-time on a conventional GPU. There are two categories of methods
Bag of freebies (BoF) and Bag of specials (BoS) that are used to improve the object
detector’s accuracy.

3.3.1 Bag of freebies (BoF)
The set of techniques or methods that change the training strategy or training cost
for improvement of model accuracy is termed as Bag of Freebies. These are dif-
ferent measures one can take while training offline for improvement in the overall
accuracy without increasing the overall inference cost. Bag of Freebies for object
detection training strategies is formulated as Data augmentation, semantic distri-
bution bias in datasets, the objective function of BBox regression. Photometric and
Geometric distortions are traditional data augmentation strategies, in addition, to
tackle object occlusion issues techniques like Random erase, CutOut, Hide-and-
Seek, Grid Mask, MixUp, CutMix are also part of this method. Mosaic data augmen-
tation, DropOut, DropConnect, DropBlock regularization, Random training shapes,
Class label smoothing, CIoU-loss, CmBN, Self-Adversarial Training, Cosine anneal-
ing scheduler, Eliminate grid sensitivity, Using multiple anchors for single ground
truth, Optimal hyperparameters, etc. are some other techniques used by the bag of
freebies method.
3.3.2 Bag of specials (BoS)
Bag of Specials contains different plugins and post-processing modules that only
increase the inference cost by a small amount but can drastically improve the ac-
curacy of the object detector. This kind of module usually involves: introducing
attention mechanisms(Squeeze-and-Excitation and Spatial Attention Module), en-
larging the receptive field of model, and strengthening feature integration capability,
among others. Mish activation, Cross-stage partial connections (CSP), Multi-input
weighted residual connections (MiWRC), DIoU-NMS, SPP-block, SAM-block, PAN
path-aggregation block are some other strategies included in the bag of specials.
3.3.3 Basic architecture
There are two types of object detection models, one-stage or two-stage models. A
one-stage model is capable of detecting objects without the need for a preliminary
step. On the contrary, a two-stage detector uses a preliminary stage where regions of
importance are detected and then classified to see if an object has been detected in
these areas. The advantage of a one-stage detector is the speed it is able to make pre-
dictions quickly allowing real-time use. The main components of one-stage detec-
tors are the Input layer, backbone, neck, head(dense prediction). The block diagram
of the one-stage detector is given below.

Fig. 6: Architecture of one-stage detector (Bochkovskiy et al., 2020).
Backbone
Backbone refers to the feature-extraction architecture and models such as ResNet,
DenseNet, VGG, etc, are used as feature extractors. They are pre-trained on image
classification datasets, like ImageNet, and then fine-tuned on the detection dataset.
A backbone for object detection requires a higher input network size, for better de-
tection in small objects, and more layers, for a higher receptive field. As the feature-
extractor model, YOLOv4 uses the CSPDarknet53 (Bochkovskiy et al., 2020).
Fig. 7: A 5-layer dense block with a growth rate of k = 4. Each layer takes all preceding feature-
maps as input (Xu and Wu, 2020)
A Dense Block (fig 7) in the YOLOv4 backbone contains multiple convolution

layers with each layer Hi composed of batch normalization, ReLU, and followed by
convolution. From the figure 7 we can see Hi takes the output of all previous layers
as well as the original as its input. i.e. X0 , X1 , ..., and Xi-1 , instead of using the output
of the last layer only. In the following diagram, each layer outputs four feature maps,
in consequence at each layer, the number of feature maps is increased by four the
growth rate (Huang et al., 2017).
A DenseNet can be formed by composing several Dense Blocks with a transition
layer in between that composed of convolution and pooling.

Fig. 8: A DenseNet consists of three Dense Blocks (Xu and Wu, 2020)
YOLOv4 utilizes the Cross-Stage-Partial-connections (CSP) with the Darknet-53.

CSPNet separates the input feature maps of the Dense Block into two parts. The first
part bypasses the Dense Block and becomes part of the input to the next transition
layer and the second part will go through the Dense block (Wang et al., 2020). This
new design reduces the computational complexity by separating the input into two
parts with only one going through the Dense Block.
Fig. 9: Illustrations of (a) DenseNet and (b) Cross Stage Partial DenseNet (CSPDenseNet)
(Wang et al., 2020)
Activation function
Mish is a smooth, non-monotonic activation function, that can be defined as:
f (x) = x tanh(δ(x))
where, δ(x) = ln(1 + e x ), is a softmax activation function.
Fig. 10: Mish activation function
The reason why the Mish function is used in YOLOv4 is because of its low cost
and various properties like smooth and non-monotonic nature, unbounded above,

bounded below property improves its performance when compared with other pop-
ularly used functions like ReLU (Rectified Linear Unit) and Swish. Being unbounded
above (i.e. positive values can go to any height) avoids saturation due to capping.
The slight allowance for negative values, in theory, allows for better gradient flow vs
a hard zero bound as in ReLU. Non-monotonic property helps to preserve the small
negative values, hence stabilizing the network gradient flow. Finally, and likely most
importantly, current thinking is that smooth activation functions allow for better
information propagation deeper into the neural network, and thus better accuracy
and generalization.
Neck
These are extra layers that go in between the backbone and head. They are used
to extract different feature maps of different stages of the backbone. YOLOV4 neck
consists of a modified Spatial Attention Module (SAM), Path Aggregation Network
(PAN), and Spatial pyramid pooling layer (SPP).
Attention mechanisms have been widely used in deep learning, it refers to fo-
cusing on a specific part of the input. In the original implementation of SAM per-
formed average-pooling and max-pooling operations along the channel axis and
then concatenated them. Then a convolution layer is applied (with sigmoid as acti-
vation function) to generate an attention map, which is applied to the original fea-
ture map. YOLOv4 modified SAM, on the other hand, does not apply max-pooling
and average-pooling, but instead, the feature map goes through a convolutional
layer (with sigmoid activation) which then multiplies the original feature map.
Fig. 11: Illustration of (a) SAM and (b) modified SAM used in YOLOv4 (Bochkovskiy et al.,
2020)
PANet is chosen for instance segmentation in YOLOv4 is because of its ability

to preserve spatial information accurately which helps in the proper localization of
pixels for mask formation. Bottom-up path augmentation, adaptive feature pool-
ing, fully-connected fusion are some of the properties that make PanNet so accurate.
PANet conventionally adds the neighboring layers together for doing mask predic-
tions using the adaptive feature pooling. However, this approach is slightly twisted
when PANet is employed in YOLOv4, such that instead of adding the neighboring
layers, a concatenation operation is applied on them which improves the accuracy
of predictions.

Fig. 12: Illustration of (a) PAN and (b) modified PAN used in YOLOv4 (Bochkovskiy et al., 2020)
Spatial Pyramid Pooling Layer will allow generating fixed size features whatever
the size of our feature maps. To generate a fixed size it will use pooling layers like
Max Pooling for example, and generate different representations of our feature maps.
The feature maps from different kernel sizes are then concatenated together as out-
put.
Fig. 13: Spatial Pyramid Pooling (Huang et al., 2020)
Head
The goal of YOLO is to divide the image into a grid of multiple cells and then for
each cell to predict the probability of having an object using anchor boxes. The role
of the head in the case of a one stage detector is to perform dense prediction. The

dense prediction is the final prediction which is composed of a vector containing the
coordinates of the predicted bounding box (center, height, width), the confidence
score of the prediction and the label.
Complete IoU Loss

The YOLOv4 uses CIoU loss from the bag of freebies, which has to do with the
way the predicted bounding box overlaps with the ground truth bounding box. Be-
fore getting into the details of CIoU loss we have to know about the IoU loss and
DIoU loss.
Bounding box regression uses overlap area between the predicted bounding box
and the ground truth bounding box referred to as Intersection over Union (IOU)
based losses. IoU loss fails when predicted, and ground truth boxes do not overlap.
The equation for IoU and IoU loss is shown below.
|B, B g t | |B, B g t |
I oU = L I oU = 1 −
|B, B g t | |B, B g t |
Fig. 14: Intersection over Union
The Distance IoU is the normalized distance between the center point of the
predicted and ground truth boxes. Distance loss helps with faster convergence and
accurate regression.
Fig. 15: Penalty term of distance IoU loss
DIoU loss equation is give below
d
D I oU = 1 − I oU +
c2

In the above equation, d represents the euclidian distance between the center
point of the predicted and ground truth boxes, and C is the diagonal length of the
smallest enclosing box covering two boxes. DIoU loss is invariant to the scale of
regression problem and also provides the moving directions for predicted bounding
boxes for non-overlapping cases.
Now, as we know about IoU and DIoU we can discuss CIoU loss. CIoU loss
bounding box regression uses three geometric factors.
• Overlap area between the predicted box and the ground truth bounding box-
IoU loss
• The central point between the predicted box and the ground truth bounding
box-DIoU loss
• An aspect ratio of the predicted box and the ground truth box
As CIoU loss uses complete geometric factors, it converges faster. It improves

average precision (AP) and average recall (AR) for object detection.
L = S(B, B g t ) + D(B, B g t ) + V (B, B g t )

From the above equation S is the overlap area denoted by S = 1 − I oU . D is the
normalized distance IoU loss between the center point of the predicted and ground
truth boxes. V is the consistency of the aspect ratio. All S, V, and D are invariant
to the regression scale and are normalized to values between 0 and 1. CIoU loss
moves the predicted bounding box towards the ground truth bounding box for non-
overlapping cases.
4 E XPERIMENTAL S ETUP
4.1 Data distribution
Data distribution refers to the amount of data we used for training, validation, and
test set. The ratio of the data distribution could be varied from problem to problem.
Training data is used for learning patterns of the data, whereas validation data is
used to understand the model behavior and generalizability of unseen data, and
test data gives us an idea of how the model would perform in a real-world scenario.
Validation
Class Training Images Test Images
Images
healthy 1700 100 200
infected 1925 100 200
Table 2: Data distribution for classification task

Validation
Class Training Images Test Images
Images
infected 1780 222 223
After data augmentation on training data
infected 8900 222 223
Table 3: Data distribution for detection task
4.2 Training
Data collection, processing, saving class weights are prerequisite steps for training,
creating an input pipeline is the next important step. Input pipeline creation in-
volves loading the data from a data source, parsing, apply augmentation and shuf-
fling in the batch data, and finally send the data for training.
In this study, for the classification, we load and parse the dataset, apply shuffling
and several data augmentation techniques including scaling, translation, rotation,
and shear on random images with a random value at run time. We used Adam op-
timizer and adjusted the learning rate, batch size, and other hyperparameters con-
tinuously until we achieve high accuracy. Categorical Crossentropy loss is used as
a loss function and accuracy is used as metrics. Two different versions of the VGG
block model are trained for 100 epochs and results are recorded for further analysis.
For the detection task, we prepare the dataset according to the YOLO format and
create the necessary configuration files for training. As we have a smaller training
dataset, in order to increase the training set we have applied various types of aug-
mentation beforehand, and then we train our model on these data. we applied ran-
dom rotation, translation, horizontal flipping, and save the images. After that, as
our bounding box values change, we calculate the new values for the bounding box
and save the values in a text file. We also used some augmentation in runtime as
well, such as mosaic data augmentation. We adjust the image size, batch size, learn-
ing rate, IoU thresholds, and other parameters, train on the YOLOv4 model, using
COCO dataset pre-trained weights.
4.3 Evaluation Metrics
After training the model we need to evaluate our model on unseen data and use dif-
ferent metrics to compare one model with another. The classification model uses
accuracy, precision, recall, F1 score as evaluation metrics. However, we also keep
the track of the number of parameters of the model. In binary classification, when a
class is correctly classified as positive that is True Positive (TP), when a class is cor-
rectly classified as negative that is True Negative (TN), when a class is actually pos-
itive but predicted as negative that is False Negative (FN), and lastly when a class is
actually negative but predicted as positive that is False Positive (FP). Accuracy repre-
sents the number of correctly classified data instances over the total number of data
instances. Precision is the ratio of correctly predicted positive observations to the
total predicted positive observations. The recall is the ratio of correctly predicted

positive observations to all actual positive observations.
TP +T N
Accur ac y =
TP +T N +FP +FN
TP TP
P r eci si on = Rec al l =
TP +FP TP +FN
The detection model uses mean Average Precision (mAP) as evaluation metrics.
The Intersection over Union (IoU) would be used to determine if a predicted bound-
ing box (BB) is TP, FP, or FN. Traditionally, we define a prediction to be a TP if the
IoU is greater than 0.5. If IoU is less than 0.5 or the bounding box is duplicated
then the prediction is FP. If there is no detection at all or IoU is greater than 0.5 but
the classification is wrong then the prediction is considered as FN. Each bounding
box would have its confidence level, usually given by its softmax layer, and would
be used to rank the output. Average Precision(AP) is calculated from the area un-
der the precision-recall curve. The mAP for object detection is the average of the AP
calculated for all the classes.
N
1 X
m AP = AP k
N k
5 R ESULTS AND D ISCUSSIONS
5.1 Classification
The main objective of this study to develop a CNN-based model which can classify
healthy and infected corn leaves with high accuracy as well as maintaining a smaller
number of parameters and a model that can localize the infected area in the corn
leaf with high mAP. We have conducted several experiments for both classification
and detection tasks and evaluate our trained model on the test set. In order to do
the evaluation we need to resize and normalize the test set images. Our experiments
results are showing below:
Model Total Param-

Accuracy
Name eters
VGG-L 99.92 4,694,082
VGG-S 99.25 125,810
Table 4: Classification model performance overview. Model names are just given name, VGG-
L, VGG-S means VGG-Large and VGG-Small respectively
From the table 4, we can see that although VGG-L model achieves higher accu-
racy than VGG-S but VGG-S has fewer parameters and met our objectives.

(a) Confusion Matrix
(b) Model Accuracy (c) Model Loss
Fig. 16: Confusion Matrix(a) for test data, Model Accuracy(b), Model Loss(c)
(a) (b)

(c) (d)
Fig. 17: Classification with VGG-S model
So, based on our goal we have selected VGG-S as our model with a trade-off be-
tween accuracy and total number of parameters. We have used 200 images for eval-
uation, 100 images for each category and from figure 16 we can see the confusion
matrix, model training accuracy and loss. Classification model prediction is illus-
trated in figure 17 where in the bottom of each image predicted label followed by
prediction probability and actual label is written.
5.2 Object Detection
We run experiments with different network sizes and IoU thresholds on the YOLOv4
model.
Model Name Network Size IoU Thresholds mAP

416 × 416 0.45 42.11
512 × 512 0.45 49.68
608 × 608 0.45 50.14
YOLOv4
416 × 416 0.50 36.43
512 × 512 0.50 46.55
608 × 608 0.50 45.63
After data augmentation on training data
512 × 512 0.45 52.11
512 × 512 0.50 47.39
YOLOv4
608 × 608 0.45 51.74
608 × 608 0.50 47.82
Table 5: YOLOv4 model performance comparison in different IoU thresholds and network
size

We adjusted different hyperparameters such as batch size, subdivision, adaptive

learning rate and recorded the results. From the table 5, we can see that we start
to train our model with small network size and then increase the network size, use
different IoU thresholds and achieve good accuracy. Using data augmentation to
increase the number of training dataset we achieve better accuracy than previous
model with same network size and thresholds. So, YOLOv4 (512 × 512) model is our
best performing model with mAP 52.11% and 0.45 IoU thresholds. Now using our
best model lets observe some predicted images.
Fig. 18: YOLOv4 model training loss and mAP.
Figure 18 shows training iteration in x-axis, training loss and mAP along y-axis.
We run 9000 iterations and as we can see the loss is converging and map is also
increasing slowly.

(a) Model Prediction (b) Ground Truth
(c) Model Prediction (d) Ground Truth
Fig. 19: Diseases detection with YOLOv4 (512 × 512) model
We run prediction on test set using YOLOv4 (512 × 512) model as illustrated on
Figure 19. Figure 19a, Figure 19c is the model prediction and Figure 19b, Figure 19d
is ground truth annotation respectively. In Figure 19a, 19c model detected some in-
fected area correctly but with lower confidence and could not detect some of the
infected area properly. However, although our proposed model is doing the detec-
tion but the confidence score are not that high as well as in some complex cases
model can not detect properly, so the model could be enhanced further.
5.3 Limitations and Future Work
• The dataset we used in this study has some annotation anomalies which is a
hindrance to achieving high mAP with a good confidence score. So, in future
we need to label the data from scratch and need to remove all sort of anoma-
lies.

• Training a deep convolutional neural network is a time-consuming task. For

this study we could not use a proper GPU, as a result, we could not conduct
many experiments.
• If one would split the large image into smaller patches, for example, split one
large image into three or four images then the model detection capability could
be better. Now the problem of this approach is to recalculate the bounding
boxes correctly into the split images. We need to do manual annotation for
split images to do this correctly, that is the reason we have not done it in this
study.
6 C ONCLUSIONS
Corn is one of the prime food crops in the world, nevertheless, there are many kinds
of corn diseases and it is difficult to diagnose corn diseases by examining the symp-
toms in the laboratory or in a traditional manner. In our study, we identify healthy
leaves and infected leaves as well as the infected area by an improved deep convo-
lutional neural network. Despite there are lots of complexities in the dataset such
as illumination conditions, background variations, etc. our proposed deep system
achieves high identification accuracy 99.25%, with a smaller number of parameters
and that is very suitable for deploying into handy devices such as mobile device.
Experiments show that it is possible to get good accuracy and faster convergence
by incorporating a reasonable number of Batch Normalization, ReLU, dropout op-
erations, and some adjustments in model parameters. We found some annotation
anomalies in the dataset during the experiments. Our proposed detection model
improves mAP with data augmentation techniques but still, the model could be en-
hanced.
R EFERENCES
Acharya, R. (2020). Corn leaf infection dataset. https://www.kaggle.com/qramkrishna/corn-leaf-
infection-dataset. Accessed on 04/15/2021.
Alehegn, E. (2020). Maize Leaf Diseases Recognition and Classification Based on Imaging and Machine
Learning Techniques. PhD thesis.
Alvarez, A. M. (2004). Integrated approaches for detection of plant pathogenic bacteria and diagnosis of
bacterial diseases. Annu. Rev. Phytopathol., 42:339–366.
Arora, J., Agrawal, U., et al. (2020). Classification of maize leaf diseases from healthy leaves using deep
forest. Journal of Artificial Intelligence and Systems, 2(1):14–26.
Bochkovskiy, A., Wang, C.-Y., and Liao, H.-Y. M. (2020). Yolov4: Optimal speed and accuracy of object
detection. arXiv preprint arXiv:2004.10934.
Brahimi, M., Boukhalfa, K., and Moussaoui, A. (2017). Deep learning for tomato diseases: classification
and symptoms visualization. Applied Artificial Intelligence, 31(4):299–315.
Brownlee, J. (2010). 4 types of classification tasks in machine learning. Machine Learning Mastery.
Chen, G., Meng, Y., Lu, J., and Wang, D. (2016). Research on color and shape recognition of maize diseases
based on hsv and otsu method. In International Conference on Computer and Computing Technologies
in Agriculture, pages 298–309. Springer.

Frommer, D., Veres, S., and Radócz, L. (2018). Susceptibility of stem infected sweet corn hybrids to com-
mon smut disease. Acta Agraria Debreceniensis, (74):55–57.
Fuentes, A., Yoon, S., Kim, S. C., and Park, D. S. (2017). A robust deep-learning-based detector for real-
time tomato plant diseases and pests recognition. Sensors, 17(9):2022.
Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., Wang, X., Wang, G., Cai, J., et al. (2018).
Recent advances in convolutional neural networks. Pattern Recognition, 77:354–377.
Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. (2017). Densely connected convolutional
networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages
4700–4708.
Huang, Z., Wang, J., Fu, X., Yu, T., Guo, Y., and Wang, R. (2020). Dc-spp-yolo: Dense connection and
spatial pyramid pooling based yolo for object detection. Information Sciences, 522:241–258.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional
neural networks. Advances in neural information processing systems, 25:1097–1105.
LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature, 521(7553):436–444.
Liu, J. and Wang, X. (2020). Tomato diseases and pests detection based on improved yolo v3 convolutional
neural network. Frontiers in plant science, 11:898.
Miller, S. A., Beed, F. D., and Harmon, C. L. (2009). Plant disease diagnostic capabilities and networks.
Annual review of phytopathology, 47:15–38.
Mohanty, S. P., Hughes, D. P., and Salathé, M. (2016). Using deep learning for image-based plant disease
detection. Frontiers in plant science, 7:1419.
Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016). You only look once: Unified, real-time object
detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages
779–788.
Redmon, J. and Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint

arXiv:1804.02767.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A.,
Bernstein, M., et al. (2015). Imagenet large scale visual recognition challenge. International journal of
computer vision, 115(3):211–252.
Sibiya, M. and Sumbwanyambe, M. (2019). A computational procedure for the recognition and classifica-
tion of maize leaf diseases out of healthy leaves using convolutional neural networks. AgriEngineering,
1(1):119–131.
Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recogni-
tion. arXiv preprint arXiv:1409.1556.
Solawetz, J. (2020).
Data augmentation in yolov4. https://towardsdatascience.com/data-
augmentation-in-yolov4-c16bd22b2617. (Accessed on 04/17/2021).
Song, K., Sun, X., and Ji, J. (2007). Corn leaf disease recognition based on support vector machine method.
Transactions of the CSAE, 23(1):155–157.
Sun, X. and Wei, J. (2020). Identification of maize disease based on transfer learning. In Journal of Physics:
Conference Series, volume 1437, page 012080. IOP Publishing.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich,
A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 1–9.

Wang, C.-Y., Liao, H.-Y. M., Wu, Y.-H., Chen, P.-Y., Hsieh, J.-W., and Yeh, I.-H. (2020). Cspnet: A new
backbone that can enhance learning capability of cnn. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition workshops, pages 390–391.
Wang, N., Wang, K., Xie, R., Lai, J., Ming, B., Li, S., et al. (2009). Maize leaf disease identification based on
fisher discrimination analysis. Scientia Agricultura Sinica, 42(11):3836–3842.
Xu, B., Wang, N., Chen, T., and Li, M. (2015). Empirical evaluation of rectified activations in convolutional
network. arXiv preprint arXiv:1505.00853.
Xu, D. and Wu, Y. (2020). Improved yolo-v3 with densenet for multi-scale remote sensing target detection.
Sensors, 20(15):4276.
View publication stats

Corn Leaf Disease Classification and Detection Using Deep Convolutional Neural Network

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Corn Leaf Disease Classification and Detection Using Deep Convolutional Neural Network

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Research · September 2021

The user has requested enhancement of the downloaded file.

Corn Leaf Disease Classification and Detection using

Haque Md Shafiul* Prof. Dr. Niels Landwehr † Dr. Julian Adolphs †

Research module WS2020/21

Research module WS2020/21 1

(a) Healthy (b) Infected (c) Disease Localization

Research module WS2020/21 2

Research module WS2020/21 3

performance and salient feature extraction from data.

Fig. 2: Process of training the classification model

Fig. 3: Process of training the detection model

Research module WS2020/21 4

(a) actual image (b) image with small rotation

Fig. 4: Example of bounding box anomalies

3.1.1 Data preprocessing

Research module WS2020/21 5

Name xmin ymin xmax ymax label

Table 1: Example of bounding box data

3.1.2 Data augmentation

3.2 VGG Model

Research module WS2020/21 6

Fig. 5: Block diagram of basic CNN architecture

The main function of the convolution layer is feature extraction. Convolution

Research module WS2020/21 7

Research module WS2020/21 8

3.3.1 Bag of freebies (BoF)

3.3.2 Bag of specials (BoS)

3.3.3 Basic architecture

Research module WS2020/21 9

Fig. 6: Architecture of one-stage detector (Bochkovskiy et al., 2020).

A Dense Block (fig 7) in the YOLOv4 backbone contains multiple convolution

Research module WS2020/21 10

YOLOv4 utilizes the Cross-Stage-Partial-connections (CSP) with the Darknet-53.

Fig. 10: Mish activation function

Research module WS2020/21 11

PANet is chosen for instance segmentation in YOLOv4 is because of its ability

Research module WS2020/21 12

Fig. 13: Spatial Pyramid Pooling (Huang et al., 2020)

Research module WS2020/21 13

Complete IoU Loss

Fig. 14: Intersection over Union

Fig. 15: Penalty term of distance IoU loss

DIoU loss equation is give below

Research module WS2020/21 14

As CIoU loss uses complete geometric factors, it converges faster. It improves

L = S(B, B g t ) + D(B, B g t ) + V (B, B g t )

4.1 Data distribution

Table 2: Data distribution for classification task

Research module WS2020/21 15

Table 3: Data distribution for detection task

4.3 Evaluation Metrics

Research module WS2020/21 16

positive observations to all actual positive observations.

5 R ESULTS AND D ISCUSSIONS

Model Total Param-

Research module WS2020/21 17

(a) Confusion Matrix

(b) Model Accuracy (c) Model Loss

Research module WS2020/21 18

Fig. 17: Classification with VGG-S model

5.2 Object Detection