You are on page 1of 6

2021 6th International Conference for Convergence in Technology (I2CT)

Pune, India. Apr 02-04, 2021

Smote-DL: A Deep Learning Based Plant Disease


Detection Method
Subham Divakar Abhishek Bhattacharjee Rojalina Priyadarshini
Persistent Systems Cogizant Solutions C.V.Raman Global University
India India Bhubaneswar, Odisha
shubham.divakar@gmail.com abhishekb496@gmail.com priyadarshini.rojalina@gmail.com

Abstract: In the due course of time, computer vision, most fundamental aspect in plant disease detection which is
machine learning and deep learning has been widely used to reducing the number of false predictions that might lead to
detect disease in the plant leaf. Most works done in this area misdiagnosis. The result of which could be large scale crop
focuses upon coming up with accurate models but does not destruction.Our main contributions in this paper are:
focus on the false predictions which could be a serious cause.
Misdiagnosis of the plant leaf could cause large scale crop 1. We have done vast data-pre-processing and
destruction. We used a publicly available dataset which visualization to understand the data and then used
contained four categories of images belonging to Apple Plant- SMOTE to handle our imbalanced dataset.
2021 6th International Conference for Convergence in Technology (I2CT) | 978-1-7281-8876-8/21/$31.00 ©2021 IEEE | DOI: 10.1109/I2CT51068.2021.9417920

Healthy, Scab, Rust and Multiple disease.However this dataset


upon visualization was found to be imbalanced.Our main 2. The unique Ensemble algorithm which identifies
objective isto reduce the false predictions. The main both single and multiple diseased leaf with a good
contribution lies in the use of SMOTE method to balance the accuracy and low false predictions.
dataset and the novel Ensemble algorithm which uses both F1 3. The use of two parameters at a time to come up with
score and accuracy to compare and come up with the best a good classifier using both F1 score and Accuracy
classifier from among the classifiers. Upon experimentation we
to compare the models.
came up with Efficient NetB7 as the best classifier from our list
of classifier which had both good accuracy and good F1 Score. The entire paper is divided into seven section. The first
It also predicts whether a leaf image has multiple disease or section introduces the paper and its contributions followed
not which helps to reduce false predictions further. by the related work section. In the third section we have
given detailed explanation of the problem statement. The
Keywords- Ensemble learning, Deep Learning, Data fourth section talks about the pro-posed work followed by
Skewness, Classification the fifth section which discusses the results. The sixth
I. INTRODUCTION section concludes the paper with future scope mentioned in it
and the last section gives the references.
India is country of farmers where more than 40% of the
population thrives on agriculture as its source of income. II. RELATED WORK
With so much of people working daily on farms, the The literature survey is done in the domain of plant
experience and knowledge of farming is passed on from disease detection to understand the use of computer vision
generations to generations. Traditionally people who are techniques used in the process. K.P Ferentinos in his study
engaged in farming develop expertise in detecting disease in developed specialized deep learning models based on
plants just by looking at the leaf. However with the rapid convolutional neural networks architecture for identification
increase in DNA modification and the coming of the hybrid of plant diseases through simple leaf images of healthy or
variety of plants not everyone is able to detect the disease diseased plant[5]. He used an open database of 87848 images
because just like humans plants also get infected with new for the training of the model. He got a success rate of
disease [1]. We are living in the 20th Century which is 99.53% on 17548 previously unseen images in a VGG
seeing a tremendous rise in innovation of new technology in convolutional neural network model. S.P Mohanty et.al in
each and every domain. In the agricultural domain it has there paper used a public dataset of 54306 images of
been seen a tremendous rise in use of modern techniques and diseased and healthy plant leaves and trained the dataset in a
machinery for farming. When it comes to plant disease deep convolutional neural network to identify 14 crop
detection, modern techniques like Deep Learning, Lab species and 26 diseases. They used the Alexnet and
Testing, Computer Vision and lots of other techniques have GoogleNet deep learning architecture and used transfer
come into picture but plant disease detection in still a major learning and training from scratch for training the dataset.
concern in this fast growing world because there exists a They received an accuracy of 99.35% on the test set [6].
very large number of disease that are related to plants and yet Sachin D. Khirade et al. in their work found out the plant
are not properly discovered or treated timely. There are disease using image processing [7]. They used number of
various types of disease that are related to crops such as steps to determine the disease such as image acquisition in
viral, bacterial or fungus. Not all disease are visible to naked which they captured the RBG for the leaf. A.K Mahlein
eyes so detection in time is very difficult. showed in her work how plant disease can be found out
Detection of disease using computer science means using Imaging Sensors. She has shown in her work the
working on the digital images of the diseased plants. These optical techniques such as RGB imaging, multi and
algorithms process these images to find out the desired hyperspectral sensors, thermography or chlorophyll
results. Traditional ways of finding out disease was by using fluorescence which all can be used for identification and
a microscope and performing scientific laboratory quantification of plant disease at early points[8]. Yi Fang
medications on the plants [2]. Our proposed work aims at the et.al in their work discussed two types of disease

978-1-7281-8876-8/21/$31.00 ©2021 IEEE 1

Authorized licensed use limited to: Linkoping University Library. Downloaded on June 20,2021 at 11:43:51 UTC from IEEE Xplore. Restrictions apply.
identification methods used in agriculture which is direct and multiple disease detection in a single leaf using neural
indirect [9]. P Moghadam et.al in their paper used networks which will now identify such leaves as a separate
hyperspectral imaging (VNIR and SWIR) and machine category and not detect them as any other disease.
learning techniques for the detection of Tomato Spotted Wilt
Virus (TSWV) in capsicum plants. They received an
accuracy of 90% in their trained model[10].Vijai Singh et.al
in their work found out unhealthy part plant leaf using image
processing and genetic algorithm. They used leaf samples
that are in RGB format and applied image acquisition and
pre processing on the image sample then they segmented the
components using genetic algorithm and obtained useful
segments to classify the leaf disease. They performed the
experiment in MATLAB[11].S. AashaNandhini et.al Fig. 1. Samples of healthy leaf Fig. 2. Samples of multiple
provided a web enabled disease detection system (WEDDS) images diseased leaf images
based on compressed sensing (CS) to detect and classify the
disease in leaves. They upload the CS measurements of the
segmented leaf to the cloud and retrieved it in the monitoring
site and extracted the features from it. They did the analysis
and classification using support vector machine (SVM).
They achieved an overall accuracy of 98.5% and
classification accuracy of 98.4%[12].
III. PROBLEM STATEMENT
From the above section it is evident that the use of Fig. 3. Samples of Apple Scab Fig. 4. Samples of Apple Rust
computer vision to solve the problem of manual disease disease leaf images diseased leaf images
detection is increasing which decreases human efforts. We
also saw few works done in the recent years which claim
high accuracy obtained by their machine learning or deep B. Data Pre-Processing and vizualization
learning models yet none of them claim to perform multiple Data Preprocessing and Visualization is a very important
disease detection in a single leaf with high accuracy along step before building Machine Learning and Deep Learning
with proper identification.There are variations in symptoms models. It gives us an insight into the data and evenshows
of the disease due to age difference of plants, severity of the distribution of data. We also thought of looking into the
disease itself, genetic variation due to hybrid genetic data before going for model selection and training part. The
modifications and light intensity of the images which cause dataset contains 3642 images of Apple Leaves divided into
the accuracy to fall very low when it comes to testing on Train and Test which belongs to 4 categories.“scab”- this
actual images in a crop field. Due to the drop in the refers to the Apple Scab leaf disease, “rust”- this refers to the
accuracy misdiagnosis is prone to occur which could lead to Apple Rust leaf disease, ”multiple_disease”- this refers to the
crop damage on a large scale. Thus to prevent this the Apple Leaves images having multiple disease and “healthy”-
proposed deep learning or machine learning model should be this refers to the healthy Apple Leaf images.
able to predict single diseased plant leaf with high accuracy Since, we are dealing with the leaf disease detection
and it should clearly distinguishin case of single leaf with dataset therefore it will be quite interesting to observe the
multiple diseasewhich will increase the overall accuracy and spread of the Red,Blue and Green channel values in the
reduce the false predictions. One of the main reasons in the images to identify which channel in dominant with the
lack of research into this is the lack of images having a healthy leaf and which is dominant with the diseased leaf.
single leaf with multiple diseases. Thus we could have a We first analyzed some random images from training set to
dataset of leaf images with multiple disease along with a see the channel distribution and understand about the
good deep learning model that has low false predictions and intensity of channel values in diseased and healthy leaf
high accuracy. images. Fig 5 shows the RGB values in a healthy leaf
IV. PROPOSED WORK image,Fig 6shows the RGB values of diseased part of the
leaf having only one disease and Fig 7 shows the RGB
A. Data Collection values of amultiple diseased leaf and these values are of the
The image dataset [9] was gathered from Kaggle diseased part as shown in image. From Table 1, we can see
competition “Plant Pathology 2020-FGVC7” [8], organized that the blue values are quite low in case of healthy leaf and
in the month of April-May 2020.The dataset contains images high for diseased leafimages. Thus it is now clear that the
belonging toapple leaf diseases. It contains images belonging blue channel is the key point of our observation and that it is
to 4 category of images“healthy”, themain key to detect the leaf disease.Yet we need to study
“multiple_diseases”,“scab” and “rust”. “scab” and “rust” the channel distribution in more detail to be sure of our
refers to the apple scab and apple rust disease and observation.
“multiple_diseases” refer to the images having multiple
diseases. Few samples of the three categories of images
could be seen in Fig 1,2,3, 4. As explained in the problem
statement section that lack of dataset of leaf images with
multiple images is a major block thus now with the new
dataset that we have gathered, we can easily perform

Authorized licensed use limited to: Linkoping University Library. Downloaded on June 20,2021 at 11:43:51 UTC from IEEE Xplore. Restrictions apply.
Fig 8, 9 and 10 shows the channel distribution of Red,
Blue and Green values. Upon observing Fig 8 which is the
red channel distribution plot, we found that the plot is rightly
skewed (slightly) with a positive skew and the red channel
valuesare roughly normally distributed.This indicates that
the red channel values are more concentratedat 100 which
can also be observed from the plot. Also we can observe that
there is large variation in averagered channel values across
the dataset. Fig 9 is the plot for green channel distribution
Fig. 5. RGB value in a healthy leaf Fig. 6. RGB value in a leaf with
single disease
and as compared to the red channel plot Fig 8, the green
channel values have a more uniform distribution but smaller
peak than the red channel plot. Also we can clearly observe
that this plot has left skew and larger mode of 140,
aroundwhich the green channel values are concentrated. This
clearly indicates the presence of more green color in the
dataset with a good distribution as compared to red channel.
This is quite obvious because the images are of leaf which
are green in color. Fig 10 is the plot for the blue channel
Fig. 7. RBG value in a leaf with
multiple disease distribution which is most uniformly distributed as compared
to the red and green channel plots. Also this channel shows
The three RGB values obtained are presented in Table-1. more variation than other two channels in the entire dataset.
The plot has a slightly leftward or minimal skew.Fig 11
TABLE I. COMPARISON OF RGB VALUES OF LEAVES shows the overall distribution of RGBchannels from which
we canobserve that the channel values are concentrated
Leaf Category Red Green Blue around 105 with a roughly normal distribution. Fig 12 shows
Healthy 60 137 33 the combined plot of RGB channel values which shows us
Single Disease in a Leaf 115 111 57 the variation of the channels throughout the images. Also we
Multiple Disease in a Leaf 141 128 115 observed green to be the most pronounced color followed by
red and blue.Fig 13 shows the mean value vs. the color
Thus we plotted thechannel distribution of RGB values. channel plot which is in alignment with the values which we
figured out from the channel plots. For red we figured out
the values to be concentrated at 100 which are same as
observed in Fig 13 for the red channel. Similar is the case for
green and blue channels.Due to this variation of blue channel
in the image it is becoming clearer that it is the key point for
leaf disease detection.

Fig. 8. Red channel distribution in dataset

Fig. 11. Overall distribution of channel values in dataset

Fig. 9. Blue channel distribution in dataset

Fig. 12. RGB channel distribution in dataset

Fig. 10. Green channel distribution in dataset

Authorized licensed use limited to: Linkoping University Library. Downloaded on June 20,2021 at 11:43:51 UTC from IEEE Xplore. Restrictions apply.
Fig. 15. Parallel categories plot showing relationship of categories

Fig. 13. Mean Value vs RGB Channel

Then we visualized the categories and observed the


distribution of data in them along with their relationship with
each other. Fig 14 shows a pie-chart of data distribution in
various categories and we can observe that 71.7% leaves in
the dataset are unhealthy, rust and scab occupy one-third of
the pie-chart and multiple disease leaves are only 5%.This Fig. 16. Healthy leaf images distribution
gives us an indication of imbalanced dataset which will
cause the accuracy to fall. Fig 15 is a parallel category plot
which helps us to understand the relationship among the
various category and as expected, it is impossible for a
healthy leaf to have scab, rust, or multiple diseases. Even
every unhealthy leaf has one of either scab, rust, or multiple
diseases. Fig 16 is a bar graph to understand the distribution
of healthy category as compared to the entire dataset which
tells that there are more unhealthy leaves than healthyones.
There are 1305 (72%) unhealthy plants and 516 (28%) Fig. 17. Apple Scab leaf images distribution
healthy plants.Fig 17 is a bar graph to understand the
distribution of scab category as compared to the entire
dataset which conveys that there are 592 (33%) unhealthy
leaves with scab disease and 1229 (67%) leaves without
scab. Fig 18 is a bar graph to understand the distribution of
rust category as compared to the entire dataset which tells
that rust infected leaves are 622 (34%) and 1199
(66%)without rust. The percentage of leaves without rust
(67%) and without scab (66%) are almost same. Fig 19 is a
bar graph to understand the distribution of multiple diseases Fig. 18. Apple Rust leaf images distribution
category as compared to the entire dataset which tells that
there are only 91 (5%) leaves with multiple disease and rest
95% without multiple disease. From this vast visualization
we can come to the points that:
1. Blue channel is the key to plant disease detection in
our dataset
2. Our dataset is imbalanced as it only contains 5 % of
dataset belonging to multiple disease dataset.
Now to get rid of this imbalanced dataset problem which
will cause a problem if not solved now, we will be
performing re-sampling of our dataset using Synthetic Fig. 19. Multiple diseased leaf images distribution
Minority Oversampling technique (SMOTE)[10] method.
C. Resampling using SMOT and dataset splitting
As seen in the above section that ourdataset is highly
imbalanced as it contains only 91 images in the “multiple
disease” category which is only 5% of the entire
dataset.Although there are other methods but we choose
SMOTE because it is a type of image augmentation
technique and it creates new samples from the existing
samples only. It works by selecting examples that are close
in the feature space, drawing a line between the examples in
Fig. 14. Pie chart showing distribution of dataset the feature space and drawing a new sample at a point along

Authorized licensed use limited to: Linkoping University Library. Downloaded on June 20,2021 at 11:43:51 UTC from IEEE Xplore. Restrictions apply.
that line.Thus we applied SMOTE re-sampling and the 1.For i=1 to k, do
“multiple disease” category is the minority class in our case 2. From P2 set, pick ‫ܥ‬௜ and ‫ͳܨ‬௜ . Let this be ‫ܥ‬ଵ and ‫ͳܨ‬ଵଵ
which will be oversampled. Then we split the dataset into 3. Check if ‫ܥ‬ଵ and ‫ͳܨ‬ଵଵ are present simultaneously in
training and validation which will help us evaluate the model any tuple of P1. If yes, then ‫ܥ‬ଵ is the model we are looking
better. We took 80% dataset for training and the rest 20% for for and ‫ܥ‬௕௘௦௧ ൌ ‫ܥ‬ଵ , break the loop
validation 4. Else continue from step 2
5.If no such ‫ܥ‬ଵ found that lies between ‫ͳܨ‬ଵଵ and ‫ͳܨ‬ଵሺ௞Ȁଶሻ then
D. Transfer Learning ‫ܥ‬ଵ (first model from P2 set) is the best model.
We used the Transfer Learning to train the pre-trained
models on the dataset. The main reason to choose Transfer
Learning is that there can be many classes which could V. RESULTS AND DISCUSSIONS
resemble close to the leaf dataset thus the features extracted S:NO Pre-Trained Accuracy F1 Precision
by the pre-trained models in Table 1, could be used for Model Score
training on our dataset which increases the overall accuracy. 1 MobileNetsV2 0.8677 0.8509 0.8827
This process also reduces the training time of the models. 2 InceptionV3 0.8236 0.8036 0.7848
3 VGG19 0.669 0.6518 0.6376
E. Ensemble Algorithm 4 DenseNet 0.9228 0.9124 0.9193
Ensemble Methods like bagging and boosting have 5 VGG16 0.6335 0.6219 0.6146
proven to be a good way to build classifiers which increases 6 Xception 0.8787 0.8658 0.8615
the accuracy.Generally ensemble methods have been found 7 EfficientNetB7 0.9146 0.91903 0.9261
to take only one performance metric – accuracy which is not 8 NASNET Large 0.9090 0.9078 0.9077
always the only metrics for evaluation of models. Often in
real life classification problem like ours we have to deal with TABLE II. ACCURACY, F1 SCORE AND PRECISION OF CLASSIFIERS
imbalanced dataset and penalties of every wrong prediction
could be both deadly and economically disastrous. In our
scenario where we are dealing with Plant Disease
Classification, every wrong prediction can lead to wrong
chemical use which ultimately can destroy the crops causing
huge economic loss. So the model cannot be judged on only
one parameter accuracy, which simply tells the total correct
predictions in the dataset.F1 score is another parameter
which is used along-with accuracy to test a model’s
performance. It is the harmonic mean of recall and precision.
Recall and Precision are the two most important parameters
upon which the F1 Score depends, so instead of getting high
range of these two values we only need to focus on getting
good F1 Score. Good F1 score would mean good recall and
precision values.Thusin this paper we have proposed a novel
ensemble method which unlike other ensemble method takes
both accuracy and f1 score into account for choosing the best
classifier.The proposed algorithm- Algorithm 1 and 2, shows Fig. 20. Accuracy vs F1 Score comparison of classifiers
the detailed steps of the approach.
Algorithm 1 – Ensemble Algorithm Table 2, presents a summary of the Accuracy, F1 Score
Input - C- {ଵ ,….., ୩ } - Set of classifiers, k: No:of classifiers and Precision of the various classifiers which were trained
D – Training dataset using Transfer Learning process and using Ensemble
Output – Set P1-{(୧ ǡ ୧ ǡ ͳ୧ ), ……} – Sorted in decreasing order Method. Since, our proposed work consists of unique
of ୧ ensemble method to come up with the best classifier which
P2-{(୧ ǡ ୧ ǡ ͳ୧ ),(…), ……} – Sorted in decreasing order of ͳ୧ uses Algorithm 1 and 2, thus one cannot simply say that the
1. Set ‫ܥ‬௜ - Set of Classifiers model with the best accuracy is the best performing model.
2. For i=1 to k, do According to our Algorithm 1 and 2 the best model should
3. Train ‫ܥ‬௜ on D. have high F1 Score and its Accuracy should lie in the top
4. Evaluate the model. k/2, accuracies.k(total number of classifiers)=8.Thus after
5. Calculate and Save the accuracy in ‫ܣ‬௜ andF1 Score in applying the algorithm, our Ensemble Method came up with
‫ͳܨ‬௜ EfficientNetB7 as the best classifier as it had the best F1
6. Store ‫ܥ‬௜ ,‫ܣ‬௜ and ‫ͳܨ‬௜ as a set in Set P1 and P2 Score and its accuracy also lies in the top 4 accuracies. The
7. Sort P1 in decreasing order of ‫ܣ‬௜ and keep top k/2 plot in Fig 20 resembles the comparison of accuracy and F1
elements. Score of all the models. It can also be clearly observed that
8. Sort P2 in decreasing order of ‫ͳܨ‬௜ . EfficientNetB7has highest F1 Scorebut its accuracy is just
slightly less than the top accuracy which satisfies our
Algorithm 2 Algorithm 1, 2.Thus the proposed algorithm comes with a
classifier which is having a good F1 Score which indicates
Input Set P1-{(‫ܥ‬௜ ǡ ‫ܣ‬௜ ǡ ‫ͳܨ‬௜ ), ……} – Sorted in decreasing order of good recall and precision and which shows lesser wrong
‫ܣ‬௜ predictions. Also it predicts whether a leaf imageis having
P2-{(‫ܥ‬௜ ǡ ‫ܣ‬௜ ǡ ‫ͳܨ‬௜ ),(…), ……} – Sorted in decreasing order of ‫ͳܨ‬௜ multiple disease or single leaf which reduces false
Output –‫ܥ‬௕௘௦௧ (Best classifier) predictions in real-life scenario.

Authorized licensed use limited to: Linkoping University Library. Downloaded on June 20,2021 at 11:43:51 UTC from IEEE Xplore. Restrictions apply.
VI. CONCLUSION Computing & Communication Systems (ICACCS), pp. 281-284.
IEEE, 2019.
In this paper we have presented a unique Ensemble [4] Shrivastava, Vimal K., Monoj K. Pradhan, SonajhariaMinz, and
algorithm –Algorithm1 and 2 which uses both accuracy and Mahesh P. Thakur. "RICE PLANT DISEASE CLASSIFICATION
F1 Score to choose the best classifier from among the list of USING TRANSFER LEARNING OF DEEP CONVOLUTION
classifiers [Table 1]. We have performed data pre-processing NEURAL NETWORK." International Archives of the
Photogrammetry, Remote Sensing & Spatial Information
and found that the blue channel was more abundant in the Sciences (2019).
diseased part as compared to the healthy part. We then
[5] Ferentinos, Konstantinos P. "Deep learning models for plant disease
analyzed the data and found it to be imbalanced as the detection and diagnosis." Computers and Electronics in
multiple diseased leaf category only had 91 images which is Agriculture 145 (2018): 311-318.
only 5% of the dataset. We performed SMOTEresampling [6] Mohanty, Sharada P., David P. Hughes, and Marcel Salathé. "Using
method to handle imbalanced dataset.From our deep learning for image-based plant disease detection." Frontiers in
experimentation, our proposed algorithmcame up with plant science 7 (2016): 1419.
EfficientNetB7 as the best classifier which has best F1 score [7] Khirade, Sachin D., and A. B. Patil. "Plant disease detection using
and its accuracy was also among the top k/2 classifiers.Using image processing." In 2015 International conference on computing
communication control and automation, pp. 768-771. IEEE, 2015.
our proposed algorithm and new dataset(having multiple
[8] Mahlein, Anne-Katrin. "Plant disease detection by imaging sensors–
disease leaf as a separate category) our proposed work parallels and specific demands for precision agriculture and plant
successfully reduces false predictions by first predicting phenotyping." Plant disease 100, no. 2 (2016): 241-251.
whether a leaf image has single or multiple disease and [9] Fang, Yi, and Ramaraja P. Ramasamy. "Current and prospective
secondly the classifier has low false predictions and good methods for plant disease detection." Biosensors 5, no. 3 (2015): 537-
accuracy as our algorithm chooses classifier with good F1 561.
score and accuracy. [10] Moghadam, Peyman, Daniel Ward, Ethan Goan, Srimal Jayawardena,
PavanSikka, and Emili Hernandez. "Plant disease detection using
REFERENCES hyperspectral imaging." In 2017 International Conference on Digital
Image Computing: Techniques and Applications (DICTA), pp. 1-8.
[1] Singh, Vijai, Namita Sharma, and Shikha Singh. "A review of IEEE, 2017.
imaging techniques for plant disease detection." Artificial Intelligence
in Agriculture (2020). [11] Singh, Vijai, and A. K. Misra. "Detection of unhealthy region of plant
leaves using image processing and genetic algorithm." In 2015
[2] Shah, Jitesh P., Harshadkumar B. Prajapati, and Vipul K. Dabhi. "A International Conference on Advances in Computer Engineering and
survey on detection and classification of rice plant diseases." In 2016 Applications, pp. 1028-1032. IEEE, 2015.
IEEE International Conference on Current Trends in Advanced
Computing (ICCTAC), pp. 1-8. IEEE, 2016. [12] Nandhini, S. Aasha, RadhaHemalatha, S. Radha, and K. Indumathi.
"Web enabled plant disease detection system for agricultural
[3] Shruthi, U., V. Nagaveni, and B. K. Raghavendra. "A review on applications using WMSN." Wireless Personal Communications 102,
machine learning classification techniques for plant disease no. 2 (2018): 725-740.
detection." In 2019 5th International Conference on Advanced

Authorized licensed use limited to: Linkoping University Library. Downloaded on June 20,2021 at 11:43:51 UTC from IEEE Xplore. Restrictions apply.

You might also like