You are on page 1of 4

23rd Telecommunications forum TELFOR 2015 Serbia, Belgrade, November 24-26, 2015.

Data mining model for early fruit diseases


detection
Milos Ilic, Petar Spalevic, Mladen Veinovic, Abdolkarim Abdala M. Ennaas

prediction of possible fruit infection, and correct time for


Abstract — Automatic methods for an early detection of fruit protection data mining techniques can be used. Data
plant diseases could be vital for precise fruit protection. mining is most useful in an exploratory analysis scenario
Traditionally the agriculture expert’s knowledge is in which there are no predetermined notions about what
descriptive and experiment based, therefore it is difficult to
describe it mathematically and subsequently build decision
will constitute an “interesting” outcome. It is a cooperative
system which can replace it. Key parameters of decision effort of humans and computers. Best results are achieved
based fruit protection system could differ for classes of plants by balancing the knowledge of human experts with the
and diseases. However, such systems are very rare and very search capabilities of computers [1]. Data mining consist
complex, and in many cases designed just for one plant class. of two primary goals, prediction and description. Here we
For effective diseases protection of fruit, meteorological data are interest for prediction. Prediction involves using
and data about the disease appearance are the most
important. In this paper authors propose one idea for data
variables or fields in the data set to predict unknown or
mining based system for detection of possible fruit infection. future values of other variables of interest. Data mining is
For this purpose, different types of data mining techniques emerging research field in Agriculture plant protection too.
were evaluated on unique data sets. For very recent applications of Data Mining techniques in
Keywords — Artificial neural networks, Classification, agriculture field different data mining techniques are in
Data mining, Diseases prediction, K-Means. use, such as K-Means, K-Nearest Neighbor (KNN),
I. INTRODUCTION Artificial Neural Networks (ANN) and Support Vector
Machines (SVM). In our research and system
C LIMATIC changes and global warming produce
difficulties for agriculture production. In one hand people
implementation we will use new methodologies to predict
possible infection on the fruit plants. Consider that data are
have less pure arable land, and on other increasing number of available from some point back to the past, where the
diseases and pests cause the use of much more chemicals. corresponding pathogen infection has been recorded. In
Those chemicals applied in large quantities lead to soil every data mining procedure the training data collected
contamination, and could endanger human health. This from some point back to the past is used in terms of
problem can be avoided by using chemicals at the right time, training which has to be exploited to learn how to classify
when the least amount of them can suppress the concrete
pathogen or pest. In order to predict the right time for the possible emergence of diseases. This paper provides
suppression of pathogens many parameters must be overview of data mining techniques that can be used for
processed. For example, pathogen could infect fruit species prediction in many different fields, and here authors
just under certain conditions. Those environmental stats could propose model for prediction of possible fruit infection.
be specific weather conditions like temperature, amount of
rainfall, humidity of air and leaf wetness. Other group of II. RELATED APPLICATIONS
conditions is the presence of active spores of a given For our research we explored some papers in similar
pathogen. Traditionally, in the cooperation with agricultural
areas of expertise. Related papers and research describe
experts, based on experience and knowledge, farmers are
making decision about a suitable time for the two main problems that we must observe. First problem is
protection of fruit from specific pathogens or pests. processing of meteorological data, obtained from a
Sometimes experience is not enough. For precise meteorological radar centers. Second problem is
processing of data obtained from the devices called spores
traps. Final and most important goal which will result from
Milos N. Ilic (contact author), Faculty of Technical Science Kosovska
Mitrovica, University of Pristina, Kneza Miloša 7, 38220 Kosovska
the previous two is prediction of possible diseases. For this
Mitrovica,Serbia (phone: 381-69-8702584 e-mail: milos.ilic@pr.ac.rs). purpose we outlined a number of similar applications from
the reviewed literature, in which data mining techniques
Petar C. Spalevic, Faculty of Technical Science Kosovska Mitrovica, are used for similar problems. One group of applications is
University of Pristina, Kneza Miloša 7, 38220 Kosovska
Mitrovica, Serbia (phone: 381-62-273813 e-mail: implementation of data mining techniques for processing
petar.spalevic@pr.ac.rs). of meteorological data. Some of the data mining
Mladen Veinovic, Faculty of Informatics and Computing, Singidunum techniques are related to weather conditions and forecasts.
University, Danijelova 32, 11000 Belgrade, Serbia (phone: 381-65-
3093227 e-mail: mveinovic@singidunum.ac.rs). For example, the K-Means algorithm is used to perform
Abdolkarim A. Ennaas, Faculty of Informatics and Computing, forecast of the pollution in the atmosphere [2], the K
Singidunum University, Danijelova 32, 11000 Belgrade, Serbia (phone: Nearest Neighbor (KNN) is applied for simulating daily
+381-11-3094094 e-mail: abdolkarim.ennaas@yahoo.com
precipitations and other weather variables [3], and
different possible changes of the weather scenarios are

978-1-5090-0055-5/15/$31.00 ©2015 IEEE 910


analyzed using SVMs [4]. There are many studies that healthy leaves and leaves with symptoms of the three
support the applicability of data mining techniques for diseases have achieved accuracy higher than 86%. All this
weather prediction. papers and research show that data mining techniques
Authors in [5] presented a small application of CART provide good support for different agricultural problems.
decision tree algorithm for weather prediction. The data All mentioned and implemented techniques are mainly
collected is registered over Hong Kong, and the data is divided in two groups, classification and clustering
recorded between 2002 and 2005. The data used for techniques. Classification techniques are designed to
creating the dataset includes parameters year, month, classify unknown samples using information provided by a
average pressure, relative humidity, clouds quantity, set of classified samples. This set is usually referred to as a
precipitation and average temperature. WEKA, open training set as it is used to train the classification technique
source data mining software, is used for the how to perform its classification. Clustering is based on
implementation of CART decision tree algorithm. The grouping data according to the character, or to any
decision tree, results and statistical information about the property that they have in common.
data are used to generate the decision model for prediction
of weather. The way the data about past events is stored is III. PROPOSED MODEL
highlighted. The data transformation is required according A. Data collecting
to the decision tree algorithm in order to be efficiently
Like we said above, for prediction of the possible fruit
used by WEKA for weather prediction.
and plant disease infection weather data and data about
In [6] authors have predicted the hourly rainfall in any
active spores are important. Here we propose a model for
geographical regions time efficiently. The chance of rain is
infection prediction based on data mining techniques. Two
first determined. Then only if there is any chance of
key devices which collect data are automatic
rainfall, the hourly rainfall prediction is performed.
meteorological (weather) stations and spore traps. An
Although different methodologies have been introduced to
automatic weather station is an automated version of the
predict hourly prediction, most of them have performance
traditional weather station, either to save human labor or to
limitations because of the existence of wide range of
variation in data and limited amount of data. CART and enable measurements from remote areas. Automatic
C4.5 are used to provide outcomes, which may provide weather station measures various meteorological
hidden and important patterns with transparent reasons. parameters such as wind speed, wind direction,
About 18 variables were used from weather station. For temperature, humidity, ambient pressure, atmospheric
validation purpose, 10 fold cross validation method is pressure and rainfall. In special cases some additional
performed. CART gives slightly better performance than sensors can measure soil temperature, humidity leaf of
C4.5. Considering the chances, only a small number of fruit trees.
instances are left for prediction which makes it hard to The data collected in the weather station can be
predict. There are several applications of data mining monitored on site or transferred to a remote server. From
techniques in the field of agriculture. Data mining all this parameters for our research most important are air
techniques are often used to study soil characteristics. As temperature on the five centimeters from the ground, air
an example, the K-Means approach is used for classifying temperature on the one meter from the ground, rainfall and
soils in combination with GPS-based technologies [6]. humidity of the air and humidity of the fruit tree leaves.
One meteorological station will cover from eighty to a
K-Means approach is used to classify soils and plants
hundred hectares. This is because weather conditions on
[7]. Some authors in [8] use SVMs to classify crops.
the larger surface can vary considerably, especially
Apples are checked too, by using different approaches
rainfall. Automatic stations will measure and send data
before sending them to the market. In [9] authors use a K-
every fifteen minutes. For manual stations measurements
Means approach to analyze color images of fruits as they
will be carried out on eight hours.
run on conveyor belts. In [10] authors apply a supervised
biclustering technique to a dataset of wine fermentations An active pathogenic spore in appropriate weather
with the aim of selecting and discovering the features that conditions can lead to fruit tree infection. By infection
are responsible for the problematic fermentations and also agricultural sciences mean infection of leaves and fruit
exploit the selected features for predicting the quality of berries. Classically detection and enumeration of airborne
new fermentations. spores has been achieved by microscopic examination of
Taste sensors are used to obtain data from the surfaces on which spores were impacted. Spore traps have
fermentation process to be classified using ANNs [11]. been traditionally used to determine the spore density for
SVMs are used for milk classification in [12]. Here airborne plant pathogens.
sensors are used to smell milk. Automatic methods for an For our system, microscopic data about registered active
early detection of plant diseases are vital for precision crop spores of concrete pathogens are important. Spore traps
protection. Authors in [13] create a procedure for the early will be positioned on the same plantation where
detection and differentiation of sugar beet diseases based meteorological stations are, but we need much more than
on Support Vector Machines and spectral vegetation one trap. Examination of spore traps will be carried out
indices. The discrimination between healthy sugar beet three times a day. Block scheme of the system is presented
leaves and diseased leaves resulted in classification in Fig. 1 bellow.
accuracies up to 97%. The multiple classifications between

911
C. Data processing
In this phase the final data set organized in
preprocessing phase will be processed by different data
mining techniques. Data about identified spores will be
classified according to the group of diseases that can
cause. Beside this, spores for each pathogen are classified
in two types. First type represents active spores, meaning
that such spores could cause diseases in appropriate
weather conditions. Second type represents passive spores.
Passive spores can’t cause infection, although weather
conditions are fulfilled. Data collected from the
meteorological stations are more numerous. For each
parameter measurement is repeated many times during the
day. This causes big set of meteorological data for model
Fig. 1. Block scheme of the proposed system training. Depending of meteorological data category,
different classification or clustering techniques can be
A block which is called weather/meteorological stations used. Technique selection must be based on results that
presents network of the meteorological stations. These concrete technique provides. From all of the collected
stations are positioned in advance determined locations meteorological and data about spores we create training
near the plantation. Parameters from both types of stations file. This file starts with attributes definition of all data.
will be saved in system database. From automatic stations That means we must define if concrete attribute is numeric
report will be send through the network, automatically. or nominal. The last attribute represents class attribute.
Data from manual station must be entered by human. Data Attributes must have defined order of appearance in each
collection can be automated if we have more automatic instance. In that order we input values for concrete disease
stations with spore traps on them. Data from the electronic or group of diseases. Instances first have meteorological
microscope about identified spores (for which pathogen parameters, after that characteristic of the spores, and at
spore is present, active or passive type of spore) will be the end class value. Based on those instances we will
saved in the same database as meteorological data. More create training model. Number of instances is very
spore traps provide better coverage. One spore trap will be important. Increasing number of instances provides better
placed near the specific plant. In that way examination can training model. For created file we apply different
be significantly faster, because at the start we eliminate classification techniques, and measure percentage of
spores that are not specific for disease attacking that plant. corrected classified instances, standard deviation, mean
Spore trap will be examined by the phytopathology. At absolute error, relative absolute error, and root relative
this stage, the data collection phase is over. By this we squared error. A technique that gives the best results will
mean on the data obtained in the current time. For be chosen for creation of training model. In order to
successful prediction, data from the previous ten and more evaluate different data mining techniques we create
years will be entered in the database. Information about training dataset. Given dataset contains data from one year
identified infections on the field was obtained from long timespan. This dataset is created for two specific
farmers and from official bodies in charge of monitoring diseases. Over this data we apply multiple data mining
the occurrence of disease. techniques. Classification results are presented in Table 1.
For this evaluation we used WEKA classes implemented
B. Data preprocessing in C# form application.
Data preprocessing must be applied on both
meteorological and data from electronic microscope. Data TABLE 1: CLASSIFICATION RESULTS AND STATISTICS
preprocessing is often neglected but important step in the Classifier output J48 SMO ZeroR
data mining process. Data gathering methods are often Cor. classified instances 90.32% 85.25% 73.77%
lightly controlled, resulting in out-of-range values, Incor. classified instances 9.68% 14.75% 26.23%
impossible data combinations (Temperature: 29oC, Snow: Mean absolute error 0.0985 0.2659 0.2783
Yes), missing values, etc. Analyzing data that has not been Standard deviation 0.2812 0.3414 0.3681
carefully screened for such problems could produce Relative absolute error 25.39% 95.55% 100%
misleading results. Thus, the representation and quality of Root relative squared e. 64.06% 92.75 100%
data could be crucial for analysis process. Steps like data
preparation and filtering could take considerable amount For all classifiers we perform 10-fold cross-validation,
of processing time. Data preprocessing includes cleaning, without percentage split. This means that we use whole
normalization, transformation, feature extraction, dataset for training. Evaluation helps us to choose the best
selection, etc. The product of the data preprocessing is the technique for model creation. After the evaluation, training
final training set. For our research other weather model that will be used for prediction is build. For that
parameters besides the above mentioned are not necessary. purpose we use classifier that shows best results. This is
Such parameters that are coming from digital very important because if we build better training model
meteorological stations will be removed in this step.

912
prediction will be more accurate. From the table above we healthy foods, reduced number of chemical treatments is
can see that J48 classifier provides the highest percentage very important. With appropriate detection and prediction
of correctly classified instances. Because of that fact in we could get successful chemical protection and healthy
this case we will choose to build training model with J48. food.
Authors’ future research will be implementation and
D. Prediction
integration of proposed model in real terms. Authors plan
For disease infection all mentioned parameters must to create much bigger training dataset, with instances from
have values in specific range. Based on classified data about twenty years in the past. New parameters from the
from the training model, we can predict if appropriate real world will be collected from automatic meteorological
conditions for possible infection are satisfied. From this stations mounted on plantations. Obtained results will
moment we use created trained model for prediction. New show the degree of accuracy of the practical application of
set of instances provided from the meteorological stations the proposed model.
and laboratory will be used for test dataset creation. Test
dataset has the same form like training dataset. For the ACKNOWLEDGMENT
class values we can input two types of values. First is This paper is result of collaboration with the Ministry of
question mark that indicates that we do not know which Education, Science and Technological Development of
class value is appropriate for that set of data. In the second Republic of Serbia within the projects TR 32023 and TR
case we can predict the class value intuitively, and input 35026.
our prediction. After entering all the current values of the The authors are grateful to the professors from College
instances in the dataset, prediction can start. of agriculture and food technology in Prokuplje, for their
In prediction phase we use our saved training model. collaboration in order to acquire basic knowledge on
For the prediction we must select the classifier used for certain plant pathogens.
model creation. Prediction output will be class value for
each instance, regardless of whether we put a question REFERENCES
mark or predicted value. Despite this, degree of probability [1] M. Kantardzic, “Data mining concepts models methods and
is also essential. Degree of probability will vary depending algorithms”, John Wiley & Sons, Inc., Hoboken, New Jersey, pp.5-
21, 2011.
on the values of the current parameters. If current values [2] H. Jorquera, R. Perez, A. Cipriano, G. Acuna, “Short term
for all parameters are similar with corresponding values forecasting of air pollution episodes”, In: Zannetti P (eds)
from the training set, model will predict that probability Environmental modeling 4. WIT Press, UK, 2001.
[3] B. Rajagopalan, U. Lall, “A K-Nearest Neighbor simulator for daily
for infection is in the similar range. precipitation and other weather variables”, Water Resources
For the predicted value, if there is just one disease in the Research, vol. 35, no. 10, pp. 3089–3101, 1999.
training and testing set, and conditions for that disease are [4] S. Tripathi, V. Srinivas, R. Nanjundiah, “Downscaling of
precipitation for climate change scenarios: a Support Vector
fulfilled, we will get answer. If we have more than one Machine approach”, Journal of Hydrology, vol. 330, Issues 3-4, pp.
possible disease, output will be the one with the highest 621–640, 2006.
probability. [5] E. Georgiana, ”A Decision Tree for Weather Prediction”, Buletinul,
Vol. LXI no. 1, pp. 77-82, 2009.
We use mathematical regression methods for [6] K. Verheyen, D. Adriaens, M. Hermy, S. Deckers, “High resolution
verification of the results obtained by WEKA continuous soil classification using morphological soil profile
classification algorithms. Mathematical regression and descriptions”, Geoderma 101, pp. 31–48, 2001.
statistics calculations will be obtained by MatLab. After [7] G. Meyer, J. Neto, D. Jones, T. Hindman, “Intensified fuzzy
clusters for classifying plant, soil, and residue regions of interest
confirming predictions, and if some infection is possible from color images”, Computer and Electronics in Agriculture vol.
farmers will be notified. Notification will contain 42, pp.161–180, 2004.
information about present disease, and a proposal of [8] G. Camps-Valls, L. Gomez-Chova, J. Calpe-Maravilla, E.
SoriaOlivas, J. Martin-Guerrero, J. Moreno, “Support Vector
measures. For notification like this we must create farmers Machines for crop classification using hyperspectral data”,
database, and provide message or mail service. Lecture Notes Computer Sciences 2652, pp. 134–141 , 2003
[9] V. Leemans, M. Destain, “A real time grading method of apples
IV. CONCLUSION based on features extracted from defects”, Journal of Food
Engineering, vol. 61, pp. 83–89, 2004.
Agricultural production is complex job. One of the most [10] A. Mucherino, A. Urtubia, “Feature Selection for Datasets of Wine
unpredictable and complex task is chemical protection. Fermentations”, I3M Conference Proceedings, 10 th International
The key factor for successful chemical fruit protection Conference on Modeling and Applied Simulation (MAS11), Rome,
Italy, 2011.
from diseases and pests is nothing but the right moment. [11] Jr. Riul, H. Sousa, R. Malmegrim, D. Santos, A. Carvalho, F.
This means that the selection of chemicals is not as Fonseca, O. Oliveira, L. Mattoso, “Wine classification by taste
complex as timing determination for protection. Early fruit sensors made from ultra-thin films and using Neural Networks”
Sensors and Actuators B 98, pp. 77–82, 2004.
disease detection has a lot of benefits. From the angle of [12] K. Brudzewski, S. Osowski, T. Markiewicz, “Classification of milk
farmers, methods like suggested one provide important by means of an electronic nose and SVM neural network”, Sensors
information for successful chemical protection. Second and Actuators B 98, pp. 291–298, 2004.
[13] T. Rumpf, A. Mahlein, U. Steiner, E. Oerke, H. Dehne, L. Plumer,
benefit for the farmers is economical. They can save “Early detection and classification of plant diseases with Support
money if they reduce numbers of chemical treatments. Vector Machine based on hyperspectral reflectance”, Computer and
This is because model indicates when conditions for Electronic in Agriculture, vol. 71, num. 1, pp. 91-99, 2010.
diseases development are not fulfilled. In that case
chemical treatment is not needed. From the perspective of

913

You might also like