Professional Documents
Culture Documents
IN ETHIOPIA
By
A Thesis Submitted as a Partial Fulfilment to the Requirements for the Award of the
Degree of Master of Science in Software Engineering
to
FEBRUARY 2022
Approval Page
This is to certify that the thesis prepared by Mr. Belete Asmare Assefa entitled “Cereal
Crop Yield Prediction Using Machine Learning Techniques in Ethiopia” and
submitted as partial fulfillment for the award of the Degree of Master of Science in
Software Engineering complies with the regulations of the university and meets the
accepted standards with respect to originality, content, and quality.
ii
Declaration
I hereby declare that this thesis entitled “Cereal Crop Yield Prediction Using Machine
Learning Techniques in Ethiopia” was prepared by me, with the guidance of my advisor.
The work contained herein is my own except where explicitly stated otherwise in the text,
and that this work has not been submitted, in whole or in part, for any other degree or
professional qualification.
Witnessed by:
………………………………………… ………………………………………
iii
Abstract
Agriculture in Ethiopia is the area that plays an important role in improving our economy.
About 85% of the population live in rural areas and their economy is largely based on crop
production rate, climate data, chemicals, and different government policies. Prediction of
crop yields is important for planning and making various policy decisions. Many countries
like Ethiopia their economy is depend on agriculture use the conventional technique of data
collection for crop monitoring and yield predicting. The purpose of this study is to develop
a cereal crops yield prediction model based on agricultural inputs data. To this end,
appropriate machine learning techniques have been identified and applied to predict cereal
crop yields based on agricultural inputs. In order to build the prediction, model the
collected raw data had been pre-processed and merged based on common features. After
merging the dataset, the files containing the data were collected and the reputation of the
final data should be: the year of the item (crop), the yield value, the average rainfall, the
pesticides, and the average temperature. The data has a size of 20 kilobytes and 12 features
initially. After feature importance analysis has been implemented the size data was resized
to 7 features and 8 kilobytes to develop the predicted model. For the experimental analysis
we have use, Gradient Boosting Regression, Random Forest Regression, Support Vector
performance comparison of each algorithm by using different data splitting train/test levels.
Finally, among listed algorithms, the Gradient Boosting Regression outperforms the other
Keywords: Cereal crop, regression algorithm, machine learning, dataset, yield prediction
iv
Acknowledgements
First of all, I want to thank the Lord of all creation for all that God has done for us and for
being here at this time. Next, I would like to provide my great and special words of thanks
to my advisor Dr. Kula Kakeba for his constructive and concrete advice from starting to
ending of the thesis research for his intellectual guidance that has been motivating me in
my entire work.
Furthermore, I would like to thank my beloved wife Kasayenesh Nigussie for her
motivation and contribution not only in the research session but also during the class
session. I would like to thank my previous teacher Dr. Sudhir Kumar Mohapatra and my
beloved friends in my work office at Information Network Security Agency (INSA):
Semahegn, Amsalu, and others their name is not listed here for their support during the
research session.
I would also like to thank those who are not possible to list those who are contributed,
cooperated, and assisted directly or indirectly to acquire the necessary data. Besides, I
would also like to thank the academic staff members and the whole department of software
engineering of Addis Ababa Science and Technology University.
v
Table of Contents
Contents Page
Abstract .............................................................................................................................. iv
Acknowledgements ............................................................................................................. v
Introduction ......................................................................................................................... 1
Literature Review................................................................................................................ 9
vi
2.1. Introduction .......................................................................................................... 9
2.2. Factors Affecting Crop Production ...................................................................... 9
2.2.1. Impact of Climate Change on Crop Production ............................................ 9
2.2.2. Temperature ................................................................................................ 10
2.2.3. Rainfall ........................................................................................................ 10
2.2.4. Pesticides..................................................................................................... 11
2.3. Machine Learning .............................................................................................. 11
2.3.1. Types of Machine Learning ........................................................................ 13
2.4. Machine Learning Applications and Techniques in Agriculture ....................... 14
2.4.1. Specie Management .................................................................................... 14
2.4.2. Field Conditions Management .................................................................... 15
2.4.3. Crop Management ....................................................................................... 15
2.4.4. Livestock Management ............................................................................... 16
2.5. Crop Yield Prediction Using MLR .................................................................... 17
2.6. Review of Related Works .................................................................................. 20
Chapter Three.................................................................................................................... 27
Methodology ..................................................................................................................... 27
vii
3.4.1. Random Forest Regression ......................................................................... 36
3.4.2. Decision Tree Regression ........................................................................... 37
3.4.3. Support-Vector Machines SVM ................................................................. 38
3.4.4. Gradient Boosting Regression .................................................................... 40
Chapter Four ..................................................................................................................... 41
5.1. Conclusion.......................................................................................................... 52
5.2. Recommendation ................................................................................................ 53
References ......................................................................................................................... liv
viii
Abbreviations and Acronyms
AI Artificial Intelligence
ANN Artificial Neural Networks
CCKP Climate Change Knowledge Portal
CPU Central Processing Unit
CSA Central Statics Agency
CYP Crop Yield Prediction
DNN Deep Learning Neural Network Model
FAO Food and Agriculture Organization
GB Giga Byte
GDP Gross Domestic Product
ICT Information Communication Technology
IPython Interactive Python
IT Information Technology
KNN K Nearest Neighbor
Matlab MATrix LABoratory
ML Machine Learning
MLR Multiple Linear Regression
NCPB National Cereal Produce Board
pH Potential Hydrogen
REPL Read-Evaluate-Print Loop
RF Random Forest
SVM Support Vector Machine
WEKA Waikato Environment for Knowledge Analysis
ix
List of Tables
Table 1. Summary of studies and their findings .............................................................. 25
Table 2. Features used for cereal crop yield in the area................................................... 32
Table 3. Sample crop yield dataset ................................................................................... 42
Table 4. Sample rainfall dataset ....................................................................................... 43
Table 5. Sample temperature dataset ................................................................................ 44
Table 6. Sample Pesticides data set ................................................................................. 45
Table 7. R2 result summary of different train/test values ................................................. 47
x
List of Figures
Figure 1. Predictive analysis process ................................................................................. 4
Figure 2. Machine learning process ................................................................................. 12
Figure 3. Types of machine learning ................................................................................ 13
Figure 4. Workflow of the research ................................................................................. 28
Figure 5. Block diagram for model design ...................................................................... 34
Figure 6. Working of random forest algorithm ................................................................ 37
Figure 7. Working of decision tree regression algorithm ................................................ 38
Figure 8. Working of SVM ............................................................................................. 39
Figure 9. Working of gradient boosting algorithm .......................................................... 40
Figure 10. Correlation map in the dataframe ................................................................... 46
Figure 11. Model comparison ........................................................................................... 48
Figure 12. Actual vs predicted yield ................................................................................. 49
xi
Chapter One
Introduction
This chapter is organized to describe the research background, which includes key concepts
such as crop prediction, key concepts of machine learning, and predictive analysis
techniques. The motivation of the research and statement of the research problem has been
described, followed by the objectives, questions of the research, scope, and limitation, and
application of results. The last section concludes with the thesis organization and summary
of each chapter at the end.
Hence, because it has been for centuries in the past, still being the leading sector at present,
it's far believed to stay to be the determinant sector to play a dominant role to result in a
general sustainable economic growth to the country, for the years to come. if and only if
strenuous efforts are made through the authorities and the involved stakeholders such as
the farmer, to enhance productiveness via multiplied use of farm inputs which includes
progressed seed, fertilizers, etc., and modernize the farm activity via multiplied use of
modern and progressed farm implements and farming systems in addition to via the
introduction of modern farming technology to the sector as a whole.
In Ethiopia, a cereal production is a dominant form of agricultural practice over other types
of crop production. According to the 2019 CSA report, the percentage of crops, according
to production, is cereals (71.57%), legumes (11.20%), oats (5.17), vegetables (1.67%), root
crops (1.60%), and fruit crops. (0.83%) and coffee (5.28%) of typical crop production
location. Out of nearby states inside the country, Oromia ranks first every in terms of land
1
region allocation (45.41% of country-wide crop production location) and crop production
(49.24% of country-wide crop production) [3].
In 2018, cereal yield for Ethiopia were 2,395 kg per hectare. Though Ethiopia's cereal yield
fluctuated appreciably in recent years, it tended to increase via the 1969 - 2018 duration
ending at 2,395 kg per hectare in 2018 [4]. This indicates that cereal crop production is the
important source of livelihood for smallholder farmers in the country and thus, smallholder
farmers’ food security and welfare status depends on the extent of development in this
subsector. Cereals like sorghum, wheat, maize, and rice are major staple foods of most
populations.
Crop yield is the maximum vital indicator in agriculture and has several connections with
human society. Due to the complexity of the information, crop manufacturing forecasting
is a difficult assignment for coverage leaders. Researchers in agriculture and agro-
economics are inquisitive about growing new mathematical techniques that could make
higher predictions with the use of current metrics. Research on this path is concerned with
presenting a hyperlink among the rural surroundings and crop manufacturing, considering
nearby variables, soil quality, irrigation, and land use. These fashions are primarily based
totally at the legal guidelines of measurement. These models are based on the laws of
measurement [5].
Crop yield prediction is one of the most important and well-known topics in real
agriculture, with crop mapping and estimation, crop supply in line with demand, and crop
management. Modern approaches are far from simple predictions based on historical data
but include computer vision technologies to provide information on travel and general crop,
weather, and economic conditions [6].
The challenge begins when one realizes that it is not possible to produce such information
for a specific professional system. Manual surveys and remote sensor data are used to
predict crop yields. Observations of the past years with mathematical knowledge Manual
study with historical knowledge is useful for a small area, but difficult to compare with
other regions and countries. Recent advances in crop simulation models have overcome
these problems [7] .
2
Crop yield predictions are valuable to many stakeholders in the agro-food chain, including
farmers, agronomists, commodity traders, and policymakers of agriculture[8]. Crop yield
is prompted by many crop-unique parameters, environmental conditions, and management
decisions and it's far hard to construct a reliable and explainable prediction model [9].
From an engineering perspective, an ML task is a software system that has one or greater
components in it that learn from records. This involves the gathering and pre-processing of
records, the training of an ML model, the deployment of the trained model to carry out
inference, and the software program engineering of the encompassing software system that
sends new input records to the model to get answers. Machine Learning is usually classified
into 3 types: Supervised Learning, Unsupervised Learning, Reinforcement Learning [11].
Data processing is the idea of the whole agricultural records cycle and ought to deal with
many troubles in agriculture, including meals security, soil conservation, irrigation, pest
identity and prevention, soil health, and agricultural utilization. Traditional evaluation
techniques including information mining, system learning, statistical evaluation, and
different techniques aren't applicable to large-scale information processing in agriculture.
Years of studies and development, information mining, system learning, statistical
evaluation, and extra information evaluation have caused huge effects on the information.
Depending on the traits of the information in agriculture, you will use timeliness as a
measure. Research information control offers numerous technological challenges. These
are associated with environmental modeling, i.e., metadata-primarily based totally
information retrieval troubles into information mining and information integration. He
makes use of analysts to affirm the good-sized agricultural information algorithms. It can
calculate the effectiveness of algorithms to a point and calculate the reliability of
3
information results. Predictive analysis is the branch of data analysis that is mainly used to
predict future events or outcomes. The process of predictive analysis can be
diagrammatically described below [12].
4
Predictive analyzes include a combination of scientific methods and techniques. Based on
the Science of predictive analytics the techniques of predictive analytics include the basic
steps [13].
Regression
Classification
Clustering Time Series
Prediction
Data Mining: In order to manipulate huge quantities of data units both based or
unstructured to recognize hidden patterns and relationships amongst variables provided,
data mining is aimed to. Once identified, those relationships may be used to apprehend the
conduct of the event from which data is compiled.
Statistical Modelling: In parallel to the data mining process, statistical data models may
be evolved relying on the context of what wishes to be expected the use of the equally
gathered data as for data mining. Once the model is built, the new data is fed to models to
predict future outcomes. For example, a business expert can build a cross-selling model
using current customer data and predict what other items they will likely purchase from the
same company.
Machine learning: ML can deploy iterative techniques and strategies to perceive patterns
from massive data sets and construct models. For example, recommendation engines are
broadly used for online buying recommendations as predictions are made from the use of
customers' earlier shopping and browsing behavior.
There is still a long way to go before it can be used in any data science application. Much
needs to be done before it can be used in any data science application. It is difficult for
researchers who wanted to contribute to some research in agricultural data analysis.
5
Therefore, the current research is intended to work in this field and to develop useful and
valuable data sets and models, so that any researcher who wants to work in this field will
have access to decent information and good models. It is hoped that the successful
implementation of the models will enable one to predict cereal crop productivity.
To achieve the above general objective, the research work will carry out the following
specific objectives:
To review the literature on crop yield predictions and machine learning applications
in agriculture
To look at the factors affecting crop production
To collect data from open source data repositories
6
To develop an appropriate model using machine learning techniques for cereal crop
yield prediction
To evaluate the effectiveness of the developed machine learning model based on
cereal crop yield prediction
Preparation of a data set for cereal crop production for Ethiopia.
7
1.8. Contribution of the Study
This work contributes to the scientific community and practice in multiple ways:
Different factors that affect crop production specifically cereal crops were
investigated that have different applications for agricultural policies.
Study of various solutions proposed/used in crop yield prediction, and the
effectiveness of the various parameters that influences their results
Comparative analysis of standard algorithms on crop yield prediction and
identifying the most suitable algorithm for a generic set of crops
Evaluation of various parameters which affect the crop yield and ranking them
according to their impact.
The cereal crop dataset for Ethiopia has been developed and published for
researchers and other users.
8
Chapter Two
Literature Review
2.1. Introduction
This chapter discusses the literature reviews conducted by refereeing books, journals,
articles, conference papers, and the internet to get more insight into the concept of machine
learning and its application, especially in agriculture. The section provides insights into
research from early researchers. This chapter is subdivided into factors affecting crop
production, machine learning in agriculture: applications and techniques, crop yield
prediction using MLR, and a summary of all the recent works on the study.
From the countries that are affected by weather change, Ethiopia is the most. The
agricultural sector, which contributes more than 45% of GDP, 80% to labor, and 85% to
foreign exchange, is particularly vulnerable to climate change. More than 95 percent of
crop production is dependent on rainfall has been produced by smallholders and subsistent
farmers who have less capacity to adapt to climate change [1].
Global warming is one of the major challenges facing global food security. Climate change
affects the production and productivity of the crop sector by decreasing soil fertility,
increasing pests and crop diseases aggravating lack of access to inputs and improved seeds,
and frequent drought and floods due to low irrigation schemes, poverty, high population
pressure, lack of institutional capacity to adaptation. Climate change is projected to overall
decrease the yields of cereal crops in Africa through shortening Increasing the length of
the season, increasing water stress, and increasing disease, pests, and weed outbreaks [15].
Agriculture is always dependent on the weather, farmers need a mixture of sun, heat, and
rain to ensure that all human beings can produce food safely.
9
2.2.2. Temperature
Wheat needs 12 to 15 inches (31 to 38 cm) of water to produce a good crop in Ethiopia. It
grows fine while temperatures are warm, from 70° to 75° F (21° to 24° C), but not too hot.
Wheat also needs a lot of sunshine, especially when the grains are filling. The Rice crop
desires a hot and humid climate. The average temperature required in the course of the
lifestyles duration of the rice ranges from 21 to 37º C. Maize is a warm-weather crop and
is not grown in areas where the mean daily temperature is less than 19 ºC or where the
mean of the summer months is less than 23 ºC. Although the minimum temperature for
germination is 10 ºC, germination could be quicker and much less variable at soil
temperatures of 16 to 18 ºC. Sorghum will germinate quickly with soil temperatures at 65-
70°F but will also germinate at temperatures as low as 50°F (expect very slow growth).
Planting should not begin until soil temperatures (2-inch depth) have reached an average
of 60°F over a five-day period [16].
2.2.3. Rainfall
Rainfall seasonality and timing are key climatic features affecting crop yield in rain-fed
agriculture. Crops need water for their growth, photosynthesis of making their food, and
their overall performance. Rainfall provides water that serves as a medium through which
nutrients transport for crop development [17]. Rainfall variability has a significant and
negative impact on crop production in Ethiopia. When the once-a-year rainfall diverges
from its mean (each upward and downward), the extent of production of all crop types
diminished significantly. When there's extreme rainfall, the effect of fertilizer to reinforce
productiveness has diminished.
10
In Ethiopia, the amount of rainfall required for wheat cultivation varies between 300 mm
and 1000 mm. The major wheat lands of the temperate regions have an annual rainfall of
380 cm to 800 mm. Maize is the second most widely cultivated crop in Ethiopia and is
grown under rain-fed production. The rainfall needed for maize is around 650-1200 mm.
Rice is mainly grown in rain-fed areas that receive heavy annual rainfall. It demands a
rainfall of more than 800 mm. Sorghum is well adopted to semiarid regions with a
minimum annual rainfall of 350-600 mm. It is grown in areas that are too hot and dry [18].
2.2.4. Pesticides
Pesticides can contaminate soil, water, grass, and other plants. In addition to killing insects
or weeds, pesticides can be poisonous to many other organisms, including birds, fish,
beneficial insects, and non-target plants. Pesticides are poisonous chemical compounds that
are intentionally launched into the environment. Although each pesticide is meant to kill a
certain pest, a very large percentage of pesticides reach a destination other than their target.
Pesticides easily contaminate the air, ground, and water when they runoff from fields,
escape storage tanks are not properly disposed of, especially when sprayed into the air.
Pesticides are agricultural technologies that enable farmers to control pests and weeds and
are an important resource when growing crops [19].
A machine or intelligent computer program learns and extracts knowledge from the data,
builds a framework for making predictions or intelligent decisions. Thus, the ML process
is divided into three key parts, i.e. data input, model building, and generalization as shown
in figure 2. Generalization is the process for predicting the output for the inputs with which
the algorithm has not been trained before.
11
ML algorithms are mainly used to solve complex problems where human expertise fails
such as weather prediction, spam filtering, disease identification in plants, pattern
recognition. [20]
Machine Learning (ML) has enabled greater in-intensity studies in a plethora of fields.
Training artificial neural networks (ANNs) is one of the maximum popular techniques in
ML and has been carried out to lots of biological and agricultural problems. A thrilling and
beneficial thing of ANNs is their cap potential to find complex institutions among entering
and reaction variables without pre-defining any constraints or assumptions approximately
the sample distribution of the data. This allows the opportunity of describing complicated
non-linear relationships which might be regularly found in domain names which include
precision agriculture because of an extensive variety of crop situations and different
influencing factors such as precision agriculture resulting from a wide range of crop
conditions and other influencing factors [21].
Machine Learning (ML) deals with problems where the relationship between input and
output variables is not known or hard to obtain. The “learning” term here denotes the
automated acquisition of structural descriptions from examples of what's being described.
Unlike conventional statistical methods, ML does now no longer make assumptions
approximately the precise shape of the information version, which describes the
information. This function could be very beneficial to version complicated non-linear
behaviors, including a characteristic for crop yield prediction. ML techniques have been
maximum successfully implemented to Crop Yield Prediction (CYP). A supervised
12
learning algorithm consists of a target/outcome variable (or dependent variable) which is
to be expected from a given set of predictors (independent variables). Using those sets of
variables, we generate a feature that maps inputs to favored outputs. The training method
maintains till the model achieves a desired stage of accuracy at the training data. Examples
of supervised learning: regression, decision tree, random forest, KNN, logistic regression,
etc. [22].
Supervised Learning: It is the popular paradigm for machine learning. Given the data in
the illustrations along with the labels, we can feed those sample tags using one, using one.,
permitting the algorithm to predict the label for every example, and giving it feedback as
to whether or not it predicted the proper solution or not. Over time, the algorithm will learn
how to approximate the exact nature of the connection among examples and their labels.
13
When completely trained, the supervised learning set of rules might be capable of taking a
look at a new, never-before-visible instance and predicting a great label for it [23].
14
as well as nutrient content or a better taste. Machine learning takes decades of field
information, especially deep learning algorithms, takes decades of field data to analyze
crop performance in different climates and new features built into the process. Based on
this information, you can build a probability model that predicts which genes are most
important to a plant.
Species Recognition
While the traditional human approach for plant classification would be to compare the color
and shape of leaves, machine learning can provide more accurate and faster results by
analyzing leaf morphology, which contains more information about the properties of the
leaf.
2.4.2. Field Conditions Management
Soil Management
For agricultural specialists, the soil is a diverse natural resource, with complex processes
and vague methods. Its temperature alone can give insights into the climate change effects
on regional yield. Machine learning algorithms study evaporation processes, soil moisture,
and temperature to understand the ecological variability and problems in agriculture.
Water Management
Water management in agriculture affects hydrological, climatological, and agronomical
balance. So far, the maximum advanced device learning-primarily based totally
applications are related with an estimation of every day, weekly, or month-to-month
evapotranspiration making an allowance for the greater powerful use of irrigation systems
and prediction of every day dew factor temperature, which enables discover expected
climate phenomena and estimate evapotranspiration and evaporation.
2.4.3. Crop Management
Yield Prediction
Yield prediction is one of the maximum vital and famous subjects in precision agriculture
because it defines the mapping and estimation of yields, matching crop delivery with a call
for and management. cultures. State-of-the-art work techniques have long past some ways
past easy prediction primarily based totally on historic records, however, combine laptop
15
imaginative and prescient technology to offer cellular records and a complete
multidimensional evaluation of crops, weather, and economics to get the maximum out of
it, income for farmers and the population.
Crop Quality
The correct detection and type of crop quality characteristics can grow product rates and
decrease waste. In contrast with human experts, machines can employ reputedly
meaningless data and interconnections to reveal new qualities gambling a function with
inside the typical quality of the plants and to detect them.
Disease Detection
Both in the open air and in greenhouse conditions, pesticides and pesticides are commonly
used in the same way to spray the crop. To be effective, this approach requires high doses
of pesticides, resulting in significant financial and environmental costs. Machine learning
serves as an integral part of overall agricultural management and focuses on time, space,
and damaged plants.
Weed Detection
Apart from diseases, weeds are the maximum critical threats to crop production. The largest
hassle in weeds prevention is that they're hard to discover and discriminate from crops.
Computer imagination and prescient and gadget gaining knowledge of algorithms can
enhance the detection and discrimination of weeds at a low value and not using
environmental troubles and aspect effects. In the future, that technology will power robots
with a view to damage weeds, minimizing the want for herbicides.
2.4.4. Livestock Management
Livestock Production
Similar to crop management, machine learning provides correct prediction and estimation
of farming parameters to optimize the economic potency of eutherian mammal production
systems, such as cattle and eggs production and eggs production. For example, weight
predicting systems will estimate the long-run weights one hundred fifty days before the
slaughter day, permitting farmers to switch diets and conditions respectively.
16
Animal Welfare
In the present-day setting, a farm animal is an increasing number of handled now no longer
simply as meals containers, but as animals who may be sad and exhausted in their existence
at a farm. Animals' conduct classifiers can join their chewing indicators to the want in
weight-reduction plan adjustments and with the aid of using their movement patterns,
which includes standing, moving, feeding, and drinking, they could inform the quantity of
pressure the animal is uncovered to and expect its susceptibility to diseases, weight gain,
and production.
The cost of clever automation is broadly identified throughout many sectors, proved with
the aid of using examples of AI in fintech or AI in actual estate. In agriculture, this phase
of technology is turning into essential. With records on the center of farming choices and
the improvement of agrochemical products, the capacity is immense. Perhaps, greater
importantly, system mastering is ready to turn out to be a behind-the-scenes enabler of
greater sustainable use of herbal sources and a big contributor to a higher environment.
However, for this era to have a tangible effect on agriculture, it desires extensive popularity
amongst stakeholders, a one of kind mindset from farmers, and enough funding. This is a
long-haul game. Companies want to be prepared to reinvent themselves, research new
skills, and adapt to the policies imposed with the aid of using big data.
Crop Yield Prediction(CYP) is one of the methodologies to predict the yield of the crops
using different available parameters. Yield prediction is controlled by various trainable and
untrainable factors. Predictive modeling is a method that uses data mining and probability
to predict outcomes. Data modeling in prediction involves four stages: historical data
analysis, data pre-processing, modeling of data, and performance estimation [28].
17
Applying the ML algorithm and tuning their parameters based on the feature set make an
accurate prediction. Researchers are working toward developing efficient methods to
evaluate the prediction accuracy based on the data they collected. The data-driven models
have gained popularity and found CYP applications using classical statistical and ML
methods. Supervised ML approaches such as Artificial Neural Network (ANN), Support
Vector Regression (SVR), k-Nearest Neighbor (k-NN), and Random Forest (RF) which
are parametric or nonparametric in nature and are heavily dominating the crop yield
prediction in different agricultural data sets [29].
Machine learning is a critical decision-support tool for crop yield prediction, inclusive of
supporting decisions on what crops to develop and what to do during the developing season
of the plants. Several machine learning algorithms have been implemented to support crop
yield prediction research [30].
The most important problem in agriculture is crop yield prediction. The Agricultural yield
primarily depends on weather conditions (rain, temperature, etc.), pesticides. Accurate
information about the history of crop yield is important for making decisions related to
agricultural risk management and future predictions.
Yield forecasts can be made using statistical and ML algorithm models. The statistical
model MLR is frequently used in agricultural yield prediction, and its main goal is to
quantify the connections between many independent variables and a dependent variable.
Although there is no rational dependence between variables, one can try to connect using
a mathematical equation. This equation may not have a physical sense, but under some
assumptions, it allows forecasting values determined based on knowledge of other
variables. MLR method attempts to make a model that relates a dependent variable and
two or more independent variables by connecting a linear equation into the observed data.
This section analyzes the application of the MLR model in yield prediction by the
researchers [31].
MLR and ANN are widely used to predict soil hydraulic properties from easily available
soil variables, and parameters are selected by the data distribution method. The researchers
used a vast array of soil data. From the analysis, he realized that neural network data
18
collection is uncertain. When instability in data sets decreases, the neural network provides
a better prognosis for soil behavior than MLR. However, when distrust in data sets is high,
the neural network is unable to provide better accuracy for predictions [32].
MLR and ANN algorithms were implemented to estimate the yield of organic potatoes
using the soil quality parameters and tillage system. The consequence of considering tillage
systems on the soil properties to calculate crop production are discussed [33]. They
established that tillage and soil properties impacted the yield greatly. It was also found out
that the crop yield was estimated more accurately by the MLR model than the ANN model.
Still, its prediction effectiveness was lower when compared to the ANN model.
The study by Sarmadian focused on predicting soil parameters using the available soil
dataset. The feedforward back-propagation neural network model and the MLR model
were used to predict the soil parameters. The artificial neural network with two neurons in
the hidden layer performed well with the main soil parameters, including cation exchange
capacity, water percentage at field capacity, and permanent wilting point. The performance
evaluated for the selected models uses the test data model. The results indicate that the
neural network models were more suitable to compute the nonlinearity among the variables
[34].
Linear and statistical models to estimate the daily global solar radiation in a region of the
Salta Province of Argentina. The features of the dataset were analyzed with MLR, ANN,
and Multilayer Perceptron. The linear models and neural network models were developed
and their efficiency was compared by applying the dataset. The data set, they used
consisted of information about solar radiation data for 1996-2002. Three alternative
combinations of meteorological parameters for neural networks and linear regressions were
considered. The researchers got good results with both prediction methods. However, it
was concluded that neural networks produced better estimates than linear regressions [35].
MLR and ANN were compared by Mohammad Zaefizadeh et al to estimate the barley
yield. Their prediction model was based on multilayer ANN with one hidden layer and
included 15 neurons. The Matlab Perceptron type software that was used in this study ran
on an algorithm that underwent error propagation learning method and hyperbolic tangent
19
function. The comparative results of the analysis indicated that the mean deviation index
of estimation in the ANN technique was one-third of its MLR rate. The variation of the
mean deviation index value was because of the significant interaction which took place
between the genotype and the environment. This interaction had an impact on the MLR
method of estimations. This study concluded that a neural network approach was
recommended over the regression method for yield prediction, especially when there were
significant genotype-environment interactions and more velocity [36].
Safa and Samarasinghe attempted to create an ANN that could predict energy use in wheat
production. The study was held on the irrigated as well as dry wheat fields in Canterbury
in the 2007-08 harvest season. The data were collected by using extensive interviews and
questionnaires. The researchers identified many direct and indirect factors to train the
ANN. The ANN model gave a better prediction on energy consumption than the MLR
model when a dataset was selected for testing and validation [37].
Using ten crop datasets, Gonzalez Sanchez, Frausto Sol, and Ojeda Bustamante studied the
predictive accuracy in crop yield prediction of ML and linear regression technique by using
the data collected from a Mexican irrigation zone. Along with the MLR model, the
researchers used the regression trees, neural networks, nearest neighbor, and support vector
models to analyze the predictive ability. M5-Prime obtained the highest average accuracy
matrices and k-nearest neighbor techniques, and the researcher concluded that in
agricultural planning, the planner could use the tool M5-prime to predict larger crop yield
[38].
Goapl and Bhargavi, developed a novel hybrid model to predict paddy crop yield and is
based on multiple linear regression (MLR) and Artificial Neural Networks(ANN). In this
20
model, the initial weights of the neural network are derived from MLR coefficients. The
paddy data is used to train the backpropagation community and the performance is similar
to the other machine learning models. The hybrid ANN-MLR model achieved better
precision than other models [39].
Shastry and Sanjay created a brand new cloud-primarily based totally framework to
categorize soil and to are expecting crop yield. The proposed framework used to categorize
the soil is primarily based totally on the hybrid kernel Support Vector Machine(SVM)
method and the SVM kernel parameters are derived from GA. Based on Artificial Neural
Networks(ANN), the crop yield prediction version turned into advanced, and additionally
the parameters of ANN just like the hidden layers, neurons and gaining knowledge of price
are customized. The proposed cloud-primarily based totally framework version plays
higher than different fashions in soil type and crop yield prediction [40].
Pavan Patil, Virendra Panpatil, Prof. Shrikant Kokate proposed a system that discusses
improving the result by adding more attributes to the system. A combination of Naive
Bayes and decision tree algorithms are used. The decision tree shows poor performance
with the given dataset and has more variations but naive Bayes provides better results than
the decision tree for such datasets. The combination classification algorithm of naive Bayes
and decision tree classifier are better performing than the use of a single classifier model.
The parameters include soil type, soil Ph value, humidity, temperature, wind, and rainfall
[41].
Islam, T., Chisty, T. A., & Chakrabarty used a deep learning neural network model(DNN)
to envisage varieties of crop yield like rice, Jute, Wheat, and Potato by using weather, Soil,
and fertilizers data. The newly developed DNN model is compared with the other machine
learning models namely Random Forest, Support Vector Machine, and Linear Regression.
The DNN model gives higher precision in prediction than the other model [42].
Feng et al. exploited the power of machine learning and regression models in the prediction
process. In their research work, they compared the cross-validated Random forest(RF)
model with the multiple linear regression(MLR) model and also establish the correlation
between climate and rainfall parameters. This established correlation shows how the wheat
21
yield percentage is decreased when the rainfall is low. In prediction, the RF outperforms
MLR [43].
Prakash, S., Sharma, A., & Sahu explored the better way for soil moisture prediction with
help of machine learning models such as Support Vector Machine(SVM), RNN, and
statistical model multiple linear regression(MLR). The predicted outcome of the machine
learning models is compared against each other. Authors suggested that, in short-term
moisture prediction, MLR has better prediction power than machine learning models [44].
Giritharan and Koteeshwari suggested in this paper, to use one of the most effective tools
named Artificial Neural Network(ANN) for modeling and prediction. To implement the
ANN both the Feedforward and Back Propagation Network is combined together and used.
The suggested system is an easy-to-use android application [45].
Snehal S. Dahikar and Sandeep V. Rode used Artificial Neural Network technology for
estimating long-term or short-term crop production because it provides an assorted solution
for the cumbersome problems in agriculture research. This research work only presented
the ANN to minimize the losses when the conditions are not apt while envisaging the crop
yield with the parameters of the soil, weather, guaranteed price, cultivation area, etc. [46].
Singh and Prabhat Kummer concluded that this paper would help improve crop yields by
applying classification methods and comparing metrics. We can also do analyzing and
prediction of crops using Bayesian algorithms. The Bayesian algorithm, K-means
Algorithm, Clustering Algorithm, Support Vector Machine algorithms were used. The
disadvantage is the lack of accuracy and performance described in the paper according to
the implementation of the suggested algorithms. [47].
Arun Kumar, Naveen Kumar, and Vishal Vats have proposed a system to predict the yield
of the crop by analyzing past soil datasets, rainfall datasets, yield datasets. The prediction
was done using K-Nearest Neighbor and Support Vector Machine algorithm and Least
Squares algorithms [48]. They have done crop prediction using weather forecasting,
pesticides and fertilizers to be used and past revenue as input data. Multi-line core
component analysis (MPCA) has been used for behavior reduction. In addition to the
forecast, they take into account prerequisites and behavioral reductions [49].
22
There are few research works about sugarcane yield prediction which can be associated
with our work. Sugarcane yield prediction technique with the use of Random forest [43]
became proposed in one of the survey, the features used in this study consist of biomass
index, climate statistics (e.g., rainfall) and yields from previous years. Two predictive tasks
are provided in [50]: (i) the category problem for predicting whether or not the yield can
be above or underneath the found median yield, and (ii) the regression hassle for predicting
the yield estimates in two distinct time intervals. In addition, support vector system for rice
crop yield prediction become proposed, the dataset used in this method are precipitation,
minimum, maximum and common temperature, place, evapotranspiration and
manufacturing. The sequential minimal optimization classifier is implemented on the
dataset [51].
Mary Mary Saji, Kevin Tom, Varsha S, Lisha Vargesi, Er. Gene Thomas proposed the
paper that will clear up the rural issues via way of means of looking at the rural region on
the premise of soil properties. It recommends the maximum appropriate crop to farmers,
thereby assisting them to boom productiveness and decrease loss. Here is a paper
evaluating the algorithms. Here, in particular, the use of the algorithms is KNN, Selection
Tree, Naive Bay, KNN with certified SVM. And it affects wherein set of rules is first-rate
for this crop prediction. The algorithms are on the way to be used for checking out our
KNN, KNN with Cross-Validation, Decision Tree, Naive Bayes, and SVM. The accuracies
acquired have been 85%, 88%, 81%, 82%, and 78% respectively. KNN with cross-
validation has the very best accuracy and as a result, may be used for implementation inside
the very last system [52].
The dataset is processed through the WEKA tool to build the set of rules on the current
dataset. The results were generated in python by using the SVM algorithm. Based on the
C4.5 algorithm, decision trees and decision rules have been developed, in their study, they
have developed a website called Crop Advisor: This is an interactive website for
discovering the effect of weather and crop production by using the C4.5 algorithm [53]
This gives the idea of how different climatic parameters impact the growth of the crop. The
selections were made based on the area under the chosen crop. The information regarding
the associated year's climatic parameters like rainfall, high and low temperature, wet day
23
frequency was collected. The id3 algorithmic rule was developed to induce sensible quality
and improved Tomato crop yield that is implemented in the PHP platform and uses CSV
as datasets. The features used in this study include area, production of the tomato crop,
temperature, and humidity [54].
A decision tree classifier for agriculture information turned into proposed [55]. This new
classifier uses new facts expression and can address each entire record and in entire records.
Inside the test, a 10-fold cross-validation technique is used to check the dataset, horse-colic
dataset, and soybean dataset. Their results showed the proposed selection tree is capable of
classifying all styles of agriculture records. A yield prediction version turned into proposed
in one of the take a look at which makes use of data mining techniques for category and
prediction. This model includes crop name, topography, soil type, soil pH, pest
information, climate, water level, seed type, and this model anticipated the plant boom and
plant diseases and therefore enabled to select of the nice crop based on climate information
and required parameters [56].
By studying the previous research done by various scholars above many techniques and
ideas can be grasped which can help to learn more about solving the issues which are
intended to achieve. Hence by using the machine learning algorithms the prediction can be
more efficient for achieving the goal and there are ways to crop yield prediction. Taking a
step forward that aiming to use the regression technique on the data-test above numerical
values. As the values in the data-set are numerical it is suited for the regression.
24
The table shown below summarizes the works of other researchers, scholars, and
contributors of the domain of crop yield prediction, the algorithms that they use, the
purpose of their studies, and their findings.
25
Deep Feedforward
Network (DFN
2016 Giritharan and To develop crop Artificial Neural Develop crop predictor and
Koteeshwari predictor and Network advisor application for
advisor using ANN smartphones
2015 J.P. Singh, To improve the Bayesian algorithm, K- Analyzing crop prediction
Rakesh Kumar, yield rate of crops means Algorithm, using those models, but
M.P. Singh and Clustering Algorithm, they did not show proper
Prabhat Kumar SVM accuracy error
2018 Arun Kumar, Efficient Crop SVM and Least Squares It shows that SVM is better
Naveen Kumar Yield Prediction algorithms here compared to the
and Vishal Using Machine complexity
Vats Learning
Algorithms.
2020 Kevin Tom Crop Prediction KNN, Decision Tree, The accuracies obtained
Varsha S , Using Machine Naive Bayes , KNN with here are 85%, 88%, 81%,
Merin Mary Learning. Cross Validation, and 82% and 78% respectively.
Saji, Lisha SVM. KNN with cross validation
Varghese, Er. has the highest accuracy for
Jinu Thomas this paper.
Depending on the above literature performance, we use regression we have select to apply
four machine learning algorithms for crop yield prediction based on the performance that
the researchers have gotten precisely. When we analyzed the gaps of researcher we can
conclude that the data they use and their scope is limited for the specific area this study can
fill the gabs on Ethiopia and specifically cereal crops.
26
Chapter Three
Methodology
3.1. Introduction
The goal of this work is to explore a number of standard machine learning techniques to
agricultural data set for discovering cereal crop yield. So before applying the machine
learning techniques to the data set, there should be a methodology that governs a given
work. The methodology is more than a method of data collection; rather it is further of the
concepts and theories which underlie the methods. So it is critical to understand the
essential ideas of the method to focus on a specific feature of a sociological theory check
an algorithm for data retrieval or check the validity of a particular system.
The objective of the current research work is to analyze the predictive algorithms with
fewer relevant features. In order to meet the objectives of the research work, data collection
and purity are most important. Since this study is a combination of three approaches, the
methodology discussed in each approach and the various resources is finally linked up into
a single platform to achieve the objective. The data used for the current work are irrigation-
related data, related meteorological data, fertilizer usage data, and yield statistics. The data
was collected from various sources during the process. The pre-processed data were
applied to the most relevant feature selection algorithms to identify the most critical
features. The input dataset’s selected features are given as the input of predictive
algorithms to predict the crop yield.
27
3.2. Workflow
The research work has been divided into 8 procedures. Figure 4 illustrates the workflow
of the research. The subsequent subsections of the chapter discuss the workflow in detail
The data is collected from open-source sites which have normalized values. The system
can also be tested against the actual data which can be obtained from the government. For
this research, the crop yield data is obtained from FAO and climate data is obtained from
the world data bank repository, which provides global data on historical and future climate,
vulnerabilities, and impacts.
The dataset is commonly known as a collection of data that represents a particular variable
for a single table and also data combination in the whole entity. This data set can be
organized into several characteristics of information based on the structure and properties
that need to be carried out [57].
28
In any ML analysis, you need data. And any model can only be powerful if you feed it with
the right data. The on-target data should have the precise features and the right outcomes
because it will affect the relevance and the usability of the model as well as the findings.
The data applied for this work was obtained from data from FAO and the World Data Bank
repository. It was made for crop yield prediction.
Agricultural production depends on these factors. The changes in these factors will have a
meaningful impact on the selected areas' yearly agricultural outcome. The attributes or
parameters are mainly depended on the availability of the data. Two different sets of
statistical data that were used for the study were the statistical and agricultural data for
cereal crops and its weather data for the respective years. The collected two data sets were
combined into a single data set. The list of parameters of the dataset is as below.
Temperature
Rainfall
Yield
Pesticides
Machine learning algorithms do now no longer paint nicely with processing raw data. Data
pre-processing is a technique that is used to convert the raw data into a clean data set. In
other words, every time information is collected from various sources, it is collected in raw
form, which is not easy to analyze.
Data cleaning: is the method involved with cleaning the data so that the
information can be effectively coordinated. Real-world data tend to be incomplete,
noisy, and inconsistent. Data cleaning routines attempt to fill in missing values,
smooth out noise while identifying outliers, and correct inconsistencies in the data.
One way of handling missing values is deleting the rows with null values. This
method is a quick solution and it is typically being preferred in cases where the
percentage of missing values is relatively low. There was no invalid worth in the
data set.
29
Data Integration is involved in data analysis task which combines data from
multiple sources into a coherent data store, as in dataframe.
Data reduction: Complex data analysis and mining on huge amounts of data may
take a very long time, making such analysis impractical or infeasible. Data
Reduction is the process to reduce the large data into smaller one, in such a way
that data can be easily transformed further. Obtains reduced representation in
volume, Data discretization, Data aggregation, dimensionality reduction, data
compression, generalization
Data Transformation: is the process of converting raw data into a format or
structure that would be more suitable for model building and also data discovery in
general. It is an imperative step in feature engineering that facilitates discovering
insights. This article will cover techniques of numeric data transformation: log
transformation, clipping methods, and data scaling.
Taking a look at the dataset above, it contains features highly varying in magnitudes, units,
and range. The features with high magnitudes will weigh in a lot more in the distance
calculations than features with low magnitudes. To suppress this effect, we need to bring
all features to the same level of magnitudes. This can be achieved by scaling.
3.2.4. Feature Selection
To compile the model, a large general crop data set with agricultural metrics were taken.
Another dataset is taken as a feature dataset. The datasets are collected from FAO and
World Data Bank database galleries. Initially the data has a size of 20kb and the prediction
parameters in this dataset include temperature, rainfall, pesticides, and harvested year.
There are a number of crops taken in this dataset like wheat, rice, maize, sorghum, etc. A
number of values are available for each and every prediction parameter for a single crop.
For instance, when taking a crop like maize, any value can be given to the prediction
parameters among a set of values available in the dataset, for maize. It is the same for the
entire crops available in the dataset.
30
Feature selection is a process to reduce and refine the relevant features from a large number
of features that describe the dataset reduced to computing complexity. Because having a
relative number of features in a dataset might cause over fit to training samples and result
in poor generalization for new samples. However, the final dataset did not require this
because it was manually constructed, which avoided the presence of unnecessary variables
in the first place. Learning algorithms are often a necessary step in the data process.
Reduction of the attribute dimensionality leads to a better understandable model that
simplifies the usage of different visualization techniques and is the process of identifying
and removing as much irrelevant and redundant information as possible. Reduces the
dimensionality of the data, may allow learning algorithms to operate faster and more
efficiently and, accuracy can later be improved on future classification. It finds a minimum
set of attributes such that the resulting probability distribution of data classes is as close as
possible to the original distribution.
The selection of high-level features that contribute to prediction accuracy plays a major
role in obtaining accurate predictions. By applying different feature selection algorithms,
different subgroups were selected, such as Sequential Behavior Selection, Related
Behavior Selection, Differential Inflation, and Random Forest Variability. These feature
subsets were applied to the MLR model to find the best feature subset. The selected features
were area, number of open wells, number of tanks, canals length, and maximum
temperature during the season [58]. These features give better prediction accuracy when
they were applied to the machine learning algorithms and statistical model. The collected
four data sets were combined into a single data set.
31
The table below describes the dataset.
The training dataset is the initial dataset used to train ML algorithms to learn and produce
the right predictions. The test dataset, however, is used to assess how well the ML
algorithm is trained with the training dataset. You can’t simply reuse the training dataset
in the testing stage because the ML algorithm already is aware of the expected result, which
defeats the aim of testing the algorithm.
In this study various decisions for partitioning the dataset for training and testing have been
attempted, we utilized 60% for training and 40% for testing, 70% for training and 30% for
testing, 80% for training and 20% for testing and 90% for training and 10% for testing
separately. When the training is done, the prediction model is prepared 100% of the time
for the expectation. During this training, the expectation model will become familiar with
the entire example among various contribution to different years and inside every year
itself.
32
3.2.6. Training and Testing of the Algorithm
As referenced previously, data has been isolated into two parts. A piece of the separated
data was utilized for giving training to the algorithm, while the other part was utilized for
testing it. Upon the effective training and testing of the algorithms, which utilized this
partitioned data set, a comparable cycle was followed for every one of the implemented
algorithms.
33
model can improve the decision-making process for the user. The final prediction of the
model was generated by running the best-performing algorithm on the selected parameters.
34
The above architecture clearly explains how the components of the system communicate
among themselves starting from preprocessing of data. This proposed framework is able to
find out the crop yield. This model gives a clear picture of the huge amount of data capture
and preprocessing of data to remove the unwanted data such as NULL etc. presented in it.
During preprocessing step, we split the dataset into the training and testing dataset. Train
dataset to detect the crop yield present in the dataset using appropriately supervised
learning algorithms. Apply the machine learning techniques which are helpful for finding
crop yield for any new data that occurred in the data. After this data acquisition suitable
machine learning algorithm must be applied to compute the efficiency and capability of the
model, here that have applied various machine learning algorithms like random forest
regression, SVR, decision tree regression, gradient boosting regression, etc. Measurements
such as accuracy are calculated for the proposed model. This system architecture focuses
on 3 parts such as flow data, Machine learning techniques, and modules for detecting crop
yield and feature selection modules.
The above architecture clearly explains how the components of the system communicate
amongst themselves beginning from preprocessing of data. This proposed framework is
capable of discovering the crop yield. This model offers a clean data description of the
big quantity of data size and preprocessed data to dispose of the undesirable data along
with NULL etc. supplied in it.
After cleaning and exploring the relationship among the features, the final data frame that
carries all of the features in order to be used for the prediction process may be seen below
in the screenshots:
Area: country of production.
Item: type of crop.
Year: year of production.
Average_rain_fall_mm_per_year: Average amount of rain recorded that year.
Hg/ha_yield: country’s yearly production of the crop that year.
Pesticides_tonnes: Amount of pesticides used on the crop that year.
Avg_temp: Average temperature recorded for that year.
35
3.4. Models Under Consideration for Crop Yield Prediction
The research on crop yield prediction needs multiple factors of production and different
algorithms. Some of the algorithms that are being used are for finding the best feature
subset for better prediction and others are used for finding prediction. Multiple algorithms
were used to compare the different algorithms that were used in the current study. It has
long been recognized that the generation of empirical models to estimate the crop yield is
an important responsibility for the remote sensing community [59].
Machine learning is an essential decision guide tool for crop yield prediction, which
includes supporting decisions on what crops to develop and what to do during the growing
season of the crops. The regression learning algorithm is supervised machine learning that
is important in the prediction of the labeled data. It works on continuous values prediction.
It also important in the crop yield prediction. Many machine learning algorithms are
utilized for crop yield prediction by numerous researchers. Generally involved models for
crop yield prediction are random forest regression, decision tree regression, support vector
machine (SVM) and Gradient boosting regression.
Random forest is a supervised learning algorithm that is used for both classifications as
well as regression. But however, it's far specifically used for classification problems. As
we recognize a forest is made from trees and greater trees mean a more robust forest.
Similarly, a random forest algorithm creates decision trees on data samples after which
gets the prediction from every one of them and finally selects the nice solution by means
of voting. It is the best ensemble method from a single decision tree because it reduces the
over-fitting by measuring the result on average [61].
36
As the name suggests, "Random Forest is a classifier that incorporates some of the decision
trees on numerous subsets of the given dataset and takes the common to enhance the
predictive accuracy of that dataset." Instead of counting on one decision tree, the random
forest takes the prediction from every tree and is primarily based totally on the majority
votes of predictions, and it predicts the final output [62].
The ensemble learning approach is a method that mixes predictions from more than one
machine-learning algorithm to make an extra correct prediction than a single model.
Regression algorithm in each the modules of predicting the yield rate of crop and predicting
crop too [63]. We can understand the working of Random Forest algorithm with the help
of following diagrams.
It is a tree-clad classifier with three types of joints. The root node is the main node that
represents the entire sample and can be subdivided into additional nodes. The internal nodes
represent the data set properties and the branches represent the decision rules. Finally, loop
37
nodes indicate the result. This algorithm is very useful for solving decision-making
problems [64].
The performance of the model is determined by comparing the actual values and the
estimated values at the final stage. By comparing these values, the accuracy of the model
can be estimated. Creating a graph of the values and seeing the results also helps to measure
the accuracy of the model [65].
Figure 7. Working of decision tree regression algorithm (Source: Adapted from [64])
38
Support Vector Machine or SVM is one of the maximum popular Supervised Learning
algorithms, that's used for Classification in addition to Regression troubles. However, it is
primarily used for the classification of problems in machine learning.
SVM chooses the acute points/vectors that assist in developing the hyperplane. These
intense instances are referred to as assist vectors, and as a result set of rules is called a
Support Vector Machine. Consider the underneath diagram wherein there are one of a kind
classes which might be labeled the usage of a selection boundary or hyperplane [67].
SVMs have proven effective methods on all types of data, from tabular, text, and image
data. SVMs are known to work well even for a small number of training samples that scale
well to high-dimensional spaces and have shown state-of-the-art performance in many
problems in the biomedical domain [68].
Support Vector Machines (SVMs) are a popular set of related supervised learning
techniques for data analysis and pattern detection for classification and regression analysis.
Methods vary in the structure and characteristics of the classifier. The most common is the
SVM linear classifier, which predicts each component of input between two possible
classifications. A more accurate definition describes the construction of a support vector
39
machine hyper plan or hyperplane set. The values closest to the category margin are called
support vectors. The SVM’s goal is to maximize the margin between the hyperplane and
the support vectors [69].
Support vector machines are very famous and lots of bear in mind them because of the fine
off-the-shelf classifier. Furthermore, there is a wide selection of environments and
toolboxes that enforce SVMs. For those motives, we selected to use SVMs to the trouble
of classifying infeasible take a look at cases.
Gradient boosting algorithm is one of the maximum effective algorithms inside the subject
of machine learning. As we understand the mistakes in machine learning algorithms are
extensively labeled into Bias Error and Variance Error classes. As gradient boosting is one
of the boosting algorithms it is used to minimize bias mistakes of the model. The working
process of the Gradient boosting algorithm is generalized by the diagram shown below
[71].
40
Chapter Four
Result and Discussion of the Study
The Jupyter Notebook is an open-source web application that lets you create and share
documents that include live code, equations, visualizations, and narrative text. Uses consist
of data cleaning and transformation, numerical simulation, statistical modeling, data
visualization, machine learning, and lots more. The Jupyter Notebook project is the
evolution of the IPython Notebook library which changed into advanced usually to enhance
the default python interactive console through permitting scientific operations and
advanced data analytics capabilities through sharable web documents. Jupyter Notebooks
work with what's referred to as a two-process version primarily based totally on a kernel-
client infrastructure. This model applies a comparable idea to the Read-Evaluate-Print
Loop (REPL) programming surroundings that take a single user’s inputs, evaluate them,
and return the end result to the user.
41
The Agricultural yield in the main relies upon climate conditions (rain, temperature, and
pesticides), and correct data approximately the records of crop yield are an essential issue
for making selections associated with agricultural danger control and destiny predictions.
The primary elements that maintain human beings are similar. In this study, the prediction
of the top 4 cereal crop yields is established by applying different machine learning
techniques. These corps include maize, rice, sorghum, and wheat.
42
In in the above table, a small part of the crop yield dataset from different types of crop in
different year is displayed.
4.2.2. Climate Data
The climatic factors include rainfall and temperature. They are abiotic components,
including pesticides and soil, of the environmental factors that influence plant growth and
development. Rainfall has a dramatic effect on agriculture. For this project rainfall per year,
information was gathered from the World Data Bank repository.
1963 910.08
1964 943.97
1965 749.42
1966 847.97
1967 1082.36
1968 892.22
1969 777.65
1970 817.73
1971 821.9
1972 842.86
1973 755.97
1974 794.31
1975 901.29
1976 896.37
1977 964.92
1978 772.47
1979 772.21
1980 704.63
1981 794.49
1982 928.02
1983 837.77
1984 629.57
43
The average temperature for each country was collected from the World Data Bank
repository. So average temperature starts from 1901 and ends in 2020, with some empty
rows that we have to drop.
Table 5. Sample temperature dataset
Year Average
Temprature
1961 21.98
1962 22.03
1963 22.23
1964 21.82
1965 22.21
1966 22.31
1967 21.82
1968 21.83
1969 22.55
1970 22.48
1971 21.97
1972 22.42
1973 22.8
1974 22.16
1975 22.28
1976 22.59
1977 22.58
1978 22.61
1979 22.75
1980 23.04
1981 22.39
44
4.2.3. Pesticides Data
Pesticides used for each item and country was also collected from FAO database.
45
Now, exploring the connections between the columns of the data frame, the best way to
quickly check the connection between the columns is to view the communication matrix
as a heatmap.
The correlation between all the features has been calculated and illustrated with diverging
color heatmap.
It is evident from the heatmap above that all of the variables are independent of each,
with no correlation between any of the columns in the dataframe.
46
4.4. Model Comparison & Selection
Before deciding on the algorithm to use, we must first evaluate, compare, and select the
one that is compatible with this particular set of data. Usually, when we are working on a
machine learning problem with a given set of data, we try different models and techniques
to solve the optimization problem and try to adapt the most appropriate model, which does
not fit the model. We compare the following models for this project,
Gradient Boosting Regressor
Random Forest Regressor
Support vector machines (SVM)
Decision Tree Regressor
From the results viewed above, for 80/20 train/test data split Gradient Boosting Regressor
has the highest R2 score 0f 93.2%, Decision tree regression comes second.
47
The result of comparison of the models can be shown graphically below using 80/20
train/test data split
48
It will also calculate Adjusted R2 indicates how well terms fit a curve or line but adjusts
for the number of terms in a model. If more and more useless variables add to the model,
adjusted r-squared will decrease. If more useful variables add, adjusted r-squared will
increase. Adjusted R2 will always be less than or equal to R2.
The image above shows the goodness of matching linear predictions. It can be seen that
the R Square score is excellent. This means that we have found a good fitting model to
predict the crops yield value for each year.
49
The most common interpretation of r - squared is how properly the regression version suits
the discovered data. For example, an r - squared of 60% well-known shows that 60% of
the data suit the regression model. Generally, a better r - squared shows a higher suit for
the version. From the acquired results, it’s clear that the model suits the data to an excellent
degree of 93.2%. Feature importance is calculated because the lower in node impurity is
weighted with the aid of using the probability of achieving that node. The node probability
may be calculated with the aid of using the number of samples that attain the node, divided
with the aid of using the full number of samples. The higher the value the extra important
the feature. Getting the 7 top features important for the model:
50
The crop being maize has the highest importance in the decision-making for the model,
where it's the highest crop in the dataset. rice too, then as expected we see the effect of
pesticides, then comes rainfall and temperature. The first assumption about these features
was correct that all significantly impact the expected crops yield in the model. The boxplot
shows the yield for each item. maize is the highest, Rice, Wheat and Sorghum.
51
Chapter Five
Conclusion and Recommendation
5.1. Conclusion
Researches on agriculture is the most common area that government give more attention
because of Ethiopian economy is highly dependent on it. Since cereal crop production was
dominant over other types of crop production by contributing more than 71% of total crop
production, in this paper, we focus on the cereal crop yield and the effects of different
parameters on the production of such crops. To improve the crop yield prediction
implementing machine learning techniques were analyzed in the case of Ethiopia for cereal
crop yield predictions. Predicting the size of the crop can influence on-farm decisions such
as how much pesticides to need and help farmers carefully plan maintenance and labor
schedules to be ready for the start of the harvesting seasons. For crop yield prediction the
climate factors, temperature and rainfall, and the number of pesticides used during
harvesting had different impacts.
Developing accurate models for cereal crop yield estimation using machine learning
techniques may help farmers and other stakeholders improve decision-making in relation
to national food revenue and food security. The purpose of this study is to solve the
problems raised like the problem of accuracy of prediction of crop yields by farmers and
governments. To experiment with this study, the dataset was collected from FAO and
World Data Bank. Significantly those data were preprocessed to make it more
understandable and used for building the machine learning models to find the solution.
There are four sets of data: temperature data sets, rainfall data sets, pesticide datasets, and
crop yield data sets. Based on our dataset the model was developed by using four data
preprocessing techniques. The prediction of cereal crop yield is primarily based totally on
the dataset implementation of algorithms. The analysis of each datasets depending on the
parameters that affect crop yield predictions.
The cereal crop yield prediction experiments were done by applying different machine
learning algorithms like random forest regression, decision tree regression, gradient
52
boosting regression, and support vector machine (SVM). The gradient boosting regression
model was better in the prediction when compared with the other using R2.
Each model was evaluated cross-validation techniques, as we do have limited data and to
reduce overfitting. Gradient boosting regression model better test accuracy on our dataset
as compared with random forest regression, decision tree regression, and SVM algorithms
with accuracy result 93.2. The other investigation of this study was the indication of the
parameters that highly affect crop production became pesticides and the crop being maize.
This study will help to reduce the problems faced by farmers and will serve as an
intermediary to provide farmers with the information they need to earn high profits and
maximize profits. This study reveals that machine learning algorithms are important in the
agricultural sector, in the yield prediction, species management, field conditions
management, crop management, livestock managements.
5.2. Recommendation
In this work, the different machine learning techniques were presented and implemented
to analyze and improve the predictive ability of ML algorithms. Through analyzing various
analytical results, it is concluded that the Gradient Boosting Regressor method gained good
results for cereal crop prediction. However, many areas require additional work. Since
agriculture is the main source of food different researches like using more machine learning
algorithms, AI, deep learning technology for crop prediction including different parameters
that have different impacts on crop production, and also different crop types and seasonal
crops must be considered. Since the current work was developed only on basic cereal crops
and rained seasons on the next work the different seasons, climate factors like humidity,
wind, and all the factors that affect crops should be under consideration. In future work,
various other factors can be considered to improve the prediction and reduce the error rate.
Finally, it was more applicable if the future work on this scenario would include mobile
and cloud computing applications to be more supported for the agriculture industries. Other
than crop yield prediction, machine learning techniques can also be applied to other
agricultural issues like crop disease detection, weed detection, seed classification,
irrigation management, soil classification, and weather prediction.
53
References
[3] Central Statistical Agency (CSA), "Agricultural sample survey: Report on area and
production of major crops," Addis Ababa, 2019.
[4] (CSA), Central Statistical Agency, "Agricultural sample survey: Report on area and,"
Addis Ababa, 2017.
[5] Zhong L. Hu L. Zhou H., "Deep learning based multi-temporal crop classification,"
Remote Sens. Environ, vol. 221, p. 430–443, 2019.
[6] Rossana MC, L. D., "Prediction Model Framework for Crop Yield Prediction," in
Asia Pacific Industrial Engineering and Management Society Conference
Proceedings Cebu, Phillipines, 2013.
[7] You, J., Li, X., Low, M., Lobell, D., Ermon, S., "Deep Gaussian process for crop
yield prediction based on remote sensing data," in Proceedings of the Thirty-First
AAAI Conference on Artificial Intelligence, 2017.
[8] Basso, B., Liu, L., "Seasonal crop yield forecast: methods, applications, and
accuracies," Elsevier, vol. 154, no. Advances in Agronomy , p. 201–255, 2019.
[9] Chipanshi, A., Zhang, Y., Kouadio, L., Newlands, N., Davidson, A., Hill, H.,
Warren,R., Qian, B., Daneshfar, B., Bedard, F., et al, "Evaluation of the integrated
Canadian crop yield forecaster (ICCYF) model for in-season prediction of crop yield
across the Canadian agricultural landscape," vol. 206, no. Agri- cultural and Forest
Meteorology, p. 137–150, 2015.
[10] Fischer, R., "Definitions and determination of crop yield, yield gaps, and of rates of
change.," vol. 182, no. Field Crop Res, p. 9–18, 2015.
[11] C. Ozer., "Research on Machine Learning Methods and Its Applications," Real-
World Applications and Research, no. Machine Learning: Algorithms, 2018.
liv
[12] Lee JY, Ahn S, Kim D., "Deep learning-based prediction of future growth potential
of technologies," PLoS ONE, 2021.
[13] Kumar, V., & Garg, M.L., "Predictive Analytics: A Review of Trends and
Techniques," International Journal of Computer Applications,, 2018.
[14] IPCC (Intergovernmental Panel on Climate Change), "Climate change," no. The
scientific basis , 2007.
[15] World Bank, "Economics of adaptation to climate change study," World Bank,
Washington DC, 2008.
[16] Alemayehu N., Masafu M., Ebro A., Tegegne A., Gebru G., "Climate Change and
Variability in the Mixed Crop/Livestock Production Systems of Central Ethiopian
Highland," in Handbook of Climate Change Resilience, Springer International
Publishing, 2018, pp. 1-24.
[17] Francis Ndamani, Tsunemi Watanabe, "Influences of rainfall on crop production and
suggestions for adaptation," International Journal of Agricultural Sciences I, vol. 5
(1), 2015.
[18] Abate, T., Shiferaw, B., Menkir, A. et al., "Factors that transformed maize
productivity in Ethiopia," Springerlink, vol. 7, no. Food Sec., p. 965–981, 2015.
[19] Jansen, K., & Dubois, M., "Local pesticide governance by disclosure: Prior informed
consent and the Rotterdam convention," MIT Press, no. Transparency in
environmental governance, pp. 107-131, 2014.
[20] Abhinav Sharma, Arpit Jain, Prateek Gupta, Vinay Chowdary, "Machine Learning
Applications for Precision Agriculture: A Comprehensive Review," in IEEE Access,
vol. 9, pp. 4843-4873, 2021.
[21] Uno, Y., Prashera, S.O., Lacroix, R., Goela, P.K., Karimia, Y., Viauc, A., & Patel
R.M., " Artificial neural networks to predict corn yield from compact airborne
spectrographic imager data," vol. 47, no. Computers and Electronics in Agriculture,
p. 149–161, 2005.
[22] Priya, P., Muthaiah, U., & Balamurugan, M., "Predicting yield of the crop using
machine learning algorithm," International journal of engineering sciences &
research technology, vol. 7(4), pp. 1-7, 2018.
lv
[23] Witten, I. H., Frank, E. and Hall M., "Data mining: Practical machine learning tools
and techniques," vol. edition, San Francisco, Morgen Kaufmann, 2005.
[24] Russell, Stuart J.; Norvig, Peter, "Artificial Intelligence: A Modern Approach," vol.
Third ed., 2010.
[25] Van Otterlo, M.; Wiering, M., "Reinforcement learning and markov decision
processes. Reinforcement Learning. Adaptation, Learning, and Optimization," 2012.
[27] S. Bhanumathi, M. Vineeth, and N. Rohit, "Crop yield prediction and efficient use of
fertilizers," in Proc. Int. Conf. Commun. Signal Process, Chennai, India, April 2019.
[28] Chlingaryan, A., Sukkarieh, S., & Whelan, B. , "Machine learning approaches for
crop yield prediction and nitrogen status estimation in precision agriculture," vol.
151, no. Computers and electronics in agriculture, pp. 61-69, 2018.
[29] Sarker, I.H., "Machine Learning: Algorithms, Real-World Applications and Research
Directions.," SN COMPUT. SC, vol. 2, 2021.
[30] R. Ghadge, J. Kulkarni, P. More, S. Nene and R. L. Priya,, "Prediction of crop yield
using machine learning," Int. Res. J. Eng. Technolgy, vol. 5, 2018.
[31] Lobell DB, Burke MB., "On the use of statistical models to predict crop ieldresponses
to climate change," no. Agricultural and Forest Meteorology, p. 1443–52, 2010.
[32] Tamari, S., Wosten, J. and Ruiz-Su ¨ arez, J, " Testing an artificial neural network for
predicting soil hydraulic conductivity," Soil Science Society of America Journal, vol.
60(6), p. 1732–1741, 1996.
[33] Abrougui, K., Gabsi, K., Mercatoris, B., Khemis, C., Amami, R. and Chehaibi, S.,
"Prediction of organic potato yield using tillage systems and soil properties by
artificial neural network (ann) and multiple linear regressions (mlr)," S, vol. 190, no.
oil and Tillage Research , p. 202–208, 2019.
[34] Sarmadian, F., Mehrjardi, R. T., Akbarzadeh, A. et al., "Modeling of some soil
properties using artificial neural network and multivariate regression in gorgan
province," Australian Journal of Basic and Applied Sciences, vol. 3(1), p. 323–329,
2009.
lvi
[35] Bocco, M., Willington, E., Arias, M. et al. , "Comparison of regression and neural
networks models to estimate solar radiation," Chilean Journal of Agricultural
Research, vol. 70(3), p. 428–435, 2010.
[36] Zaefizadeh, M., Jalili, A., Khayatnezhad, M., Gholamin, R. and Mokhtari, T.,
"Comparison of multiple linear regressions (mlr) and artificial neural network (ann)
in predicting the yield using its components in the hulless barley," no. Advances in
Environmental Biology, 2011.
[39] Gopal, P. M., & Bhargavi, R., " A novel approach for efficient crop yield prediction,"
no. Computers and Electronics in Agriculture, 2019.
[40] Shastry, K. A., & Sanjay, H. A., "Cloud-Based Agricultural Framework for Soil
Classification and Crop Yield Prediction as a Service," no. Emerging Research in
Computing, Information, Communication and Applications, pp. 685-696, 2019.
[41] Pavan Patil, Virendra Panpatil, Prof. Shrikant Kokate, "Crop Prediction System using
Machine Learning Algorithms," International Research Journal of Engineering and
Technology (IRJET) , vol. 07 , no. 02, 2020.
[42] Islam, T., Chisty, T. A., & Chakrabarty, A., "A Deep Neural Network Approach for
Crop Selection and Yield Prediction in Bangladesh," in IEEE Region 10, Bangladesh,
2018.
[43] Feng, P., Wang, B., Li Liu, D., Xing, H., Ji, F., Macadam, I., ... & Yu, Q. , "Impacts
of rainfall extremes on wheat yield in semi-arid cropping systems in eastern
Australia," Vols. 147(3-4), no. Climatic change, pp. 555-569, 2018.
[44] Prakash, S., Sharma, A., & Sahu, S. S., "Soil Moisture Prediction Using Machine
Learning.," in 2018 Second International Conference on Inventive Communication
and Computational Technologies (ICICCT),, 2018.
lvii
[45] Giritharan Ravichandran, Koteeshwari R S., "Agricultural Crop Predictor and
Advisor using ANN for Smart phones," IEEE, 2016.
[46] Snehal S.Dahikar, Dr.Sandeep V.Rode, " Agricultural Crop Yield Prediction Using
Artificial Neural Network Approach," International Journal Of Innovative Research
In Electrical, Electronics, Instrumentation And Control Engineering, vol. 2(1), pp.
683-686., 2014.
[47] Rakesh Kumar, M.P. Singh, Prabhat Kumar, J.P. Singh,, "Crop Selection Method to
Maximize Crop Yield Rate using Machine Learning Technique," in International
Conference on Smart Technologies and Management for Computing
Communication, Controls, Energy and Materials (ICSTM), Vel Tech Rangarajan Dr.
Sagunthala R&D Institute of Science and Technology, Chennai, T.N., India, May
2015.
[48] Arun Kumar, Naveen Kumar and Vishal Vats, "Efficient crop yield prediction using
machine learning algorithms," International Research Journal of Engineering and
Technology (IRJET), vol. 05, pp. ISSN: 2395-0072, 2018.
[49] Aakunuri Manjula and Dr. G.Narsimha, "Crop Yield Prediction with Aid of Optimal
Neural Network in Spatial Data Mining," New Approaches, International Journal of
Information & Computation Technology ISSN 09742239 , vol. 6(1), pp. 25-33, 2016.
[51] N. Gandhi, L. J. Armstrong, O. Petkar and A. K., "Tripathy, Rice crop yield
prediction in India using support vector machines," in 13th International Joint
Conference on Computer Science and Software Engineering (JCSSE), Khon Kaen,
2016 .
[53] S. Veenadhari, B. Misra and C. Singh, " Machine learning approach for forecasting
crop yield based on climatic parameters," in International Conference on Computer
Communication and Informatics, Coimbatore, 2014.
lviii
[54] CH. Vishnu Vardhan chowdary, Dr.K.Venkataramana, "Tomato Crop Yield
Prediction using ID3," IJIRT, vol. 4, no. 10 , pp. 663-62, March 2018.
[55] Jun Wu, Anastasiya Olesnikova, Chi- Hwa Song, Won Don Lee, "The Development
and Application of Decision Tree for Agriculture Dat," IITSI, pp. 6-20, 2009.
[56] R. Sujatha and P. Isakki, " A study on crop yield forecasting using classification
techniques," in 2016 International Conference on Computing Technologies and
Intelligent Data Engineering (ICCTIDE'16), Kovilpatti, 2016.
[57] Ahmad, F. K., et al, " Daily stream flow prediction on time series forecasting.,"
Journal of Theoretical and Applied Information Technology, vol. 95(4), no. ISSN:
1992-8645 and E-ISSN: 1817-3195, 28th February 2017.
[58] Maya Gopal, P.S., Bhargavi, R., " Optimum Feature subset for optimizing crop yield
prediction using filter and wrapper approaches," Appl. Eng. Agri., vol. 35 (1), pp. 9-
14, 2019a.
[59] Kind, M.C., Brunner, R.J., TPZ, "Photometric redshift PDFs and ancillary
information by using prediction trees and random forests," Monthly Notices of the
Royal Astronomical Society, 2013.
[61] Sadeh, I., Abdalla, F.B., Lahav, O.,, "Photometric redshift and probability
distribution function estimation using machine learning," in Publications of the
Astronomical Society of the Pacific, 2016.
[62] Tolles, Juliana; Meurer, William J., " Logistic Regression Relating Patient
Characteristics to Outcomes," JAMA, 2016.
[64] A. M. Ahmed, A. Rizaner and A. H. Ulusoy, "A Decision Tree Algorithm Combined
with Linear Regression for Data Classification," in 2018 International Conference
on Computer, Control, Electrical, and Electronics Engineering ICCCEEE), 2018.
lix
[65] Priyama A, Abhijeeta RG, Ratheeb A, Srivastavab S. , "Comparative analysis of
decision tree classification algorithms," International Journal of Current
Engineering and Technology, 2013.
[66] Achirul Nanda, M., Boro Seminar, K., Nandika, D. and Maddu, A., "A comparison
study of kernel functions in the support vector machine and its application for termite
detection," vol. 9(1), 2018.
[67] Dixon, B. and Candade, N., " Multispectral land use classification using neural
networks and support vector machines: one or the other, or both?’," International
Journal of Remote Sensing, 2008.
[69] Guo, Y., Yin, X., Zhao, X., Yang, D., Bai, Y., "Hyperspectral image classification
with SVM and guided filter," EURASIP Journal on Wireless Communications and
Networking, 2019.
lx
Appendix
Predicting Crops Yield: A Machine Learning Approach
In this project the prediction of top 4 most consumed cereal crop yields is established by
applying machine learning techniques. These corps include: Maize, Rice, Sorghum and
Wheat.
1. Part One: Gathering & Cleaning Data
importing required libraries,
Importing data and the checking for null values, merging and dropping unwanted
Feature selection
Scaling Features:
lxii
Taking a look at the dataset above, it contains features highly varying in
magnitudes, units and range. To suppress this effect, we need to bring all features
to the same level of magnitudes. This can be achieved by scaling.
Training Data:
The common splits are 70/30 or 80/20 for train/test. The training dataset is the initial
dataset used to train ML algorithm to learn and produce right predictions.
Before deciding on an algorithm to use, first we need to evaluate, compare and choose the
best one that fits this specific dataset. For this project, we'll compare between the following
models: Gradient Boosting, Random Forest, SVM and Decision Tree Regressor
lxiii
From results viewed above, Gradient Boosting Regressor has the highest R² score 0f
93%, Decision Tree Regressor comes second with 88%.
lxiv
To show level of feature importance
lxv