You are on page 1of 76

ADDIS ABABA SCIENCE AND TECHNOLOGY UNIVERSITY

CEREAL CROP YIELD PREDICTION USING

MACHINE LEARNING TECHNIQUES

IN ETHIOPIA

By

BELETE ASMARE ASSEFA

A Thesis Submitted as a Partial Fulfilment to the Requirements for the Award of the
Degree of Master of Science in Software Engineering

to

DEPARTMENT OF SOFTWARE ENGINEERING

COLLEGE OF ELECTRICAL AND MECHANICAL ENGINEERING

FEBRUARY 2022
Approval Page
This is to certify that the thesis prepared by Mr. Belete Asmare Assefa entitled “Cereal
Crop Yield Prediction Using Machine Learning Techniques in Ethiopia” and
submitted as partial fulfillment for the award of the Degree of Master of Science in
Software Engineering complies with the regulations of the university and meets the
accepted standards with respect to originality, content, and quality.

Signed by Examining Board:

ii
Declaration
I hereby declare that this thesis entitled “Cereal Crop Yield Prediction Using Machine
Learning Techniques in Ethiopia” was prepared by me, with the guidance of my advisor.
The work contained herein is my own except where explicitly stated otherwise in the text,
and that this work has not been submitted, in whole or in part, for any other degree or
professional qualification.

Author Signature, Date:

Belete Asmare Assefa

Witnessed by:

Name of student advisor: Signature, Date:

Kula Kekeba (PhD)

Name of student co-advisor: Signature, Date:

………………………………………… ………………………………………

iii
Abstract
Agriculture in Ethiopia is the area that plays an important role in improving our economy.

About 85% of the population live in rural areas and their economy is largely based on crop

productivity. Crop selection depended on several parameters such as market price,

production rate, climate data, chemicals, and different government policies. Prediction of

crop yields is important for planning and making various policy decisions. Many countries

like Ethiopia their economy is depend on agriculture use the conventional technique of data

collection for crop monitoring and yield predicting. The purpose of this study is to develop

a cereal crops yield prediction model based on agricultural inputs data. To this end,

appropriate machine learning techniques have been identified and applied to predict cereal

crop yields based on agricultural inputs. In order to build the prediction, model the

collected raw data had been pre-processed and merged based on common features. After

merging the dataset, the files containing the data were collected and the reputation of the

final data should be: the year of the item (crop), the yield value, the average rainfall, the

pesticides, and the average temperature. The data has a size of 20 kilobytes and 12 features

initially. After feature importance analysis has been implemented the size data was resized

to 7 features and 8 kilobytes to develop the predicted model. For the experimental analysis

we have use, Gradient Boosting Regression, Random Forest Regression, Support Vector

Machine, and Decision Tree Regression. Experimentally we have analyzed the

performance comparison of each algorithm by using different data splitting train/test levels.

Finally, among listed algorithms, the Gradient Boosting Regression outperforms the other

standard algorithms by showing 93% accuracy in crop yield prediction.

Keywords: Cereal crop, regression algorithm, machine learning, dataset, yield prediction

iv
Acknowledgements
First of all, I want to thank the Lord of all creation for all that God has done for us and for
being here at this time. Next, I would like to provide my great and special words of thanks
to my advisor Dr. Kula Kakeba for his constructive and concrete advice from starting to
ending of the thesis research for his intellectual guidance that has been motivating me in
my entire work.

Furthermore, I would like to thank my beloved wife Kasayenesh Nigussie for her
motivation and contribution not only in the research session but also during the class
session. I would like to thank my previous teacher Dr. Sudhir Kumar Mohapatra and my
beloved friends in my work office at Information Network Security Agency (INSA):
Semahegn, Amsalu, and others their name is not listed here for their support during the
research session.

I would also like to thank those who are not possible to list those who are contributed,
cooperated, and assisted directly or indirectly to acquire the necessary data. Besides, I
would also like to thank the academic staff members and the whole department of software
engineering of Addis Ababa Science and Technology University.

v
Table of Contents

Contents Page

Approval Page ..................................................................................................................... ii

Declaration ......................................................................................................................... iii

Abstract .............................................................................................................................. iv

Acknowledgements ............................................................................................................. v

Table of Contents ............................................................................................................... vi

Abbreviations and Acronyms ............................................................................................ ix

List of Tables ...................................................................................................................... x

List of Figures .................................................................................................................... xi

Chapter One ........................................................................................................................ 1

Introduction ......................................................................................................................... 1

1.1. Back Ground of the Study .................................................................................... 1


1.2. Motivation of the Study........................................................................................ 5
1.3. Statement of the Problem ..................................................................................... 6
1.4. Research Questions .............................................................................................. 6
1.5. Research Objectives ............................................................................................. 6
1.5.1. General Objective ......................................................................................... 6
1.5.2. Specific Objectives ....................................................................................... 6
1.6. Scope and Limitation of the Study ....................................................................... 7
1.6.1. Scope of the Study ........................................................................................ 7
1.6.2. Limitation of the Study ................................................................................. 7
1.7. Significance of the Study ..................................................................................... 7
1.8. Contribution of the Study ..................................................................................... 8
1.9. Thesis Organization.............................................................................................. 8
Chapter Two........................................................................................................................ 9

Literature Review................................................................................................................ 9

vi
2.1. Introduction .......................................................................................................... 9
2.2. Factors Affecting Crop Production ...................................................................... 9
2.2.1. Impact of Climate Change on Crop Production ............................................ 9
2.2.2. Temperature ................................................................................................ 10
2.2.3. Rainfall ........................................................................................................ 10
2.2.4. Pesticides..................................................................................................... 11
2.3. Machine Learning .............................................................................................. 11
2.3.1. Types of Machine Learning ........................................................................ 13
2.4. Machine Learning Applications and Techniques in Agriculture ....................... 14
2.4.1. Specie Management .................................................................................... 14
2.4.2. Field Conditions Management .................................................................... 15
2.4.3. Crop Management ....................................................................................... 15
2.4.4. Livestock Management ............................................................................... 16
2.5. Crop Yield Prediction Using MLR .................................................................... 17
2.6. Review of Related Works .................................................................................. 20
Chapter Three.................................................................................................................... 27

Methodology ..................................................................................................................... 27

3.1. Introduction ........................................................................................................ 27


3.2. Workflow ........................................................................................................... 28
3.2.1. Data Collection ........................................................................................... 28
3.2.2. Data Pre-Processing .................................................................................... 29
3.2.3. Features Norma Normalization (Scaling) ................................................... 30
3.2.4. Feature Selection ......................................................................................... 30
3.2.5. Data Splitting .............................................................................................. 32
3.2.6. Training and Testing of the Algorithm ....................................................... 33
3.2.7. Performing Cross-Validation ...................................................................... 33
3.2.8. Performance Evaluation of Models ............................................................ 33
3.2.9. Generate Final Prediction ........................................................................... 33
3.3. Model Design ..................................................................................................... 34
3.4. Models Under Consideration for Crop Yield Prediction ................................... 36

vii
3.4.1. Random Forest Regression ......................................................................... 36
3.4.2. Decision Tree Regression ........................................................................... 37
3.4.3. Support-Vector Machines SVM ................................................................. 38
3.4.4. Gradient Boosting Regression .................................................................... 40
Chapter Four ..................................................................................................................... 41

Result and Discussion of the Study .................................................................................. 41

4.1. Design and Implementation ............................................................................... 41


4.1.1. Hardware and Software Requirements ....................................................... 41
4.2. Data Gathering and Cleaning ............................................................................. 41
4.2.1. Crops Yield Data......................................................................................... 42
4.2.2. Climate Data ............................................................................................... 43
4.2.3. Pesticides Data ............................................................................................ 45
4.3. Data Exploration ................................................................................................ 45
4.4. Model Comparison & Selection ......................................................................... 47
4.4.1. Result of Evaluation Metrics ...................................................................... 47
4.5. Model Results and Conclusions ......................................................................... 49
Chapter Five ...................................................................................................................... 52

Conclusion and Recommendation .................................................................................... 52

5.1. Conclusion.......................................................................................................... 52
5.2. Recommendation ................................................................................................ 53
References ......................................................................................................................... liv

Appendix ........................................................................................................................... lxi

viii
Abbreviations and Acronyms

AI Artificial Intelligence
ANN Artificial Neural Networks
CCKP Climate Change Knowledge Portal
CPU Central Processing Unit
CSA Central Statics Agency
CYP Crop Yield Prediction
DNN Deep Learning Neural Network Model
FAO Food and Agriculture Organization
GB Giga Byte
GDP Gross Domestic Product
ICT Information Communication Technology
IPython Interactive Python
IT Information Technology
KNN K Nearest Neighbor
Matlab MATrix LABoratory
ML Machine Learning
MLR Multiple Linear Regression
NCPB National Cereal Produce Board
pH Potential Hydrogen
REPL Read-Evaluate-Print Loop
RF Random Forest
SVM Support Vector Machine
WEKA Waikato Environment for Knowledge Analysis

ix
List of Tables
Table 1. Summary of studies and their findings .............................................................. 25
Table 2. Features used for cereal crop yield in the area................................................... 32
Table 3. Sample crop yield dataset ................................................................................... 42
Table 4. Sample rainfall dataset ....................................................................................... 43
Table 5. Sample temperature dataset ................................................................................ 44
Table 6. Sample Pesticides data set ................................................................................. 45
Table 7. R2 result summary of different train/test values ................................................. 47

x
List of Figures
Figure 1. Predictive analysis process ................................................................................. 4
Figure 2. Machine learning process ................................................................................. 12
Figure 3. Types of machine learning ................................................................................ 13
Figure 4. Workflow of the research ................................................................................. 28
Figure 5. Block diagram for model design ...................................................................... 34
Figure 6. Working of random forest algorithm ................................................................ 37
Figure 7. Working of decision tree regression algorithm ................................................ 38
Figure 8. Working of SVM ............................................................................................. 39
Figure 9. Working of gradient boosting algorithm .......................................................... 40
Figure 10. Correlation map in the dataframe ................................................................... 46
Figure 11. Model comparison ........................................................................................... 48
Figure 12. Actual vs predicted yield ................................................................................. 49

xi
Chapter One
Introduction
This chapter is organized to describe the research background, which includes key concepts
such as crop prediction, key concepts of machine learning, and predictive analysis
techniques. The motivation of the research and statement of the research problem has been
described, followed by the objectives, questions of the research, scope, and limitation, and
application of results. The last section concludes with the thesis organization and summary
of each chapter at the end.

1.1. Back Ground of the Study


Ethiopia is one of the country’s maximum prone to weather variability. The agricultural
region which contributes over 45% of GDP, 80% of the workforce, and 85% of forex
profits are very touchy to weather change [1]. Over 95% of rainfall-established agricultural
manufacturing has been produced through smallholders and subsistence farmers who've
much less potential to evolve to weather change [2]. The plants produced consist of meals
plants, coins plants, fruits, and vegetables. It constitutes the best proportion of the country’s
GDP and export profits whilst in comparison to livestock manufacturing.

Hence, because it has been for centuries in the past, still being the leading sector at present,
it's far believed to stay to be the determinant sector to play a dominant role to result in a
general sustainable economic growth to the country, for the years to come. if and only if
strenuous efforts are made through the authorities and the involved stakeholders such as
the farmer, to enhance productiveness via multiplied use of farm inputs which includes
progressed seed, fertilizers, etc., and modernize the farm activity via multiplied use of
modern and progressed farm implements and farming systems in addition to via the
introduction of modern farming technology to the sector as a whole.

In Ethiopia, a cereal production is a dominant form of agricultural practice over other types
of crop production. According to the 2019 CSA report, the percentage of crops, according
to production, is cereals (71.57%), legumes (11.20%), oats (5.17), vegetables (1.67%), root
crops (1.60%), and fruit crops. (0.83%) and coffee (5.28%) of typical crop production
location. Out of nearby states inside the country, Oromia ranks first every in terms of land

1
region allocation (45.41% of country-wide crop production location) and crop production
(49.24% of country-wide crop production) [3].

In 2018, cereal yield for Ethiopia were 2,395 kg per hectare. Though Ethiopia's cereal yield
fluctuated appreciably in recent years, it tended to increase via the 1969 - 2018 duration
ending at 2,395 kg per hectare in 2018 [4]. This indicates that cereal crop production is the
important source of livelihood for smallholder farmers in the country and thus, smallholder
farmers’ food security and welfare status depends on the extent of development in this
subsector. Cereals like sorghum, wheat, maize, and rice are major staple foods of most
populations.

Crop yield is the maximum vital indicator in agriculture and has several connections with
human society. Due to the complexity of the information, crop manufacturing forecasting
is a difficult assignment for coverage leaders. Researchers in agriculture and agro-
economics are inquisitive about growing new mathematical techniques that could make
higher predictions with the use of current metrics. Research on this path is concerned with
presenting a hyperlink among the rural surroundings and crop manufacturing, considering
nearby variables, soil quality, irrigation, and land use. These fashions are primarily based
totally at the legal guidelines of measurement. These models are based on the laws of
measurement [5].

Crop yield prediction is one of the most important and well-known topics in real
agriculture, with crop mapping and estimation, crop supply in line with demand, and crop
management. Modern approaches are far from simple predictions based on historical data
but include computer vision technologies to provide information on travel and general crop,
weather, and economic conditions [6].

The challenge begins when one realizes that it is not possible to produce such information
for a specific professional system. Manual surveys and remote sensor data are used to
predict crop yields. Observations of the past years with mathematical knowledge Manual
study with historical knowledge is useful for a small area, but difficult to compare with
other regions and countries. Recent advances in crop simulation models have overcome
these problems [7] .

2
Crop yield predictions are valuable to many stakeholders in the agro-food chain, including
farmers, agronomists, commodity traders, and policymakers of agriculture[8]. Crop yield
is prompted by many crop-unique parameters, environmental conditions, and management
decisions and it's far hard to construct a reliable and explainable prediction model [9].

Machine learning is an artificial intelligence application in which a computer or system


learns from experiences beyond (inserting data) and makes predictions about fate. The
overall performance of this type of machine must be at the lowest human level. The study
of the machine is omnipresent throughout the development and collection cycle. It starts
with evolving with a seed that is planted in the ground from the size of the preparation of
the soil, the selection of seeds, and the supply of water and ends with robots harvesting the
crop by calculating the maturity using computer vision [10].

From an engineering perspective, an ML task is a software system that has one or greater
components in it that learn from records. This involves the gathering and pre-processing of
records, the training of an ML model, the deployment of the trained model to carry out
inference, and the software program engineering of the encompassing software system that
sends new input records to the model to get answers. Machine Learning is usually classified
into 3 types: Supervised Learning, Unsupervised Learning, Reinforcement Learning [11].

Data processing is the idea of the whole agricultural records cycle and ought to deal with
many troubles in agriculture, including meals security, soil conservation, irrigation, pest
identity and prevention, soil health, and agricultural utilization. Traditional evaluation
techniques including information mining, system learning, statistical evaluation, and
different techniques aren't applicable to large-scale information processing in agriculture.
Years of studies and development, information mining, system learning, statistical
evaluation, and extra information evaluation have caused huge effects on the information.
Depending on the traits of the information in agriculture, you will use timeliness as a
measure. Research information control offers numerous technological challenges. These
are associated with environmental modeling, i.e., metadata-primarily based totally
information retrieval troubles into information mining and information integration. He
makes use of analysts to affirm the good-sized agricultural information algorithms. It can
calculate the effectiveness of algorithms to a point and calculate the reliability of

3
information results. Predictive analysis is the branch of data analysis that is mainly used to
predict future events or outcomes. The process of predictive analysis can be
diagrammatically described below [12].

Figure 1. Predictive analysis process (Source: Adapted from [12])


In modern times, technology is constantly generating excessive amounts of data to collect
and analyze connected space, and speculative analyzes play an important role in this view.
The modern digital technology sector uses predictive analytics everywhere in the business
and IT domains to gain competitiveness. Predictive analytics in the form of advanced data
analytics makes predictions about future outcomes via analyzing previous data. To analyze
previous data, this method combines statistical modeling, data mining, and machine
learning tools and techniques and makes accurate and actionable insights. The science of
predictive analytics can construct destiny insights with a significant rate of accuracy.

4
Predictive analyzes include a combination of scientific methods and techniques. Based on
the Science of predictive analytics the techniques of predictive analytics include the basic
steps [13].
 Regression
 Classification
 Clustering Time Series
 Prediction

Data Mining: In order to manipulate huge quantities of data units both based or
unstructured to recognize hidden patterns and relationships amongst variables provided,
data mining is aimed to. Once identified, those relationships may be used to apprehend the
conduct of the event from which data is compiled.
 Statistical Modelling: In parallel to the data mining process, statistical data models may
be evolved relying on the context of what wishes to be expected the use of the equally
gathered data as for data mining. Once the model is built, the new data is fed to models to
predict future outcomes. For example, a business expert can build a cross-selling model
using current customer data and predict what other items they will likely purchase from the
same company.
 Machine learning: ML can deploy iterative techniques and strategies to perceive patterns
from massive data sets and construct models. For example, recommendation engines are
broadly used for online buying recommendations as predictions are made from the use of
customers' earlier shopping and browsing behavior.

1.2. Motivation of the Study


Agriculture plays a critical role in the Ethiopian economy. The application possibility of
data science is very promising in the agricultural sector. To date, few valid research studies
have been conducted in Ethiopia, which was the main motivation for this research. The use
of data science in Ethiopian agriculture is relatively low and most of the information
available is not digital.

There is still a long way to go before it can be used in any data science application. Much
needs to be done before it can be used in any data science application. It is difficult for
researchers who wanted to contribute to some research in agricultural data analysis.

5
Therefore, the current research is intended to work in this field and to develop useful and
valuable data sets and models, so that any researcher who wants to work in this field will
have access to decent information and good models. It is hoped that the successful
implementation of the models will enable one to predict cereal crop productivity.

1.3. Statement of the Problem


Timely prediction of crop yields is important for planning and making various policy
decisions. Many countries use the conventional technique of data collection for crop
monitoring and yield predicting based on field visits and reports. These methods are
subjective, very expensive, and time-consuming. The underlying issues that call for this
research are that crop yield prediction is the most important service for an agricultural
country like Ethiopia. Based on the dataset as input design and implement training model.
So the machine will be able to learn the features and extract the crop yield from the data
by using machine learning techniques.

1.4. Research Questions


The study is basically designed with these research questions in mind:
1. Which machine-learning model can be applied for cereal crop yield prediction?
2. How the developed crop yield predictive model is effective?

1.5. Research Objectives


1.5.1. General Objective
The main objective of this research is to implement machine learning techniques in cereal
crop yield prediction to improve crop production management in Ethiopia.
1.5.2. Specific Objectives

To achieve the above general objective, the research work will carry out the following
specific objectives:
 To review the literature on crop yield predictions and machine learning applications
in agriculture
 To look at the factors affecting crop production
 To collect data from open source data repositories

6
 To develop an appropriate model using machine learning techniques for cereal crop
yield prediction
 To evaluate the effectiveness of the developed machine learning model based on
cereal crop yield prediction
 Preparation of a data set for cereal crop production for Ethiopia.

1.6. Scope and Limitation of the Study


1.6.1. Scope of the Study
The primary purpose of the study is to develop a cereal crop yield prediction model to
enhance food security. The study is to be conducted in Ethiopia. The objectives described
in the subsection above describe the scope and generality of the study, the parameters and
scope of the research define its specificity. By considering various factors such as rainfall,
temperature, pesticides, yield, and other entities build the predicting model using different
machine learning techniques. The performance of each technique is evaluated based on
predicted accuracy.

1.6.2. Limitation of the Study


Since in this project, only four machine learning methods are applied to predict cereal crops
using publicly available data from FAO and World Data Bank the data coverage and all
the parameters that affect crop yield do not cover well, rather than including basic
parameters and selected crops. The application is intended with model selection based on
the performance of the model therefore it does not contain any system or mobile application
at this time.

1.7. Significance of the Study


Agricultural stakeholders are distressed about the appraised crop yield data before the
harvest. Many parts of the world use the practice of determining yield ahead of the harvest
to gauge the food security of a nation and raise warnings related to food shortages. This
practice typically helps strategy planners and decision-makers, especially in agrarian
economies. The findings of this observation could be of brilliant importance to numerous
stakeholders like farmers, the government, agricultural transformation agency (ATA)and
researchers and scholars

7
1.8. Contribution of the Study
This work contributes to the scientific community and practice in multiple ways:

 Different factors that affect crop production specifically cereal crops were
investigated that have different applications for agricultural policies.
 Study of various solutions proposed/used in crop yield prediction, and the
effectiveness of the various parameters that influences their results
 Comparative analysis of standard algorithms on crop yield prediction and
identifying the most suitable algorithm for a generic set of crops
 Evaluation of various parameters which affect the crop yield and ranking them
according to their impact.
 The cereal crop dataset for Ethiopia has been developed and published for
researchers and other users.

1.9. Thesis Organization


The thesis consists of five chapters. Chapter one gives an introduction to the research in
machine learning techniques in the agricultural domain. The overview and importance of
crop yield prediction, motivation, objectives, and the scope and limitation of the research
are also presented in this chapter. Chapter two elaborates the review of some theoretical
and practical conceptualizations with respect to crop yield prediction. Recent research-
related works on crop yield prediction are discussed in this chapter. Chapter three contains
a brief explanation of the methodologies and the data collection and different machine
learning techniques and algorithms were discussed here. The experimental setups and tools
for the study findings of the study and the result and discussions are part of Section Four.
Finally, section five deals with conclusions, the major contributions of the study, and
recommendations are drawn from the study. Reference and appendix are out of the chapter.

8
Chapter Two
Literature Review
2.1. Introduction
This chapter discusses the literature reviews conducted by refereeing books, journals,
articles, conference papers, and the internet to get more insight into the concept of machine
learning and its application, especially in agriculture. The section provides insights into
research from early researchers. This chapter is subdivided into factors affecting crop
production, machine learning in agriculture: applications and techniques, crop yield
prediction using MLR, and a summary of all the recent works on the study.

2.2. Factors Affecting Crop Production


2.2.1. Impact of Climate Change on Crop Production
Global climate change and the associated weather extremes continued posing considerable
challenges both in developed and developing countries. Climate-caused meals shortages
and continual illnesses affected billions of human beings in growing countries [14].

From the countries that are affected by weather change, Ethiopia is the most. The
agricultural sector, which contributes more than 45% of GDP, 80% to labor, and 85% to
foreign exchange, is particularly vulnerable to climate change. More than 95 percent of
crop production is dependent on rainfall has been produced by smallholders and subsistent
farmers who have less capacity to adapt to climate change [1].

Global warming is one of the major challenges facing global food security. Climate change
affects the production and productivity of the crop sector by decreasing soil fertility,
increasing pests and crop diseases aggravating lack of access to inputs and improved seeds,
and frequent drought and floods due to low irrigation schemes, poverty, high population
pressure, lack of institutional capacity to adaptation. Climate change is projected to overall
decrease the yields of cereal crops in Africa through shortening Increasing the length of
the season, increasing water stress, and increasing disease, pests, and weed outbreaks [15].
Agriculture is always dependent on the weather, farmers need a mixture of sun, heat, and
rain to ensure that all human beings can produce food safely.

9
2.2.2. Temperature

Temperature is the maximum vital environmental characteristic influencing growth and


development, and hence, the remaining productivity of agronomic cereal crops. Annual
crop plants have a more restricted range of lethal temperatures. Most of these plants can
survive temperatures from approximately 0°to 50°C. In agriculture, crop plants are grown
generally during those times of the year when the occurrence of lethal temperatures is
unlikely.

Wheat needs 12 to 15 inches (31 to 38 cm) of water to produce a good crop in Ethiopia. It
grows fine while temperatures are warm, from 70° to 75° F (21° to 24° C), but not too hot.
Wheat also needs a lot of sunshine, especially when the grains are filling. The Rice crop
desires a hot and humid climate. The average temperature required in the course of the
lifestyles duration of the rice ranges from 21 to 37º C. Maize is a warm-weather crop and
is not grown in areas where the mean daily temperature is less than 19 ºC or where the
mean of the summer months is less than 23 ºC. Although the minimum temperature for
germination is 10 ºC, germination could be quicker and much less variable at soil
temperatures of 16 to 18 ºC. Sorghum will germinate quickly with soil temperatures at 65-
70°F but will also germinate at temperatures as low as 50°F (expect very slow growth).
Planting should not begin until soil temperatures (2-inch depth) have reached an average
of 60°F over a five-day period [16].
2.2.3. Rainfall
Rainfall seasonality and timing are key climatic features affecting crop yield in rain-fed
agriculture. Crops need water for their growth, photosynthesis of making their food, and
their overall performance. Rainfall provides water that serves as a medium through which
nutrients transport for crop development [17]. Rainfall variability has a significant and
negative impact on crop production in Ethiopia. When the once-a-year rainfall diverges
from its mean (each upward and downward), the extent of production of all crop types
diminished significantly. When there's extreme rainfall, the effect of fertilizer to reinforce
productiveness has diminished.

10
In Ethiopia, the amount of rainfall required for wheat cultivation varies between 300 mm
and 1000 mm. The major wheat lands of the temperate regions have an annual rainfall of
380 cm to 800 mm. Maize is the second most widely cultivated crop in Ethiopia and is
grown under rain-fed production. The rainfall needed for maize is around 650-1200 mm.
Rice is mainly grown in rain-fed areas that receive heavy annual rainfall. It demands a
rainfall of more than 800 mm. Sorghum is well adopted to semiarid regions with a
minimum annual rainfall of 350-600 mm. It is grown in areas that are too hot and dry [18].

2.2.4. Pesticides
Pesticides can contaminate soil, water, grass, and other plants. In addition to killing insects
or weeds, pesticides can be poisonous to many other organisms, including birds, fish,
beneficial insects, and non-target plants. Pesticides are poisonous chemical compounds that
are intentionally launched into the environment. Although each pesticide is meant to kill a
certain pest, a very large percentage of pesticides reach a destination other than their target.
Pesticides easily contaminate the air, ground, and water when they runoff from fields,
escape storage tanks are not properly disposed of, especially when sprayed into the air.
Pesticides are agricultural technologies that enable farmers to control pests and weeds and
are an important resource when growing crops [19].

2.3. Machine Learning


It is an artificial intelligence (AI) application that allows you to learn and improve your
experience without having to program it explicitly. Machine learning focuses on finding
information and developing computer programs that you can use to learn for yourself.

A machine or intelligent computer program learns and extracts knowledge from the data,
builds a framework for making predictions or intelligent decisions. Thus, the ML process
is divided into three key parts, i.e. data input, model building, and generalization as shown
in figure 2. Generalization is the process for predicting the output for the inputs with which
the algorithm has not been trained before.

11
ML algorithms are mainly used to solve complex problems where human expertise fails
such as weather prediction, spam filtering, disease identification in plants, pattern
recognition. [20]

Figure 2. Machine learning process (Source: Adapted from [20])

Machine Learning (ML) has enabled greater in-intensity studies in a plethora of fields.
Training artificial neural networks (ANNs) is one of the maximum popular techniques in
ML and has been carried out to lots of biological and agricultural problems. A thrilling and
beneficial thing of ANNs is their cap potential to find complex institutions among entering
and reaction variables without pre-defining any constraints or assumptions approximately
the sample distribution of the data. This allows the opportunity of describing complicated
non-linear relationships which might be regularly found in domain names which include
precision agriculture because of an extensive variety of crop situations and different
influencing factors such as precision agriculture resulting from a wide range of crop
conditions and other influencing factors [21].

Machine Learning (ML) deals with problems where the relationship between input and
output variables is not known or hard to obtain. The “learning” term here denotes the
automated acquisition of structural descriptions from examples of what's being described.
Unlike conventional statistical methods, ML does now no longer make assumptions
approximately the precise shape of the information version, which describes the
information. This function could be very beneficial to version complicated non-linear
behaviors, including a characteristic for crop yield prediction. ML techniques have been
maximum successfully implemented to Crop Yield Prediction (CYP). A supervised

12
learning algorithm consists of a target/outcome variable (or dependent variable) which is
to be expected from a given set of predictors (independent variables). Using those sets of
variables, we generate a feature that maps inputs to favored outputs. The training method
maintains till the model achieves a desired stage of accuracy at the training data. Examples
of supervised learning: regression, decision tree, random forest, KNN, logistic regression,
etc. [22].

2.3.1. Types of Machine Learning


There are multiple forms of Machine learning; supervised, unsupervised, semi-supervised,
and reinforcement learning. Each machine learning method has different approaches, but
they all follow the same basic process and theory [23].

Figure 3. Types of machine learning (Source: Adapted from [23])

Supervised Learning: It is the popular paradigm for machine learning. Given the data in
the illustrations along with the labels, we can feed those sample tags using one, using one.,
permitting the algorithm to predict the label for every example, and giving it feedback as
to whether or not it predicted the proper solution or not. Over time, the algorithm will learn
how to approximate the exact nature of the connection among examples and their labels.

13
When completely trained, the supervised learning set of rules might be capable of taking a
look at a new, never-before-visible instance and predicting a great label for it [23].

Unsupervised Learning: It is very much the opposite of supervised learning. It features


no labels. Instead, the algorithm would be fed a lot of data and given the tools to understand
the properties of the data. From there, it can learn to group, cluster, and organize the data
in a way such that a human can come in and make sense of the newly organized data.
Because unsupervised learning is primarily based totally upon data and its properties, we
are able to say that unsupervised learning is data-driven. The effects of unmanaged learning
tasks are managed with the aid of using the statistics and the manner it’s formatted. [24].

Reinforcement learning: It is fairly different when compared to supervised and


unsupervised learning. Reinforcement learning could be very behavior-driven. It has
implications for the fields of neuroscience and psychology. For any reinforcement learning
problem, we want an agent and environment in addition to a manner to attach the two via
a feedback loop. To join the agent to the environment, we provide it a hard and fast of
movements that it is able to take that have an effect on the environment. To join the
environment to the agent, we've usually trouble indicators to the agent: an up-to-date state
and a reward (our reinforcement signal for behavior) [25].

2.4. Machine Learning Applications and Techniques in Agriculture


In a way, a hit farming comes right all the way down to making complicated choices
primarily based totally on interconnections of a multitude of variables, which includes crop
specifications, soil conditions, weather change, and more. Traditionally, farming
techniques had been implemented to a whole field or its element at best. Machine learning
in agriculture lets in for plenty better precision, allowing farmers to deal with plants and
animals nearly individually, which in flip significantly will increase the effectiveness of
farmers’ decisions [26].

2.4.1. Specie Management


Species Selection
The selection of species is a tedious process of finding specific genes that determine the
effectiveness of water and nutrients use, adaptation to climate change, disease resistance,

14
as well as nutrient content or a better taste. Machine learning takes decades of field
information, especially deep learning algorithms, takes decades of field data to analyze
crop performance in different climates and new features built into the process. Based on
this information, you can build a probability model that predicts which genes are most
important to a plant.

Species Recognition
While the traditional human approach for plant classification would be to compare the color
and shape of leaves, machine learning can provide more accurate and faster results by
analyzing leaf morphology, which contains more information about the properties of the
leaf.
2.4.2. Field Conditions Management
Soil Management

For agricultural specialists, the soil is a diverse natural resource, with complex processes
and vague methods. Its temperature alone can give insights into the climate change effects
on regional yield. Machine learning algorithms study evaporation processes, soil moisture,
and temperature to understand the ecological variability and problems in agriculture.

Water Management
Water management in agriculture affects hydrological, climatological, and agronomical
balance. So far, the maximum advanced device learning-primarily based totally
applications are related with an estimation of every day, weekly, or month-to-month
evapotranspiration making an allowance for the greater powerful use of irrigation systems
and prediction of every day dew factor temperature, which enables discover expected
climate phenomena and estimate evapotranspiration and evaporation.
2.4.3. Crop Management
Yield Prediction
Yield prediction is one of the maximum vital and famous subjects in precision agriculture
because it defines the mapping and estimation of yields, matching crop delivery with a call
for and management. cultures. State-of-the-art work techniques have long past some ways
past easy prediction primarily based totally on historic records, however, combine laptop

15
imaginative and prescient technology to offer cellular records and a complete
multidimensional evaluation of crops, weather, and economics to get the maximum out of
it, income for farmers and the population.

Crop Quality
The correct detection and type of crop quality characteristics can grow product rates and
decrease waste. In contrast with human experts, machines can employ reputedly
meaningless data and interconnections to reveal new qualities gambling a function with
inside the typical quality of the plants and to detect them.
Disease Detection
Both in the open air and in greenhouse conditions, pesticides and pesticides are commonly
used in the same way to spray the crop. To be effective, this approach requires high doses
of pesticides, resulting in significant financial and environmental costs. Machine learning
serves as an integral part of overall agricultural management and focuses on time, space,
and damaged plants.
Weed Detection
Apart from diseases, weeds are the maximum critical threats to crop production. The largest
hassle in weeds prevention is that they're hard to discover and discriminate from crops.
Computer imagination and prescient and gadget gaining knowledge of algorithms can
enhance the detection and discrimination of weeds at a low value and not using
environmental troubles and aspect effects. In the future, that technology will power robots
with a view to damage weeds, minimizing the want for herbicides.
2.4.4. Livestock Management
Livestock Production
Similar to crop management, machine learning provides correct prediction and estimation
of farming parameters to optimize the economic potency of eutherian mammal production
systems, such as cattle and eggs production and eggs production. For example, weight
predicting systems will estimate the long-run weights one hundred fifty days before the
slaughter day, permitting farmers to switch diets and conditions respectively.

16
Animal Welfare
In the present-day setting, a farm animal is an increasing number of handled now no longer
simply as meals containers, but as animals who may be sad and exhausted in their existence
at a farm. Animals' conduct classifiers can join their chewing indicators to the want in
weight-reduction plan adjustments and with the aid of using their movement patterns,
which includes standing, moving, feeding, and drinking, they could inform the quantity of
pressure the animal is uncovered to and expect its susceptibility to diseases, weight gain,
and production.

The cost of clever automation is broadly identified throughout many sectors, proved with
the aid of using examples of AI in fintech or AI in actual estate. In agriculture, this phase
of technology is turning into essential. With records on the center of farming choices and
the improvement of agrochemical products, the capacity is immense. Perhaps, greater
importantly, system mastering is ready to turn out to be a behind-the-scenes enabler of
greater sustainable use of herbal sources and a big contributor to a higher environment.

However, for this era to have a tangible effect on agriculture, it desires extensive popularity
amongst stakeholders, a one of kind mindset from farmers, and enough funding. This is a
long-haul game. Companies want to be prepared to reinvent themselves, research new
skills, and adapt to the policies imposed with the aid of using big data.

2.5. Crop Yield Prediction Using MLR


Important information for any farmer is crop forecasting and forecasting how to increase
the yield. pH value, soil type, and quality, weather pattern: temperature, rainfall, humidity,
sunshine hours, fertilizers, and harvesting schedules are some of the parameters which play
an important role in predicting the crop yield [27].

Crop Yield Prediction(CYP) is one of the methodologies to predict the yield of the crops
using different available parameters. Yield prediction is controlled by various trainable and
untrainable factors. Predictive modeling is a method that uses data mining and probability
to predict outcomes. Data modeling in prediction involves four stages: historical data
analysis, data pre-processing, modeling of data, and performance estimation [28].

17
Applying the ML algorithm and tuning their parameters based on the feature set make an
accurate prediction. Researchers are working toward developing efficient methods to
evaluate the prediction accuracy based on the data they collected. The data-driven models
have gained popularity and found CYP applications using classical statistical and ML
methods. Supervised ML approaches such as Artificial Neural Network (ANN), Support
Vector Regression (SVR), k-Nearest Neighbor (k-NN), and Random Forest (RF) which
are parametric or nonparametric in nature and are heavily dominating the crop yield
prediction in different agricultural data sets [29].

Machine learning is a critical decision-support tool for crop yield prediction, inclusive of
supporting decisions on what crops to develop and what to do during the developing season
of the plants. Several machine learning algorithms have been implemented to support crop
yield prediction research [30].

The most important problem in agriculture is crop yield prediction. The Agricultural yield
primarily depends on weather conditions (rain, temperature, etc.), pesticides. Accurate
information about the history of crop yield is important for making decisions related to
agricultural risk management and future predictions.

Yield forecasts can be made using statistical and ML algorithm models. The statistical
model MLR is frequently used in agricultural yield prediction, and its main goal is to
quantify the connections between many independent variables and a dependent variable.
Although there is no rational dependence between variables, one can try to connect using
a mathematical equation. This equation may not have a physical sense, but under some
assumptions, it allows forecasting values determined based on knowledge of other
variables. MLR method attempts to make a model that relates a dependent variable and
two or more independent variables by connecting a linear equation into the observed data.
This section analyzes the application of the MLR model in yield prediction by the
researchers [31].

MLR and ANN are widely used to predict soil hydraulic properties from easily available
soil variables, and parameters are selected by the data distribution method. The researchers
used a vast array of soil data. From the analysis, he realized that neural network data

18
collection is uncertain. When instability in data sets decreases, the neural network provides
a better prognosis for soil behavior than MLR. However, when distrust in data sets is high,
the neural network is unable to provide better accuracy for predictions [32].

MLR and ANN algorithms were implemented to estimate the yield of organic potatoes
using the soil quality parameters and tillage system. The consequence of considering tillage
systems on the soil properties to calculate crop production are discussed [33]. They
established that tillage and soil properties impacted the yield greatly. It was also found out
that the crop yield was estimated more accurately by the MLR model than the ANN model.
Still, its prediction effectiveness was lower when compared to the ANN model.

The study by Sarmadian focused on predicting soil parameters using the available soil
dataset. The feedforward back-propagation neural network model and the MLR model
were used to predict the soil parameters. The artificial neural network with two neurons in
the hidden layer performed well with the main soil parameters, including cation exchange
capacity, water percentage at field capacity, and permanent wilting point. The performance
evaluated for the selected models uses the test data model. The results indicate that the
neural network models were more suitable to compute the nonlinearity among the variables
[34].

Linear and statistical models to estimate the daily global solar radiation in a region of the
Salta Province of Argentina. The features of the dataset were analyzed with MLR, ANN,
and Multilayer Perceptron. The linear models and neural network models were developed
and their efficiency was compared by applying the dataset. The data set, they used
consisted of information about solar radiation data for 1996-2002. Three alternative
combinations of meteorological parameters for neural networks and linear regressions were
considered. The researchers got good results with both prediction methods. However, it
was concluded that neural networks produced better estimates than linear regressions [35].

MLR and ANN were compared by Mohammad Zaefizadeh et al to estimate the barley
yield. Their prediction model was based on multilayer ANN with one hidden layer and
included 15 neurons. The Matlab Perceptron type software that was used in this study ran
on an algorithm that underwent error propagation learning method and hyperbolic tangent

19
function. The comparative results of the analysis indicated that the mean deviation index
of estimation in the ANN technique was one-third of its MLR rate. The variation of the
mean deviation index value was because of the significant interaction which took place
between the genotype and the environment. This interaction had an impact on the MLR
method of estimations. This study concluded that a neural network approach was
recommended over the regression method for yield prediction, especially when there were
significant genotype-environment interactions and more velocity [36].

Safa and Samarasinghe attempted to create an ANN that could predict energy use in wheat
production. The study was held on the irrigated as well as dry wheat fields in Canterbury
in the 2007-08 harvest season. The data were collected by using extensive interviews and
questionnaires. The researchers identified many direct and indirect factors to train the
ANN. The ANN model gave a better prediction on energy consumption than the MLR
model when a dataset was selected for testing and validation [37].

Using ten crop datasets, Gonzalez Sanchez, Frausto Sol, and Ojeda Bustamante studied the
predictive accuracy in crop yield prediction of ML and linear regression technique by using
the data collected from a Mexican irrigation zone. Along with the MLR model, the
researchers used the regression trees, neural networks, nearest neighbor, and support vector
models to analyze the predictive ability. M5-Prime obtained the highest average accuracy
matrices and k-nearest neighbor techniques, and the researcher concluded that in
agricultural planning, the planner could use the tool M5-prime to predict larger crop yield
[38].

2.6. Review of Related Works


Nowadays, the research community has given more attention to the topics that are related
to agriculture and its contribution to the growth of economies in countries. There are
different approaches used to study agricultural gross production forecasting. Hereafter we
are going to look back on some research works related to the use of machine learning
techniques in agriculture especially prediction and forecasting.

Goapl and Bhargavi, developed a novel hybrid model to predict paddy crop yield and is
based on multiple linear regression (MLR) and Artificial Neural Networks(ANN). In this

20
model, the initial weights of the neural network are derived from MLR coefficients. The
paddy data is used to train the backpropagation community and the performance is similar
to the other machine learning models. The hybrid ANN-MLR model achieved better
precision than other models [39].

Shastry and Sanjay created a brand new cloud-primarily based totally framework to
categorize soil and to are expecting crop yield. The proposed framework used to categorize
the soil is primarily based totally on the hybrid kernel Support Vector Machine(SVM)
method and the SVM kernel parameters are derived from GA. Based on Artificial Neural
Networks(ANN), the crop yield prediction version turned into advanced, and additionally
the parameters of ANN just like the hidden layers, neurons and gaining knowledge of price
are customized. The proposed cloud-primarily based totally framework version plays
higher than different fashions in soil type and crop yield prediction [40].

Pavan Patil, Virendra Panpatil, Prof. Shrikant Kokate proposed a system that discusses
improving the result by adding more attributes to the system. A combination of Naive
Bayes and decision tree algorithms are used. The decision tree shows poor performance
with the given dataset and has more variations but naive Bayes provides better results than
the decision tree for such datasets. The combination classification algorithm of naive Bayes
and decision tree classifier are better performing than the use of a single classifier model.
The parameters include soil type, soil Ph value, humidity, temperature, wind, and rainfall
[41].

Islam, T., Chisty, T. A., & Chakrabarty used a deep learning neural network model(DNN)
to envisage varieties of crop yield like rice, Jute, Wheat, and Potato by using weather, Soil,
and fertilizers data. The newly developed DNN model is compared with the other machine
learning models namely Random Forest, Support Vector Machine, and Linear Regression.
The DNN model gives higher precision in prediction than the other model [42].

Feng et al. exploited the power of machine learning and regression models in the prediction
process. In their research work, they compared the cross-validated Random forest(RF)
model with the multiple linear regression(MLR) model and also establish the correlation
between climate and rainfall parameters. This established correlation shows how the wheat

21
yield percentage is decreased when the rainfall is low. In prediction, the RF outperforms
MLR [43].

Prakash, S., Sharma, A., & Sahu explored the better way for soil moisture prediction with
help of machine learning models such as Support Vector Machine(SVM), RNN, and
statistical model multiple linear regression(MLR). The predicted outcome of the machine
learning models is compared against each other. Authors suggested that, in short-term
moisture prediction, MLR has better prediction power than machine learning models [44].

Giritharan and Koteeshwari suggested in this paper, to use one of the most effective tools
named Artificial Neural Network(ANN) for modeling and prediction. To implement the
ANN both the Feedforward and Back Propagation Network is combined together and used.
The suggested system is an easy-to-use android application [45].

Snehal S. Dahikar and Sandeep V. Rode used Artificial Neural Network technology for
estimating long-term or short-term crop production because it provides an assorted solution
for the cumbersome problems in agriculture research. This research work only presented
the ANN to minimize the losses when the conditions are not apt while envisaging the crop
yield with the parameters of the soil, weather, guaranteed price, cultivation area, etc. [46].

Singh and Prabhat Kummer concluded that this paper would help improve crop yields by
applying classification methods and comparing metrics. We can also do analyzing and
prediction of crops using Bayesian algorithms. The Bayesian algorithm, K-means
Algorithm, Clustering Algorithm, Support Vector Machine algorithms were used. The
disadvantage is the lack of accuracy and performance described in the paper according to
the implementation of the suggested algorithms. [47].

Arun Kumar, Naveen Kumar, and Vishal Vats have proposed a system to predict the yield
of the crop by analyzing past soil datasets, rainfall datasets, yield datasets. The prediction
was done using K-Nearest Neighbor and Support Vector Machine algorithm and Least
Squares algorithms [48]. They have done crop prediction using weather forecasting,
pesticides and fertilizers to be used and past revenue as input data. Multi-line core
component analysis (MPCA) has been used for behavior reduction. In addition to the
forecast, they take into account prerequisites and behavioral reductions [49].

22
There are few research works about sugarcane yield prediction which can be associated
with our work. Sugarcane yield prediction technique with the use of Random forest [43]
became proposed in one of the survey, the features used in this study consist of biomass
index, climate statistics (e.g., rainfall) and yields from previous years. Two predictive tasks
are provided in [50]: (i) the category problem for predicting whether or not the yield can
be above or underneath the found median yield, and (ii) the regression hassle for predicting
the yield estimates in two distinct time intervals. In addition, support vector system for rice
crop yield prediction become proposed, the dataset used in this method are precipitation,
minimum, maximum and common temperature, place, evapotranspiration and
manufacturing. The sequential minimal optimization classifier is implemented on the
dataset [51].

Mary Mary Saji, Kevin Tom, Varsha S, Lisha Vargesi, Er. Gene Thomas proposed the
paper that will clear up the rural issues via way of means of looking at the rural region on
the premise of soil properties. It recommends the maximum appropriate crop to farmers,
thereby assisting them to boom productiveness and decrease loss. Here is a paper
evaluating the algorithms. Here, in particular, the use of the algorithms is KNN, Selection
Tree, Naive Bay, KNN with certified SVM. And it affects wherein set of rules is first-rate
for this crop prediction. The algorithms are on the way to be used for checking out our
KNN, KNN with Cross-Validation, Decision Tree, Naive Bayes, and SVM. The accuracies
acquired have been 85%, 88%, 81%, 82%, and 78% respectively. KNN with cross-
validation has the very best accuracy and as a result, may be used for implementation inside
the very last system [52].

The dataset is processed through the WEKA tool to build the set of rules on the current
dataset. The results were generated in python by using the SVM algorithm. Based on the
C4.5 algorithm, decision trees and decision rules have been developed, in their study, they
have developed a website called Crop Advisor: This is an interactive website for
discovering the effect of weather and crop production by using the C4.5 algorithm [53]
This gives the idea of how different climatic parameters impact the growth of the crop. The
selections were made based on the area under the chosen crop. The information regarding
the associated year's climatic parameters like rainfall, high and low temperature, wet day

23
frequency was collected. The id3 algorithmic rule was developed to induce sensible quality
and improved Tomato crop yield that is implemented in the PHP platform and uses CSV
as datasets. The features used in this study include area, production of the tomato crop,
temperature, and humidity [54].

A decision tree classifier for agriculture information turned into proposed [55]. This new
classifier uses new facts expression and can address each entire record and in entire records.
Inside the test, a 10-fold cross-validation technique is used to check the dataset, horse-colic
dataset, and soybean dataset. Their results showed the proposed selection tree is capable of
classifying all styles of agriculture records. A yield prediction version turned into proposed
in one of the take a look at which makes use of data mining techniques for category and
prediction. This model includes crop name, topography, soil type, soil pH, pest
information, climate, water level, seed type, and this model anticipated the plant boom and
plant diseases and therefore enabled to select of the nice crop based on climate information
and required parameters [56].

By studying the previous research done by various scholars above many techniques and
ideas can be grasped which can help to learn more about solving the issues which are
intended to achieve. Hence by using the machine learning algorithms the prediction can be
more efficient for achieving the goal and there are ways to crop yield prediction. Taking a
step forward that aiming to use the regression technique on the data-test above numerical
values. As the values in the data-set are numerical it is suited for the regression.

24
The table shown below summarizes the works of other researchers, scholars, and
contributors of the domain of crop yield prediction, the algorithms that they use, the
purpose of their studies, and their findings.

Table 1. Summary of studies and their findings

Studies and their findings


Year Author Purpose Model used Findings
2019 Goapl and Proposed model to Hybrid MLR-ANN Hybrid MLRANN model
Bhargavi predict the accurate model gives better prediction
crop yield accuracy than other models
for same agricultural
dataset
2019 Shastry and Categories soil and Hybrid kernel Support Cloud primarily based
Sanjay to are expecting Vector Machine(SVM) totally framework to
crop yield categories soil and to are
expecting crop yield.
2020 Pavan Patil, Crop Prediction Decision tree and he combination
Virendra System using Naïve Bayes. classification algorithm of
Panpatil, Prof. Machine Learning naïve
Shrikant Algorithms. bayes and decision tree
Kokate. classifier are better
performing than use of
single classifier model.
2018 Islam, T., Envisage varieties Deep learning neural DNN model which have
Chisty, T. A., of crop yield like network model(DNN) higher precision than RF,
& Chakrabarty rice, Jute, Wheat SVN and Leaner regression
and Potato in prediction
2018 Feng, P.,Wang, Wheat yield Random forest(RF) RF has best performance
prediction based on model and multiple linear than Leaner regression
rainfall regression model model after comparison
2018 Prakash, S., To predict future Support Vector Machines GBM model showed the
Sharma, A., & soil moisture (SVM), Random Forest lowest prediction error
Sahu (RF), Extremely
Randomized Trees
(ET), Gradient Boosting
Machines (GBM), and

25
Deep Feedforward
Network (DFN
2016 Giritharan and To develop crop Artificial Neural Develop crop predictor and
Koteeshwari predictor and Network advisor application for
advisor using ANN smartphones

2014 Snehal S. To develop Crop Artificial Neural Developed powerful tools


Dahikar and prediction by Network for modeling and
Sandeep V. sensing various prediction of crop based on
Rode parameter of soil soil

2015 J.P. Singh, To improve the Bayesian algorithm, K- Analyzing crop prediction
Rakesh Kumar, yield rate of crops means Algorithm, using those models, but
M.P. Singh and Clustering Algorithm, they did not show proper
Prabhat Kumar SVM accuracy error
2018 Arun Kumar, Efficient Crop SVM and Least Squares It shows that SVM is better
Naveen Kumar Yield Prediction algorithms here compared to the
and Vishal Using Machine complexity
Vats Learning
Algorithms.
2020 Kevin Tom Crop Prediction KNN, Decision Tree, The accuracies obtained
Varsha S , Using Machine Naive Bayes , KNN with here are 85%, 88%, 81%,
Merin Mary Learning. Cross Validation, and 82% and 78% respectively.
Saji, Lisha SVM. KNN with cross validation
Varghese, Er. has the highest accuracy for
Jinu Thomas this paper.

Depending on the above literature performance, we use regression we have select to apply
four machine learning algorithms for crop yield prediction based on the performance that
the researchers have gotten precisely. When we analyzed the gaps of researcher we can
conclude that the data they use and their scope is limited for the specific area this study can
fill the gabs on Ethiopia and specifically cereal crops.

26
Chapter Three
Methodology
3.1. Introduction

The goal of this work is to explore a number of standard machine learning techniques to
agricultural data set for discovering cereal crop yield. So before applying the machine
learning techniques to the data set, there should be a methodology that governs a given
work. The methodology is more than a method of data collection; rather it is further of the
concepts and theories which underlie the methods. So it is critical to understand the
essential ideas of the method to focus on a specific feature of a sociological theory check
an algorithm for data retrieval or check the validity of a particular system.

The objective of the current research work is to analyze the predictive algorithms with
fewer relevant features. In order to meet the objectives of the research work, data collection
and purity are most important. Since this study is a combination of three approaches, the
methodology discussed in each approach and the various resources is finally linked up into
a single platform to achieve the objective. The data used for the current work are irrigation-
related data, related meteorological data, fertilizer usage data, and yield statistics. The data
was collected from various sources during the process. The pre-processed data were
applied to the most relevant feature selection algorithms to identify the most critical
features. The input dataset’s selected features are given as the input of predictive
algorithms to predict the crop yield.

27
3.2. Workflow
The research work has been divided into 8 procedures. Figure 4 illustrates the workflow
of the research. The subsequent subsections of the chapter discuss the workflow in detail

Figure 4. Workflow of the research

3.2.1. Data Collection

The data is collected from open-source sites which have normalized values. The system
can also be tested against the actual data which can be obtained from the government. For
this research, the crop yield data is obtained from FAO and climate data is obtained from
the world data bank repository, which provides global data on historical and future climate,
vulnerabilities, and impacts.

The dataset is commonly known as a collection of data that represents a particular variable
for a single table and also data combination in the whole entity. This data set can be
organized into several characteristics of information based on the structure and properties
that need to be carried out [57].

Machine learning is highly dependent on information. Algorithm training is a very


important aspect that is possible. Uses historical data and information to gain experience.
The better the dataset, the better its accuracy.

28
In any ML analysis, you need data. And any model can only be powerful if you feed it with
the right data. The on-target data should have the precise features and the right outcomes
because it will affect the relevance and the usability of the model as well as the findings.
The data applied for this work was obtained from data from FAO and the World Data Bank
repository. It was made for crop yield prediction.

Agricultural production depends on these factors. The changes in these factors will have a
meaningful impact on the selected areas' yearly agricultural outcome. The attributes or
parameters are mainly depended on the availability of the data. Two different sets of
statistical data that were used for the study were the statistical and agricultural data for
cereal crops and its weather data for the respective years. The collected two data sets were
combined into a single data set. The list of parameters of the dataset is as below.

 Temperature
 Rainfall
 Yield
 Pesticides

3.2.2. Data Pre-Processing

Machine learning algorithms do now no longer paint nicely with processing raw data. Data
pre-processing is a technique that is used to convert the raw data into a clean data set. In
other words, every time information is collected from various sources, it is collected in raw
form, which is not easy to analyze.
 Data cleaning: is the method involved with cleaning the data so that the
information can be effectively coordinated. Real-world data tend to be incomplete,
noisy, and inconsistent. Data cleaning routines attempt to fill in missing values,
smooth out noise while identifying outliers, and correct inconsistencies in the data.
One way of handling missing values is deleting the rows with null values. This
method is a quick solution and it is typically being preferred in cases where the
percentage of missing values is relatively low. There was no invalid worth in the
data set.

29
 Data Integration is involved in data analysis task which combines data from
multiple sources into a coherent data store, as in dataframe.

 Data reduction: Complex data analysis and mining on huge amounts of data may
take a very long time, making such analysis impractical or infeasible. Data
Reduction is the process to reduce the large data into smaller one, in such a way
that data can be easily transformed further. Obtains reduced representation in
volume, Data discretization, Data aggregation, dimensionality reduction, data
compression, generalization
 Data Transformation: is the process of converting raw data into a format or
structure that would be more suitable for model building and also data discovery in
general. It is an imperative step in feature engineering that facilitates discovering
insights. This article will cover techniques of numeric data transformation: log
transformation, clipping methods, and data scaling.

3.2.3. Features Norma Normalization (Scaling)

Taking a look at the dataset above, it contains features highly varying in magnitudes, units,
and range. The features with high magnitudes will weigh in a lot more in the distance
calculations than features with low magnitudes. To suppress this effect, we need to bring
all features to the same level of magnitudes. This can be achieved by scaling.
3.2.4. Feature Selection

To compile the model, a large general crop data set with agricultural metrics were taken.
Another dataset is taken as a feature dataset. The datasets are collected from FAO and
World Data Bank database galleries. Initially the data has a size of 20kb and the prediction
parameters in this dataset include temperature, rainfall, pesticides, and harvested year.
There are a number of crops taken in this dataset like wheat, rice, maize, sorghum, etc. A
number of values are available for each and every prediction parameter for a single crop.
For instance, when taking a crop like maize, any value can be given to the prediction
parameters among a set of values available in the dataset, for maize. It is the same for the
entire crops available in the dataset.

30
Feature selection is a process to reduce and refine the relevant features from a large number
of features that describe the dataset reduced to computing complexity. Because having a
relative number of features in a dataset might cause over fit to training samples and result
in poor generalization for new samples. However, the final dataset did not require this
because it was manually constructed, which avoided the presence of unnecessary variables
in the first place. Learning algorithms are often a necessary step in the data process.
Reduction of the attribute dimensionality leads to a better understandable model that
simplifies the usage of different visualization techniques and is the process of identifying
and removing as much irrelevant and redundant information as possible. Reduces the
dimensionality of the data, may allow learning algorithms to operate faster and more
efficiently and, accuracy can later be improved on future classification. It finds a minimum
set of attributes such that the resulting probability distribution of data classes is as close as
possible to the original distribution.

The selection of high-level features that contribute to prediction accuracy plays a major
role in obtaining accurate predictions. By applying different feature selection algorithms,
different subgroups were selected, such as Sequential Behavior Selection, Related
Behavior Selection, Differential Inflation, and Random Forest Variability. These feature
subsets were applied to the MLR model to find the best feature subset. The selected features
were area, number of open wells, number of tanks, canals length, and maximum
temperature during the season [58]. These features give better prediction accuracy when
they were applied to the machine learning algorithms and statistical model. The collected
four data sets were combined into a single data set.

31
The table below describes the dataset.

Table 2. Features used for cereal crop yield in the area

Feature Name Data type Feature Category Description

Item Numeric Continuous The name of the crop


Numeric Continuous Year of harvested of the crop
Year
Average of temperature registered for the year
Temperature Numeric Continuous
Average rainfall for the year in
mm Average of temperature registered for the year
Rainfall Numeric Continuous
Total amount of pesticides used for cultivation for the
Pesticides Numeric Continuous year
Yield Integer Continuous Total production of the year in tone

3.2.5. Data Splitting


The dataset is split into datasets, a training dataset, and a test dataset, Training the model
requires as many points of data as possible, so the data is often divided into inconsistencies.

The training dataset is the initial dataset used to train ML algorithms to learn and produce
the right predictions. The test dataset, however, is used to assess how well the ML
algorithm is trained with the training dataset. You can’t simply reuse the training dataset
in the testing stage because the ML algorithm already is aware of the expected result, which
defeats the aim of testing the algorithm.

In this study various decisions for partitioning the dataset for training and testing have been
attempted, we utilized 60% for training and 40% for testing, 70% for training and 30% for
testing, 80% for training and 20% for testing and 90% for training and 10% for testing
separately. When the training is done, the prediction model is prepared 100% of the time
for the expectation. During this training, the expectation model will become familiar with
the entire example among various contribution to different years and inside every year
itself.

32
3.2.6. Training and Testing of the Algorithm
As referenced previously, data has been isolated into two parts. A piece of the separated
data was utilized for giving training to the algorithm, while the other part was utilized for
testing it. Upon the effective training and testing of the algorithms, which utilized this
partitioned data set, a comparable cycle was followed for every one of the implemented
algorithms.

3.2.7. Performing Cross-Validation


After the model is trained from the given data then the proficiency of the model should be
checked. In the assessment step, the presentation of the applied models need to look at and
the aftereffect of the different algorithm is investigated whether or not the models can
accomplish the assignment. The data that is contrasted and the test data for approval is
known as cross-approval. This progression is done to see if the model is fit or over-fitted
for the investigation. This strategy has been utilized to discover the mean of cross-approval
scores of different train test split. The issues related with the arbitrary sub-examining can
be defeated utilizing cross-approval, which is more systematic in its methodology.

3.2.8. Performance Evaluation of Models


When the evaluation has been done, then, at that point, it should be checked assuming any
further improvement is required to have been done, as if during the evaluation stage the
model didn't acquire an appropriate outcome or on the other hand on the off chance that
there are over-fitting or under-fitting issues, so we should get back to the preparation step.
So boundary tuning is the following stage just after the evaluation this progression is
known as hyper-meter tuning this is one of the important stages. The first set boundaries
must be tried whether the outcome can be better and work on the model assuming there are
any short-comes. This stage is otherwise called an experimental process.

3.2.9. Generate Final Prediction


Once all the steps have been done then we get an answer to the question i.e. to do the task
for which the model was built do the prediction. It is the last and is one of the important
steps where the model is ready for practical application. The model that has been trained
from the data is now ready to make a conclusion. A good well-executed model and efficient

33
model can improve the decision-making process for the user. The final prediction of the
model was generated by running the best-performing algorithm on the selected parameters.

3.3. Model Design


The first task in this research is basically understanding the problem domain. This step
includes an overview of the agriculture, factors of cereal crop determinates. In
understanding the data step domain-specific terminologies, data description and attribute
selection are included. In the data preparation step, data cleaning, data integration, and data
reduction steps are applied. The next step is building the model based on the selected
algorithm which is the regression algorithm.

Figure 5. Block diagram for model design

34
The above architecture clearly explains how the components of the system communicate
among themselves starting from preprocessing of data. This proposed framework is able to
find out the crop yield. This model gives a clear picture of the huge amount of data capture
and preprocessing of data to remove the unwanted data such as NULL etc. presented in it.

During preprocessing step, we split the dataset into the training and testing dataset. Train
dataset to detect the crop yield present in the dataset using appropriately supervised
learning algorithms. Apply the machine learning techniques which are helpful for finding
crop yield for any new data that occurred in the data. After this data acquisition suitable
machine learning algorithm must be applied to compute the efficiency and capability of the
model, here that have applied various machine learning algorithms like random forest
regression, SVR, decision tree regression, gradient boosting regression, etc. Measurements
such as accuracy are calculated for the proposed model. This system architecture focuses
on 3 parts such as flow data, Machine learning techniques, and modules for detecting crop
yield and feature selection modules.

The above architecture clearly explains how the components of the system communicate
amongst themselves beginning from preprocessing of data. This proposed framework is
capable of discovering the crop yield. This model offers a clean data description of the
big quantity of data size and preprocessed data to dispose of the undesirable data along
with NULL etc. supplied in it.
After cleaning and exploring the relationship among the features, the final data frame that
carries all of the features in order to be used for the prediction process may be seen below
in the screenshots:
 Area: country of production.
 Item: type of crop.
 Year: year of production.
 Average_rain_fall_mm_per_year: Average amount of rain recorded that year.
 Hg/ha_yield: country’s yearly production of the crop that year.
 Pesticides_tonnes: Amount of pesticides used on the crop that year.
 Avg_temp: Average temperature recorded for that year.

35
3.4. Models Under Consideration for Crop Yield Prediction
The research on crop yield prediction needs multiple factors of production and different
algorithms. Some of the algorithms that are being used are for finding the best feature
subset for better prediction and others are used for finding prediction. Multiple algorithms
were used to compare the different algorithms that were used in the current study. It has
long been recognized that the generation of empirical models to estimate the crop yield is
an important responsibility for the remote sensing community [59].

Machine learning is an essential decision guide tool for crop yield prediction, which
includes supporting decisions on what crops to develop and what to do during the growing
season of the crops. The regression learning algorithm is supervised machine learning that
is important in the prediction of the labeled data. It works on continuous values prediction.
It also important in the crop yield prediction. Many machine learning algorithms are
utilized for crop yield prediction by numerous researchers. Generally involved models for
crop yield prediction are random forest regression, decision tree regression, support vector
machine (SVM) and Gradient boosting regression.

3.4.1. Random Forest Regression


Random Forest Regression is a supervised learning algorithm that utilizes the ensemble
learning methods for regression that are used for classification and regression. It is the most
flexible, straightforward rule to utilize. Random forests make choice trees on a randomly
appointed data sample, get predictions from each tree and select the most effective
solution[60].

Random forest is a supervised learning algorithm that is used for both classifications as
well as regression. But however, it's far specifically used for classification problems. As
we recognize a forest is made from trees and greater trees mean a more robust forest.
Similarly, a random forest algorithm creates decision trees on data samples after which
gets the prediction from every one of them and finally selects the nice solution by means
of voting. It is the best ensemble method from a single decision tree because it reduces the
over-fitting by measuring the result on average [61].

36
As the name suggests, "Random Forest is a classifier that incorporates some of the decision
trees on numerous subsets of the given dataset and takes the common to enhance the
predictive accuracy of that dataset." Instead of counting on one decision tree, the random
forest takes the prediction from every tree and is primarily based totally on the majority
votes of predictions, and it predicts the final output [62].

The ensemble learning approach is a method that mixes predictions from more than one
machine-learning algorithm to make an extra correct prediction than a single model.
Regression algorithm in each the modules of predicting the yield rate of crop and predicting
crop too [63]. We can understand the working of Random Forest algorithm with the help
of following diagrams.

Figure 6. Working of random forest algorithm (Source: Adapted from [63])

3.4.2. Decision Tree Regression


A Decision Tree is one of the most usually used, practical approaches for supervised
learning. It may be used to resolve each Regression and Classification obligation with the
latter being placed greater into the practical application

It is a tree-clad classifier with three types of joints. The root node is the main node that
represents the entire sample and can be subdivided into additional nodes. The internal nodes
represent the data set properties and the branches represent the decision rules. Finally, loop

37
nodes indicate the result. This algorithm is very useful for solving decision-making
problems [64].

The performance of the model is determined by comparing the actual values and the
estimated values at the final stage. By comparing these values, the accuracy of the model
can be estimated. Creating a graph of the values and seeing the results also helps to measure
the accuracy of the model [65].

Figure 7. Working of decision tree regression algorithm (Source: Adapted from [64])

3.4.3. Support-Vector Machines SVM


Support Vector Machine (SVM) is a supervised machine learning algorithm that may be
used for each type and regression challenges. However, it's miles broadly speaking utilized
in type problems. In the SVM algorithm, we plot every fact object as a factor in n-
dimensional space (wherein n is some of the features you have) with the fee of every feature
being the value of a particular coordinate. Then, we carry out type via way of means of
locating the hyper-aircraft that differentiates the 2 instructions very well (examine the under
snapshot) [66].

38
Support Vector Machine or SVM is one of the maximum popular Supervised Learning
algorithms, that's used for Classification in addition to Regression troubles. However, it is
primarily used for the classification of problems in machine learning.

SVM chooses the acute points/vectors that assist in developing the hyperplane. These
intense instances are referred to as assist vectors, and as a result set of rules is called a
Support Vector Machine. Consider the underneath diagram wherein there are one of a kind
classes which might be labeled the usage of a selection boundary or hyperplane [67].

Figure 8. Working of SVM (Source: Adapted from [67])

SVMs have proven effective methods on all types of data, from tabular, text, and image
data. SVMs are known to work well even for a small number of training samples that scale
well to high-dimensional spaces and have shown state-of-the-art performance in many
problems in the biomedical domain [68].

Support Vector Machines (SVMs) are a popular set of related supervised learning
techniques for data analysis and pattern detection for classification and regression analysis.
Methods vary in the structure and characteristics of the classifier. The most common is the
SVM linear classifier, which predicts each component of input between two possible
classifications. A more accurate definition describes the construction of a support vector

39
machine hyper plan or hyperplane set. The values closest to the category margin are called
support vectors. The SVM’s goal is to maximize the margin between the hyperplane and
the support vectors [69].

Support vector machines are very famous and lots of bear in mind them because of the fine
off-the-shelf classifier. Furthermore, there is a wide selection of environments and
toolboxes that enforce SVMs. For those motives, we selected to use SVMs to the trouble
of classifying infeasible take a look at cases.

3.4.4. Gradient Boosting Regression


Gradient boosting is a machine learning approach for regression, classification, and
different tasks, which produces a prediction model with a vulnerable set of predictive
models, commonly decision trees [70].

Gradient boosting algorithm is one of the maximum effective algorithms inside the subject
of machine learning. As we understand the mistakes in machine learning algorithms are
extensively labeled into Bias Error and Variance Error classes. As gradient boosting is one
of the boosting algorithms it is used to minimize bias mistakes of the model. The working
process of the Gradient boosting algorithm is generalized by the diagram shown below
[71].

Figure 9. Working of gradient boosting algorithm (Source: Adapted from [71])

40
Chapter Four
Result and Discussion of the Study

4.1. Design and Implementation


4.1.1. Hardware and Software Requirements
In this research, numerous hardware and software program necessities have been hired to
test the proposed algorithms. A non-public laptop with Intel ® Pentium CPU B960,
2.2GHz, 2.00GB memory, and 300GB tough pressure changed into used, which ran on
Microsoft Windows 10 Ultimate. Microsoft ®Excel® 2016 was used for statistical analysis
(calculating minimum, maximum, average and standard deviation) at the stage of dataset
analysis.

The Jupyter Notebook is an open-source web application that lets you create and share
documents that include live code, equations, visualizations, and narrative text. Uses consist
of data cleaning and transformation, numerical simulation, statistical modeling, data
visualization, machine learning, and lots more. The Jupyter Notebook project is the
evolution of the IPython Notebook library which changed into advanced usually to enhance
the default python interactive console through permitting scientific operations and
advanced data analytics capabilities through sharable web documents. Jupyter Notebooks
work with what's referred to as a two-process version primarily based totally on a kernel-
client infrastructure. This model applies a comparable idea to the Read-Evaluate-Print
Loop (REPL) programming surroundings that take a single user’s inputs, evaluate them,
and return the end result to the user.

4.2. Data Gathering and Cleaning


The technological know-how of training machines to examine and produce models for
future predictions is extensively used, and now no longer for nothing. Agriculture plays a
vital position inside the worldwide economy. With the continuing growth of the human
population understanding, crop yield is important to addressing food safety demanding
situations and decreasing the influences of weather change. Crop yield prediction is an
essential agricultural problem.

41
The Agricultural yield in the main relies upon climate conditions (rain, temperature, and
pesticides), and correct data approximately the records of crop yield are an essential issue
for making selections associated with agricultural danger control and destiny predictions.
The primary elements that maintain human beings are similar. In this study, the prediction
of the top 4 cereal crop yields is established by applying different machine learning
techniques. These corps include maize, rice, sorghum, and wheat.

4.2.1. Crops Yield Data


Cereal crops yield of four most consumed crops around the country was downloaded from
FAO website after importing required libraries. The collected data include, item, year
starting from 1961 to 2016 and yield value.
Table 3. Sample crop yield dataset

Item Code Item Year Unit Value


56 Maize 1964 hg/ha 9700
56 Maize 1965 hg/ha 9850
56 Maize 1966 hg/ha 10000
56 Maize 1967 hg/ha 10078
56 Maize 1970 hg/ha 10731
27 Rice, paddy 1993 hg/ha 18519
27 Rice, paddy 1994 hg/ha 18275
27 Rice, paddy 1995 hg/ha 18644
27 Rice, paddy 1996 hg/ha 18372
27 Rice, paddy 1997 hg/ha 18462
27 Rice, paddy 1998 hg/ha 18571
27 Rice, paddy 1999 hg/ha 18667
83 Sorghum 1964 hg/ha 8021
83 Sorghum 1965 hg/ha 8100
83 Sorghum 1966 hg/ha 8200
83 Sorghum 1967 hg/ha 8164
83 Sorghum 1968 hg/ha 6814
83 Sorghum 1969 hg/ha 8489
15 Wheat 1961 hg/ha 7127
15 Wheat 1962 hg/ha 7127
15 Wheat 1963 hg/ha 7123
15 Wheat 1964 hg/ha 7100
15 Wheat 1965 hg/ha 7200
15 Wheat 1966 hg/ha 7300
15 Wheat 1967 hg/ha 7327

42
In in the above table, a small part of the crop yield dataset from different types of crop in
different year is displayed.
4.2.2. Climate Data
The climatic factors include rainfall and temperature. They are abiotic components,
including pesticides and soil, of the environmental factors that influence plant growth and
development. Rainfall has a dramatic effect on agriculture. For this project rainfall per year,
information was gathered from the World Data Bank repository.

Table 4. Sample rainfall dataset


Year Average_rain_fall_
mm_per_year

1963 910.08
1964 943.97
1965 749.42
1966 847.97
1967 1082.36
1968 892.22
1969 777.65
1970 817.73
1971 821.9
1972 842.86
1973 755.97
1974 794.31
1975 901.29
1976 896.37
1977 964.92
1978 772.47
1979 772.21
1980 704.63
1981 794.49
1982 928.02
1983 837.77
1984 629.57

43
The average temperature for each country was collected from the World Data Bank
repository. So average temperature starts from 1901 and ends in 2020, with some empty
rows that we have to drop.
Table 5. Sample temperature dataset

Year Average
Temprature

1961 21.98
1962 22.03
1963 22.23
1964 21.82
1965 22.21
1966 22.31
1967 21.82
1968 21.83
1969 22.55
1970 22.48
1971 21.97
1972 22.42
1973 22.8
1974 22.16
1975 22.28
1976 22.59
1977 22.58
1978 22.61
1979 22.75
1980 23.04
1981 22.39

44
4.2.3. Pesticides Data
Pesticides used for each item and country was also collected from FAO database.

Table 6. Sample Pesticides data set

Year Unit Value


1993 tonnes of active ingredients 242
1994 tonnes of active ingredients 242
1995 tonnes of active ingredients 242
1996 tonnes of active ingredients 383
1997 tonnes of active ingredients 383
1998 tonnes of active ingredients 383
1999 tonnes of active ingredients 492.5
2000 tonnes of active ingredients 602
2001 tonnes of active ingredients 630
2002 tonnes of active ingredients 822.67
2003 tonnes of active ingredients 1015.35
2004 tonnes of active ingredients 1208.03
2005 tonnes of active ingredients 1400.7
2006 tonnes of active ingredients 2603.1
2007 tonnes of active ingredients 2593.9
2008 tonnes of active ingredients 2270.6
2009 tonnes of active ingredients 3699.2
2010 tonnes of active ingredients 4128.1
2011 tonnes of active ingredients 4128.1
2012 tonnes of active ingredients 4128.1
2013 tonnes of active ingredients 4128.1
2014 tonnes of active ingredients 4128.1
2015 tonnes of active ingredients 4128.1
2016 tonnes of active ingredients 4128.1

4.3. Data Exploration


The final dataframe was obtained by joining four different dataframes, from FAO and
World Data Bank to collect all needed features. Then after cleaning and transforming the
data into a standardized form, we have merged them together in the final dataframe
yield_df. To understand the relationship between these parameters. The final dataframe
starts in 1993 and ends in 2016 after merging all the data were cleaned and merged together.

45
Now, exploring the connections between the columns of the data frame, the best way to
quickly check the connection between the columns is to view the communication matrix
as a heatmap.

The correlation between all the features has been calculated and illustrated with diverging
color heatmap.

Figure 10. Correlation map in the dataframe

It is evident from the heatmap above that all of the variables are independent of each,
with no correlation between any of the columns in the dataframe.

46
4.4. Model Comparison & Selection
Before deciding on the algorithm to use, we must first evaluate, compare, and select the
one that is compatible with this particular set of data. Usually, when we are working on a
machine learning problem with a given set of data, we try different models and techniques
to solve the optimization problem and try to adapt the most appropriate model, which does
not fit the model. We compare the following models for this project,
 Gradient Boosting Regressor
 Random Forest Regressor
 Support vector machines (SVM)
 Decision Tree Regressor

4.4.1. Result of Evaluation Metrics


The evaluation metric is set based on the R2 (coefficient of determination) regression score
function, which will represent the proportion of the variance for items (crops) in the
regression model. R2 score shows how well terms (data points) fit a curve or line. R2 is a
statistical measure between 0 and 1 that calculates how similar a regression line is to the
data it’s fitted to. If 1, the model predicts 100% data variance; If 0, the model does not
predict any differences.
In this study of the data is splitting in different level. The following table shows the
comparison of the machine-learning algorithms models based on the value of R².

Table 7. R2 result summary of different train/test values

Models R² result 90/10 R² result 80/20 R² result 70/30 R² result 60/40


train/test. train/test. train/test. train/test.
Gradient Boosting 92 93 87 88
regression
Random forest 76 76 69 72
regression
SVR 73 71 73 71
Decision tree 75 88 83 8
regression

From the results viewed above, for 80/20 train/test data split Gradient Boosting Regressor
has the highest R2 score 0f 93.2%, Decision tree regression comes second.

47
The result of comparison of the models can be shown graphically below using 80/20
train/test data split

Figure 11. Model comparison

48
It will also calculate Adjusted R2 indicates how well terms fit a curve or line but adjusts
for the number of terms in a model. If more and more useless variables add to the model,
adjusted r-squared will decrease. If more useful variables add, adjusted r-squared will
increase. Adjusted R2 will always be less than or equal to R2.

Figure 12. Actual vs predicted yield

The image above shows the goodness of matching linear predictions. It can be seen that
the R Square score is excellent. This means that we have found a good fitting model to
predict the crops yield value for each year.

4.5. Model Results and Conclusions


The studies paintings receive carried out via way of means of the use of a crop dataset that's
accrued from FAO and World Data Bank database galleries. It carries diverse simple cereal
crops, for example, wheat, rice, maize, and sorghum. It is covered with some prediction
parameters like temperature, rainfall, pesticides, and 12 months of harvest. For a predictive
model, a gadget getting to know desires sorts of information, namely, the Trained set and
a Test set. The Trained information is the accrued survey information that has been
amassed from beyond events. While the cutting-edge survey information is the Test
information.

49
The most common interpretation of r - squared is how properly the regression version suits
the discovered data. For example, an r - squared of 60% well-known shows that 60% of
the data suit the regression model. Generally, a better r - squared shows a higher suit for
the version. From the acquired results, it’s clear that the model suits the data to an excellent
degree of 93.2%. Feature importance is calculated because the lower in node impurity is
weighted with the aid of using the probability of achieving that node. The node probability
may be calculated with the aid of using the number of samples that attain the node, divided
with the aid of using the full number of samples. The higher the value the extra important
the feature. Getting the 7 top features important for the model:

Figure 13. Level of feature importance

50
The crop being maize has the highest importance in the decision-making for the model,
where it's the highest crop in the dataset. rice too, then as expected we see the effect of
pesticides, then comes rainfall and temperature. The first assumption about these features
was correct that all significantly impact the expected crops yield in the model. The boxplot
shows the yield for each item. maize is the highest, Rice, Wheat and Sorghum.

Figure 14. Yield for each item.

51
Chapter Five
Conclusion and Recommendation
5.1. Conclusion
Researches on agriculture is the most common area that government give more attention
because of Ethiopian economy is highly dependent on it. Since cereal crop production was
dominant over other types of crop production by contributing more than 71% of total crop
production, in this paper, we focus on the cereal crop yield and the effects of different
parameters on the production of such crops. To improve the crop yield prediction
implementing machine learning techniques were analyzed in the case of Ethiopia for cereal
crop yield predictions. Predicting the size of the crop can influence on-farm decisions such
as how much pesticides to need and help farmers carefully plan maintenance and labor
schedules to be ready for the start of the harvesting seasons. For crop yield prediction the
climate factors, temperature and rainfall, and the number of pesticides used during
harvesting had different impacts.

Developing accurate models for cereal crop yield estimation using machine learning
techniques may help farmers and other stakeholders improve decision-making in relation
to national food revenue and food security. The purpose of this study is to solve the
problems raised like the problem of accuracy of prediction of crop yields by farmers and
governments. To experiment with this study, the dataset was collected from FAO and
World Data Bank. Significantly those data were preprocessed to make it more
understandable and used for building the machine learning models to find the solution.
There are four sets of data: temperature data sets, rainfall data sets, pesticide datasets, and
crop yield data sets. Based on our dataset the model was developed by using four data
preprocessing techniques. The prediction of cereal crop yield is primarily based totally on
the dataset implementation of algorithms. The analysis of each datasets depending on the
parameters that affect crop yield predictions.

The cereal crop yield prediction experiments were done by applying different machine
learning algorithms like random forest regression, decision tree regression, gradient

52
boosting regression, and support vector machine (SVM). The gradient boosting regression
model was better in the prediction when compared with the other using R2.

Each model was evaluated cross-validation techniques, as we do have limited data and to
reduce overfitting. Gradient boosting regression model better test accuracy on our dataset
as compared with random forest regression, decision tree regression, and SVM algorithms
with accuracy result 93.2. The other investigation of this study was the indication of the
parameters that highly affect crop production became pesticides and the crop being maize.
This study will help to reduce the problems faced by farmers and will serve as an
intermediary to provide farmers with the information they need to earn high profits and
maximize profits. This study reveals that machine learning algorithms are important in the
agricultural sector, in the yield prediction, species management, field conditions
management, crop management, livestock managements.

5.2. Recommendation
In this work, the different machine learning techniques were presented and implemented
to analyze and improve the predictive ability of ML algorithms. Through analyzing various
analytical results, it is concluded that the Gradient Boosting Regressor method gained good
results for cereal crop prediction. However, many areas require additional work. Since
agriculture is the main source of food different researches like using more machine learning
algorithms, AI, deep learning technology for crop prediction including different parameters
that have different impacts on crop production, and also different crop types and seasonal
crops must be considered. Since the current work was developed only on basic cereal crops
and rained seasons on the next work the different seasons, climate factors like humidity,
wind, and all the factors that affect crops should be under consideration. In future work,
various other factors can be considered to improve the prediction and reduce the error rate.

Finally, it was more applicable if the future work on this scenario would include mobile
and cloud computing applications to be more supported for the agriculture industries. Other
than crop yield prediction, machine learning techniques can also be applied to other
agricultural issues like crop disease detection, weed detection, seed classification,
irrigation management, soil classification, and weather prediction.

53
References

[1] Central Statistical Agency(CSA), "Agriculture Sample Survey," Central Statistical


Agency(CSA), Addis Ababa,Ethiopia, 2011.

[2] MoFED (Ministry of Finance and Economic Development) , "Survey of the


Ethiopian economy," ,Addis Ababa, Ethiopia,, 2006.

[3] Central Statistical Agency (CSA), "Agricultural sample survey: Report on area and
production of major crops," Addis Ababa, 2019.

[4] (CSA), Central Statistical Agency, "Agricultural sample survey: Report on area and,"
Addis Ababa, 2017.

[5] Zhong L. Hu L. Zhou H., "Deep learning based multi-temporal crop classification,"
Remote Sens. Environ, vol. 221, p. 430–443, 2019.

[6] Rossana MC, L. D., "Prediction Model Framework for Crop Yield Prediction," in
Asia Pacific Industrial Engineering and Management Society Conference
Proceedings Cebu, Phillipines, 2013.

[7] You, J., Li, X., Low, M., Lobell, D., Ermon, S., "Deep Gaussian process for crop
yield prediction based on remote sensing data," in Proceedings of the Thirty-First
AAAI Conference on Artificial Intelligence, 2017.

[8] Basso, B., Liu, L., "Seasonal crop yield forecast: methods, applications, and
accuracies," Elsevier, vol. 154, no. Advances in Agronomy , p. 201–255, 2019.

[9] Chipanshi, A., Zhang, Y., Kouadio, L., Newlands, N., Davidson, A., Hill, H.,
Warren,R., Qian, B., Daneshfar, B., Bedard, F., et al, "Evaluation of the integrated
Canadian crop yield forecaster (ICCYF) model for in-season prediction of crop yield
across the Canadian agricultural landscape," vol. 206, no. Agri- cultural and Forest
Meteorology, p. 137–150, 2015.

[10] Fischer, R., "Definitions and determination of crop yield, yield gaps, and of rates of
change.," vol. 182, no. Field Crop Res, p. 9–18, 2015.

[11] C. Ozer., "Research on Machine Learning Methods and Its Applications," Real-
World Applications and Research, no. Machine Learning: Algorithms, 2018.

liv
[12] Lee JY, Ahn S, Kim D., "Deep learning-based prediction of future growth potential
of technologies," PLoS ONE, 2021.

[13] Kumar, V., & Garg, M.L., "Predictive Analytics: A Review of Trends and
Techniques," International Journal of Computer Applications,, 2018.

[14] IPCC (Intergovernmental Panel on Climate Change), "Climate change," no. The
scientific basis , 2007.

[15] World Bank, "Economics of adaptation to climate change study," World Bank,
Washington DC, 2008.

[16] Alemayehu N., Masafu M., Ebro A., Tegegne A., Gebru G., "Climate Change and
Variability in the Mixed Crop/Livestock Production Systems of Central Ethiopian
Highland," in Handbook of Climate Change Resilience, Springer International
Publishing, 2018, pp. 1-24.

[17] Francis Ndamani, Tsunemi Watanabe, "Influences of rainfall on crop production and
suggestions for adaptation," International Journal of Agricultural Sciences I, vol. 5
(1), 2015.

[18] Abate, T., Shiferaw, B., Menkir, A. et al., "Factors that transformed maize
productivity in Ethiopia," Springerlink, vol. 7, no. Food Sec., p. 965–981, 2015.

[19] Jansen, K., & Dubois, M., "Local pesticide governance by disclosure: Prior informed
consent and the Rotterdam convention," MIT Press, no. Transparency in
environmental governance, pp. 107-131, 2014.

[20] Abhinav Sharma, Arpit Jain, Prateek Gupta, Vinay Chowdary, "Machine Learning
Applications for Precision Agriculture: A Comprehensive Review," in IEEE Access,
vol. 9, pp. 4843-4873, 2021.

[21] Uno, Y., Prashera, S.O., Lacroix, R., Goela, P.K., Karimia, Y., Viauc, A., & Patel
R.M., " Artificial neural networks to predict corn yield from compact airborne
spectrographic imager data," vol. 47, no. Computers and Electronics in Agriculture,
p. 149–161, 2005.

[22] Priya, P., Muthaiah, U., & Balamurugan, M., "Predicting yield of the crop using
machine learning algorithm," International journal of engineering sciences &
research technology, vol. 7(4), pp. 1-7, 2018.

lv
[23] Witten, I. H., Frank, E. and Hall M., "Data mining: Practical machine learning tools
and techniques," vol. edition, San Francisco, Morgen Kaufmann, 2005.

[24] Russell, Stuart J.; Norvig, Peter, "Artificial Intelligence: A Modern Approach," vol.
Third ed., 2010.

[25] Van Otterlo, M.; Wiering, M., "Reinforcement learning and markov decision
processes. Reinforcement Learning. Adaptation, Learning, and Optimization," 2012.

[26] "Techniques of machine learning in agriculture," [Online]. Available:


https://www.iflexion.com/blog/machine-learning-agriculture. [Accessed 09 October
2021].

[27] S. Bhanumathi, M. Vineeth, and N. Rohit, "Crop yield prediction and efficient use of
fertilizers," in Proc. Int. Conf. Commun. Signal Process, Chennai, India, April 2019.

[28] Chlingaryan, A., Sukkarieh, S., & Whelan, B. , "Machine learning approaches for
crop yield prediction and nitrogen status estimation in precision agriculture," vol.
151, no. Computers and electronics in agriculture, pp. 61-69, 2018.

[29] Sarker, I.H., "Machine Learning: Algorithms, Real-World Applications and Research
Directions.," SN COMPUT. SC, vol. 2, 2021.

[30] R. Ghadge, J. Kulkarni, P. More, S. Nene and R. L. Priya,, "Prediction of crop yield
using machine learning," Int. Res. J. Eng. Technolgy, vol. 5, 2018.

[31] Lobell DB, Burke MB., "On the use of statistical models to predict crop ieldresponses
to climate change," no. Agricultural and Forest Meteorology, p. 1443–52, 2010.

[32] Tamari, S., Wosten, J. and Ruiz-Su ¨ arez, J, " Testing an artificial neural network for
predicting soil hydraulic conductivity," Soil Science Society of America Journal, vol.
60(6), p. 1732–1741, 1996.

[33] Abrougui, K., Gabsi, K., Mercatoris, B., Khemis, C., Amami, R. and Chehaibi, S.,
"Prediction of organic potato yield using tillage systems and soil properties by
artificial neural network (ann) and multiple linear regressions (mlr)," S, vol. 190, no.
oil and Tillage Research , p. 202–208, 2019.

[34] Sarmadian, F., Mehrjardi, R. T., Akbarzadeh, A. et al., "Modeling of some soil
properties using artificial neural network and multivariate regression in gorgan
province," Australian Journal of Basic and Applied Sciences, vol. 3(1), p. 323–329,
2009.

lvi
[35] Bocco, M., Willington, E., Arias, M. et al. , "Comparison of regression and neural
networks models to estimate solar radiation," Chilean Journal of Agricultural
Research, vol. 70(3), p. 428–435, 2010.

[36] Zaefizadeh, M., Jalili, A., Khayatnezhad, M., Gholamin, R. and Mokhtari, T.,
"Comparison of multiple linear regressions (mlr) and artificial neural network (ann)
in predicting the yield using its components in the hulless barley," no. Advances in
Environmental Biology, 2011.

[37] Safa, M. and Samarasinghe, S., "Determination and modelling of energy


consumption in wheat production using neural networks:," vol. 36(8), no. a case study
in Canterbury province, p. 5140–5147, 2011.

[38] Gonzalez-Sanchez, A., Frausto-Solis, J. and Ojeda-Bustamante, W., "Attribute


selection impact on linear and nonlinear regression models for crop yield prediction,"
The Scientific World Journal, 2014.

[39] Gopal, P. M., & Bhargavi, R., " A novel approach for efficient crop yield prediction,"
no. Computers and Electronics in Agriculture, 2019.

[40] Shastry, K. A., & Sanjay, H. A., "Cloud-Based Agricultural Framework for Soil
Classification and Crop Yield Prediction as a Service," no. Emerging Research in
Computing, Information, Communication and Applications, pp. 685-696, 2019.

[41] Pavan Patil, Virendra Panpatil, Prof. Shrikant Kokate, "Crop Prediction System using
Machine Learning Algorithms," International Research Journal of Engineering and
Technology (IRJET) , vol. 07 , no. 02, 2020.

[42] Islam, T., Chisty, T. A., & Chakrabarty, A., "A Deep Neural Network Approach for
Crop Selection and Yield Prediction in Bangladesh," in IEEE Region 10, Bangladesh,
2018.

[43] Feng, P., Wang, B., Li Liu, D., Xing, H., Ji, F., Macadam, I., ... & Yu, Q. , "Impacts
of rainfall extremes on wheat yield in semi-arid cropping systems in eastern
Australia," Vols. 147(3-4), no. Climatic change, pp. 555-569, 2018.

[44] Prakash, S., Sharma, A., & Sahu, S. S., "Soil Moisture Prediction Using Machine
Learning.," in 2018 Second International Conference on Inventive Communication
and Computational Technologies (ICICCT),, 2018.

lvii
[45] Giritharan Ravichandran, Koteeshwari R S., "Agricultural Crop Predictor and
Advisor using ANN for Smart phones," IEEE, 2016.

[46] Snehal S.Dahikar, Dr.Sandeep V.Rode, " Agricultural Crop Yield Prediction Using
Artificial Neural Network Approach," International Journal Of Innovative Research
In Electrical, Electronics, Instrumentation And Control Engineering, vol. 2(1), pp.
683-686., 2014.

[47] Rakesh Kumar, M.P. Singh, Prabhat Kumar, J.P. Singh,, "Crop Selection Method to
Maximize Crop Yield Rate using Machine Learning Technique," in International
Conference on Smart Technologies and Management for Computing
Communication, Controls, Energy and Materials (ICSTM), Vel Tech Rangarajan Dr.
Sagunthala R&D Institute of Science and Technology, Chennai, T.N., India, May
2015.

[48] Arun Kumar, Naveen Kumar and Vishal Vats, "Efficient crop yield prediction using
machine learning algorithms," International Research Journal of Engineering and
Technology (IRJET), vol. 05, pp. ISSN: 2395-0072, 2018.

[49] Aakunuri Manjula and Dr. G.Narsimha, "Crop Yield Prediction with Aid of Optimal
Neural Network in Spatial Data Mining," New Approaches, International Journal of
Information & Computation Technology ISSN 09742239 , vol. 6(1), pp. 25-33, 2016.

[50] Y. Everingham, J. Sexton, D. Skocaj, and G. Inman-Bamber. , "Accurate prediction


of sugarcane yield using a random forest algorithm," vol. 36(2) , no. Agronomy for
Sustainable, 2016.

[51] N. Gandhi, L. J. Armstrong, O. Petkar and A. K., "Tripathy, Rice crop yield
prediction in India using support vector machines," in 13th International Joint
Conference on Computer Science and Software Engineering (JCSSE), Khon Kaen,
2016 .

[52] M Kalimuthu ,P.Vaishnavi, M.Kishore, "Crop Prediction using Machine Learning,"


in Proceedings of the Third International Conference on Smart Systems and Inventive
Technology (ICSSIT2020), 2020.

[53] S. Veenadhari, B. Misra and C. Singh, " Machine learning approach for forecasting
crop yield based on climatic parameters," in International Conference on Computer
Communication and Informatics, Coimbatore, 2014.

lviii
[54] CH. Vishnu Vardhan chowdary, Dr.K.Venkataramana, "Tomato Crop Yield
Prediction using ID3," IJIRT, vol. 4, no. 10 , pp. 663-62, March 2018.

[55] Jun Wu, Anastasiya Olesnikova, Chi- Hwa Song, Won Don Lee, "The Development
and Application of Decision Tree for Agriculture Dat," IITSI, pp. 6-20, 2009.

[56] R. Sujatha and P. Isakki, " A study on crop yield forecasting using classification
techniques," in 2016 International Conference on Computing Technologies and
Intelligent Data Engineering (ICCTIDE'16), Kovilpatti, 2016.

[57] Ahmad, F. K., et al, " Daily stream flow prediction on time series forecasting.,"
Journal of Theoretical and Applied Information Technology, vol. 95(4), no. ISSN:
1992-8645 and E-ISSN: 1817-3195, 28th February 2017.

[58] Maya Gopal, P.S., Bhargavi, R., " Optimum Feature subset for optimizing crop yield
prediction using filter and wrapper approaches," Appl. Eng. Agri., vol. 35 (1), pp. 9-
14, 2019a.

[59] Kind, M.C., Brunner, R.J., TPZ, "Photometric redshift PDFs and ancillary
information by using prediction trees and random forests," Monthly Notices of the
Royal Astronomical Society, 2013.

[60] Provost F, Hibert C, Malet J P, et al., "Automatic classification of endogenous


seismic sources within a landslide body using random forest algorithm," in General
Assembly Conference Abstracts, 2016.

[61] Sadeh, I., Abdalla, F.B., Lahav, O.,, "Photometric redshift and probability
distribution function estimation using machine learning," in Publications of the
Astronomical Society of the Pacific, 2016.

[62] Tolles, Juliana; Meurer, William J., " Logistic Regression Relating Patient
Characteristics to Outcomes," JAMA, 2016.

[63] Schwartz M H, Rozumalski A, Truong W, et al., "Predicting the outcome of


intramuscular psoas lengthening in children with cerebral palsy using preoperative
gait data and the random forest algorithm," vol. 37(4), pp. 473-479, 2013.

[64] A. M. Ahmed, A. Rizaner and A. H. Ulusoy, "A Decision Tree Algorithm Combined
with Linear Regression for Data Classification," in 2018 International Conference
on Computer, Control, Electrical, and Electronics Engineering ICCCEEE), 2018.

lix
[65] Priyama A, Abhijeeta RG, Ratheeb A, Srivastavab S. , "Comparative analysis of
decision tree classification algorithms," International Journal of Current
Engineering and Technology, 2013.

[66] Achirul Nanda, M., Boro Seminar, K., Nandika, D. and Maddu, A., "A comparison
study of kernel functions in the support vector machine and its application for termite
detection," vol. 9(1), 2018.

[67] Dixon, B. and Candade, N., " Multispectral land use classification using neural
networks and support vector machines: one or the other, or both?’," International
Journal of Remote Sensing, 2008.

[68] Ben-Hur A, Ong CS, Sonnenburg S, Schölkopf B, Rätsch G , "Support Vector


achines and Kernels for Computational Biology," PLoS Comput Biol, 2008.

[69] Guo, Y., Yin, X., Zhao, X., Yang, D., Bai, Y., "Hyperspectral image classification
with SVM and guided filter," EURASIP Journal on Wireless Communications and
Networking, 2019.

[70] Piryonesi, S. Madeh; El-Diraby, Tamer E, "Data Analytics in Asset Management:


Cost-Effective Prediction of the Pavement Condition Index," Journal of
Infrastructure Systems, 2019.

[71] Flair Training, "Gradient Boosting Algorithm – Working and Improvements," 05


October 2021. [Online]. Available: www.https://data-flair.training/.

lx
Appendix
Predicting Crops Yield: A Machine Learning Approach
In this project the prediction of top 4 most consumed cereal crop yields is established by
applying machine learning techniques. These corps include: Maize, Rice, Sorghum and
Wheat.
1. Part One: Gathering & Cleaning Data
 importing required libraries,

 Importing data and the checking for null values, merging and dropping unwanted

2. Part Two: Data Exploration


lxi
 yield_df is the final obtained dataframe
 exploring the relationships between the columns of the dataframe, a good way to
quickly check correlations among columns is by visualizing the correlation matrix
as a heatmap.

3. Part Three: Data Preprocessing


 Encoding Categorical Variables:

 Feature selection

 Scaling Features:

lxii
Taking a look at the dataset above, it contains features highly varying in
magnitudes, units and range. To suppress this effect, we need to bring all features
to the same level of magnitudes. This can be achieved by scaling.

 Training Data:
The common splits are 70/30 or 80/20 for train/test. The training dataset is the initial
dataset used to train ML algorithm to learn and produce right predictions.

4. Part Four: Model Comparison & Selection

Before deciding on an algorithm to use, first we need to evaluate, compare and choose the
best one that fits this specific dataset. For this project, we'll compare between the following
models: Gradient Boosting, Random Forest, SVM and Decision Tree Regressor

lxiii
From results viewed above, Gradient Boosting Regressor has the highest R² score 0f
93%, Decision Tree Regressor comes second with 88%.

5. Part Five: Model Results & Conclusions


Feature importance is calculated as the decrease in node impurity weighted by the
probability of reaching that node. The node probability can be calculated by the
number of samples that reach the node, divided by the total number of samples. The
higher the value the more important the feature.

Getting only top 7 of features importance in the model:

lxiv
 To show level of feature importance

lxv

You might also like