You are on page 1of 8

Crop Yield Prediction Using ML Algorithms

Authors:-

Swayam Verma Shashwat Sinha


Kalinga Institute of Industrial Technology Kalinga Institute of Industrial Technology
3rd Year Information Technology B.Tech 3rd Year Information Technology B.Tech

Pratima Chaudhury
Kalinga Institute of Industrial Technology
3rd Year Information Technology B.Tech

Abstract-
Most agricultural crops have been badly affected by the effect of global climate change in India. This project will allow
farmers to capture the yield of their crops before cultivation in the field of agriculture and thus help them make the
necessary decisions. It utilizes Random Forest which is a Machine Learning Algorithm. By researching such
problems and issues such as weather, temperature, humidity, rainfall, humidity, there are no adequate solutions and
inventions to resolve the situation we face. In countries like India, even in the agricultural sector, there are many
types of increasing economic growth. In addition, the processing is useful for forecasting the production of crop
yields.
Keywords— Machine Learning; Crop_yield_prediction; Random forest Algorithm;

1. Introduction
The Indus Valley Civilization Period is when India's agricultural history began. In this industry, India is ranked second.
Agriculture and allied sectors account for 20.2% of GVA (gross value added) in fiscal year 2020-2021, which is 1.8%
higher than the previous fiscal year 2019-2020, and 18.8% with 42.6% of the workforce in fiscal year 2021-2022. In
terms of net cultivated area, India leads the world with 9.6% of all arable land, followed by the US (8.9%), China
(8.8%), and Russia (8.8%). According to demographics, India's socio economic fabric is mostly based on agriculture.
The GDP contribution of agriculture in India is significantly declining as industrialization rises. Integration with
technology is not at the desired level, which is a problem for the Indian agricultural sector. The reason why the
agriculture sector's full potential is not being used. It is difficult for farmers to predict the rainfall and temperature,
which has an impact on the yield of crops, as a result of the overuse of industrial technologies and non-renewable
resources. Here, machine learning can help farmers by using algorithms like RNN, LSTM, and others to predict
trends in temperature, rainfall, and crop yield. Due to the ability to pre-plan crops in accordance with the prediction,
this will help farmers with predictions ease their lifestyle a little bit and increase the yield and quality of their harvest.
The practical implementation of machine learning techniques and its quantification are the main topics of this study.In
order to obtain a consistent trend, the work presented here additionally takes into account the erratic data from the
temperature and rainfall databases. Contrary to the customary practice of making predictions about crop yields by
only taking into account one aspect at a time, this method takes into account all of the factors
The remainder of the paper is structured as follows. Section 2 contains Literature Surveys of the researches that
were done before this paper. Section 3 contains Methodology that briefly describes the different algorithms and the
requirements for ML Algorithms. Section 4 contains the Proposed Model. Section 5 contains Brief Detail on Data
Sources and Datasets. Section 6 contains the Prediction Result that we get after using Formula. Section 7 Contains
Result and Analysis that we get after processing the data in the Random Forest Model. Section 8 contains Pros and
Cons of the proposed model. Section 9 Contains Conclusion of the paper and Future use for the Proposed model.

2. Literature Survey
On a dataset from the Indian government, experiments by Aruvansh Nigam, Saksham Garg, and Archit Agrawal[1]
showed that the Random Forest machine learning method provides the best yield forecast accuracy.
Balamurugan [2], have implemented crop yield prediction by using only the random forest classifier. Various features
like rainfall, temperature and season were taken into account to predict the crop yield.
According to Dr. Y. Jeevan Nagendra Kumar [3], supervised learning allows machine learning algorithms to forecast
an objective or outcome. This study focuses on supervised learning methods for predicting crop yields.
Jig Han et al. [4] used a random forest algorithm to predict global and regional crop yields for potato, maize, and
wheat, as well as environmental variables such soil, climate, photoperiod, fertilization data, and water.
Leo Brieman [5] specializes in the random forest algorithm's accuracy, strength, and correlation. The random forest
algorithm generates decision trees from different data samples, predicts the data from each subset, and then
provides the best answer for the system by voting.
Mishra [6] has theoretically described various machine learning techniques that can be applied in various forecasting
areas.
Using data mining techniques, Shastry et al[7] fitted various regression models to forecast crop yield in India. The
crop yields of maize, wheat, and cotton are studied using time series data, soil, and weather parameters.
Manjula's et al.[8] research aimed to propose and implement a rule-based system to predict crop yield production
from past data by using association rule mining on agricultural data from 2000 to 2012.
Saeed Khaki, Lizhi Wang and Sotirios V. Archontoulis [9] uses CNN-RNN model for Crop yield prediction. Its used to
capture the time dependencies of environmental factors and the genetic improvement of seeds over time without
having their genotype information.
Using Random forest algorithm for Crop yield prediction is used by Mayank Champaneri , Darpan Chachpara ,
Chaitanya Chandvidkar , Mansing Rathod [10].
Thomas van Klompenburga , Ayalew Kassahuna , Cagatay Catalb [11] using different deep learning methods to find
best performing model, models for Crop yield prediction.
S. Vinson Joshua, A. Selwin Mich Priyadharson [12] used General Regression Neural Networks (GRNN), Back
Propagation Neural Network(BPNN), Support Vector Machine(SVM) for crop yield prediction.

3. Data Source and Datasets


The acquisition of dataset in the Indian sub terrain is a tad difficult as there is no official compilation of the required
datasets but scattered datasets are available which upon merging can be used to provide the desired yield. The
following Table[1] Fig[1] data were used throughout the paper. We also visualized the features of the dataset through
machine learning techniques shown in the Fig[2] for the better understanding of the attributes present in the dataset.
For better understanding of the dataset we have created the heatmap of the data showing the relationship between
attributes as shown in Fig[3].

Sl no. Attribute Description

1 States Andaman and Nicobar Islands, Andhra Pradesh,


Arunachal Pradesh, Assam, Bihar, Chandigarh,
Chhattisgarh, Dadra and Nagar Haveli, Goa, Gujarat, Haryana, Himachal Pradesh, Jammu and Kashmir,
Jharkhand, Karnataka, Kerala, Madhya Pradesh, Maharashtra, Manipur, Meghalaya, Mizoram, Nagaland,
Odisha, Puducherry, Punjab, Rajasthan, Sikkim, Tamil Nadu, Telangana, Tripura, Uttar Pradesh,
Uttarakhand, West Bengal

2 Crops Arecanut, Other Kharif pulses, Rice, Banana, Cashewnut, Coconut, Dry ginger, Sugarcane, Sweet potato,
Tapioca, Black pepper, Dry chillies, other oilseeds, Turmeric, Maize, Moong(Green Gram), Urad, Arhar/Tur,
Groundnut, Sunflower, Bajra, Castor seed, Cotton(lint), Horse-gram, Jowar, Korra, Ragi, Tobacco, Gram,
Wheat, Masoor, Sesamum, Linseed, Safflower, Onion, other misc. Pulses, Samai, Small millets, Coriander,
Potato, Other Rabi pulses, Soyabean, Beans & Mutter(Vegetable), Bhindi, Brinjal, Citrus Fruit, Cucumber,
Grapes, Mango, Orange, other Fibers, Other Fresh Fruits, Other Vegetables, Papaya, Pome Fruits, Tomato,
Rapeseed & Mustard, Mesta, Cowpea(Lobia), Lemon, Pomegranate, Sapota, Cabbage, Peas, Niger seed,
Bottle Gourd, Sannhamp, Varagu, Garlic, Ginger, Oilseeds total, Pulses total, Jute, Peas & beans (Pulses),
Blackgram, Paddy, Pineapple, Barley, Khesari, Guar seed, Other Cereals & Millets, Cond-spcs other, Turnip,
Carrot, Redish, Arcanut (Processed), Atcanut (Raw),Cashew Nut Processed, Cashew Nut Raw, Cardamom,
Rubber, Bitter Gourd, Drum Stick, JackFruit, Snake Guard, Pump Kin, Tea, Coffee, Cauliflower, Other Citrus
Fruit, Water Melon, Total foodgrain, Kapas, Colocasia, Lentil, Bean, Jobster, Perilla, Rajmash Kholar,
Ricebean (nagadal), Ash Gourd, Beet Root, Lab-Lab, Ribbed Gourd, Yam, Apple, Peach, Pear, Plums,
Litchi, Ber, Other Dry Fruit, Jute & mesta

Table[1] Attribute List of dataset

Fig[1] Dataset sample

Fig[2] Graphical Representation of year, Fig[3] Correlation Matrix of


Area Size, Production, Temperature. Different attributes
4. Methodology
The use of data is crucial to machine learning. A technique called data preprocessing is used to turn the raw data into
a clean data set. The data are acquired from various sources, however because they are collected in raw form,
analysis is not possible. We can change data into a comprehensible format by using various strategies, such as
substituting missing values and null values. The division of training and testing data is the last step in the data
preprocessing process. Due to the fact that training the model typically requires as many data points as possible, the
data typically tend to be distributed unevenly. The initial dataset used to teach ML algorithms how to learn and make
accurate predictions is known as the training dataset
➢ Factors Affecting the Crop yield
Any crop's yield and production are influenced by a number of variables. In essence, these variables aid in
the prediction of crop yield over a specific time frame. We took into account variables like area, temperature,
rainfall, humidity, and wind speed in this research.
➢ Different Machine Learning algorithms
We must first assess and compare potential algorithms before selecting the one that best fits this particular
dataset. The best method for solving the crop production problem practically is machine learning.
Numerous machine learning methods are employed to forecast agricultural yield. The following machine
learning techniques for selection and accuracy comparison are included in this paper:
● Linear regression : A supervised learning classification approach called logistic regression is used
to forecast the likelihood of a target variable. Since the target or dependent variable has a dual
nature, there are only two viable classes. This regression method determines whether a dependent
variable and the other independent variables are linearly related. On our dataset, the logistic
regression technique delivers an accuracy of 87.8%.
● Random Forest : Random Forest has the capacity to examine how crop growth is influenced by the
prevailing climatic factors and biophysical change. It is a supervised machine learning algorithm
that is frequently employed to solve classification and regression problems. The random forest
algorithm builds decision trees using several data samples, predicts the data from each subset, and
then determines which answer is best for the system through user voting. The bagging approach is
used by Random Forest to train the data, increasing the accuracy of the outcome. RF offers a
90.47% accuracy for our data. As a result, we will use the Random Forest Algorithm to analyze our
data because it performs more accurately than the alternative algorithm

5. Proposed Model
The diagram of the proposed model shown in Fig[4] above is of Random Forest Model and it works in several steps
those are:
1. When the Algorithm is started the Data Sets are Loded in the model and Graphs are made according to
them in the 1st step and random samples are taken from the data sets that are then processed to get them
in suitable form to Construct Decision Trees.
2. When the Decision Trees are made they are made using Attribute selection Process and the attributes that
are selected are data points[subset] selected by the user and then the Decision Trees that are formed then
get the data and then the Decision Trees create some set of rules and formulas to predict the result each
tree uses different sets of data and form different rules for prediction.
3. The Result from each Decision Tree is taken and Voted upon By the random Forest Classifier and the result
that gets highest votes Gets selected for the Final Result.
4. The Final Result is Displayed and Graphs are made according to the result.
Fig[4] Proposed System Flowchart
The Random Forest Algorithm gets illustrated in Pseudo-code(1) in Table[2]. Out of all the features, K random
features can be chosen using a best split point scheme. Then, N trees are produced, each with a d node and several
daughter nodes. In this area of prediction, Random Forest provides the highest accuracy because it trains N numbers
of trees, and more trees lead to greater accuracy. It can manage enormous volumes of data.

Pseudocode(1) of the Proposed System:


1. We first randomly select the 'k's to feature out of the total 'm' feature in the model.
2. Using the best split point the k feature is chosen and node d is calculated.
3. Using the split method, split the nodes into daughter nodes.
4. Repeat steps 1 to 3 until several nodes have been reached.
5. To make an n number of trees, repeat steps 1 to 4 for an n number of times.

The voting process is highlighted in Pseudo-code(2) in Table[3] to provide the final result. Each trained tree utilizes a
random set of data to predict an outcome for each event. This process is repeated numerous times, saving the
results for each event. Next, the voting process is started, and each tree casts votes for each outcome. The outcome
with the most votes is then chosen as the outcome for the event. If two results are in conflict, the data are again voted
on, and the result with the highest number of votes is chosen.

To perform prediction using the trained random forest algorithm uses the below pseudocode(2) as shown in
Fig[5]:
1. We used the test features and each random decision tree to predict the output and the outcome, which was
then saved.
2. The vote given by each decision tree for each predicted event was then calculated.
3. Finally, we looked at the most popular predicted outcome, which is the random forest algorithm's final

forecast.

Fig[5] Random Forest Algorithm Pseudocode


6. Crop Yield Calculation
The crop which was predicted by the Random Forest Classifier was mapped to the production of the predicted
crop.Then the area entered by the user was divided from the production to get crop yield.
Yield= Production/Area
Here is the Fig[6] showing the Yield for the following dataset:

Fig[6] Yield which indicates Production per unit Area.

7. Result & Analysis


There are around 2 lakh records in the collection. We have tested the supplied data set using various machine
learning methods in order to choose the preferred algorithm for the study. First, we examined the dataset using the
random forest and linear regression algorithms. Since the random forest algorithm's R2 score was higher than the
linear regression, which determined the coefficient of determination at -66.59< 0.95, it is obvious that linear
regression is not suitable for the dataset. This is so because in linear regression, homoscedasticity, multivariate
normality, and the absence of multicollinearity are all presumptions that the dataset should meet. We can observe the
distinction in the plotted graph in Fig[7] , which illustrates the prediction of Yield using the model and both
algorithms.Homoscedasticity

Fig[7] Prediction of Yield through Linear Regression & Random Forest Model
After comparing linear regression and random forest regression, we performed an analysis on decision trees, which
revealed that the decision tree's R2 value was 0.93 as shown in Fig[8] , which was significantly lower than the R2
score of the random forest, indicating that the random forest was the most effective technique for the dataset in
question, with an accuracy of 95.32 percent and a standard deviation of 4.72%, as shown in the Fig[9].

Fig[8] Prediction of Yield using Decision Tree Regression

Fig[9] Accuracy and Standard Deviation of Proposed Model

8. Pros of using Random forest model


● Reduced risk of overfitting: Decision trees run the risk of overfitting as they tend to tightly fit all the
samples within training data. However, when there’s a robust number of decision trees in a random forest,
the classifier won’t overfit the model since the averaging of uncorrelated trees lowers the overall variance
and prediction error.
● Provides flexibility: Since random forest can handle both regression and classification tasks with a high
degree of accuracy, it is a popular method among data scientists. Feature bagging also makes the random
forest classifier an effective tool for estimating missing values as it maintains accuracy when a portion of the
data is missing.
● Easy to determine feature importance: Random forest makes it easy to evaluate variable importance, or
contribution, to the model. There are a few ways to evaluate feature importance. Gini importance and mean
decrease in impurity (MDI) are usually used to measure how much the model’s accuracy decreases when a
given variable is excluded. However, permutation importance, also known as mean decrease accuracy
(MDA), is another important measure. MDA identifies the average decrease in accuracy by randomly
permuting the feature values in samples.
9. Conclusion & Future Work
The paper outlined a variety of machine learning techniques for estimating agricultural output based on area, season,
temperature, and rainfall. Studies using datasets from the Indian government have shown that the Random Forest
Regressor has the best accuracy for predicting yield. This will enable the farmers in India to determine the yield they
may anticipate in a particular climate and adjust the timing of crop planting accordingly. In the following years, we can
try to create a data-independent system. Our system must perform accurately regardless of format. Since crop
selection also takes soil knowledge into account, it is advantageous to incorporate soil information into the system.
Effective irrigation is also necessary for crop cultivation. Rainfall may show whether more water availability is needed
or not.

10. References
1. Breiman, L. Random Forests. Machine Learning 45, 5–32 (2001). https://doi.org/10.1023/A:1010933404324
2. U. Muthaiah, "Predicting yield of the crop using machine learning algorithm", International journal of
engineering sciences & research Technology(IJESRT), 5.164,UGC Approved (2018)
3. Mishra, Subhadra & Mishra, Debahuti & Santra, Gour. (2016). Applications of Machine Learning Techniques
in Agricultural Crop Production: A Review Paper. Indian Journal of Science and Technology. 9.
10.17485/ijst/2016/v9i38/95032.
4. Breiman, L. (2001). Random Forests. Machine Learning, 45, 5-32.
5. Mahore, Pallavi & Bardekar, Dr. (2021). Crop Yield Prediction. International Journal of Scientific Research in
Computer Science, Engineering and Information Technology. 561-569. 10.32628/CSEIT2173168.
6. Champaneri, Mayank & Chachpara, Darpan & Chandvidkar, Chaitanya & Rathod, Mansing. (2020). CROP
YIELD PREDICTION USING MACHINE LEARNING. International Journal of Science and Research (IJSR).
9. 2.
7. Khaki S, Wang L, Archontoulis SV. A CNN-RNN Framework for Crop Yield Prediction. Front Plant Sci. 2020
Jan 24;10:1750. doi: 10.3389/fpls.2019.01750. PMID: 32038699; PMCID: PMC6993602.
8. Champaneri, Mayank & Chachpara, Darpan & Chandvidkar, Chaitanya & Rathod, Mansing. (2020). CROP
YIELD PREDICTION USING MACHINE LEARNING. International Journal of Science and Research (IJSR).
9. 2.
9. van Klompenburg, T., Kassahun, A., & Catal, C. (2020). Crop yield prediction using machine learning: A
systematic literature review. Computers and Electronics in Agriculture, 177, [105709].
https://doi.org/10.1016/j.compag.2020.105709
10. Anakha Venugopal, Aparna S, Jinsu Mani, Rima Mathew, Vinu Williams, 2021, Crop Yield Prediction using
Machine Learning Algorithms, INTERNATIONAL JOURNAL OF ENGINEERING RESEARCH &
TECHNOLOGY (IJERT) NCREIS – 2021 (Volume 09 – Issue 13),
11. S. Vinson Joshua, A. Selwin Mich Priyadharson, R. Kannadasan, A. Ahmad Khan, W. Lawanont et al.,
"Crop yield prediction using machine learning approaches on a wide spectrum," Computers, Materials &
Continua, vol. 72, no.3, pp. 5663–5679, 2022.
12. Lontsi Saadio, Cedric and Adoni, Wilfried Yves Hamilton and Aworka, Rubby and Zoueu, Jérémie
Thouakesseh and Kalala Mutombo, Franck and Kimpolo, Charles Lebon Mberi, Crops Yield Prediction
Based on Machine Learning Models: Case of West African Countries. Available at SSRN:
https://ssrn.com/abstract=4003105 or http://dx.doi.org/10.2139/ssrn.40

You might also like