You are on page 1of 11

1132 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 8, NO.

5, OCTOBER 2021

Prediction of Land Suitability for Crop Cultivation


Based on Soil and Environmental Characteristics
Using Modified Recursive Feature Elimination
Technique With Various Classifiers
G. Mariammal , A. Suruliandi , S. P. Raja , and E. Poongothai

Abstract— Crop cultivation prediction is an integral part of agricultural productivity and reduce the impact on the envi-
agriculture and is primarily based on factors such as soil, ronment. Accurately identifying crops for cultivation, based
environmental features like rainfall and temperature, and the on soil and environmental factors, is critical to agricultural
quantum of fertilizer used, particularly nitrogen and phosphorus.
These factors, however, vary from region to region: consequently, productivity and has been an active research topic for decades.
farmers are unable to cultivate similar crops in every region. This Most of the existing approaches use machine learning (ML)
is where machine learning (ML) techniques step in to help find the for crop yield estimation, though very little has been done to
most suitable crops for a particular region, thus assisting farmers predict region-specific crops based on soil and environmental
a great deal in crop prediction. The feature selection (FS) facet parameters. Parameters such as soil type, nutrients (nitrogen,
of ML is a major component in the selection of key features
for a particular region and keeps the crop prediction process phosphorus, and potassium), micronutrients (iron, boron, and
constantly upgraded. This work proposes a novel FS approach manganese), temperature, and rainfall influence crop cultiva-
called modified recursive feature elimination (MRFE) to select tion. Since the parameters differ for every zone, thus making
appropriate features from a data set for crop prediction. The pro- for a massive crop prediction data set, there is a need to select
posed MRFE technique selects and ranks salient features using a crucial features that help identify suitable crops for specific
ranking method. The experimental results show that the MRFE
method selects the most accurate features, while the bagging tech- areas of land. The process is carried out using feature selection
nique helps accurately predict a suitable crop. The performance (FS) techniques.
of proposed MRFE technique is evaluated by various metrics ML algorithms play a major role in prediction. For enhanced
such as accuracy (ACC), precision, recall, specificity, F1 score, ML performance, FS techniques [1]–[6] are used to reduce
area under the curve, mean absolute error, and log loss. From overfitting and ascertain salient features from the data set
the performance analysis, it is justified that the MRFE technique
performs well with 95% ACC than other FS methods. for the prediction process. The FS technique is divided into
three categories: filter [7], wrapper [8], and embedded [9].
Index Terms— Agriculture, classification, crop prediction, Filter methods are independent of the performance of the
feature selection (FS), modified recursive feature elimination
(MRFE). classifier, whereas wrapper methods select features based on
its performance. The embedded method, which combines the
filter and wrapper methods, is somewhat similar to the latter.
I. I NTRODUCTION This work pays special attention to wrapper FS techniques.
The features selected are fed to the k-nearest neighbor (kNN),
A GRICULTURAL research has strengthened the economy
worldwide, and is an area that offers humanity, as whole,
inestimable benefits. Crop prediction in agriculture continues
Naive Bayes (NB), decision tree (DT), support vector machine
(SVM), random forest (RF), and bagging classifiers to predict
to present difficulties, notwithstanding current developments a suitable crop, and evaluate the performance of the FS
that include the use of an array of technological resources, process. The objective of this work is to select key features
tools, and procedures. Agri-technology and precision farming, from a data set and improves crop prediction performance.
now termed virtual farming, have emerged as new scientific The main contribution of this work is to propose a novel
areas of interest that use data-intensive methods to boost modified recursive feature elimination (MRFE) technique to
select the most appropriate key features using permutation crop
Manuscript received January 28, 2021; revised March 31, 2021; accepted data set based on soil and environmental factors, while using
April 17, 2021. Date of publication May 5, 2021; date of current version
September 30, 2021. (Corresponding author: G. Mariammal.) permutation data set, the algorithm need not to be updated with
G. Mariammal and A. Suruliandi are with the Department of Computer the data set at each iteration, so it reduces the computational
Science and Engineering, Manonmaniam Sundaranar University, Tirunelveli time than existing RFE method.
627012, India (e-mail: suba.g1212@gmail.com; suruliandi@yahoo.com).
S. P. Raja is with the School of Computer Science and Engi-
neering, Vellore Institute of Technology, Vellore 632014, India (e-mail:
avemariaraja@gmail.com). A. Related Work
E. Poongothai is with the Department of Computer Science and Engineering,
SRM University, Chennai 603203, India (e-mail: poongothai.rp@gmail.com). Several studies on FS that have been undertaken for
Digital Object Identifier 10.1109/TCSS.2021.3074534 improved prediction are discussed in this section.
2329-924X © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 15,2022 at 06:03:48 UTC from IEEE Xplore. Restrictions apply.
MARIAMMAL et al.: PREDICTION OF LAND SUITABILITY FOR CROP CULTIVATION BASED ON SOIL AND ENVIRONMENTAL CHARACTERISTICS 1133

Gregorutti et al. [10] compared the RFE and non-RFE


(NRFE) techniques. The permutation importance (PIMP) mea-
sure was used as a ranking criterion for FS, and the technique
was tested on the Landsat satellite data collected from the Uni-
versity of California Irvine (UCI) ML repository. From the
results, it was concluded that the RFE is more efficient than the
NRFE. Hall and Holmes [11] compared several FS techniques
and used benchmark data sets for evaluation. The results show
that the wrapper technique is best for FS. Liu and Yu [12]
analyzed the existing FS techniques for classification and clus-
tering. Certain real-world applications were used in their work
to demonstrate the FS techniques. Granitto et al. [13] com-
pared RF-RFE with SVM-RFE. A performance evaluation was
carried out using the proton transfer reaction-mass spectrome-
try (PTR-MS) data of agro-industrial products. Their analysis
concluded that the RF-RFE works better than the SVM-RFE. Fig. 1. Outline of the work.
Araúzo-Azofra and Benítez [14] used 36 data sets from the
UCI, Orange (Org), and silicon graphics (SGI) to evaluate mis- FS technique called the MRFE to overcome the limitations of
cellaneous FS techniques. An experimental analysis concluded the RFE. The efficiency of the MRFE is analyzed, following
that the wrapper approach is the best for selecting features. the results of the experiments. After the features are selected,
Altmann et al. [15] proposed an improved RF model with classification algorithms take the lead in the prediction process.
the PIMP measure for FS. The PIMP ranking measure and In much of the research, a single prediction model (such as
Gini importance were compared to find that the PIMP-RF the kNN [14], NB [24], DT [25], and SVM [26]), along
model significantly outperformed the Gini-RF model. Kursa with an ensemble prediction model (like the RF [27]), and
and Rudnicki [16] described the Boruta FS technique, and Bagging [28] techniques have been used to classify crop
the Boruta package provided their algorithm a convenient prediction. Each algorithm displays prediction characteristics
interface, with the Madalon data set being used for their of its own. However, there is a need to find the classifier
experimental analysis. Ruß and Kruse [17] proposed a novel that works best with the proposed FS technique for crop
FS technique for wheat yield prediction with two regression prediction. Therefore, this work analyzes the performance of
models, support vector regression (SVR) and the regression each classifier with the proposed MRFE technique to predict
tree (Reg tree), for a comparison. Darst et al. [18] compared the most suitable crops for specific land areas.
the RF and RF-RFE in terms of the selection of variables,
and concluded that the latter was not likely to scale to C. Outline of the Work
high-dimensional data. Hsieh et al. [19] used the RFE algo-
Fig. 1 depicts the overall process of the proposed work.
rithm to select key features that impact rice blast disease
The data set containing soil and environmental features is
(RBD). Their work analyzed climatic data collected over five
preprocessed to find missing values and remove redundant
years. Table I illustrates the characteristics comparison of the
data. The preprocessed data are then fed into the proposed
proposed MRFE technique with existing FS techniques such
MRFE FS algorithm. The features selected are input into the
as sequential forward feature selection (SFFS), Boruta, and
classifier for the learning process. This work uses a super-
RFE.
vised learning technique for the prediction process. Training
B. Motivation and Justification samples are trained with the classifier and unknown samples
provided to validate the trained classifier. Finally, the results
Farming plays a critical role in the global economy, in which
are evaluated, using certain performance metrics, to produce
crop prediction is a decisive factor. FS and classification [20]
the most suitable crop.
are central to the crop prediction process. The literature review
makes it plain that the wrapper FS technique [21]–[23] predicts
crops better than existing techniques. The RFE technique D. Organization of This Article
is a wrapper-type FS method that works by searching for The remaining part of this article is organized as follows.
a subset of features, commencing with all features in the Section II describes the existing FS techniques and the pro-
training data set, and thereafter successfully removing features posed MRFE technique. Section III discusses the existing clas-
until only a desired number remains. The RFE method ranks sification techniques to predict the suitable crop. Section IV
appropriate features in terms of their importance, discarding depicts the crop prediction procedure for cultivation. Section V
the least important ones. The feature that is selected impacts analyses the experimental results and Section VI concludes the
classification accuracy (ACC) as well. This method, however, work.
needs an iterative process for data set updation in the feature
elimination process. Updating the data set is the most difficult II. FS T ECHNIQUES
part of the RFE, and maximum time is taken to eliminate weak FS, which is a preprocessing step in ML [11], removes
features. Motivated by these facts, this work proposes a new irrelevant features so as to render the classification models

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 15,2022 at 06:03:48 UTC from IEEE Xplore. Restrictions apply.
1134 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 8, NO. 5, OCTOBER 2021

most efficient [24]. Sections II-A and II-B describe existing


wrapper FS techniques such as SFFS, Boruta, RFE, and the
proposed technique, MRFE.

A. Existing FS Techniques
1) Sequential Forward Feature Selection: Sequential fea-
ture selection (SFS) is a wrapper-based FS technique. This
Fig. 2. Flow diagram of MRFE process.
algorithm is divided into two, SFFS and sequential backward
feature selection (SBFS). This work takes the SFFS for the FS
process, the working of which is given in [29]. It starts with
represents features.
an empty set, selects important features from the data set, and
repeats the process until every important feature is selected. 1 4 7
The SFFS algorithm is based on the Akaike information Example: Given matrix 2 5 8
criterion (AIC) value for FS [30]. 3 6 9
2) Boruta: The Boruta algorithm is a wrapper FS technique 4 1 7
built around the RF classification algorithm. The advantage Shuffled matrix 8 5 2.
of RF classification is that it runs quickly and, in addition, 3 6 9
estimates the importance of features [16]. The results provide
a Z score. In the Boruta, the Z score has a great impact on This process does not affect the feature values. A data set
the FS technique. The pseudo code of the Boruta algorithm is that contains n × m records shows no change following
mentioned in [31]. permutation application.
3) Recursive Feature Elimination: The RFE is the most 2) The shuffled data set is then combined with the input
frequently used wrapper FS technique. The RFE starts with a data set, i.e., the crop data set.
whole data set and removes its weak features using a ranking In the example below, the given matrix is combined with
method. It then updates the data set and continues the process the shuffled matrix.
until all the weak features are eliminated. In the RFE, the Gini 1 4 7 4 1 7
importance ranking method is used for feature elimination. The Combined matrix 2 5 8 8 5 2
pseudo code for the RFE technique is given in Algorithm 1. 3 6 9 3 6 9.
3) Finally, the combined crop data set is merged with the
Algorithm 1 The Pseudo Code of the RFE Algorithm [13] Is original data set, and then shuffling process is applied to
Given Below obtain the extended data set. It also known as permuted
Inputs: crop data set. Furthermore, this data set is used to find
Training set T, the importance of features.
Set of p features F = {f1 , . . . , fp }
In the example below, the given matrix is extended with the
Ranking Method RM(T,F)
help of the combined matrix.
Output:
Final ranking R 3 4 9 4 7 1 8 5 9
Code: Extended matrix 1 6 7 5 2 8 4 6 2.
Repeat for i in (1:p) 2 5 8 6 9 3 3 1 7
Rank set F using RM(T,F)
f ∗ ← last ranked feature in F Extending the data set results in a drop in the standard
R(p − i + 1) ← f ∗ deviation value, indicating that the value is close to the mean.
F ← F−f ∗ The permutation process offers two distinct advantages. The
first is its ability to standardize the co-efficient of variables
to help the ranking process eliminate weak features from the
data set. The second is that the model needs no retraining as
B. Proposed FS Technique
forward or backward, thus making it faster than the existing
1) Modified Recursive Feature Elimination: The proposed RFE technique.
MRFE technique removes weak features from the data set Step 2: Finding the Most Important Features:
using the permutation data set and ranking method. The The RF classifier is used to discover the most important
permutation data set shuffles the values in each field and features as well as the mean decrease value that helps find
duplicates the crop data set fed as input. Fig. 2 shows the the Z score. The extended crop data set is fed into the RF
process of the MRFE technique. classifier to find the most important features. The two main
Step 1: Initiating the Permutation Process: parameters of the RF classifier are as follows.
1) The given input crop data set is considered an n × m 1) mtry: This refers to the number of variables that are
matrix, where n represents records of each area and m used as each split, and is called the mtry parameter. The

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 15,2022 at 06:03:48 UTC from IEEE Xplore. Restrictions apply.
MARIAMMAL et al.: PREDICTION OF LAND SUITABILITY FOR CROP CULTIVATION BASED ON SOIL AND ENVIRONMENTAL CHARACTERISTICS 1135

recommended value for the mtry is the root square of Algorithm 2 The Pseudo Code for the Novel Approach MRFE
the number of features. Technique Is Given Below
2) ntree: This refers to the number of trees, called the ntree  
Input:D ← Dataset, F ← Featur es f1 , f2 , . . . , fp ,
parameter, which decides the splitting range of trees in RM←Ranking Method
the forest, with the default ntree used in the RF classifier Output: F1 ← Selected Features
being 100. Let a be the entire crop data set
Step 3: Finding Z Score: Let b be the shuffled crop data set
The Z score is the standard score that is used to compare Let c be the combined data set of crop data set and shuffled
the importance of the features selected. To fine-tune the crop data set
performance of the RF classifier and evaluate it in this work, Let d be the extended data set of crop data set
the ntree value is altered from 100 to 500. The basic Z score Begin  
formula is given as follows: Initialization:F = f1 , f2 , . . . , fp , F1 = ∅
Z score = mean decrease accuracy loss(x − μ) ÷ σ The algorithm start with whole data set D
a←D
where x represents the observed value, μ the mean value of b ← shuffle(a)
the samples, and σ the standard deviation of the samples. c ← cbind(a + b)
Step 4: Applying the Ranking Method: d ← shuffle(cbind(a + c))
Finally, a ranking method is applied to find weak soil Apply RF to d for finding the importance of features
and environmental features from the data set. Several ranking Zscore ← RF (d)
methods [32], [33] are used for FS. This work evaluates the Apply Ranking Method after predicting Zscore
performance of rank aggregation [34], Gini importance [27], for all weak features
PIMP [15], and actual impurity reduction importance (AIR- repeat for 1 to p
IMP) [35] to find the best ranking method for FS so as to refine R1←RM (d,F)
the crop prediction process. The AIRIMP ranking method W1←Calculate the weak feature from R1
outperforms others and is discussed below in the section on if R1≤W1
results. Hence, it is used in the proposed MRFE FS technique remove weak features from F
to rank every feature, from the best to the worst. else if R1>W1
2) Algorithm for MRFE: The pseudo code for the proposed return F1
MRFE technique is given in Algorithm 2. end if
end for
III. C LASSIFICATION T ECHNIQUE end for
Classification is the learning process used in ML to predict Repeat this procedure until terminate condition is satisfied
the target class of a given input. Classification technique is Terminate if all weak features are eliminated from
divided into two, supervised and unsupervised. In this work, thedata set then stop the process
supervised learning methods such as the kNN, NB, DT, SVM, End
RF, and Bagging are used for the crop prediction process.
In addition, they help evaluate the performance of the FS C. Decision Tree
technique.
The DT is a supervised learning model with a tree-like
structure. Each internal node is labeled with an input fea-
A. k Nearest Neighbor ture [25] and follows a top-down approach. Each leaf node is
The kNN is a supervised learning process that predicts labeled with the class used to predict the target variable [25].
a suitable crop, based on the closest training samples, and For the DT, which holds the prediction class, tree splitting is
is centered on a distance measurement for the prediction important. Using the splitting, data values from the testing set
process [14]. Using the distance measurement, a new sample are used to identify a suitable crop.
from the testing set is allocated to a particular target class,
based on how closely it matches the training set.

D. Support Vector Machine


B. Naive Bayes
The NB classifier [24] is a simple classification algorithm The SVM classifier is a supervised learning process that
that estimates the probability of every class and chooses a predicts the most suitable crop from the testing set. It sep-
suitable crop with the maximum probability. The NB classifier arates classes into categories, with several possibilities for
is trained with the training samples, and its performance is hyperplane, using the maximum margin [36]. Hyperplane, also
evaluated by using testing samples from the testing set to find known as decision boundary, helps classify crops. The crop
the most appropriate crop for cultivation. Fundamentally built that lies closest to the decision boundary is recommended for
on the Bayesian theorem, its principles are drawn from graph cultivation. In the SVM, finding the decision boundary is an
and probability theories. optimization problem [37].

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 15,2022 at 06:03:48 UTC from IEEE Xplore. Restrictions apply.
1136 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 8, NO. 5, OCTOBER 2021

E. Random Forest TABLE I


C HARACTERISTIC C OMPARISON OF FS M ETHODS
The RF classifier is a collection of binary DTs [27]. The
RF creates a set of DTs from a randomly selected subset of
the training set. Each individual tree splits a class for crop
prediction, and the split is chosen from the Gini index value.
The class of a new sample from the testing set is determined
by the majority of the votes of all the trees in the RF [15].

F. Bagging
Bagging, also known as bootstrap aggregating, is an ensem-
ble of a meta-algorithm used to improve the stability and ACC
of ML algorithms [28]. It combines weak learners for the
prediction process. The samples taken from the training data
set are used to train the classifier to predict the most suitable
crop for cultivation. The classifier then takes votes for each
sample, and this is used to improve the prediction level. Bag-
ging is a special feature of the model average approach [28]
and eliminates the need for weight updation. The pseudo code
for the Bagging technique is given in Algorithm 3.
B. Testing Phase
Algorithm 3 The Pseudo Code [38] for the Bagging Algo- The remaining subset of the data set is used to validate the
rithm Is Given Below classifier in the testing phase.
Input: D→Data set, T→Training set of examples of size N, Step 8: The remaining 30% of samples from the reduced
k→number of bootstrapsamples, LA→Learning Algorithm data set are taken as testing samples.
Output: C∗ bagging ensemble with k component classifiers Step 9: The learned classifier is applied in the testing
Learning phase: samples for the prediction process.
for i=1 to k do Step 10: The learning classifier finds the target class for
Si ← bootstrap sample from T the new given input and predicts the most suitable crop for a
Generate classifier Ci ← LA(Si ) specific land area.
end for Step 11: From the results, the most suitable crop for
Predicting class
 label for new instance x: cultivation is recommended.
C∗ (x) = arg maxy ki=1 [Ci (x) = y]

IV. C ROP P REDICTION P ROCEDURE V. E XPERIMENTAL R ESULT A NALYSIS AND D ISCUSSION


The basic steps for crop prediction follow below. A. Data Set Description
Step 1: The crop data set containing soil and environmental This work utilized an agricultural data set that chiefly
features is the input data set. included soil and environmental factors. The environmental
Step 2: The preprocessing procedure is applied to the input factors data set is publicly available from www.tnau.ac.in
data set to standardize the data. Missing values and redundant website. But the soil characteristics data set is not publicly
data are checked for anomalies. Also, variables in the data available. Hence, it is manually collected from various sources
set are converted into a particular range to keep the data set such as Department of Agriculture, Sankarankovil Taluk,
compatible. Tenkasi, India. It is specially constructed for this research
Step 3: The proposed MRFE technique is applied to select purpose only. The data set contains 1000 instances with
important features from the preprocessed data, as discussed in 9 classes and 16 features, where 12 features are soil charac-
Section II-B. teristics and the remaining 4 are environmental characteristics,
Step 4: The selected features are used for the prediction respectively.
process. Using ML techniques, the reduced data set is split Table II illustrates the types and description of the soil
for use in the training and testing phases. and environmental characteristics which influence the crop
prediction process.
A. Training Phase
A subset of the data set is used to train the classifier in the
training phase. B. Performance Metrics
Step 5: In all, 70% of samples from the reduced data set The performance of the proposed technique is evaluated by
are taken for use as training samples. varying the splitting range of the total number of trees and by
Step 6: The classification algorithm is applied to all the reviewing the ranking method that identifies the importance
training samples for the learning process. of each feature. In this work, the performance of the FS
Step 7: The algorithm is well trained with the entire training and classification techniques is evaluated using the metrics of
data set for crop prediction. ACC, precision (P), recall (R), specificity (S), F1 score, mean

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 15,2022 at 06:03:48 UTC from IEEE Xplore. Restrictions apply.
MARIAMMAL et al.: PREDICTION OF LAND SUITABILITY FOR CROP CULTIVATION BASED ON SOIL AND ENVIRONMENTAL CHARACTERISTICS 1137

TABLE II TABLE III


D ATA S ET D ESCRIPTION OF C ROP D ATA S ET P ERFORMANCE M ETRICS D ESCRIPTION

absolute error (MAE), area under the curve (AUC), log loss
(LL), and out of bag (OOB).

C. Results and Discussion


This section evaluates the performance of the proposed
MRFE technique in terms of selecting key features for the 1) Performance Evaluation of the Modified RFE Varying
crop prediction process and compares it with other wrapper RF Parameter: To optimize the performance of the pro-
FS techniques. In this work, the performance of the MRFE posed MRFE technique, the parameter of the RF classifier
technique is evaluated using supervised learning methods such is fine-tuned to identify the most important features. The
as the kNN, NB, DT, SVM, RF, and Bagging to identify the range of the RF parameter, ntree and mtry, is varied and
most suitable crop for cultivation. Furthermore, this section the performance evaluated to determine the importance of the
depicts the performance of the MRFE technique by varying the features chosen. The effect of varying the ntree parameter is
RF parameter and ranking method. Performance is evaluated analyzed using the OOB error metric and mtry is analyzed
using the performance metrics mentioned in Table III. with ACC metric to validate the RF model. Tables IV and V

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 15,2022 at 06:03:48 UTC from IEEE Xplore. Restrictions apply.
1138 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 8, NO. 5, OCTOBER 2021

TABLE IV TABLE VI
P ERFORMANCE E VALUATION OF THE M ODIFIED RFE VARYING P ERFORMANCE E VALUATION OF THE M ODIFIED RFE VARYING R ANKING
RF PARAMETER NTREE M ETHOD

TABLE VII
P ERFORMANCE E VALUATION OF M ODIFIED RFE U SING VARIOUS C LAS -
TABLE V SIFIER T ECHNIQUES
P ERFORMANCE E VALUATION OF THE M ODIFIED RFE VARYING
RF PARAMETER MTRY

TABLE VIII
P ERFORMANCE E VALUATION OF VARIOUS FS T ECHNIQUES W ITH THE
BAGGING C LASSIFIER
describe the performance of the proposed MRFE FS technique
when the ntree and mtry are varied.
Table IV depicts the fine-tuned performance of the RF
classifier. The result is used to select the number of trees
that is to be generated, ranging from 100 to 500, to find the
most important features. The range chosen is based primarily
on the OOB error rate. It is evident from Table IV that the
ntree 400 minimizes the OOB error rate and makes better
predictions. Hence, the RF performs better in the splitting
range of 400. Furthermore, the lower LL value offers better to obtain better ranking. Since it combines different ranking
predictions, and the results show that the ntree splitting range orders for finding feature importance, it is more complicated
of 400 shows a lower LL value and a better AUC rate, than other methods. The PIMP results are better than those
compared with other splitting ranges. of the Gini, though the running time is slower than that of
Table V illustrates the fine-tuned performance of RF clas- the Gini and AIRIMP. The AIRIMP ranking method, which
sifier while varying the parameter “mtry.” It helps to find the performs best of all, is used for feature ranking in the proposed
number of attributes at each split that used for RF model. From technique.
the result, it is observed that the RF classifier performs better 3) Performance Evaluation of Modified RFE Technique
for mtry = 4 than other ranges. Using Various Classifiers: This evaluation identifies the best
2) Performance Evaluation of the Modified RFE Varying classifier to predict a suitable crop for cultivation. Table VII
Ranking Method: The ranking method ranks features in order, shows the performance of the MRFE technique using classi-
from the best to the worst. It eliminates irrelevant features fiers such as the kNN, NB, DT, SVM, RF, and Bagging.
in the process, greatly impacting the FS process. Table VI Table VII shows that the MRFE technique produces a better
illustrates the performance of the MRFE FS technique using prediction rate with the bagging classifier. Furthermore, it has a
ranking methods such as rank aggregation, Gini importance, lower LL value, indicative of higher prediction ACC. Bagging
PIMP, and AIRIMP. combines several weak learners for the prediction process,
The ranking methods mentioned above play a major role in takes votes for each sample, and aggregates it. It is on this
FS techniques. These methods help order features in terms of basis that the bagging method accurately classifies crops, for
importance and eliminate weak ones from the data set. It is particular stretches of land, better than other classification
observed from Table VI that the AIRIMP method provides techniques.
better results than the others, involving less time and bias. 4) Performance Evaluation of Various Features Selection
The rank aggregation method combines different ranking order Techniques With the Bagging Classifier: Table VIII presents a

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 15,2022 at 06:03:48 UTC from IEEE Xplore. Restrictions apply.
MARIAMMAL et al.: PREDICTION OF LAND SUITABILITY FOR CROP CULTIVATION BASED ON SOIL AND ENVIRONMENTAL CHARACTERISTICS 1139

TABLE IX TABLE X
P ERFORMANCE E VALUATION OF MRFE W ITH THE BAGGING C LASSIFIER P ERFORMANCE E VALUATION OF MRFE W ITH THE BAGGING C LASSIFIER
BASED ON S OIL FACTORS BASED ON E NVIRONMENTAL FACTORS

performance analysis of various wrapper FS techniques with


the proposed MRFE technique, using the bagging classifier.
In this analysis, the FS techniques select essential features
TABLE XI
from the soil and environment, as described in Table II.
P ERFORMANCE E VALUATION OF MRFE W ITH THE BAGGING C LASSIFIER
The MRFE technique selects the most appropriate features BASED ON S OIL AND E NVIRONMENTAL FACTORS
for crop prediction, as depicted in Table VIII. The proposed
MRFE selects eight soil characteristics (N, P, K, Zn, Cu, Fe,
Mn, and EC) and two environmental characteristics (seasons
and rainfall) as important features. N, P, and K are macro
nutrients of soil which helps to increase the crop growth as
well as quality of crop. Zn, Cu, Fe, and Mn are micronutrients
of soil which are involved in photosynthesis and respiration.
Also the purpose of the selected features is given in Table II.
Using the bagging classifier, the MRFE technique outperforms
the others. It employs the permutation crop data set and
ranking method for the FS process, owing to which it selects
the most accurate features faster than other FS techniques.
5) Identifying Suitable Crop Using MRFE Technique With
the Bagging Classifier Based on Soil Factors: This work
utilizes nine classes of crops to identify suitable crops for
cultivation. The classes include pulses like Black Gram (Bl G), Table X shows that the proposed technique performed
Bengal Gram (BG), Chick Peas (CP), and Green Gram (GG); poorly at predicting maize but was better in the case of
vegetables like Brinjal, Lady’s Fingers (LF), and Tomatoes, chickpeas, similar to the details shown in Table IX. However,
and cereals like Maize and Paddy. Evaluating the performance the proposed technique predicts the most suitable crop based
of the MRFE technique with the bagging classifier becomes a on environmental factors (such as rainfall and temperature)
multiclass problem. The metrics of precision, recall, F1 score, with a prediction rate of 59%, which is lower than the
and AUC are used in the evaluation to predict each class for prediction based on soil factors.
cultivation. 7) Identifying Suitable Crop Using MRFE Technique With
Table IX shows the performance of the proposed MRFE the Bagging Classifier Based on the Both Soil and Environ-
technique with the bagging classifier for each class. Soil fac- mental Factors: Table XI represents a performance analysis
tors from the crop data set in Table II are used for evaluation. of the MRFE technique with the bagging classifier for each
It is observed from the results that the MRFE with the class. Soil and environmental factors from the crop data set
bagging technique predicted a suitable crop for cultivation in Table II are used for evaluation.
with 91% ACC, based on soil factors (N, P, K, Zn, Cu, Fe, The proposed technique achieves 95% ACC in the pre-
Mn, and EC). Table IX shows that the proposed technique diction rate for identifying a suitable crop for cultivation,
performed poorly at predicting maize but was better in the based on soil and environmental factors. The prediction rate
case of chickpeas. Chickpeas registered a higher precision rate is better when compared with predictions made using soil
and AUC value than other crops. and environmental factors separately. It is evident from the
6) Identifying Suitable Crop Using MRFE Technique With results that both soil and environmental factors are critical
the Bagging Classifier Based on Environmental Factors: to predicting crops for cultivation in different areas. Table XI
Table X presents a performance analysis of the MRFE FS depicts that the classifier performed poorly for the class, maize,
technique with the bagging classifier for each class. Environ- but better for the class, chickpea.
mental factors from the crop data set in Table II are used for 8) Performance Evaluation of Various Classification Tech-
evaluation. niques to Identifying Suitable Crop Using MRFE Technique

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 15,2022 at 06:03:48 UTC from IEEE Xplore. Restrictions apply.
1140 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 8, NO. 5, OCTOBER 2021

TABLE XII TABLE XIII


P ERFORMANCE E VALUATION OF MRFE W ITH VARIOUS C LASSIFIERS B ENCHMARK D ATA S ET D ESCRIPTION
BASED ON S OIL AND E NVIRONMENTAL FACTORS

TABLE XIV
P ERFORMANCE E VALUATION OF MRFE W ITH O THER FS T ECHNIQUES
U SING THE BAGGING C LASSIFIER FOR B ENCHMARK D ATA S ETS

Table XII records the performance of the MRFE technique


Based on Both Soil and Environmental Factors: Table XII with the kNN, NB, DT, SVM, and RF classifiers for each class.
shows a performance analysis, based on both soil and envi- The results show that in terms of ACC, the MRFE with the RF
ronmental factors, of the MRFE FS technique with the kNN, classifier outperforms the others. The performance metrics of
NB, DT, SVM, and RF classifiers. precision, recall, and F1 score, as well as the AUC values, are

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 15,2022 at 06:03:48 UTC from IEEE Xplore. Restrictions apply.
MARIAMMAL et al.: PREDICTION OF LAND SUITABILITY FOR CROP CULTIVATION BASED ON SOIL AND ENVIRONMENTAL CHARACTERISTICS 1141

used to examine the FS technique with the classifier for the [4] P. S. Maya Gopal and R. Bhargavi, “Feature selection for yield prediction
multiclass problem. Table XII indicates that the MRFE with in boruta algorithm,” Int. J. Pure Appl. Math., vol. 118, no. 22,
pp. 139–144, 2018.
all the classifiers performed well for the class, chickpeas, but [5] S. Ji, S. Pan, X. Li, E. Cambria, G. Long, and Z. Huang, “Suicidal
poorly for the class, maize. The results show that the MRFE ideation detection: A review of machine learning methods and applica-
with the bagging technique offers higher ACC than the other tions,” IEEE Trans. Comput. Social Syst., vol. 8, no. 1, pp. 214–226,
Feb. 2021.
classifiers.
[6] K. Ranjini, A. Suruliandi, and S. P. Raja, “An ensemble of heterogeneous
9) Performance Evaluation of MRFE With Other FS Tech- incremental classifiers for assisted reproductive technology outcome
niques Using the Bagging Classifier for Other Benchmark prediction,” IEEE Trans. Comput. Social Syst.early access, Nov. 3, 2020,
Data Sets: To ensure the relevance of the proposed MRFE doi: 10.1109/TCSS.2020.3032640.
[7] H. Liu and R. Setiono, “A probabilistic approach to feature selection-a
technique to data sets other than crop-related ones, miscella- filter solution,” in Proc. 13th Int. Conf. Int. Conf. Mach. Learn., vol. 96,
neous data sets were downloaded from the UCI Repository. 1996, pp. 319–327.
Table XIII describes that the three data sets which are used to [8] R. Kohavi and G. H. John, “Wrappers for feature subset selection,” Artif.
Intell., vol. 97, nos. 1–2, pp. 273–324, Dec. 1997.
examine the MRFE technique are suitable for all kind of data
[9] H. Wang, M. Taghi Khoshgoftaar, and K. Gao, “Ensemble feature
set. Table XIV below displays the performance of the MRFE selection technique for software quality classification,” in Proc. 22nd
wrapper FS technique as against the others, and analyzes its Int. Conf. Softw. Eng. Knowl. Eng., 2010, pp. 215–220.
suitability to diverse benchmark data sets. [10] B. Gregorutti, B. Michel, and P. Saint-Pierre, “Correlation and vari-
able importance in random forests,” Statist. Comput., vol. 27, no. 3,
Table XIV reveals that the proposed MRFE technique is pp. 659–678, May 2017.
as relevant to crop data sets as it is to others. Owing to the [11] M. A. Hall and G. Holmes, “Benchmarking attribute selection techniques
use of permutation and ranking, the MRFE selects the most for discrete class data mining,” IEEE Trans. Knowl. Data Eng., vol. 15,
appropriate features with improved prediction ACC in the least no. 6, pp. 1437–1447, Nov. 2003.
[12] H. Liu and L. Yu, “Toward integrating feature selection algorithms for
time, compared with other techniques. classification and clustering,” IEEE Trans. Knowl. Data Eng., vol. 17,
no. 4, pp. 491–502, Apr. 2005.
[13] P. M. Granitto, C. Furlanello, F. Biasioli, and F. Gasperi, “Recur-
VI. C ONCLUSION sive feature elimination with random forest for PTR-MS analysis of
agroindustrial products,” Chemometric Intell. Lab. Syst., vol. 83, no. 2,
Predicting a suitable crop for cultivation is critical to agri- pp. 83–90, Sep. 2006.
culture. In this work, the MRFE, a novel approach, has been [14] A. Araåzo-Azofra and J. M. Benítez, “Empirical study of feature
proposed for selecting salient features using a permutation selection methods in classification,” in Proc. 8th Int. Conf. Hybrid Intell.
Syst., Barcelona, Spain, Sep. 2008, pp. 584–589.
crop data set and a ranking method to identify the most suitable
[15] A. Altmann, L. Toloåi, O. Sander, and T. Lengauer, “Permutation
crop for a particular region. Experiments were conducted importance: A corrected feature importance measure,” Bioinformatics,
to evaluate the efficiency of the proposed MRFE technique vol. 26, no. 10, pp. 1340–1347, May 2010.
using the kNN, NB, DT, SVM, RF, and bagging classification [16] M. B. Kursa and W. R. Rudnicki, “Feature selection with the Boruta
package,” J. Stat. Softw., vol. 36, no. 11, pp. 1–13, 2010.
techniques to predict the most suitable crops for cultivation.
[17] G. Ruß and R. Kruse, “Feature selection for wheat yield prediction,”
Soil and environmental factors were considered for an analysis in Research and Development in Intelligent Systems. London, U.K.:
of the crop prediction process. The results indicate that the Springer, 2010, pp. 465–478.
MRFE with the bagging technique classifier gives better crop [18] B. F. Darst, K. C. Malecki, and C. D. Engelman, “Using recursive feature
elimination in random forest to account for correlated variables in high
prediction ACC than the MRFE with other classifiers. The dimensional data,” BMC Genet., vol. 19, no. S1, p. 65, Sep. 2018.
performance of the MRFE technique for the crop data set was [19] J.-Y. Hsieh, W. Huang, H.-T. Yang, C.-C. Lin, Y.-C. Fan, and H. Chen,
assessed and compared with existing techniques like the SFFS, “Building the rice blast Disease Prediction Model based on Machine
Learning and Neural Networks,” Easy Chair World Sci., vol. 1197,
Boruta, and RFE. Furthermore, the suitability of the proposed pp. 1–8, Dec. 2019.
MRFE technique was evaluated using three benchmark data [20] H. Liu, J. Li, and L. Wong, “A comparative study on feature selection
sets. The results show that the proposed MRFE technique and classification methods using gene expression profiles and proteomic
outperforms the others. Nevertheless, the MRFE technique patterns,” Genome Informat., vol. 13, no. 13, pp. 51–60, 2002.
[21] J. Camargo and A. Young, “Feature selection and non-linear classifiers:
needs performance-wise improvements before it can be used Effects on simultaneous motion recognition in upper limb,” IEEE Trans.
in large feature data sets. Neural Syst. Rehabil. Eng., vol. 27, no. 4, pp. 743–750, Apr. 2019.
[22] M. B. Kursa, A. Jankowski, and W. R. Rudnicki, “Boruta—A system
for feature selection,” Fundam. Inf., vol. 101, no. 4, pp. 271–285, 2010.
ACKNOWLEDGMENT [23] R. Rajasheker Pullanagari, G. Kereszturi, and I. Yule, “Integrating
airborne hyperspectral, topographic, and soil data for estimating pasture
The authors would like to thank the Department of Agricul- quality using recursive feature elimination with random forest regres-
ture, Sankarankovil Taluk, Tenkasi, India, for providing data sion,” Remote Sens., vol. 10, no. 7, pp. 1117–1130, 2018.
for the analysis. [24] A. Choudhary, S. Kolhe, and H. Rajkamal, “Performance Evaluation of
feature selection methods for Mobile devices,” Int. J. Eng. Res. Appl.,
vol. 3, no. 6, pp. 587–594, 2013.
R EFERENCES [25] F. Balducci, D. Impedovo, and G. Pirlo, “Machine learning applications
on agricultural datasets for smart farm enhancement,” Machine, vol. 6,
[1] A. Mark Hall, “Feature selection for discrete and numeric class machine no. 3, pp. 38–59, 2018.
learning,” Comput. Sci., Univ. Waikato, pp. 359–366, Dec. 1999. [26] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggio, and
[2] I. Guyon and A. Elisseeff, “An introduction to variable and feature V. Vapnik, “Feature selection for SVMs,” in Proc. Adv. Neural Inf.
selection,” J. Mach. Learn. Res., vol. 3, pp. 1157–1182, Jan. 2003. Process. Syst., vol. 13, 2001, pp. 668–674.
[3] R. Gilad-Bachrach, A. Navot, and N. Tishby, “Margin based feature [27] A. Bahl et al., “Recursive feature elimination in random forest classifica-
selection–theory and algorithms,” in Proc. 21st Int. Conf. Mach. Learn. tion supports nanomaterial grouping,” NanoImpact, vol. 15, Mar. 2019,
(ICML), 2004, p. 43. Art. no. 100179.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 15,2022 at 06:03:48 UTC from IEEE Xplore. Restrictions apply.
1142 IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, VOL. 8, NO. 5, OCTOBER 2021

[28] D. H. Zala and M. B. Chaudhri, “Review on use of BAGGING technique A. Suruliandi received the B.E. degree in electron-
in agriculture crop yield prediction,” Int. J. Sci. Res. Develop., vol. 6, ics and communication engineering from the Coim-
no. 8, pp. 675–677, 2018. batore Institute of Technology, Coimbatore, India,
[29] F. Shirbani and H. Soltanian Zadeh, “Fast SFFS-based algorithm for in 1987, the M.E. degree in computer science and
feature selection in biomedical datasets,” Amirkabir Int. J. Sci. Res., engineering from the Government College of Engi-
vol. 45, pp. 43–56, Dec. 2013. neering, Tirunelveli, India, in 2000, and the Ph.D.
[30] M. Gopal P S and B. R, “Selection of important features for optimizing degree from Manonmaniam Sundaranar University,
crop yield prediction,” Int. J. Agricult. Environ. Inf. Syst., vol. 10, no. 3, Tirunelveli, in 2009.
pp. 54–71, Jul. 2019. He is currently working as a Professor with the
[31] W. Paja, K. Pancerz, and P. Grochowalski, “Generational feature elimi- Department of Computer Science and Engineering,
nation and some other ranking feature selection methods,” in Advances Manonmaniam Sundaranar University. He has more
in Feature Selection for Data and Pattern Recognition, vol. 138. Cham, than 29 years of teaching experience. He has authored 50 articles in interna-
Switzerland: Springer, 2018, pp. 97–112. tional journals, 23 in IEEE Xplore publications, 33 in national conferences,
[32] H. Stoppiglia, G. Dreyfus, R. Dubois, and Y. Oussar, “Ranking a random and 13 in international conferences. His research areas are remote sensing,
feature for variable and feature selection,” J. Mach. Learn. Res., vol. 3, image processing, and pattern recognition.
pp. 1399–1414, 2003.
[33] Z. Karimi, M. Mansour Riahi Kashani, and A. Harounabadi, “Feature
ranking in intrusion detection dataset using combination of filtering
methods,” Int. J. Comput. Appl., vol. 78, no. 4, pp. 21–27, Sep. 2013.
[34] V. Pihur, S. Datta, and S. Datta, “RankAggreg, an R package for S. P. Raja was born in Tuticorin, India. He received
weighted rank aggregation,” BMC Bioinf., vol. 10, no. 1, p. 62, 2009. the B.Tech. degree in information technology from
[35] S. Nembrini, I. R. König, and M. N. Wright, “The revival of the gini the Dr. Sivanthi Aditanar College of Engineer-
importance?” Bioinformatics, vol. 34, no. 21, pp. 3711–3718, Nov. 2018. ing, Tiruchendur, India, in 2007, and the M.E.
[36] M. A. Al Maruf and S. Shatabda, “IRSpot-SF: Prediction of recom- degree in computer science and engineering and
bination hotspots by incorporating sequence based features into Chou’s the Ph.D. degree in image processing from Manon-
pseudo components,” Genomics, vol. 111, no. 4, pp. 966–972, Jul. 2019. maniam Sundaranar University, Tirunelveli, India,
[37] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection for in 2010 and 2016, respectively.
cancer classification using support vector machines,” Mach. Learn., He is currently working as an Associate Professor
vol. 46, nos. 1–3, pp. 389–422, 2002. with the School of Computer Science and Engineer-
[38] M. Lango and J. Stefanowski, “Multi-class and feature selection exten- ing, Vellore Institute of Technology, Vellore, India.
sions of roughly balanced bagging for imbalanced data,” J. Intell. Inf. He has more than 14 years of teaching experience in engineering colleges.
Syst., vol. 50, no. 1, pp. 97–127, 2018. He has authored 42 articles in international journals, 24 in international
conferences, and 12 in national conferences. His area of interests is image
processing and cryptography.
Dr. Raja is an Associate Editor of the International Journal of Interactive
Multimedia and Artificial Intelligence, Brazilian Archives of Biology and
Technology, Journal of Circuits, Systems and Computers, Computing and
Informatics, International Journal of Image and Graphics, and International
Journal of Biometrics.

G. Mariammal received the B.E. degree in com- E. Poongothai received the B.E. degree from the
puter science and engineering from Francis Xavier Bhajarang Engineering College, Anna University,
Engineering College, Tirunelveli, India, in 2011, Chennai, India, in 2011, and the M.E. and Ph.D.
and the M.E. degree in computer science and engi- degrees in computer science and engineering from
neering from Manonmaniam Sundaranar University, Manonmaniam Sundaranar University, Tirunelveli,
Tirunelveli, in 2017, where she is currently pur- India, in 2013 and 2020, respectively.
suing the Ph.D. degree in computer science and She is currently working as an Assistant Profes-
engineering. sor with the Department of Computer Science and
Her research areas are machine learning, data Engineering, SRM University, Chennai, India. Her
analytics, and image processing. research areas are machine learning and computer
vision.

Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY WARANGAL. Downloaded on November 15,2022 at 06:03:48 UTC from IEEE Xplore. Restrictions apply.

You might also like