You are on page 1of 18

sustainability

Article
Optimization of Crop Recommendations Using Novel Machine
Learning Techniques
Husam Lahza 1 , K. R. Naveen Kumar 2 , B. R. Sreenivasa 3, * , Tawfeeq Shawly 4 , Ahmed A. Alsheikhy 5 ,
Arun Kumar Hiremath 2 and Hassan Fareed M. Lahza 6

1 Department of Information Technology, Faculty of Computing and Information Technology,


King Abdulaziz University, Jeddah 23589, Saudi Arabia
2 Department of Computer Science & Engineering, Bapuji Institute of Engineering & Technology,
Davangere 577004, Karnataka, India
3 Department of Information Science & Engineering, Bapuji Institute of Engineering & Technology,
Davangere 577004, Karnataka, India
4 Department of Electrical Engineering, Faculty of Engineering at Rabigh, King Abdulaziz University,
Jeddah 23589, Saudi Arabia
5 Department of Electrical Engineering, College of Engineering, Northern Border University,
Arar 91431, Saudi Arabia
6 Department of Information Systems, College of Computers and Information Systems,
Umm Al-Qura University, Makkah 21955, Saudi Arabia
* Correspondence: br.sreenu@gmail.com; Tel.: +91-9844647541

Abstract: A farmer can use machine learning to make decisions about what crops to sow, how to
care for those crops throughout the growing season, and how to predict crop yields. According to
the World Health Organization, agriculture is essential to the nation’s quick economic development.
Food security, access, and adoption are the three cornerstones of the organization. Without a doubt,
the main priority is to ensure that there is enough food for everyone. Increasing agricultural yield can
help ensure a sufficient supply. The country-wide variation in crop yields is substantial. As a result,
this will be the foundation for research into whether cluster analysis can be used to identify crop yield
patterns in a field. Previous study investigations were only marginally successful in accomplishing
their primary intended objectives because of unstable conditions and imprecise methodology. The
Citation: Lahza, H.; Naveen Kumar, vast majority of farmers base their predictions of crop yield on prior observations of crop growth
K.R.; Sreenivasa, B.R.; Shawly, T.;
in their farms, which can be deceptive. Standard preprocessing methods and random cluster value
Alsheikhy, A.A.; Hiremath, A.K.;
selection are not always reliable, according to the literature. The proposed study overcomes the
Lahza, H.F.M. Optimization of Crop
shortcomings of conventional methodology by highlighting the significance of machine learning-
Recommendations Using Novel
based classification/partitioning and hierarchical approaches in offering a trained analysis of yield
Machine Learning Techniques.
Sustainability 2023, 15, 8836. https://
prediction in the state of Karnataka. The dataset used for the study was collected from the ICAR-
doi.org/10.3390/su15118836 Taralabalu Krishi Vigyan Kendra, Davangere, Karnataka. In the two dataset analysis techniques
employed in the study to find anomalies, crop area, and crop production are significant variables.
Academic Editor: Paris Fokaides
Crop area and crop yield are important variables in the two dataset analysis methods used in the
Received: 4 April 2023 study to detect anomalies. The study emphasizes the importance of a mathematical model and
Revised: 18 May 2023 algorithm for identifying yield trends, which can assist farmers in selecting crops that have a large
Accepted: 25 May 2023 seasonal impact on yield productivity.
Published: 30 May 2023

Keywords: hierarchical clustering; precision agriculture; yield prediction; cluster analysis; dendrogram;
partition clustering

Copyright: © 2023 by the authors.


Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
1. Introduction
conditions of the Creative Commons Precision agriculture, as it is now known, was pioneered by environmentally conscious
Attribution (CC BY) license (https:// farmers. Prior to the invention of computers, this method was used. They were successful
creativecommons.org/licenses/by/ in identifying both the actions required to increase crop yields and the variables that con-
4.0/). tributed to the field’s unpredictability. Farmers achieved this by taking field notes during

Sustainability 2023, 15, 8836. https://doi.org/10.3390/su15118836 https://www.mdpi.com/journal/sustainability


Sustainability 2023, 15, 8836 2 of 18

the planting and harvesting seasons. Based on the information gathered, they would then
select the most effective plan of action for the following year. Data-generating equipment
and sensors have long been on the rise in agriculture [1]. This has enabled farmers to make
data-driven decisions. This type of farming is known as smart farming. The author [2]
provides a thorough overview of the various goals and strategies used in smart farming.
One of the major problems in precision agriculture is taking into account agricultural
production predictions and the various models that have been proposed and tried so far.
Because crop yield is affected by a variety of factors such as soil, weather, fertilizers, and
seeds, multiple datasets must be used [3]. As a result, estimating agricultural production
is a difficult task. It is simple to estimate the actual yield using agricultural productivity
forecasting models, but improving yield prediction accuracy is still preferable [4,5].
The majority of climate change simulations are based on deterministic biophysical
crop models [6]. Based on detailed illustrations of plant physiology, these models can still
be used to assess response mechanisms and potential adaptation strategies [7]. Statistical
models, on the other hand, outperform them when making predictions at a larger spatial
scale [8]. Several studies [9] have found a strong link between excessive heat and poor
crop performance. This correlation was demonstrated using statistical models. Traditional
econometric methods are used in these methods. In recent research, crop model output and
crop model insights have both been incorporated into statistical model parameterization,
among other attempts to merge crop models with statistical models [10]. These efforts have
been made to better understand how statistical models and crops may interact [11].
Numerous studies go into detail on the various challenges to developing high-performance
forecasting models. Choosing the best algorithm for high performance becomes a time-
consuming and important task as a result. Furthermore, the chosen systems and algorithms
must be extremely efficient at handling large amounts of data [12]. Furthermore, locating
zones within a region that have behaved similarly over time is more useful than predicting
specific yields within a sector. However, some factors that can affect yield, such as soil type,
climate, harvesting techniques, and so on, may vary from season to season. As a result,
even if crop yield remains constant from year to year. As a result, the yield of one season
cannot account for differences in the field [13]. The large area and yield deviations make
it difficult to precisely measure variants. Many crops have yield estimates built into the
agricultural planning process. To protect the vested interests of farmers and the government,
the research work categorizes the divisions based on yield and region. On some major
agricultural issues, the government may act rashly. However, this approach assists farmers
in selecting the appropriate crop for various areas in order to provide adequate yields.
This is accomplished by taking into account cluster values based on the heuristic scores
described in the following section. It also intends to elaborate on the importance of avoiding
the cultivation of unnecessary crops in order to protect vital resources, such as time and
money. Most farmers plant crops based on past experience, resulting in a lower yield.
Various clustering algorithms are used in the study to identify the appropriate clusters, and
comparative analyses are performed to determine the optimal cluster values. The section
that follows contains a detailed description of the aforementioned study.
The research performs the following activities:
• Hierarchical and partitioning approach to developing clusters based on factors such
as location, output, productivity, etc.;
• Comparative analysis is performed to identify the best method for structuring zones
into clusters;
• Recommend areas or fields with the potential to produce crops with high crop yields
based on the scale value specified.

2. Materials and Methods


2.1. Dataset Overview
The data have been collected from various sources, including Krishi Kendra (Agricul-
tural office) in Davangere district, Karnataka. Area and Production Statistics are collected
Sustainability 2023, 15, 8836 3 of 18

from the Ministry of Agriculture and Farmers Welfare, Karnataka, India [14]. In this work,
the dataset source is accessible in the records of the Karnataka Government [12,15]. The
data are obtained from the year 1998 to the year 2018. The preliminary data collection is
carried out for various districts in Karnataka. The dataset consists of huge observations
with the following varying values: state, district, year, season, crop, yield (in tons), area (in
hectares), and production (in hectares).
A precise summary of the statistical data of each variable is presented in Figure 1.
Lines no. 5 and 6 represent the variable “crop” details. The total number of crops available
is 43, and each needs to be identified by its specific name. Only the Karnataka state is
used for the study. As the total number of locations is 30, one needs to know the names
of different districts. Data are collected from 1997–1998 to 2017–2018, which counts as
21 multi-year data. One needs to know that the crops were studied all these years aptly.
Line no. 20 to 23 represents the variable “season” details. Crops are classified into kharif,
rabi, and summer crops. The Kharif season, which lasts until September, begins in June.
The summer season lasts from March to May, while the Rabi season lasts from October
to February. Whole_year specifies the different crops cultivated in a year irrespective of
the three seasons. Representation of total value in the season column specifies the overall
yield of a crop. Line no. 25 to 29 represents the details of the variable “area”. Variable
“area” specifies an approximate number of lands in hectares used for agricultural purposes.
Line no. 31 to 36 represents the details of the variable “production”. Variable “production”
considers the aggregate area and calculates the production of crops in kgs. Missing values
may be due to personal purpose, nonproduction, or unentered data usage. Line no. 38 to
42 represents the details of the variable “yield”. Variable “yield” determines the number of
crops cultivated in tons per hectare.

Figure 1. Summary of the dataset used in the proposed work.

2.2. Proposed Framework for Determination of Yield Trend


The flow chart of the working model, as shown in Figure 2. starts by loading the
dataset and clearing unwanted data. The main data are then sent to the present summary.
Once this is performed, extracting numeric variable stands is the following step. Extracting
Sustainability 2023, 15, 8836 4 of 18

numeric variables helps in generating a table of correlation which shows the correlation
coefficient between a set of variables, such as area, production, and yield.

Figure 2. Data flow diagram for determination of yield trend.


Sustainability 2023, 15, 8836 5 of 18

If the correlated variables are available, redundant variables are filtered out as they
would give the same results when we use them. Then, data were sent for the data inspection
methods (probability density plot and boxplot). If present results using density plots exist
multi-modal data, one is supposed to filter out such districts and then send it to the boxplot
to realize the presence of outliers.
Afterward, it is further sent for the outlier detection methods (bpRule and Grubbs’
test). The outliers are removed only when the two outlier methods identify outliers.
Once the data are free from outliers, the next step of the flowchart is picking the
optimal K for clustering. The results must be either area or yield. For area, group data by
location, mean partition by area, and mean partition by yield are obtained. If the outcome
is yield, then the grouping of data by location is the next step, which means partition by
yield is obtained after the clustering, the output is categorized season-wise and presented,
and inferences are drawn.
Most of the existing methods used multiple clustering algorithms to analyse large
datasets CLARA (Clustering Large Applications) is an extension of k-medoids (PAM)
techniques for dealing with data including many objects (greater than a few thousand
observations). Its purpose is to reduce processing time and RAM storage problems. This is
performed using the sample technique.

2.3. Algorithm for Determination of Yield Trends


Let D represent the training set of tuples presented by an 8-dimensional feature vector,
X = (x1, x2, . . . , x26, 906), showing eight measurements, such as state, district, year, season,
crop, yield (in tons), area (in hectares) and production (in hectares).
Firstly, Algorithm 1 considers numeric variables to check if the variables are changing
together at a constant rate by using Pearson Correlation [16] given in Equation (1).

∑n ( x − x )(y − y)
r = q i =1 . (1)
n 2 2
∑ i =1 ( x − x ) ( y − y )

Then, linearly related variables are removed from the attribute list.

Algorithm 1: Determination of Yield Trends


Input: Data, D, which is a crop yield dataset;
attrlist, the set of numeric candidate attributes;
Output: A set of k territories’ crop yield based on the scales defined.
1 //find the correlated variables;
2 apply FeatureSelection(D, attrList);
3 if redundantVars then
4 attrList := attrList − redundantVars;
5 end
6 var := “area”;
7 DetectionAndInspection(D, var, locations);
8 //find the best k;
9 k := kSelection(mean(D [var]),locations);
10 //construct k partitions;
11 pObj := pGroup(mean(D [var]), k, locations);
12 var := “yield”;
13 for each outcome k of the partitions do
14 let Dk be the set of observations in the partition D satisfying outcome k;
DetectionAndInspection(Dk, var, locations);
15 k := kSelection(mean(Dk [var]),locations);
16 hObj := hGroup(mean(Dk [var]), k,locations);
17 end
Sustainability 2023, 15, 8836 6 of 18

The algorithm calls the DetectionAndInspection () computes, draws kernel density esti-
mates, and figures out the multi-modal distribution. It also removes the locations and location
observations for which the method determines the existence of multi-modal distribution.
The simplest univariate outlier detection method, the Boxplot Rule, is also used and
applied to the numeric variable area, which tags any value outside the interval

[ Qtl1 − 1.5 ∗ IQR, Qtl3 + 1.5 ∗ IQR],

where Qtl1 (Qtl3) is the first (third) quartile, IQR = Qtl3 − Qtl1 is the interquartile range,
and s an outlier.
A related method called the Grubbs’ Test starts by calculating the following z score
|x−x|
for each observation x, Equation z = sx where x is the sample mean of the variable x,
and s x represents sample standard deviation. Using this score, an outlier is declared if the
following Equation (2) holds
v
t2α/(2n), N −2
u
N − 1u
z ≥ √ t . (2)
N N − 2 + t2α/(2n), N −2

where N is the sample size and t2α/(2n), N −2 is the value of the t-distribution at the signifi-
cance level of α/(2N ) [8].
The key issues with any clustering algorithm are:
• Cluster validation decided by the obtained solution is precise;
• To obtain an appropriate number of clusters for the yield dataset (compactness or
cluster separation).
Internal validation methods, such as Calinski–Harabasz Index and Average Silhouette
Width Index, are adopted to overcome the issues related to clustering.

2.3.1. Calinski–Harabasz Index


Calinski–Harabasz Index, also known as the Variance Ratio Criterion, is the ratio of
the sum of between_clusters dispersion and inter_cluster dispersion for all clusters; the
higher the score, the better the performances.
For a set of data D of size n D , which has to be clustered into k groups, the Calinski–
tr ( B )
Harabasz score “s” is given by the equation s = tr(Wk ) × nkD−−1k where tr ( Bk ) is the trace of
k
the between_clusters dispersion matrix, and tr (Wk ) is traces of within-clusters dispersion
matrix given by
k
Wk = ∑ ∑ x − cq x − cq
 T
(3)
q=1 x ∈Cq

k
∑ nq
 T
Wk = cq − c D cq − c D (4)
q =1

and cq represents the set of points in cluster q; cq represents the centre of cluster q, c D is the
centre of D, and nq represents the number of points in cluster q [17].

2.3.2. Average Silhouette Width Index


For each observation “i”, the average distance is obtained to all objects in the same
group as “i” and called this average “ai ”. For each observation, we also calculate the
average distance to the cases belonging to the other groups, calling this value “bi ”. Finally,
the silhouette coefficient of any observation, “si ”, is given by equation Si = maxbi(−a a,i b ) [18].
i i
Initially, on a dataset of 26,906 instances, the algorithm calls the partition method on
the variable area to form clusters. The area division obtained is further partitioned into a
yield-wise group using the Hierarchical Method with linkage criteria.
Sustainability 2023, 15, 8836 7 of 18

Given datasets with huge observations, it would be extremely computationally ex-


pensive [19] to compute the partition and Hierarchical method. To reduce computation
in evaluating these methods without the locations and associated observations splitting
up and falling among different clusters [20], the mean () observations of each location are
recorded.
The Agglomerative Clustering algorithm supports four linkage criteria: Complete_linkage,
Single_linkage, Average_linkage, and Ward’s method [18,19].
• The Maximum or Complete_linkage clustering measures the difference between two
groups by the largest distance between any two observations in each group, and it is
mathematically given Equation (5) as the distance D(X, Y) between cluster X and Y

D ( X, Y ) = max x∈X, y∈Y d( x, y) (5)

• The Minimum or Single_linkage clustering measures the difference between two


groups by the smallest distance between any two observations in each group, and it is
mathematically given Equation (6) as the distance D(X, Y) between cluster X and Y

D ( X, Y ) = min x∈X, y∈Y d( x, y) (6)

• The Average_linkage measures the difference between the two groups by the average
distance between any two observations in each group, and it is mathematically given
Equation (7) as the mean distance between elements of each cluster

1
| A |.| B | ∑ ∑ d(a, b) (7)
a∈ A b∈ B

• Ward’s method aims to minimize the total within-cluster variance. At each step,
the pair of clusters with minimum between-cluster distance is merged, and it is
mathematically given Equation (8) as the squared Euclidean distance between points
  2
dij = d { Xi }, X j = Xi – X j | (8)

All the above-discussed linkage methods are applied to the dataset to determine the
yield patterns, and the complete details of the methods are explained in Figures 15 and 16.
From Table 6, the highest score instance is considered for model construction.

3. Data Analysis
A significant part of the process of data mining is data preprocessing. It can yield false
results by examining data that have yet to be thoroughly examined, for instance, issues
such as out-of-range values, unlikely mixes of data, missing values, etc. If such obsolete
and repetitive data are present, the evidence exploration becomes more complicated during
the modeling process.
The following are the critical problems that are recognized during preprocessing
information.

3.1. Eliminating the Unwanted Observations


There are a few districts where multi-year data have not been consistent over sev-
eral years, which could lead to incorrect output because crops in such districts have a
lower threshold value (40%). These districts have been identified and removed, as shown
in Figure 3.
Sustainability 2023, 15, 8836 8 of 18

Figure 3. Removal of district observations with a lower threshold value.

3.2. Removing the Impossible Data Combination


Table 1 shows that the value “total” is not a season but rather the combining of various
seasons in which a specific crop is grown while considering location, production, and
output (for example, Kharif + summer). Rows with a cumulative value are highlighted
and deleted.

Table 1. Impossible combination in the season column.

Crop Year Season Area Production Yield


Rice 1998–1999 kharif 197 316 1.6
Rice 1999–2000 kharif 128 202 1.58
Rice 2000–2001 kharif 171 311 1.82
Rice 2001–2002 kharif 171 411 2.4
Rice 2001–2002 summer 13 19 1.46
Rice 2001–2002 total 184 430 2.34
Rice 2002–2003 kharif 112 230 2.05
Rice 2002–2003 summer 15 16 1.07
Rice 2002–2003 total 127 246 1.94
Rice 2003–2004 kharif 93 210 2.26

3.3. Fill in the Missing Values


It is important to note that the output column (Figure 1, line 32) has sixty missing
values. An in-depth examination of the data in Table 2 reveals that the yield column has
zero value, implying that no crop yield occurred. Furthermore, the number of harvesting
areas is limited. As a result, there was no agricultural production. In the output, zeros are
used to replace the rows with missing values.

Table 2. Fill in the Missing Values in the production column.

Crop Year Season Area Production Yield


moong 2015–2016 kharif 1 NA 0
linseed 2016–2017 rabi 2 NA 0
urad 2005–2006 kharif 2 NA 0
cowpea 2015–2016 summer 1 NA 0
rapeseed 2015–2016 kharif 2 NA 0
urad 2016–2017 kharif 1 NA 0

3.4. Removing the Impossible Data Mixes


Figures 4 and 16 show a side-by-side comparison of boxplot results for a portion of
the region and region results. There are two images in total. The boxplot with the outlier is
Sustainability 2023, 15, 8836 9 of 18

on the left, and the boxplot without the outlier is on the right. The graphicPlot() function
generates a boxplot with parameters. The boxplot is also given a rug, which shows the
parameter’s concrete values and its horizontal dotted line at the mean value [18]. We can
infer that anomalies have skewed the mean value by matching this dotted line with the
inner region of the solid line of the box displaying the median line. Two methods for
identifying outliers in districts with defined anomalies are the Boxplot Rule and the Grubbs’
test. As a result, they are capable of detecting and eliminating these anomalies.

Figure 4. Shaded density plot—Area.

4. Results and Discussion


Eight variables are used in the presented work, of which area, production, and yield
are numeric variables. To avoid unnecessary analysis, a table of correlated variables is
generated (Table 3) [20]. From this table, area and production appear to be highly correlated
(about 95%). So, area and yield variables are considered for the analysis.

Table 3. Correlation between feature variables.

var var2 cor


production area 95%
yield production 36%
yield area 26%

The shaded density plot on the area in Figure 4 suggests that four districts exhibit
multi-modal distributions. From these plots, it will be easier for the two outlier detection
methods to identify the outliers correctly.
Two outlier detection methods: The boxplot rule and Grubbs’ test [18] have identified
one or two outliers in three districts, as shown in Figure 5. However, the outliers are very
few, and outliers also distorted the mean value. So, the elimination of these outliers will
not cause any misleading analysis. The work divides the yield zones, resulting in an actual
clustering problem. Because of the lack of external information, knowing the value of the k
parameter in advance is critical for proper groupings.

Figure 5. Boxplot with outliers—Area.


Sustainability 2023, 15, 8836 10 of 18

The partitioning method was used in conjunction with two criteria to estimate the best
number of clusters:
• Calinski–Harabasz Index (“CH”).
• Average Silhouette Width Index (“ASW”).
Over a range of k is used to estimate the best k.
“ASW” suggested 5 clusters, and “CH” suggested 10 clusters (Figure 6). “ASW” shows
the sign of exemplary cluster configuration. Out of clusters 2, 4, 5, and 6, the k value of 2 is
selected, as their cluster criterion values are nearly the same, as shown in Table 4.

Figure 6. Plot of CH vs. ASW—Area.

Table 4. ASW crit scores for the variable area.

clust: 2 3 4 5 6
crit: 0.6 0.6 0.7 0.7 0.7

Further, it would be easier to give rankings to the clusters. The criterion “CH” pro-
posed k value of 10 is not chosen because it indicates data overfitting. After setting the k
value to 2, cluster analysis is performed using parameters based on the following features:
one by taking the mean (area) and another by taking the mean(yield), with Euclidean dis-
tance as the metric. We use mean () for variables (area and yield) because we do not want
the locations and their associated observations to split up and fall into different clusters, as
this distorts the distribution and leads to false interpretations [20].
The partition method supports two objective functions—construct phase and exchange
phase [21]. In the construct phase, the algorithm looks for a good initial set of medoids,
and in the exchange phase, it tries to fine-tune initial estimates given by the rough clusters
determined in the construct phase. By looking at the values (construct phase: 22350 and
exchange phase: 16357) of these objective functions from Figure 7, the function did change
significantly with more steps from the construct phase to the exchange phase. The partition
method has selected two reference medoids as Bangalore_rural and Mandya. After dividing
the areas into two groups of area 1 and area 2, these clusters are identified and renamed as
small and large areas based on the mean values obtained in Table 5.

Table 5. Min to max cluster values for the variable area.

Clust Min Max Mean


area 1 4 92,740 13,284
area 2 21,625 201,286 80,863
Sustainability 2023, 15, 8836 11 of 18

Figure 7. Partition object—Area.


The second part focuses on the variable “yield” The density plot shown in Figure 8
indicates multi-modal distribution in one district, Bagalkot. Figure 9 has identified three
districts, namely Hassan, Koppal, and Shimoga.

Figure 8. Shaded density plot—small area.

Figure 9. Shaded density plot—large area.

Outlier detection methods have detected two outliers in small areas, as shown in
Figure 10, and three outliers in large areas, as shown in Figure 11.

Figure 10. Boxplots with outliers—small area.


Sustainability 2023, 15, 8836 12 of 18

Figure 11. Boxplots with outliers—large area.

Picking the k value in small and large areas uses the “ASW” metric along with hier-
archical linkage criteria such as complete_linkage, average_linkage, single_linkage, and
ward.D2 to estimate the best k. A small area picks up two clusters, whereas a large area
picks up four clusters. The resultant scores, thus obtained, are presented in Table 6 [18].

Table 6. ASW scores of small and large-area for variable yield.

Complete Single Average ward.D2


k
Small Large Small Large Small Large Small Large
2 0.55 0.58 0.39 0.41 0.62 0.55 0.55 0.58
3 0.47 0.59 0.34 0.53 0.54 0.53 0.47 0.59
4 0.41 0.71 0.46 0.42 0.48 0.71 0.41 0.71
5 0.48 0.68 0.43 0.68 0.52 0.68 0.48 0.68
6 0.43 0.61 0.32 0.61 0.43 0.61 0.43 0.65
7 0.43 0.62 0.35 0.56 0.41 0.62 0.43 0.62

After determining the best k value, cluster analysis is carried out on small and large
areas using Agglomerative nesting (Agnes). Figure 12 depicts the results obtained from
Agnes. Lines 2 and 11 specify the call to the Agnes function. The first argument to this
function is the Euclidean distance matrix of variable yield from a small and large area cluster.
On the other hand, the second argument specifies the criterion in this case as “average” to
select the two groups for merging at each step. Lines 3 and 12 define the agglomerative
coefficient, quantifying the amount of clustering structure discovered (values closer to
1 suggest a strong clustering structure).
We obtained 0.88 and 0.93 for the small and large areas, respectively, indicating that
we have a fairly reasonable clustering structure for both groups. Lines 4 and 13 specify
the order of objects, i.e., a vector containing a permutation of the original observations to
allow plotting. Lines 6 and 15 define the height, a vector containing the distances between
merging clusters at each stage.
A dendrogram is a tree-representation diagram. This diagrammatic representation,
which is widely used in hierarchical clustering, shows how the clusters generated by the
relevant analysis are arranged.
Sustainability 2023, 15, 8836 13 of 18

Figure 12. Agnes objects of small and large areas—yield.

In this research, we used visual criteria, e.g., Average Silhouette Width(ASW) Index,
Calinski-Harabasz(CH) Index, and hierarchical linkage criteria such as complete_linkage,
average_linkage, single_linkage and ward.D2 to estimate the best k. Basically, we want
to know how well the original distance matrix is approximated in the cluster space, so
a measure of the cophenetic correlation is also useful. The concordance with Ward.D2
hierarchical clustering gives an idea of the stability of the cluster solution. In the current
research two clusters, which were found in Figure 6, were used for the investigation, as
shown in Figure 13.

Figure 13. Dendrogram plot for a small area.

A dendrogram is plotted and compared to a banner plot to determine whether the


data have been properly clustered. Figures 13–16 give comparative interpretations of the
dendrogram and banner plot. The banner plot’s x and y axes represent the height and order
in which objects are clustered in Figure 14.
Sustainability 2023, 15, 8836 14 of 18

Figure 14. Banner plot for a small area.

Figure 15. Dendrogram plot—large area.


Sustainability 2023, 15, 8836 15 of 18

Figure 16. Banner plot—large area.

The banner plot’s white area (Figure 14) dictates the unclustered data [22]. On the
other hand, the white lines (Figure 14) indicate the red blocks where the clusters were
arranged. As a result, it can be seen that objects 9 and 13 have a larger bar than objects 1 and
9. Moreover, in the objects between 2 and 5, there is no bar at all, and this indicates that
objects numbered from 1, 9, so on to 5 belong to one cluster, which is completely dissimilar
to the objects 2, 12, so on to 15 which belong to another cluster [23]. The dendrogram shows
a similar pattern, as shown in Figure 13.
The unclustered data are determined by the white area of the banner plot (Figure 16).
The white lines (Figure 16) represent the red blocks where the clusters were arranged. As a
result, objects 1 and 7 have a larger bar than objects 1 and 3. Furthermore, there is no bar
between objects 3 and 2, indicating that objects 1, 7, and 3 belong to the same cluster as
objects 2 and 9, which belong to a different cluster. Figure 15 depicts a similar pattern in
the dendrogram.
After dividing the districts’ as small and large areas into different clusters, the two
clusters zone 1 and zone 2 are identified and renamed as low and high yield districts based
on the pie plot obtained in the Table 7. Similarly, the four clusters zone 1 to zone 4 from
large areas are identified and concluded that zone 1 is the low yield, zone 2 is high yield;
zone 3 and zone 4 are considered as moderate yield districts. The results of average yield
distributions of small and large areas are presented in Figures 17 and 18, respectively.

Table 7. Mean values of a small and large area for yield variable.

Small Area—Yield Large Area—Yield


Clust Min Max Mean Clust Min Max Mean
zone 1 0.4 7.62 3.19 zone 1 1.65 6.1 4.7
zone 2 2.73 11.6 7.17 zone 2 4.04 9.29 6.18
- - - - zone 3 5.16 10.9 7.38
- - - - zone 4 5.62 12.4 8.65
Sustainability 2023, 15, 8836 16 of 18

Figure 17. Average yield distribution—small area.

Figure 18. Average yield distribution—large area.


Sustainability 2023, 15, 8836 17 of 18

Table 8 shows that the rice crops are cultivated and harvested during all three seasons.
Moreover, larger areas are allocated to this crop for cultivation during the kharif season. On
top of that, this season has a high impact on production compared to the summer season,
followed by the Rabi season.

Table 8. Season-wise data for rice crops, for instance.

Area (In Hectares) Production (In Hectares) Yield (In Tons)


Season: Kharif
mean: 37,112 mean: 95,330 mean: 2.47
Season: Rabi
mean: 3240.6 mean: 7468.4 mean: 2.4
Season: Summer
mean: 9845.6 mean: 30,887 mean: 2.9

5. Conclusions
The research emphasizes the importance of prioritizing relevant crops for different
districts by eliminating risks and investments to achieve yield benefits rather than harvest-
ing or sowing crops that are not relevant to the regions. In order to achieve yield benefits
instead of harvesting/sowing flop crops (irrelevant crops), the work presented aims to
recommend and prioritize the relevant crops for different districts. The work extends to
help government sectors and farmers by providing detailed information on the type of
seasons and crops that have a high impact on yield and production, and also by resolving
issues related to agricultural activities in weaker locations. Two dataset inspection methods
(Probability Density Function and Boxplot) are used to define the multi-modal distribution
and identify and eliminate outliers. The research work was conducted research observation
on the agricultural planes of Karnataka and used the dataset furnished by the Ministry of
Agriculture and Farmers Welfare, Karnataka, India, and ICAR-Taralabalu Krishi Vigyan
Kendra, Davanagere, Karnataka, India. The proposed work is carried out on linkage param-
eters, along with two heuristic methods and two partition algorithms (both hierarchical and
partition), which are used to estimate the value of the best k. In addition, dendrograms are
created and compared to the banner plot to guarantee that the cluster formation is accurate.
The algorithm designed in this work computes the mean score on area and yield variables,
which reflects the fast execution and assists in carrying out precise analysis to draw the
best yield patterns that can be recommended. Finally, based on economic significance, the
work recommends the best crops by eliminating risks and investments.

Author Contributions: Conceptualization, H.L., K.R.N.K. and B.R.S.; data curation, T.S. and A.A.A.;
formal analysis, H.L. and K.R.N.K.; funding acquisition, H.L.; investigation, B.R.S. and A.K.H.;
methodology, K.R.N.K., B.R.S. and A.K.H.; supervision, B.R.S.; validation, H.F.M.L., T.S. and A.A.A.;
writing—original draft, K.R.N.K.; writing—review and editing, H.L., K.R.N.K. and B.R.S. All authors
have read and agreed to the published version of the manuscript.
Funding: This research work was funded by the Institutional Fund Projects under grant no. (IFPIP:
959-611-1443). The authors gratefully acknowledge the technical and financial support provided by
the Ministry of Education and King Abdulaziz University, DSR, Jeddah, Saudi Arabia.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: We accessed data on 17 August 2022 from ICAR-Taralabalu Krishi
Vigyan Kendra, and data is available at www.taralabalukvk.com (accessed on 17 August 2022).
Acknowledgments: The authors also thank Bapuji Institute of Engineering &Technology (BIET) for
providing the support and infrastructure.
Conflicts of Interest: The authors declare no conflict of interest.
Sustainability 2023, 15, 8836 18 of 18

References
1. Wolfert, S.; Ge, L.; Verdouw, C.; Bogaardt, M.-J. Big Data in Smart Farming—A Review. Agric. Syst. 2017, 153, 69–80. [CrossRef]
2. Kamilaris, A.; Kartakoullis, A.; Prenafeta-Boldú, F.X. A Review on the Practice of Big Data Analysis in Agriculture. Comput.
Electron. Agric. 2017, 143, 23–37. [CrossRef]
3. Xu, X.; Gao, P.; Zhu, X.; Guo, W.; Ding, J.; Li, C.; Zhu, M.; Wu, X. Design of an Integrated Climatic Assessment Indicator (ICAI) for
Wheat Production: A Case Study in Jiangsu Province, China. Ecol. Indic. 2019, 101, 943–953. [CrossRef]
4. Filippi, P.; Jones, E.J.; Wimalathunge, N.S.; Somarathna, P.D.S.N.; Pozza, L.E.; Ugbaje, S.U.; Jephcott, T.G.; Paterson, S.E.; Whelan,
B.M.; Bishop, T.F.A. An Approach to Forecast Grain Crop Yield Using Multi-Layered, Multi-Farm Data Sets and Machine Learning.
Precis. Agric. 2019, 20, 1015–1029. [CrossRef]
5. Van Klompenburg, T.; Kassahun, A.; Catal, C. Crop Yield Prediction Using Machine Learning: A Systematic Literature Review.
Comput. Electron. Agric. 2020, 177, 105709. [CrossRef]
6. Rosenzweig, C.; Jones, J.W.; Hatfield, J.L.; Ruane, A.C.; Boote, K.J.; Thorburn, P.; Antle, J.M.; Nelson, G.C.; Porter, C.; Janssen,
S.; et al. The Agricultural Model Intercomparison and Improvement Project (AgMIP): Protocols and Pilot Studies. Agric. For.
Meteorol. 2013, 170, 166–182. [CrossRef]
7. Ciscar, J.-C.; Fisher-Vanden, K.; Lobell, D.B. Synthesis and Review: An Inter-Method Comparison of Climate Change Impacts on
Agriculture. Environ. Res. Lett. 2018, 13, 070401. [CrossRef]
8. Lobell, D.B.; Asseng, S. Comparing Estimates of Climate Change Impacts from Process-Based and Statistical Crop Models.
Environ. Res. Lett. 2017, 12, 015001. [CrossRef]
9. Schlenker, W.; Roberts, M.J. Nonlinear Temperature Effects Indicate Severe Damages to U.S. Crop Yields under Climate Change.
Proc. Natl. Acad. Sci. USA 2009, 106, 15594–15598. [CrossRef] [PubMed]
10. Roberts, M.J.; Schlenker, W.; Eyer, J. Agronomic Weather Measures in Econometric Models of Crop Yield with Implications for
Climate Change. Am. J. Agric. Econ. 2012, 95, 236–243. [CrossRef]
11. Roberts, M.J.; Braun, N.O.; Sinclair, T.R.; Lobell, D.B.; Schlenker, W. Comparing and Combining Process-Based Crop Models and
Statistical Models with Some Implications for Climate Change. Environ. Res. Lett. 2017, 12, 095010. [CrossRef]
12. Urban, D.W.; Sheffield, J.; Lobell, D.B. The Impacts of Future Climate and Carbon Dioxide Changes on the Average and Variability
of US Maize Yields under Two Emission Scenarios. Environ. Res. Lett. 2015, 10, 045003. [CrossRef]
13. Majumdar, J.; Naraseeyappa, S.; Ankalaki, S. Analysis of Agriculture Data Using Data Mining Techniques: Application of Big
Data. J. Big Data 2017, 4, 20. [CrossRef]
14. Crop Production Statistics for Selected States, Crops and Range of Year. Available online: https://aps.dac.gov.in/APY/Public_
Report1.aspx (accessed on 2 January 2021).
15. Gandhi, N.; Armstrong, L.J.; Petkar, O. Proposed decision support system (DSS) for Indian rice crop yield prediction. In
Proceedings of the 2016 IEEE Technological Innovations in ICT for Agriculture and Rural Development (TIAR), Chennai, India,
15–16 July 2016; pp. 13–18. [CrossRef]
16. Pearson Correlation Coefficient—Wikipedia. Available online: https://en.wikipedia.org/wiki/Pearson_correlation_coefficient
(accessed on 22 March 2021).
17. Wei, H. How to Measure Clustering Performances When There Are No Ground Truth? Available online: https://medium.com/
@haataa/how-to-measure-clustering-performances-when-there-are-no-ground-truth-db027e9a871c (accessed on 2 January 2021).
18. Torgo, L. Data Mining with R; Chapman and Hall/CRC Data Mining and Knowledge Discovery Series; CRC Press: Boca Raton,
FL, USA, 2016; ISBN 978-1-315-39909-6.
19. Han, J.; Kamber, M.; Pei, J. Data Mining: Concepts and Techniques; The\Morgan Kaufmann Series in Data Management Systems Ser.
Morgan Kaufmann: Burlington, MA, USA, 2011; ISBN 978-0-12-381479-1.
20. Williams, G.J. The Essentials of Data Science: Knowledge Discovery Using R; Chapman and Hall/CRC the R Series; Chapman &
Hall/CRC: Boca Raton, FL, USA, 2017; ISBN 978-1-4987-4001-2.
21. Toomey, D. R for Data Science|Packt. Available online: https://www.packtpub.com/product/r-for-data-science/9781784390860
(accessed on 22 March 2021).
22. Spector, P. Stat 133 Class Notes—Spring. UC Berkeley Statistics. 2011. Available online: https://www.stat.berkeley.edu/~s133
/all2011.pdf (accessed on 21 March 2022).
23. Bock, T. What Is a Dendrogram? Available online: https://www.displayr.com/what-is-dendrogram/ (accessed on 22 March 2021).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like