Professional Documents
Culture Documents
a r t i c l e i n f o a b s t r a c t
Keywords: Background: The application of classsification methods through multivariate and machine learning techniques
Dry bean has enormous significance in agricultural sector. It is vital to classify various types of seeds as well as identify the
Interquartile range quality of seeds which has a great impact on the production of crops. There is a wide range of genetic variations
ADASYN
in dry beans all over the world. Many studies have been conducted previously on various dataset to indentify the
Multiclass classification techniques
sorts of dry beans, however most of them focused on machine learning techniques with binary classification.
Performance measures
Objective: The aim of this study is to investigate a reliable classifier which has the lowest noise implications and
establish an algorithm for dry bean classification effectively. This paper focuses on outlier removals, oversampling
with Adaptive Synthetic (ADASYN) algorithm and finding the best classifier to guarantee the highest possible
accuracy.
Methods: The raw dataset for this study was accessed from UCI Machine Learning Repository. The dataset con-
tained grains having 16 features, 12 dimensions, and 4 distinct shapes. For the purpose of eliminating missing
values from the dataset, interquartile range (IQR) with python programming was utilized. Eight most popular
classifiers were used in this study which are Logistic Regression (LR), Naïve Bayes (NB), k-Nearest Neighbor
(KNN), Decision Tree (DT), Random Forest (RF), Extreme Gradient Boosting (XGB), Support Vector Machine
(SVM), and Multilayer Perception (MLP) with balanced and imbalanced classes. The authors utilized frequency
tables, bar diagrams, boxplots, analysis of variance for descriptive analysis as well as data preprocessing.
Results: The XGB classifier preferably outperformed than other classifiers with balanced and imbalanced dis-
tribution of dry beans within each class. It has acquired accuracy (ACC) 93.0% and 95.4% in imbalanced and
balanced classes respectively. In case of balanced dataset, after application of ADASYN algorithm both KNN and
RF techniques also performed well regarding the Classification Accuracy (ACC), Sensitivity (SE), Specificity (SP)
and Cohen’s kappa coefficient (Kappa) etc. The most important attributes for classifying the dry beans were found
ShapeFactor2, Minor Axis Length, and ShapeFactor1 along with EquivDiameter, Roundness and ConvexArea.
Conclusions: For classification of dry seeds, the XGB classifier had performed well when the dataset contained
both balanced and imbalanced distribution in classes. Also, it is the primary approach of identifying the classes
of seeds/beans with balanced or not. If the classes of the target variable are balanced well, then the KNN and RF
algorithms may be applied along with XGB technique for more accurate classification.
∗
Corresponding author.
E-mail address: salauddinstat@ku.ac.bd (M. Salauddin Khan).
https://doi.org/10.1016/j.ijcce.2023.01.002
Received 10 March 2022; Received in revised form 4 January 2023; Accepted 7 January 2023
Available online 14 January 2023
2666-3074/© 2023 The Authors. Publishing Services by Elsevier B.V. on behalf of KeAi Communications Co. Ltd. This is an open access article under the CC BY
license (http://creativecommons.org/licenses/by/4.0/)
M. Salauddin Khan, T.D. Nath, M. Murad Hossain et al. International Journal of Cognitive Computing in Engineering 4 (2023) 6–20
termining the type of dry beans requires skillful person and takes huge ii) Classification of balanced and imbalanced dataset: Nowadays the
time manually. When the array of seeds appears so similar, manually researchers employ the traditional classifiers namely Linear Regres-
categorizing them becomes a challenging process. Even, it is almost im- sion (LR), Naïve Bayes (NB), k-Nearest Neighbor (KNN), Decision
possible for a human operator to interpret or handle such seeds except Tree (DT), Random Forest (RF), Extreme Gradient Boosting (XGB),
specific tools or automatic software procedures (Mendoza and Aguil- Support Vector Machine (SVM), and Multilayer Perception (MLP)
era, 2010; Liu et al., 2011; Savakar, 2012; Rodríguez-Pulido et al., 2013; with balanced and imbalanced data to provide comparison with
Gómez-Sanchis et al., 2012, Stegmayer et al., 2013; Khatri et al., 2022). state-of the-art algorithms.
In today’s world, the inspection of the quality of seeds, fruits and vegeta- iii) Performance evaluations: In experimental evaluation, the perfor-
bles along with examination and categorization of seeds and grains have mances of the mentioned classifiers are compared and evaluated in
been performed worldwide to meet these demands with help of machine terms of Accuracy (ACC), Sensitivity (SE), Specificity (SP), False Pos-
learning and computer vision. The purpose of seed categorization is to itive Rate (FPR), Cohen’s Kappa Coefficient (Kappa), F1 - score, Mean
ensure high-quality food product in greater quantities. Square Error (MSE) and Area Under the Curve (AUC).
The dry bean (Phaseolus vulgaris L.) is the most nutritious and widely iv) Comparison of classifiers: There are no comparative studies on mul-
cultivated vegetable found (Fabaceae-Leguminosae) all over the world. ticlass classification techniques and data distribution which address
The purification of dry beans play an important role in the economy subject-wise problems. From experimental evaluation, the XGB clas-
of agriculture based countries like Bangladesh, India, Pakistan etc. sification technique shows better performance among the selected
throughout the winter season. Unfortunately, deterioration in seed qual- classifiers with the help of ADASYN algorithms. Herein it has im-
ity may begin at any point in the plant’s development stage from fertil- proved accuracies of 1.4%, 5.5%, 0.40%, and 11.5% compared to
ization onward due to the effects of changing climate and other envi- RF, DT, KNN, MLP respectively.
ronmental factors. Breeding new seed cultivars and determining their v) Statistical evaluation: The Receiver Operating Characteristic (ROC)
traits, which are the significant variables for growth of plants properly curves are constructed by plotting the True Positive Rate (TPR)
and may improve the response of plants and/or tolerance to environ- against False Positive Rate (FPR) at various threshold settings to di-
mental stimuli (Ceyhan et al., 2012). The process of seed identification agnose the ability of the classifiers and validation of the proposed
is time-consuming and may be interpreted in a variety of ways. From ML-based system.
the practical point of view, it becomes more challenging with respect to
commercial and technical aspects. Specially, various dry bean species The rest of the paper is outlined as follows. Section 2 reviews the
tend to vary in color and the geometrical data carry no information applications of machine learning algorithms in seed identification. In
about the bean color. For this reason, it is crucial not only economically Section 3, materials and research framework are described including
but also technically to build an automated technique to detect as well data sources, descriptive statistics of the variables, data preprocessing
as categorize seed features rapidly and repeatedly (Granitto et al., 2002; stage and different classification techniques. The performance measures
Bacchetta et al., 2008). are also mentioned in this section. Section 4 presents the experimental
In the perspective of cultivation, the qualities of seed influence the results, and also shows graphical and statistical performances with the
crop production greatly. In the recent years, knowledge-based technolo- help of confusion matrix. A brief discussion is provided in Section 5 and
gies such as statistical learning, fuzzy logic and artificial neural net- finally, draws the conclusion of the work in Section 6.
works (ANN) have been used in inspection, classification, prediction
and segmentation of food product quality (León-Roque et al., 2016; 2. Related works and motivations
Du & Sun, 2004). The combination of Computer Vision Systems (CVS)
and ANN produce a potent machine vision inspection tool. Many re- This section provides a brief discussion about recent works which
searchers have employed machine learning algorithms to evaluate the are related to the classification of different seed varieties. Almost all the
quality of beans using various analytical techniques (León-Roque et al., existing machine leaning (ML) algorithms have been used with various
2016; Lawi, A. & Adhitya, Y., 2018). Random forest (RF) is an ensem- morphological, tonal, textural and color features for seed classification.
ble learning technique that compares preferably well with or outper- Oliveira et al. in 2021 developed a fast and reliable computer vision
forms various classification algorithms including SVM, C4.5, AdaBoost, system to classify fermented cocoa beans into four categories (Oliveira
KNN, LR, Stochastic Gradient Boosting Trees, Extreme Learning Ma- et al., 2021). To identify the samples, hand-crafted characteristics were
chine, Sparse Representation-Based Classification, and Deep Learning extracted from the beans as predictors. They used RF to determine the
(Breiman, 2001; Zhang et al., 2017; Barbon et al., 2016). Most of the quality of fermented beans and proposed it as a cut-test by using dig-
studies follow traditional approaches to predict seed types. For instance, ital Red, Green and Blue (RGB) imagess. Sanl et al. evaluated the per-
SVM (Subasi, 2015; Yahyaoui et al., 2018), RF (Barbon et al., 2016), formance on different datasets using KNN, J48, SMO, NB, NBM, BAG-
KNN, NB, DT and MLP (Koklu et al., 2020) algorithms have showed bet- GING and JRIP classification algorithms (Sanlı et al., 2020). A machine
ter performance in solving classification problems in a variety of agri- learning technique was proposed by Islam et al. to identify illness in
cultural fields. potato plants using leaf images. Their study obtained 95% accuracy in
The aim of this study is to utilize a set of classification techniques classifying illness in potato using an SVM on over 300 samples. Their
and find out the best classifier which identifies the actual types of dry approaches enabled the widespread diagnosis of plant diseases by auto-
beans. In addition, ADASYN algorithm was adopted to make distribution mated detection. However, the seriousness of the identified ailments has
of classes equally; and also employed boxplot and Interquartile Range not yet been established (Islam et al., 2017). Gürcan et al. analyzed Turk-
(IQR) for removing outliers that improve slightly classification perfor- ish literature using supervised machine learning methods with variety of
mance. Further, our investigation demonstrates that the XGB classifier factors (Gürcan, F., 2018). On Turkish news texts, the authors compared
may be applied in lieu of widely used KNN, RF and LR techniques. the performance of Multinomial NB, Bernoulli NB, SVM, KNN and DT
The present study offers the following contributions: algorithms. Another study classified hazardous online activities into dif-
ferent categories using J48, PART and SVM (Goseva-Popstojanova et al.,
2012). The authors made an attempt to differentiate between various
i) Data processing: The core contribution of this study has been de- types of malicious activities directed towards internet platforms.
scribed in data prepocessing and classification stages including data Koklu et al. introduced another Computer Vision System (CVS) for
scaling, outliers removing, and applying ADASYN algorithms to recognizing registered varieties of dry beans with comparable traits in
eliminate the issues occurred due to imbalanced class distribution order to get consistent seed types from crop output (Koklu et al., 2020).
of dataset. They assessed their performance by comparing MLP, SVM, KNN, and DT
7
M. Salauddin Khan, T.D. Nath, M. Murad Hossain et al. International Journal of Cognitive Computing in Engineering 4 (2023) 6–20
8
M. Salauddin Khan, T.D. Nath, M. Murad Hossain et al. International Journal of Cognitive Computing in Engineering 4 (2023) 6–20
Table 2
Statistical distribution for the selected features of dry bean (in pixels).
9
M. Salauddin Khan, T.D. Nath, M. Murad Hossain et al. International Journal of Cognitive Computing in Engineering 4 (2023) 6–20
lying data distribution including predictor independence and a mean- jority vote among the 𝑘 neighbors (Hastie et al., 2009;Mukherjee et al.,
ingful variable related with outcome variable. These assumptions must 2021). It employs the KNN technique to run respective times with var-
be satisfied before estimating the model (Agresti, A., 2002). ious values of 𝑘 and selects 𝑘 that minimizes the number of errors ap-
propriately.
3.4.2. K-nearest neighbor classifier (KNN)
KNN is a distance based supervised machine learning technique that
makes the use of training data to categorize new data points. It is used 3.4.3. Decision tree classifier (DT)
to solve classification and regression problems (Mukherjee et al., 2022). One of the simple and straightforward machine learning algorithms
It returns an integer number representing the productivity (labels) of a is the Decision Tree (DT) classifier, which is based on the divide and
classification algorithm output. KNN is a memory-based classifier that conquer principle (Igual, L. & Seguí, S., 2017). A DT with internal nodes
reminds all training data points in order to predict test data by com- representing tests (on input patterns) and leaf nodes representing cate-
paring an input sample to each training instance. It considers k training gories (of patterns) provides a class number (or output) to the pattern
neighbors 𝑥𝑟 where 𝑟 = 1, ..., 𝑘 that are closest to 𝑥0 in terms of distance. by filtering it through the tree tests. Each test provides conclusive and
For a given new data point 𝑥0 , the algorithm labels them based on a ma- mutually exclusive results.
10
M. Salauddin Khan, T.D. Nath, M. Murad Hossain et al. International Journal of Cognitive Computing in Engineering 4 (2023) 6–20
Fig. 4. Box-plot for the features after removing outliers engineering of the dataset.
3.4.4. Random forest classifier (RF) Burges, C. J., 1998). It is anticipated that any new observation will fit
In RF method, the number of trees and their maximum depth i.e., neatly into either of the categories depending on maximum marginal
evaluation of interactions are hyper-parameters. The RF (Awad, M. & hyperplane. Support vectors are the data points closest to the hyperplane
Khanna, R., 2015) is a classification technique that provides a large num- that separate the classes (Awad, M. & Khanna, R., 2015; Awal et al.,
ber of de-correlated DTs. To develop the RF method in Python, we have 2021b).
utilized a few numbers of DT and Gini as impurity index.
11
M. Salauddin Khan, T.D. Nath, M. Murad Hossain et al. International Journal of Cognitive Computing in Engineering 4 (2023) 6–20
There are several evaluation metrics to measure the performance of 4.1. Confusion matrix for selected different algorithms
a machine learning algorithm. The effectiveness of a ML algorithm is
determined by the percentage of correct predictions made among all In this study, a confusion matrix is adopted to visualize and summa-
predictions. The reported metrics are derived from confusion matrix. rize the performance of the classifiers. The matrix clarifies the specific
Confusion matrix is one of the best reliable measure techniques to de- class accuracy of each dry bean as well as the incorrect classification
scribe the performance of a classifier against a set of known test data. rates. Each row of the confusion matrix represents the actual class while
True positive, true negative, false positive and false negative are used each column represents the predicted class. All the diagonal elements
to construct confusion matrix. Table 3 shows the multiclass confusion denote correctly classified outcomes and the off-diagonal elements of
matrix with predicted and actual class that is utilized for visualizing the matrix represent the misclassified outcomes. The confusion matrix
the performance of each class. The performance metrics like Accuracy is accomplished by XGB classifier using physical features such as form,
(ACC), Error Rate, Sensitivity (SE), Specificity (SP), Mean Square Error shape, type and structure etc. The correct as well as confusing informa-
(MSE), Recall, False Positive Rate (FPR), Kappa and F1 -score are em- tion of each class has been further clarified in Table 5. It reveals that
ployed to evaluate accurate predictions for classification problems as with the exception of Dermason dry bean seeds, the frequency of all
enlisted in Table 4 (Li et al., 2021; Islam et al., 2022). major diagonal elements is higher when the ADASYN algorithm is used
than when it is not used in Table 5(a). Similarly, all other confusion
3.5.1. Receiver operating characteristics (ROC) curve matrices, Table 5(b) to Table 5(h) demonstrate the actual and predicted
In machine learning, the graphical analyses are essential for perfor- number of observations as listed in Table 5(a).
mance evaluation when multiclass classification problems arrive. The Table 6 shows the classification performance of the eight distinct
ROC curve is plotted to depict the performance of the multiclass classi- models LR, KNN, DT, RF, SVM, NB, XGB, and MLP in the first column,
fiers. Their Area Under Curve (AUC) is also computed to measure dis- and the performance measures AUC, ACC, MSE, F1 -score, FPR, Kappa,
criminative ability or how well it works in a particular clinical setting SE, and SP are presented in the top row, where upper one is without
(Ahmed et al., 2021; Khan et al., 2022; Kumari et al., 2021; Rai et al., ADASYN and the another one is with ADASYN. When all classes ensure
2022b; Islam et al., 2022). In the ROC curve, AUC is equal to the prob- equal number of samples with applying ADASYN algorithm, the XGB
ability and provides an aggregated measure. The values of AUC around classifier attains the highest performance measure, including an ACC
1 show that a model is excellent indicating that it has a high degree that is more than 95% and an AUC that is 99.64%. Additionally, the
of separability. Alternatively, the values of AUC near 0 indicate lower rest of the performance measures are higher compared to other selected
performance model. The test is more visually effective if the ROC curve models when ADASYN is applied. The KNN (ACC 95%) and RF (ACC
is nearer to the top left corner. 94%) models show somewhat superior performance measures with re-
12
M. Salauddin Khan, T.D. Nath, M. Murad Hossain et al. International Journal of Cognitive Computing in Engineering 4 (2023) 6–20
Table 4
Calculation formulas and explanations of multiple class metrics.
Table 5
(a). Confusion matrix for XGBoost classifier with and without ADASYN algorithm.
without ADASYN
Actual Predict
Seker Barbunya Bombay Cali Horoz Sira Dermason
Seker 454 6 0 0 0 14 13
Barbunya 0 316 0 22 2 2 1
Bombay 0 0 116 0 0 1 0
Cali 1 6 0 399 8 5 0
Horoz 0 3 0 11 466 8 7
Sira 8 2 0 2 4 558 62
Dermason 13 0 0 0 0 54 839
with ADASYN
Actual Predict
Seker Barbunya Bombay Cali Horoz Sira Dermason
Seker 678 1 0 1 0 19 5
Barbunya 5 660 0 29 4 8 0
Bombay 0 0 710 0 0 0 0
Cali 5 8 0 702 8 3 0
Horoz 0 2 0 6 670 14 5
Sira 16 3 0 4 22 592 44
Dermason 17 0 0 0 4 63 625
Table 6
Performance measures of classification models (%) on the dry bean dataset.
without ADASYN
with ADASYN
13
M. Salauddin Khan, T.D. Nath, M. Murad Hossain et al. International Journal of Cognitive Computing in Engineering 4 (2023) 6–20
Fig. 5. (A-P) Distributional patterns of seven classes of dry bean among the features.
gard to accuracy metrics than LR (ACC 83%), SVM (ACC 86%), MLP performance when compared to other models with or without ADASYN
(ACC 84%) with ADASYN algorithm in Table 6. algorithm. In addition, the KNN model (ACC 95%) achieves the second
When compared to the performance of other models without highest accuracy in terms of utilizing the ADASYN algorithm, while the
ADASYN algorithm, the XGB model provides more accuracy metrics. RF model (ACC 92%) has the highest average accuracy in terms of not
The ACC and AUC of the XGB model are shown in Table 6 and it is using the ADASYN algorithm.
93% and more than 99% respectively. The performance accuracy of LR When compared with the same model, XGB model has better per-
(ACC 91%), DT (ACC 90%), RF (ACC 92%) and MLP (ACC 91%) models formance in both cases of with or without ADASYN algorithms. Simi-
are higher than that of KNN (ACC 89%) and NB (ACC 89%). We con- larly, the XGB model has better performance in terms of ACC and AUC,
clude from these discussions that the XGB model demonstrates better which are 95% and 93% respectively with and without ADASYN ap-
14
M. Salauddin Khan, T.D. Nath, M. Murad Hossain et al. International Journal of Cognitive Computing in Engineering 4 (2023) 6–20
Fig. 5. Continued
15
M. Salauddin Khan, T.D. Nath, M. Murad Hossain et al. International Journal of Cognitive Computing in Engineering 4 (2023) 6–20
16
M. Salauddin Khan, T.D. Nath, M. Murad Hossain et al. International Journal of Cognitive Computing in Engineering 4 (2023) 6–20
5. Discussion
17
M. Salauddin Khan, T.D. Nath, M. Murad Hossain et al. International Journal of Cognitive Computing in Engineering 4 (2023) 6–20
Table 7
Performance comparison among proposed models and other’ studies.
Koklu et al. (2020) Dry bean seeds MLP, SVM, DT, and KNN SVM 93.13
Sanlı et al. (2020) Sugar beet seeds MSI, MIT MIT 82.0
Kiratiratanapruk et al. (2020) 14 kinds of rice seeds LR, LDA, KNN, CNN and SVM SVM 83.9
Pozza et al.(2022) Bean seeds RF, rpart, rpart1SE, rpart2 NB, SVM RF 80.0
Keya et al. (2020) 5 variants of seed CNN CNN 93.0
Słowiński, G. (2021) Dry bean seeds NB, SVM, DT, RF, SVC, and ANN RF 93.61
OuYang et al. (2010) Rice seeds BP-ANN BP-ANN 93.66
Proposed model Dry bean seeds LR, KNN, DT, RF SVM, NB, XGB and MLP. XGB without ADASYN 93.00
XGB with ADASYN 95.40
all other tree-based classifiers on the imbalanced dry beans dataset in 5.1. Reasons of better performance behind the XGB classifier
terms of ACC, SE, SP and AUC. Additionally, these models provide the
higher accuracy for this dry bean dataset than other metrics. According The machine learning algorithm with XGB classifier performs better
to the experimental assessments, XGB classifier outperforms both with due to gradient boosting, minimizing loss function, and avoiding over-
and without ADASYN. Fig. 7(b) shows that KNN and RF classifiers per- fitting. In the additive and sequential stages, the trees are generated in
form equally well but not as good as XGB classifier. The performance of sequential approach which turns the weak learners into strong learn-
KNN is slightly better than RF. The NB, LR and SVM perform at the cost ers by adding up weights to the weak learners as well as decrease the
of a much higher run time. The investigation shows that the multiclass weights of strong learners. In the similar process, every tree boosts and
classification model may enhance performance by around 1.87%, when learns consistently from the prior tree grown. Another advantage is to
oversampled with ADASYN method. overcome the tiring process using approximate greedy algorithm by di-
The performances of our multiclass classification model as well as viding the invariant data into quantiles or adopts quantiles as candidate
prior competing methods on various datasets are enlisted in Table 7. thresholds to split. The parameters are the key factors behind the better
It is not possible to directly compare our method with those of the performance of the classifier.
existing models in terms of precision, recall and accuracy. Since the
prior approaches have followed traditional strategies in seed pre- 5.2. Strengths, limitations, and future scopes of the study
processing stage, feature extraction and classification with different
distributions of experimental dataset. For example, the accuracy is The main strength of this study is to identify automatically uniform
83.90% for 14 kinds of rice seeds including statistical machine learn- seed varieties for more crop production with reducing computational
ing approaches and pre-trained models on deep learning techniques cost as well as overcoming the inter-seed ambiguities. Due to increasing
(Kiratiratanapruk et, al.,2020) and 93.00% on 5 variants of seed, de- demand of uniform seeds in agricultural fields, the proposed technique
veloping an adequate integrated framework to replace the current clas- may be applied to determine seed quality for planting and marketing,
sification system (Keya et, al.,2020). and classification. The investigated method shows better performance
The previous approaches fail to reflect the ability of the classifier to identify the accurate types of dry beans.
for each class of samples because of unbalanced data set as well as The experiments are conducted on secondary data, which consists of
favoring each class with higher probability of occurrence (majority) 7 different types of dry beans. Nearly the 13,611 items of the dataset
over another with low probability (minority) of occurrence (Awal et al., were collected from various planting areas of bean by the research in-
2021a,2021b). Oliveira et al. experimented on a total of 1800 beans stitute in Turkey. The performance of the developed algorithm may be
with four classes and observed that the imbalanced dataset represents reduced with respect to other datasets with poor data pre-processing
true variation of class. The classification accuracies were 0.93% for un- and segmentation, types of features and feature dimensionality and so
balanced dataset and 0.92% for balanced dataset but precision, recall on.
and F1 -score were also high for few classes. As those classes are classi- Although, the classifiers have achieved a satisfactory accuracy for
fied inaccurately and affect the performance metrics for their respective this dataset, it still suffers from various real-time challenges in uniform
class due to improper distribution. The RF classification model in bal- bean identification. Here, only the variables related to shape and size,
anced dataset provided more information, as it ignores the effects of and characteristics of bean cultivar are included as features. The suture
the number of samples from an input class to influence the accuracy of axis of bean (i.e., third dimensional analysis) is ignored due to huge time
classification (Oliveira et al., 2021). consuming, but can increase the classification performance. If the coef-
As mentioned earlier, we have used 13,611 items with 7 different ficient of variance e.g., the difference in the shape of each bean variety
types of dry beans collected from various planting areas in Turkey un- is considered in shape and size variables, it will improve accuracy. The
der varying imaging conditions in our experimental evaluations. Ad- bean identification performance may be enhanced by employing fea-
ditionally, adaptive synthetic sampling technique is adopted to elimi- ture fusion namely shape and size features, texture features, statistical
nate the issues occurred due to imbalanced dataset and made an at- features and decision level fusion.
tempt to delineate how the algorithm affects the classification perfor-
mance when working at high production volumes. Only the proposed 6. Conclusions
approaches (Słowiński et al. 2021; Pozza et al., 2022) experimented on
dry beans with a large number of images and subjects similar to this This work studies the classification performance with imbalanced
work. The comparison with state-of the-art algorithms demonstrates and balanced distribution of dry bean dataset which has a great im-
that the XGB classification model shows good performance with pos- pact on data science and agricultural fields. The agricultural products
sible highest accuracy when all classes have equal number of samples. highly depend on the quality of seeds as well as the fertility of lands.
However, our established model has the lowest possibility to be affected In this study, a genetically diversed dry bean dataset is used to identify
by different levels of noise implications in geometric feature fusion the actual seeds with the help of ADASYN algorithm and different ma-
cases. chine learning techniques which are described details in classification
18
M. Salauddin Khan, T.D. Nath, M. Murad Hossain et al. International Journal of Cognitive Computing in Engineering 4 (2023) 6–20
stage. The mentioned techniques have showed different performances Ceyhan, E., Kahraman, A., & Onder, M. (2012). The impacts of environment on plant
with various parameter settings. Among them, the XGB beats all other products. International Journal of Bioscience, Biochemistry and Bioinformatics, 2(1), 48.
Desai, C., Ramanan, D., & Fowlkes, C. C. (2011). Discriminative models for multi-class
approaches with both balanced and imbalanced classes for the experi- object layout. International Journal of Computer Vision, 95(1), 1–12.
mental dataset with ACC of 93% and 95% respectively. In the case of a Gómez-Sanchis, J., Martín-Guerrero, J. D., Soria-Olivas, E., et al., (2012). Detecting rot-
balanced dataset, KNN and RF algorithms also demonstrate superior per- tenness caused by Penicillium genus fungi in citrus fruits using machine learning tech-
niques. Expert Systems with Applications, 39(1), 780–785.
formance in terms of accuracy such as ACC, SE, SP and Kappa among the Goseva-Popstojanova, K., Anastasovski, G., & Pantev, R. (2012, November). Using mul-
others. ShapeFactor2, MinorAxis Length, ShapeFactor1, EquivDiameter, ticlass machine learning methods to classify malicious behaviors aimed at web sys-
roundness and Convex Area are most important features to determine tems. In 2012 IEEE 23rd International Symposium on Software Reliability Engineering
(pp. 81–90). IEEE.
these dry beans. The existing classification approaches employ multi-
Granitto, P. M., Garralda, P. A., Verdes, P. F., & Ceccatto, H. A. (2002). Boosting classifiers
class classification algorithm which is highly recommended for isolation for weed seeds identification. VIII congreso argentino de ciencias de la computación.
and improvement of unified dry bean successfully. Before applying ma- Gürcan, F. (2018). Multi-class classification of turkish texts with machine learning algo-
rithms. In 2018 2nd International Symposium on Multidisciplinary Studies and Innovative
chine learning algorithms to identify important predictive factors and
Technologies (ISMSIT) (pp. 1–5). IEEE.
the classes of bean, this study suggests that the authors should investi- Hastie, T., Tibshirani, R., Friedman, J. H., et al., (2009). The elements of statistical learning:
gate the distribution of datasets in distinct classes with focusing on their Data mining, inference, and prediction (pp. 1–758). New York: Springer. Vol. 2.
balanced or imbalanced patterns. Igual, L., & Seguí, S. (2017). Introduction to data science. In Introduction to data science
(pp. 1–4). Cham: Springer.
Islam, M. M., Rahman, M. J., Islam, M. M., et al., (2022). Application of machine learning
Funding based algorithm for prediction of malnutrition among women in Bangladesh. Interna-
tional Journal of Cognitive Computing in Engineering, 3, 46–57.
Islam, M., Dinh, A., Wahid, K., & Bhowmik, P. (2017, April). Detection of potato diseases
The author(s) received no financial support for the research, author- using image segmentation and multiclass support vector machine. In 2017 IEEE 30th
ship, and/or publication of this paper. canadian conference on electrical and computer engineering (CCECE) (pp. 1–4). IEEE.
Keya, M., Majumdar, B., & Islam, M. S. (2020). A Robust Deep Learning Segmentation
and Identification Approach of Different Bangladeshi Plant Seeds Using CNN. In 2020
Declaration of Competing Interest 11th International Conference on Computing, Communication and Networking Technolo-
gies (ICCCNT) (pp. 1–6).
There are no conflicts of interest. Khan, W., & Haroon, M. (2022). An unsupervised deep learning ensemble model for
anomaly detection in static attributed social networks. International Journal of Cog-
nitive Computing in Engineering, 3, 153–160.
Acknowledgments Khatri, A., Agrawal, S., & Chatterjee, J. M. (2022). Wheat Seed Classification: Utilizing
Ensemble Machine Learning Approach. Scientific Programming, 2022.
The authors gratefully acknowledge the contribution of Statistics Dis- Kiratiratanapruk, K., Temniranrat, P., Sinthupinyo, W., et al., (2020). Development of
paddy rice seed classification process using machine learning techniques for automatic
cipline, Science, Engineering and Technology School, Khulna Univer- grading machine. Journal of Sensors, 2020.
sity, Khulna-9208, Bangladesh. We give special thanks to Milon Sheikh, Koklu, M., & Ozkan, I. A. (2020). Multiclass classification of dry beans using computer
a student of English Discipline, Khulna University. He has tried his best vision and machine learning techniques. Computers and Electronics in Agriculture, 174,
Article 105507. 10.1016/j.compag.2020.105507.
to improve the language issues of this article. The authors also thank Kumari, S., Kumar, D., & Mittal, M. (2021). An ensemble approach for classification and
and gratefully acknowledge to the editor and referees for their com- prediction of diabetes mellitus using soft voting classifier. International Journal of Cog-
ments and positive critique. nitive Computing in Engineering, 2, 40–46.
Lang, T., Flachsenberg, F., von Luxburg, U., et al., (2016). Feasibility of active machine
learning for multiclass compound classification. Journal of Chemical Information and
Supplementary materials Modeling, 56(1), 12–20.
Lawi, A., & Adhitya, Y. (2018). Classifying physical morphology of cocoa beans digital
images using multiclass ensemble least-squares support vector machine. Journal of
Supplementary material associated with this article can be found, in
Physics: Conference Series. IOP Publishing Vol. 979, No. 1.
the online version, at doi:10.1016/j.ijcce.2023.01.002. León-Roque, N., Abderrahim, M., Nuñez-Alejos, L., et al., (2016). Prediction of fermen-
tation index of cocoa beans (Theobroma cacao L.) based on color measurement and
References artificial neural networks. Talanta, 161, 31–39.
Li, B. (2021). Hearing loss classification via AlexNet and extreme learning machine. Inter-
Agresti, A. (2002). Categorical data analysis (2nd ed). New York: Wiley. national Journal of Cognitive Computing in Engineering, 2, 144–153.
Ahmed, N., Ahammed, R., Islam, M. M., et al., (2021). Machine learning based diabetes Liu, J., Yang, W. W., Wang, Y., et al., (2011). Optimizing machine vision-based applica-
prediction and development of smart web application. International Journal of Cogni- tions in agricultural products by artificial neural network. International Journal of Food
tive Computing in Engineering, 2, 229–241. Engineering, 7(3).
Alzubi, J. A., Kumar, A., Alzubi, O., & Manikandan, R. (2019). Efficient approaches for Madhu, B., Mukherjee, A., Islam, M. Z., et al., (2021, December). Depth motion map
prediction of brain tumor using machine learning techniques. Indian Journal of Public based human action recognition using adaptive threshold technique. In 2021 5th In-
Health Research & Development, 10(2). ternational Conference on Electrical Information and Communication Technology (EICT)
Alzubi, O. A., Alzubi, J. A., Alweshah, M., et al., (2020). An optimal pruning algorithm of (pp. 1–6). IEEE.
classifier ensembles: Dynamic programming approach. Neural Computing and Applica- Mendoza, F., Dejmek, P., & Aguilera, J. M. (2010). Gloss measurements of raw agricultural
tions, 32(20), 16091–16107. 10.1007/s00521-020-04761-6. products using image analysis. Food Research International, 43(1), 18–25.
Alzubi, O. A., Alzubi, J. A., Al-Zoubi, A. M., et al., (2021). An efficient malware detec- Movassagh, A. A., Alzubi, J. A., Gheisari, M., et al., (2021). Artificial neural net-
tion approach with feature weighting based on Harris Hawks optimization. Cluster works training algorithm integrating invasive weed optimization with differential
Computing, 1–19. 10.1007/s10586-021-03459-1. evolutionary model. Journal of Ambient Intelligence and Humanized Computing, 1–9.
Andhalkar, S., & Momin, B. F. (2018, July). Multiclass IFROWNN classification algorithm 10.1007/s12652-020-02623-6.
using OVA and OVO strategy. In 2018 9th International Conference on Computing, Com- Mukherjee, A., Islam, M. Z., Mamun-Al-Imran, et al., (2021, September). Iris recogni-
munication and Networking Technologies (ICCCNT) (pp. 1–7). IEEE. tion using wavelet features and various distance based classification. In 2021 Interna-
Awad, M., & Khanna, R. (2015). Support vector regression. In Efficient learning machines tional Conference on Electronics, Communications and Information Technology (ICECIT)
(pp. 67–80). Berkeley, CA: Apress. (pp. 1–4). IEEE.
Awal, M. A., Hossain, M. S., Debjit, K., et al., (2021a). An early detection of asthma using Mukherjee, A., Ripon, K. S. N., Ali, L. E., et al., (2022). Image gradient based iris recogni-
BOMLA detector. IEEE Access : Practical Innovations, Open Solutions, 9, 58403–58420. tion for distantly acquired face images using distance classifiers. In International Con-
Awal, M. A., Masud, M., Hossain, M. S., et al., (2021b). A novel bayesian optimiza- ference on Computational Science and Its Applications (pp. 239–252). Cham: Springer.
tion-based machine learning framework for COVID-19 detection from inpatient fa- Müller, K. R., Mika, S., Tsuda, K., et al., (2018). An introduction to kernel-based learning
cility data. IEEE Access : Practical Innovations, Open Solutions, 9, 10263–10281. algorithms. Handbook of neural network signal processing. CRC Press 4-1.
Bacchetta, G., Grillo, O., Mattana, E., & Venora, G. (2008). Morpho-colorimetric character- Muralidharan, K., Ramesh, A., Rithvik, G., et al., (2021). 1D Convolution approach to
ization by image analysis to identify diaspores of wild plant species. Flora-Morphology, human activity recognition using sensor data and comparison with machine learning
Distribution, Functional Ecology of Plants, 203(8), 669–682. algorithms. International Journal of Cognitive Computing in Engineering, 2, 130–143.
Barbon, A. P. A., Barbon Jr„ S., Mantovani, R. G., et al., (2016). Storage time prediction Oliveira, M. M., Cerqueira, B. V., Barbon Jr„ S., et al., (2021). Classification of fermented
of pork by Computational Intelligence. Computers and Electronics in Agriculture, 127, cocoa beans (cut test) using computer vision. Journal of Food Composition and Analysis,
368–375. 97, Article 103771.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. Paliwal, J., Visen, N. S., & Jayas, D. S. (2001). Evaluation of neural network architectures
Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data for cereal grain classification using morphological features. Journal of Agricultural En-
mining and Knowledge Discovery, 2(2), 121–167. 10.1023/A:1009715923555. gineering Research, 79(4), 361–370.
19
M. Salauddin Khan, T.D. Nath, M. Murad Hossain et al. International Journal of Cognitive Computing in Engineering 4 (2023) 6–20
Pozza, E. A., de Carvalho Alves, M., & Sanches, L. (2022). Using computer vision to iden- Savakar, D. (2012). Identification and classification of bulk fruits images using artificial
tify seed-borne fungi and other targets associated with common bean seeds based on neural networks. International Journal of Engineering and Innovative Technology (IJEIT),
red–green–blue spectral data. Tropical Plant Pathology, 47(1), 168–185. 1(3), 35–40.
Przybył, K., Gawałek, J., Koszela, K., et al., (2018). Artificial neural networks and electron Sethuraman, J., Alzubi, J. A., Manikandan, R., et al., (2019). Eccentric methodology with
microscopy to evaluate the quality of fruit and vegetable spray-dried powders. Case optimization to unearth hidden facts of search engine result pages. Recent Patents on
study: Strawberry powder. Computers and Electronics in Agriculture, 155, 314–323. Computer Science, 12(2), 110–119. 10.2174/2213275911666181115093050.
Rai, N., Kaushik, N., Kumar, D., et al., (2022a). Mortality prediction of COVID-19 patients Słowiński, G. (2020). Dry Beans Classification Using Machine Learning Multinomial Naive
using soft voting classifier. International Journal of Cognitive Computing in Engineering, Bayes classifier. In 29th International Workshop on Concurrency, Specification and Pro-
3, 172–179. gramming (CS&P21), University of Technology and Economics, ul. Jagiellońska 82f,
Rai, N., Kumar, D., Kaushik, N., et al., (2022b). Fake News Classification using trans- 03-301 (p. 2020).
former based enhanced LSTM and BERT. International Journal of Cognitive Computing Stegmayer, G., Milone, D. H., Garran, S., & Burdyn, L. (2013). Automatic recognition of
in Engineering, 3, 98–105. quarantine citrus diseases. Expert Systems with Applications, 40(9), 3512–3517.
Rathi, M., & Pareek, V. (2016). Hybrid approach to predict breast cancer using ma- Subasi, A. (2015). A decision support system for diagnosis of neuromuscular disorders us-
chine learning techniques. International Journal of Computer Science Engineering, 5(3), ing DWT and evolutionary support vector machines. Signal, Image and Video Processing,
125–136. 9(2), 399–408.
Rehman, T. U., Mahmud, M. S., Chang, Y. K., et al., (2019). Current and future applications Yahyaoui, A., & Yumuşak, N. (2018). Decision support system based on the support vec-
of statistical machine learning algorithms for agricultural machine vision systems. tor machines and the adaptive support vector machines algorithm for solving chest
Computers and Electronics in Agriculture, 156, 585–605. disease diagnosis problems.
Rodríguez-Pulido, F. J., Gordillo, B., González-Miret, M. L., & Heredia, F. J. (2013). Anal- Zhang, C., Liu, C., Zhang, X., & Almpanidis, G. (2017). An up-to-date comparison
ysis of food appearance properties by computer vision applying ellipsoids to colour of state-of-the-art classification algorithms. Expert Systems with Applications, 82,
data. Computers and Electronics in Agriculture, 99, 108–115. 128–150.
Sáez, A., Sánchez-Monedero, J., Gutiérrez, P. A., & Hervás-Martínez, C. (2015). Machine Zhang, L., & Zhan, C. (2017, May). Machine learning in rock facies classification: An appli-
learning methods for binary and multiclass classification of melanoma thickness from cation of XGBoost. In International Geophysical Conference, Qingdao, China, 17-20 April
dermoscopic images. IEEE transactions on medical imaging, 35(4), 1036–1045. 2017 (pp. 1371–1374). Society of Exploration Geophysicists and Chinese Petroleum
Sanlı, T., Sıcakyüz, Ç., & Yüregir, O. H. (2020). Comparison of the accuracy of classifica- Society.
tion algorithms on three data-sets in data mining: Example of 20 classes. International
Journal of Engineering, Science and Technology, 12(3), 81–89.
20