You are on page 1of 8

Detecting Pesticide Concentration Levels using

Colour Information and Machine learning

environment and human health[1]. As a


Abstract result,
It is crucial to have accurate, efficient and
The following research compares and cost-effective technologies for measuring
tests the proficiency suitable machine pesticide concentration levels in order to
learning models in effectively predicting maintain safe food production and
pesticide concentration levels, particularly sustainable agriculture practices even in
Organophosphate pesticide concentration areas without developed infrastructure.
levels using only colour information. For The current development in the field
this task, 9 diverse machine learning involves the usage of expensive and
models were trained on 5 different feature intricate machinery wherein a cost
sets derived from the same dataset in analysis of the classification of pesticide
order to find the best model and detection techniques done in published
combination of features for the task with literature demonstrates how various
the feature sets including the RGB, HSI, techniques range from either medium to
and LAB colour spaces, either high cost, but with none having low cost
individually or in combination. The study [2]. Pesticide detection methods
found that K-Nearest Neighbours currently commonly in use include gas
achieved the highest accuracy, surpassing chromatography, liquid chromatography,
the other models. The Ensemble models and GC-mass spectrometry. While these
showed promise but their inability to methods are known for their high
capture regional patterns resulted in lower accuracy they are burdened by their
accuracies, while the linear and exorbitant costs, substantial instrument
probabilistic models struggled due to volumes, and limited portability.[3]
them not being able capture the complex, The way to counteract this is by
non-linear patterns present in the data. transferring the capabilities of this
intricate machinery, by leveraging
Introduction machine learning techniques to build a
model that can accurately recognize
Pesticides are essential in modern pesticide concentration levels using colour
agriculture due to their ability to protect information. Since, by using Machine
crops from harmful pests and diseases; Learning technology and building a
their indiscriminate usage, however, suitable model one can ideally eliminate
endangers human health and can result in the need for such equipment on the field,
environmental damage.[1] Moreover, the due to its low-cost nature of
advent of modern agricultural practices implementation.
has been instrumental in ensuring global
food security, but it also brings forth Therefore, In the following research, we
challenges related to the indiscriminate aim to compare multiple models to test
pesticide usage and its impact on the which models can accurately identify
pesticide concentration levels based on The OPH bioreceptor is essential to the
colour information in order to find the working of the device. It is an
most efficient. amidohydrolase enzyme that can
Nine different machine learning models hydrolyze a wide range of
are used in the study, each of which was organophosphate insecticides with
trained using five different feature sets specific P-F, P-O, P-S, and P-CN bonds in
taken from the dataset. The feature sets their chemical structures. This unique
include either separately or jointly the characteristic of OPH allows it to break
RGB, HSI, and LAB color spaces. down these harmful pesticides resulting in
the formation of a yellow-coloured
Literature Review product called p-nitrophenol (pNP); the
concentration of pNP is directly
Image processing or Picture detection is a proportional to the pesticide
field that has been thoroughly researched, concentration. [3] Therefore, the higher
even for agricultural applications [4], with the yellow concentration of a particular
past studies indicating the potential for sample, the higher the pesticide
future implementation of statistical concentration; since more pNP (yellow
machine learning technology in byproduct) produced equates to the more
agricultural applications [5] pesticides being present for the OPH to
On the other hand, colour information has break down, resulting in a higher yellow
successfully been used in other fields such concentration.
as Chemistry, in predicting concentration The colour information is then extracted
levels for chemicals such as using a contact image sensor (CIS) as the
Formaldehyde [6], or Biology in detecting image array technology. The pNP
glucose concentration [7]. increases in intensity as the enzymatic
Its success in other such comparable fields activity of the OPH increases, this change
hints at the promise of its further in colour intensity is detected by the CIS
applications in the agricultural landscape. and translated into electronic impulses
Moreover, although studies have resulting in real-time pesticide data. [3]
investigated methodologies for the For the purposes of this exact dataset, the
detection of pesticide concentration levels samples were prepared using pesticide
using colour information[8] , the spiked samples in a mixture of water,
examination was confined to a restricted NaCl, and Acetonitrile. The resultant
set of models. This research, seeks to colour was recorded simultaneously with
a UV–Visible spectrophotometer and the
extend the exploration by encompassing a
developed UIIS based system. This dual
diverse array of models, aiming to recording system likely provided a more
facilitate a comparative analysis of accurate and reliable measure of pesticide
multiple models’ accuracy levels. concentration [3].
For the purposes of this study the dataset
Background: was divided into 6 distinct levels of
pesticide concentrations
The data for this dataset was collected (0,10,100,250,500,1000)(ppb).Each
using an on-spot biosensor that utilizes the datapoint consists of the respective
Organophosphate Hydrolase enzyme sample’s colour information, or 9 features
(OPH) to detect Organophosphate (RGBHSILAB) representing the RGB,
Pesticides. [3] The developed OPH was HSI, and LAB colour spaces, allowing us
then integrated into a 96 well plate format to explore the individual and combined
with UIISScan 1.1, an advanced imaging influences of these colour components on
array technology based field‐portable model performance.
high‐throughput sensory system. [8]
The UIIS utilizes the principles of The Models:
colorimetry and enzymatic reactions to
provide a rapid and reliable detection
platform.
Nine diverse models were chosen for this of the loss function in each iteration. By
task due to the complex nature of the data, optimizing differentiable loss functions,
They are: Gradient Boosting can efficiently handle
various classification problems.
HistGradient Boosting
1. Support Vector Machine (SVM) Histogram Gradient Boosting
2. Random Forest (HistGradient Boosting) is a variant of
3. Gradient Boosting gradient boosting that combines high-
4. Histogram Gradient Boosting speed training time with the prediction
(HistGradient Boosting) advantage of Gradient Boosting, making it
5. K-Nearest Neighbours a competitive model[12] This method is
6. Naive Bayes especially advantageous for large datasets
7. Multivariate Linear Regression(MLR) with numerous features, making it more
8. Extreme Gradient Boosting (X-Gradient scalable while maintaining performance.
Boosting) K-Nearest Neighbours
9. Logistic Regression K-Nearest Neighbours (KNN) is a
straightforward algorithm that classifies
SVM new instances based on their similarity to
The SVM divides data points belonging to neighbouring data points [13]. KNN
different classes in a high-dimensional determines the k closest data points
space, by identifying an ideal (neighbours) based on a selected distance
hyperplane[9]. It, then seeks to optimize measure when classifying a new instance.
the distance between each class's closest A majority vote among the new instance's
data points, known as the support vectors, k closest neighbours then determines its
and the hyperplane. The ability of SVM to class. This makes KNN a popular choice
maximize the margin, which makes it for classification wherein similar
resistant to outliers and data noise, instances belong to the same class such as
enhances its performance in complex this dataset.
datasets. It can successfully handle high- Naive Bayes
dimensional datasets, making it Naive Bayes classifiers are probabilistic
appropriate for this data set as well as models that make predictions based on
others with a variety of attributes. Bayes' theorem and assume strong
Random Forest independence between features [14].
A Random Forest constructs numerous These classifiers calculate the probability
decision trees, the outcomes are of a given feature belonging to a
then subsequently integrated via majority particular class during training. It then
voting[10]. By using this technique, the determines the probability of each class
likelihood of overfitting reduces, and the given the observable features when
model's capacity to generalize to huge classifying a new instance, and chooses
datasets in general improves. Thus, the class with the highest probability as
Random Forest is a great option for this the final prediction. This model performs
purpose since it can record feature particularly well in text classification and
interactions and manage high-dimensional document categorization tasks.
data. Multivariate Linear Regression
Gradient Boosting MLR models the linear relationship
Gradient Boosting is another powerful between a dependent variable and
ensemble technique that builds an multiple independent variables [15]. MLR
ensemble of weak prediction models, is helpful for datasets where a linear
typically decision trees, in an incremental equation can reasonably approximate the
manner [11] Beginning with a base connection between the variables;
model, it iteratively enhances its however, its performance may be
performance by concentrating on the constrained when dealing with complex
flaws of the earlier models. It effectively non-linear relationships.
lowers the prediction errors by fitting a X-Gradient Boosting
fresh decision tree to the negative gradient
X-Gradient Boosting is an optimized related to pesticide concentration but also
distributed gradient boosting library to help us avoid the curse of
designed for speed and accuracy [16]. dimensionality due to potential noise
Similar to Gradient Boosting, XGBoost introduced by less relevant features. The
creates an ensemble of decision trees but models were first trained on the
it uses more regularization terms and RGB color space (Red, Green, Blue)
sophisticated optimization techniques to served as the baseline experiment,
boost speed. This makes the model highly allowing us to evaluate the performance
efficient and flexible, making it suitable when considering raw color information
for a wide range of classification tasks. alone. Subsequently, we explored the HSI
Logistic Regression color space (Hue, Saturation,
Logistic regression models the Intensity) to assess the influence of hue,
relationship between the dependent saturation, and intensity components on
variable and the independent factors using model accuracy. The LAB color space
the logistic function in order to forecast (Lightness, A - Green and Red, B- Yellow
the likelihood that an event will occur and Blue) was also considered to
[17]. Although the model is frequently understand how lightness and two colour
employed for binary classification components (A and B) contribute to the
problems, complex non-linear datasets predictive power.
may not lend themselves to it as well. Additionally, we combined the RGB and
HSI colour spaces using selected features.
Methodology This selection consisted of the GBHSI
(Green, Blue, Hue, Saturation, Intensity)
features. Not only did this combination
We trained the Nine machine learning showcase the best overall accuracy during
models on different sets of features to initial testing, but by doing so we aimed
investigate the impact of colour -related to uncover potential synergistic effects
information on predicting pesticide between RGB and HSI colour
concentration levels. components, which could provide more
Each model was trained on each of the 5 comprehensive information for pesticide
feature sets that the data was divided into concentration prediction.
(RGBHSILAB, GBHSI, RGB. HSI, Moreover, we trained models on the entire
LAB). feature set (RGBHSILAB, 1-11) to
The motivation behind training models on understand how all colour -related
different feature sets stems from the need features together contribute to the overall
to analyse the relative importance of predictive performance. This allowed us
colour -related features in predicting to gauge whether incorporating
pesticide concentration. Although, colour information from all colour spaces
information allows us to detect the provides better accuracy than individual
concentration, this feature selection is colour spaces or selected combinations.
necessary to assess not only the efficacy
of each set in capturing essential patterns

Results:

Model RGBHSILAB GBHSI RGB HSI LAB


Accuracy (%) Accuracy Accuracy Accuracy Accuracy
(%) (%) (%) (%)

SVM 47.34 64.85 62.66 61.20 50.99


Random 61.20 65.58 55.36 64.85 59.01
Forest

Gradient 64.12 64.85 58.28 63.39 61.20


Boosting

HistGradient 58.28 59.01 54.64 63.39 62.66


Boosting

K-Nearest 64.85 70.69 62.66 58.28 47.34


Neighbours

Naive Bayes 59.74 59.74 59.01 61.93 60.47

MLR 46.61 45.91 49.53 50.99 53.91

X-Gradient 59.74 60.47 56.09 62.66 61.20


Boosting

Logistic 59.74 63.39 48.80 57.55 50.26


Regression

Figure 1: Visual Representation of the Accuracies of the models

was included in the RGBHSILAB feature


set, which provided a thorough
Discussion: representation of pesticide concentration
patterns; however the accuracy of models
trained on this feature set tended to be
Due to various characteristics of each lower. While, the models trained on the
method and the nature of the data itself, GBHSI feature set tended to have higher
the machine learning models performed accuracies showcasing that the inclusion
differently on the varied feature sets. First of hue, saturation, and intensity
off, the accuracy of the models was components may have offered useful
considerably impacted by the feature discriminative capacity for forecasting
selection. All available colour-related data pesticide concentration levels. The models
trained on the RGBHSILAB data set across all classes. Moreover, since
resulted in lower accuracies possibly due ensemble methods aim to optimize
to the increased dimensionality and the performance across the entire dataset they
inclusion of less relevant or noisy features may struggle to capture localised
hindering the functioning of the model. fluctuations withing classes, however
KNN’s local approach reliant on
KNN achieved the highest accuracy neighbouring data points allows it to adapt
(70.69%) amongst the entire batch while to these variations within the dataset.
utilising the GBHSI feature set. The
success of KNN lies in its ability to While the quite lower accuracy of models
capture local patterns and work with non- like logistic regression and MLR shows
linear relationships effectively. This is that a linear model is unable to fully
presumably due to its reliance on capture complicated patterns. These
neighbouring data points, allowing it to models had been unable to fit the
discern regional values in pesticide accuracy tiers attained by means of
concentrations. On the other hand, GBHSI greater adaptable models like KNN or
feature set in particular seemed to have even Random forest, emphasizing the
the most valuable information for the necessity of taking into account and
purposes of detecting pesticide adjusting for localized differences
concentration levels, as this specific inherent inside the dataset. additionally,
combination in features allowed the models like Naive Bayes that rely on
model to find distinct patterns in the data opportunity distributions do their
as opposed to other feature sets, even with paintings via assuming sure matters
the same model, KNN using the LAB approximately the dataset and the
feature set(47.34%). KNNs relative connections between its variables, which
success suggests that the dataset might may not as it should be reflect the
exhibit spatial or regional patterns and complex, non-linear structure of the
that the dataset night not have strong statistics. However, SVM additionally
global patterns that ensemble models like performed worse than KNN in terms of
Random Forests and Gradient Boosting accuracy. This is probably due to the fact
excel at capturing. if the education data for an SVM isn't
linearly separable, it turns into hard for
Speaking off, ensemble models such as the version to decide most useful
Random Forests and Gradient Boosting parameters [9].
demonstrated relatively competitive
performances but could not reach The numerous performances of diverse
accuracies as high as KNN. These models models, even within a particular feature
operate by leveraging the power of with diverse characteristic sets, reveal the
combining multiple weak learners to complexity of the underlying, non-linear
improve their predictive accuracy. relationships within the records.
Random forest despite having a diverse Furthermore, for the reason that intense
set of decision trees enabling it to ideally outliers can nonetheless affect the
handle complex relationships and performance of ensemble models
interaction among features, achieved a (Random woodland, Gradient Boosting,
highest accuracy of 65.58%.Similarly, and XGBoost), the dataset may
Gradient Boosting’s sequential additionally have contained noise or
optimization process which enables it to outliers that decreased its performance.
learn from errors of previous iterations
and capture intricate patterns in a data, The overall effects spotlight the complex
attained an accuracy of 64.12% from the shape of the dataset, with capability
RGBHSILAB feature set. The lower relationships among capabilities in the
performance of the Ensemble methods is same shade space and variances in
due to the presence of clear regional statistics distribution across colour areas,
patterns within different classes itself as however the restrictions seen in some
opposed to a clear available global pattern models.
other models resulting in lower accuracies
Conclusion and Future Enhancement across the feature sets.
Overall, this research sheds light on the
The data was divided into 5 different potential of colorimetric data and machine
feature sets with each of the models being learning models to predict pesticide
trained on each of the feature sets. The concentration levels accurately. The
GBHSI feature set resulted in the most findings provide valuable insights into the
consistent results across the models most effective feature sets and models for
possibly due to its discriminatory power, this task, offering opportunities for
or the lack of noise in the form of less enhanced monitoring of pesticide usage.
relevant features. However, the complex Future enhancements must include
nature of the data set and the presence of modifying the models to better capture the
regional patterns resulted in KNN being regional patterns present within the data in
the superior model to predict pesticide order to obtain a higher accuracy with
concentration levels. Ensemble models other options such as feature engineering,
such as Gradient Boosting and Random outlier handling and development of a
forest showed promising results but had larger dataset to train the models on are
lower accuracies due to it not being able also worth considering.
to capture localized variations. While,
linear models struggled with respect to the
Citations

[1] Aktar, Wasim, et al. “Impact of Pesticides Use in Agriculture: Their Benefits and Hazards.”
Interdisciplinary Toxicology, vol. 2, no. 1, 2009, pp. 1–12, https://doi.org/10.2478/v10102-
009-0001-7.

[2] Thorat, Tanmay, et al. “Advancements in Techniques Used for Identification of Pesticide
Residue on Crops.” Journal of Natural Pesticide Research, vol. 4, 2023, p. 100031,
https://doi.org/10.1016/j.napere.2023.100031.

[3] Mukherjee, Subhankar, et al. “On-Spot Biosensing Device for Organophosphate Pesticide
Residue Detection in Fruits and Vegetables.” Current Research in Biotechnology, vol. 3, 2021,
pp. 308–316, https://doi.org/10.1016/j.crbiot.2021.11.002.

[4] Sonawane, Sachin, et al. “A Literature Review on Image Processing and Classification
Techniques for Agriculture Produce and Modeling of Quality Assessment System for Soybean
Industry Sample.” International Journal of Innovative Research in Electronics and
Communications, vol. 6, no. 2, 2019, https://doi.org/10.20431/2349-4050.0602002.

[5] Rehman, Tanzeel U., et al. “Current and Future Applications of Statistical Machine
Learning Algorithms for Agricultural Machine Vision Systems.” Computers and Electronics in
Agriculture, vol. 156, 2019, pp. 585–605, https://doi.org/10.1016/j.compag.2018.12.006.

[6] Cao, Zhihao, et al. “HCHODetector: Formaldehyde Concentration Detection Based on


Deep Learning.” Journal of Physics: Conference Series, vol. 1848, no. 1, 2021, p. 012047,
https://doi.org/10.1088/1742-6596/1848/1/012047.

[7] Kim, Ji-Sun, et al. “A Study on Detection of Glucose Concentration Using Changes in
Color Coordinates.” Bioengineered, vol. 8, no. 1, 2016, pp. 99–104,
https://doi.org/10.1080/21655979.2016.1227629.

[8] Lapcharoensuk, Ravipat, et al. “Nondestructive Detection of Pesticide Residue


(Chlorpyrifos) on Bok Choi (Brassica Rapa Subsp. Chinensis) Using a Portable NIR
Spectrometer Coupled with a Machine Learning Approach.” Foods, vol. 12, no. 5, Feb. 2023,
p. 955. Crossref, https://doi.org/10.3390/foods12050955.
[9] Mukherjee, Subhankar, Souvik Pal, Abhra Pal, et al. “UIISScan 1.1: A Field Portable High-
Throughput Platform Tool for Biomedical and Agricultural Applications.” Journal of
Pharmaceutical and Biomedical Analysis, vol. 174, 2019, pp. 70–80,
https://doi.org/10.1016/j.jpba.2019.05.042.

[10] Aktar, Wasim, et al. “Impact of Pesticides Use in Agriculture: Their Benefits and
Hazards.” Interdisciplinary Toxicology, vol. 2, no. 1, 2009, pp. 1–12,
https://doi.org/10.2478/v10102-009-0001-7.

[11] Breiman, Leo. “Random Forests.” Machine Learning, vol. 45, no. 1, 2001, pp. 5–32,
https://doi.org/10.1023/a:1010933404324.

[12] Friedman, Jerome H. “Greedy Function Approximation: A Gradient Boosting Machine.”


The Annals of Statistics, no. 5, Institute of Mathematical Statistics, Oct. 2001. Crossref,
doi:10.1214/aos/1013203451.

[13] Guryanov, Aleksei. “Histogram-Based Algorithm for Building Gradient Boosting


Ensembles of Piecewise Linear Decision Trees.” Lecture Notes in Computer Science, Dec.
2019, pp. 39–50, https://doi.org/10.1007/978-3-030-37334-4_4.

[14] Cover, T., and P. Hart. “Nearest Neighbor Pattern Classification.” IEEE Transactions on
Information Theory, no. 1, Institute of Electrical and Electronics Engineers (IEEE), Jan. 1967,
pp. 21–27. Crossref, doi:10.1109/tit.1967.1053964.

[15] Lewis, David D. “Naive (Bayes) at Forty: The Independence Assumption in Information
Retrieval | SpringerLink.” SpringerLink, Springer Berlin Heidelberg, 1998,
https://link.springer.com/chapter/10.1007/bfb0026666.

[16] Alexopoulos, E C. “Introduction to multivariate regression analysis.” Hippokratia vol.


14,Suppl 1 (2010): 23-8.

[17] Chen, Tianqi, and Carlos Guestrin. “XGBoost.” Proceedings of the 22nd ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, Aug. 2016. Crossref,
https://doi.org/10.1145/2939672.2939785.

[18] Cox, D. R. “The Regression Analysis of Binary Sequences.” Journal of the Royal
Statistical Society: Series B (Methodological), no. 1, Wiley, Jan. 1959, pp. 238–238. Crossref,
doi:10.1111/j.2517-6161.1959.tb00334.

You might also like