Professional Documents
Culture Documents
net/publication/358948488
CITATIONS READS
0 180
2 authors:
All content following this page was uploaded by Morteza Abdipourchenarestansofla on 13 March 2022.
Abstract
Having a reliable yield map is an important key to unlock the value of precision agriculture.
This work demonstrates an application of unsupervised machine learning for harvester yield
monitoring recovery in the presence of noise for constructing a reliable yield map. The
objective of this research is to propose a fully automated statistical and machine learning
pipeline to clean primary harvested raw grain observations, regardless of combine series and
types. Harvest yield observations are contaminated with different sources of error that
obfuscate the spatial structure of the yield observations. The error sources are well-known,
mostly inherited from the dynamic nature of operation, speed changes, irregular field
topography, non-fully used cutting bar, and start/end delays of headland and filling/emptying
times. To identify measurements with such error sources, the proposed pipeline incorporates
two processing operations to be applied on the data. The first data cleaning operation is
done by setting thresholds based on the lower and upper quartiles as well as the interquartile
range (IQR) to detect outliers. In the second operation, the Robust Kernel Outlier Factor
(RKOF) is deployed to detect local outliers. The developed method is tested on real yield
monitoring data from winter wheat and wintery barley crops. Validation (root mean square
error) demonstrates a 75% improvement in the yield map after eliminating the noise. The
developed pipeline provides an automated outlier detection with a smaller number of outliers
found in the dataset compared to other publications.
1. Introduction
Outlier detection is a statistical procedure that aims to find suspicious observations in a set of
measurements under study. With production of big data, it has attracted significant interest in
the field of data mining and machine learning [1]. Generally, two definitions of outliers are
regression outliers and Hawkins outliers [2]. A regression outlier is an observation which
does not match the predefined distribution model while a Hawkins outlier is an observation
that deviates so much from other observations as to arouse suspicion that this observation is
generated by a different mechanism. Harvester technology can measure not only the mass
of harvested grain/crop but its nutrients and moisture with high frequency. One of the major
challenges in the use of combine harvester yield observations is the contamination of the
signals with variety of noise sources which sometimes could be tricky to identify. Generally,
combine yield monitoring can be affected by operation dynamic, speed changes, field
topography, non-fully used cutting bar, start/end delays, filling/emptying time, and flow delay
[3, 4, 5]. Yield monitoring data follows a spatial structure which requires a robust local outlier
detection algorithm to spot the outliers accurately and avoid removing a large number of
observations. Several studies have been carried out on combine harvester yield observation
cleaning, with the focus on Local Moran I-dependent test powered by k/d-Nearest
Neighbour, density, and distance-based outlier detection [6, 7, 8]. In this paper we propose a
data cleaning pipeline developed based on the unsupervised methods for Hawkins outlier
detection called Robust Kernel-Based Local Outlier Detection, which is also called, Robust
Kernel Outlier Factor (RKOF) [2], and a statistical dispersion measure called interquartile
range (IQR). RKOF learns from data based on local density estimation and then flags outliers
accordingly. To feed the unsupervised method mentioned above with more consistent data,
IQR is implemented to remove extreme outliers.
2. Proposed Framework
Let the yield monitoring dataset be matrix , where is the feature dimensions, and
be done through univariate or multivariate framework. For detecting extreme outliers (also
called global filtering) a univariate approach is developed based on IQR. For detecting
local/spatial outliers, a local density estimation with variable kernel approach called Robust
Kernel Outlier Factor (RKOF) developed by [2] is implemented. Given the matrix D Three
numerical features, including wet-mass ( ) as primary feature dimension, elevation ( ),
and heading ( ) of the vehicle as secondary feature dimensions are selected. Our study
shows an improvement using the mentioned numerical features compare to the other
numerical feature in . Thus, our constructed matrix for local outlier detection is given by,
Therefore, given a 3-dimensional object being outlier or not, we are able to compute
2.1. IQR
Given the above attribute , consists in finding the samples such
that,
where, and are the first and third quantile of respectively, and is the constant which
enables adjusting the decision range. Any point lying outside this range (lower-upper band) is
considered as outlier. We propose to be 1.5 on combine monitoring dataset.
2.2. RKOF
Kernel Density Estimation (KDE) is a well-known non-parametric statistical approach for
outlier detection [01]. RKOF is an unsupervised learning algorithm developed based on the
main framework of Local Outlier Factor (LOF) which can be directly used in the extension of
LOF, such as Feature Bagging [2]. The general idea of RKOF is to compare the relative
density of a point with its neighbours found via distance. The basic assumption is the density
around an outlier is considerably lower than the density around its neighbours. Unlike LOF,
this algorithm can vary the neighbouring size through an integrated operation called, kernel
density estimator, and apply a weight to the found objects for each neighbour. Therefore, it
demonstrates a better performance and scalability in large dataset. The steps for
Implementing RKOF for combine yield monitoring dataset are the following, given the matrix
and an integer number of neighbours.
the k-distance neighbours of . The computed geometric distance matrix out of KNN
score for a given measurement based on its geometric distance from their k-number of
neighbours. Let be a row vector extracted from in which the geometric
distance of the number of neighbours for a given measurement are presented. The
indicates higher probability of a given data point being an outlier. A gaussian kernel is used
for density estimation, given a bandwidth with =0.05 . K-distance can be influenced with
and the weight for objects in each neighbour can be tuned through another parameter called
, which is the variance parameter of weighting the neighbouring objects.
We leave the computational details of and to the reader (see the original
paper [8]).
Experimental Setup and results
In this study the harvester observations of 150 fields of wheat and barley with crop season
2019-2021 are used for developing the outlier detection pipeline. The observations are
collected with commercial combine harvester machine, with 5 hertz frequencies. Individual
yield harvest monitoring dataset is then taken into the cleaning process. The results of the
IQR and RKOF methods are also compared with the proposed Local Morans I method, see
Table 1.
Table 1. Result of cleaning through our proposed outlier detection pipeline.
Algorithm Observation size Outliers found % Outliers
IQR 4925 193 3.9
RKOF 4732 217 4.6
Local Morans I - - 30
where is the i-th predicted grain mass through vegetation indices, and is the i-th actual
yield mass observation for that corresponding data point. The result of regression analysis is
provided in table 2. A convolutional Neural Network (CNN) is used to model the wet-mass
through VI. Out of 150 harvest monitoring datasets, 80% are allocated for training and 20%
for testing the model stability. The RMSE is then calculated on the test set (see Table 2).
Table 2. The comparison of each cleaning method.
Methods RMSE
Conclusion
Any definition of "outlier" is arbitrary and will limit generalizability Nevertheless, this study
demonstrates an unsupervised approach that can deal with complex and large dataset. Our
main objective is to develop an automated processing pipeline to detect outliers in harvester
View publication stats
Reference
[1] Alghushairy O, Alsini R, Soule T, Ma X. A Review of Local Outlier Factor Algorithms for
Outlier Detection in Big Data Streams. Big Data and Cognitive Computing. 2021;
5(1):1.
[2] Gao J., Hu W., Zhang Z., Zhang X., Wu O. (2011). RKOF: Robust Kernel-Based Local
Outlier Detection. In: Huang J.Z., Cao L., Srivastava J. (eds) Advances in Knowledge
Discovery and Data Mining, 270-283.
[3] Arslan, S., & Colvin, T. (2002). Grain yield mapping: yield sensing, yield reconstruction,
and errors. Precision Agriculture, 135-154.
[4] Blackmore, B. S., & Moore, M. (1999). Remedial correction of yield map data. Precision
Agriculture, 53 66.
[5] Lee, D. H., Sudduth, K. A., Drummond, S. T., Chung, S. O., & Myers, D. B. (2012).
Automated yield map delay identification using phase correlation methodology.
ransactions of the ASABE, 743752.
[6] Vega, A.; Córdoba, M.A.; Castro-Franco, M.; Balzarini, M. (2019). Protocol for
automating error removal from yield maps. Precis. Agric., 10301044.
[7] Leroux, C., Jones, H., Clenet, A. et al. (2018). A general method to filter out defective
spatial observations from yield mapping datasets. Precision Agric, 789808.
[8] Lyle, Greg et al. post-processing methods to eliminate erroneous grain yield
measurements: review and directions for future development. Precision Agriculture 15
(2013): 377-402.
[9] Cousineau D., Chartier, S. (2010). Outliers detection and treatment: a review.
International Journal of Psychological Research, 3 (1), 59-68.