Yield Observation Outlier Detection With Unsupervised Machine Learning in Harvest Machines

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/358948488
Yield observation outlier detection with unsupervised machine learning in

harvest machines
Chapter · January 2022

DOI: 10.51202/9783181023952-343
CITATIONS READS
0 180
2 authors:
Morteza Abdipourchenarestansofla Hans-Peter Piepho

John Deere Kaiserslautern University of Hohenheim
5 PUBLICATIONS 1 CITATION 842 PUBLICATIONS 22,435 CITATIONS
SEE PROFILE SEE PROFILE
All content following this page was uploaded by Morteza Abdipourchenarestansofla on 13 March 2022.
The user has requested enhancement of the downloaded file.

Yield observation outlier detection with unsupervised
machine learning in harvest machines
Ms.Eng. Morteza Abdipourchenarestansofla, John Deere, European

Technology and innovation Centre, Germany, Kaiserslautern;
Prof. H.P. Piepho, Biostatistics unit, Institution of Crop Science,
University of Hohenheim
Abstract
Having a reliable yield map is an important key to unlock the value of precision agriculture.
This work demonstrates an application of unsupervised machine learning for harvester yield
monitoring recovery in the presence of noise for constructing a reliable yield map. The
objective of this research is to propose a fully automated statistical and machine learning
pipeline to clean primary harvested raw grain observations, regardless of combine series and
types. Harvest yield observations are contaminated with different sources of error that
obfuscate the spatial structure of the yield observations. The error sources are well-known,
mostly inherited from the dynamic nature of operation, speed changes, irregular field
topography, non-fully used cutting bar, and start/end delays of headland and filling/emptying
times. To identify measurements with such error sources, the proposed pipeline incorporates
two processing operations to be applied on the data. The first data cleaning operation is
done by setting thresholds based on the lower and upper quartiles as well as the interquartile
range (IQR) to detect outliers. In the second operation, the Robust Kernel Outlier Factor
(RKOF) is deployed to detect local outliers. The developed method is tested on real yield
monitoring data from winter wheat and wintery barley crops. Validation (root mean square
error) demonstrates a 75% improvement in the yield map after eliminating the noise. The
developed pipeline provides an automated outlier detection with a smaller number of outliers
found in the dataset compared to other publications.
1. Introduction
Outlier detection is a statistical procedure that aims to find suspicious observations in a set of
measurements under study. With production of big data, it has attracted significant interest in
the field of data mining and machine learning [1]. Generally, two definitions of outliers are
regression outliers and Hawkins outliers [2]. A regression outlier is an observation which
does not match the predefined distribution model while a Hawkins outlier is an observation
that deviates so much from other observations as to arouse suspicion that this observation is
generated by a different mechanism. Harvester technology can measure not only the mass
of harvested grain/crop but its nutrients and moisture with high frequency. One of the major
challenges in the use of combine harvester yield observations is the contamination of the
signals with variety of noise sources which sometimes could be tricky to identify. Generally,
combine yield monitoring can be affected by operation dynamic, speed changes, field
topography, non-fully used cutting bar, start/end delays, filling/emptying time, and flow delay
[3, 4, 5]. Yield monitoring data follows a spatial structure which requires a robust local outlier
detection algorithm to spot the outliers accurately and avoid removing a large number of
observations. Several studies have been carried out on combine harvester yield observation
cleaning, with the focus on Local Moran I-dependent test powered by k/d-Nearest
Neighbour, density, and distance-based outlier detection [6, 7, 8]. In this paper we propose a
data cleaning pipeline developed based on the unsupervised methods for Hawkins outlier
detection called Robust Kernel-Based Local Outlier Detection, which is also called, Robust
Kernel Outlier Factor (RKOF) [2], and a statistical dispersion measure called interquartile
range (IQR). RKOF learns from data based on local density estimation and then flags outliers
accordingly. To feed the unsupervised method mentioned above with more consistent data,
IQR is implemented to remove extreme outliers.
2. Proposed Framework
Let the yield monitoring dataset be matrix , where is the feature dimensions, and
is the number of objects/observations. Our primary variable of interest is wet-mass of the
grain, which is given as . Evaluating to be an outlier or not, can
be done through univariate or multivariate framework. For detecting extreme outliers (also
called global filtering) a univariate approach is developed based on IQR. For detecting
local/spatial outliers, a local density estimation with variable kernel approach called Robust
Kernel Outlier Factor (RKOF) developed by [2] is implemented. Given the matrix D Three
numerical features, including wet-mass ( ) as primary feature dimension, elevation ( ),
and heading ( ) of the vehicle as secondary feature dimensions are selected. Our study
shows an improvement using the mentioned numerical features compare to the other
numerical feature in . Thus, our constructed matrix for local outlier detection is given by,
Therefore, given a 3-dimensional object being outlier or not, we are able to compute
a vector of the RKOF score for each object in .
2.1. IQR
Given the above attribute , consists in finding the samples such
that,
where, and are the first and third quantile of respectively, and is the constant which
enables adjusting the decision range. Any point lying outside this range (lower-upper band) is
considered as outlier. We propose to be 1.5 on combine monitoring dataset.
2.2. RKOF
Kernel Density Estimation (KDE) is a well-known non-parametric statistical approach for
outlier detection [01]. RKOF is an unsupervised learning algorithm developed based on the
main framework of Local Outlier Factor (LOF) which can be directly used in the extension of
LOF, such as Feature Bagging [2]. The general idea of RKOF is to compare the relative
density of a point with its neighbours found via distance. The basic assumption is the density
around an outlier is considerably lower than the density around its neighbours. Unlike LOF,
this algorithm can vary the neighbouring size through an integrated operation called, kernel
density estimator, and apply a weight to the found objects for each neighbour. Therefore, it
demonstrates a better performance and scalability in large dataset. The steps for
Implementing RKOF for combine yield monitoring dataset are the following, given the matrix
and an integer number of neighbours.
1. Construct a geometric distance matrix via k-dimensional tree (kd-tree).

Given a 3-dimensional column vector extracted from the goal is to find the k-
distance of measurement and the k-distance neighbourhood of , Hence, the -distance
is defined as the distance between and an object where such that::
a. for at least objects , it holds that .
b. for at most -1 objects , it holds that .
And k-distance neighbourhood of , named , contains every observation whose
distance from is not greater than the k-distance , i.e.,

Where any such data point is called a k-distance neighbour of . is the number of
the k-distance neighbours of . The computed geometric distance matrix out of KNN
is then given to compute RKOF, where is the number of neighbours and is
the number of objects as noted before.
2. Compute the RKOF

Given the above geometric distance matrix the goal is to compute an outlier
score for a given measurement based on its geometric distance from their k-number of
neighbours. Let be a row vector extracted from in which the geometric
distance of the number of neighbours for a given measurement are presented. The
computation of RKOF for all in is as follows,
where and are weight density estimation of the k-distance
neighbourhood of , and local density estimation of respectively. Large value of RKOF
indicates higher probability of a given data point being an outlier. A gaussian kernel is used
for density estimation, given a bandwidth with =0.05 . K-distance can be influenced with
the following parameters:

- : multiplication parameter with k-distance. Act as a bandwidth increaser.
- : sensitivity parameter for k-distance/bandwidth.
and the weight for objects in each neighbour can be tuned through another parameter called
, which is the variance parameter of weighting the neighbouring objects.
We leave the computational details of and to the reader (see the original
paper [8]).
Experimental Setup and results
In this study the harvester observations of 150 fields of wheat and barley with crop season
2019-2021 are used for developing the outlier detection pipeline. The observations are
collected with commercial combine harvester machine, with 5 hertz frequencies. Individual
yield harvest monitoring dataset is then taken into the cleaning process. The results of the
IQR and RKOF methods are also compared with the proposed Local Morans I method, see
Table 1.
Table 1. Result of cleaning through our proposed outlier detection pipeline.
Algorithm Observation size Outliers found % Outliers
IQR 4925 193 3.9
RKOF 4732 217 4.6
Local Morans I - - 30
The performance of proposed pipeline is evaluated through an experimental design with

harvest monitoring data and three Vegetation Indices (VI) called NDVI, NDRE, and LAI
derived from Sentinel-2 images. The assessment criteria are based on regression analysis,
remove the outliers detected by IQR, and look for the effect of deletion on RMSE
improvement. In a simple word, the more residual we kick out the smaller the RMSE gets.
[9]. The same procedure applied for ROKF. The RMSE is given by,
where is the i-th predicted grain mass through vegetation indices, and is the i-th actual
yield mass observation for that corresponding data point. The result of regression analysis is
provided in table 2. A convolutional Neural Network (CNN) is used to model the wet-mass
through VI. Out of 150 harvest monitoring datasets, 80% are allocated for training and 20%
for testing the model stability. The RMSE is then calculated on the test set (see Table 2).
Table 2. The comparison of each cleaning method.
Methods RMSE
Raw observation 4150 k/he
IQR 1275 k/he
IQR +RKOF 1050 k/he
Conclusion
Any definition of "outlier" is arbitrary and will limit generalizability Nevertheless, this study
demonstrates an unsupervised approach that can deal with complex and large dataset. Our
main objective is to develop an automated processing pipeline to detect outliers in harvester
View publication stats
yield observation. We deploy an unsupervised machine learning approach which can

consider the heterogeneity in the dataset for detecting heterogenous outliers. To detect
homogenous outliers a IQR based algorithm is used with a constant of 1.5. As result the
proposed pipeline demonstrates a more reliable and accurate result moving from IQR to
RKOF, and the number of founded outliers are considerably smaller (8.4%) than the cleaning
approach removing (30%) proposed by [4]. The RMSE results prove 75% improvement of
the yield map.
Reference
[1] Alghushairy O, Alsini R, Soule T, Ma X. A Review of Local Outlier Factor Algorithms for
Outlier Detection in Big Data Streams. Big Data and Cognitive Computing. 2021;
5(1):1.
[2] Gao J., Hu W., Zhang Z., Zhang X., Wu O. (2011). RKOF: Robust Kernel-Based Local
Outlier Detection. In: Huang J.Z., Cao L., Srivastava J. (eds) Advances in Knowledge
Discovery and Data Mining, 270-283.
[3] Arslan, S., & Colvin, T. (2002). Grain yield mapping: yield sensing, yield reconstruction,
and errors. Precision Agriculture, 135-154.
[4] Blackmore, B. S., & Moore, M. (1999). Remedial correction of yield map data. Precision
Agriculture, 53 66.
[5] Lee, D. H., Sudduth, K. A., Drummond, S. T., Chung, S. O., & Myers, D. B. (2012).
Automated yield map delay identification using phase correlation methodology.
ransactions of the ASABE, 743752.
[6] Vega, A.; Córdoba, M.A.; Castro-Franco, M.; Balzarini, M. (2019). Protocol for
automating error removal from yield maps. Precis. Agric., 10301044.
[7] Leroux, C., Jones, H., Clenet, A. et al. (2018). A general method to filter out defective
spatial observations from yield mapping datasets. Precision Agric, 789808.
[8] Lyle, Greg et al. post-processing methods to eliminate erroneous grain yield
measurements: review and directions for future development. Precision Agriculture 15
(2013): 377-402.
[9] Cousineau D., Chartier, S. (2010). Outliers detection and treatment: a review.
International Journal of Psychological Research, 3 (1), 59-68.

Yield Observation Outlier Detection With Unsupervised Machine Learning in Harvest Machines

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Yield Observation Outlier Detection With Unsupervised Machine Learning in Harvest Machines

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Yield observation outlier detection with unsupervised machine learning in

Chapter · January 2022

Morteza Abdipourchenarestansofla Hans-Peter Piepho

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

Ms.Eng. Morteza Abdipourchenarestansofla, John Deere, European

is the number of objects/observations. Our primary variable of interest is wet-mass of the

grain, which is given as . Evaluating to be an outlier or not, can

a vector of the RKOF score for each object in .

1. Construct a geometric distance matrix via k-dimensional tree (kd-tree).

distance of measurement and the k-distance neighbourhood of , Hence, the -distance

is defined as the distance between and an object where such that::

a. for at least objects , it holds that .

b. for at most -1 objects , it holds that .

And k-distance neighbourhood of , named , contains every observation whose

distance from is not greater than the k-distance , i.e.,

is then given to compute RKOF, where is the number of neighbours and is

the number of objects as noted before.

2. Compute the RKOF

computation of RKOF for all in is as follows,

where and are weight density estimation of the k-distance

neighbourhood of , and local density estimation of respectively. Large value of RKOF

the following parameters:

- : sensitivity parameter for k-distance/bandwidth.

The performance of proposed pipeline is evaluated through an experimental design with

Raw observation 4150 k/he

IQR 1275 k/he

IQR +RKOF 1050 k/he

yield observation. We deploy an unsupervised machine learning approach which can

You might also like

is the number of objects/observations. Our primary variable of interest is wet-mass of the