You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/358948488

Yield observation outlier detection with unsupervised machine learning in


harvest machines

Chapter · January 2022


DOI: 10.51202/9783181023952-343

CITATIONS READS
0 180

2 authors:

Morteza Abdipourchenarestansofla Hans-Peter Piepho


John Deere Kaiserslautern University of Hohenheim
5 PUBLICATIONS 1 CITATION 842 PUBLICATIONS 22,435 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Morteza Abdipourchenarestansofla on 13 March 2022.

The user has requested enhancement of the downloaded file.


Yield observation outlier detection with unsupervised
machine learning in harvest machines

Ms.Eng. Morteza Abdipourchenarestansofla, John Deere, European


Technology and innovation Centre, Germany, Kaiserslautern;
Prof. H.P. Piepho, Biostatistics unit, Institution of Crop Science,
University of Hohenheim

Abstract
Having a reliable yield map is an important key to unlock the value of precision agriculture.
This work demonstrates an application of unsupervised machine learning for harvester yield
monitoring recovery in the presence of noise for constructing a reliable yield map. The
objective of this research is to propose a fully automated statistical and machine learning
pipeline to clean primary harvested raw grain observations, regardless of combine series and
types. Harvest yield observations are contaminated with different sources of error that
obfuscate the spatial structure of the yield observations. The error sources are well-known,
mostly inherited from the dynamic nature of operation, speed changes, irregular field
topography, non-fully used cutting bar, and start/end delays of headland and filling/emptying
times. To identify measurements with such error sources, the proposed pipeline incorporates
two processing operations to be applied on the data. The first data cleaning operation is
done by setting thresholds based on the lower and upper quartiles as well as the interquartile
range (IQR) to detect outliers. In the second operation, the Robust Kernel Outlier Factor
(RKOF) is deployed to detect local outliers. The developed method is tested on real yield
monitoring data from winter wheat and wintery barley crops. Validation (root mean square
error) demonstrates a 75% improvement in the yield map after eliminating the noise. The
developed pipeline provides an automated outlier detection with a smaller number of outliers
found in the dataset compared to other publications.

1. Introduction
Outlier detection is a statistical procedure that aims to find suspicious observations in a set of
measurements under study. With production of big data, it has attracted significant interest in
the field of data mining and machine learning [1]. Generally, two definitions of outliers are
regression outliers and Hawkins outliers [2]. A regression outlier is an observation which
does not match the predefined distribution model while a Hawkins outlier is an observation
that deviates so much from other observations as to arouse suspicion that this observation is
generated by a different mechanism. Harvester technology can measure not only the mass
of harvested grain/crop but its nutrients and moisture with high frequency. One of the major
challenges in the use of combine harvester yield observations is the contamination of the
signals with variety of noise sources which sometimes could be tricky to identify. Generally,
combine yield monitoring can be affected by operation dynamic, speed changes, field
topography, non-fully used cutting bar, start/end delays, filling/emptying time, and flow delay
[3, 4, 5]. Yield monitoring data follows a spatial structure which requires a robust local outlier
detection algorithm to spot the outliers accurately and avoid removing a large number of
observations. Several studies have been carried out on combine harvester yield observation
cleaning, with the focus on Local Moran ‘I-dependent test powered by k/d-Nearest
Neighbour, density, and distance-based outlier detection [6, 7, 8]. In this paper we propose a
data cleaning pipeline developed based on the unsupervised methods for Hawkins outlier
detection called Robust Kernel-Based Local Outlier Detection, which is also called, Robust
Kernel Outlier Factor (RKOF) [2], and a statistical dispersion measure called interquartile
range (IQR). RKOF learns from data based on local density estimation and then flags outliers
accordingly. To feed the unsupervised method mentioned above with more consistent data,
IQR is implemented to remove extreme outliers.

2. Proposed Framework
Let the yield monitoring dataset be matrix , where is the feature dimensions, and

is the number of objects/observations. Our primary variable of interest is “wet-mass” of the

grain, which is given as . Evaluating to be an outlier or not, can

be done through univariate or multivariate framework. For detecting extreme outliers (also
called global filtering) a univariate approach is developed based on IQR. For detecting
local/spatial outliers, a local density estimation with variable kernel approach called Robust
Kernel Outlier Factor (RKOF) developed by [2] is implemented. Given the matrix D Three
numerical features, including “wet-mass” ( ) as primary feature dimension, “elevation” ( ),

and “heading” ( ) of the vehicle as secondary feature dimensions are selected. Our study

shows an improvement using the mentioned numerical features compare to the other
numerical feature in . Thus, our constructed matrix for local outlier detection is given by,
Therefore, given a 3-dimensional object being outlier or not, we are able to compute

a vector of the RKOF score for each object in .

2.1. IQR
Given the above attribute , consists in finding the samples such

that,

where, and are the first and third quantile of respectively, and is the constant which

enables adjusting the decision range. Any point lying outside this range (lower-upper band) is
considered as outlier. We propose to be 1.5 on combine monitoring dataset.

2.2. RKOF
Kernel Density Estimation (KDE) is a well-known non-parametric statistical approach for
outlier detection [01]. RKOF is an unsupervised learning algorithm developed based on the
main framework of Local Outlier Factor (LOF) which can be directly used in the extension of
LOF, such as Feature Bagging [2]. The general idea of RKOF is to compare the relative
density of a point with its neighbours found via distance. The basic assumption is the density
around an outlier is considerably lower than the density around its neighbours. Unlike LOF,
this algorithm can vary the neighbouring size through an integrated operation called, kernel
density estimator, and apply a weight to the found objects for each neighbour. Therefore, it
demonstrates a better performance and scalability in large dataset. The steps for
Implementing RKOF for combine yield monitoring dataset are the following, given the matrix
and an integer number of neighbours.

1. Construct a geometric distance matrix via k-dimensional tree (kd-tree).


Given a 3-dimensional column vector extracted from the goal is to find the k-

distance of measurement and the k-distance neighbourhood of , Hence, the -distance

is defined as the distance between and an object where such that::

a. for at least objects , it holds that .

b. for at most -1 objects , it holds that .

And k-distance neighbourhood of , named , contains every observation whose

distance from is not greater than the k-distance , i.e.,


Where any such data point is called a k-distance neighbour of . is the number of

the k-distance neighbours of . The computed geometric distance matrix out of KNN

is then given to compute RKOF, where is the number of neighbours and is

the number of objects as noted before.

2. Compute the RKOF


Given the above geometric distance matrix the goal is to compute an outlier

score for a given measurement based on its geometric distance from their k-number of
neighbours. Let be a row vector extracted from in which the geometric

distance of the number of neighbours for a given measurement are presented. The

computation of RKOF for all in is as follows,

where and are weight density estimation of the k-distance

neighbourhood of , and local density estimation of respectively. Large value of RKOF

indicates higher probability of a given data point being an outlier. A gaussian kernel is used
for density estimation, given a bandwidth with =0.05 . K-distance can be influenced with

the following parameters:


- : multiplication parameter with k-distance. Act as a bandwidth increaser.

- : sensitivity parameter for k-distance/bandwidth.

and the weight for objects in each neighbour can be tuned through another parameter called
, which is the variance parameter of weighting the neighbouring objects.

We leave the computational details of and to the reader (see the original

paper [8]).
Experimental Setup and results
In this study the harvester observations of 150 fields of wheat and barley with crop season
2019-2021 are used for developing the outlier detection pipeline. The observations are
collected with commercial combine harvester machine, with 5 hertz frequencies. Individual
yield harvest monitoring dataset is then taken into the cleaning process. The results of the
IQR and RKOF methods are also compared with the proposed Local Moran’s I method, see
Table 1.
Table 1. Result of cleaning through our proposed outlier detection pipeline.
Algorithm Observation size Outliers found % Outliers
IQR 4925 193 3.9
RKOF 4732 217 4.6
Local Moran’s I - - 30

The performance of proposed pipeline is evaluated through an experimental design with


harvest monitoring data and three Vegetation Indices (VI) called “NDVI”, “NDRE”, and “LAI”
derived from Sentinel-2 images. The assessment criteria are based on regression analysis,
remove the outliers detected by IQR, and look for the effect of deletion on RMSE
improvement. In a simple word, the more residual we kick out the smaller the RMSE gets.
[9]. The same procedure applied for ROKF. The RMSE is given by,

where is the i-th predicted grain mass through vegetation indices, and is the i-th actual

yield mass observation for that corresponding data point. The result of regression analysis is
provided in table 2. A convolutional Neural Network (CNN) is used to model the “wet-mass”
through VI. Out of 150 harvest monitoring datasets, 80% are allocated for training and 20%
for testing the model stability. The RMSE is then calculated on the test set (see Table 2).
Table 2. The comparison of each cleaning method.

Methods RMSE

Raw observation 4150 k/he

IQR 1275 k/he

IQR +RKOF 1050 k/he

Conclusion
Any definition of "outlier" is arbitrary and will limit generalizability Nevertheless, this study
demonstrates an unsupervised approach that can deal with complex and large dataset. Our
main objective is to develop an automated processing pipeline to detect outliers in harvester
View publication stats

yield observation. We deploy an unsupervised machine learning approach which can


consider the heterogeneity in the dataset for detecting heterogenous outliers. To detect
homogenous outliers a IQR based algorithm is used with a constant of 1.5. As result the
proposed pipeline demonstrates a more reliable and accurate result moving from IQR to
RKOF, and the number of founded outliers are considerably smaller (8.4%) than the cleaning
approach removing (30%) proposed by [4]. The RMSE results prove 75% improvement of
the yield map.

Reference
[1] Alghushairy O, Alsini R, Soule T, Ma X. A Review of Local Outlier Factor Algorithms for
Outlier Detection in Big Data Streams. Big Data and Cognitive Computing. 2021;
5(1):1.
[2] Gao J., Hu W., Zhang Z., Zhang X., Wu O. (2011). RKOF: Robust Kernel-Based Local
Outlier Detection. In: Huang J.Z., Cao L., Srivastava J. (eds) Advances in Knowledge
Discovery and Data Mining, 270-283.
[3] Arslan, S., & Colvin, T. (2002). Grain yield mapping: yield sensing, yield reconstruction,
and errors. Precision Agriculture, 135-154.
[4] Blackmore, B. S., & Moore, M. (1999). Remedial correction of yield map data. Precision
Agriculture, 53 – 66.
[5] Lee, D. H., Sudduth, K. A., Drummond, S. T., Chung, S. O., & Myers, D. B. (2012).
Automated yield map delay identification using phase correlation methodology.
ransactions of the ASABE, 743–752.
[6] Vega, A.; Córdoba, M.A.; Castro-Franco, M.; Balzarini, M. (2019). Protocol for
automating error removal from yield maps. Precis. Agric., 1030–1044.
[7] Leroux, C., Jones, H., Clenet, A. et al. (2018). A general method to filter out defective
spatial observations from yield mapping datasets. Precision Agric, 789–808.
[8] Lyle, Greg et al. “post-processing methods to eliminate erroneous grain yield
measurements: review and directions for future development.” Precision Agriculture 15
(2013): 377-402.
[9] Cousineau D., Chartier, S. (2010). Outliers detection and treatment: a review.
International Journal of Psychological Research, 3 (1), 59-68.

You might also like