2019 Ganesh, Deep Orange Mask R-CNN Based Orange PDF

6th IFAC Conference on Sensing, Control and Automation for

Agriculture
Agriculture Available online at www.sciencedirect.com
6th IFAC Conference
December
Agriculture on Sensing,
4-6, 2019. Sydney, Control and Automation for
Australia
December
Agriculture4-6, 2019. Sydney, Australia
December 4-6, 2019. Sydney, Australia
December 4-6, 2019. Sydney, Australia
ScienceDirect
IFAC PapersOnLine 52-30 (2019) 70–75
Deep
Deep Orange:
Orange: Mask
Mask R-CNN
R-CNN based
based Orange
Orange
Deep Orange:
Detection Mask and R-CNN based
Segmentation Orange
Deep Orange:
Detection Mask and R-CNN based
Segmentation Orange
Detection
Detection and and Segmentation
Segmentation
P. Ganesh ∗∗ K. Volle ∗∗ ∗∗∗ ∗
∗∗ T. F. Burks ∗∗∗ S. S. Mehta ∗
P. Ganesh ∗∗∗ K. Volle ∗∗ ∗∗∗ ∗
∗∗ T. F. Burks ∗∗∗ ∗∗∗ S. S. Mehta ∗ ∗
P. Ganesh ∗ K. Volle ∗∗ ∗∗ T. F. Burks ∗∗∗ S. S. Mehta ∗
∗ P. Ganesh K. Volle T. F. Burks S. S. Mehta
∗ Department of Mechanical and Aersospace Engineering, University of
∗ Department of Mechanical and Aersospace Engineering, University of
∗
∗ Department of Mechanical
Florida,
Florida, Shalimar, FL-32579
and Aersospace
Shalimar, (e-mail:
FL-32579Engineering,
(e-mail: University of
Department of Mechanical Florida, and Aersospace
Shalimar, FL-32579Engineering,
{prashant.ganesh,siddhart}@ufl.edu). (e-mail: University of
∗∗ {prashant.ganesh,siddhart}@ufl.edu).
Florida, Shalimar,University
FL-32579 of (e-mail:
∗∗ National Research Council,
{prashant.ganesh,siddhart}@ufl.edu). Florida, Shalimar,
∗∗ National Research
∗∗ Council, University of Florida, Shalimar,
{prashant.ganesh,siddhart}@ufl.edu).
FL-32579
∗∗ NationalFL-32579Research (e-mail:
Council,kyle.volle@gmail.com).
(e-mail: University of Florida, Shalimar,
kyle.volle@gmail.com).
∗∗∗ NationalFL-32579
∗∗∗ Department
Research
of Council,
Agricultural andUniversity
Biological of Florida, Shalimar,
∗∗∗ Department
∗∗∗ FL-32579
(e-mail:
of Agricultural
(e-mail: and Biological Engineering,
Engineering, University
kyle.volle@gmail.com).
kyle.volle@gmail.com). University
Department
of Florida, of Agricultural
Gainesville, and
FL-32611 Biological
∗∗∗ of Florida, Gainesville, FL-32611 (e-mail: tburks@ufl.edu).
(e-mail: Engineering,
tburks@ufl.edu). University
Department of Agricultural and Biological
of Florida, Gainesville, FL-32611 (e-mail: tburks@ufl.edu). Engineering, University
of Florida, Gainesville, FL-32611 (e-mail: tburks@ufl.edu).
Abstract:
Abstract: The The objective
objective of of this
this work
work is is to
to detect
detect individual
individual fruits fruits and
and obtain
obtain pixel-wise
pixel-wise mask mask for for
each
Abstract:
each detected
detected The fruit
fruit in
objective
in an
an image.
of
image.this To
work
To this
this is end,
to
end, we
detect
we presents
individual
presents a
a deep
fruits
deep learning
and
learning obtainapproach,
pixel-wise
approach, named
namedmask Deep
Deepfor
Abstract:
Orange,
each to
detected The objective
detection
fruit in and
an of this To
pixel-wise
image. workthis is to
segmentation
end,detect
we individual
of fruits
presents a fruitslearning
based
deep and
on the obtain pixel-wise
state-of-the-art
approach, namedmask
instance
Deepfor
Orange,
each to detection
detected fruit in and
an pixel-wise
image. To segmentation
this end, we of fruitsa based
presents deep on the state-of-the-art
learning approach, named instance
Deep
Orange,
segmentation to detection
framework, and pixel-wise
Mask segmentation of fruits based on the state-of-the-art instance
segmentation
Orange, framework,
to detection andHSV Mask R-CNN.
pixel-wise R-CNN. The
The presented
segmentation presented
ofThe
fruits
approach
approach uses
uses
based onframework
multi-modal
multi-modal input
the state-of-the-art input data
data
instance
segmentation
comprising of framework,
RGB and Mask R-CNN.
images of The
the presented
scene. approach
developed uses multi-modal
is input
evaluated data
using
comprising
segmentation of RGB
framework,and HSVMask images
R-CNN. of the
The scene. The
presented developed
approach framework
uses is
multi-modal evaluated
input using
data
images
comprising
images obtained
of
obtained RGB from
from and an HSV
anHSV orange
orange grove
images of in
groveofinthe Citra,
the Florida
scene.
Citra, The
Florida under
developed natural
under natural framework lighting
lighting is conditions.
evaluated
conditions. The
using
The
comprising
performance
images of
obtained ofRGB
thefrom and
algorithm
an orange images
is compared
grove in scene.
using
Citra, RGB The
Florida developed
and RGB+HSV
under framework
natural images.
lighting is evaluated
Our using
preliminary
conditions. The
performance
images obtained of thefromalgorithm
an orange is compared
grove in using
Citra, RGB
Florida and RGB+HSV
under natural images.
lighting Our preliminary
conditions. The
performance
findings indicateof thethatalgorithm
inclusion isofcompared
HSV data using RGB
improves and
the RGB+HSV
precision to images.
0.9753 fromOur preliminary
0.8947, when
findings
performance indicateof thethat inclusionisofcompared
algorithm HSV data improves
using RGBusing the
andprecision
RGB+HSV to 0.9753
images. fromOur 0.8947, when
preliminary
using
findings
using RGB
RGB data
indicate alone.
that
data alone. The
inclusion overall
The overall of F
HSV score
data
F11 score obtained
improves the RGB+HSV
precision to is
0.9753close to
from 0.89.
0.8947, when
findings
using RGB indicate that inclusion
data alone. The overall of HSV data obtained
F11 score improvesusing
obtained using
RGB+HSV
the precision
RGB+HSV
is close
to 0.9753
is close
to 0.89.
from 0.8947, when
to 0.89.
© 2019,RGB
using IFACdata(International
alone. TheFederation
overall of Automatic
F1 score obtainedControl) Hosting
using RGB+HSVby Elsevier Ltd. All rights
is close to 0.89.reserved.
Keywords:
Keywords: Deep Deep learning;
learning; Convolutional
Convolutional neural neural networks;
networks; Multi-modal
Multi-modal instance instance segmentation
segmentation
Keywords: Deep learning; Convolutional neural networks; Multi-modal instance segmentation
Keywords: Deep learning; Convolutional neural networks; Multi-modal instance segmentation
1. INTRODUCTION
INTRODUCTION can
1. can be be found
found in in Kamilaris
Kamilaris and and Prenafeta-Boldu
Prenafeta-Boldu (2018); (2018);
1. INTRODUCTION can
Liakos be etfound
al. in Kamilaris and Prenafeta-Boldu (2018);
(2018).
1. INTRODUCTION Liakos
can be etfound
al. (2018).
in Kamilaris and Prenafeta-Boldu (2018);
Fruit detection
Fruit detection is is the
the process
process of of classifying individual Liakos et al. (2018).
classifying individual
DL
Liakosis
is a machine learning method based on artificial neu-
aet machine
al. (2018).learning method based on artificial
Fruit
fruits detection
fruits and localizing
and localizing is theeach
each process
using ofaa bounding
using classifying box
bounding individual
box the DL
in the
in DL is a machine learning methodlayers basedbetween
on artificial
neu-
neu-
Fruit detection is the process of classifying individual ral networks
ral networks that uses
uses multiple
thatlearning multiple layers between the
the input
input
fruits and
image-space.
image-space. localizing
A
A each
comprehensive
comprehensive usingreview
reviewa bounding
of
of classical
classical box in
computer
computer the DL is a machine method based on artificial neu-
fruits and localizing each using a bounding box in the ral
and
and networks
output
output layers that
layers uses
to multiple
extract
to extract layers
higher
higher between
level
level the
features
features input
from
from
image-space.
vision
vision methods
methods A comprehensive
used in fruit
used in fruit review review
detection
detection of classical
can be computer
found
can be computer
found in andin ral networks that uses multiple layers between the input
image-space. inputoutput
data. Convolutional
data. layers to extract neural
higher networks (CNN)
level features
(CNN) from is
is a
vision
Gongal
Gongal et al. A
methods
et al. comprehensive
(2015).
(2015).used Recent
in fruitadvances
Recent detection
advances ofinclassical
can be found
machine
in machine learning
learning in input
and output Convolutional
layers to extract neural
higher networks
level used
features from a
vision methods used in fruit detection can be found in DL
input
DL architecture
data.
architecture that
Convolutional
that is
is most
mostneural commonly
networks
commonly (CNN)
used in
in com-
is a
com-
Gongal
(ML),
(ML), et al. (2015).
particularly,
particularly, in Recent
the fieldadvances
of deep in machine
learning (DL),learning
have input data. Convolutional neural networksused (CNN) is a
Gongal (2015).in
et al.development the field
inRecent
of deep in
advances learning
machine (DL), have DL
learning puter architecture
vision
vision to
puterarchitecture
that is visual
to analyze
analyze
most commonly
data.
data. CNN gained in com-
huge
(ML),
led
led to particularly,
to the
the development the of field
of of deep objection
superior
superior learning
objection (DL),
detection
detectionhave DL
puter
traction vision
in to
object that
analyze is visual
most following
visual
classification commonly
data.
CNNused
CNNthe
gained
gained
huge
in com-
development huge
(ML),
led to particularly,
algorithms,the development
which is in evident
is the of field of deep
superior
from learning
growth (DL),
theobjection thehave
indetection use traction in object classification following the development
algorithms, which evident from the growth in the use puter
of vision
AlexNet
traction to analyze
(Krizhevsky
in object visual
et
classification al., data. for
2012)
following CNN gained chal-
ImageNet
the development huge
ledDL
of to methods
algorithms,the development
which in is evident
agriculture. of superior
from
DL the
has objection
growth
demonstrated indetection
the use
large of AlexNet (Krizhevsky et al., 2012) for ImageNet chal-
of DL methods in agriculture. DL has traction in object classification following the development
algorithms,
of DL methods
potential in which is evident
in agriculture. from
DL has thedemonstrated
growth inincluding
demonstrated thelarge
use of
large
lenge.
AlexNet
lenge. Subsequently,
(Krizhevsky
Subsequently, various
et al.,architectures
various 2012) for ImageNet
architectures of
of CNN
CNN chal-have
have
potential
of DL in
methods aa variety
variety
in
of agricultural
of
agriculture.
agricultural
DL has
scenarios
scenarios
demonstrated including
large
of
beenAlexNet
lenge. applied (Krizhevsky
Subsequently,
in agriculture etfor
various al., 2012)
architectures
fruit for ImageNet
detection of
and CNN chal-
have
counting.
potential
leaf in
classificationa variety
(Hall of
et agricultural
al., 2015), leaf scenarios
and plant including
disease been applied
lenge. in agriculture
Subsequently, various for fruit detection of
architectures andCNN counting.
have
leaf classification
potential in (Hall
a variety et
of al., 2015),
agricultural leaf and
scenarios plant disease
including Rahnemoonfar
been applied in and Sheppard
agriculture for (2017)
fruit presented
detection and a modified
counting.
leaf classification
detection (Sladojevic (Hall etet al.,
al., 2015),
2016; leaf
Mohanty and plant
et al., disease
2016), Rahnemoonfar
been applied in and Sheppard
agriculture for (2017)
fruit presented
detection and a counting.
modified
detection (Sladojevic (Hallet al.,
al.,2016; Mohanty et al., disease
2016), Rahnemoonfar
version of
leaf classification
detection
land cover (Sladojevic
classification etet al.,
(Chen 2015),
2016; leaf and
al.,Mohanty
et al., 2014; Luus plant
etetal., 2015; version
al.,2016), of Inception-ResNet
Rahnemoonfar
and Sheppard
Inception-ResNet
and Sheppard
for fruit
fruit counting
for(2017)
(2017)
presented
counting
presented
towards
towards yield
a modified
yield
a accuracy
modified
land cover
detection classification
(Sladojevic et(Chen
al., et
2016; 2014;
Mohanty Luus etet al.,
al., 2015;
2016), estimation.
version of The architecture
Inception-ResNet for demonstrated
fruit counting 91%
towards yield
land
Ienco cover
et classification
al., 2017), crop (Chen et al.,
classification 2014;
(KussulLuus et
et al.,
al., 2015;
2017; estimation.
version The architecture
of Inception-ResNet demonstrated
fordemonstrated
fruit 91%
counting(2018)
towardsaccuracy
yield
Ienco
land et al.,
cover 2017), crop(Chen
classification classification
et al., (Kussul
2014; Luus et
et al.,
al., 2017;
2015; estimation.
in tomato The architecture
detection. Mureşan and Oltean 91% accuracy
applied
Ienco
Mortensenet al.,et
et2017), crop classification
al., 2016),
2016), plant recognition
recognition(Kussul et al.,et
(Pound 2017;
et in tomato detection. Mureşan
al., estimation. The architecture demonstrated 91% accuracy and Oltean (2018) applied
Mortensen
Ienco et al.,
al.,et2017), crop2016),plant
classification (Kussul (Pound
et al., al.,
2017; CNN
in for
tomato fruit classification
detection. Mureşan usingand TensorFlow
Oltean framework.
(2018) applied
Mortensen
2017;
2017; Grinblat
Grinblat al.,
et 2016),
et al.,
al., plantplant
2016), recognition
plant phenotyping
phenotyping (Pound et
(Namin
(Namin al., CNN in for fruit
tomato classification
detection. Mureşan usingand TensorFlow
Oltean framework.
(2018) applied
Mortensen etYalcin,
al., 2016), plant recognition (Pound et al., CNN Similarly,
Similarly, Zhang
for fruit
Zhang et
et al.
classification
al. (2019)
(2019)using used
used aa 13-layer
TensorFlow
13-layer CNN
CNN for
framework. for
2017;
et
et al.,
al., Grinblat
2018;
2018; Yalcin,et al., 2016),
2017),
2017), crop
crop plant
yeild
yeild phenotyping
estimation
estimation (Namin
(Kuwata
(Kuwata CNN for fruit classification using TensorFlow framework.
2017; Grinblat et al., 2016), plant phenotyping (Namin Similarly,fruit
fruit category
category Zhangclassification,
et al. (2019)
classification, which
which demonstrated
used a 13-layersignificant
demonstrated CNN for
significant
et
and
and al., 2018;
Shibasaki, Yalcin, 2015; 2017),
Minh crop
et yeild
al., estimation
2017), fruit (Kuwata
detection Similarly, Zhang et al. (2019) used a 13-layer CNN for
et al.,Shibasaki,
and 2018; Yalcin,
Shibasaki,
(Rahnemoonfar
2015;2017),
2015;
Minhcrop et al.,
Minh et 2017;
and Sheppard,
Sheppard, yeild2017),
al.,
2017; estimation
2017),
Chen
fruit detection
fruit (Kuwata
al.,detection
et al., 2017; Sa
fruit
Sa improvement
category
improvement classification,
over classical which
over classical ML demonstrated
ML approaches.
approaches. While
Whilesignificant
CNNs
CNNs
(Rahnemoonfar and Chen et 2017; fruit category
outperform
improvement classification,
humans
over in many
classical which
ML cases demonstrated
on the
approaches. ImageNet
Whilesignificant
chal-
CNNs
andal.,Shibasaki,
(Rahnemoonfar
et 2016; Liu
Liu et et2015;
and MinhZhang
al.,Sheppard,
2018; et 2017;
al., 2017),
et al.,Chen
al., 2019; fruit
al.,detection
etMureşan
Mureşan2017;and Sa outperform
and humans in manyML cases on the ImageNet chal-
et al., 2016; al., 2018; Zhang et 2019; improvement
lenge, the over classical
complexity inof the approaches.
environments, on the such While
as CNNs
multiple
(Rahnemoonfar
et al.,
Oltean, 2016;
2018;Liu etand
al.,
Bargoti Sheppard,
2018;
and Zhang 2017;
Underwood, et Chen
al., 2019;
2017b; etMureşan
al.,
Stein 2017;et Sa outperform
and
al., lenge, the humans
complexity of many
the cases
environments, ImageNet
such as chal-
multiple
Oltean,
et al., 2018;
2016; Liu Bargoti
et al., and Underwood,
2018; Zhang et al., 2017b;
2019; Stein etand
Mureşan al., outperform
overlapping
lenge, the humans
objects
complexity in
and
of many
the cases backgrounds,
different
environments, on the suchImageNet
as can chal-
pose
multiple
Oltean,
2016), weed 2018;
weed Bargoti (Dyrmann
detection and Underwood,
(Dyrmann et al., 2017b;
al., 2016;
2016; Stein et
McCool overlapping
al., lenge, the objects and
complexity ofthisdifferent
the environments,backgrounds, ascan
suchstate-of-the-pose
multiple
2016),
Oltean, 2018; detection
Bargoti and Underwood,et 2017b; McCool
Stein et
et al.,
al., overlapping
several objects
challenges. Toand different
end, backgrounds,
the current can pose
2016),
2017; weed
Dyrmann detection
et al., (Dyrmann
2017; Milioto et al.,
et 2016;
al., 2017), McCool
and et
animalal., several
overlapping challenges.
objects Toand this end,
different the current
backgrounds, state-of-the-
can pose
2017; Dyrmann
2016),Dyrmann
weed et
detection al., 2017;
(Dyrmann Milioto et
et et al.,
al.,al., 2017),
2016; and
McCool animal
et al., art object detection method, Faster R-CNN (Ren et al.,
art object
several detection
challenges. To method,
this end, Faster
the R-CNN
current (Ren et
state-of-the-
2017;
research (Santoni et al.,
et 2017;
al., 2015;Milioto
Demmers et2017),
al., and
2012, animal
2010). several challenges. To this end, the current state-of-the- al.,
research
2017; (Santoni
Dyrmann et et
al.,al.,
2017;2015; Demmers
Milioto et al.,et al., 2012,
2017), 2010). art
and animal 2015),
objectcan detect
detection different
method, objects
Fasterin an
R-CNN image.(Ren Further,
et al.,
research
A survey (Santoni
of DL et al.,
algorithms 2015;
used Demmers
in et
agricultural al., 2012, 2010).
applications 2015),
art objectcan detect
detection different objects in an image. Further,
A survey(Santoni
research of DL algorithms
et al., 2015; usedDemmers
in agricultural
et al., applications
2012, 2010). 2015), Faster
Faster can
can
R-CNN
R-CNN
detect uses
uses aamethod,
differentregion
region
FasterinR-CNN
objects
proposal
proposal
an image.
network
network
(Renbased
based
et al.,
Further,on
on
A survey of DL algorithms used in agricultural applications 2015), detect different objects in an image. Further,
A
survey
This of
researchDL is algorithms
supported in used
part in
by agricultural
the USDA Smallapplications
Business In-
Faster R-CNN
convolutional
convolutional uses
feature
feature a region
maps
maps to
to proposal
generate
generate network
region
region based
proposals,
proposals, on
This research is supported in part by the USDA Small Business In-

Faster
thus R-CNN
significantly
convolutional uses
feature a
improving region
maps proposal
processing
to generate network
time
regionover based
prior
proposals, on
ar-
novation Research
This research (SBIR) grant
is supported 2018-33610-28228
in part by the USDA Small through GeoSpi-
Business In- thus significantly improving processing time overproposals,
prior ar-
novation
Research (SBIR) grant 2018-33610-28228 through GeoSpi- convolutional
chitectures, feature
namely R-CNNmaps to generate region
derThis
Inc.research
novation and the is
Research supported
USDA (SBIR)Capacityin part
grant by theGrant
Building USDA(CBG)
2018-33610-28228 Small Business
2019-38821-
through GeoSpi-In- thus significantly
chitectures,
thus significantlynamely R-CNN (Girshick
improving
improving
processing
(Girshick
processing
et al.,
al., 2014)
et time over and
2014) priorFast
and ar-
Fast
der Inc. and
novation
29147.
der Inc. Any
the USDA
Research
and opinions,
the USDA (SBIR)Capacity
grant
findings
Capacity and
Building Grant (CBG)
2018-33610-28228
conclusions
Building Grant
2019-38821-
through
or(CBG) GeoSpi-
recommendations
2019-38821- chitectures,
R-CNN namely
(Girshick, R-CNN
2015). Sa (Girshick
et al. (2016) et time
adopted overthe
al., 2014) prior
andFasterar-
Fast
29147.
der Inc.Any
and opinions,
the findings and conclusions or(CBG)
recommendations R-CNN (Girshick,
chitectures, namely2015).R-CNN Sa et al. (2016)
(Girshick et adopted
al., 2014) the
andFaster
Fast
expressed
29147. Any thisUSDA
inopinions, Capacity
material
findingsareand Building
those Grant
of the
conclusions author(s) 2019-38821-
and do not
or recommendations R-CNN (Girshick,
architecture 2015).
for Sa
sweet et al.
pepper(2016) adopted
detection the
using Faster
multi-
expressed
29147. Anyin this material
opinions, findingsareandthose of the author(s)
conclusions and do not
or recommendations R-CNN
R-CNN architecture
(Girshick, 2015). for sweet
Sa etpepper
al. (2016) detection
adopted using multi-
the Faster
necessarily
expressed reflect
in this the views are
material of the funding
those of the agency.
author(s) and do not R-CNN architecture for sweet pepper detection using multi-
necessarily reflect the views of the funding agency.
expressed
necessarilyin this the
reflect material
views are those
of the of the
funding author(s) and do not
agency. R-CNN architecture for sweet pepper detection using multi-
necessarily
2405-8963 © reflect
2019, the views
IFAC of the funding
(International agency.of Automatic Control) Hosting by Elsevier Ltd. All rights reserved.
Federation
Copyright © 2019 IFAC 70
Copyright
Peer review© under
2019 IFAC 70 Control.
responsibility of International Federation of Automatic
Copyright © 2019 IFAC
10.1016/j.ifacol.2019.12.499 70
Copyright © 2019 IFAC 70
IFAC AGRICONTROL 2019

December 4-6, 2019. Sydney, Australia P. Ganesh et al. / IFAC PapersOnLine 52-30 (2019) 70–75 71
modal input data consisting of RGB and Near-Infrared

(NIR) images. Additionally, the authors demonstrated the
performance of the network for strawberry, apple, avocado,
mango, and orange detection using (RGB) images obtained
from Google search. Similarly, Bargoti and Underwood
(2017a) applied Faster R-CNN for apple, mango, and al-
mond detection. However, in contrast to Sa et al. (2016),
Bargoti and Underwood (2017a) applied the framework for
fruit detection in orchards, where the imagery can pose
several challenges including low pixel count per fruit and
greater illumination variability. Stein et al. (2016) applied
the Faster R-CNN architecture for mango detection using
input (RGB) images.
While fruit detection aims at localizing fruits in the image-
space by identifying bounding boxes, often it is desirable Fig. 1. Example of instance segmentation providing bound-
to segment individual fruits from the background. The goal ing box locations of the fruits as well as pixel-wise
of semantic segmentation is to compute a pixel-wise mask masks for individual oranges.
for each object (or the objects of interest) in the image by 2. METHODOLOGY
classifying each pixel into a fixed set of categories (e.g., fruit
and non-fruit). Fruit segmentation is beneficial in robotic The objective of the developed Deep Orange fruit detection
harvesting for centroid detection, estimating orientation of and segmentation framework is to improve performance of
the fruit and stem (Gongal et al., 2015), and identifying autonomous agricultural operations, such as robotic har-
structure of clusters. Fully convolutional networks (FCN) vesting, by efficiently and correctly identifying fruits on a
is a classical approach of end-to-end deep learning for se- tree. The presented approach is based on the recently de-
mantic segmentation. FCNs are constructed from locally veloped state-of-the-art instance segmentation framework,
connected layers, such as convolution, pooling and upsam- Mask R-CNN (He et al., 2017). Deep Orange is a multi-
pling, and do not have any fully connected layers, as used modal segmentation framework that augments Mask R-
in CNN for classification. Examples of image segmentation CNN by including HSV image data for orange detection
using FCNs in agriculture include mixed crop segmentation in grove environments. To compare the performance of the
(Mortensen et al., 2016), orange segmentation (Chen et al., multi-modal framework, three separate models - one with
2017; Liu et al., 2018), and apple segmentation (Bargoti RGB color space inputs, one with HSV color space inputs,
and Underwood, 2017b). However, semantic segmentation, and one with both RGB and HSV inputs - are trained and
e.g., using FCNs, does not provide a means to distinguish tested. The implementation from Abdulla (2017) written
between individual instances of the same class (e.g., in- using Python and Tensorflow was adopted to accommodate
dividual fruits on a tree). Fundamentally, FCNs perform the different models.
pixel-wise multi-class categorization through coupled seg-
mentation and classification, which has been observed to The following sections will briefly discuss the Mask R-CNN
have poor performance for instance segmentation (He et al., architecture, the data acquisition process, and the training
2017). The current state-of-the-art instance segmentation methodology.
framework, Mask R-CNN (He et al., 2017), combines object
detection and semantic segmentation to efficiently detect 2.1 Preliminaries
objects in an image while simultaneously generating mask
for each instance. This is achieved in Mask R-CNN by Mask R-CNN is a popular neural network architecture for
adding a branch to Faster R-CNN that predicts an object instant segmentation. Building on a series of previous work,
mask in parallel to object and class prediction of Faster R- most proximally Faster R-CNN (Ren et al., 2015), it seeks to
CNN. Further, in contrast to FCNs, Mask R-CNN decou- not only detect instances of objects that it has been trained
ples mask and class prediction by generating mask for every to recognize but also to identify which pixels belong to each
class of objects for each region of interest (RoI) and selecting instance.
the output mask corresponding to the class predicted by the Mask R-CNN inherited two fundamental components from
Faster R-CNN branch, which is shown to improve object Faster R-CNN. The first stage is called a Region Proposal
instant segmentation in He et al. (2017). Network (RPN) that proposes RoI around potential ob-
To this end, we present an application of Mask R-CNN to jects. The second stage of Faster R-CNN is based on Fast
fruit detection and segmentation in orchard environments. R-CNN (Girshick, 2015) and performs feature extraction
The objective is to detect individual fruits and obtain pixel- on the proposed regions to classify objects in each region
wise mask for each detected fruit in an image. In this paper, and refine bounding boxes around detected objects. The
a multi-modal segmentation approach is developed by using feature extraction “backbone” can have any of several ar-
RGB and HSV input image data. The developed framework chitectures. In this work, the ResNet-101 architecture (He
is evaluated using images captured in orange orchards in et al., 2016) was used. This stage is augmented in Mask
Citra, Florida, USA under natural lighting conditions. The R-CNN to generate a binary object mask for each region
fruit detection and segmentation performance is compared of interest. The masks are of size Km2 where K is the
for RGB, HSV, and RGB+HSV input data. number of object classes and the region of interest is m × m
pixels. The processing of each RoI involves pooling and
other quantization steps, these trade spatial accuracy for
71
72
December 4-6, 2019. Sydney, Australia P. Ganesh et al. / IFAC PapersOnLine 52-30 (2019) 70–75
dimensional reduction. While vital to performance of the

network, this can cause the object masks to be misaligned
from the regions of the original image. Therefore, Mask
R-CNN introduced the RoIAlign procedure to correct the
mask alignment for each RoI by interpolating the RoI onto
the feature map using bilinear interpolation.
2.2 Dataset
The image data to train and validate the models was ac-
quired from an orange orchard in Citra, Florida, USA. The
images were acquired in 2018 just ahead of the commercial
harvesting season using a consumer grade digital camera in
natural lighting. The original images were of size 2816×1880 Fig. 3. Input images to the neural network - mean pixel
pixels. The fruit count per image was observed to be about subtracted RGB image (left) and HSV image (right)
60, thereby having relatively low pixel count per fruit. In
of labelled images will also be large. In order to limit the
an effort to reduce the number of training images to be
number of images of oranges needed to train the entire
acquired, the original images were divided into sub-images
network, transfer learning is opted. In ML, transfer learning
of size 256×256 pixels while retaining the original pixel
refers to applying knowledge gained in solving one problem
density.
to solve a different but related problem. This means that
From this cache, 150 sub-images of varying levels of lighting instead of training the network parameters from scratch, we
conditions, occlusions, and overlapping fruits were arbi- can use weights of the network trained on another dataset
trarily chosen to train the neural network. Although it is as a starting point for further fine tuning the weights for
well-known, it must be emphasized that, due to limited our problem of identifying oranges. This not only reduces
dynamic range of digital cameras, it is important to choose the number of images needed to train the network but also
images corresponding to different illumination conditions decreases the time required to train the models as only a
since the perceived color of the fruit changes depending on limited amount of hand labelled training data is required.
the position of the fruit on a tree and the position of the sun.
To this end, the Common Objects in Context (COCO)
Further, the selected images were manually annotated using
dataset (Lin et al., 2014) is employed, which consists
VGG Image Annotator (VIA) (Dutta et al., 2016). Fig. 2
of more than 120,000 images from 80 object categories
shows one such sample image with the manually generated
(including oranges). The network weights for Mask R-CNN
masks using polygon shaped regions. Note that only the
trained on the COCO dataset are freely available, which
fruits that were clearly visible in the image were labelled in
are adopted for this work. In order to fine tune the weights
the manual annotation process.
for detecting oranges, we only have to train weights for
the RPN, classifier, and the mask generation portion of
the network. However, this process is valid for training the
network using RGB input data as the COCO dataset only
contains RGB images. In order to train the network for the
other two cases (i.e., HSV and RGB+HSV input data), we
re-train the RPN, classifier, mask generation, and the first
three convolution layers while other parameters are kept
constant.
All the three models were trained for 40 epochs on a
computer with an Intel Xenon processor and four NVIDIA
GTX1070Ti Graphical Processing Units (GPUs). Each
model took about four hours to train and validate.
Fig. 2. Image on the right is the input training image and 3. RESULTS
the one on the left shows the masks generated manually
using VIA. The performance of the presented deep learning framework
is evaluated on a test dataset of randomly selected 200
Fig. 3 shows a sample image pair used to train the neural images using each of the three trained models. The metrics
network. The RGB image input is the mean pixel sub- selected to validate the fruit detection performance are
tracted original sub-image, and the HSV image input is precision, recall, and F1 score. Precision is the fraction of
the original sub-image in the HSV color space. relevant instances from all the retrieved instances while
recall is the fraction of relevant instances that have been
2.3 Training retrieved from all the relevant instances. Roughly, precision
is an indicator of false positives in the retrieved instances,
The implementation of Mask R-CNN using ResNet-101 and recall is an indicator of false negatives in the retrieved
(He et al., 2016) serves as a feature extractor. Being a instances. In fruit detection, larger precision would relate to
Deep Network with 101 layers, the network has millions of higher correctness of detection and larger recall corresponds
trainable hyper parameters, which means that the number to higher detection efficiency. F1 score is the harmonic
72

mean of precision and recall, and hence provides an overall REFERENCES

indication of the detection performance.
Waleed Abdulla. Mask r-cnn for object detection
The output of the models are the masks, bounding boxes, and instance segmentation on keras and tensorflow.
and the probability of detection. A confidence threshold of https://github.com/matterport/Mask RCNN, 2017.
0.95 was chosen for fruit detection. Fig. 3 shows detection Suchet Bargoti and James Underwood. Deep fruit detection
and segmentation results for each model, where the first in orchards. In 2017 IEEE International Conference
column is the original image and the subsequent columns on Robotics and Automation (ICRA), pages 3626–3633.
show the results for RGB, HSV, and RGB+HSV models. IEEE, 2017a.
Table 3 provides the performance of fruit detection using Suchet Bargoti and James P Underwood. Image segmen-
the selected metrics for each model. From Table 3, it tation for fruit detection and yield estimation in apple
is clear that the RGB+HSV model has a significantly orchards. Journal of Field Robotics, 2017b.
higher precision (0.9753) than the RGB model (0.8947), Steven W Chen, Shreyas S Shivakumar, Sandeep Dcunha,
however it has a lower recall (0.8128) compared to RGB Jnaneshwar Das, Edidiong Okon, Chao Qu, Camillo J
model (0.8673). On closer analysis of the results, it is Taylor, and Vijay Kumar. Counting apples and oranges
observed that the RGB+HSV model is more conservative with deep learning: A data-driven approach. IEEE
in identifying oranges, which results in its lower recall Robotics and Automation Letters, 2(2):781–788, 2017.
score. The HSV only model performs poorly by reporting Yushi Chen, Zhouhan Lin, Xing Zhao, Gang Wang, and
many false positive instances. This model is observed to Yanfeng Gu. Deep learning-based classification of hyper-
have inaccuracies when differentiating between oranges, spectral data. IEEE Journal of Selected topics in applied
branches, and leaves as seen in images 3 and 5 of Fig. 3. earth observations and remote sensing, 7(6):2094–2107,
The RGB+HSV model is shown to have improved pixel- 2014.
wise segmentation (mask) compared to the RGB model, Theo GM Demmers, Yi Cao, Sophie Gauss, John C Lowe,
our future work will provide quantitative measures to David J Parsons, and Christopher M Wathes. Neural
compare segmentation performance among these models. predictive control of broiler chicken growth. IFAC Pro-
The average inference time per image is 11ms, which is ceedings Volumes, 43(6):311–316, 2010.
obtained as average over all the validation images. Theo GM Demmers, Yi Cao, David J Parsons, Sophie
Gauss, and Christopher M Wathes. Simultaneous moni-
Table 1. Performance of fruit detection using toring and control of pig growth and ammonia emissions.
RGB, HSV, and RGB+HSV input data In 2012 IX International Livestock Environment Sym-
Input data Precision Recall F1 -Score posium (ILES IX), page 3. American Society of Agricul-
RGB Only 0.89473 0.867346 0.88082 tural and Biological Engineers, 2012.
HSV Only 0.5222 0.60567 0.56085 A. Dutta, A. Gupta, and A. Zisser-
RBG + HSV 0.97538 0.812820 0.88671 mann. VGG image annotator (VIA).
http://www.robots.ox.ac.uk/ vgg/software/via/,
By way of simultaneous detection and segmentation, the 2016.
developed framework provides fruit location in the image Mads Dyrmann, Henrik Karstoft, and Henrik Skov Midtiby.
space using bounding boxes and provides pixel-wise seg- Plant species classification using deep convolutional neu-
mentation mask for each detected instance. The segmenta- ral network. Biosystems Engineering, 151:72–80, 2016.
tion masks not only differentiate fruits from the background Mads Dyrmann, Rasmus Nyholm Jørgensen, and Hen-
(leaves, branches, sky) but also provide a means to identify rik Skov Midtiby. Roboweedsupport-detection of weed lo-
structure of the clustered fruits. Specifically, it enables iden- cations in leaf occluded cereal crops using a fully convolu-
tifying the fruits that are in the foreground of the cluster, tional neural network. Advances in Animal Biosciences,
which will benefit the harvest planning operation of robotic 8(2):842–847, 2017.
harvesting. Ross Girshick. Fast r-cnn. In Proceedings of the IEEE in-
ternational conference on computer vision, pages 1440–
1448, 2015.
4. CONCLUSIONS Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra
Malik. Rich feature hierarchies for accurate object
This paper introduces the state-of-the-art instance segmen- detection and semantic segmentation. In Proceedings
tation framework for orange detection and segmentation. of the IEEE conference on computer vision and pattern
A multi-modal deep learning approach is presented by recognition, pages 580–587, 2014.
augmenting Mask R-CNN to include HSV input data. The A. Gongal, S. Amatya, M. Karkee, Q. Zhang, and K. Lewis.
performance of the developed framework is validated using Sensors and systems for fruit detection and localization:
images obtained in orange groves under natural lighting A review. Computers and Electronics in Agriculture,
conditions. The results indicate that inclusion of HSV 116:8–19, 2015.
data with RGB images can significantly reduce the false Guillermo L Grinblat, Lucas C Uzal, Mónica G Larese, and
positive rate (i.e., improve precision score) and improve Pablo M Granitto. Deep learning for plant identifica-
mask segmentation performance, which are beneficial to au- tion using vein morphological patterns. Computers and
tonomous robotic harvesting.One of the avenues for future Electronics in Agriculture, 127:418–424, 2016.
work is to consider reducing color channels. Specifically, David Hall, Chris McCool, Feras Dayoub, Niko Sunderhauf,
only a subset of RGB and HSV channels will be employed and Ben Upcroft. Evaluation of features for leaf classifi-
in an effort to improve fruit detection efficiency and reduce cation in challenging conditions. In 2015 IEEE Winter
segmentation time. Conference on Applications of Computer Vision, pages
73
74
December 4-6, 2019. Sydney, Australia P. Ganesh et al. / IFAC PapersOnLine 52-30 (2019) 70–75
Image RGB Only HSV Only RGB + HSV
Fig. 4. Validation images (first column) and the corresponding detection and segmentation results using RGB, HSV, and
RGB+HSV input data.
74

797–804. IEEE, 2015. works for mapping winter vegetation quality cover-
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian age via multi-temporal sar sentinel-1. arXiv preprint
Sun. Deep residual learning for image recognition. In arXiv:1708.03694, 2017.
Proceedings of the IEEE conference on computer vision Sharada P Mohanty, David P Hughes, and Marcel Salathé.
and pattern recognition, pages 770–778, 2016. Using deep learning for image-based plant disease detec-
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross tion. Frontiers in plant science, 7:1419, 2016.
Girshick. Mask r-cnn. In Proceedings of the IEEE in- Anders Krogh Mortensen, Mads Dyrmann, Henrik
ternational conference on computer vision, pages 2961– Karstoft, Rasmus Nyholm Jørgensen, René Gislum, et al.
2969, 2017. Semantic segmentation of mixed crops using deep convo-
Dino Ienco, Raffaele Gaetano, Claire Dupaquier, and Pierre lutional neural network. In CIGR-AgEng Conference,
Maurel. Land cover classification via multitemporal 26-29 June 2016, Aarhus, Denmark. Abstracts and Full
spatial data by deep recurrent neural networks. IEEE papers, pages 1–6. Organising Committee, CIGR 2016,
Geoscience and Remote Sensing Letters, 14(10):1685– 2016.
1689, 2017. Horea Mureşan and Mihai Oltean. Fruit recognition from
Andreas Kamilaris and Francesc X Prenafeta-Boldu. Deep images using deep learning. Acta Universitatis Sapien-
learning in agriculture: A survey. Computers and Elec- tiae, Informatica, 10(1):26–42, 2018.
tronics in Agriculture, 147:70–90, 2018. Sarah Taghavi Namin, Mohammad Esmaeilzadeh, Mo-
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. hammad Najafi, Tim B Brown, and Justin O Borevitz.
Imagenet classification with deep convolutional neural Deep phenotyping: deep learning for temporal pheno-
networks. In Advances in neural information processing type/genotype classification. Plant methods, 14(1):66,
systems, pages 1097–1105, 2012. 2018.
Nataliia Kussul, Mykola Lavreniuk, Sergii Skakun, and Michael P Pound, Jonathan A Atkinson, Alexandra J
Andrii Shelestov. Deep learning classification of land Townsend, Michael H Wilson, Marcus Griffiths, Aaron S
cover and crop types using remote sensing data. IEEE Jackson, Adrian Bulat, Georgios Tzimiropoulos, Dar-
Geoscience and Remote Sensing Letters, 14(5):778–782, ren M Wells, Erik H Murchie, et al. Deep machine learn-
2017. ing provides state-of-the-art performance in image-based
Kentaro Kuwata and Ryosuke Shibasaki. Estimating crop plant phenotyping. Gigascience, 6(10):gix083, 2017.
yields with deep learning and remotely sensed data. Maryam Rahnemoonfar and Clay Sheppard. Deep count:
In 2015 IEEE International Geoscience and Remote fruit counting based on deep simulated learning. Sensors,
Sensing Symposium (IGARSS), pages 858–861. IEEE, 17(4):905, 2017.
2015. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
Konstantinos Liakos, Patrizia Busato, Dimitrios Moshou, Faster r-cnn: Towards real-time object detection with
Simon Pearson, and Dionysis Bochtis. Machine learning region proposal networks. In Advances in neural infor-
in agriculture: A review. Sensors, 18(8):2674, 2018. mation processing systems, pages 91–99, 2015.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Inkyu Sa, Zongyuan Ge, Feras Dayoub, Ben Upcroft, Tris-
Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and tan Perez, and Chris McCool. Deepfruits: A fruit detec-
C Lawrence Zitnick. Microsoft coco: Common objects tion system using deep neural networks. Sensors, 16(8):
in context. In European conference on computer vision, 1222, 2016.
pages 740–755. Springer, 2014. Mayanda Mega Santoni, Dana Indra Sensuse, Aniati Murni
Xu Liu, Steven W Chen, Shreyas Aditya, Nivedha Sivaku- Arymurthy, and Mohamad Ivan Fanany. Cattle race
mar, Sandeep Dcunha, Chao Qu, Camillo J Taylor, classification using gray level co-occurrence matrix con-
Jnaneshwar Das, and Vijay Kumar. Robust fruit count- volutional neural networks. Procedia Computer Science,
ing: Combining deep learning, tracking, and structure 59:493–502, 2015.
from motion. In 2018 IEEE/RSJ International Confer- Srdjan Sladojevic, Marko Arsenovic, Andras Anderla,
ence on Intelligent Robots and Systems (IROS), pages Dubravko Culibrk, and Darko Stefanovic. Deep neural
1045–1052. IEEE, 2018. networks based recognition of plant diseases by leaf image
Francois PS Luus, Brian P Salmon, Frans Van den Bergh, classification. Computational intelligence and neuro-
and Bodhaswar Tikanath Jugpershad Maharaj. Mul- science, 2016, 2016.
tiview deep learning for land-use classification. IEEE Madeleine Stein, Suchet Bargoti, and James Underwood.
Geoscience and Remote Sensing Letters, 12(12):2448– Image based mango fruit detection, localisation and yield
2452, 2015. estimation using multiple view geometry. Sensors, 16
Chris McCool, Tristan Perez, and Ben Upcroft. Mixtures of (11):1915, 2016.
lightweight deep convolutional neural networks: applied Hulya Yalcin. Plant phenology recognition using deep
to agricultural robotics. IEEE Robotics and Automation learning: Deep-pheno. In 2017 6th International Con-
Letters, 2(3):1344–1351, 2017. ference on Agro-Geoinformatics, pages 1–5. IEEE, 2017.
Andres Milioto, Philipp Lottes, and Cyrill Stachniss. Real- Yu-Dong Zhang, Zhengchao Dong, Xianqing Chen, Wen-
time blob-wise sugar beets vs weeds classification for juan Jia, Sidan Du, Khan Muhammad, and Shui-Hua
monitoring fields using convolutional neural networks. Wang. Image based fruit category classification by 13-
ISPRS Annals of the Photogrammetry, Remote Sensing layer deep convolutional neural network and data aug-
and Spatial Information Sciences, 4:41, 2017. mentation. Multimedia Tools and Applications, 78(3):
Dinh Ho Tong Minh, Dino Ienco, Raffaele Gaetano, 3613–3632, 2019.
Nathalie Lalande, Emile Ndikumana, Faycal Osman,
and Pierre Maurel. Deep recurrent neural net-
75

2019 Ganesh, Deep Orange Mask R-CNN Based Orange PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2019 Ganesh, Deep Orange Mask R-CNN Based Orange PDF

Uploaded by

Copyright:

Available Formats

6th IFAC Conference on Sensing, Control and Automation for

6th IFAC Conference on Sensing, Control and Automation for

modal input data consisting of RGB and Near-Infrared

dimensional reduction. While vital to performance of the

mean of precision and recall, and hence provides an overall REFERENCES

Image RGB Only HSV Only RGB + HSV

You might also like