You are on page 1of 6

Overview of Image-based 3D Vision Systems for

Agricultural Applications
Abhipray Paturkar, Gourab Sen Gupta, and Donald Bailey
School of Engineering and Advanced Technology,
Massey University, New Zealand
Correspondence: {A.Paturkar, G.SenGupta, D.G.Bailey}@massey.ac.nz

Abstract—Various agricultural robots exist which are intended • Handle natural objects with a huge variety in shape,
to help farmers to make decisions. Vision systems are an integral colour, size, texture, and solidity.
and essential part of these agricultural robots to perceive and • Navigate a complicated and barely structured workspace
evaluate the work space, inspect crops and detect objects. The
use of 3D vision systems can provide a 3D reconstructed model with substantial variations in illumination and a high
which will supply detailed information including depth estimation degree of occlusion.
about a crop or an object or environmental structures. However, • Identify randomly located object or fruit which demands
these vision systems have some drawbacks which are limiting operating in three-dimensions
their use in an agricultural area. The limiting factors are dealing
with lighting conditions, navigating unstructured environment, Vision based systems are an appealing approach to fulfill
harvesting speed, occlusion and overlapping of fruits, accuracy these challenges. One-dimensional (1D) and two-dimensional
of the system etc. This paper investigates the limitations and (2D) vision systems are already successful in the food produc-
challenges of the state-of-the-art of 3D image reconstruction tech- tion chain [4]. It is strongly believed that machine vision is at
niques in agricultural applications. Active and passive approaches
have been reviewed. Factors to be considered while designing the
a cusp of making deep inroads into agricultural automation.
vision system have been identified. Enhanced technology combined with low sensor prices have
Index Terms—Computer vision; 3D vision system; Agricultural influenced researchers to move into three-dimensional (3D)
automation; Precision Agriculture; Reconstruction techniques approaches. In the last decade, tremendous work has been done
in agricultural 3D vision systems. There are two 3D modelling
I. I NTRODUCTION techniques, rule based and image based [5], but most of
Precision agriculture is all about collecting information the work has been done in image based techniques because
about the crop and its environmental conditions, including of its realistic input (capturing images) and fitting operation
soil to help farmers in the process of decision making [1]. property. As reconstruction techniques and algorithms are
As agriculture is becoming increasingly modernised, the ap- developing, state-of-the-art techniques are evolving rapidly.
plication of vision based robots and intelligent machines is In this paper, we review a range of image-based 3D vision
helping farmers and researchers in many ways. These robots systems in the context of agriculture for indoor and outdoor
have helped to reduce production and labour costs, and at the applications. After reviewing these systems, we investigate the
same time, it has enhanced environmental control and helped limitations and challenges present in these image-based vision
to improve the quality of agricultural products [2]. Perceptive systems. We compare active and passive approaches. We also
agricultural robots are designed to conduct various agricultural examine which methods are useful to reduce the challenges
operations such as cultivating, transplanting, harvesting, spray- identified above. The rest of the paper is organized as follows;
ing, and trimming. Vision based robots are intended to sense Section II discusses image-based 3D vision systems along with
the unstructured agricultural environment by different sensors their various applications in an agricultural context and their
e.g. cameras, lidar etc. and use all this information to perform limitations. Section III discusses challenges and limitations
the required operations [3]. Surely, vision based systems have of existing techniques. Future directions to overcome the
a phenomenal future but currently, the use of an agricultural limitations are outlined in Section IV.
robot is inadequate to completely emulate manual operation
due to some limitations of their vision systems. The success II. 3D VISION SYSTEMS
of agricultural robotics depends heavily on the performance Image-based 3D vision systems broadly fall into two cat-
of the vision systems. egories. The various techniques and their classification are
Vision systems exploit visual information to help an agri- depicted in Fig. 1. Image-based techniques use real world
cultural robot to roam autonomously in the uncertain, unstruc- data to model the complete 3D structure of plants or objects,
tured and changing environment. Vision systems for agricul- mostly based on methods developed by computer vision re-
tural robots are implemented to resolve some of the challenges searchers. While image-based techniques have made tremen-
such as: dous progress in achieving accurate models, constructing an
978-1-5386-4276-4/17/$31.00 2017
c IEEE exact representation is still a challenging research problem.
Fig. 1. Classification of 3D vision system

Image based vision system is mainly divided into active and was collected for 2.3 hectare orchard.The lidar processing took
passive approaches: less than 5 minutes to segment all the trees in the 2.3 hectares.
Use of structured light is an alternative method for depth
A. Active Approaches estimation. In this method, the light source (either visible or
Lidar is basically an extension of the principles employed in near-infrared) is offset a known distance from an imaging
radar technology. It calculates the distance between the scanner device. The light from the emitter is reflected into the camera
and the target object by illuminating the object using a laser by the target scene. Knowledge of the light pattern enables
and measuring the time taken for the reflected light to return the depth to be inferred through triangulation. The Microsoft
[6] . There are two main ways in which lidar is used. Firstly, Kinect is possibly the most common example of a structured
airborne lidar in which a scanning system is attached to an light device which transmits a near-infrared dot pattern [8].
airplane and secondly, terrestrial laser scanning, which is used This sensor has an advantage that its software (Microsoft’s
for ground based vehicles. KinectFusion) permits depth information from multiple views
James Underwood et al. [7] presented a terrestrial laser to be fused into a single image model. Unfortunately, Kinect
scanning system to gather information for almond orchard sensors go through certain limitations when applied in outdoor
mapping and per tree yield calculation. 580 trees were mea- scenarios. It can be hard to use in day light. Greatly reflective
sured at three different times during the harvest season over leaf surfaces could also work as mirrors i.e. reflecting a heavy
approximately two years. The data collection process in the proportion of the emitted pattern elsewhere from the imaging
first year was affected by lens-flare and saturation as a result device which making it difficult to detect.
of sun-light (illumination). In the second year, they tried to Zhang et al. [9] proposed a conveyor based quality grading
keep the sun always behind the camera, which led to a time system mainly used for apples. This system detects the defects
consuming data collection process as they needed to scan half in apples by considering its stem-end and calyx because it is
of the row in the morning heading west and remaining half difficult to differentiate between calyx and defect. Sometimes
row in the afternoon facing east. During pre-harvest and once it is quite confusing for human eyes to differentiate between
the foliage developed, the fruits became occluded and many calyx and a defect in just one glance while selecting an apple.
almonds were hidden while scanning. Since, each side of a Hence, it is important to have a system which will differentiate
tree is scanned individually; misalignment occurs in between between calyx and defect easily. This grading system has a
two halves due to GPS drift. The variation in fruit size was limitation of poor calibration in between projector and camera
one of the main challenges during this experiment as the small which leads to false depth information and because of this the
size almonds are more difficult to discern than larger almonds. accuracy of the system is low. This system takes on an average
During data processing, the image processing part took around around 80ms for each image by using industrial computer. The
two days including human and computer time as the dataset false acceptance rate is 3.8% and in contrast, the false rejection
TABLE I
C OMPARATIVE ANALYSIS OF ACTIVE APPROACHES

Advantages Disadvantages
-Able to get depth information in dim light. -Bulky with moving parts.
Lidar -Robust against interference. -Costly for setup.
-Robust against sunlight. -Problems under adverse weather conditions (rain, dust).
-Excellent accuracy. -Weak performance in edge detection due to the spacing
between the light beams.
-Low cost system. -Highly sensitive to natural light.
Structured light -High speed of 3D capture, up to 30 fps. -Small field-of-view.
-Useful in animal inspection applications. -Difficulty to deal with shiny surfaces
-Directly measure in 3D at very high spatial -Limited sensing range
resolution.

rate is 0.95%Ṁoreover, this system is unable to work for real Passive approaches are therefore getting tremendous attention,
time fruit grading system in industries. as these approaches only need a standard off-the-shelf digital
Zhang et al. [10] in their apple grading system on a con- camera for capturing overlapped images which are then pro-
veyor belt, used a single CCD camera combined with a near- cessed by computer to calculate depth or 3D structure. Passive
infrared linear-array structured light for 3D reconstruction. approaches use radiation present in the scene. Stereo vision
To identify the apple stem and calyx they only considered and structure-from-motion (SFM) are the most commonly used
the upper surface. For structured light systems, the distance approaches.
between the CCD camera and laser projector should be fixed Stereo vision has three main processing stages: camera
(or at least known). This system finding the stem-end and calyx calibration, feature extraction, and correspondence matching.
depends only on the detection of a concavity. The limitation A stereo camera captures a pair of images (right and left).
of this system is, it can only be used to monitor the upper By using this stereo pair, the disparity can be calculated to
half of the apple which should be visible at the time of get the camera coordinates of matching points in the scene.
inspection. This system has overcome several restrictions and Structure-from-Motion follows a similar process to stereo
disadvantages of existing technologies with excellent accuracy vision. Nevertheless, whereas stereo vision captures two or
of about 97.5% and robustness. more images concurrently, SFM captures images serially from
Rosell-Polo et al. [11] describes the use of a Kinect sen- different view points as the camera moves. 3D data is then
sor applied to two different areas: precision agriculture and estimated either by matching pairs of images or globally
precision livestock farming. They have experimented with the matching features between all the images [12].
sensors in different seasonal conditions. They found that the A mobile platform might not have knowledge of its en-
Kinect sensor is not efficient at detecting small or complex vironment; it needs to perform operations in an unstructured
targets in daylight conditions and for this reason they have con- surrounding, which is a potential problem. Francisco et al. [13]
ducted their outdoor experiments during sunset and evening. fused the output from a stereo vision system and navigation
Some experiments for detection of trunks and branches re- system for precision agriculture. This system consists of
ported the minimal noticeable branch diameter of around 6 a stereo camera, GPS sensor and inertial calculation unit.
mm and this limitation is responsible for floating fruits or For each stereo pair, the GPS sensor calculates the global
leaves in the resultant 3D model. For livestock farming, they coordinates of the position of the camera. As there are a
carried experiments in indoor environments where the lights number of GPS satellites it prevents fusing of images with
are structured. The animals are moving during inspection and wrong coordinates. Image noise is the biggest challenge while
this movement has been considered as it is difficult to derive creating local maps as the quality of local maps can be affected
an accurate 3D model if there is any movement. To monitor by the disparity value. If there is any mismatch between two
the growth and body condition of the animals, they conducted images, for example as a result of limited texture within the
experiments for four months. This study shows that structured scene, occlusion, or poor illumination it might end up with
light based Kinect sensor is efficient for indoor applications determining wrong coordinates. These unfiltered images are
only. pixels with erroneous stereo data which leads to meaningless
Overall, lidar provides better results for variation in illumi- locations. Synchronization between the moving speed of the
nation and has good accuracy for 3D reconstruction as com- mobile platform and the processing speed of the system is
pared to the structured light system. A detailed comparative very critical. If the processing is not fast enough, data may be
analysis of active approaches is given in Table I. lost.
Xiang et al. [14] investigated recognition of clustered fruit
B. Passive Approaches on trees using stereo vision. In this system, depth information
In spite of the fact that lidar could be powerful, it needs of clustered tomatoes was obtained from the stereo vision. For
costly equipment which makes the overall system expensive. a cluster, most of the back tomatoes are occluded by the front
TABLE II
C OMPARATIVE ANALYSIS OF PASSIVE APPROACHES

Advantages Disadvantages
-Off-the-shelf cameras could be used. -Correspondence problem because of low texture.
Stereo vision -Provides efficient RGB stream. -Depth range is depends upon the baseline distance.
-Robust for controlled environment. -Not effective in sunlight because of reflection.
-Cost effective. -Computationally costly.
-Easy implementation. -Correspondence problem due to poor illumination
-Hand held cameras could be used. -Generation of point cloud is time consuming.
Structure-from-motion -Suitable for any application. -Image overlapping is an issue.
-Useful in animal inspection applications. -Difficulty to deal with shiny surfaces
-Cost effective. -Intrinsic and extrinsic camera parameters are needed
to calculate.

tomatoes and front tomatoes prevent the manipulator from result of this method has shown a significant reduction in error
reaching the back tomatoes because of incorrect recognition distance estimation by 26.55% in comparison with a single
of tomatoes in this cluster. Noise in the depth map is another stereo camera.
issue as a few pixels with unusual depth values because of false Quan et al. [18] and [19] proposed a semi-automatic method
matching leads to wrong recognition. Although this method is for modelling the plants completely from images. This method
useful for recognition of clustered tomatoes, the accuracy of is based on structure-from-motion. Although this method is
this method is not adequate if the occlusion is serious. There easy to use it has some limitations. Due to a large amount
is still scope for improvement of the accuracy if a fruit is of leaves on plants and trees, it is difficult to reconstruct
occluded by leaves. each leaf perfectly. In such case the user has to manually
Si et al. [15] propose a vision system which is able to manipulate data to obtain a realistic 3D model. Moreover,
autonomously locate apples in trees by using a stereo camera. sometimes it is not possible to scan a plant or tree in 360
Colour and shape based analysis are two important methods degrees because of neighbouring trees and this might lead to an
for fruit recognition. Colour based analysis fails if there is a improper 3D model. Furthermore, this method requires more
shadow. Edge extraction should be performed before feature effort to reconstruct unusual structural modifications because
extraction; once an apple is identified then branches and leaves of disease. There is more loss of data due to occlusion which
are recognised to create unwanted noise points. So in this case provides difficulties while creating the 3D model. So, there
it is difficult to handle a high degree of occlusion. Feature are still some problems with this method which need to be
extraction of an apple is mostly based on the contour. A Hough solved. This method has provided some really good results
transform is commonly used for this but it is time consuming. which are motivating its use in other applications like 3D face
The accuracy of the proposed system was not satisfactory and reconstruction.
there is further space for improvement. Jay et al. [20] proposed a method which builds a 3D
Kaczmarek [16], [17] presents a method to improve the model of a crop row to get the plant structural parameters.
calculation of distance between an autonomous harvesting This 3D model is acquired using structure-from-motion with
robot and a tree with ripe fruits using five cameras. This system the help of colour images captured by translating a single
called equal baseline multiple camera set (EBMCS). For this camera along the row. In this method, they have considered
set of five cameras, calibration and rectification is a difficult the plants are in a linear row structure to avoid complexities
task to do because every side camera has to be calibrated of the process. In the actual scenario, some parameters make
with the reference (centre) camera. This has the potential to some difference like ground geometry, illumination, occlusion
create confusion and introduce errors in the system. Ground and most importantly scanning speed. These parameters must
truth is prepared for every set of images. Ground truth points be considered for implementation. In addition, the scanning
correspond to points of the central image and are important process of plants is totally manual so there might be some error
in this scenario because in the neighbourhood of every edge while scanning as well. These limitations are common in these
of the centre image there must be an area which is not types of systems and should be considered when developing
visible in one of the side images. Parameters such as ground an error-free system.
truth, camera calibration and rectification are important aspects In consequence, passive techniques generally perform well
while considering stereo vision system using multiple cameras. in spite of some challenges. The detailed comparative analysis
Exceptions Excluding Merging Method (EEMM) is responsi- of passive approaches is given in Table II.
ble to enhance the quality of disparity maps by combining
maps obtained from individual stereo cameras incorporated in III. CURRENT CHALLENGES AND RESEARCH
EBMCS. There are several limiting factors in this system e.g. DIRECTIONS
wide range of real disparities in the set of images and selecting From the literature, passive approaches appear to be the
size of windows to used in the stereo matching algorithm. The most promising approach for 3D vision because of the lower
TABLE III
LIST OF CHALLENGES IN STEREO VISION AND STRUCTURE - FROM - MOTION

Challenges
- Field of view is limited.
-Struggles with reflections.
-Occlusion handling.
-Extra light is required for dim light implementations.
Stereo vision -Shadows are responsible for poor depth estimation.
-Baseline should be maintained.
-Camera position is important.
-Parallax and correspondence problem.
-Necessity of camera calibration.
-Occlusion handling.
Structure-from-motion -Object movement during capture process.
-High computation is required.
-Image overlapping is an issue.
-Poor in dim light.

cost. However, there are a few challenges which need to be techniques.


addressed such as:
1) Illumination effects: For a stereo vision, the matching IV. CONCLUSION
problem is complicated by differences in illumination Computer vision has been studied for more than 50 years,
from different view points, and poorly textured surfaces and in the last 5 years, tremendous work has been focused in
[14]. Because of variation in lighting conditions, it is the context of agriculture. Passive approaches have gained a
difficult to get good quality results. lot of attention despite their limitations as these approaches are
2) Poor hand-eye coordination: Lack of robust image reliable and less costly than active approaches. In this paper,
processing algorithms is responsible for weak visual in- we have investigated some of the different approaches used
formation which is important for the robotic manipulator for image-based 3D vision. It is clear that, although image-
to reach the fruit. based approaches are able to produce good results, they still
3) On-board processing: Current vision systems work need human interaction. A fully automated system is clearly
off-line as the system collects images from a camera a necessity. In addition, parameters like occlusion, scanning
for processing later. This limits commercialisation of speed, identification of fruit, location of an obstacle or fruit
agricultural robotics. make a difference in real time implementations and these must
4) Fruit or obstacle recognition: The accuracy of fruit be considered while designing a system. Illumination problems
recognition is an important aspect. Current vision sys- can be minimised if the system uses lidar, as the review
tems use inefficient fruit recognition algorithms which shows that lidar systems are robust against sunlight and able
are responsible for inaccurate results. to retrieve depth data in dim light as well. Occlusion handling
5) Time consuming algorithms: Current techniques use requires more efficient and effective algorithms. Efficient
time consuming algorithms which take approximately algorithms, advanced computing and reduction in hardware
20-30 mins to process the data and if the data is huge, costs are needed to attract more researchers to this area and to
it might take around a day achieve breakthrough. So, until then semi-automatic systems
6) Occlusion handling: Because of the colour similarity must be used. We have reviewed active and passive approaches
between fruit and leaves and because of the clustering and have identified the various challenges and limitations of
of fruit, it is difficult to fetch the properly ripened fruit. image based 3D vision in an agricultural context. While much
This is a daunting task and a much-studied aspect of progress has been made, there is clearly scope for significant
automated harvesting, but still a long way to go to get improvement.
proper efficiency to handle occlusion.
7) Human-Computer combination: Most of the methods R EFERENCES
in vision system are semi-automatic which needs user [1] S. Konam, “Agricultural aid for mango cutting AAM,” in International
involvement to manipulate data to get the proper result. Conference on Advances in Computing, Communications and Informat-
Although all these aforementioned techniques are challeng- ics (ICACCI), 2014, pp. 1520–1524.
[2] M. Vázquez-Arellano, H. W. Griepentrog, D. Reiser, and D. S. Paraforos,
ing in outdoor conditions, experiments in controlled conditions “3-D imaging systems for agricultural applications a review,” Sensors,
have given some promising results. Nonetheless, all these vol. 16, no. 5, p. 618, 2016.
challenges are provide future research directions to resolve [3] N. dos Santos Rosa, P. E. Cruvinel, and J. de Mendonça Naime,
“Prerequisite study for the development of embedded instrumentation
the issues with efficient solutions. Table III shows some for plant phenotyping using computational vision,” in 11th International
more challenges in stereo vision and structure-frommotion Conference on Semantic Computing (ICSC), 2017, pp. 445–451.
[4] M. Bergerman, E. Van Henten, J. Billingsley, J. Reid, and D. Mingcong, [12] T. Jebara, A. Azarbayejani, and A. Pentland, “3D structure from 2D
“IEEE robotics and automation society technical committee on agricul- motion,” IEEE Signal Processing Magazine, vol. 16, no. 3, pp. 66–84,
tural robotics and automation,” IEEE Robotics & Automation Magazine, 1999.
vol. 20, no. 2, pp. 20–23, 2013. [13] F. Rovira-Más, Q. Zhang, and J. F. Reid, “Stereo vision three-
[5] F. Remondino and S. El-Hakim, “Critical overview of image-based dimensional terrain maps for precision agriculture,” Computers and
3D modeling,” in International Workshop on Recording, Modeling and Electronics in Agriculture, vol. 60, no. 2, pp. 133–143, 2008.
Visualization of Cultural Heritage:. CRC Press, 2005, p. 299. [14] R. Xiang, H. Jiang, and Y. Ying, “Recognition of clustered tomatoes
[6] C. A. Northend, “Lidar, a laser radar for meteorological studies,” based on binocular stereo vision,” Computers and Electronics in Agri-
Naturwissenschaften, vol. 54, no. 4, pp. 77–80, 1967. culture, vol. 106, pp. 75–90, 2014.
[7] J. P. Underwood, C. Hung, B. Whelan, and S. Sukkarieh, “Mapping [15] Y. Si, G. Liu, and J. Feng, “Location of apples in trees using stereoscopic
almond orchard canopy volume, flowers, fruit and yield using lidar and vision,” Computers and Electronics in Agriculture, vol. 112, pp. 68–74,
vision sensors,” Computers and Electronics in Agriculture, vol. 130, pp. 2015.
83–96, 2016. [16] A. L. Kaczmarek, “Improving depth maps of plants by using a set of
[8] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. five cameras,” Journal of Electronic Imaging, vol. 24, no. 2, p. 023018,
Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon, “Kinect- 2015.
fusion: Real-time dense surface mapping and tracking,” in 10th IEEE [17] A. Kaczmarek, “Stereo vision with equal baseline multiple camera
International Symposium on Mixed and Augmented Reality (ISMAR), set (EBMCS) for obtaining depth maps of plants,” Computers and
2011, pp. 127–136. Electronics in Agriculture, vol. 135, pp. 23–37, 2017.
[9] C. Zhang, L. Chen, W. Huang, Z. Guo, and Q. Wang, “Apple stem- [18] L. Quan, P. Tan, G. Zeng, L. Yuan, J. Wang, and S. B. Kang, “Image-
end/calyx identification using a speckle-array encoding pattern,” in 11th based plant modeling,” in ACM Transactions on Graphics, vol. 25, no. 3,
International Conference on Signal Processing (ICSP), vol. 2, 2012, pp. 2006, pp. 599–604.
1110–1114. [19] P. Tan, G. Zeng, J. Wang, S. B. Kang, and L. Quan, “Image-based tree
[10] B. Zhang, W. Huang, C. Wang, L. Gong, C. Zhao, C. Liu, and D. Huang, modeling,” in ACM Transactions on Graphics, vol. 26, no. 3, 2007,
“Computer vision recognition of stem and calyx in apples using near- p. 87.
infrared linear-array structured light and 3D reconstruction,” Biosystems [20] S. Jay, G. Rabatel, X. Hadoux, D. Moura, and N. Gorretta, “In-field
Engineering, vol. 139, pp. 25–34, 2015. crop row phenotyping from 3D modeling performed using structure from
[11] J. R. Rosell-Polo, F. A. Cheein, E. Gregorio, D. Andujar, L. Puig- motion,” Computers and Electronics in Agriculture, vol. 110, pp. 70–77,
domènech, J. Masip, and A. Escolà, “Advances in structured light 2015.
sensors applications in precision agriculture and livestock farming,”
Advances in Agronomy, vol. 133, pp. 71–112, 2015.

You might also like