Professional Documents
Culture Documents
Reconstruction
By
Bo Wu
B.S.(Iowa State University) 2013
DISSERTATION
MASTER OF SCIENCE
in
in the
of the
UNIVERSITY OF CALIFORNIA,
DAVIS
Approved:
Committee in Charge
2015
i
ProQuest Number: 1604080
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.
ProQuest 1604080
Published by ProQuest LLC (2015). Copyright of the Dissertation is held by the Author.
ProQuest LLC.
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106 - 1346
Bo Wu
September 2015
Electrical & Computer Engineering
Abstract
This study is aimed at localizing fruits on trees. The research is intended for use
by the UC Davis Biological and Agricultural Engineering Department for robotic
localization of fruits in the field. The study includes three phases. In the first phase
a stereo vision system with only one camera and one rotation unit was constructed.
Normally stereo vision systems contain more than one camera with a known distance
between cameras. In the second phase, the stereo vision system was tested in both
indoor and outdoor areas with a known distance between the system and the objects.
The last phase introduces manual fruits detection in the images. The output of the
stereo vision system is the distance between the object and the camera. With enough
tests from phase two, the system can be used in the real world applications. This
study focused on constructing and testing the stereo vision system. The detection of
the fruits is not automated at this time, and is limited to their manual localization
in the images. A fruit locater is required in the future to automatically detect fruits
and return their locations.
ii
Contents
Abstract ii
List of Figures vi
List of Tables viii
Acknowledgments ix
1. Introduction 1
1.1. Organization of Thesis 1
1.2. Problem Statement 2
1.3. Objectives of the Research 2
1.4. Scope of the Research 3
2. Background 4
2.1. Pinhole Camera Model 4
2.1.1. Projection Function of Pinhole Camera 5
2.1.2. Camera Calibration 7
2.2. Stereo Camera and Calibration 8
2.2.1. Stereo Camera 8
2.2.2. Operation of Stereo Camera 9
2.2.3. Stereo Camera Calibration 10
2.3. Epipolar Geometry 10
2.3.1. Epipolar Geometry 10
2.3.2. Fundamental Matrix 10
2.3.3. Essential Matrix 11
2.4. OpenCV Library 12
2.5. Devices Used 13
3. Mathematical Review 16
3.1. Solving Corresponding Points and the Fundamental Matrix between
Two Images 16
3.1.1. Seven-Point Algorithm 17
3.1.2. Eight-Point Algorithm 18
iii
3.1.3. The RANSAC Algorithm 18
3.2. Solving Rotation and Translation from Fundamental Matrix 19
3.3. Rodrigues’ Rotation Formula 22
4. Localization of the Objects in the Indoor Area 23
4.1. Camera Calibration 23
4.2. Constructing the Stereo Vision System 27
4.3. Finding Corresponding Points 28
4.4. Finding the Rotation matrix and Translation Vector 29
4.5. Triangulation and Improvement 30
5. Labeling Images 31
5.1. Software Tool for Labeling 31
5.2. Labeled Image 31
5.2.1. Original Images 32
5.2.2. Labeled Images 33
5.3. Labeling Data 39
6. Experimental Result 40
6.1. Examples of images 40
6.2. Finding the Corresponding Points 41
6.3. Triangulation 42
6.4. New Pair of Images Testing 46
7. Localization of Fruits 50
7.1. Simulation Environment Set-Up 50
7.2. Fruit Images 53
7.3. Testing Procedure 54
7.3.1. First Test 55
7.3.2. Multiple Images Testing 61
8. Conclusions and Recommendations 66
8.1. Conclusion 66
8.2. Recommendations for Future Study 67
iv
References 68
v
List of Figures
3 Stereo Vision 9
4 Epipolar Geometry 11
5 Galileo 14
8 Recognized Chessboard 24
9 Chessboard 25
23 Corresponding Points 41
33 Fruits Set-up 52
vii
List of Tables
viii
Acknowledgments
Last but not the least, I want to thank my parents for supporting me throughout
my life.
ix
1. Introduction
Chapter 1 introduces the problems that currently exist with the software applica-
tion for fruit detection and the objectives and scope of this study.
Chapter 3 reviews the mathematics behind the software functions used in this study.
Chapter 5 introduces the method labeling images used for detecting certain types
of fruits. The purpose and the significance of this procedure are presented.
1
Chapter 7 presents results from the stereo vision system used in an outside area
for triangulating fruits surrounded by leaves and branches.
Chapter 8 summarizes the this research and presents recommendations for future
study.
At the current time fruit detection can be done using robot. This kind of ro-
bot must have the ability to find fruits and localize them by showing their location
in the camera frame.
To detect and localize fruits in the world, two major components are needed: la-
beling images for detection, and a stereo vision system for localizing. The stereo
camera introduced in the next chapter can offer a three-dimensional view in which
the location any object can be found. Finding a method for using only one camera
to fulfill the function of a standard stereo camera has became the problem that must
be solved for localization.
However, the stereo vision system cannot decide where the fruits are in the image.
The system requires the image points of the fruits to analyze the world coordinate
results, so it is necessary to develop a method for finding the image points of the
fruits. This method is used for fruits detection.
The objectives of the research are the following: using the knowledge of com-
puter vision and pinhole camera to build and evaluate a stereo vision system with
2
only one pinhole camera to detect and localize the fruits on trees; finding methods
for increasing the accuracy of the system so that the estimated distance between the
fruits and camera can be as close to the real distance as possible.
This study also started a new phase: automatically finding all fruits in a single im-
age. This can make the whole process automatic and much easier to run. However,
this phase is in its infancy and it will be improved in future work.
3
2. Background
This research study used the images taken by a camera to detect and localize the
fruits on trees. many areas of computer vision knowledge has been involved in this
research study. It includes image processing, the pinhole camera model, the stereo
camera model, and multiple view geometry. This background review introduces in
detail how to use computer vision knowledge to detect and localize fruits. In the back-
ground below, basic and advanced camera types will be reviewed. All the hardware
devices and the software library will be presented and discussed. Most of background
knowledge is from Multiple view Geometry in Computer Vision[14]
A pinhole camera was used in this study as the image-capturing tool. The pinhole
camera is the most specialized and simplest camera model.
Figure 1 shows the pinhole camera model’s geometry. C is the camera center which
is the center of the projection. The line starting from C and extending perpendicu-
larly to the image plane is called the principal axis and the intersection point is called
principal point (P in figure 1). Under pinhole camera model, there is an X (X, Y, Z,
1) in the world coordinate mapping to an x (x, y, 1) on the image plane.
4
These two kinds of coordinate points are both in homogeneous space, instead of
being Euclidean coordinates. This is why there is a one at the end of each coordinate.
The homogeneous space points will be used to fit in the pinhole camera projection
function, which will be presented in the next section.[14]
The pinhole camera model is based on a projection function. This projection func-
tion can indicate how the pinhole camera works and what parameters that pinhole
camera model has. 2.1.1 and 2.1.2 show the different forms of the projection function.
This function shows a scene formed by projecting all the 3D points into 2D image
points. On the left side of function 2.1.2 is the projected image point. On the right
side are the camera matrix, the rotation with translation matrix, and the 3D world
coordinate point. The rotation and translation in the function is between the camera
coordinate frame and the world coordinate frame.[20]
This is the function in two different forms of the pinhole camera model:
or
⎡ ⎤
⎡ ⎤ ⎡ ⎤⎡ ⎤ X
u f 0 cx r r r t ⎢ ⎥
⎢ ⎥ ⎢ x ⎥ ⎢ 11 12 13 1 ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ Y⎥
(2.1.2) s ⎢v ⎥ = ⎢ 0 fx cy ⎥ ⎢r21 r22 r23 t2 ⎥ ⎢ ⎥
⎢
⎣ ⎦ ⎣ ⎦⎣ ⎦ ⎢Z ⎥⎥
1 0 0 1 r31 r32 r33 t3 ⎣ ⎦
1
where
The parameters in the camera matrix are called the intrinsic parameters of the cam-
era or the internal orientation of the camera. The parameters of the rotation matrix
and the translation vector are called the extrinsic parameters of the camera.[14] The
rotation matrix and translation vector are the pose relationship between the cam-
era coordinate frame and world coordinate frame. Normally, the camera coordinate
frame is different from the world coordinate frame, which makes the rotation and
translation unknown variables. The camera matrix is also an unknown variable in
the pinhole camera model, and it depends on the properties of the camera and the
image size. The method for finding the parameters is called camera calibration. This
is implemented by OpenCV[22], which will be introduced in section 2.4. After cali-
brating the camera, all of the parameters can be computed.
In the projection function, the 3D world coordinate point and the 2D image point
are both represented by homogeneous vectors. according to the principle of matrix
multiplication, a 3 by 4 matrix must be multiplied by a 4-column matrix or vector
and the result is a 3-column matrix or vector. Normally, the last number of the image
points resulting from the projection function is not 1. To compute the correct image
points, points are normalized by dividing all components by the last number, so that
the image points will be in the same form as the projection function.
If the camera coordinate frame is set same as the world coordinate frame, the ro-
tation matrix will be a 3 by 3 identity matrix and the translation vector will be 0,
so that the distance of the objects can be easily found. This is the camera setup was
6
used in all procedures.
The camera matrix shows the properties of the calibrated camera. However, the
camera matrix does not only depend on the camera itself but also on the image size.
If the sizes of images taken by the same camera are different, the camera matrices will
be different. The principal point usually is the center of the image, which depends
on the image size. All images have to be the same size for all procedures in this study.
The basic purpose of camera calibration is to use the pinhole camera projection
function to compute the intrinsic and extrinsic parameters of the camera. A camera
with all the computed parameters can be called as a calibrated camera. In the camera
calibration, image points and world coordinate points are known variables, and all
camera parameters must be solved. In the projection matrix there is a total of sixteen
unknown variables. This is impossible to solve with only one pair of points. Solving
all unknown variables, requires a great number of image points.[21]
After computing all parameters, it is necessary to check the accuracy of the results.
This step is called reprojection. This can be done by using the computed variables
and the world coordinate points to compute the image points. Image points always
are integers, and the results from the reprojection usually are not. If the difference
7
between them is smaller than one, the parameters are correct and will be used in the
following steps.
Stereo cameras usually have two or more lenses, so they can capture 3D images. In
some stereo cameras, all lenses have the same or similar camera matrix. The distance
between lens centers determines the location relationship, which includes the rota-
tion and translation. This rotation and translation are different from the projection
function. Normally, all lenses are fixed in the stereo camera, so the rotations and
translations of all camera lenses are fixed and known.
Most of the stereo camera sold on the market are on the same horizontal level or
vertical level. With this property the rotation matrix usually is close to the iden-
tity matrix. The translation vector is the distance between lenses. Figure 2 shows a
8
stereo camera produced by Kodak. This stereo camera has two lenses on the same
horizontal level.
The purpose of using the stereo camera is to capture 3D images. Each camera
from the stereo camera takes a scene at a different angle and all images are used to
construct the 3D scene. Figure 3 presents the stereo vision.
From: http://www.ccs.neu.edu/course/cs5350sp12/
The image points for the same object in two different images are different due to
the location of the cameras. Two image points and the object form a triangle. The
location of the object is solved by the stereo camera. This requires not only the two
image points but also the property of image taken taken from different angles, which
is called projection matrix. the projection matrix is the product of the camera matrix
and the matrix from rotation matrix plus translation vector. All the stereo camera
parameters can be computed from stereo camera calibration.
9
2.2.3. Stereo Camera Calibration.
The stereo camera is based on epipolar geometry, which involves the projective
relationship between two views. This relationship only depends on the camera ma-
trix and the trasnaltion vector between two images. In the stereo camera situation,
the lenses are fixed so their relatvie position fixed. However, epipolar geometry shows
the rotation and translation can even be modified.
Figure 4 illustrate the epipolar geometry. There are two image views and each
has a pose position. Both of them have an image point mapping to an object X in
the world.
(2.3.1) xF x = 0
10
Credit: Multiple view Geometry in Computer Vision
where:
The fundamental matrix shows the relationship between two images. Every cor-
responding point in the two images must satisfy function(2.3.1). The rotation and
translation can be solved from the fundamental matrix. OpenCV has a function to
find the fundamental matrix. Before using this function, it is necessary to find the
corresponding points on two images. Finding the fundamental matrix requires at
least 7 pairs of corresponding points in a pair of images.[14]
The essential matrix also shows the relationship between two images. Unlike the
fundamental matrix, the essential matrix can only be calculated with a calibrated
camera. This involves camera matrix. The essential matrix is equal to:
11
(2.3.2) E = aF a
where:
Equation 2.3.2 shows that the essential matrix is the product of the fundamental
matrix and the camera matrix. With certain mathematical methods, rotation and
translation can be computed. This will be further introduced in the next section.
The outputs of the application in OpenCV that calibrates stereo camera include
the camera matrix parameters, distortion parameters, rectification transformation
parameters, and the projection matrix of each camera lens.
The projection matrix is equal to:
(2.4.1) P = A[R|t],
12
and is also equal to:
⎡ ⎤⎡ ⎤
f 0 cx r r r t
⎢ x ⎥ ⎢ 11 12 13 1 ⎥
⎢ ⎥⎢ ⎥
(2.4.2) P = ⎢ 0 fx cy ⎥ ⎢r21 r22 r23 t2 ⎥
⎣ ⎦⎣ ⎦
0 0 1 r31 r32 r33 t3
After the calibration, all parameters can be computed. However, the calibration
application in OpenCV is only designed for standard stereo cameras, which were not
used for images capture in this study. The knowledge of the stereo camera calibration
application was helpful in accomplishing the study objectives.
The research process required a single pinhole camera and a rotation unit instead
of a standard stereo camera. This is because the location of each lens in the stan-
dard stereo camera is fixed. A single pinhole camera with a rotation unit can achieve
different rotations and translations, which can make the process and results more
flexible and accuracy.
Figure 5 shows the Galileo. It has two parts: one for panning and one for tilting.
On the top is the tilt part. It can hold the mobile device and rotate vertically. On
the bottom is the pan part, which rotate horizontally. With these pan and tilt parts
the rotation can be 360 degreed and three-dimensional. Motrr provides a software
13
Credit: Motrr inc. From: http://motrr.com/
Figure 5. Galileo[17]
development kit for iOS. Galileo’s operations are programmable, so that Galileo can
be controlled during the processing.
Figure 6 is from the Motrr website. It shows that the panning and tilting can be
controlled remotely from the mobile device via Bluetooth or the Internet. This is the
basic operation with Galileo in this study. This figure shows that the Galileo only
needed to be controlled through the Bluetooth or the Internet, it was not necessary to
14
use more than one device if the rotation was not complex. With the remote controlled
Galileo and the iPhone 4s, a changeable stereo camera was constructed.
15
3. Mathematical Review
This mathematical review introduces the math background for this study. Some
of the algorithms had already been implemented by OpenCV 2.4 before this study,
and some of them needed to be accomplished during this research study. Finding the
corresponding points between two images and then the fundamental matrix had been
implemented; but solving rotation matrix and translation vector from the fundamen-
tal matrix had not. In the OpenCV 3.0, all of algorithms have been implemented.
However, OpenCV 3.0 is a beta version, so it was not considered for use in this study.
Most of mathematical methods are based on the matrix operations. A great number
of matrix operations were used in this research study. All of them will be introduced
in this section.
The corresponding points are the intermediate parameters for solving the funda-
mental matrix between two images. Three algorithms had been used for computing
the fundamental matrix:
• 7-point algorithm
• 8-point algorithm
• The RANSAC algorithm
All three algorithms and the math behind them will be introduced in this section. As
mentioned, the fundamental matrix is defined by the equation:
(3.1.1) xF x = 0,
16
where x and x’ are defined as x=(x, y, 1) and x’=(x’, y’, 1). They present the
corresponding points between two images, so the equation with x and x’ is:
Function (3.1.3) shows matrix A has a rank of at most eight for an exiting unique
solution of a fundamental matrix. Normally, there is noise among corresponding
points, so more than eight pairs of points are required to solve the fundamental ma-
trix.
The eight-point algorithm is the simpest method to solve the fundamental ma-
trix. If everything has been taken care of, it performs extremely well. This algorithm
requires at least eight pairs of corresponding points. The steps of this algorithm are:
Solving rotation and translation are the parts implemented by the study. There
are some methods that can solve all parameters. They have been used to fulfill this
goal. Here are the steps for solving the rotation matrix and translation vector.[1]
The first step is to find the essential matrix, which is equal to:
(3.2.1) E = AF A−1 ,
where:
The essential matrix is a 3 by 3 matrix with two of singular values that are the
same and a third that is zero. The matrix not being an essential matrix could happen
during the image capturing. If there is much noise between two images, the singular
19
value of the essential matrix would have different values.
Secondly, perform the Singular Value Decomposition (SVD) on the essential ma-
trix. The output from SVD of the essential matrix is:
(3.2.3) R = UW V T
or
(3.2.4) R = UW T V T
(3.2.5) T = ±U (3)
There are a total of four possible solutions. Among all of them, only one is feasible.
The four solutions combine two possible rotations and two possible translations. Here
are the four possible solution:
P = (U W V T u3 ) or P = (U W V T − u3 )
P = (U W T V T u3 ) or P = (U W T V T − u3 )
20
3.3. Rodrigues’ Rotation Formula.
The Rodrigues’ Rotation Formula was used to convert a rotation matrix to a rota-
tion vector[20] to find the rotation angle around one-axis in this study. The rotation
angle was used to estimate the translation vector. The Rodrigues’ Rotation Formula
is two-way transformation formula The steps for converting a rotation vector to a
rotation matrix include:
• Let θ = norm(r)
• Normalize r by θ: r = r/θ ⎡ ⎤
0 −rz ry
⎢ ⎥
T ⎢ ⎥
• So R = cos(θ)I + (1 − cos(θ)rr + sin(θ) ⎢ rz 0 −rx ⎥
⎣ ⎦
−ry rx 0
The inverse transformation is much easier to be done:
⎡ ⎤
0 −rz ry
⎢ ⎥ R − RT
⎢ ⎥
sin(θ) ⎢ rz 0 −rx ⎥ =
⎣ ⎦ 2
−ry rx 0
22
4. Localization of the Objects in the Indoor Area
In this section, the purpose is constructing the stereo vision system and testing it
in the enclosed area where the distance between the objects and camera are known
by physical measurement, and there is no shadow from the fluorescent lights. The
stereo vision system has errors. In this enclosed area the errors can be found, it gives
an opportunity to find methods to improve the system. This is the reason why the
system is tested in the enclosed area then use in the open area.
In order to localize the objects, all pinhole camera and stereo camera parameters
have to be computed. Those parameters include the camera matrix of the pinhole
camera, the rotation matrix, and the translation vector. After computing all the
parameters, the next step is to construct the stereo vision system and take plenty
of images to test the system and find new methods to improve the system to make
the error as low as possible. With a very low error, the system can be used for fruit
localization.
The first step is the camera calibration for iPhone 4s. The camera calibration ap-
plication from OpenCV has been used to fulfill the purpose. Normally, the OpenCV
camera calibration application uses the computer or laptop camera to capture a video
of a chessboard. The video is cut into frames and it stops capturing when the number
of frames reaches twenty-five. In each frame, the application can recognize and draw
the chessboard and save the image points of each corner.
Figure 8 shows a chessboard has been recognized by the camera calibration appli-
cation. This is a 8 by 8 chessboard, each corner is 24 cm. To run the application,
the chessboard information including the chess board size (width, height), and the
individual square size have to be inputted manually. The application only recognized
23
Figure 9. Chessboard
chessboard is in a different position on different images. These five images are enough
for calibration.
The camera calibration computes the pinhole camera model function with the im-
age points and the world points as known variables. The application can find the
image points of each corner and assume every world coordinate of each corner from
different images is same. It starts from the left top corner of the chessboard whose
25
world coordinate point is (0, 0, 0); the one on the right of it is (24, 0, 0); and the
next one is (48, 0, 0). 24 millimeter is the size of one square. It is used as one unit
of world coordinate points.
Checking the camera matrix’s accuracy or not is important in this step. The repor-
jection error can provide this. In fact, only a few points can solve the camera matrix.
However, to solve the camera matrix requires a great number of points in different
images then average the results from all of those points. This is why the default set
of the application is twenty-five frames. With this average result, the reporjection
error can be smaller than one.
⎡ ⎤
584.638 0 239.5
⎢ ⎥
⎢ ⎥
⎢ 0 584.638 319.5⎥
⎣ ⎦
0 0 1
The camera matrix plays a major part in the following steps of this research study.
The camera matrix shows the property of the camera. The principal point is the
26
center of the image. The size of the images should be same as the calibration images
for fruit localization.
After getting the correct iPhone 4s camera matrix, this iPhone 4s will be used
as the stereo vision system. In order to make stereo system have only one camera,
the Galileo has been used in this process. With Galileo iPhone can have a 360-
degree rotation around three axes. Using one camera and rotating it to take images
in different locations makes the stereo vision system. In this case, two images are
the minimum requirement. The first camera coordinate frame is set as the world
coordinate frame, so its rotation matrix is an identity matrix and translation vector
is 0. The rotation matrix and translation vector of the second image is the actual
movement of the Galileo.
Galileo is programmable and it can be controlled by the iOS application. The ap-
plication for constructing the stereo vision system controls the camera and Galileo.
The application can take images and save them in the photo album. Controlling the
camera can be done by the OpenCV framework. The Galileo can be controlled by
the rotation angle or the rotation speed. The rotation of the Galileo is known, so the
rotation matrix and the translation vector can be estimated by geometry.
Normally, in a stereo camera between each lens there is only the translation rela-
tionship. For this special stereo camera, it can rotate to any angle. The rotation of
the Galileo can be controlled by application. The Galileo only rotates around one
axis which changes only one rotation angle. However, the translation is hard to trace.
The simple way to find the translation is estimating it by geometry. The rotation is
set only on y-axis, so the translation is only a 2D vector. It can construct a triangle
with two sides and one angle. The geometry estimation translation can be calculated.
In order to estimate the translation, two variables have to be considered. The first
27
is the value of rotation angle which is controlled by the application. The second is
the distance between the rotation center and the camera center. These two variables
form a triangle. The third line of the triangle is actual translation. With one angle
and two triangle sides, the third side is easy to find. The estimation value is supposed
close to the real translation. However, the estimation is based on the perfect rotation
which is hard to accomplish in the experimental process.
The estimation will be used for improvement of the stereo vision system. OpenCV
has a function that can show the position changing between images taken by a stereo
camera. However, this function will return multiple cases and all of them are math-
ematically correct. The only one is the real case. With the estimated rotation and
translation, the real case can be decided. This is the purpose of geometry estimation.
Corresponding points are used to find the fundamental matrix as a set of param-
eters. OpenCV has multiple algorithms to solve the fundamental matrix. In order
to use one of them at least seven pairs of corresponding points from a pair of images
will be needed to compute the fundamental matrix. The corresponding points are the
same image points of objects on different images. With a large amount of correspond-
ing points, the relationship between this pair of images can be solved. Corresponding
points can be found by human observation. However, the human observation will
cause a tiny error on image points and this tiny error will cause a huge error on the
result of the fundamental matrix. Human observation is not the best way to find the
corresponding points.
The algorithm called Speeded-Up Robust Features (SURF) can find corresponding
points between two images. SURF only takes two images as inputs and it returns the
corresponding points. The algorithm chooses some random points on the first image
28
and find the corresponding points on the second image. Those points are floating
numbers instead of integers, due to the calculation of the algorithm.
The rotation matrix and translation vector are the most important parameters
of this stereo vision system. Neither of them is fixed and known, which requires cal-
culating both of them for each image-capturing.
Finding the rotation matrix and translation vector is the next step after solving the
fundamental matrix. It will be used every time when capturing new images. However,
OpenCV 2.11 vision does not have any functions to solve them. It can only solve the
fundamental matrix. Implementation of this method is necessary. The method of
finding the rotation and translation vector presented in the mathematical review.
There are a total of four combinations of the rotation and translation and all of
them are mathematically correct. Because the rotation is controlled by Galileo, per-
forming the Rodrigues’ rotation formula on rotation matrix can show the rotation on
which axis and the angle so there are only two combinations left. The translation
vector can be found depending on the rotation by the geometry measurement. The
only difference between two translation vectors is sign. Rotation shows the directions
on three axes and these directions shows the sign of the value on each axis. However,
users have to manually check which combination is correct, which is not fitting a
robotic system. This will be improved in the future.
All results from SVD are normalized. The magnitude of the translation vector is
one which is not the real movement of the camera. Finding the scale factor can make
the translation vector become close to the real movement. The geometry estimation
translation vector will be used for finding this scale factor. The geometry estimation
29
is based on two-dimension rotation on y-axis. However, the real case of translation
is not on y-axis. The length of the estimation translation vector will be close to the
real translation vector. The factor has been used to scale length of the normalized
vector equal to the length of the geometry estimation vector. The real translation
vector can be found by the scale factor and used in the next step.
Triangulation is the method for finding the location of the objects with one pair of
images. The triangulation is using one pair of corresponding points from two images
and the projection matrix of them two solve the third points which is the world lo-
cation of the corresponding points.
The triangulation uses the function from OpenCV which requires the exact cor-
responding points. Even only one pixel error, the results may have an enormous
error. Because the calculation methods are fixed, the improvement is from the image
capturing. Only two images are used during one triangulation. The improvement is
capturing multiple images for one object under different rotation. The first image
frame is still set as the world coordinate frame. The next step is performing the tri-
angulation between first one and other images. The final step is getting the average
distance from all of them.
30
5. Labeling Images
In image labeling all objects are labeled on single image and all information about
labels are saved. The purpose of labeling images is finding all fruits on the image. The
further goal is using those labeling data to create and train image classifier that can
automatically find the fruits on images which will be the future work of this research.
In this procedure, a thousand fruit images have been manually labeled. All of the
images are downloaded from Internet freely and all of them are from either ImageNet
or Google Image. Images have been selected from five fruits categories: apple, orange,
peach, pear, and mango, and there are two hundred images per each category. The
data of the labeled images has been saved as a dataset, and it is available to download.
Images are labeled manually by using the Training Image Labeler Application from
Matlab. The labeler application is based on human observer manually labeling the
objects on images. The application can label the objects by a bounding box. Boxes
can be adjusted to completely cover the objects. Due to most of the fruits are not
square shape, the bounding box will have noise from the leaves or other surroundings
of the fruits. This is the reasonable error and it will be considered in the future. After
labeling all the fruits, the application can export the information of the labels. The
information of a single label includes four numbers: the image point of the left top
corner of the label, width and height of the bounding boxes. All of them form a 4 by
N matrix, N is the total number of the bounding box.
Labeled images are saved by using all information of bounding boxes to draw them
on images. Labeled images can determine the veracity of all labeling data. The edge
31
of bounding box should cover fruits completely. The labeled image can show whether
the edges of fruit are completely covered or not.
The original images were downloaded either from ImageNet or Google Image.
Those images show the diversity of fruit contents: some images have a lot of fruits;
some images have only one fruit or two; some of fruits on images are visible; and some
of fruits are behind the fruits or leaves, which makes them hard to find on images.
Figure 11 shows a simple apple tree example. There are only two apples in the
image. The diversity can be showed from these two images. However, all fruits are
on the tree which makes them in different growning states. This is another way of
diversity. All states of fruits will be included from two hundred images.
Using the Matlab and the label information, the bounding box can be drawn on
images. The color of bounding box can be change by RGB parameters. Figure 12 to
33
Figure 21 show one example form each fruits
From: http://world-crops.com/wordpress/wp-content/uploads/029-Pear-tree.jpg
Labeling data is the most significant result from the labeling procedure. The out-
put from the Training Image Labeler Application is the information for the bounding
box and can be saved in the Matlab format files. This information will be used for
training the image classifier. To fulfill this purpose, the data has been saved in mul-
tiple forms, which will make the further work easier. A dataset has been created and
is available for downloading.
The dataset contains four folders: Images, URL, Mat, and Data. The original im-
ages are saved in the Images folder; the labeled images are also saved in this folder.
The labeling data is used to draw the bounding boxes on the original images to create
the labeled images. Images are downloaded from the Internet, and it is necessary to
list URL of each image. The data from the application, which is 4 by N matrix, is
saved in the Mat folder. There are textual format files in the Data file. In the text
files, specifically, the first line gives the number of objects in the image. Each sub-
sequent line shows one occurrence of an object, a fruit in this context, in the image.
The first four numbers are the same in the mat file. The last number on each line
indicates the type of fruit: 1 is apple; 2 is orange; 3 is peach; 4 is pear; and 5 is mango.
39
6. Experimental Result
The images were taken in the computer lab and all the distances between the
camera and the objects were measured by ruler and other measurement tools. Figure
22 shows two examples of the images. The first image frame is set as the world coor-
dinate frame. The second image was taken after a five-degree rotation on the y axis.
Galileo allows the iPhone to rotate on three axes. However, this made it really hard
to trace the translation. Rotating on one axis is the best situation for image-capturing.
6.3. Triangulation.
The purpose of the triangulation is to test the accuracy of the stereo vision system
in an indoor area. During this procedure, choosing the points will is first step. The
next step is measuring the distance between the objects in the world and the camera.
The last step is triangulating with two image points and comparing the result with
the real distance. However, it is hard to measure all (x, y, z) values. The easiest and
most accurate way is finding only one value and checking the reprojection error.
The corresponding points image shows all the corresponding points. In the triangu-
lation procedure, some of those points was chose as the target objects to triangulate
for testing the accuracy.
Figure 24 shows the first pair of corresponding points. The first image, which is
set up as the world coordindate frame, is on the left. The green dot on both images
42
7. Localization of Fruits
The last step is testing the stereo vision system in a real fruit field. This involves
setting up a simulation environment. All fruits in the set-up environment have known
positions relative to the camera.
The simulation environment was set up based on a few requirements, which are:
• Images taken in this environment should be the same as in a real fruit fields.
• All fruits should have known positions relative to the camera.
In the simulation environment, fruits were hung on trees by tape, which meant the
chosen trees had to be short. However, in the images the trees had to look real tall.
50
To accomplish this, the stereo vision system had to be close enough to the plants do
the images would just leaves. Figure 32 shows one image taken in this environment.
Ninety-five percent of this image is leaves, making it look like the camera is surround-
ing by leaves. The purpose of this set-up is reduced the noise from non-tree objects.
After choosing the position of the camera, fruits were hung on the plant. Apples
were chosen as the target fruits. Every apple in this environment had to have a known
position relative to the camera. It was hard to find the position of fruits when the
camera center was the coordinate original point. Fruits were hung on the tree; and
the camera was on the tripod. Both were on the air, which made it difficulty to
measure the distance between them. To fulfill the requirement of ”All fruits should
have known positions relative to the camera” efficiently, a point on the ground was
chosen as the original point of the coordinate. Not only was each apple’s position
relative to the original point determined, but also was the camera center’s position
51
relative to the original point. Every apple was attached a rope. This rope hung from
the fruit to the ground, which indicating the fruits perpendicular position relative to
the ground. The length of the rope is the y-axis value with this value. It is easy to
measure the other two values. Every apple in this set-up will have a known position.
Figure 33 shows how all apples were set up. In this figure, eight apples have been
attached to the bush. For convenience of the processing, each apple was labeled in
the order of Arabic numerals.
After setting up all the apples, the position of each and the camera relative to the
original point can be . Table 2 shows the values of all positions. All apples were set
up at the same distance from the camera except for apple 5. The branch of apple 5
fell a little due to the weight of the apple, which reduced the distance about 10 cm.
It also covers most of apple 6. However, this circumstance will happen in a real apple
tree. The x and y values include the size of apples. However, z values are measured
52
from side that apple closest to the camera. All position values were measured by
steel tape, which could have caused some errors due to the property of the steel tape.
These errors will be considered in the following steps.
The values from Table 3 are relative to the original point. The triangulation results
relative to the camera center. It is necessary to translate the positions from Table 3
to the positions relative to the camera center by subtracting the camera position. The
coordinate frame is the next consideration for preparing the simulation environment.
According to the previous translation results, the coordinate frame: x-axis should
point from left to right; y-axis should point from up to down; and z-axis should point
out from the camera. In this circumstance, all coordinate points of apples in the cam-
era frame can be solved. These coordinate points were used for testing the accuracy
of the stereo vision system in this simulation environment.
Figure 34 is a pair of fruit images that were be used to localize the fruits. Images
were taken outside, so they were affected by sunlight and other natural factors. The
resolution is lower than in the images taken indoors. All this will be considered as
distraction factors for the triangulation.
The testing procedure includes multiple kinds of processing which includes trian-
gulating different fruits in the same pair of images and the same fruits in the different
pair of images. The testing procedure also includes finding the corresponding points;
54
points in the previous indoor-images result. Images are taken outside, which it de-
pends on sunlight that is different from the lamplight. The background is much more
complicated than the indoor image. These are potential reasons causing correspond-
ing points looking different from the previous.
The rotation and translation are unknown variables during the image capturing
procedure. The only way for geometric estimation of the translation vector is using
the rotation matrix to translation vector scale factor, then scale the translation vector
to get the projection matrix.
⎡ ⎤
−0.831575 0.008522 −0.555346
⎢ ⎥
⎢ ⎥
⎢ 0.008389 −0.999575 −0.027902⎥
⎣ ⎦
−0.555348 −0.027862 0.831150
After using the Rodrigues Rotation Formula, the rotation angle is 2.91 degree
around y-axis and a little rotation around x-axis and y-axis which can be neglected
in the following process. This angle will be used to estimate the scale factor of the
translation vector.
The translation vector from the fundamental matrix is:
⎡ ⎤
0.26579
⎢ ⎥
⎢ ⎥
⎢ 0.018153 ⎥
⎣ ⎦
−0.963857
The size of fruit on the image is more than one pixel, and the triangulation only
requires one pixel. Image Points of all fruits, which are used for localization, will be
one pixel image points corresponding to each other from two images. Table 3 shows
the chosen image points and figure 35 shows this pair of fruit images after plotting
56
all image points. The green dots are the chosen image points, and the size of the dot
is one pixel.
Table 4 shows image points of all eight apples. These eight pairs of image points
will be used for triangulation. However, some of the image points have errors due to
manually labeling the image. Errors from image points will be reduced by adjusting
image points so that triangulation results will be more accurate.
Figure 36 shows all image points have been plotted on both fruit images. All of
them are chosen by human observation. In figure 36, all image points are close to real
corresponding points. Whether they are corresponding points or not will be found
out by triangulation results they will be fixed in the following steps.
57
Table 5. Triangulation Results with Errors
Apple Image 1 (pixel) Image 2 (pixel) Estimated Location in Camera
Coordinate(cm)
1 (114, 176) (141, 175) (-10.0811; -11.5402; 46.81)
2 (169, 177) (195, 175) (-8.40724; -16.9499; 69.3489)
3 (150, 386) (178, 383) (-12.622; 9.59829; 82.8833)
4 (264, 221) (291, 218) (5.2889; -23.272; 138.295)
5 (258, 440) (285, 437) (4.94537; 31.7092; 152.981)
6 (295, 452) (323, 451) (13.125; 31.1513; 135.813)
7 (315, 123) (342, 118) (36.7107; -96.4751; 286.677)
8 (381, 346) (410, 343) (117.933; 22.0773; 487.318)
Table 5 shows triangulation results from figure 35. The real distance is listed in
table 2. Although measured distance from table 2 have some measurement errors,
some of the triangulation results are not close to the real distance. Two of them are
close to the correct results. Some of them are smaller or larger than the real distance
even with small reprojection errors. These errors are caused by chosen points among
two images which are not exactly corresponding points.
The improvement included two methods: adjusting the corresponding points, and
using multiple images. After adjusting chosen points, they can be closer to exact
corresponding points. However, some corresponding points among two images of the
fruits were difficult to find. In this case, more than two images were required to
improve the results.
59
Table 6. Improved Results with Adjusting Image Points
Apple Image 1(pixel) Image 2(pixel) Estimated Location in Camera Coordinate
1 (113, 175) (141, 172) (-34.5178; -39.4676; 159.095)
2 (169, 177) (196, 174) (-18.3116; -36.9332; 151.175)
3 (147, 388) (175, 384) (-23.5082; 17.5518; 148.849)
4 (264, 221) (291, 218) (5.2889; -23.272; 138.295)
5 (258, 440) (285, 437) (4.6432; 25.7718; 143.634)
6 (295, 454) (322, 451) (15.7832; 38.1951; 165.615)
7 (314, 122) (342, 118) (18.6751; -49.3957; 146.263)
8 (382, 346) (410, 343) (36.8651; 6.86531; 151.192)
These new results show a great improvement. However, there are still some devia-
tions between two results. Different fruit images are required in order to improve the
results. With multiple images involved the location results will be much better.
Table 7 shows the measured location in world coordinate and the estimated loca-
tion. The estimated results is in the world coordinate system, which is defined in
60
figure 31. Table 8 shows all errors in three axes between the measured location and
estimated location.
The largest error from x-axis is 17.11%, and the smallest error is 0. The largest
error from y-axis is 12.54% and the smallest error is 0. The 17.11% error looks really
large. However, in this case the actual error is about 4 cm and the distance is 25.5
cm, which makes error really large. The largest error from z-axis is 10.41% and the
smallest error is only 0.76%. The z-axis values are the largest values among three
axes, so the percentage of the error from z-axis can prove the system is more accurate
when the target fruits are far from the camera.
The second test is using the first image and a new fruit image. Figure 37 presents
61
the images of this pair of fruits. The process is the same as before. The difference
of the rotation between these two images were 0.768237 degrees. This is really small,
which has never been done before. The image points from the first image were used
from the previous section; and the second image had to be labeled.
Table 9 presents the new corresponding points between two images, and figure 38
shows the plotted fruit images.
62
Table 10. Image Points of Apple from the New Pair of Images
Apple Fruit Image 1 Fruit Image 3 Estimated Location in Camera
(pixel) (pixel) Coordinate(cm)
1 (113, 176) (108, 175) (-34.2741; -39.0662; 159.367)
2 (169, 177) (163, 176) (-16.6177; -33.6226; 138.093)
3 (150, 386) (144, 385) (-24.2837; 17.8646; 158.276)
4 (264, 221) (258, 220) (-4.7426; -19.1805; 143.728)
5 (268, 440) (262, 439) (6.00727; 25.6438; 135.171)
6 (295, 452) (289, 451) (11.3152; 35.1012; 140.271)
7 (315, 123) (309, 122) (12.53838; -47.1414; 145.907)
8 (381, 346) (375, 345) (30.805; 4.75034; 147.034)
The errors from this pair of image are much more stable than the previous results.
The largest error from x-axis is 4.21%, and the smallest error is 0. The largest error
from y-axis is 8.39%, and the smallest error is 0. The largest error from z-axis is
7.938%, and the smallest error is 1.98%. Compared the results from different motion
64
shows that smaller rotation angle between two images can make estimated results
more stable.
65
8. Conclusions and Recommendations
8.1. Conclusion.
This research focused on the following: 1. creating and testing a stereo vision
system; 2. manually finding all fruits in images; 3. testing a stereo vision system
outside with real fruits and trees.
The results indicate that the stereo vision system has accurate output. The follow-
ing conclusions are based on the findings:
(1) In a stereo vision system based on the pinhole camera model and multiple
view geometry, during image capturing the adjustable rotation between two
different images will affect location results.
(2) In an indoor area, the stereo vision system works well for long distance and
with a small rotation. On the other hand, errors for short distance and with
a large rotation are large.
(3) In the real tree area, the stereo vision system had good results. The back-
ground trees made corresponding points much more complicated than in the
indoor area. These complicated corresponding points could make the calcula-
tion of the rotation and the translation more precise.
(4) The labeling images procedure produced significant results for one thousand
fruit images. These results not only show the fruits’ information, but also give
enough data for creating and training a fruit image classifier.
66
8.2. Recommendations for Future Study.
The future of this research will be improving the stereo vision system. It can
be improved in two aspects: 1. automatically finding fruits in images; 2. automati-
cally adjusting corresponding points:
(1) The image classifier can automatically label certain objects in images. With
a certain kind of fruit classifier, the stereo vision system can find all of that
kind of fruit in each captured image.
67
References
[1] Adrien Bartoli and Soren Olsen. Camera motion from two-view relationships. http:
//isit.u-clermont1.fr/~ab/Classes/DIKU-3DCV2/Handouts/Lecture16.pdf. [Online; ac-
cessed January 10th, 2015].
[2] Calin and Roda. Applied geometric representation and computation. http://www.ccs.neu.
edu/course/cs5350sp12/, 2007. [Online; accessed April 27, 2015].
[3] flickr.com. Apple tree. http://farm1.static.flickr.com/38/101398959_0d51fb9c84.jpg.
[4] flickr.com. Apple tree. http://farm1.static.flickr.com/28/48406200_135f13a35b.jpg.
[5] flickr.com. Apple tree. http://farm1.static.flickr.com/1/3139450_9b2971cb51.jpg.
[6] flickr.com. Mango tree. http://farm1.static.flickr.com/51/116367124_c68d833987.jpg.
[7] flickr.com. Mango tree. http://farm2.static.flickr.com/1027/529190162_c6c53959a9.
jpg.
[8] flickr.com. Orange tree. http://farm1.static.flickr.com/100/367162686_eba027ce6f.
jpg.
[9] flickr.com. Orange tree. http://farm3.static.flickr.com/2194/1817554650_3bd6a0657d.
jpg.
[10] flickr.com. Peach tree. http://farm2.static.flickr.com/1074/1042414990_6dd8f3e7ba.
jpg.
[11] flickr.com. Peach tree. http://farm2.static.flickr.com/1224/770919709_9f84c45d39.
jpg.
[12] flickr.com. Pear tree. http://world-crops.com/wordpress/wp-content/uploads/
029-Pear-tree.jpg.
[13] flickr.com. Pear tree. http://farm4.static.flickr.com/3181/2796719945_5c910a233a.
jpg.
[14] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge
university press, 2003.
[15] Richard Hartley and Andrew Zisserman. Single image view camera geometry, epipolar geometry,
2003.
[16] Richard Hartley and Andrew Zisserman. Single image view camera geometry, pinhole camera
model. In Multiple view geometry in computer vision. Cambridge university press, Cambridge,
2003.
[17] Motrr inc. Galileo features. http://motrr.com/, 2015.
[18] Motrr inc. Two ways to operate galileo. http://motrr.com/, 2015.
68
[19] Bryan Lowa. The apple tree. http://brokenbelievers.com/2014/09/19/the-apple-tree/,
September 19, 2014.
[20] OpenCV. Camera calibration and 3d reconstruction. http://docs.opencv.org/modules/
calib3d/doc/camera_calibration_and_3d_reconstruction.html. [Online; accessed April
27, 2015].
[21] OpenCV. Camera calibration with opencv. http://docs.opencv.org/doc/tutorials/
calib3d/camera_calibration/camera_calibration.html. [Online; accessed April 27, 2015].
[22] OpenCV. Opencv, open source computer vision. http://opencv.org/. [Online; accessed April
27, 2015].
[23] Wikipedia, the free encyclopedia. Stereo camera. https://en.wikipedia.org/wiki/Stereo_
camera, 2015. [Online; accessed April 27, 2015].
69