You are on page 1of 79

Fruit Detection with Camera Calibration and 3D

Reconstruction
By

Bo Wu
B.S.(Iowa State University) 2013

DISSERTATION

Submitted in partial satisfaction of the requirements for the degree of

MASTER OF SCIENCE

in

Electrical and Computer Engineering

in the

OFFICE OF GRADUATE STUDIES

of the

UNIVERSITY OF CALIFORNIA,

DAVIS

Approved:

Professor Soheil Ghiasi, Chair

Professor Hussain Al-Asaad

Professor Bevan Baas

Committee in Charge

2015
i
ProQuest Number: 1604080

All rights reserved

INFORMATION TO ALL USERS


The quality of this reproduction is dependent upon the quality of the copy submitted.

In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.

ProQuest 1604080

Published by ProQuest LLC (2015). Copyright of the Dissertation is held by the Author.

All rights reserved.


This work is protected against unauthorized copying under Title 17, United States Code
Microform Edition © ProQuest LLC.

ProQuest LLC.
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106 - 1346
Bo Wu
September 2015
Electrical & Computer Engineering

Abstract

This study is aimed at localizing fruits on trees. The research is intended for use
by the UC Davis Biological and Agricultural Engineering Department for robotic
localization of fruits in the field. The study includes three phases. In the first phase
a stereo vision system with only one camera and one rotation unit was constructed.
Normally stereo vision systems contain more than one camera with a known distance
between cameras. In the second phase, the stereo vision system was tested in both
indoor and outdoor areas with a known distance between the system and the objects.
The last phase introduces manual fruits detection in the images. The output of the
stereo vision system is the distance between the object and the camera. With enough
tests from phase two, the system can be used in the real world applications. This
study focused on constructing and testing the stereo vision system. The detection of
the fruits is not automated at this time, and is limited to their manual localization
in the images. A fruit locater is required in the future to automatically detect fruits
and return their locations.

ii
Contents

Abstract ii
List of Figures vi
List of Tables viii
Acknowledgments ix
1. Introduction 1
1.1. Organization of Thesis 1
1.2. Problem Statement 2
1.3. Objectives of the Research 2
1.4. Scope of the Research 3
2. Background 4
2.1. Pinhole Camera Model 4
2.1.1. Projection Function of Pinhole Camera 5
2.1.2. Camera Calibration 7
2.2. Stereo Camera and Calibration 8
2.2.1. Stereo Camera 8
2.2.2. Operation of Stereo Camera 9
2.2.3. Stereo Camera Calibration 10
2.3. Epipolar Geometry 10
2.3.1. Epipolar Geometry 10
2.3.2. Fundamental Matrix 10
2.3.3. Essential Matrix 11
2.4. OpenCV Library 12
2.5. Devices Used 13
3. Mathematical Review 16
3.1. Solving Corresponding Points and the Fundamental Matrix between
Two Images 16
3.1.1. Seven-Point Algorithm 17
3.1.2. Eight-Point Algorithm 18
iii
3.1.3. The RANSAC Algorithm 18
3.2. Solving Rotation and Translation from Fundamental Matrix 19
3.3. Rodrigues’ Rotation Formula 22
4. Localization of the Objects in the Indoor Area 23
4.1. Camera Calibration 23
4.2. Constructing the Stereo Vision System 27
4.3. Finding Corresponding Points 28
4.4. Finding the Rotation matrix and Translation Vector 29
4.5. Triangulation and Improvement 30
5. Labeling Images 31
5.1. Software Tool for Labeling 31
5.2. Labeled Image 31
5.2.1. Original Images 32
5.2.2. Labeled Images 33
5.3. Labeling Data 39
6. Experimental Result 40
6.1. Examples of images 40
6.2. Finding the Corresponding Points 41
6.3. Triangulation 42
6.4. New Pair of Images Testing 46
7. Localization of Fruits 50
7.1. Simulation Environment Set-Up 50
7.2. Fruit Images 53
7.3. Testing Procedure 54
7.3.1. First Test 55
7.3.2. Multiple Images Testing 61
8. Conclusions and Recommendations 66
8.1. Conclusion 66
8.2. Recommendations for Future Study 67

iv
References 68

v
List of Figures

1 Pinhole Camera Geometry 4

2 Kodak stereo camera 8

3 Stereo Vision 9

4 Epipolar Geometry 11

5 Galileo 14

6 Operation of The Galileo 14

7 Four Solutions of Rotation and Translation 21

8 Recognized Chessboard 24

9 Chessboard 25

10 Complicated Apple Tree Example 32

11 Simple Apple Tree Example 33

12 Complicated Labeled Apple Tree 34

13 Simple Labeled Apple Tree 34

14 Complicated Labeled Orange Tree 35

15 Simple Labeled Orange Tree 35

16 Complicated Labeled Peach Tree 36

17 Simple Labeled Peach Tree 36

18 Complicated Labeled Pear Tree 37

19 Simple Labeled Pear Tree 37

20 Complicated Labeled Mango Tree 38

21 Simple Labeled Mango Tree 38

22 First Two Images 40

23 Corresponding Points 41

24 First Corresponding Points 43


vi
25 Second Corresponding Points 44

26 Third Corresponding Points 45

27 Fourth Corresponding Points 46

28 Fifth Corresponding Points 47

29 New Pair of Images 48

30 New Pair of Images Point 1 49

31 Fruit Environment Set-Up 50

32 Image from the Simulation Environment 51

33 Fruits Set-up 52

34 A Pair of Fruit Images 54

35 Corresponding Points of Fruit Images 55

36 Plotted Fruit Images 58

37 Second Fruit Test 62

38 Second Plotted Fruit Image 63

vii
List of Tables

1 Corresponding Points for Triangulation 42

2 Results from The First Two Images 46

3 Position of Apples and the Camera 53

4 Image Points of The Fruit Image 57

5 Triangulation Results with Errors 59

6 Improved Results with Adjusting Image Points 60

7 Difference between Estimated and Measured Results 60

8 Relative Estimation Error per Axis 61

9 Image Points of Apple from the New Pair of Images 63

10 Image Points of Apple from the New Pair of Images 64

11 Relative Estimation Error per Axis 64

viii
Acknowledgments

Foremost, I would like to express my sincere gratitude to my advisor Prof. Soheil


Ghiasi for supporting my M.S. study and research with his patience, motivation, and
extensive background knowledge. I would also want to thank Prof. Bevan Baas and
Prof. Hussain Al-Asaad for reviewing my thesis and their helpful comments.

I thank my fellow colleagues in the Laboratory for Embedded and Programmable


Systems: Nicholas Hosein, Mohammad Foroozannejad, and Mohammad Motamedi
for their help during the course of this research process.

Last but not the least, I want to thank my parents for supporting me throughout
my life.

ix
1. Introduction

1.1. Organization of Thesis.

This thesis is organized into eight chapters: 1. Introduction, 2. Background,


3. Mathematical Review, 4. Localization of the objects in the enclosed Area, 5. La-
beling Images, 6. Experimental Results, 7. Localization of Fruits, and 8. Conclusion.

Chapter 1 introduces the problems that currently exist with the software applica-
tion for fruit detection and the objectives and scope of this study.

Chapter 2 introduces background knowledge in different areas to this thesis includ-


ing the mathematics, computer vision and the devices used in the research process.

Chapter 3 reviews the mathematics behind the software functions used in this study.

Chapter 4 summarizes the materials and research method adopted in experimental


process of this study. The types, sampling process, and the testing procedures are
included.

Chapter 5 introduces the method labeling images used for detecting certain types
of fruits. The purpose and the significance of this procedure are presented.

Chapter 6 presents the experimental results, including:

• The parameters for constructing the stereo vision system.


• The testing results for certain objects in the enclosed area.
• The locations of the fruit on the tree in real world coordinates.
• The labeled image results.

1
Chapter 7 presents results from the stereo vision system used in an outside area
for triangulating fruits surrounded by leaves and branches.

Chapter 8 summarizes the this research and presents recommendations for future
study.

1.2. Problem Statement.

At the current time fruit detection can be done using robot. This kind of ro-
bot must have the ability to find fruits and localize them by showing their location
in the camera frame.

To detect and localize fruits in the world, two major components are needed: la-
beling images for detection, and a stereo vision system for localizing. The stereo
camera introduced in the next chapter can offer a three-dimensional view in which
the location any object can be found. Finding a method for using only one camera
to fulfill the function of a standard stereo camera has became the problem that must
be solved for localization.

However, the stereo vision system cannot decide where the fruits are in the image.
The system requires the image points of the fruits to analyze the world coordinate
results, so it is necessary to develop a method for finding the image points of the
fruits. This method is used for fruits detection.

1.3. Objectives of the Research.

The objectives of the research are the following: using the knowledge of com-
puter vision and pinhole camera to build and evaluate a stereo vision system with
2
only one pinhole camera to detect and localize the fruits on trees; finding methods
for increasing the accuracy of the system so that the estimated distance between the
fruits and camera can be as close to the real distance as possible.

This study also started a new phase: automatically finding all fruits in a single im-
age. This can make the whole process automatic and much easier to run. However,
this phase is in its infancy and it will be improved in future work.

1.4. Scope of the Research.

The scope of the research covers the following:

• Calibrating a pinhole camera from a mobile device.


• Constructing a stereo camera from the calibrated pinhole camera.
• Calculating the parameters of the constructed stereo camera.
• Testing the stereo camera in an enclosed area with a known distance.
• Using the stereo camera to test the distance of fruits in an outside area.
• Labeling images to find the image points of the fruits.

3
2. Background

This research study used the images taken by a camera to detect and localize the
fruits on trees. many areas of computer vision knowledge has been involved in this
research study. It includes image processing, the pinhole camera model, the stereo
camera model, and multiple view geometry. This background review introduces in
detail how to use computer vision knowledge to detect and localize fruits. In the back-
ground below, basic and advanced camera types will be reviewed. All the hardware
devices and the software library will be presented and discussed. Most of background
knowledge is from Multiple view Geometry in Computer Vision[14]

2.1. Pinhole Camera Model.

A pinhole camera was used in this study as the image-capturing tool. The pinhole
camera is the most specialized and simplest camera model.

Credit: Multiple view Geometry in Computer Vision

Figure 1. Pinhole Camera Geometry[16]

Figure 1 shows the pinhole camera model’s geometry. C is the camera center which
is the center of the projection. The line starting from C and extending perpendicu-
larly to the image plane is called the principal axis and the intersection point is called
principal point (P in figure 1). Under pinhole camera model, there is an X (X, Y, Z,
1) in the world coordinate mapping to an x (x, y, 1) on the image plane.

4
These two kinds of coordinate points are both in homogeneous space, instead of
being Euclidean coordinates. This is why there is a one at the end of each coordinate.
The homogeneous space points will be used to fit in the pinhole camera projection
function, which will be presented in the next section.[14]

2.1.1. Projection Function of Pinhole Camera.

The pinhole camera model is based on a projection function. This projection func-
tion can indicate how the pinhole camera works and what parameters that pinhole
camera model has. 2.1.1 and 2.1.2 show the different forms of the projection function.
This function shows a scene formed by projecting all the 3D points into 2D image
points. On the left side of function 2.1.2 is the projected image point. On the right
side are the camera matrix, the rotation with translation matrix, and the 3D world
coordinate point. The rotation and translation in the function is between the camera
coordinate frame and the world coordinate frame.[20]

This is the function in two different forms of the pinhole camera model:

(2.1.1) sm = A[R|t]M 

or
⎡ ⎤
⎡ ⎤ ⎡ ⎤⎡ ⎤ X
u f 0 cx r r r t ⎢ ⎥
⎢ ⎥ ⎢ x ⎥ ⎢ 11 12 13 1 ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ Y⎥
(2.1.2) s ⎢v ⎥ = ⎢ 0 fx cy ⎥ ⎢r21 r22 r23 t2 ⎥ ⎢ ⎥

⎣ ⎦ ⎣ ⎦⎣ ⎦ ⎢Z ⎥⎥
1 0 0 1 r31 r32 r33 t3 ⎣ ⎦
1

where

• (X,Y,Z) are the coordinates of a 3D point in the world coordinate.


• (u,v) is the image points in pixels.
• A is the camera matrix, also called the intrinsic parameters of the camera.
5
• (cx , cy ) is the principal point which is usually the center of the image.
• fx, fy are the focal lengths in pixels.
• R is the rotation matrix.
• t is the translation vector.

The parameters in the camera matrix are called the intrinsic parameters of the cam-
era or the internal orientation of the camera. The parameters of the rotation matrix
and the translation vector are called the extrinsic parameters of the camera.[14] The
rotation matrix and translation vector are the pose relationship between the cam-
era coordinate frame and world coordinate frame. Normally, the camera coordinate
frame is different from the world coordinate frame, which makes the rotation and
translation unknown variables. The camera matrix is also an unknown variable in
the pinhole camera model, and it depends on the properties of the camera and the
image size. The method for finding the parameters is called camera calibration. This
is implemented by OpenCV[22], which will be introduced in section 2.4. After cali-
brating the camera, all of the parameters can be computed.

In the projection function, the 3D world coordinate point and the 2D image point
are both represented by homogeneous vectors. according to the principle of matrix
multiplication, a 3 by 4 matrix must be multiplied by a 4-column matrix or vector
and the result is a 3-column matrix or vector. Normally, the last number of the image
points resulting from the projection function is not 1. To compute the correct image
points, points are normalized by dividing all components by the last number, so that
the image points will be in the same form as the projection function.

If the camera coordinate frame is set same as the world coordinate frame, the ro-
tation matrix will be a 3 by 3 identity matrix and the translation vector will be 0,
so that the distance of the objects can be easily found. This is the camera setup was
6
used in all procedures.

The camera matrix shows the properties of the calibrated camera. However, the
camera matrix does not only depend on the camera itself but also on the image size.
If the sizes of images taken by the same camera are different, the camera matrices will
be different. The principal point usually is the center of the image, which depends
on the image size. All images have to be the same size for all procedures in this study.

2.1.2. Camera Calibration.

The basic purpose of camera calibration is to use the pinhole camera projection
function to compute the intrinsic and extrinsic parameters of the camera. A camera
with all the computed parameters can be called as a calibrated camera. In the camera
calibration, image points and world coordinate points are known variables, and all
camera parameters must be solved. In the projection matrix there is a total of sixteen
unknown variables. This is impossible to solve with only one pair of points. Solving
all unknown variables, requires a great number of image points.[21]

Camera calibration requires images of one chessboard in different positions, and


the camera should not be moved when taking the images. Each corner of the chess-
board is a world coordinate point and an image point. OpenCV has functions that
can detect the chessboard and find each chessboard corner in the images and return
all the image points. With enough image points, the camera matrix, rotation matrix,
and the translation vector can be computed.

After computing all parameters, it is necessary to check the accuracy of the results.
This step is called reprojection. This can be done by using the computed variables
and the world coordinate points to compute the image points. Image points always
are integers, and the results from the reprojection usually are not. If the difference
7
between them is smaller than one, the parameters are correct and will be used in the
following steps.

2.2. Stereo Camera and Calibration.

2.2.1. Stereo Camera.

Stereo cameras usually have two or more lenses, so they can capture 3D images. In
some stereo cameras, all lenses have the same or similar camera matrix. The distance
between lens centers determines the location relationship, which includes the rota-
tion and translation. This rotation and translation are different from the projection
function. Normally, all lenses are fixed in the stereo camera, so the rotations and
translations of all camera lenses are fixed and known.

Credit: Wikipedia, Stereo Camera

From: https://en.wikipedia.org/wiki/Stereo camera/

Figure 2. Kodak stereo camera[23]

Most of the stereo camera sold on the market are on the same horizontal level or
vertical level. With this property the rotation matrix usually is close to the iden-
tity matrix. The translation vector is the distance between lenses. Figure 2 shows a
8
stereo camera produced by Kodak. This stereo camera has two lenses on the same
horizontal level.

2.2.2. Operation of Stereo Camera.

The purpose of using the stereo camera is to capture 3D images. Each camera
from the stereo camera takes a scene at a different angle and all images are used to
construct the 3D scene. Figure 3 presents the stereo vision.

Credit: Calin and Roda

From: http://www.ccs.neu.edu/course/cs5350sp12/

Figure 3. Stereo Vision[2]

The image points for the same object in two different images are different due to
the location of the cameras. Two image points and the object form a triangle. The
location of the object is solved by the stereo camera. This requires not only the two
image points but also the property of image taken taken from different angles, which
is called projection matrix. the projection matrix is the product of the camera matrix
and the matrix from rotation matrix plus translation vector. All the stereo camera
parameters can be computed from stereo camera calibration.
9
2.2.3. Stereo Camera Calibration.

Stereo camera calibration is similar to pinhole camera calibration: taking images


of a chessboard in different positions. However, the results are different. Camera
matrices are known variables in this process. The final results include the rotation
matrix and translation vector between two cameras, the essential matrix, and the
fundamental matrix. The essential matrix and fundamental matrix play a major role
in multiple view geometry, which will be presented in the next section.[20]

2.3. Epipolar Geometry.

The stereo camera is based on epipolar geometry, which involves the projective
relationship between two views. This relationship only depends on the camera ma-
trix and the trasnaltion vector between two images. In the stereo camera situation,
the lenses are fixed so their relatvie position fixed. However, epipolar geometry shows
the rotation and translation can even be modified.

2.3.1. Epipolar Geometry.

Figure 4 illustrate the epipolar geometry. There are two image views and each
has a pose position. Both of them have an image point mapping to an object X in
the world.

2.3.2. Fundamental Matrix.

The fundamental matrix is the algebraical representation of epipolar geometry

(2.3.1) xF x = 0
10
Credit: Multiple view Geometry in Computer Vision

Figure 4. Epipolar Geometry[15]

where:

• x and x are the corresponding points in the pair of stereo images.


• F is the fundamental matrix.

The fundamental matrix shows the relationship between two images. Every cor-
responding point in the two images must satisfy function(2.3.1). The rotation and
translation can be solved from the fundamental matrix. OpenCV has a function to
find the fundamental matrix. Before using this function, it is necessary to find the
corresponding points on two images. Finding the fundamental matrix requires at
least 7 pairs of corresponding points in a pair of images.[14]

2.3.3. Essential Matrix.

The essential matrix also shows the relationship between two images. Unlike the
fundamental matrix, the essential matrix can only be calculated with a calibrated
camera. This involves camera matrix. The essential matrix is equal to:
11
(2.3.2) E = aF a

where:

• E is the essential matrix.


• a is the camera matrix.
• F is the fundamental matrix.

Equation 2.3.2 shows that the essential matrix is the product of the fundamental
matrix and the camera matrix. With certain mathematical methods, rotation and
translation can be computed. This will be further introduced in the next section.

2.4. OpenCV Library.

OpenCV(Open Source Computer Vision) is an online computer vision open source.


It has a complete source code and a cross-platform computer vision library. It can run
on computers, mobile devices including iOS and Android, and also on GPU. OpenCV
provides functions for pinhole camera calibration and for stereo camera calibration,
image processing, and video processing. The most of algorithms and functions used
in this study are implemented by OpenCV. The algorithms and functions detail will
be presented in the Mathematical review section.

The outputs of the application in OpenCV that calibrates stereo camera include
the camera matrix parameters, distortion parameters, rectification transformation
parameters, and the projection matrix of each camera lens.
The projection matrix is equal to:

(2.4.1) P = A[R|t],
12
and is also equal to:
⎡ ⎤⎡ ⎤
f 0 cx r r r t
⎢ x ⎥ ⎢ 11 12 13 1 ⎥
⎢ ⎥⎢ ⎥
(2.4.2) P = ⎢ 0 fx cy ⎥ ⎢r21 r22 r23 t2 ⎥
⎣ ⎦⎣ ⎦
0 0 1 r31 r32 r33 t3
After the calibration, all parameters can be computed. However, the calibration
application in OpenCV is only designed for standard stereo cameras, which were not
used for images capture in this study. The knowledge of the stereo camera calibration
application was helpful in accomplishing the study objectives.

2.5. Devices Used.

The research process required a single pinhole camera and a rotation unit instead
of a standard stereo camera. This is because the location of each lens in the stan-
dard stereo camera is fixed. A single pinhole camera with a rotation unit can achieve
different rotations and translations, which can make the process and results more
flexible and accuracy.

Given programming requirements and hardware limitations, an iPhone 4s was used


as the pinhole camera. The operating system of iPhone, iOS, has a complete software
development kit. OpenCV has an iOS framework that allows software developers to
create computer vision applications. The rotation device is called Galileo. Galileo is
a rotation unit developed by Motrr, a company in Santa Cruz, California. Initially,
Galileo was designed for Apple mobile devices. This is the hardware limitation that
dictated the use of as the pinhole camera.

Figure 5 shows the Galileo. It has two parts: one for panning and one for tilting.
On the top is the tilt part. It can hold the mobile device and rotate vertically. On
the bottom is the pan part, which rotate horizontally. With these pan and tilt parts
the rotation can be 360 degreed and three-dimensional. Motrr provides a software
13
Credit: Motrr inc. From: http://motrr.com/

Figure 5. Galileo[17]

development kit for iOS. Galileo’s operations are programmable, so that Galileo can
be controlled during the processing.

Credit: Motrr inc. From: http://motrr.com/

Figure 6. Operation of the Galileo[18]

Figure 6 is from the Motrr website. It shows that the panning and tilting can be
controlled remotely from the mobile device via Bluetooth or the Internet. This is the
basic operation with Galileo in this study. This figure shows that the Galileo only
needed to be controlled through the Bluetooth or the Internet, it was not necessary to
14
use more than one device if the rotation was not complex. With the remote controlled
Galileo and the iPhone 4s, a changeable stereo camera was constructed.

15
3. Mathematical Review

This mathematical review introduces the math background for this study. Some
of the algorithms had already been implemented by OpenCV 2.4 before this study,
and some of them needed to be accomplished during this research study. Finding the
corresponding points between two images and then the fundamental matrix had been
implemented; but solving rotation matrix and translation vector from the fundamen-
tal matrix had not. In the OpenCV 3.0, all of algorithms have been implemented.
However, OpenCV 3.0 is a beta version, so it was not considered for use in this study.

Most of mathematical methods are based on the matrix operations. A great number
of matrix operations were used in this research study. All of them will be introduced
in this section.

3.1. Solving Corresponding Points and the Fundamental Matrix between


Two Images.

The corresponding points are the intermediate parameters for solving the funda-
mental matrix between two images. Three algorithms had been used for computing
the fundamental matrix:

• 7-point algorithm
• 8-point algorithm
• The RANSAC algorithm

All three algorithms and the math behind them will be introduced in this section. As
mentioned, the fundamental matrix is defined by the equation:

(3.1.1) xF x = 0,
16
where x and x’ are defined as x=(x, y, 1) and x’=(x’, y’, 1). They present the
corresponding points between two images, so the equation with x and x’ is:

(3.1.2) x xf11 + x yf12 + x f13 + y  xf21 + y  yf22 + y  f23 + xf 21 + yf32 + f33 = 0

If f is a 9-vector made up of the entry of F in row-major order, then function


(3.1.2) can be rewritten as:

(x’x, x’y, x’, y’x, y’y, y’, x, y, 1)f = 0

For a set of corresponding points, a linear function has been obtained:


⎡ ⎤
x x x1 y1 x1 y1 x y1 y1 y1 x1 y1 1
⎢ 1 ⎥
⎢ . .. .. .. .. .. .. .. .. ⎥ 
(3.1.3) Af = ⎢ .. . . . . . . . .⎥ f = 0
⎣ ⎦
xn x xn yn xn yn x yn yn yn xn yn 1

Function (3.1.3) shows matrix A has a rank of at most eight for an exiting unique
solution of a fundamental matrix. Normally, there is noise among corresponding
points, so more than eight pairs of points are required to solve the fundamental ma-
trix.

3.1.1. Seven-Point Algorithm.

As mentioned, it is possible to solve the fundamental matrix if there are eight


pairs of points. However, if there are seven, it is still possible to solve it. When there
are only seven corresponding points known, the matrix A is a 7 by 9 matrix, which
has a rank of 7.

The solution of function (3.1.2) is a two-dimensional space of a form αF1 +(1−α)F2 ,


where α is a scalar variable. F1 and F2 are matrices corresponding to the generators
f1 and f2 of the right null-space of matrix A. Now, make the detF = 0. This may be
written as det(αF1 + (1 − α)F2 ) = 0. F1 and F2 are known, which leads to a cubic
polynomial equation in α that can solve the fundamental matrix. There may be one
17
or three real solutions. This is how the seven-point algorithm works.

3.1.2. Eight-Point Algorithm.

The eight-point algorithm is the simpest method to solve the fundamental ma-
trix. If everything has been taken care of, it performs extremely well. This algorithm
requires at least eight pairs of corresponding points. The steps of this algorithm are:

(1) Normalization: Transform image coordinates according to xi = T xi and


xi = T  xi , where T and T’ are normalizing transformations of translation and
scaling.
(2) Find the fundamental matrix F from the normalized points.
• Linear Solution: Check the F from the singular vector corresponding
to the smallest value of A which is composed from xi ←→ xi
• Constraint Enforcement: Replace F by F’ using the SVD.
(3) Denormalization: Set F = T  F  T . Matrix F is the fundamental matrix
corresponding to original data.

This is how the eight-point algorithm works.

3.1.3. The RANSAC Algorithm.

The RANSAC algorithm is used to automatically compute the fundamental matrix.


The algorithm is used as a search engine. Here are the steps of the algorithm:

(1) Interest points: Compute interest points from each image.


(2) Putative Correspondences: Compute a set of interest point matches, based
on proximatity and similarity of their neighborhood.
(3) RANSAC Robust Estimation:
18
• Use the seven-point algorithm to solve the fundamental matrix, which
may have one solution or three.
• Compute the distance d of each putative correspondences.
• Solve the number of inliers wiht F by the number of correpsondence that
d < t pixels.
• If there are three real soultions for fundamental matrix and the number
of liners has been computed, the solution with the most of liners will be
saved.
(4) Non-linear Estimation: Resovle F from all corresponding points as inliers
to minimize a cost function.
(5) Guided Matching: Corresponding points are determined by the estimated
fundamental matrix to define a search strip.

3.2. Solving Rotation and Translation from Fundamental Matrix.

Solving rotation and translation are the parts implemented by the study. There
are some methods that can solve all parameters. They have been used to fulfill this
goal. Here are the steps for solving the rotation matrix and translation vector.[1]

The first step is to find the essential matrix, which is equal to:

(3.2.1) E = AF A−1 ,

where:

• E is the essential matrix.


• A is the camera matrix.
• F is the fundamental matrix.

The essential matrix is a 3 by 3 matrix with two of singular values that are the
same and a third that is zero. The matrix not being an essential matrix could happen
during the image capturing. If there is much noise between two images, the singular
19
value of the essential matrix would have different values.

Secondly, perform the Singular Value Decomposition (SVD) on the essential ma-
trix. The output from SVD of the essential matrix is:

(3.2.2) E = U diag(1, 1, 0)V T

⎡ Matrix W ⎤ must be used to solve the rotation and translation. It is defined as


0 −1 0
⎢ ⎥
⎢ ⎥
⎢1 0 0 ⎥
⎣ ⎦
0 0 1
Both the rotation and translation have two possible solutions.

The rotation matrix is equal to:

(3.2.3) R = UW V T

or

(3.2.4) R = UW T V T

The translation vector is equal to:

(3.2.5) T = ±U (3)

There are a total of four possible solutions. Among all of them, only one is feasible.
The four solutions combine two possible rotations and two possible translations. Here
are the four possible solution:

P  = (U W V T u3 ) or P  = (U W V T − u3 )
P  = (U W T V T u3 ) or P  = (U W T V T − u3 )

20
3.3. Rodrigues’ Rotation Formula.

The Rodrigues’ Rotation Formula was used to convert a rotation matrix to a rota-
tion vector[20] to find the rotation angle around one-axis in this study. The rotation
angle was used to estimate the translation vector. The Rodrigues’ Rotation Formula
is two-way transformation formula The steps for converting a rotation vector to a
rotation matrix include:

• Let θ = norm(r)
• Normalize r by θ: r = r/θ ⎡ ⎤
0 −rz ry
⎢ ⎥
T ⎢ ⎥
• So R = cos(θ)I + (1 − cos(θ)rr + sin(θ) ⎢ rz 0 −rx ⎥
⎣ ⎦
−ry rx 0
The inverse transformation is much easier to be done:

⎡ ⎤
0 −rz ry
⎢ ⎥ R − RT
⎢ ⎥
sin(θ) ⎢ rz 0 −rx ⎥ =
⎣ ⎦ 2
−ry rx 0

22
4. Localization of the Objects in the Indoor Area

In this section, the purpose is constructing the stereo vision system and testing it
in the enclosed area where the distance between the objects and camera are known
by physical measurement, and there is no shadow from the fluorescent lights. The
stereo vision system has errors. In this enclosed area the errors can be found, it gives
an opportunity to find methods to improve the system. This is the reason why the
system is tested in the enclosed area then use in the open area.

In order to localize the objects, all pinhole camera and stereo camera parameters
have to be computed. Those parameters include the camera matrix of the pinhole
camera, the rotation matrix, and the translation vector. After computing all the
parameters, the next step is to construct the stereo vision system and take plenty
of images to test the system and find new methods to improve the system to make
the error as low as possible. With a very low error, the system can be used for fruit
localization.

4.1. Camera Calibration.

The first step is the camera calibration for iPhone 4s. The camera calibration ap-
plication from OpenCV has been used to fulfill the purpose. Normally, the OpenCV
camera calibration application uses the computer or laptop camera to capture a video
of a chessboard. The video is cut into frames and it stops capturing when the number
of frames reaches twenty-five. In each frame, the application can recognize and draw
the chessboard and save the image points of each corner.

Figure 8 shows a chessboard has been recognized by the camera calibration appli-
cation. This is a 8 by 8 chessboard, each corner is 24 cm. To run the application,
the chessboard information including the chess board size (width, height), and the
individual square size have to be inputted manually. The application only recognized
23
Figure 9. Chessboard

chessboard is in a different position on different images. These five images are enough
for calibration.

The camera calibration computes the pinhole camera model function with the im-
age points and the world points as known variables. The application can find the
image points of each corner and assume every world coordinate of each corner from
different images is same. It starts from the left top corner of the chessboard whose
25
world coordinate point is (0, 0, 0); the one on the right of it is (24, 0, 0); and the
next one is (48, 0, 0). 24 millimeter is the size of one square. It is used as one unit
of world coordinate points.

Checking the camera matrix’s accuracy or not is important in this step. The repor-
jection error can provide this. In fact, only a few points can solve the camera matrix.
However, to solve the camera matrix requires a great number of points in different
images then average the results from all of those points. This is why the default set
of the application is twenty-five frames. With this average result, the reporjection
error can be smaller than one.

The reprojection error is provided by the calibration application. However, it is the


average error of the whole set of points. The reprojection error can also be obtained
from manually using the pinhole camera model formula to check the single point er-
ror instead of the whole set. Plugging the solved camera matrix, rotation matrix,
translation vector, and the world coordinate points into the formula gets the image
point. Then, comparing two kinds of image points: one is from the outputting the
chessboard corners function and the other is from reprojection. The difference is the
reporjction error. This error checking procedure will be used in the future for testing
the stereo vision system.

The camera matrix is equal to:

⎡ ⎤
584.638 0 239.5
⎢ ⎥
⎢ ⎥
⎢ 0 584.638 319.5⎥
⎣ ⎦
0 0 1

The camera matrix plays a major part in the following steps of this research study.
The camera matrix shows the property of the camera. The principal point is the
26
center of the image. The size of the images should be same as the calibration images
for fruit localization.

4.2. Constructing the Stereo Vision System.

After getting the correct iPhone 4s camera matrix, this iPhone 4s will be used
as the stereo vision system. In order to make stereo system have only one camera,
the Galileo has been used in this process. With Galileo iPhone can have a 360-
degree rotation around three axes. Using one camera and rotating it to take images
in different locations makes the stereo vision system. In this case, two images are
the minimum requirement. The first camera coordinate frame is set as the world
coordinate frame, so its rotation matrix is an identity matrix and translation vector
is 0. The rotation matrix and translation vector of the second image is the actual
movement of the Galileo.

Galileo is programmable and it can be controlled by the iOS application. The ap-
plication for constructing the stereo vision system controls the camera and Galileo.
The application can take images and save them in the photo album. Controlling the
camera can be done by the OpenCV framework. The Galileo can be controlled by
the rotation angle or the rotation speed. The rotation of the Galileo is known, so the
rotation matrix and the translation vector can be estimated by geometry.

Normally, in a stereo camera between each lens there is only the translation rela-
tionship. For this special stereo camera, it can rotate to any angle. The rotation of
the Galileo can be controlled by application. The Galileo only rotates around one
axis which changes only one rotation angle. However, the translation is hard to trace.
The simple way to find the translation is estimating it by geometry. The rotation is
set only on y-axis, so the translation is only a 2D vector. It can construct a triangle
with two sides and one angle. The geometry estimation translation can be calculated.
In order to estimate the translation, two variables have to be considered. The first
27
is the value of rotation angle which is controlled by the application. The second is
the distance between the rotation center and the camera center. These two variables
form a triangle. The third line of the triangle is actual translation. With one angle
and two triangle sides, the third side is easy to find. The estimation value is supposed
close to the real translation. However, the estimation is based on the perfect rotation
which is hard to accomplish in the experimental process.

The estimation will be used for improvement of the stereo vision system. OpenCV
has a function that can show the position changing between images taken by a stereo
camera. However, this function will return multiple cases and all of them are math-
ematically correct. The only one is the real case. With the estimated rotation and
translation, the real case can be decided. This is the purpose of geometry estimation.

4.3. Finding Corresponding Points.

Corresponding points are used to find the fundamental matrix as a set of param-
eters. OpenCV has multiple algorithms to solve the fundamental matrix. In order
to use one of them at least seven pairs of corresponding points from a pair of images
will be needed to compute the fundamental matrix. The corresponding points are the
same image points of objects on different images. With a large amount of correspond-
ing points, the relationship between this pair of images can be solved. Corresponding
points can be found by human observation. However, the human observation will
cause a tiny error on image points and this tiny error will cause a huge error on the
result of the fundamental matrix. Human observation is not the best way to find the
corresponding points.

The algorithm called Speeded-Up Robust Features (SURF) can find corresponding
points between two images. SURF only takes two images as inputs and it returns the
corresponding points. The algorithm chooses some random points on the first image
28
and find the corresponding points on the second image. Those points are floating
numbers instead of integers, due to the calculation of the algorithm.

4.4. Finding the Rotation matrix and Translation Vector.

The rotation matrix and translation vector are the most important parameters
of this stereo vision system. Neither of them is fixed and known, which requires cal-
culating both of them for each image-capturing.

Finding the rotation matrix and translation vector is the next step after solving the
fundamental matrix. It will be used every time when capturing new images. However,
OpenCV 2.11 vision does not have any functions to solve them. It can only solve the
fundamental matrix. Implementation of this method is necessary. The method of
finding the rotation and translation vector presented in the mathematical review.

There are a total of four combinations of the rotation and translation and all of
them are mathematically correct. Because the rotation is controlled by Galileo, per-
forming the Rodrigues’ rotation formula on rotation matrix can show the rotation on
which axis and the angle so there are only two combinations left. The translation
vector can be found depending on the rotation by the geometry measurement. The
only difference between two translation vectors is sign. Rotation shows the directions
on three axes and these directions shows the sign of the value on each axis. However,
users have to manually check which combination is correct, which is not fitting a
robotic system. This will be improved in the future.

All results from SVD are normalized. The magnitude of the translation vector is
one which is not the real movement of the camera. Finding the scale factor can make
the translation vector become close to the real movement. The geometry estimation
translation vector will be used for finding this scale factor. The geometry estimation
29
is based on two-dimension rotation on y-axis. However, the real case of translation
is not on y-axis. The length of the estimation translation vector will be close to the
real translation vector. The factor has been used to scale length of the normalized
vector equal to the length of the geometry estimation vector. The real translation
vector can be found by the scale factor and used in the next step.

4.5. Triangulation and Improvement.

Triangulation is the method for finding the location of the objects with one pair of
images. The triangulation is using one pair of corresponding points from two images
and the projection matrix of them two solve the third points which is the world lo-
cation of the corresponding points.

Triangulation requires one pair of corresponding points and projection matrices


from two images. The projection matrix is the product of the camera matrix and
rotation matrix with translation vector. The first image frame is defined as the world
coordinate frame which sets the rotation matrix as a 3x3 identity matrix and the
translation vector is 0. The rotation and translation of the second image are the
result from the previous procedure. With two projection matrices and a pair of cor-
responding points, the stereo vision system is able to find the location of the object.

The triangulation uses the function from OpenCV which requires the exact cor-
responding points. Even only one pixel error, the results may have an enormous
error. Because the calculation methods are fixed, the improvement is from the image
capturing. Only two images are used during one triangulation. The improvement is
capturing multiple images for one object under different rotation. The first image
frame is still set as the world coordinate frame. The next step is performing the tri-
angulation between first one and other images. The final step is getting the average
distance from all of them.
30
5. Labeling Images

In image labeling all objects are labeled on single image and all information about
labels are saved. The purpose of labeling images is finding all fruits on the image. The
further goal is using those labeling data to create and train image classifier that can
automatically find the fruits on images which will be the future work of this research.
In this procedure, a thousand fruit images have been manually labeled. All of the
images are downloaded from Internet freely and all of them are from either ImageNet
or Google Image. Images have been selected from five fruits categories: apple, orange,
peach, pear, and mango, and there are two hundred images per each category. The
data of the labeled images has been saved as a dataset, and it is available to download.

5.1. Software Tool for Labeling.

Images are labeled manually by using the Training Image Labeler Application from
Matlab. The labeler application is based on human observer manually labeling the
objects on images. The application can label the objects by a bounding box. Boxes
can be adjusted to completely cover the objects. Due to most of the fruits are not
square shape, the bounding box will have noise from the leaves or other surroundings
of the fruits. This is the reasonable error and it will be considered in the future. After
labeling all the fruits, the application can export the information of the labels. The
information of a single label includes four numbers: the image point of the left top
corner of the label, width and height of the bounding boxes. All of them form a 4 by
N matrix, N is the total number of the bounding box.

5.2. Labeled Image.

Labeled images are saved by using all information of bounding boxes to draw them
on images. Labeled images can determine the veracity of all labeling data. The edge
31
of bounding box should cover fruits completely. The labeled image can show whether
the edges of fruit are completely covered or not.

5.2.1. Original Images.

The original images were downloaded either from ImageNet or Google Image.
Those images show the diversity of fruit contents: some images have a lot of fruits;
some images have only one fruit or two; some of fruits on images are visible; and some
of fruits are behind the fruits or leaves, which makes them hard to find on images.

Credit: flickr.com From: http://brokenbelievers.com/2014/09/19/the-apple-tree/

Figure 10. Complicated Apple Tree Example[19]


32
Figure 10 shows a complicated apple tree image example. There are over 140 visible
apples on the different trees. This kind of images are really hard to label. Some of
fruits have only really small parts showing on the image whose size may be smaller
than the size of bounding box. This type of fruits will be ignored.

Credit: flickr.com From: http://farm1.static.flickr.com/38/101398959 0d51fb9c84.jpg

Figure 11. Simple Apple Tree Example[3]

Figure 11 shows a simple apple tree example. There are only two apples in the
image. The diversity can be showed from these two images. However, all fruits are
on the tree which makes them in different growning states. This is another way of
diversity. All states of fruits will be included from two hundred images.

5.2.2. Labeled Images.

Using the Matlab and the label information, the bounding box can be drawn on
images. The color of bounding box can be change by RGB parameters. Figure 12 to
33
Figure 21 show one example form each fruits

Credit: flickr.com From: http://farm1.static.flickr.com/28/48406200 135f13a35b.jpg

Figure 12. Complicated Labeled Apple Tree[4]

Credit: flickr.com From: http://farm1.static.flickr.com/1/3139450 9b2971cb51.jpg

Figure 13. Simple Labeled Apple Tree[5]


34
Credit: flickr.com From: http://farm1.static.flickr.com/100/367162686 eba027ce6f.jpg

Figure 14. Complicated Labeled Orange Tree[8]

Credit: flickr.com From: http://farm3.static.flickr.com/2194/1817554650 3bd6a0657d.jpg

Figure 15. Simple Labeled Orange Tree[9]


35
Credit: flickr.com From: http://farm2.static.flickr.com/1074/1042414990 6dd8f3e7ba.jpg

Figure 16. Complicated Labeled Peach Tree[10]

Credit: flickr.com From: http://farm2.static.flickr.com/1224/770919709 9f84c45d39.jpg

Figure 17. Simple Labeled Peach Tree[11]


36
Credit: flickr.com

From: http://world-crops.com/wordpress/wp-content/uploads/029-Pear-tree.jpg

Figure 18. Complicated Labeled Pear Tree[12]

Credit: flickr.com From: http://farm4.static.flickr.com/3181/2796719945 5c910a233a.jpg

Figure 19. Simple Labeled Pear Tree[13]


37
Credit: flickr.com From: http://farm1.static.flickr.com/51/116367124 c68d833987.jpg

Figure 20. Complicated Labeled Mango Tree[6]

Credit: flickr.com From: http://farm2.static.flickr.com/1027/529190162 c6c53959a9.jpg

Figure 21. Simple Labeled Mango Tree[7]


38
5.3. Labeling Data.

Labeling data is the most significant result from the labeling procedure. The out-
put from the Training Image Labeler Application is the information for the bounding
box and can be saved in the Matlab format files. This information will be used for
training the image classifier. To fulfill this purpose, the data has been saved in mul-
tiple forms, which will make the further work easier. A dataset has been created and
is available for downloading.

The dataset contains four folders: Images, URL, Mat, and Data. The original im-
ages are saved in the Images folder; the labeled images are also saved in this folder.
The labeling data is used to draw the bounding boxes on the original images to create
the labeled images. Images are downloaded from the Internet, and it is necessary to
list URL of each image. The data from the application, which is 4 by N matrix, is
saved in the Mat folder. There are textual format files in the Data file. In the text
files, specifically, the first line gives the number of objects in the image. Each sub-
sequent line shows one occurrence of an object, a fruit in this context, in the image.
The first four numbers are the same in the mat file. The last number on each line
indicates the type of fruit: 1 is apple; 2 is orange; 3 is peach; 4 is pear; and 5 is mango.

The dataset is available for downloading in this website:


http://labeledimagedataset.weebly.com

39
6. Experimental Result

The experimental results section presents the intermediate processing parameters


of the stereo vision system, and the testing and improvement results of the stereo
vision system in a closed area. Many images were taken and tested.

6.1. Examples of images.

The images were taken in the computer lab and all the distances between the
camera and the objects were measured by ruler and other measurement tools. Figure
22 shows two examples of the images. The first image frame is set as the world coor-
dinate frame. The second image was taken after a five-degree rotation on the y axis.
Galileo allows the iPhone to rotate on three axes. However, this made it really hard
to trace the translation. Rotating on one axis is the best situation for image-capturing.

Figure 22. First Two Images


40
The corresponding points image shows all of the corresponding points. In this pro-
cedure, most of those corresponding points will be chosen as the target objects to
triangulate and test the accuracy of the stereo vision system. Table 1 shows all the
corresponding points that had been used as the target objects. Those points are the
results of the SURF algorithm, so every points is a pair of floating numbers.

Table 1. Corresponding Points for Triangulation

Point Number Image 1 Image 2


1 (411.632, 274.722) (360.351, 275.352)
2 (115.635, 160.34) (64.5039, 157.872)
3 (352.989, 355.993) (305.365, 355.137)
4 (337.178, 366.111) (289.246, 365.329)
5 (224.645, 158.058) (176.645, 157.713)

6.3. Triangulation.

The purpose of the triangulation is to test the accuracy of the stereo vision system
in an indoor area. During this procedure, choosing the points will is first step. The
next step is measuring the distance between the objects in the world and the camera.
The last step is triangulating with two image points and comparing the result with
the real distance. However, it is hard to measure all (x, y, z) values. The easiest and
most accurate way is finding only one value and checking the reprojection error.

The corresponding points image shows all the corresponding points. In the triangu-
lation procedure, some of those points was chose as the target objects to triangulate
for testing the accuracy.

Figure 24 shows the first pair of corresponding points. The first image, which is
set up as the world coordindate frame, is on the left. The green dot on both images
42
7. Localization of Fruits

The last step is testing the stereo vision system in a real fruit field. This involves
setting up a simulation environment. All fruits in the set-up environment have known
positions relative to the camera.

7.1. Simulation Environment Set-Up.

Figure 31 shows the set-up simulation environment.

Figure 31. Fruit Environment Set-Up

The simulation environment was set up based on a few requirements, which are:

• Images taken in this environment should be the same as in a real fruit fields.
• All fruits should have known positions relative to the camera.

In the simulation environment, fruits were hung on trees by tape, which meant the
chosen trees had to be short. However, in the images the trees had to look real tall.
50
To accomplish this, the stereo vision system had to be close enough to the plants do
the images would just leaves. Figure 32 shows one image taken in this environment.
Ninety-five percent of this image is leaves, making it look like the camera is surround-
ing by leaves. The purpose of this set-up is reduced the noise from non-tree objects.

Figure 32. Image from the Simulation Environment

After choosing the position of the camera, fruits were hung on the plant. Apples
were chosen as the target fruits. Every apple in this environment had to have a known
position relative to the camera. It was hard to find the position of fruits when the
camera center was the coordinate original point. Fruits were hung on the tree; and
the camera was on the tripod. Both were on the air, which made it difficulty to
measure the distance between them. To fulfill the requirement of ”All fruits should
have known positions relative to the camera” efficiently, a point on the ground was
chosen as the original point of the coordinate. Not only was each apple’s position
relative to the original point determined, but also was the camera center’s position
51
relative to the original point. Every apple was attached a rope. This rope hung from
the fruit to the ground, which indicating the fruits perpendicular position relative to
the ground. The length of the rope is the y-axis value with this value. It is easy to
measure the other two values. Every apple in this set-up will have a known position.
Figure 33 shows how all apples were set up. In this figure, eight apples have been
attached to the bush. For convenience of the processing, each apple was labeled in
the order of Arabic numerals.

Figure 33. Fruits Set-up

After setting up all the apples, the position of each and the camera relative to the
original point can be . Table 2 shows the values of all positions. All apples were set
up at the same distance from the camera except for apple 5. The branch of apple 5
fell a little due to the weight of the apple, which reduced the distance about 10 cm.
It also covers most of apple 6. However, this circumstance will happen in a real apple
tree. The x and y values include the size of apples. However, z values are measured
52
from side that apple closest to the camera. All position values were measured by
steel tape, which could have caused some errors due to the property of the steel tape.
These errors will be considered in the following steps.

Table 3. Position of Apples and the Camera


Objects X-axis(cm) Y-axis(cm) Z-axis(cm) Location in Camera Coordinate(cm)
Apple 1 88 to 93 117 to 122 150 (−35 ∼ −30; −44 ∼ −39; 150)
Apple 2 74 to 79 114 to 119 150 (−21 ∼ −16; −41 ∼ −36; 150)
Apple 3 79 to 84 58 to 64 150 (−26 ∼ −21; 14 ∼ 20; 150)
Apple 4 63 to 68 96 to 101 150 (−10 ∼ −5; −23 ∼ −18; 150)
Apple 5 46 to 52 55 to 60 140 (6 ∼ 12; 18 ∼ 23; 140)
Apple 6 45 to 50 37 to 43 150 (8 ∼ 13; 35 ∼ 41; 150)
Apple 7 40 to 45 129 to 134 150 (13 ∼ 18; −56 ∼ −51; 150)
Apple 8 23 to 28 69 to 76 150 (30 ∼ 35; 2 ∼ 9; 150)
Camera 58 78 0 (0; 0 ; 0)

The values from Table 3 are relative to the original point. The triangulation results
relative to the camera center. It is necessary to translate the positions from Table 3
to the positions relative to the camera center by subtracting the camera position. The
coordinate frame is the next consideration for preparing the simulation environment.
According to the previous translation results, the coordinate frame: x-axis should
point from left to right; y-axis should point from up to down; and z-axis should point
out from the camera. In this circumstance, all coordinate points of apples in the cam-
era frame can be solved. These coordinate points were used for testing the accuracy
of the stereo vision system in this simulation environment.

7.2. Fruit Images.

A total of eleven images where taken in this procedure. Unfortunately, due to


53
the iOS development license issue, the Galileo control application did not work ap-
propriately. The rotation angle of Galileo was unknown during the image capturing.
The rotation angle which is used for estimating the translation vector, can be calcu-
lated from the fundamental matrix in the following procedure. Every triangulation
only needs one pair of images.

Figure 34. A Pair of Fruit Images

Figure 34 is a pair of fruit images that were be used to localize the fruits. Images
were taken outside, so they were affected by sunlight and other natural factors. The
resolution is lower than in the images taken indoors. All this will be considered as
distraction factors for the triangulation.

7.3. Testing Procedure.

The testing procedure includes multiple kinds of processing which includes trian-
gulating different fruits in the same pair of images and the same fruits in the different
pair of images. The testing procedure also includes finding the corresponding points;
54
points in the previous indoor-images result. Images are taken outside, which it de-
pends on sunlight that is different from the lamplight. The background is much more
complicated than the indoor image. These are potential reasons causing correspond-
ing points looking different from the previous.

The rotation and translation are unknown variables during the image capturing
procedure. The only way for geometric estimation of the translation vector is using
the rotation matrix to translation vector scale factor, then scale the translation vector
to get the projection matrix.

The rotation matrix from the fundamental matrix is:

⎡ ⎤
−0.831575 0.008522 −0.555346
⎢ ⎥
⎢ ⎥
⎢ 0.008389 −0.999575 −0.027902⎥
⎣ ⎦
−0.555348 −0.027862 0.831150

After using the Rodrigues Rotation Formula, the rotation angle is 2.91 degree
around y-axis and a little rotation around x-axis and y-axis which can be neglected
in the following process. This angle will be used to estimate the scale factor of the
translation vector.
The translation vector from the fundamental matrix is:

⎡ ⎤
0.26579
⎢ ⎥
⎢ ⎥
⎢ 0.018153 ⎥
⎣ ⎦
−0.963857

The size of fruit on the image is more than one pixel, and the triangulation only
requires one pixel. Image Points of all fruits, which are used for localization, will be
one pixel image points corresponding to each other from two images. Table 3 shows
the chosen image points and figure 35 shows this pair of fruit images after plotting
56
all image points. The green dots are the chosen image points, and the size of the dot
is one pixel.

Table 4. Image Points of The Fruit Image


Apple Fruit Image 1 Fruit Image 2
1 (114, 176) (141, 175)
2 (169, 177) (195, 175)
3 (150, 386) (178, 383)
4 (264, 221) (291, 218)
5 (258, 440) (285, 437)
6 (295, 452) (323, 451)
7 (315, 123) (342, 118)
8 (381, 346) (410, 343)

Table 4 shows image points of all eight apples. These eight pairs of image points
will be used for triangulation. However, some of the image points have errors due to
manually labeling the image. Errors from image points will be reduced by adjusting
image points so that triangulation results will be more accurate.

Figure 36 shows all image points have been plotted on both fruit images. All of
them are chosen by human observation. In figure 36, all image points are close to real
corresponding points. Whether they are corresponding points or not will be found
out by triangulation results they will be fixed in the following steps.

57
Table 5. Triangulation Results with Errors
Apple Image 1 (pixel) Image 2 (pixel) Estimated Location in Camera
Coordinate(cm)
1 (114, 176) (141, 175) (-10.0811; -11.5402; 46.81)
2 (169, 177) (195, 175) (-8.40724; -16.9499; 69.3489)
3 (150, 386) (178, 383) (-12.622; 9.59829; 82.8833)
4 (264, 221) (291, 218) (5.2889; -23.272; 138.295)
5 (258, 440) (285, 437) (4.94537; 31.7092; 152.981)
6 (295, 452) (323, 451) (13.125; 31.1513; 135.813)
7 (315, 123) (342, 118) (36.7107; -96.4751; 286.677)
8 (381, 346) (410, 343) (117.933; 22.0773; 487.318)

Table 5 shows triangulation results from figure 35. The real distance is listed in
table 2. Although measured distance from table 2 have some measurement errors,
some of the triangulation results are not close to the real distance. Two of them are
close to the correct results. Some of them are smaller or larger than the real distance
even with small reprojection errors. These errors are caused by chosen points among
two images which are not exactly corresponding points.

The improvement included two methods: adjusting the corresponding points, and
using multiple images. After adjusting chosen points, they can be closer to exact
corresponding points. However, some corresponding points among two images of the
fruits were difficult to find. In this case, more than two images were required to
improve the results.

Table 6 shows the improved location results by adjusting corresponding points. It


also shows new image points, new location results, and new reprojection errors. some
of them were improved a lot; some of them hardly output the correct results.

59
Table 6. Improved Results with Adjusting Image Points
Apple Image 1(pixel) Image 2(pixel) Estimated Location in Camera Coordinate
1 (113, 175) (141, 172) (-34.5178; -39.4676; 159.095)
2 (169, 177) (196, 174) (-18.3116; -36.9332; 151.175)
3 (147, 388) (175, 384) (-23.5082; 17.5518; 148.849)
4 (264, 221) (291, 218) (5.2889; -23.272; 138.295)
5 (258, 440) (285, 437) (4.6432; 25.7718; 143.634)
6 (295, 454) (322, 451) (15.7832; 38.1951; 165.615)
7 (314, 122) (342, 118) (18.6751; -49.3957; 146.263)
8 (382, 346) (410, 343) (36.8651; 6.86531; 151.192)

These new results show a great improvement. However, there are still some devia-
tions between two results. Different fruit images are required in order to improve the
results. With multiple images involved the location results will be much better.

Table 7. Difference between Estimated and Measured Results


Apple Measured Location in World Co- Estimated Location in World Co-
ordinate (cm) ordinate(cm)
1 (88 ∼ 93; 117 ∼ 123; 150) (94.7642; 120.0361; 159.095)
2 (74 ∼ 79; 114 ∼ 119; 150) (76.3116; 114.9332; 151.175)
3 (79 ∼ 84; 58 ∼ 64; 150) (81.5082; 60.4482; 148.849)
4 (63 ∼ 68; 96 ∼ 101; 150) (63.2889; 101.272; 138.295)
5 (46 ∼ 52; 54 ∼ 61; 140) (53.05463; 50.2908; 143.634)
6 (45 ∼ 50; 37 ∼ 43; 150) (42.2168; 39.8049; 165.615)
7 (40 ∼ 45; 129 ∼ 134; 150) (39.3249; 127.3957; 146.263)
8 (23 ∼ 28; 69 ∼ 76; 150) (21.1349; 71.13469; 151.192)

Table 7 shows the measured location in world coordinate and the estimated loca-
tion. The estimated results is in the world coordinate system, which is defined in
60
figure 31. Table 8 shows all errors in three axes between the measured location and
estimated location.

Table 8. Relative Estimation Error per Axis


Apple x-axis(%) y-axis(%) z-axis(%)
1 4.71 0 6.06
2 0 0 0.78
3 0 0 0.76
4 0 2.81 7.8
5 8.27 12.54 2.59
6 11.12 0 10.41
7 7.47 3.12 2.49
8 17.11 0 0.79
Average Error 6.085 2.30875 3.96
Expected errors show the accuracy of the results. Because the x-axis and y-axis
values are range of distance, the expected errors should be 0 if the results are within
the range; otherwise, the average value of that range will be used to compute the
errors.

The largest error from x-axis is 17.11%, and the smallest error is 0. The largest
error from y-axis is 12.54% and the smallest error is 0. The 17.11% error looks really
large. However, in this case the actual error is about 4 cm and the distance is 25.5
cm, which makes error really large. The largest error from z-axis is 10.41% and the
smallest error is only 0.76%. The z-axis values are the largest values among three
axes, so the percentage of the error from z-axis can prove the system is more accurate
when the target fruits are far from the camera.

7.3.2. Multiple Images Testing.

The second test is using the first image and a new fruit image. Figure 37 presents
61
the images of this pair of fruits. The process is the same as before. The difference
of the rotation between these two images were 0.768237 degrees. This is really small,
which has never been done before. The image points from the first image were used
from the previous section; and the second image had to be labeled.

Figure 37. Second Fruit Test

Table 9 presents the new corresponding points between two images, and figure 38
shows the plotted fruit images.

62
Table 10. Image Points of Apple from the New Pair of Images
Apple Fruit Image 1 Fruit Image 3 Estimated Location in Camera
(pixel) (pixel) Coordinate(cm)
1 (113, 176) (108, 175) (-34.2741; -39.0662; 159.367)
2 (169, 177) (163, 176) (-16.6177; -33.6226; 138.093)
3 (150, 386) (144, 385) (-24.2837; 17.8646; 158.276)
4 (264, 221) (258, 220) (-4.7426; -19.1805; 143.728)
5 (268, 440) (262, 439) (6.00727; 25.6438; 135.171)
6 (295, 452) (289, 451) (11.3152; 35.1012; 140.271)
7 (315, 123) (309, 122) (12.53838; -47.1414; 145.907)
8 (381, 346) (375, 345) (30.805; 4.75034; 147.034)

Table 11. Relative Estimation Error per Axis


Apple Estimated Location in Word x-axis(%) y-axis(%) z-axis(%)
Coordinate(cm)
1 (92.2741; 117.0662; 159.367) 0 0 6.24
2 (74.6177; 111.6226; 138.093) 0 4.19 7.938
3 (82.2837; 60.1354; 158.276) 0 0 5.517
4 (62.7426; 97.1805; 143.728) 4.21 0 4.18
5 (51.99273; 52.3562; 135.171) 0 8.14 3.57
6 (46.6848; 43.8988; 140.271) 0 8.39 6.486
7 (45.4616; 125.1414; 145.907) 2.99 4.83 2.72
8 (27.195; 73.24696; 147.034) 0 0 1.98
Average 0.9625 3.1938 4.829

The errors from this pair of image are much more stable than the previous results.
The largest error from x-axis is 4.21%, and the smallest error is 0. The largest error
from y-axis is 8.39%, and the smallest error is 0. The largest error from z-axis is
7.938%, and the smallest error is 1.98%. Compared the results from different motion
64
shows that smaller rotation angle between two images can make estimated results
more stable.

65
8. Conclusions and Recommendations

8.1. Conclusion.

This research focused on the following: 1. creating and testing a stereo vision
system; 2. manually finding all fruits in images; 3. testing a stereo vision system
outside with real fruits and trees.

The results indicate that the stereo vision system has accurate output. The follow-
ing conclusions are based on the findings:

(1) In a stereo vision system based on the pinhole camera model and multiple
view geometry, during image capturing the adjustable rotation between two
different images will affect location results.

(2) In an indoor area, the stereo vision system works well for long distance and
with a small rotation. On the other hand, errors for short distance and with
a large rotation are large.

(3) In the real tree area, the stereo vision system had good results. The back-
ground trees made corresponding points much more complicated than in the
indoor area. These complicated corresponding points could make the calcula-
tion of the rotation and the translation more precise.

(4) The labeling images procedure produced significant results for one thousand
fruit images. These results not only show the fruits’ information, but also give
enough data for creating and training a fruit image classifier.

66
8.2. Recommendations for Future Study.

The future of this research will be improving the stereo vision system. It can
be improved in two aspects: 1. automatically finding fruits in images; 2. automati-
cally adjusting corresponding points:

(1) The image classifier can automatically label certain objects in images. With
a certain kind of fruit classifier, the stereo vision system can find all of that
kind of fruit in each captured image.

(2) The triangulation procedure requires an exact pair of correpsonding points.


In this study, image points were manually adjusted. As another improvement,
an automatic adjusting mechanism will be fitted into this stereo vision system.

67
References

[1] Adrien Bartoli and Soren Olsen. Camera motion from two-view relationships. http:
//isit.u-clermont1.fr/~ab/Classes/DIKU-3DCV2/Handouts/Lecture16.pdf. [Online; ac-
cessed January 10th, 2015].
[2] Calin and Roda. Applied geometric representation and computation. http://www.ccs.neu.
edu/course/cs5350sp12/, 2007. [Online; accessed April 27, 2015].
[3] flickr.com. Apple tree. http://farm1.static.flickr.com/38/101398959_0d51fb9c84.jpg.
[4] flickr.com. Apple tree. http://farm1.static.flickr.com/28/48406200_135f13a35b.jpg.
[5] flickr.com. Apple tree. http://farm1.static.flickr.com/1/3139450_9b2971cb51.jpg.
[6] flickr.com. Mango tree. http://farm1.static.flickr.com/51/116367124_c68d833987.jpg.
[7] flickr.com. Mango tree. http://farm2.static.flickr.com/1027/529190162_c6c53959a9.
jpg.
[8] flickr.com. Orange tree. http://farm1.static.flickr.com/100/367162686_eba027ce6f.
jpg.
[9] flickr.com. Orange tree. http://farm3.static.flickr.com/2194/1817554650_3bd6a0657d.
jpg.
[10] flickr.com. Peach tree. http://farm2.static.flickr.com/1074/1042414990_6dd8f3e7ba.
jpg.
[11] flickr.com. Peach tree. http://farm2.static.flickr.com/1224/770919709_9f84c45d39.
jpg.
[12] flickr.com. Pear tree. http://world-crops.com/wordpress/wp-content/uploads/
029-Pear-tree.jpg.
[13] flickr.com. Pear tree. http://farm4.static.flickr.com/3181/2796719945_5c910a233a.
jpg.
[14] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge
university press, 2003.
[15] Richard Hartley and Andrew Zisserman. Single image view camera geometry, epipolar geometry,
2003.
[16] Richard Hartley and Andrew Zisserman. Single image view camera geometry, pinhole camera
model. In Multiple view geometry in computer vision. Cambridge university press, Cambridge,
2003.
[17] Motrr inc. Galileo features. http://motrr.com/, 2015.
[18] Motrr inc. Two ways to operate galileo. http://motrr.com/, 2015.

68
[19] Bryan Lowa. The apple tree. http://brokenbelievers.com/2014/09/19/the-apple-tree/,
September 19, 2014.
[20] OpenCV. Camera calibration and 3d reconstruction. http://docs.opencv.org/modules/
calib3d/doc/camera_calibration_and_3d_reconstruction.html. [Online; accessed April
27, 2015].
[21] OpenCV. Camera calibration with opencv. http://docs.opencv.org/doc/tutorials/
calib3d/camera_calibration/camera_calibration.html. [Online; accessed April 27, 2015].
[22] OpenCV. Opencv, open source computer vision. http://opencv.org/. [Online; accessed April
27, 2015].
[23] Wikipedia, the free encyclopedia. Stereo camera. https://en.wikipedia.org/wiki/Stereo_
camera, 2015. [Online; accessed April 27, 2015].

69

You might also like