Object Segmentation and Modelling Through Robotic

Manipulation and Observing Bilateral Symmetry
by
Wai Ho Li
BE(HONs) in Computer Systems Engineering (2004)
Monash University, Australia
Thesis
Submitted by Wai Ho Li
for fulfillment of the Requirements for the Degree of
Doctor of Philosophy
Supervisor: Associate Professor Lindsay Kleeman
Associate Supervisor: Associate Professor R. Andrew Russell
Intelligent Robotics Research Centre
Department of Electrical and Computer Systems Engineering
Monash University, Australia
December, 2008
In memory of my father
Object Segmentation and Modelling Through Robotic
Manipulation and Observing Bilateral Symmetry
Declaration
I hereby declare that this submission is my own work and that, to the best of my knowledge
and belief, it contains no material previously published or written by another person nor
material which has been accepted for the award of any other degree or diploma of the
university or other institute of higher learning, except where due acknowledgment has been
made in the text. Similarly, software and hardware systems described in this submission
have been designed and implemented without external help unless otherwise stated in the
text.
Wai Ho Li
December 3, 2008
c Copyright
by
Wai Ho Li
2008
Object Segmentation and Modelling Through Robotic
Manipulation and Observing Bilateral Symmetry
Wai Ho Li
waiholi@gmail.com
Intelligent Robotics Research Centre
Department of Electrical and Computer Systems Engineering
Monash University, Australia, 2008
Supervisor: Associate Professor Lindsay Kleeman
Lindsay.Kleeman@eng.monash.edu.au
Associate Supervisor: Associate Professor R. Andrew Russell
Andy.Russell@eng.monash.edu.au
Abstract
Robots are slowly making their way into our lives. Over two million Roomba robots have
been purchased by consumers to vacuum their homes. With the aging populations of the
USA, Europe and Japan, demand for domestic robots will inevitably increase. Everyday
tasks such as setting the dining table require a robot that can deal with household objects
reliably. The research in this thesis develops the visual sensing, object manipulation and
the autonomy necessary for a robot to deal with household objects intelligently.
As many household objects are visually symmetric, it is worth exploring bilateral sym-
metry as an object feature. Existing methods of detecting bilateral symmetry have high
computational cost or are sensitive to noise, making them unsuitable for robotic appli-
cations. This thesis presents a novel detection method targeted specifically at real time
robotic applications. The detection method is able to rapidly detect the symmetries of
multi-colour, transparent and reflective objects.
The fast symmetry detector is applied to two static visual sensing problems. Firstly,
detected symmetry is used to guide object segmentation. Segmentation is performed
by identifying near-symmetric edge contours using a dynamic programming approach.
Secondly, three dimensional symmetry axes are found by triangulating pairs of symmetry
lines in stereo. Symmetry axes are used to localize objects on a table and are especially
useful when dealing with surface of revolution objects such as cups and bottles.
The symmetry detector is also applied to the dynamic problem of real time object tracking.
The tracker contains a Kalman filter that uses object motion and symmetry synergetically
to track an object in real time. An extensive quantitative analysis of the tracking error is
performed. By using a pendulum to generate predictable object trajectories, the tracking
error is measured against reliable ground truth data. The performance of colour and
symmetry as tracking features are also compared qualitatively.
Using the newly developed visual symmetry toolkit, an autonomous robotic system is
implemented. This began with giving the robot the ability to autonomously segment
new objects. The robot performs segmentation by applying a gentle nudge to an object
and analysing the induced motion. The robot is able to robustly and accurately segment
objects, including transparent objects, against cluttered backgrounds. The segmentation
process is performed without any human guidance.
Finally, the robot’s newfound segmentation ability is leveraged to perform autonomous
object learning. After performing object segmentation, the robot grasps and rotates the
segmented object to gather training images. These robot-collected images are used to
produce reusable object models that are collated into an object recognition database. Es-
sentially, the robot learns new symmetric objects through physical interaction. Given that
most households contain too many unique objects to model exhaustively, the autonomous
learning approach shifts the burden of object model construction from the human user to
the tireless robot.
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.1 Chapter 2: Background Information . . . . . . . . . . . . . . . . . . 10
1.3.2 Chapter 3: Symmetry Detection . . . . . . . . . . . . . . . . . . . . 10
1.3.3 Chapter 4: Sensing Objects in Static Scenes . . . . . . . . . . . . . . 10
1.3.4 Chapter 5: Real Time Object Tracking . . . . . . . . . . . . . . . . 11
1.3.5 Chapter 6: Autonomous Object Segmentation . . . . . . . . . . . . . 11
1.3.6 Chapter 7: Object Learning by Robotic Interaction . . . . . . . . . . 11
1.3.7 Chapter 8 : Conclusion and Future Work . . . . . . . . . . . . . . . 12
1.3.8 Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.9 Appendix B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Background Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1 Visual Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 Symmetry in Human Visual Processing . . . . . . . . . . . . . . . . 13
2.1.2 Types of Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Related Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Accurate Shape Representation . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Towards Symmetry Detection . . . . . . . . . . . . . . . . . . . . . . 17
2.2.3 Other Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.4 Skew Symmetry Detection . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.5 Perceptual Organization . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.6 Multi-Scale Detection Approaches . . . . . . . . . . . . . . . . . . . 20
2.2.7 Applications of Detection . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.8 SIFT-Based Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Symmetry Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Fast Bilateral Symmetry Detection . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.1 Novel Aspects of Detection Method . . . . . . . . . . . . . . . . . . 25
3.2.2 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.3 Extensions to Detection Method . . . . . . . . . . . . . . . . . . . . 31
3.3 Computational Complexity of Detection . . . . . . . . . . . . . . . . . . . . 35
3.4 Detection Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.1 Synthetic Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.2 Video Frames of Indoor Scenes . . . . . . . . . . . . . . . . . . . . . 37
3.4.3 Computational Performance . . . . . . . . . . . . . . . . . . . . . . . 46
3.5 Comparison with Generalized Symmetry Transform . . . . . . . . . . . . . . 48
3.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5.2 Comparison of Detection Results . . . . . . . . . . . . . . . . . . . . 48
3.5.3 Comparison of Computational Performance . . . . . . . . . . . . . . 52
3.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4 Sensing Objects in Static Scenes . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 Monocular Object Segmentation Using Symmetry . . . . . . . . . . . . . . 56
4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.2 The Symmetric Edge Pair Transform . . . . . . . . . . . . . . . . . . 57
4.2.3 Dynamic Programming and Contour Refinement . . . . . . . . . . . 59
4.2.4 Segmentation Results . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.5 Computational Performance . . . . . . . . . . . . . . . . . . . . . . . 65
4.2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3 Stereo Triangulation of Symmetric Objects . . . . . . . . . . . . . . . . . . 67
4.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3.2 Camera Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3.3 Triangulating Pairs of Symmetry Lines . . . . . . . . . . . . . . . . . 69
4.3.4 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3.5 Accuracy of Symmetry Triangulation . . . . . . . . . . . . . . . . . . 75
4.3.6 Qualitative Comparison with Dense Stereo . . . . . . . . . . . . . . 77
4.3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5 Real Time Object Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2 Real Time Object Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2.1 Improving the Quality of Detected Symmetry . . . . . . . . . . . . . 85
5.2.2 Block Motion Detection . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2.3 Object Segmentation and Motion Mask Refinement . . . . . . . . . 89
5.2.4 Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.3 Object Tracking Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.3.1 Discussion of Tracking Results . . . . . . . . . . . . . . . . . . . . . 96
5.3.2 Real Time Performance . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.4 Bilateral Symmetry as a Tracking Feature . . . . . . . . . . . . . . . . . . . 99
5.4.1 Obtaining Ground Truth . . . . . . . . . . . . . . . . . . . . . . . . 100
5.4.2 Quantitative Analysis of Tracking Accuracy . . . . . . . . . . . . . . 105
5.4.3 Qualitative Comparison Between Symmetry and Colour . . . . . . . 114
5.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6 Autonomous Object Segmentation . . . . . . . . . . . . . . . . . . . . . . 121
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.1.3 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.1.4 System Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.2 Detecting Interesting Locations . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.2.1 Collecting Symmetry Intersects . . . . . . . . . . . . . . . . . . . . . 129
6.2.2 Clustering Symmetry Intersects . . . . . . . . . . . . . . . . . . . . . 129
6.3 The Robotic Nudge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.3.1 Motion Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.3.2 Obtaining Visual Feedback by Stereo Tracking . . . . . . . . . . . . 133
6.4 Object Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.5 Autonomous Segmentation Results . . . . . . . . . . . . . . . . . . . . . . . 136
6.5.1 Cups Without Handles . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.5.2 Mugs With Handles . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.5.3 Beverage Bottles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.6 Discussion and Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . 139
7 Object Learning by Interaction . . . . . . . . . . . . . . . . . . . . . . . . 143
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.2 Autonomous Object Grasping After a Robotic Nudge . . . . . . . . . . . . 146
7.2.1 Robot Gripper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.2.2 Determining the Height of a Nudged Object . . . . . . . . . . . . . . 148
7.2.3 Object Grasping, Rotation and Training Data Collection . . . . . . 150
7.3 Modelling Object using SIFT Descriptors . . . . . . . . . . . . . . . . . . . 151
7.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.3.2 SIFT Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.3.3 Removing Background SIFT Descriptors . . . . . . . . . . . . . . . . 152
7.3.4 Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.4 Autonomous Object Learning Experiments . . . . . . . . . . . . . . . . . . 157
7.4.1 Object Recognition Results . . . . . . . . . . . . . . . . . . . . . . . 157
7.5 Discussion and Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . 160
8 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 169
8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Appendix A Multimedia DVD Contents . . . . . . . . . . . . . . . . . . . . 177
A.1 Real Time Object Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
A.2 Autonomous Object Segmentation . . . . . . . . . . . . . . . . . . . . . . . 177
A.3 Object Learning by Interaction . . . . . . . . . . . . . . . . . . . . . . . . . 178
Appendix B Building a New Controller for the PUMA 260 . . . . . . . . 179
B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
B.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
B.3 Kinematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
B.3.1 PUMA 260 Physical Parameters . . . . . . . . . . . . . . . . . . . . 181
B.3.2 Direct Kinematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
B.3.3 Inverse Kinematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
List of Figures
2.1 Bilateral and skew symmetry. . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Rotational and radial symmetry. . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Research related to bilateral symmetry detection. . . . . . . . . . . . . . . . 22
3.1 Fast symmetry – Convergent voting. . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Fast symmetry – Edge pixel rotation and grouping. . . . . . . . . . . . . . . 29
3.3 Fast skew symmetry detection. . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Unwanted symmetry detected from a horizontal line. . . . . . . . . . . . . . 33
3.5 Symmetry detection results – Synthetic images. . . . . . . . . . . . . . . . . 38
3.6 Symmetry detection result – Single symmetric object. . . . . . . . . . . . . 39
3.7 Symmetry detection result – Multiple symmetric objects. . . . . . . . . . . 41
3.8 Detecting non-object symmetry lines. . . . . . . . . . . . . . . . . . . . . . . 42
3.9 Rejecting unwanted symmetry with angle limits. . . . . . . . . . . . . . . . 44
3.10 Symmetry detection results – Challenging objects. . . . . . . . . . . . . . . 45
3.11 Detection execution time versus angle range. . . . . . . . . . . . . . . . . . 47
3.12 Fast symmetry versus gen. symmetry – Test image 1. . . . . . . . . . . . . 49
3.13 Fast symmetry versus gen. symmetry – Test image 2. . . . . . . . . . . . . 51
4.1 Overview of object segmentation steps. . . . . . . . . . . . . . . . . . . . . . 60
4.2 Object contour detection and contour refinement. . . . . . . . . . . . . . . . 63
4.3 Segmentation of a multi-colour mug. . . . . . . . . . . . . . . . . . . . . . . 64
4.4 Object segmentation on a scene with multiple objects. . . . . . . . . . . . . 65
4.5 Stereo vision hardware setup. . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.6 Test objects used in the triangulation experiments. . . . . . . . . . . . . . . 73
4.7 Example stereo data set – Multi-colour mug. . . . . . . . . . . . . . . . . . 74
4.8 Triangulation results for reflective metal can. . . . . . . . . . . . . . . . . . 75
4.9 Dense stereo disparity result – Textured bottle. . . . . . . . . . . . . . . . . 78
4.10 Dense stereo disparity result – Transparent bottle. . . . . . . . . . . . . . . 79
4.11 Dense stereo disparity result – Reflective can. . . . . . . . . . . . . . . . . . 80
5.1 System diagram of real time object tracker. . . . . . . . . . . . . . . . . . . 84
5.2 Using angle limits to reject non-object symmetry. . . . . . . . . . . . . . . . 86
5.3 Motion mask object segmentation – White bottle. . . . . . . . . . . . . . . 90
5.4 Motion mask object segmentation – White cup. . . . . . . . . . . . . . . . . 91
5.5 Generating rotated bounding boxes – Transparent bottle. . . . . . . . . . . 94
5.6 Generating rotated bounding boxes – Multi-colour mug. . . . . . . . . . . . 95
5.7 Pendulum hardware. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.8 Automatically extracted symmetry line. . . . . . . . . . . . . . . . . . . . . 102
5.9 Pendulum video – White background. . . . . . . . . . . . . . . . . . . . . . 103
5.10 Pendulum video – Red background. . . . . . . . . . . . . . . . . . . . . . . . 103
5.11 Pendulum video – Edge background. . . . . . . . . . . . . . . . . . . . . . . 104
5.12 Pendulum video – Mixed background. . . . . . . . . . . . . . . . . . . . . . 104
5.13 Example of symmetry detection under edge noise. . . . . . . . . . . . . . . . 108
5.14 White background – Sym. tracking error plots. . . . . . . . . . . . . . . . . 109
5.15 Red background – Sym. tracking error plots. . . . . . . . . . . . . . . . . . 110
5.16 Edge background – Sym. tracking error plots. . . . . . . . . . . . . . . . . . 111
5.17 Mixed background – Sym. tracking error plots. . . . . . . . . . . . . . . . . 112
5.18 White background – Histograms of sym. tracking errors. . . . . . . . . . . . 113
5.19 Red background – Histograms of sym. tracking errors. . . . . . . . . . . . . 113
5.20 Edge background – Histograms of sym. tracking errors. . . . . . . . . . . . 113
5.21 Mixed background – Histograms of sym. tracking errors. . . . . . . . . . . . 113
5.22 Hue-saturation histogram back projection. . . . . . . . . . . . . . . . . . . . 115
5.23 Effects of different backgrounds on colour centroid. . . . . . . . . . . . . . . 116
5.24 White background – Colour tracking error plot. . . . . . . . . . . . . . . . . 118
5.25 Red background – Colour tracking error plot. . . . . . . . . . . . . . . . . . 118
5.26 Edge background – Colour tracking error plot. . . . . . . . . . . . . . . . . 119
5.27 Mixed background – Colour tracking error plot. . . . . . . . . . . . . . . . . 119
6.1 Robotic system hardware components. . . . . . . . . . . . . . . . . . . . . 126
6.2 Autonomous object segmentation flowchart. . . . . . . . . . . . . . . . . . 127
6.3 The robotic nudge – Side view. . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.4 The robotic nudge – Overhead view. . . . . . . . . . . . . . . . . . . . . . 131
6.5 Video images of robotic nudge. . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.6 Workspace visualization of robotic nudge. . . . . . . . . . . . . . . . . . . . 133
6.7 Motion segmentation using symmetry. . . . . . . . . . . . . . . . . . . . . . 135
6.8 Autonomous segmentation results – Cups. . . . . . . . . . . . . . . . . . . . 137
6.9 Autonomous segmentation results – Mugs with handles. . . . . . . . . . . . 138
6.10 Autonomous segmentation results – Beverage bottles. . . . . . . . . . . . . 140
6.11 Autonomous segmentation results – Beverage bottles (continued). . . . . . . 141
7.1 Robot gripper and angle bracket. . . . . . . . . . . . . . . . . . . . . . . . . 147
7.2 Photos of robot gripper. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.3 Detecting the top of a nudged object. . . . . . . . . . . . . . . . . . . . . . 148
7.4 Uncertainty of stereo triangulated object height. . . . . . . . . . . . . . . . 149
7.5 Autonomously collected training data set – Green bottle. . . . . . . . . . . 150
7.6 SIFT detection example – White bottle training image. . . . . . . . . . . . 153
7.7 Removing background SIFT descriptors. . . . . . . . . . . . . . . . . . . . . 154
7.8 Object recognition using learned SIFT descriptors. . . . . . . . . . . . . . . 156
7.9 Bottles used in object learning and recognition experiments. . . . . . . . . . 158
7.10 Object recognition result – White bottle (match01.png). . . . . . . . . . . 161
7.11 Object recognition result – Yellow bottle (match02.png). . . . . . . . . . . 162
7.12 Object recognition result – Green bottle (match02.png). . . . . . . . . . . 163
7.13 Object recognition result – Brown bottle (match03.png). . . . . . . . . . . 164
7.14 Object recognition result – Glass bottle (match03.png). . . . . . . . . . . . 165
7.15 Object recognition result – Cola bottle (match03.png). . . . . . . . . . . . 166
7.16 Object recognition result – Transparent bottle (match02.png). . . . . . . . 167
B.1 Overview of new robot arm controller. . . . . . . . . . . . . . . . . . . . . . 180
B.2 New stand-alone controller for the PUMA 260. . . . . . . . . . . . . . . . . 180
List of Tables
3.1 Execution time of fast bilateral symmetry detection . . . . . . . . . . . . . 46
3.2 Execution time of generalized symmetry on test images . . . . . . . . . . . 52
3.3 Execution time of fast symmetry on test images . . . . . . . . . . . . . . . . 52
4.1 Execution time of object segmentation . . . . . . . . . . . . . . . . . . . . . 66
4.2 Triangulation error at checkerboard corners . . . . . . . . . . . . . . . . . . 76
5.1 Object tracker execution times and frame rate . . . . . . . . . . . . . . . . . 99
5.2 Pendulum ground truth data regression residuals . . . . . . . . . . . . . . . 102
5.3 Pendulum symmetry tracking error statistics . . . . . . . . . . . . . . . . . 106
7.1 Object recognition results – SIFT descriptor matches . . . . . . . . . . . . . 159
B.1 PUMA 260 link and joint parameters . . . . . . . . . . . . . . . . . . . . . . 182
List of Algorithms
1 Fast bilateral symmetry detection . . . . . . . . . . . . . . . . . . . . . . . . 27
2 Symmetric edge pair transform (SEPT) . . . . . . . . . . . . . . . . . . . . . 58
3 Score table generation through dynamic programming . . . . . . . . . . . . . 61
4 Contour detection by backtracking . . . . . . . . . . . . . . . . . . . . . . . . 62
5 Block motion detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Acknowledgments
Marston Bates said that Research is the process of going up alleys to see if they are blind.
Many thanks to my supervisor Lindsay Kleeman, who steered my research away from
the blind alleys with a constant supply of practical advice and constructive criticism.
I would like to thank my associate supervisor Andy Russell for invaluable discussions
about robotics and biological organisms. Thanks goes to Ray Jarvis, who regularly shared
his deep insights on robotics research. During our travels to conferences near and far,
Lindsay, Andy and Ray were constant sources of interesting stories spanning robotics,
biology, philosophy and comedy.
I was fortunate enough to study at the Intelligent Robotics Research Centre, which is
full of helpful and competent people. Geoff Taylor eased my transition into postgraduate
studies by answering numerous questions covering both theory and practice. In the lab,
Albert Diosi constantly humbled and inspired me with his tireless work ethic. Alan M.
Zhang was always available for lengthy discussions about robotics and algorithm design.
Konrad Schindler was ever willing to share his encyclopedic knowledge of computer vision
research. Steve Armstrong lend his time and expertise to help debug hardware issues.
Nghia, Damien, Dennis, Jay and a long list of others were always available for lunch,
which included obligatory newspaper quizzes and freeform current events discussions. I
am grateful to be amongst such friends.
Without the support of my family, my PhD studies and the completion of this dissertation
would be impossible. To my sister, Joey, thank you for reminding me that rest is an
important reprieve from thesis writing. To my parents, Peter and Mary, thank you for
nurturing my curiosity. My achievements, past, present and future, are a product of your
hardwork.
The research in this thesis was supported by an Australian Postgraduate Award and the
Intelligent Robotics Research Centre at Monash University. Funding for conference travel
was also provided by the Monash Research Graduate School and the Department of Electri-
cal and Computer System Engineering. I gratefully acknowledge these sources of financial
support.
Finally, I sincerely thank everyone that offered their help and condolences after they
learned about my father’s death. You made finishing this dissertation in the wake of
tragedy much easier.
And as I look at the trends that are now starting to con-
verge, I can envision a future in which robotic devices
will become a nearly ubiquitous part of our day-to-day
lives.
Bill Gates, January 2007
1
Introduction
It seems that all the right doors are opening for intelligent robots to make their entry
into the household. This chapter’s quote comes from Bill Gates’ A Robot in Every Home
article in Scientific American magazine, where he suggests that the availability of low cost
computational power and better sensors are clearing the way for affordable and diverse
domestic robotic systems. A European survey of public opinion [Ray et al., 2008] also
observes strong consumer interest in domestic robots. The survey indicates a generally
positive attitude towards domestic robots, especially those that will help alleviate the
tedium of repetitive daily tasks such as vacuum cleaning, setting the table and cleaning
the dishes.
Roboticists are also noticing the emerging consumer demand for domestic robots. In
a 2008 workshop on Robot Services in Aging Societies [Buss et al., 2008], prominent
robotics researchers discussed the role of technology in the face of the developed world’s
aging societies. According to World Bank statistics presented by Henrik Christensen
[Christensen, 2008], the current ratio of 0.2 retired versus working individuals in the USA
will increase to 0.45 by 2040. This problem is worse in Europe and worst in Japan, with the
latter having an expected ratio of retirees versus workers of 0.6 in 2040. During the closing
discussions, a Japanese researcher also mentioned an industrial estimate of a US$23 billion
domestic robotics market in Japan by 2030. Additionally, the workshop highlighted the
need for robotic systems that can compensate for the physical and mental limitations of
the elderly and the disabled, such as helping a person dress in the morning and reminding
the user if they forget to take their medicine.
The aforementioned robotics workshop [Buss et al., 2008] also touched on the forms that
future domestic robots will take. Unlike the current crop of mobile robots designed to
vacuum the house or mow the lawn, the next generation of domestic robots will likely
incorporate robot manipulators. For example, [Christensen, 2008] shows a robot arm at-
tachment for a motorized wheelchair that can grasp common household objects. However,
many technological obstacles stand in the way of large scale consumer adoption of such
domestic robots. One of the most challenging, is the ability to deal with household objects
reliably, including objects for which the robot has no prior models. For example, tasks
such as cleaning dishes and fetching a beverage demand the robust sensing and reliable
1
CHAPTER 1. INTRODUCTION
manipulation of objects, including new objects that the robot has never encountered be-
fore. The research presented in this thesis develops the visual sensing methods, object
manipulation techniques and autonomy needed by domestic robots to deal with common
household objects such as cups and bottles.
1.1 Motivation and Challenges
A domestic robot that can perform object interaction with the same grace and agility
as a person will leave roboticists in awe. The simple task of fetching a drink from a
table highlights the complex sensing and manipulations humans make on a regular basis.
The following is an example of the steps a robot will take in order to perform the same
drink fetching task. Adhering to the popular Sense-Plan-Act paradigm, the robot begins
by recognizing the target beverage amongst the objects on the table. After localizing
the beverage in three dimensions, the robot generates an action plan for grasping the
object. The planning will need to take into account the robot arm’s workspace and the
locations of obstacles in the environment. Physically stable locations on the object where
the robot’s fingers can be placed to ensure a sturdy grasp must also be calculated during
planning. Finally, the robot will perform the grasp by positioning its gripper around the
target object. This usually requires inverse kinematics calculations to find robot arm joint
angles or close-looped servoing of the end effector. In general, the sensors external to the
robot arm are also monitored during the grasp to assess whether the object manipulation
is carried successfully. Possible problems such as the spillage of the beverage container’s
contents, temporal changes in the environment and occlusions of the target beverage have
been ignored in the above scenario.
The complicated series of steps the robot must take to perform the drink fetching task
illustrates a fundamental challenge of robotics. The challenge of robotic naivety. A robot
knows little about anything. Robotic systems rely on human-provided knowledge to in-
terpret the sensor data they receive during online operation. For example, a mobile robot
can localize itself by comparing laser range finder readings with a metric map of the
environment. More subtly, a robot performing Simultaneous Localization and Mapping
(SLAM) relies on human-provided algorithms and constraints to build a map, to remove
inconsistencies and to close the loop. Similarly, a robot relies on human guidance in the
form of pre-programmed actions or kinematics algorithms to move its actuators in order to
perform useful actions. A major challenge addressed by this thesis is the minimization of
the a priori knowledge the robot requires while maximizing its flexibility and robustness
when performing practical domestic tasks.
Robots that deal with graspable household objects generally use vision as their primary
sensing modality. A video camera provides spatially dense information at high refresh
rates, which is ideal for cluttered indoor scenes such as the household. Also, cameras
cost less than active ranger finders and are able to operate at short distances where time-
of-flight sensors are unable to operate. A vision-based robotic system performing the
2
Section 1.1. Motivation and Challenges
aforementioned drink fetching task requires a massive quantity of a priori information. For
example, an object recognition system requires object models such as 3D surface meshes
or visual features extracted from training images. As such, the drink fetching robot will
require object models for every beverage it needs to fetch in order to ensure reliable object
recognition performance. Considering studies showing that an average person encounters
several thousand unique objects in the household [Buss et al., 2008], exhaustive model
building will be very labour intensive and probably intractable. As such, the development
of model-free methods to sense household objects is an important challenge that will help
reduce a domestic robot’s dependence on a priori knowledge.
Sensing household objects for which the robot has no model appears to be a chicken-and-
egg problem. How can a robot sense an object when it does not know what the object
looks like? The trick is to tell the robot what kind of features to extract. For example,
a robot searching for tennis balls can be instructed to visually sense round objects with
a yellow hue. A useful observation for domestic environments is that many household
objects are bilaterally symmetric. This observation is especially applicable to surface of
revolution objects such as cups and bottles, which are bilaterally symmetric from many
viewing orientations. Statistically, it is rare for symmetry to occur by chance. Visual
symmetry usually indicates an object or a manually arranged symmetric constellation
of objects. Both kinds of symmetry provide useful and reliable information. However,
bilateral symmetry is rarely employed in robotic systems. This appears to be due to the
lack of a robust and computational fast method of detection. Therefore, the author is
motivated to design and implement a fast method of bilateral symmetry detection that
can function robustly on real world images. By using symmetry, the user is freed from
having to explicitly provide training images or manually constructed object models. This is
what is meant by model-free vision. The symmetry detector also addresses the challenge
of developing model-free vision methods that will help a robot robustly deal with new
objects without a priori models.
Robust object recognition methods such as Boosted Haar Cascades [Viola and Jones, 2001]
require large quantities of training data, in the order of several hundred to a thousand
images for each object. Approaches such as SIFT [Lowe, 2004] require fewer training
images but rely on manual segmentation of the target objects. Limiting the scope to
surface of revolution objects such as cups, an object is defined as a physical entity that
moves coherently when actuated. This definition allows a robot to detect and segment
objects by applying physical manipulation in parallel with model-free visual sensing meth-
ods. This approach is interesting as it departs from the norm of moving the camera while
keeping the object stationary. By actuating an object instead of the camera, a robot can
autonomously obtain visual segmentations that are representative of the physical world.
This opens up the possibility of additional autonomy such as training data collection and
object learning. As such, the visual sensing and manipulator control challenges posed by
autonomous object segmentation are inherently interesting and worth addressing.
The above challenges are further detailed below.
3
CHAPTER 1. INTRODUCTION
Fast and Robust Detection of Bilateral Symmetry
Gestalt theorists observed that symmetry is an important visual property that people use
to represent and detect objects. It is a visual property that rarely occurs by chance and
can be used for local feature extraction as well as global object description. In the field
of computer vision, symmetry detection has been an active area of research for the last
three decades. Chapter 2 provides a taxonomy of the different types of symmetry as well
as a comprehensive survey of existing detection methods. While a fast radial symmetry
detection method is available [Loy and Zelinsky, 2003], the literature survey revealed the
need for a fast and robust bilateral symmetry detection method. More specifically, many
robotic applications demand a symmetry detector that operates in real time on videos of
real world scenes captured using a low cost colour camera.
Development of Model-Free Object Sensing Methods
Given the successful development of a fast bilateral symmetry detector, model-free meth-
ods of object sensing can be designed and implemented. These methods should provide
a robot with a visual sensing toolbox to deal with symmetric objects without having to
rely on a priori information such as the colour and shape of a target object. In partic-
ular, these methods will be applicable to surface of revolution objects such as cups and
bottles that are bilaterally symmetric from many points of view. Ideally, these model-free
methods will be fast and robust so that they can meet real time requirements, which are
plentiful in robotic applications.
For a domestic robot dealing with symmetric household objects, several sensing functions
are especially important. Firstly, a model-free object segmentation method is needed.
Segmentation allows for the detection of objects as well as obtaining useful size and shape
information. Secondly, a stereo vision method is needed to localize symmetric objects in
three dimensional space. Object localization allows the use of robotic manipulation to
actuate and grasp objects. Finally, real time object tracking should be developed. Real
time tracking is needed to identifying moving objects in the robot’s environment and to
determine the effects of robotic action on a target object. Overall, the model-free sensing
methods should be fast, robust and general so that they can be applied to different types
of robotic problems.
Autonomous Object Segmentation
Inspired by the work of Fitzpatrick [Fitzpatrick, 2003a], the concept of autonomous object
segmentation is further explored. Fitzpatrick’s robotic system sweeps its one-fingered
end effector across a scene to simultaneously detect and segment objects using a poking
action. Objects are discovered by visually monitoring for a sharp jump in motion caused
by a collision between the end effector and the object. Object segmentation is performed
offline using a graph cuts approach by analysing the motion in the video frames near the
4
Section 1.2. Key Contributions
time of effector-object impact. Due to the harsh nature of the object poking action, his test
objects need to be durable and unbreakable. Also, his approach sometimes includes the
end effector in object segmentations and is prone to producing near-empty segmentations.
Unlike Fitzpatrick’s accidental approach to object discovery and segmentation, the wish is
to leverage the visual symmetry toolkit to generate a plan before applying robotic action.
Essentially, the robot produces hypotheses of object locations and then tests the validity
of these hypotheses using robotic actions. This enables the application of gentle and
controlled robotic action, which allows the manipulation of fragile objects. Additionally,
poor segmentations can be prevented by monitoring the actuated object during robotic
manipulation. This allows the robot to detect failed manipulations such as the target
object being tipped over by the robot. Also, object segmentation should be performed
online so that the robot can quickly resume visual sensing of its environment.
Object Learning by Robotic Interaction
A domestic robot requires autonomy in order to perform household tasks intelligently. The
ability to autonomously segment objects allows greater autonomy. Information gained
from autonomous object segmentation can be used to guide more advanced manipula-
tions such as object grasping. This effectively leverages a simple robotic manipulation to
perform more complex manipulations. Once an object has been grasped, training data
collection and object learning becomes possible. Object learning allows a robot to adapt
to changing environments by learning new objects autonomously. Autonomous object
learning is useful for domestic robots that have to deal with the large number of unique
objects in the average home. Ideally, the burden of object modelling will be shifted from
the researcher and user to the robot.
1.2 Key Contributions
Five conference papers have been published on the research described in this thesis. An
IJRR journal article [Li et al., 2008] covers research on bilateral symmetry detection as well
as using detected symmetry to segment objects and to perform real time object tracking.
The research contributions made by the work in this thesis are detailed below.
Fast and Robust Bilateral Symmetry Detection
The core novelty of the proposed fast bilateral symmetry detection method is its quick
execution time. Running on a 1.73GHz Pentium M laptop, the C++ implementation of the
fast symmetry detector only requires 45ms to find symmetry lines across all orientations
from a 640 × 480 input image. The detection time can be reduced linearly by restricting
the angular range of detection. At the time of writing, the proposed method is the fastest
5
CHAPTER 1. INTRODUCTION
bilateral symmetry detector available. As a point of comparison, timing trials documented
in this thesis show that the generalized symmetry transform [Reisfeld et al., 1995] is
roughly 8000 times slower than the proposed detection method. The proposed symmetry
detection method is also faster than SIFT-based approaches such as [Loy and Eklundh,
2006].
As detailed in Section 2.2, the majority of symmetry detection methods are unable to
operate on real world images. For example, experiment results presented in this thesis
show that the generalized symmetry transform [Reisfeld et al., 1995] is unable to deal
with background intensity changes. This restricts the ability of the generalized symmetry
transform to operate on real world images, which regularly have backgrounds with non-
uniform intensity. SIFT approaches rely on texture symmetry as opposed to contour
symmetry, which means physically asymmetric objects with symmetric surface patterns
are considered symmetric.
The fast symmetry detection method is able to operate on real world images by leveraging
the noise robustness of Hough transform and Canny edge detection. As fast symmetry uses
edge pixels as input, it is able to detect the symmetry lines of multi-colour, reflective or
transparent objects. The fast symmetry algorithm was first published in [Li et al., 2005].
An updated version of the detection method, with lower computational cost and greater
noise robustness is available from the author’s IJRR article [Li et al., 2008]. Details of
both publications are as follows.
• Wai Ho Li, Alan M. Zhang and Lindsay Kleeman. Fast Global Reflectional Sym-
metry Detection for Robotic Grasping and Visual Tracking. In Proceedings of Aus-
tralasian Conference on Robotics and Automation, Sydney, December, 2005.
• Wai Ho Li, Alan M. Zhang and Lindsay Kleeman. Bilateral Symmetry Detection
for Real-time Robotics Applications. International Journal of Robotics Research
(IJRR), 2008, Volume 27, Number 7, pages 785 to 814
Real Time Object Segmentation using Bilateral Symmetry
Segmentation using symmetry has been previously investigated in [Gupta et al., 2005].
Their method uses symmetry to augment the affinity matrix of a normalized cuts approach.
Normalized cuts produces accurate segmentations but has a very high computational cost,
making it unsuitable for real time applications. Additionally, the approach of Gupta et
al. assumes symmetric pixel values within an object’s contour. This assumption does not
hold for symmetric objects with asymmetric textures or when non-uniform illumination
causes shadows and specular reflections on symmetric objects.
The proposed symmetry-guided object segmentation method is fast, requiring an average
of 35ms for 640×480 pixel images. Also, the assumption of symmetric internal pixel values
is removed by using a Dynamic Programming (DP) approach that operates on edge pixels
6
Section 1.2. Key Contributions
to find near-symmetric object contours. Unlike traditional DP approaches [Lee et al., 2001;
Mortensen et al., 1992; Yu and Luo, 2002] that rely on manually specified control points
or curves, the proposed segmentation method is initialized automatically using an object’s
detected symmetry line. The symmetry-guided segmentation approach is published in the
following paper.
• Wai Ho Li, Alan M. Zhang and Lindsay Kleeman. Real Time Detection and Seg-
mentation of Reflectionally Symmetric Objects in Digital Images. In Proceedings of
IEEE/RSJ Conference on Intelligent Robots and Systems (IROS06), Beijing, Octo-
ber, 2006, pages 4867 to 4873.
Stereo Triangulation using Symmetry
Traditional stereo methods rely on matching corresponding features on an object’s surface
to obtain three dimensional information. The proposed symmetry triangulation approach
differs from the norm by using a structural object feature for matching. Symmetry trian-
gulation can deal with transparent and reflective objects. Stereo methods such as those
surveyed in [Scharstein and Szeliski, 2001] are unable to deal with these objects due to
their unreliable surface pixel information across stereo views.
Additionally, unlike stereo methods that triangulate features on the surface of objects,
symmetry triangulation returns an axis that passes through the inside of an object. The
triangulated symmetry axis is especially useful when dealing with surface of revolution
objects such as cups and bottles, as their symmetry axes are equivalent to their axes of
revolution. A symmetry axis can also be used to localize an object by looking for the
intersection between the axis and the table on which the object rests. The work on stereo
symmetry triangulation has led to the following publication.
• Wai Ho Li and Lindsay Kleeman. Fast Stereo Triangulation using Symmetry. In
Proceedings of Australasian Conference on Robotics and Automation, Auckland, De-
cember, 2006.
Real Time Object Tracking using Symmetry
The author was the first to implement a real time visual tracking system using bilateral
symmetry as the primary object feature. By using the author’s fast bilateral symmetry
detector, the Kalman filter tracker is able to operate at 40 frames per second on 640×480
video. The high tracking speed is achieved by feeding back the tracker prediction to the
fast symmetry detector in order to limit the angular range of detection. Experiments on
ten real world videos that include transparent objects suggests high tracking robustness.
The tracker also provides a real time symmetry-refined motion segmentation of the tracked
object.
7
CHAPTER 1. INTRODUCTION
A custom-built pendulum is used to quantitatively analyse the tracking error against
reliable ground truth data. The tracking error is measured for each video frame, allowing
for a deeper understanding of the tracker’s behaviour with respect to object speed and
background noise. This departs from the normal computer vision practice of evaluating
trackers based on their success rate at maintaining a convergent tracking estimate over
a set of test videos. Additionally, a qualitative comparison is performed between HSV
colour centroid and bilateral symmetry as tracking features.
The symmetry tracker was first published as a conference paper in IROS [Li and Kleeman,
2006b]. The work is also documented in the IJRR article [Li et al., 2008], with the addition
of the quantitative error analysis and colour versus symmetry comparison.
• Wai Ho Li and Lindsay Kleeman. Real Time Object Tracking using Reflectional
Symmetry and Motion. In Proceedings of IEEE/RSJ Conference on Intelligent
Robots and Systems (IROS06), Beijing, October, 2006, pages 2798 to 2803.
• Wai Ho Li, Alan M. Zhang and Lindsay Kleeman. Bilateral Symmetry Detection
for Real-time Robotics Applications. International Journal of Robotics Research
(IJRR), 2008, Volume 27, Number 7, pages 785 to 814
Autonomous Object Segmentation
Most active vision methods focus on actuating the camera to obtain multiple views of
a static object. Departing from the norm, the work of [Fitzpatrick, 2003a] used robotic
action to actuate objects in order to perform motion segmentation. The proposed approach
uses the same ideology of actuating objects instead of moving the camera, but differs from
Fitzpatrick’s approach as follows.
Firstly, instead of Fitzpatrick’s accidental approach to object discovery, the proposed
approach finds interesting locations to explore prior to any robotic action. This allows the
generation of an action plan that can incorporate higher level strategies such as exploring
objects nearest the camera first. Additionally, the actuation trajectory can be chosen such
that changes in object scale and orientation are minimized.
Secondly, a short and gentle robotic nudge is used to actuate a target object. This departs
from the fast sweeping action of Fitzpatrick’s robot, which requires unbreakable test ob-
jects as collisions are unpredictable and high impact. Experiments show that the robotic
nudge is able to actuate fragile and top heavy objects. The robotic nudge also has a small
workspace footprint, which allows for easy path planning and collision avoidance.
Thirdly, Fitzpatrick’s approach is prone to poor segmentations where the result is nearly
empty or contains the robot’s end effector. Examples of these poor segmentations are
shown in Figure 11 of [Fitzpatrick and Metta, 2003]. The proposed approach uses a series
of visual checks during object manipulation to prevent poor segmentations. Near-empty
segmentations are prevented by initiating stereo tracking upon detecting object motion.
8
Section 1.2. Key Contributions
This ensures that insufficient object motion or the object being tipped over do not result
in a segmentation attempt. Additionally, the proposed approach never includes the end
effector in the segmentation result as the input video images used for motion segmentation
are taken when the end effector is out of view.
Finally, the author uses a different approach to perform motion segmentation. Fitzpatrick
uses a computationally expensive graph cuts approach, which requires several seconds
of offline processing to perform object segmentation. The proposed symmetry-based ap-
proach performs motion segmentations online, taking only 80ms for a temporal pair of
1280 × 960 pixel images. This makes the approach more suited to real time applications
where the robot cannot afford to pause its visual sensing. The work on autonomous object
segmentation resulted in the following publication.
• Wai Ho Li and Lindsay Kleeman. Autonomous Segmentation of Near-Symmetric
Objects through Vision and Robotic Nudging. In Proceedings of IEEE/RSJ Con-
ference on Intelligent Robots and Systems (IROS08), Nice, September, 2008, pages
3604 to 3609
Object Learning by Robotic Interaction
The majority of robotic vision systems model objects visually by actuating the camera.
The proposed autonomous object learning approach departs from the norm by grasping
and rotating an object while the camera remains static. The following research contribu-
tions are made by the work.
Firstly, the work shows that a simple object manipulation, the aforementioned robotic
nudge, can enable the use of more advanced manipulations such as object grasping. Ex-
periments show that bottles of various shapes, both heavy and light, can be grasped
autonomously after performing object segmentation using the robotic nudge.
Secondly, a contribution is made by showing that a robot can autonomously collect useful
training data. After a successful grasp, the robot rotates the grasped object in front of its
camera to collect training images at fixed angular intervals. These images allow the entire
360 degrees of the grasped object to be modelled visually.
Finally, SIFT descriptors [Lowe, 2004] are used to build object models from robot-collected
training images. Non-object SIFT descriptors are automatically pruned to prevent their
inclusion into object models. The resulting descriptor sets are used as object models in a
recognition database. Experiments confirm that it is possible to use robot-learned object
models to perform robust object recognition.
9
CHAPTER 1. INTRODUCTION
1.3 Thesis Outline
The thesis chapters are organized in approximate chronological order. A multimedia
DVD containing images and videos of experimental results is attached to this thesis. The
contents of the multimedia DVD are detailed in Appendix A.
1.3.1 Chapter 2: Background Information
This chapter contains two sections. The first section provides an overview of visual sym-
metry and explains the different types of symmetry. This prevents confusion in future
discussions concerning symmetry detection. As symmetry is the core feature of the visual
sensing methods to follow, the second section of this chapter provides a literature survey of
bilateral symmetry detection methods. Figure 2.3 provides a timeline of research progress
on symmetry detection and how past research influences more recent endeavours. The
figure also shows how the author’s novel symmetry detection approach relates to existing
research. To prevent the reader from having to move backwards and forwards between
chapters, related works for other chapters are discussed in their respective introductions.
1.3.2 Chapter 3: Symmetry Detection
The fast bilateral symmetry detection approach finds lines of symmetry from image edge
pixels using the Hough transform and convergent voting. After discussing the algorith-
mic details of the fast symmetry approach, detection results on synthetic and real world
images are provided with accompanying discussion. The chapter also investigates the
computational performance of the detection method. Additionally, this chapter includes
a comparison between the generalized symmetry transform [Reisfeld et al., 1995] and the
fast symmetry method that evaluates their noise robustness and detection characteristics.
The chapter concludes with a comparison of the computational costs between generalized
symmetry and fast symmetry using timing trials.
1.3.3 Chapter 4: Sensing Objects in Static Scenes
This chapter details two model-free methods that rely on detected bilateral symmetry to
sense objects in static scenes. The first section details a method that uses symmetry to
guide object segmentation, automatically extracting the edge contours of objects for which
symmetry lines have been detected. The dynamic programming segmentation algorithm
and a novel image preprocessing step called the symmetric edge pair transform are detailed
in this section. Several segmentation results on real world images are also provided. The
section also investigates the computational performance of the segmentation method.
The second section of the chapter details a stereo triangulation method that makes use
of detected symmetries across two cameras to obtain three dimensional information. The
10
Section 1.3. Thesis Outline
triangulation process produces three dimensional symmetry axes that are used to localize
objects on a table plane. The section also documents experiments that quantitatively
measure the accuracy of symmetry triangulation. The section concludes with a qualita-
tive comparison between dense stereo disparity and the proposed symmetry triangulation
approach.
1.3.4 Chapter 5: Real Time Object Tracking
This chapter covers research on real time object tracking using symmetry and motion. The
chapter details the Kalman filter tracker, a block motion algorithm used to reject unwanted
symmetries and a symmetry-refined motion segmentation method. The symmetry tracker
is tested on ten real world videos. Additionally, a custom-built pendulum rig is used to
quantitatively measure symmetry tracking error against predictable ground truth data.
The same pendulum rig is also used to qualitatively compare the performance of colour
and symmetry as object tracking features.
Multimedia Content
Videos of the tracking results can be found in the tracking folder of the multimedia DVD.
1.3.5 Chapter 6: Autonomous Object Segmentation
This is the first of two system integration chapters. This chapter details the use of a precise
robotic nudge to actuate an object in order to obtain its segmentation autonomously. The
robot makes use of stereo symmetry triangulation to localize symmetric objects. The real
time object tracker is applied in stereo to monitor robot-actuated objects. The chapter
includes segmentation results from twelve experiments conducted on ten different test
objects set against different backgrounds. A discussion of the experimental results is
provided at the end of the chapter.
Multimedia Content
Autonomous object segmentation results are available alongside corresponding videos of
stereo tracking and the robotic nudge from the nudge folder of the multimedia DVD.
1.3.6 Chapter 7: Object Learning by Robotic Interaction
This chapter details an autonomous robot than learns about new objects through inter-
action. By leveraging the autonomous segmentation approach from the previous chapter,
the robot gathers training data and builds object models autonomously. After object seg-
mentation, training data is collected by grasping and rotating an object to gather images
11
CHAPTER 1. INTRODUCTION
covering the entire 360-degree view of the grasped object. SIFT-models are built using the
robot-collected training images. Experiments on seven bottles show that object models
learned autonomously by the robot allow robust and reliable object recognition.
Multimedia Content
Videos and images documenting the autonomous learning process are available from the
multimedia DVD in the learning folder.
1.3.7 Chapter 8 : Conclusion and Future Work
This chapter provides a summary of how the motivating challenges brought forth in the
introduction are addressed by the presented research. As future works are only briefly
discussed in previous chapters, a more thorough coverage is provided in this chapter.
1.3.8 Appendix A
Details the contents of the multimedia DVD and provides online URLs where some of the
multimedia content is available for download.
1.3.9 Appendix B
Details the design and implementation of a stand-alone motion controller for the PUMA
260 robot manipulator. The appendix includes the direct and inverse kinematic calcula-
tions for the PUMA 260 manipulator.
12
Research is to see what everybody else has seen, and
to think what nobody else has thought
Albert Szent-Gyorgyi
2
Background Information
2.1 Visual Symmetry
2.1.1 Symmetry in Human Visual Processing
Symmetry is one of many structural relationships humans use to interpret visual informa-
tion. Three dimensional structures, such as surfaces of revolution like cylinders and cones,
are often inferred from line drawings using symmetry. Gestalt theorists in the early 20th
century suggested a set of laws, the Pragnanz, which model the way humans group low
level visual entities, such as lines and dots, into objects. Symmetry is one of the features
suggested by these theorists as being essential for grouping low level visual entities into
objects.
The Pragnanz law of symmetry grouping states that Symmetrical entities are seen as
belonging together regardless of their distance. This implies that the human vision system
tends to cluster symmetric entities together regardless of scale. Computer vision research,
especially in areas dealing with producing human-like detection behaviour, often draw on
Gestalt theories and the Pragnanz laws as motivation in the design phase. Couple this
with the fact that many man-made objects are symmetric provides further motivation for
the use of symmetry as a visual feature in robotic applications.
2.1.2 Types of Symmetry
To avoid future confusion, the taxonomy of symmetry must be explored before continuing
further. Figures 2.1 and 2.2 provide an illustrated summary of the common types of
symmetry encountered in images.
Bilateral symmetry, sometimes called reflectional symmetry, is described by a symmetry
line. Image data, such as pixel values or contour shape, is equal when reflected across
the symmetry line. Note that the terms symmetry line, mirror line, line of reflection and
reflection plane are used interchangeably in literature. Figure 2.1(a) is an example of a
bilaterally symmetric shape.
13
CHAPTER 2. BACKGROUND INFORMATION
(a) Bilateral symmetry (b) Skew symmetry
Figure 2.1: Bilateral and skew symmetry. Symmetry lines are solid black. The skew symmetry
shape is produced by horizontally skewing the bilateral symmetry shape by
π
4
. The black dots are
point pairs that are symmetric about a shape’s symmetry line.
Skew symmetry occurs when a pattern with bilateral symmetry is skewed by a constant
angle. In practice, skew symmetry tends to be found in images where a planar shape
with bilateral symmetry is viewed at an angle under weak perspective projection. This is
illustrated by the shapes in Figure 2.1. The black dots mark points in each shape that are
symmetric about the shape’s symmetry line. Notice that bilateral symmetry is a special
case of skew symmetry, where the skew angle is zero such that a line joining the symmetric
point pair is perpendicular to the symmetry line. Therefore, skew symmetry detectors can
also detect bilateral symmetry.
A shape has rotational symmetry if its appearance remains the same after rotation. The
order of rotational symmetry is defined as the number of times the repetition occurs during
one complete revolution. This is sometimes abbreviated to C
N
, where N is the order of
rotational symmetry. Mathematically, a pattern with rotational symmetry of order N
will remain invariant under a rotation of

N
. Figure 2.2(a) is an example of a shape with
rotational symmetry of order 4.
Radial symmetry, also known as floral symmetry, describes the kind of symmetry found in
actinomorphic biological structures such as flowers. Radial symmetry is the special case
where a shape has both bilateral symmetry and rotational symmetry. A shape with radial
symmetry of order N will have N symmetry lines as well as rotational symmetry of order
N. An example of a shape with radial symmetry of order 8 is shown in Figure 2.2(b).
Note that the symmetry lines of the shape are drawn as dashed lines. In essence, radial
symmetry can be seen as a special case of rotational symmetry, where the spokes are
bilaterally symmetric. Radial symmetry of order N can be abbreviated as D
N
.
Circular structures are described as having radial and rotational symmetry of order infinity.
An example of this is shown in Figure 2.2(c). Because of this property, radial symmetry
detectors are often applied as circle detectors in computer vision applications such as eye
tracking.
14
Section 2.2. Related Research
(a) Rotational – Order 4 (b) Radial – Order 8 (c) Radial – Order ∞
Figure 2.2: Rotational and radial symmetry. Radially symmetric shapes also exhibit bilateral
symmetry. Bilateral symmetry axes are shown as dashed lines.
2.2 Related Research
This section provides an overview of research related to the detection of symmetry in
images. Emphasis is placed on bilateral and skew symmetry detection methods as well as
their shape representation predecessors. For brevity’s sake, the term detection will refer
specifically to the task of finding bilateral symmetry in images. For the sake of readability,
research related to specific applications of the author’s symmetry detector, such as object
segmentation and real time tracking, will be covered in their respective chapters.
Figure 2.3 provides a summary of bilateral symmetry detection research over the past few
decades. Arrows in the figure indicate the adaptation or application of ideas from previous
research. For example, along the right hand side of the figure, the arrow from Brooks to
Ponce highlights the fact that Ponce’s skew symmetry detection method uses the Brooks
ribbon. The detection method developed by the author is shown in bold towards the
bottom of the figure.
2.2.1 Accurate Shape Representation
Much of the pioneering work for symmetry detection was carried out by researchers from
digital image processing and medical imaging fields. These researchers were concerned
with the concise description of shapes for the automation of visual tasks such as detecting
and matching biological entities in medical images. While topology can be used to describe
the internal structure of such objects, the lack of uniqueness in the description means that
tasks such as shape matching will produce many false positives.
The inadequacy of topology for shape representation was succinctly described in [Blum
and Nagel, 1978]. In the paper, Blum and Nagel stated that Topology is so general that all
silhouettes without holes are equivalent. Also, as biological structures tend to vary in shape
and size between samples, template-based matching will be inherently inaccurate. As such,
15
CHAPTER 2. BACKGROUND INFORMATION
early symmetry research was focused on providing methods of shape description with
high accuracy and uniqueness while being computationally compact in terms of storage.
Methods for the detection of symmetry were simply a positive side effect from shape
representation research.
Ribbons
The medial axis transform [Blum, 1964], later republished as [Blum, 1967], was the first of
many shape representation methods proposed as an alternative to traditional topological
approaches. This transform generates a Blum ribbon that represents the internal structure
of a shape as well as allowing accurate regeneration of the shape’s contour. The Blum
ribbon is generated by sliding a circle of variable size along the interior of a shape, making
sure that the circumference is in contact with two points of the contour at all times.
Computationally, the ribbon is recorded as the loci of circle centers along with a series of
radii, one for each locus. This loci of circle centers is also known as the medial axis or
skeleton of a shape.
Subsequently, a ribbon method based on sweeping a line of constant angle relative to a
shape’s skeleton, instead of a circle, was proposed as the Brooks ribbon [Brooks, 1981].
Another approach using a line touching two points on the shape’s contour with mirrored
local tangents was suggested in [Brady and Asada, 1984]. This paper by Brady and
Asada is also the first of the ribbon papers that explicitly defines a method for symmetry
detection. A summarizing analysis of the three aforementioned ribbons can be found in
[Ponce, 1990]. A multi-scale version the Blum ribbon and the medial axis transform has
also been patented [Makram-Ebeid, 2000].
In ribbons literature, the ribbon’s internal structure, such as the loci of circle centers for
the Blum ribbon, is also called a spine. The stencil used in the sweep, such as the circle
used for Blum ribbons, is called the generator. Note that, unlike bilateral symmetry lines,
the spine is allowed to have tree-like branches within an object and does not have to be
a straight line. This is why the ribbon spine is also referred to as a shape’s skeleton and
the process of obtaining it is also called skeletonisation.
Distance Transform Skeletonisation
Skeletonisation approaches using distance transform [Rosenfeld and Pfaltz, 1966] appeared
in literature near the time of Blum ribbons. Additional research on the storage efficiency
of shape representation using distance transform [Pfaltz and Rosenfeld, 1967] and the
parallelization of wavefront computations [Rosenfeld and Pfaltz, 1968] arrived in subse-
quent years. Unlike ribbons, which deals with smooth, continuous contours, the distance
transform approaches deal specifically with digital images and provide discrete shape rep-
resentations.
16
Section 2.2. Related Research
In its early incarnation, even with the parallel wavefront propagation extension, the com-
putation cost of distance transform is very high. The more efficient kernel-based implemen-
tation, used commonly in robotics and computer vision, was mathematically formalized
two decades later in [Borgerfors, 1986].
Generalized Cones
Generalized cones [Nevatia and Binford, 1977] is a computationally efficient alternative
to the Blum ribbon for shape representation. The generalized cones approach extracts
shape skeletons by first producing piecewise linear skeletal sections at discrete orienta-
tions. Through a series of refinement and fusion steps, the skeletal sections are then
combined to form a complete skeleton. By using discrete orientations and distances in
its calculation, the method lowers computational requirements at the cost of decreased
accuracy of representation. It is conceptually similar to a piecewise implementation of the
Brooks ribbon.
2.2.2 Towards Symmetry Detection
Early detection methods assume near-ideal registration of the symmetric shape to the
image center. The input data used in experiments have low noise and generally the input
data is manually preprocessed to enhance or extract object contours prior to detection.
Essentially, object segmentation or recognition is a mandatory preprocessing step to ensure
successful detection.
Ribbon-based Approaches
As mentioned earlier, symmetry detection came about as a byproduct of ribbons and shape
representation. [Brady and Asada, 1984] describes a ribbon generation method called
smoothed local symmetries (SLS), which can be used directly for symmetry detection by
producing a locally symmetric skeleton. Earlier, [Blum and Nagel, 1978] also suggested
a method called the symmetric axis transform (SAT). The SAT uses the medial axis of
Blum ribbons to segment objects into symmetric portions but does not explicitly detect
symmetry.
Ribbon-based shape representation methods operate on perfectly extracted contours, with
no discontinuities such as gaps. Most methods also require the tangent or gradient along
the contour. The ribbon-based methods all make use of hand labelled data, in the form
of manually extracted contours. Also note that as ribbons are generated using a param-
eterized structure swept along a curve, both SAT and SLS can produce branches in the
object’s symmetry skeleton. Of all ribbon-based approaches, the skeleton produced by
SLS is the most similar to an object’s bilateral symmetry line as detected by the author’s
method.
17
CHAPTER 2. BACKGROUND INFORMATION
Hough Transform Approach
First proposed as a US patent [Hough, 1962], the Hough transform was originally designed
as a noise robust method of line detection. The method was improved by introducing a
radius-angle parameterization and extended to detect curves in [Duda and Hart, 1972].
The algorithm was further generalized in [Ballard, 1981] to detect arbitrary shapes. Bal-
lard’s method can also deal with scaling and rotation of the target shape.
Soon after Ballard’s paper, a Hough transform approach to symmetry detection was pro-
posed [Levitt, 1984]. Levitt’s method performs line detection using Hough transform on
the midpoints between pairs of input points, as well as the original input points. The
use of the Hough transform provides additional noise robustness not present in other ap-
proaches, allowing the method to handle broken contours and operate on sparse data.
Coupled with the ease of implementation and simple parameterization, many detection
methods, including the author’s, stem from Levitt’s seminal work.
Extending Levitt’s Hough transform approach, Ogawa proposed a method to find bilateral
symmetry in line drawings [Ogawa, 1991]. Apart from applying Levitt’s approach to
digital images, Ogawa also suggested the use of microsymmetry, local symmetries between
contour segments similar to those found in ribbons, to find larger scale global symmetry.
In essence, Ogawa’s work is the earliest attempt at a multi-scale detection method.
2.2.3 Other Approaches
Research targeting bilateral symmetry detection in digital images began with Attallah’s
seminal work [Atallah, 1985]. The paper proposed a mathematical approach, with O(Nlog(N))
minimum complexity, that can find lines of symmetry in an image with points, line seg-
ments and circles. This global detection method assumes that the input data is symmetric
and does not provide discriminatory detection to determine whether symmetry exists.
Marola’s work provides a more general detection method [Marola, 1989a], robust to slight
asymmetries in the input data. This paper also provided an important insight into sym-
metry detection. Marola suggested that symmetry lines of near-symmetric shapes will
tend to deviate from passing through the center of mass of the shape. This is certainly
true for averaging approaches based around ribbons, but fortunately does not hold true
for voting approaches such as those using Hough transform. In a separate paper, Marola
proposed a symmetry-based object detection method [Marola, 1989b]. The method func-
tioned by convolving a mirrored template of the target object with the input image. A
symmetry score is used to judge the correctness of the detection location and orientation.
The method requires prior knowledge of the object’s symmetry line.
18
Section 2.2. Related Research
2.2.4 Skew Symmetry Detection
Ribbon Approach
Revisiting ribbon-based approaches, a detection method for skew symmetry was proposed
in [Ponce, 1990]. Ponce’s method uses Brooks ribbons with a straight line skeleton. Neva-
tia and Binford’s method for finding generalized cones [Nevatia and Binford, 1977] was
applied to improve computational performance over ribbon-based techniques of the past.
In terms of computational complexity, the ribbon-based SLS proposed by Brady has a
complexity of O(n
2
), whereas Ponce’s method only has a complexity of O(kn), where k
is the number of discrete orientations for which skeletons are produced. The variable n is
the number of input data points in both cases, and the reduction in complexity is partially
due to the use of midpoints in the generalized cones algorithm. However, Ponce’s skew
symmetry detection scheme requires manual pruning of input edge pixels.
Hough Transform Approaches
On the left of Figure 2.3, skew symmetry detection methods using Hough transform can
be found. Note that the author’s fast symmetry detection method, detailed in the next
chapter, can also be extended to detect skew symmetry. The method of [Yip et al., 1994]
uses pairs of midpoints, each formed from an edge pair, in a Hough voting process with a
complexity of O(N
4
), where N is the number of input edge pixels.
The method of [Cham and Cipolla, 1994] take a different approach based on edge groups
with attached orientation called edgels. Hough transform is applied as a rough initial
detection step. Their method performs a Hough transform on the intersection of edgel
gradients. Note that the detection process requires the manual fitting of B-spline curves
to edge data so that edgels and their tangents are accurately discovered. This is difficult
to automate, especially when the input image produce many noisy edge pixels.
These Hough transform detection methods led to an improved skew symmetry detector [Lei
and Wong, 1999], targeting bilateral planar objects under weak perspective projection.
This method combined aspects of previous work to provide an automatic detection scheme
operating on edge pixels instead of edgels, while also having better computational efficiency
than the method of Yip et al. The complexity of this improved method is O(MN
2
), where
N is the number of input edge pixels. Symmetry is detected across M discrete skew angles.
2.2.5 Perceptual Organization
The detection methods covered so far are generally limited to operating on low noise data,
in the form of hand segmented object contours or pruned edge images. Ponce was the first
to depart from manual object segmentation by using hand picked edges from a Canny edge
detector [Canny, 1986]. The first completely data driven detector was proposed for the
19
CHAPTER 2. BACKGROUND INFORMATION
purposes of perceptual organization [Mohan and Nevatia, 1992]. This detector is capable
of finding local symmetries in the form of parallel edge segments without requiring any
manual preprocessing of the input data.
2.2.6 Multi-Scale Detection Approaches
With the ever increasing popularity of scale space theory, a multi-scale symmetry detection
approach called the generalized symmetry transform (GST) was proposed in [Reisfeld
et al., 1995]. This method detects bilateral symmetry using a combination of factors,
including image gradient magnitude, image gradient orientation and a distance weighting
function that adjusts the scale of detection. This scheme also provides a way to find radial
symmetry, which can be used as a corner-like feature detector at low detection scale,
similar to Harris corners [Harris and Stephens, 1988]. Section 3.5 contains a comparison
between the author’s detection approach and GST. An overview of the GST detection
steps and its computational requirements are also provided in the same section.
The symmetry distance approach [Zabrodsky et al., 1995] produces a continuous measure
of symmetry. The symmetry distance is described as the minimum distance required to
make a set of points symmetric. The difficulty of using this detection method lies in the
selection of points representative of a shape. The paper documents successful application
of this symmetry measure to the tasks of occluded shape completion and the detection of
locally symmetric regions in an image.
A multi-scale detection approach making use of probabilistic genetic algorithms for global
optimization of symmetry parameters has been proposed [Kiryati and Gofman, 1998]. This
method treats bilateral symmetry, namely the location and orientation of the symmetry
line along with the scale of symmetry, as parameters in a global optimization problem. A
United States patent [Makram-Ebeid, 2000] also proposes a multi-scale detection approach.
The method uses Blum ribbons at multiple scales to obtain a median axis for strip-shaped
objects.
While the multi-scale approaches described above are an improvement over older detection
schemes, they operate under the assumption of low input noise. The lack of noise robust-
ness exhibited by these methods make them unsuitable for applications that suffer from
sensory noise. Also, the time-critical nature of many robotic applications prohibit the use
of multi-scale methods due to their high computational costs. As such, these multi-scale
methods are rarely applied to the domain of robotics.
2.2.7 Applications of Detection
In the realm of mobile robotics, bilateral symmetry can be used to generate image signa-
tures in conjunction with dynamic programming [Westhoff et al., 2005]. The symmetry
image signature allows a mobile robot to compare panoramic images of its surroundings
20
Section 2.2. Related Research
with a database of images collected in the past to perform place recognition. Huebner has
since extended this approach to allow mutli-scale detection[Huebner, 2007]. The author’s
approach to object segmentation, described in Section 4.2, also makes use of symmetry-
guided dynamic programming to achieve a different goal.
Not included in the full-page figure due to limited space, symmetry detection can also be
used to complete occluded shapes [Zabrodsky et al., 1993] and for robust model fitting
[Wang and Suter, 2003]. Radial symmetry has been applied to the problems of road
sign detection [Barnes and Zelinsky, 2004] and eye detection [Loy and Zelinsky, 2003;
Loy, 2003]. In the domain of robotics and object manipulation, bilateral symmetry has
also been applied to the modelling of cutlery [Yl-Jski and Ade, 1996].
2.2.8 SIFT-Based Approaches
Detection methods for bilateral symmetry [Loy and Eklundh, 2006] and symmetry under
perspective projection [Cornelius and Loy, 2006] using mirrored SIFT features [Lowe, 2004]
have been proposed. By exploiting the affine and lighting invariance of SIFT features,
which are also highly unique, symmetry is detected robustly for noisy real world images.
These two robust detection methods have the potential to be applied to a variety of robotics
applications. However, due to computational limitations, which at the time of writing are
slowly being overcome by graphics processing unit (GPU) implementations of SIFT, these
methods are unsuitable for time-critical applications due to the high computational costs
of SIFT detection and matching.
21
CHAPTER 2. BACKGROUND INFORMATION
Blum
Blum Ribbon (Medial Axis Transform)
1964,1967
Rosenfeld and Pfaltz
Distance Transform Skeleton
1966,1967,1968
Blum and Nagel
Symmetric Axis Transform
1978
Brady and Asada
Brady Ribbon
(Smoothed Local Symmetries)
1982,1984
Levitt
Symmetric Hough
Transform
1984
Nevatia and Binford
Generalized Cones
1977
Brooks
Brooks Ribbon
1981
Ponce
Skew Symmetry
Detection
1990
Ogawa
Line Drawing Analysis
using Symmetric Hough
1990
Mohan and Nevatia
Symmetry Detection using
Perceptual Organization
1992
Yip et al
Skew Symmetric Hough
Transform using Edge Pixels
1994
Cham and Cipolla
Skew Symmetric Hough
using Edgels
1994
Reisfeld et al
Generalized
Symmetry Transform
1995
Lei and Wong
Improved Skew Symmetric
Hough Transform
1999
Sun and Si
Detection using
Orientation Histogram
1999
Shen et al
Detection by Generalized
Complex Moments
1999
Westhoff et al
Symmetry Detection using
Dynamic Programming
2005
Loy and Eklundh
Symmetry Detection using SIFT
and Hough Transform
2006
Cornelius and Loy
Detecting Symmetry in
Perspective using SIFT
2006
Y
e
a
r
1990
2000
Author’s Work
Fast Bilateral Symmetry Detection
using Hough Transform
2005
Marola
Near Symmetry Detection
1989
Atallah
Symmetry
Detection
1985
Zabrodsky et al
Symmetry
Distance
1995
Kiryati and Gofman
Symmetry Detection using
Global Optimization
1998
Figure 2.3: Research related to bilateral symmetry detection.
22
Symmetry is what we see at a glance
Blaise Pascal
3
Symmetry Detection
3.1 Introduction
In Section 2.2, a plethora of symmetry detection methods was surveyed. Most of these
methods are designed for non-robotic applications, with the majority being from the do-
mains of computer vision and medical image processing. In the proposed robotic system,
the symmetry detection results will be used to perform tracking, segmentation and stereo
triangulation of new objects, enabling robust object manipulation and autonomous learn-
ing. The following issues must be addressed by a symmetry detection method applied in
such a robotic system.
Detectability
First and foremost, bilateral symmetry must be detectable for the target objects. Ac-
cording to Nalwa’s work on line drawing interpretation [Nalwa, 1988a; Nalwa, 1988b] and
the bilateral symmetry of line drawings [Nalwa, 1989], all drawings of an orthographically
projected surface of revolution will exhibit bilateral symmetry. The bilateral symmetry
line will also coincide with the object’s projected axis of revolution. Moreover, as detailed
in page 7 and pages 517-573 of [Forsyth and Ponce, 2003], an orthographic projection is
simply a special case of weak perspective projection where scaling has been normalized to
unity (or negative unity).
In practical terms, Nalwa’s work implies that an object with a surface of revolution, such
as a cup, has visually detectable bilateral symmetry when viewed from many directions. If
the symmetry line is measured from two or more view points, stereo triangulation should
be possible. The resulting three-dimensional symmetry axis will be the surface’s axis of
revolution. Deviations from the ideal surface of revolution can be treated as visual noise.
In practical terms, this means that bilateral symmetry can be detected as long as the
robot’s cameras are not too close to the test objects.
23
CHAPTER 3. SYMMETRY DETECTION
Real Time Operation
For robotic sensing applications such as object tracking, real time operation is essential.
Real time performance is especially important during tasks that require immediate sensory
feedback, such as object manipulation. Existing methods for bilateral symmetry detection
are unable to operate in real time on large images that have a million or more pixels
due to their high computational costs. As larger images generally provide more sensory
information and improve the upper limit of accuracy in tasks such as tracking and stereo
triangulation, symmetry detection methods applied to time-critical robotic applications
should be able to operate quickly on large input images.
Robustness to Noise and Asymmetry
In robotic applications, sensor data tend to be noisier than test image sets encountered
in computer vision. The detection method must be able to handle images taken with
robotic sensors under real world conditions. The accuracy of posterior estimates from
high level information filters, such as a Kalman filter, depends on the quality of their
input measurements. Accurate symmetry detection reduces the tracking estimate error
and limits the chance of filter divergence. Shadows and specular reflections as well as
asymmetric portions of partially symmetric objects must be dealt with robustly. For
example, the detection method should be able to detect the symmetry of a mug under
non-uniform lighting while ignoring the asymmetric mug handle.
3.2 Fast Bilateral Symmetry Detection
While existing methods of detection are capable of addressing some of the issues detailed in
the introduction, none of them can address all the issues simultaneously. For example, the
SIFT-based method of [Loy and Eklundh, 2006] can detect symmetry in real world images
very robustly but does not operate quickly enough for many real time applications. The
high computational costs of the existing bilateral symmetry detection methods appear to
be a common hinderance when trying to apply them to time-critical robotic applications.
The fast bilateral symmetry detection method was developed to remedy this situation.
Herein referred to as fast symmetry, this detection method was initially developed in
collaboration with Alan M. Zhang, who participated in the early design discussions that
led to the use of an edge pixel pairing approach. The majority of the algorithm design as
well as the entirety of implementation and experiments were performed by the author.
The fast symmetry detection method was first published as [Li et al., 2005]. Following
this publication, the author further refined the detection process in order to reduce the
computational cost of detection and to improve detection robustness. These refinements,
including an edge pixel rotation step and allowing angular limits on detection orientations,
24
Section 3.2. Fast Bilateral Symmetry Detection
are detailed in Section 3.2.2 and Algorithm 1. The updated version of the detection
algorithm is also available in press [Li et al., 2008].
3.2.1 Novel Aspects of Detection Method
High Detection Speed
The primary novelty of the detection method is the speed of detection. Gains in com-
putational speed are achieved at several stages of the detection method. Firstly, instead
of operating on all pixels in an image, only high gradient locations, found using an edge
detector, are used as input data. In empirical tests, the Canny edge detector [Canny, 1986]
was found to greatly reduce the input data size. For 640 ×480 images, the edge detection
step typically reduces the image data to around 10000 edge pixels.
Secondly, a novel Hough transform method using a convergent voting scheme helps reduce
the computational cost of detection. A rotation step before Hough voting further reduces
computational cost by eliminating trigonometric calculations within the algorithm’s inner
loop. The rotation step also provides a way to limit the orientation range of detection,
which linearly reduces the computational cost of detection.
Noise Robustness
Many existing detection methods rely on local image information such as gradient intensity
and orientation. While useful for synthetic images, factors such as object texture and
non-uniform lighting can severely disrupt these local features in real world situations. For
example, specular reflections can generate large gradient changes on object surfaces that
do not represent any structural symmetry present in the physical world. Also, surface
texture, such as logos on a soft drink bottle, can introduce symmetric gradients that are
independent of the symmetry of an object’s contour. To improve noise robustness, fast
symmetry only uses the location of edge pixels as input data. This has the added benefit
of reducing input data size, thereby reducing the computational cost of detection.
The use of the Hough transform to find lines of symmetry further improves the noise ro-
bustness of fast symmetry. The voting process of the Hough transform is able to ignore
asymmetric portions of roughly symmetric objects, such as a cup handle. By using a con-
vergent voting scheme, pairs of edge pixels cast single votes during Hough accumulation.
This convergent voting scheme produces sharper peaks in the Hough accumulator than
the traditional approach of casting multiple votes for each edge pixel. The quantization
of parameter space inherent to the Hough transform also provides additional robustness
against small errors in edge pixel localization.
Additional noise robustness can be gained by adjusting the parameters of the detection
method. A pair of distance thresholds govern the scale of detected symmetry. The allow-
able pairing distance between edge pixels is controlled by these thresholds. For example,
25
CHAPTER 3. SYMMETRY DETECTION
the upper threshold can be lowered to prevent the detection of inter-object symmetry. A
pair of orientation parameters can further improve detection robustness by rejecting sym-
metries with orientations outside a specified range. Further details about these parameters
can be found in Section 3.2.2 and Algorithm 1.
Application-Specific Features
As fast symmetry is targeted at robotic applications, specifically that of object segmenta-
tion, real time object tracking and stereo triangulation, application-specific features have
been added to enhance detection performance. Firstly, as mentioned earlier, orientation
limits can be applied to the detection process. Apart from providing a way to include
prior knowledge of object orientation to improve detection accuracy and robustness, these
orientation limits also provide another advantage. The method has a complexity directly
proportional to the number of discrete detection orientations. It follows that by limit-
ing detection to a small range of angles, computational efficiency of detection is vastly
improved. These orientation limits are used in the real time object tracker detailed in
Chapter 5.
The detection method also produces a global symmetry line. This is different from many
existing methods that produce local symmetries at a particular scale, such as a shape’s
structural skeleton or an analogue measure of local symmetry. In this respect, fast sym-
metry is global in that it detects bilateral symmetry as a feature representative of an entire
object contour. By detecting global symmetry lines, the method can operate on objects of
vastly different visual appearances under difficult lighting conditions. Also, fast symmetry
can detect the symmetry lines of transparent objects, textureless objects, multi-colour ob-
jects and objects with dense surface texture. Detection methods that represent symmetry
as a local feature, such as SIFT-based approaches, have difficulty with transparent and
reflective objects due to the unreliable pixel information of the object’s surface.
3.2.2 Algorithm Description
Since the first publication of the fast symmetry detection algorithm [Li et al., 2005], the
detection method has undergone many changes. The version detailed in Algorithm 1 is the
most current incarnation. It is the version used in all of the robotic experiments presented
in this thesis. The current version differs primarily in the addition of an edge pixel rotation
and grouping step. Also, the original version uses a weighting function based on local
gradient orientation to determine the voting contribution of edge pixel pairs. Empirical
tests showed that this weighting is not robust to non-uniform scene lighting. As such, it
has been discarded from the detection method. Its removal also reduces the computational
cost of detection.
The detection process is described programmatically in Algorithm 1. The parameter pair
θ
lower
and θ
upper
are used to limit the orientations of symmetry lines detected by fast
26
Section 3.2. Fast Bilateral Symmetry Detection
Algorithm 1: Fast bilateral symmetry detection
Input: I – Source image
Output: sym – Array of symmetry line parameters (R, θ)
Parameters:
D
min
, D
max
– Minimum and maximum pairing distance
θ
lower
, θ
upper
– Orientation limits (Hough indices)
N
lines
– Number of symmetry lines returned
edgePixels ← (x,y) locations of edge pixels in I 1
Hough accumulator H[ ][ ] ←0 2
for θ
index
← θ
lower
to θ
upper
do 3
θ ←θ
index
in radians 4
Rot ← Rotate edgePixels by angle θ. See Figure 3.2 5
for each row in Rot do 6
for each possible pair (x
1
, x
2
) in current row do 7
dx ←|x2 −x1| 8
if dx < D
min
OR dx > D
max
then 9
continue to next pair 10
x
0
←(x2 +x1)/2 11
Increment H[x
0
][θ
index
] by 1 12
for i ←1 to N
lines
do 13
sym[i] ←max(R, θ) ∈ H 14
Neighbourhood around sym[i] in H ← 0 15
symmetry. The thresholds D
min
and D
max
control the scale of detection by placing limits
on the minimum and maximum distance allowed between edge pixel pairs. In practice,
D
min
is used to reject small scale symmetry, such as those caused by edge contours with
multi-pixel thickness. D
max
is used to reduce the effects of large scale symmetry, which
tend to be caused by background edge noise and inter-object symmetry. The parameter
N
lines
controls the number of symmetry lines returned by the detection method.
Edge Detection and Sampling
Fast symmetry operates on the edge pixels of an input image. The Canny edge detector
[Canny, 1986] is used to generate the edge image. The detection method is given the
locations of edge pixels as input. Note that no preprocessing of edge pixels is performed
before detection. The Canny thresholds are set to ensure reasonable edge detection results
for indoor scenes. These thresholds are fixed across multiple experiments. The thresholds
are only modified when a change occurs in the camera’s gain or exposure settings. The
C++ implementation of fast symmetry uses the Canny edge detection function provided
by Intel OpenCV 1.0 [Intel, 2006] with an aperture of 3 pixels.
27
CHAPTER 3. SYMMETRY DETECTION
Hough Transform using Convergent Voting
The Hough transform, first described in a patent [Hough, 1962] and later refined to model
lines using polar parameters [Duda and Hart, 1972], is a method used to detect param-
eterizable curves in a set of points. It is commonly employed to find straight lines and
circles in edge images. It has also been generalized to find arbitrary shapes in images [Bal-
lard, 1981]. A vast collection of Hough transform methods and parameterizations are
summarized in a survey paper [Illingworth and Kittler, 1988].
Fast symmetry uses a modified Hough transform approach to find symmetry lines. Unlike
traditional Hough methods, which require multiple votes for each edge pixel, fast symmetry
uses a convergent voting scheme. This modified scheme greatly reduces the total number
of votes cast. In exchange, additional computation is required to perform edge pixel
pairing prior to voting. The detected symmetry line is parameterized by its radius (R)
and orientation (θ) relative to the image center.
Figure 3.1 illustrates the symmetry line parameterization and the Hough transform con-
vergent voting scheme. As a bilateral symmetry line is bidirectional, the angle θ is limited
to −
π
2
< θ ≤
π
2
. Edge pixels, shown as dots, are paired up and each pair contributes a
single vote. For example, the edge pixel pair in black, linked by a solid line, contributes
one vote to the dashed symmetry line.

R
Edge Pixel
Symmetry
Line
Edge Image
Figure 3.1: The Hough transform convergent voting scheme in fast bilateral symmetry detection.
In their work on randomized Hough transform [Xu and Oja, 1993], Xu and Oja observed
that convergent voting reduces Hough accumulation noise and improves the sharpness of
peaks in parameter space. Convergent voting also reduces the computational cost of voting
in exchange for additional processing to group edge pixels into pairs. This reduction in
28
Section 3.2. Fast Bilateral Symmetry Detection
computational cost is first described in [Ballard, 1981]. Exhaustively pairing N edge pixels
has a computational complexity of O(N
2
). With large N, this will have detrimental effects
on a method’s real time performance. To overcome this, fast symmetry employs an edge
pixel rotation and grouping step before edge pairing to reduce the effective size of N.
Edge Pixel Rotation and Grouping
Figure 3.2 contains a simple example of edge pixel rotation and grouping. Edge pixels,
drawn as black dots, are rotated by angle θ. After this rotation, edge pixels with similar
y coordinates, belonging to the same scanline, are grouped into rows of a two-dimensional
array named Rot. By only pairing values within each row of Rot, the resulting edge pixel
pairs will all vote for symmetry lines at the current rotation orientation θ. The example
in Figure 3.2 will produce five votes for the symmetry line with orientation θ and radius
R = 2.
0 1
3,1
3,1
1
3,1
3,1
3,1
EMPTY
EMPTY
EMPTY
EMPTY
EMPTY
EMPTY
Rot x
y EMPTY
R
o
t
a
t
e
d

E
d
g
e
P
i
x
e
l
s

Scanlines
Original Edge Pixels
Rotate by

3 2
R
R
Figure 3.2: The edge pixel rotation and grouping step in fast bilateral symmetry detection.
This process of edge pixel rotation and grouping provides two benefits. Firstly, random
memory access across the θ dimension of the Hough accumulator is removed. This means
that only a single row of the accumulator needs to be cached during voting. Secondly,
the arithmetic to calculate symmetry line parameters during voting is greatly simplified.
The polar radius, R, is found by taking the average of x coordinates in an edge pair. The
orientation does not require any calculation as it is simply the current angle of rotation. By
doing this, computationally expensive calculations are avoided during the O(N
2
) voting
process.
As pixel rotation has a complexity of O(N), the edge pixel rotation step is computationally
cheap. In addition, the rotation step explicitly allows the use of orientation limits in
detection, which can greatly reduce the computational cost of detection. Note also that
the effective size of N during the O(N
2
) edge pairing and voting is reduced by the grouping
29
CHAPTER 3. SYMMETRY DETECTION
process as pairing only occurs between the subset of edge pixels within each row of Rot, as
opposed to all edge pixels. The edge pixel rotation and grouping step significantly reduces
detection time for large input images.
Peak Finding and Sub-pixel Refinement
Lines 13 to 15 of Algorithm 1 details the peak finding process that returns the (R, θ)
parameters of detected symmetry lines. The peak finding process consists of a maxima
search in the Hough accumulator followed by non-maxima suppression. Non-maxima
suppression is performed after locating a maximum by zeroing the neighbourhood of bins
around the maximum, including the maximum itself. The non-maxima suppression step
prevents the detection of multiple symmetry lines with near-identical parameters. In
the C++ implementation, a threshold proportional to the highest peak in the Hough
accumulator is also used to prevent the detection of very weak symmetries.
Hough accumulation is inherently susceptible to aliasing due to the quantization of (R, θ)
parameter space into discrete bins. To improve detection accuracy, sub-pixel refinement is
performed to approximate the true peak in parameter space. The term sub-pixel is used
as the resulting peak is allowed to lie between Hough accumulator bins.
After locating the maximum value in the Hough accumulator, prior to non-maxima sup-
pression, the 3 × 3 neighbourhood of values centered at the maximum are used to refine
the peak location. A Hessian fit of a two dimensional quadratic is used to calculate the
sub-pixel offsets. The fitting process is performed as follows:
V =
_
¸
_
V
−1,−1
V
0,−1
V
1,−1
V
−1,0
V
0,0
V
1,0
V
−1,1
V
0,1
V
1,1
_
¸
_
D =
_
V
−1,0
−V
1,0
2
V
0,−1
−V
0,1
2
_
H =
_
V
1,0
+V
−1,0
−2V
0,0
V
1,1
+V
−1,−1
−V
−1,1
−V
1,−1
4
V
1,1
+V
−1,−1
−V
−1,1
−V
1,−1
4
V
0,1
+V
0,−1
−2V
0,0
_
(3.1)
First, the local neighbourhood values around the peak are defined as matrix V such that
the maximum is located at V
0,0
. Treating the 3 ×3 neighbourhood of values as a coarsely
sampled surface, the local differences and Hessian are calculated and placed into vector
D and matrix H respectively. The sub-pixel offsets are found by solving for X
off
in the
following equation.
HX
off
= D (3.2)
The offsets in X
off
are added to the maximum location to produce a refined sub-pixel lo-
cation. Brief experiments on synthetic images suggests that the use of sub-pixel refinement
30
Section 3.2. Fast Bilateral Symmetry Detection
provides noticeable improvement to detection accuracy. Accuracy is especially improved
in cases where the true symmetry line of a shape lies at the quantization boundary be-
tween Hough accumulator bins. As sub-pixel refinement is only executed once for each
detected symmetry line, it does not contribute significantly to the total computational
cost of detection.
3.2.3 Extensions to Detection Method
During the course of research, numerous modifications have been made to the basic de-
tection method. These changes were made in an attempt to improve detection accuracy,
robustness, flexibility and performance. Due to various reasons, such as the need to re-
duce detection execution time, these extensions are not included in the core of the fast
symmetry method detailed in Algorithm 1. For the sake of completeness and to encourage
future work, this section will briefly describe these extensions.
Skew Symmetry Detection
Bilateral symmetry, as discussed in Section 2.1.2, is a subset of skew symmetry where the
symmetry line and the line joining symmetric point pairs intersect at a right angle. By
modifying the convergent voting portion of the algorithm, fast symmetry can be extended
to detect skew symmetry. In the modified scheme, a discrete range of skew orientations
are specified, for which symmetry lines can be detected. Instead of casting single votes,
every edge pixel pair casts multiple votes, one for each skew orientation.
As multiple votes are cast for each edge pair, the computational cost increases by a constant
factor equal to the number of skew orientations. The skew symmetry detection complexity
is O(MN
2
) where N is the number of edge pixels and M is the number of discrete skew
angles for which symmetry can be detected. In order to recover the skew angle, two
additional matrices with the same size as the Hough accumulator are needed. The skew
angle is recovered using the method suggested in [Lei and Wong, 1999]. An example
result of skew symmetry detection is shown in Figure 3.3(b). The detected symmetry line
is shown as a solid red line overlayed on top of the source image’s edge pixels
Reducing Vote Aliasing
Early experiments on synthetic images exposed an aliasing problem where votes belonging
to the same symmetry line can be split across multiple Hough accumulator bins. For
example, if the orientation of a shape’s symmetry line is at 0.5 degrees, using a Hough
orientation quantization of 1 degree, votes can be spread between bins of orientations 0
and 1 within the Hough accumulator. This aliasing effect means that a symmetric shape
and its rotation can produce peaks of different heights in the Hough accumulator.
31
CHAPTER 3. SYMMETRY DETECTION
(a) Source image
(b) Detection result
Figure 3.3: Fast skew symmetry detection.
To overcome this, the votes cast can be spread to reduce the amount of aliasing. Instead of
incrementing a single accumulator bin, the values of surrounding bins are also incremented
by using a voting kernel. Votes can be spread using a variety of kernels. A 3 × 3 integer
approximate of a symmetric Gaussian is a good compromise between computational cost
and effective anti-aliasing. Instead of vote spreading, Gaussian blurring of the accumulator
after voting also reduces aliasing.
However, due to the quantization of parameter space, Hough transform accumulation is
inherently susceptible to vote aliasing. The vision systems presented in this thesis do
not employ these anti-aliasing improvements as they never rely on the quantity of Hough
votes as a direct measure of symmetry strength. However, in applications where a reliable
measure of symmetry strength is needed, spread voting schemes or Hough accumulator
blurring should be employed to combat vote aliasing.
32
Section 3.2. Fast Bilateral Symmetry Detection
Preventing the Detection of Non-Object Symmetry
As Canny edge detection and Hough transform are methods with high noise robustness,
fast symmetry is inherently very robust to noisy input data. However, symmetry lines
can be detected due to coincidental pairings of edge pixels belonging to different objects
or between object and background contours. These noisy symmetries can also occur in
images with large quantities of high frequency texture, which results in the generation of
many edge pixels that do not belong to any object contour. The visual structure of a
scene’s background, such as long edge contours from table tops, can also encourage the
detection of non-object symmetry.
Figure 3.4 is a simple example of the latter kind of unwanted symmetry. The black dots
represent edge pixels, which are overlayed over grey lines representing the contours from
which they were detected. As the vertical symmetry line has θ = 0, no edge pixel rotation
is needed. Grouping the edge pixels for pairing produces the values in Rot. Notice that the
values in Rot will vote for two symmetry lines at the current orientation. The symmetry
line at R = −1 will receive votes from the pairs (2, −4), (1, −3) and (0, −2). Three pairs
of (3, 1) will contribute votes to the other symmetry line at R = 2. Both symmetry lines
will have three votes at the end of the Hough transform.
0
3,1
3,1
3,2,1
EMPTY
EMPTY
EMPTY
3,1
EMPTY
2,1,0,-1,-2,-3,-4
EMPTY
Rot
x
y
2 -2
R = -1 R = 2
-4
Figure 3.4: Unwanted symmetry detected from a horizontal line.
The symmetry line at R = 2 is expected as the cup-like U-shaped contour suggests sym-
metry. However, the symmetry line at R = −1 has the same number of convergent votes
but is simply a straight line, not a symmetric shape from a human vision point of view.
Technically, the horizontal line does have strong bilateral symmetry as it has many sym-
metric portions across the symmetry line. In fact, a bilaterally symmetric V-shape will
result if the line is bent upwards, using its intersection with the symmetry line as a pivot.
33
CHAPTER 3. SYMMETRY DETECTION
Similarly, an attempt to widen and flatten the U-shaped cup contour will arrive at the
horizontal line after passing through shapes that look like the cross-section of a bowl.
The lack of prior assumptions concerning the kind of shapes fast symmetry should target
is the primary reason why unwanted symmetries are detected by fast symmetry. To
detect symmetries that are likely to belong to objects and reduce the chance of non-object
symmetries being found, the edge grouping and voting processes can be modified. Two
approaches can be used to steer detection towards symmetry lines of object-like shapes.
Firstly, the specific problem of detecting straight lines as being bilaterally symmetric can
be addressed by preprocessing the values in the rows of Rot prior to pairing. Horizontal
lines that occur after edge pixel rotation will contribute many values to a single row of
Rot. By ignoring the rows of Rot that have too many values, long straight lines will be
rejected before paring, effectively eliminating the problem. For example, the horizontal
line’s symmetry at R = −1 will not be detected if rows with more than three values in
Figure 3.4 are ignored.
However, setting a fixed threshold is difficult as the number of edge pixels and their
distribution in an image are scene-dependent. Even with a finely tuned threshold, it
is possible to reject rows with edge pairs that belong to a symmetric object contour.
Rejecting too many edge pixels from object contours will lead to failed detection. If
the threshold is set too low, rows with many edge pixels caused by visual clutter or
high frequency texture will be wrongly rejected. Overall, this method of straight line
rejection should be employed with a large threshold in situations where the additional
noise robustness is sorely needed.
Secondly, instead of excluding rows of edge pixels from pairing, the convergent voting
process can be modified to address the underlying problem. Looking again at the U-
shaped contour in Figure 3.4, it appears more symmetric than the horizontal line because
it is taller along the direction of its symmetry line. By exploiting this qualitative property
of symmetric objects, the convergent voting process can be reformulated to favour tall
symmetric contours instead of flat, wide contours.
The voting process is modified by imposing the rule that any Hough accumulator bin
is only allowed to be incremented once by each row of Rot. Applying this rule to the
example in Figure 3.4, the three edge pixel pairs of (3, 1) continue to vote for the U-shape
contour’s symmetry line at R = 2. However, for the row with edge pixel values taken from
the horizontal line, the voting is very different. Recall that the edge pixel pairs (2, −4),
(1, −3) and (0, −2) all vote for the same symmetry line. Following the new voting rule,
two of the three identical votes are ignored, reducing the strength of the symmetry line
at R = −1 to a single vote. The strength of the non-object symmetry line is now a third
of its original. In can be seen that this method is very effective in reducing the quantity
of votes contributed by straight line contours as well as dense patches of edge pixels from
high frequency texture.
34
Section 3.3. Computational Complexity of Detection
In practice, it is fairly difficult to set up a situation where unwanted symmetries oc-
cur regularly. The use of a checkerboard pattern as background can consistently inject
non-object symmetry lines into the detection results, lowering the rankings of object sym-
metries. Higher level processes such as Kalman filtering are generally able to ignore the
noisy detection results. Where additional noise robustness is needed, one of the methods
described here should be applied based on the demands of the target application. The
row rejection method is suggested for applications where robustness against straight lines
is needed. Note that as the row rejection method reduces the number edge pixels paired,
computational cost of detection is also reduced. However, in applications where additional
computation costs can be afforded, the latter method of modified voting will be much more
effective in rejecting non-object symmetries.
3.3 Computational Complexity of Detection
The fast symmetry algorithm consists of two loops in series. The first, beginning on
line 3 in Algorithm 1, performs the edge pixel rotation, grouping and convergent voting
steps. The second loop begins on line 13. This loop performs peak finding on the Hough
accumulator to return the parameters of detected symmetry lines. Peak finding has a
complexity of O(N) where N is the number of Hough Accumulator bins. The peaking
finding loop is repeated N
lines
times. As N
lines
is small, less than 10 in all detection
scenarios, peak finding contributes a very small portion of the overall computational cost
of detection.
The voting loop occurs on lines 3 to 12 of the algorithm. As this loop is carried out for
θ
upper
−θ
lower
iterations, the computational cost of detection is reduced in a linear manner
by reducing the orientation range of detection. For example, limiting the orientation range
of detection to ±10 degrees of vertical, which is one-ninth of the maximum range of 180
degrees, will reduce detection time by a factor of nine. This linear reduction in execution
time is predicated on the assumption that the distribution of edge pixels in the rows of
Rot after edge pixel rotation and grouping is similar for all orientations.
Each cycle of the voting loop contains two major steps. The first step is edge pixel rotation
and grouping which occurs on line 4 and 5 of Algorithm 1. The rotation and grouping
process has a complexity of O(N
edge
) where N
edge
is the number of edge pixels in the input
image. As the edge pixels are rotated by a set of angles that are fixed at compile time,
rotation matrices are calculated and placed into a lookup table prior to detection.
The second step of the voting loop begins on line 6 of the algorithm. This step iterates
through the rows of Rot, pairing the x coordinates of rotated edge pixels and performs
convergent voting. For each edge pixel pair, their distance of separation is checked against
the thresholds D
min
and D
max
. These checks are formulated as if statements on lines 8
to 10. The x coordinates of edge pixel pairs that satisfy the thresholds are averaged to
35
CHAPTER 3. SYMMETRY DETECTION
find the radius of the symmetry vote. The radius is labelled as x
0
on line 11. On line 12,
the convergent vote is cast by incrementing the Hough accumulator.
Edge pairing and convergent voting consumes the majority of computational time. There-
fore, the computational complexity of detection is primarily dependent on the efficiency
of the pairing and voting processes. As such, the complexity of detection can be approxi-
mated as follows.
COMPLEXITY ∝
N
edge
2
Rows
Rot
2
×Rows
Rot
×(θ
upper
−θ
lower
) (3.3)
N
edge
, as used earlier, is the number of edge pixels extracted from the input image.
Rows
Rot
is the number of rows in Rot. Assuming uniformly distributed edge pixels across
the rows of Rot, the number of edge pixels per row is
N
edge
Rows
Rot
. As the edge pairing process
has N
2
complexity, squaring the fraction gives the leftmost term in the multiplication.
Performing edge pairing for each row of Rot requires Rows
Rot
repetitions, which results
in the middle term. The rightmost term represents the number of angles for which edge
rotation and voting takes place, represented as a for-loop on line 3 of the algorithm.
Simplifying Equation 3.3 gives
COMPLEXITY ∝
N
2
edge

upper
−θ
lower
)
Rows
Rot
(3.4)
Equation 3.4 shows that both the orientation range of detection and the number of rows in
Rot directly affect the computational cost of detection. As suggested earlier, the detection
orientation range can be reduced to improve performance. Note that the orientation
range can be changed at runtime, which is used by the real time object tracker detailed
in Chapter 5 to reduce detection execution times. Fixing the scanline height, Rows
Rot
will depend solely on the size of the input image. For the sake of computational efficiency,
scanlines are 1-pixel high. This allows the grouping of edge pixels into Rot to be performed
using computational cheap rounding operations. To ensure that all edge pixels are grouped,
Row
Rot
is set to match the diagonal length of the input image.
3.4 Detection Results
Before applying the fast bilateral symmetry detection method to robotics applications, it
is tested offline to gain a better understanding of the quality and speed of detection. In
this section, the results of these tests are partitioned into three parts. The first subsection
details the detection of bilateral symmetry in synthetic images. The second subsection
shows detection results of the method operating on images of real world scenes containing
symmetric objects. The final subsection investigates the computational cost of detection
in practice using a series of timing trials.
36
Section 3.4. Detection Results
The same Hough quantization is used across all experiments. The Hough space is quantized
into 180 orientation divisions, giving a θ bin size of 1 degree. The number of radius divisions
is equivalent to the diagonal of the input image. As such, the R bin size is set to 1 pixel.
Canny edge detection thresholds of 30 and 90 are used to extract edge pixels for all test
images.
3.4.1 Synthetic Images
An old idiom suggests that one should learn to crawl before attempting to run. As such,
experimental evaluation of fast symmetry begins with synthetic images. These images are
less challenging than the kind of images a domestic robot will encounter. The results of
symmetry detection on these synthetic images are shown in Figure 3.5. The symmetry
lines detected using fast symmetry are drawn in green on top of the source image. The
detection results are organized in columns based on the parameter N
lines
, which controls
the number of symmetry lines returned by detection. Note that the rightmost column of
the figure contains detection results obtained using different N
lines
values. Going from
top to bottom, 4, 5 and 6 symmetry lines are detected respectively. Apart from N
lines
,
the same detection parameters are used for all images. The minimum pairing threshold is
set to D
min
= 10 to prevent the detection of small scale symmetry that occur along thick
lines, such as the triangle surrounding the poison symbol. No maximum pairing threshold
is used, nor any orientation limits.
Overall, the detection results in Figure 3.5 appear similar to the kind of bilateral symmetry
perceived by the human visual system. The exceptions are the diagonal symmetry lines of
the triangular poison symbol and the wide rectangle. In the poison symbol, located at the
bottom of the N
lines
= 3 column of Figure 3.5, the vertical symmetry line received more
Hough votes than the two diagonal lines. The same is true for the horizontal symmetry
line of the rectangle, which has more votes that the vertical and diagonal lines. While
Section 3.2.3 warns against directly using the Hough accumulator vote count as an indica-
tion of symmetry strength, for these synthetic figures, the votes cast for a symmetry line
seem to correlate with our human perception of symmetry strength.
3.4.2 Video Frames of Indoor Scenes
After convincing ourselves that the detection method is effective on synthetic images,
the next step is to attempt symmetry detection on more difficult data. As the final
robot system will operate on graspable objects imaged at arm’s reach, images of indoor
scenes containing symmetric household objects are used as test data. The test images
are 640 × 480 video images captured using a colour IEEE1394 camera. Due to aliasing
caused by hardware sub-sampling of the camera’s image, which has a native resolution of
1280 ×960, Gaussian blurring is applied to the test images prior to edge detection.
37
CHAPTER 3. SYMMETRY DETECTION
N
lines
= 1 N
lines
= 3 N
lines
= 4,5,6
Figure 3.5: Symmetry detection results – Synthetic images.
As the assumption that a symmetric shape will occupy the majority of the image is no
longer valid, the distance thresholds are adjusted to improve detection robustness. First,
D
min
is increased to 25 to limit the effects of small scale symmetry which is more prevalent
in real world scenes due to noisy edge pixels. The upper threshold D
max
is set to 250 which
is roughly half the image width to reduce the probability of detecting non-object symmetry.
In practice, given the location of the robot’s cameras, the width of objects within reach
of the robot’s manipulator never exceeds 200 pixels.
Figure 3.6 shows the symmetry line detected by fast symmetry when it is applied to an
image of a scene containing a single symmetric object. The detected symmetry line is
drawn in red. The edge pixels of the image, which are the detection method’s input data,
are coloured black. Note that the edge pixels have been dilated to improve their visibility.
The large quantity of non-object edge pixels suggests that fast symmetry is highly robust
to asymmetric edge pixel noise.
38
Section 3.4. Detection Results
(a) Input image
(b) Detection result
Figure 3.6: Symmetry detection result – Single symmetric object. The detected symmetry line is
drawn in red over the input image’s edge pixels, which are dilated and drawn in black.
39
CHAPTER 3. SYMMETRY DETECTION
Moving on to a scene of greater complexity, the detected symmetry lines of an image
containing three symmetric objects are shown in Figure 3.7. By setting N
lines
= 3, the
three symmetry lines with the most Hough votes are returned by detection. Again, edge
pixels have been dilated and are drawn in black. This scene has fewer background edge
pixels than the last, the presence of which may cause non-object symmetries to be detected
ahead of object symmetries. Note that only a single detection pass is needed to recover
all three symmetry lines. The detection result suggests that fast symmetry can operate
on images with multiple symmetric objects effectively. Also, the green symmetry line
shows that fast symmetry can operate on roughly symmetric objects. In this case, the
method detected the mug’s bilateral symmetry while ignoring the asymmetric edge pixels
contributed by its handle.
Next, a failure scenario where non-object symmetries overshadow object symmetry is
examined. The threshold D
min
is decreased to 5 pixels and the Canny thresholds are also
reduced to increase the number of noisy edge pixels contributing to Hough voting. After
applying these changes, Figure 3.8 shows the top five symmetry lines returned by fast
symmetry for a scene containing multiple symmetric objects. The numbers next to the
symmetry lines indicate their ranking in terms of Hough accumulator vote count, with one
being the highest ranked and five the lowest.
While technically not a failure of fast symmetry, the detection of non-object symmetries
ahead of symmetries emanating from objects is unwanted in many situations. Moreover,
if bilateral symmetry is used as an object feature, this problem needs to be addressed.
The middle and bottom images of Figure 3.8 can be used to examine the cause of this
problem. Edge pixels that voted for a non-object symmetry line are coloured red. The high
frequency surface texture of the multi-colour mug contributes the majority of noisy edge
pixels to the non-object symmetries. As fast symmetry does not know what an object is,
the problem of detecting non-object symmetry lines can not be fully resolved. Therefore,
higher level processes making use of symmetry detection results must be robust against
non-object symmetry lines.
The above problem can be partially overcome through three approaches. Firstly, D
min
can be increased to prevent the pairing of noisy edge pixels within high texture patches.
Reverting to the original suggested value of 25 will reduce the strength of symmetry lines
2 and 4. Secondly, increasing the Canny thresholds will reduce the edge pixel noise,
especially from faint high frequency texture as seen on the multi-colour mug in Figure 3.8.
However, increasing the Canny thresholds too much may result in missing edge pixels
in object contours that can cause missed detection of symmetric objects. Finally, the
orientation range of detection can be limited if some prior knowledge of object pose is
known. This method is further detailed below.
40
Section 3.4. Detection Results
(a) Input image
(b) Detection result
Figure 3.7: Symmetry detection result – Scene containing multiple symmetric objects. The
detected symmetry lines are drawn over the input image’s edge pixels, which are dilated and drawn
in black.
41
CHAPTER 3. SYMMETRY DETECTION
(a) Detection result
(b) Symmetry line 2
(c) Symmetry line 4
Figure 3.8: Detection of non-object symmetry lines due to edge pixel noise. Edge pixels have
been dilated and are coloured red if they voted for symmetry lines 2 or 4.
42
Section 3.4. Detection Results
In situations where some a priori information concerning an object’s orientation is known,
the orientation range of detection can be restricted to reduce the impact of non-object
symmetries. For example, objects with surfaces of revolution, such as cups and bottles,
are expected to have near-vertical lines of symmetry when placed upright on a table. By
using this knowledge, the θ
lower
and θ
upper
orientation limits can be adjusted to restrict
the range of orientations for which symmetries are detected.
Figure 3.9 provides an example where non-object symmetries are successfully rejected
by limiting detection orientation. In the case where no orientation limits are used, the
object symmetry line is ranked fifth in terms of Hough votes. This low ranking is due
to a combination of strong background symmetry and weak object symmetry caused by
disruptions in the object’s edge contour due to the experimenter’s hand. By restricting the
orientation range of detection to ±15 degrees of vertical, the object’s symmetry line now
receives the most votes. Reducing the orientation range also improves detection speed,
which is further discussed in Section 3.4.3.
Finally, the detection method is tested on a set of visually challenging objects. Figure 3.10
contains the fast symmetry detection results for these objects with N
lines
= 1 so that only
one symmetry line is returned. Also, the orientation range of detection is restricted to
±25 degrees of vertical.
The multi-colour mug image poses several challenges. Firstly, the mug is slightly asymmet-
ric due to its handle, which provides a test for the asymmetry robustness of fast symmetry.
Secondly, the handle occludes the left side of the mug’s contour, reducing the quantity
of symmetric edge pixels provided to fast symmetry as input. The reflective can has a
shiny surface which may contain symmetries due to reflections of its surroundings. The
textured bottle has a lot of edge noise within its object contour due to high frequency
surface texture. The transparent bottle has weak contrast with the background, which
results in gaps along the object’s edge contour.
The successful detection of symmetry lines for these visually difficult objects illustrates the
robustness of fast symmetry. The detection results also show that fast symmetry is highly
general, capable of dealing with objects of vastly different visual appearances. Another
point of note is that fast symmetry can detect the symmetries of reflective and transparent
objects. Due to their unreliable surface pixel information, these objects are difficult to
deal with using traditional computer vision approaches.
43
CHAPTER 3. SYMMETRY DETECTION
(a) No orientation limit
(b) ±15 degrees of vertical
Figure 3.9: Rejecting unwanted symmetry lines by limiting detection orientation range. The
detected symmetry lines are drawn over the input image’s edge pixels, which are dilated and drawn
in black.
44
Section 3.4. Detection Results
(a) Multi-colour mug (b) Reflective can
(c) Textured bottle (d) Transparent bottle
Figure 3.10: Symmetry detection results – Objects with challenging visual characteristics.
45
CHAPTER 3. SYMMETRY DETECTION
3.4.3 Computational Performance
As many robotic applications demand rapid feature detection due to real time constraints,
the computational cost of the proposed detection method must be examined. The effects
of input data size and changes in algorithm parameter values on computational cost are
examined using detection execution time as a metric. A set of eleven video frames of
indoor scenes, which includes the test images in Figures 3.6, 3.7, 3.8 and 3.9, are used
as test data. Note that these are the same test images used in previous timing trials
documented as Table I in the author’s IROS paper [Li et al., 2006].
In the new timing trials, the distance thresholds and edge detection parameters are the
same as those used to obtain the results in Section 3.4.2. The number of symmetry lines
detected is fixed by setting N
lines
= 5. Detection is performed on the full orientation
range of 180 degrees. The detection execution time is averaged over 1000 trials. The test
platform is an Intel 1.73GHz Pentium M laptop PC. The detection method is coded using
C++ and compiled with the Intel C Compiler 9.1.
The results of the timing trials are recorded in Table 3.1. The execution times of the
voting and peak finding portions of the code are measured separately. The voting time
inclusively incorporates everything between line 1 and line 12 of Algorithm 1. The peak
finding time measures lines 13 to 15 of the algorithm, including the sub-pixel refinement
step described at the end of Section 3.2.2. Note that Canny edge detection, which takes
around 8ms to perform, is not included in the trial times. The time required for the
extraction of edge locations from the edge image is included in the voting portion of the
execution times.
Table 3.1: Execution time of fast bilateral symmetry detection over 1000 trials
Image number Number of edge pixels
Mean execution time (ms)
Voting Peak find Total
1 6170 24.78 3.62 28.40
2 10444 61.75 3.72 65.48
3 6486 37.70 3.88 41.59
4 8700 52.20 4.22 56.42
5 9026 56.03 3.83 59.86
6 8365 48.38 3.24 51.63
7 5859 35.50 3.71 39.21
8 6350 35.44 4.24 39.68
9 7471 43.40 4.63 48.03
10 8396 47.48 4.04 51.51
11 3849 18.49 4.02 22.51
OVERALL MEAN 7374.18 41.92 3.92 45.85
A major point of note is the vast improvement in detection speed over the previous timing
trials conducted in 2006. The overall mean execution time has been reduced to a third
of the previous mean of 150ms. This improvement in execution is especially significant
considering that edge sampling was employed in the previous trials to reduces the quantity
46
Section 3.4. Detection Results
of input data. The edge sampling process produced a randomly selected subset of the
original edge pixels, effectively reducing the input data size by a factor of four. This
massive boost in computational performance can be attributed to a variety of coding
improvements such as pointer arithmetic and streamlining of mathematical computations.
The use of aggressive optimization settings during compilation also decreased detection
time significantly.
Section 3.3 hypothesized that reducing the orientation range will linearly improve the
computational performance of detection. This hypothesis hinges on the assumption that
the distribution of edge pixels across the rows of Rot remains fairly constant for different
rotation angles. To test the validity of this hypothesis and the accompanying assumption,
the execution time of detection carried out with different orientation limits is measured.
The mean execution time is plotted against the orientation range of detection in Fig-
ure 3.11. Each data point represents the mean execution time of eleven 1000-trial detec-
tion runs across all the test images, carried out with a specific orientation range. The
least squares fit of a linear model to the timing data is shown as a dashed line on the
plot. The highly linear relationship between execution time and orientation range proves
the hypothesis, thereby also confirming the underlying assumption of similar edge pixel
distribution across different rotation angles. The timing results have also experimentally
validated the prior assertion that using a ±10 degree orientation limit will lower detection
time by a factor of nine. Note that when using narrow orientation limits, the peaking
finding process may occupy a fairly large portion of the overall detection time.
Mean Execution time of Voting versus Orientation Range of
Detection
y = 0.2277x + 1.1086
R
2
= 0.9998
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
45.00
0.00 20.00 40.00 60.00 80.00 100.00 120.00 140.00 160.00 180.00
Orientation Range (Degrees)
T
i
m
e

(
m
s
)
Figure 3.11: Plot of mean execution time of voting versus detection orientation range.
The timing trial results clearly show that fast symmetry is well suited to time-critical
applications, which are abundant in robotics. The overall mean execution time suggests
that detection can be carried out at 20 frames per second, and higher frame rates can
47
CHAPTER 3. SYMMETRY DETECTION
be achieved by reducing the orientation range of detection. Apart from limiting orienta-
tion range, choosing reasonable Canny thresholds and increasing D
min
provides the most
noticeable performance gains in practice. The former can reduce the number of input
edge pixels, which reduces the quantity of input data supplied to detection. The latter
limits the number of edge pairs formed, thereby reducing the number of voting operations
performed.
3.5 Comparison with Generalized Symmetry Transform
3.5.1 Introduction
The generalized symmetry transform [Reisfeld et al., 1995], herein referred to as generalized
symmetry, is a popular computer vision approach for detecting symmetry in images. This
method was designed as a context free attentional operator that can find regions of interest
without high level knowledge about the image. Apart from detecting bilateral symmetry,
this method can also find skew and radial symmetry. No object segmentation or recognition
is required prior to detection. Two symmetry maps are generated by the transform, one
representing the magnitude of symmetry, the other phase. Of the two symmetry maps
generated, the magnitude map is of particular interest. Performing line detection on the
binary result produced by thresholding the magnitude symmetry map will yield similar
symmetry lines to those detected using fast symmetry.
The magnitude symmetry map is an image with pixel values representing the symmetry
of the input image at a particular location. The value at each pixel of the map is found by
summing the symmetry contributions from all possible pairs of pixels in the input image.
The symmetry contribution of each pixel pair is calculated by multiplying several weights,
each of which represents a different property of symmetry. An adjustable distance weight,
containing a Gaussian function whose width depends on a parameter σ, controls the scale
of detection. Mirror symmetry in the orientation of local gradients between a pair of pixels
is enforced by a phase-based weighting function. The weighting scheme also includes the
logarithm of gradient intensity, which favours shapes with strong contrast against their
background.
3.5.2 Comparison of Detection Results
Using two synthetic images, the detection results of fast symmetry is qualitatively com-
pared with those of the generalized symmetry transform. The test images are 101 × 101
pixels in size so that the shape’s symmetry lines do not lie between pixels. The detection
results are displayed in Figure 3.12 and Figure 3.13. In the upper left image, the top
two symmetry lines detected using fast symmetry are visualized as solid red lines. The
edge pixels found using Canny edge detection are shown as green pixels. The bottom two
48
Section 3.5. Comparison with Generalized Symmetry Transform
images in the figures are magnitude symmetry maps generated using generalized symme-
try with different σ values. Recall that the σ parameter gives control over the scale of
detection, with lower values favouring small scale symmetry. No distance thresholds are
used for either algorithm. Additional generalized symmetry transform results, including a
symmetry map generated from an image of an indoor scene, can be found in the author’s
IJRR paper [Li et al., 2008].
(a) Input image (b) Fast symmetry N
lines
= 2
(c) Generalized symmetry σ = 2 (d) Generalized symmetry σ = 200
Figure 3.12: Test image 1 – Dark rectangle over uniform light background.
Beginning the analysis with Figure 3.12, several conclusions can be drawn from the de-
tection results. Firstly, comparing Figures 3.12(b) and 3.12(d), the bilateral symmetries
found using fast symmetry is similar to the large scale symmetry map returned by gen-
eralized symmetry with σ = 200. This result agrees with theoretical expectations as fast
symmetry is designed to find global, object-like symmetries, especially when no upper dis-
tance threshold is applied. By decreasing σ, contributions from large scale symmetries are
reduced. Figure 3.12(c) shows small scale symmetry detection results using generalized
49
CHAPTER 3. SYMMETRY DETECTION
symmetry. The corners of the rectangular shape are highlighted in the symmetry map
because of they have strong local symmetry. These symmetric local features are useful for
applications such as eye-detection and object recognition.
Binary thresholding followed by Hough transform can be applied in series to obtain sym-
metry lines from the symmetry map in Figure 3.12(d), similar to those detected using
fast symmetry. However, as the horizontal line in the symmetry map has much lower
intensity than the vertical one, it is difficult to perform thresholding automatically. This
large difference in symmetry intensity is due primarily to the distance weighting function
in generalized symmetry. The distance function is a Gaussian centered at distance zero,
which biases detection towards small scale symmetries. Increasing the σ value spreads the
Gaussian, reducing this favouritism, but never entirely reversing it. As such, the general-
ized symmetry method relies on the large numbers of contributing pixels present in large
scale symmetries to overcome this inherent bias in the distance weighting.
The generalized symmetry transform is designed to detect solid uniformly shaded shapes
against a contrasting background of an opposing intensity. The test image in Figure
3.13 violates this assumption. The fast symmetry detection results show that the rect-
angle’s main symmetry lines are found despite the edge noise caused by the variation in
background intensity. However, the symmetry maps show that the violation of general-
ized symmetry’s foreground-background assumption noticeably deteriorates its detection
results.
In Figure 3.13(c), the small scale symmetries of the rectangle’s corners are recovered,
which shows that the local feature detection behaviour is still functional. However, two
horizontal lines above and below the rectangle are also present in the symmetry map.
These lines are due to local gradient symmetry between pixels paired along the horizontal
sides of the rectangle. Because of the changing background intensity, the dark-to-light
gradient orientation points inward on the left side of the rectangle but points outward
on the right side. The orientation reverses direction about a vertical axis bisecting the
rectangle, passing through the top and bottom sides of the rectangle where the intensity
inside and outside the shape is the same. As such, pixels at the top and bottom sides
which are equidistant across this vertical axis where the gradient orientation flips from in
to out will appear symmetric to generalized symmetry.
Moving on to a larger scale of detection, Figure 3.13(d) shows that the rectangle’s sym-
metry is no longer found in the symmetry map. While a horizontal band can be seen, the
locations occupied by the symmetry lines detected by fast symmetry are not brightly lit in
the symmetry map. Due to the orientation flip of local gradients around the border of the
rectangle described in the last paragraph, the gradient orientations of pixels on the left
and right sides of the rectangle are parallel, facing the same direction. The phase weight-
ing function of generalized symmetry increases the symmetry contributions of pixel pairs
that have local gradient orientations that mirror each other. As local gradients on the left
and right border of the rectangle no longer mirror each other, the vertical symmetry has
effectively been destroyed in the eyes of generalized symmetry.
50
Section 3.5. Comparison with Generalized Symmetry Transform
(a) Input image (b) Fast symmetry N
lines
= 2
(c) Generalized symmetry σ = 2 (d) Generalized symmetry σ = 200
Figure 3.13: Test image 2 – Grey rectangle over background with non-uniform intensity.
These comparisons of detection results are not quantitative. Instead, the comparisons have
qualitatively highlighted the main differences between the generalized symmetry transform
and fast symmetry. Generalized symmetry is designed for computer vision applications
where objects and background obey a dark-on-light or light-on-dark intensity contrast.
As fast symmetry is designed to tolerate non-uniform background intensity, it does not
suffer from the violation of this assumption. Generalized symmetry is capable of finding
local features using a tunable scale parameter whereas fast symmetry cannot. Generalized
symmetry also relies on local gradient intensity to determine symmetry strength. Echoing
the comments found in the introduction of Kovesi’s paper on phase-based symmetry detec-
tion [Kovesi, 1997], the use of gradient intensity will bias detection towards high contrast
shapes, which may not be more symmetric than a low contrast shape. Assuming the low
contrast shape generates equal numbers of edge pixels, fast symmetry does not suffer from
this problem.
51
CHAPTER 3. SYMMETRY DETECTION
Overall, the differences between the two methods stem from their assumptions and under-
lying motivations. Generalized symmetry is designed to be a highly general context-free
method to locate local features by leveraging bilateral symmetry and radial symmetry.
Symmetric shapes in the input image must have high contrast against a uniform back-
ground. As fast symmetry is designed to recognize symmetric objects in real world images
quickly and robustly, fewer assumptions are made with regards to a scene’s foreground
and background. As a trade off, fast symmetry does not detect local features. Also, fast
symmetry does not return the same quantity of information as a symmetry map. The
symmetry map contains dense symmetry information that may be useful for applications
such as image signature generation. On the other hand, as confirmed by the comparison
results, fast symmetry can detect symmetry that eludes generalized symmetry. The fast
bilateral symmetry detection method also operates much faster than generalized symme-
try. The computational costs of fast symmetry and generalized symmetry are further
examined in the next section.
3.5.3 Comparison of Computational Performance
The computational complexities of both algorithms are O(N
2
), where N is the input data
size. However, for the same input image, the size of N is vastly different for each algorithm.
Whereas the generalized symmetry transform operates on every image pixel, fast symmetry
only operates on the edge pixels of the input image. The practical complexity of generalized
symmetry can be reduced to O(Nr
2
) by limiting the pixel pairing distance to 2r. The
parameter r is set so that any pixel pair with nearly zero symmetry contribution due to
a large distance of separation is never paired. Given that r <

N, computational cost
is reduced. The r parameter is similar to the fast symmetry algorithm’s upper distance
threshold D
max
, which also limits pairing distance. To maintain fairness during the timing
trials, no distance thresholds are used for either algorithm.
The results of the timing trials are shown in Table 3.2 and Table 3.3. The two test images
in Figures 3.12 and 3.13 are used as input data in these timing trials. Both algorithms are
Table 3.2: Execution time of generalized symmetry transform on test images over 10 trials
Image number Number of image pixels Mean execution time (ms)
1 10201 23505
2 10201 24489
OVERALL MEAN 10201 23997
Table 3.3: Execution time of fast symmetry detection on test images over 1000 trials
Image number Edge pixel count
Mean execution time (ms)
Canny Voting Peak find Total
1 280 0.555 1.447 0.673 2.674
2 363 0.565 2.052 0.672 3.289
OVERALL MEAN 321.5 0.560 1.750 0.673 2.982
52
Section 3.6. Chapter Summary
implemented using C++ and complied with the Intel C Compiler 9.1. The test platform
is an Intel 1.73GHz Pentium M laptop PC.
Comparing the mean execution times, fast symmetry is shown to operate over 8000 times
faster than the generalized symmetry method. The large difference in detection time can
be attributed to two main factors. Firstly, fast symmetry reduces input data size by the
use of Canny edge detection prior to voting. For the test images, this reduced the number
of input pixels from over ten thousand to around 300. For more cluttered images, the
edge pixel to image pixel ratio is generally less favourable, but still significantly reduces
execution time. Secondly, the voting operation of fast symmetry is much more efficient to
compute than the weight-based contribution calculations of generalized symmetry. The
simple voting procedure used in fast symmetry, as well as the edge pixel rotation and
grouping step prior to voting, greatly reduces the computational cost in the O(N
2
) portion
of the algorithm.
The low computational cost of the fast symmetry voting loop can be confirmed by com-
paring the execution times of both methods when operating on similar quantities of input
data. Recall that in Section 3.4.3, the execution times of fast symmetry on 640 × 480
images were measured over many trials. Referring back to the results in Table 3.1, Canny
edge detection on test image 2 produced 10444 edge pixels, all of which are processed by
fast symmetry. Fast symmetry required a mean execution time of 65.5ms, with Canny
edge detection requiring another 8ms. As 101 × 101 test images are used for the current
execution time comparison, 10201 image pixels are processed by generalized symmetry.
Table 3.2 shows that generalized symmetry took over 23 seconds to generate its result.
The massive difference in execution time for similar quantities of input data confirms the
low computational cost of the fast symmetry voting method when compared with the
symmetry contribution calculations of generalized symmetry.
3.6 Chapter Summary
This chapter introduced the fast bilateral symmetry detection method. This method can
rapidly and robustly detect symmetry lines in large images. It can operate directly on
video frames of indoor scenes without requiring any manual preprocessing. The bilateral
symmetry of near-symmetric objects, such as a mug with a handle, can be detected using
fast symmetry. Leveraging the noise robustness of Canny edge detection and Hough
transform, fast symmetry has been shown to execute robustly at 20 frames per second on
640 ×480 images. The experimental results also show that fast symmetry can operate on
visually difficult objects including those with transparent and reflective surfaces.
The tunable parameters in fast symmetry allow for great flexibility in detection without
significantly affecting detection behaviour. For example, distance thresholds can be used
to lower the impact of inter-object symmetry. Unlike the σ scale parameter in generalized
symmetry, the distance thresholds act like a bandpass filter, capable of limiting symmetry
53
CHAPTER 3. SYMMETRY DETECTION
detection to a range of object widths. The execution time of detection can be dramatically
reduced by setting limits on the orientation range of detection.
On the other hand, this chapter also highlighted the limitations of fast symmetry. Due
to the lack of high level knowledge prior to detection, fast symmetry will detect both
object and non-object symmetries. Adjusting the distance thresholds can help remedy
this problem. Higher level processes, such as Kalman filtering, should also be equipped
with the means to disambiguate the origins of detected symmetry lines. Also, a priori
expectations of an object’s orientation can be used to limit the angular range of detection
to improve robustness and speed.
The global nature of fast symmetry was compared against the multi-scale nature of the
generalized symmetry transform. The comparison showed that fast symmetry is able to
find symmetry lines in images that confuse generalized symmetry. Execution times of the
two methods showed that fast symmetry deserves its titular adjective fast, performing
symmetry detection thousands of times faster than generalized symmetry. However, the
symmetry lines returned by fast symmetry carry less information than the symmetry maps
produced using generalized symmetry, which contains spatial symmetry strengths and local
attentional features. Overall, fast symmetry seems more suited to robotic applications,
where detection speed and robustness to non-uniform lighting are crucial to operational
success.
As mentioned at the beginning of this chapter, real time detection is a major motivation
behind the development of fast symmetry. The C++ implementation of the fast symme-
try algorithm was the first bilateral symmetry detector to practically demonstrate real
time performance on high resolution videos of real world scenes. Since its development,
fast symmetry has been applied to a variety of time-critical applications. The results of
applying fast symmetry to the problems of object segmentation, stereo triangulation and
real time object tracking are presented in upcoming chapters.
54
Our notion of symmetry is derived from the human
face. Hence, we demand symmetry vertically and in
breadth only, not horizontally nor in depth
Blaise Pascal
4
Sensing Objects in Static Scenes
4.1 Introduction
Chapter 3 introduced the fast bilateral symmetry detection method. This symmetry de-
tection method is the backbone of the robot’s vision system, providing low-level symmetry
features to other vision methods. This chapter details two methods that make use of de-
tected symmetry to segment and triangulate objects in static scenes. These higher level
methods are model-free, requiring no offline training prior to operation. Both methods
rely solely on edge information, which is visually orthogonal to colour and intensity in-
formation. As such, other vision approaches using orthogonal visual information can be
applied synergetically with the methods presented in this chapter.
The proposed object segmentation method automatically segments the contours of bilater-
ally symmetric objects from a single input image. The contour segmentation is performed
using a dynamic programming approach. The segmentation method is able to operate on
new objects as it does not rely on existing models of target objects. The segmentation
method has low computational cost, operating at 20 frames per second on 640×480 input
images. The segmentation research was carried out in collaboration with Alan. M Zhang,
who designed and implemented the dynamic programming portion of the segmentation
method. The segmentation research, available in press as [Li et al., 2006] and [Li et al.,
2008], is detailed in Section 4.2.
The triangulation method uses stereo vision to localize object symmetry lines in three
dimensions. By detecting symmetry in images obtained using a calibrated stereo vision
system, pairs of symmetry lines are triangulated to form symmetry axes. These three
dimensional symmetry axes are especially useful when dealing with surfaces of revolution.
The symmetry triangulation method does not assume reliable surface pixel information
across stereo views. This allows the method to triangulate objects that are difficult to
handle with traditional approaches, such as those with transparent and reflective surfaces.
The symmetry triangulation method is available in press as [Li and Kleeman, 2006a]. The
triangulation approach is detailed in Section 4.3.
55
CHAPTER 4. SENSING OBJECTS IN STATIC SCENES
4.2 Monocular Object Segmentation Using Symmetry
4.2.1 Introduction
Object segmentation methods attempt to divide data into subsets representing useful
physical entities. Methods using monocular vision try to group spatially coherent image
pixels into objects. In terms of prior knowledge requirements, object segmentation lies
between object recognition and image segmentation. Object recognition methods require
object models to find instances of objects in the input image. These object models are
acquired offline prior to recognition, usually from training conducted on positive and
negative images of target objects. For example, multiple boosted Haar cascades [Viola and
Jones, 2001], individually trained for each target object, can be used to robustly detect and
recognize objects in a multi-scale manner. Image segmentation is on the opposite end of the
prior knowledge spectrum. It uses context-independent similarity between image pixels,
such as colour and intensity, to segment image regions. Image segmentation methods are
surveyed in [Pal and Pal, 1993] and [Skarbek and Koschan, 1994], the latter focusing on
approaches using colour. Object segmentation provides useful mid-level information about
an image, bridging the gap between low level and high level knowledge.
To improve the generality and flexibility of segmentation, the proposed approach assumes
minimal background knowledge, leaning towards image segmentation methods. Similar
to the generalized Hough transform [Ballard, 1981], the core assumption is that target
objects all share a common parameterizable feature. For the case of the generalized Hough
transform, the parameterizable feature may be straight lines, circles or more complicated
shapes. Bilateral symmetry is the main feature used to guide segmentation in the proposed
approach. This allows a robot using the symmetry-based segmentation method to operate
without relying on object models. In fact, the segmentation results can potentially be
used as training data to build models for new objects.
Symmetry-aided segmentation has been investigated in [Gupta et al., 2005]. Their method
uses symmetry to augment the affinity matrix in a normalized cuts segmentation. Nor-
malized cuts produces accurate segmentations but has a high computational cost. As their
approach assumes symmetry of pixel values within an object’s contour, the segmentation
of transparent objects and objects with asymmetric texture is impossible. Also, their ap-
proach is not robust against strong shadows or specular reflections commonly found in real
world images due to non-uniform lighting. The drawbacks of requiring symmetry between
local image gradients are similar to those previously encountered with the generalized
symmetry transform in Section 3.5.2.
Object segmentation is defined as the task of finding contours in an input image belonging
to objects in the physical world. This definition implicitly removes the aforementioned
assumption of symmetric image gradients, an assumption which is problematic in real
world images. An object contour is defined as the most continuous and symmetric contour
about a detected symmetry line. This definition of the object segmentation allows a robust
56
Section 4.2. Monocular Object Segmentation Using Symmetry
and low computational cost solution to the segmentation problem at the expense of limiting
segmentation targets to those with visual symmetry.
Object contours can be detected from an image’s edge pixels. However, simply identi-
fying all edge pixels that voted for a detected symmetry line will produce broken con-
tours that may include noisy edge pixel pairs. As such, a more robust contour detection
method is required. Contour detection has a wide literature base in computer vision and
medical imaging. Existing methods general make use of energy-minimizing snakes [Yan
and Kassim, 2004] or dynamic programming [Lee et al., 2001; Mortensen et al., 1992;
Yu and Luo, 2002]. The proposed approach departs from the norm by removing the need
for manual initialization, such as the specification of control points or hand-drawn curves.
This added level of autonomy in segmentation is achieved by using dynamic programming
in combination with weights based on the bilateral symmetry of edge pixel pairs.
A single pass technique is used so that the segmentation method maintains a stable and
predictable execution time. The segmentation method consists of three steps. Firstly, a
preprocessing step, detailed in Section 4.2.2, rejects asymmetric edge pairs and weights
the remaining edge pairs according to their level of symmetry. This is followed by a
dynamic programming step to produce a continuous contour. Finally, this contour is
refined, allowing for slight asymmetry, so that the contour passes over the object’s edge
pixels. The latter two steps are detailed in Section 4.2.3.
4.2.2 The Symmetric Edge Pair Transform
The symmetric edge pair transform, herein referred to as SEPT, is a preprocessing step
applied to the input edge pixels prior to dynamic programming contour detection. To save
computational cost, the edge pixels used by fast symmetry detection are reused as input
data. The edge image is rotated so that the object’s symmetry line becomes vertical prior
to applying SEPT. The SEPT is detailed in Algorithm 2.
The SEPT performs three main functions. Firstly, it removes edge pixel pairs that are
unlikely to belong to an object’s contour. Secondly, the remaining roughly symmetric edge
pixel pairs are transformed based on their level of symmetry, parameterizing them into the
SeptBuf buffer. A dynamic programming step then operates on this buffer. Thirdly, and
quite subtly, the SEPT also resolves ambiguities in overlapping edge pixel pairs. These
major functions are detailed below.
Rejecting Edge Pixel Pairs
Two criteria are used to reject edge pixels after pairing. Firstly, edge pixel pairs that
are too far apart are removed. This is performed on line 6 of Algorithm 2 by comparing
the pairing distance against a threshold MAX
w
. Recall that the fast symmetry detection
method, detailed in Algorithm 1, has a threshold D
max
governing the maximum width
57
CHAPTER 4. SENSING OBJECTS IN STATIC SCENES
Algorithm 2: Symmetric edge pair transform (SEPT)
Input:
E – Edge image
Sym – Object symmetry line x-coordinate
Output:
SeptBuf – SEPT buffer
Parameters:
MAX
w
– Maximum expected width of symmetric objects
MAX
mid
– Maximum distance between midpoint of edge pair and Sym
W(d) = 1 −
d
2(MAX
mid
)
SeptBuf[][] ←−1 1
foreach row r in E do 2
foreach edge pixel pair in r do 3
p ←x-coordinates of edge pixel pair 4
w ← distance between pixels in p 5
if w < MAX
w
then 6
SKIP TO NEXT PAIR 7
d ← distance between the midpoint of p and Sym 8
if d < MAX
mid
then 9
if SeptBuf[r][CEIL(w/2)] < W(d) then 10
SeptBuf[r][CEIL(w/2)] ←W(d) 11
of edge pairs. Edge pixel pairs wider than D
max
do not contribute any votes to detected
symmetry. As such, MAX
w
is usually set to D
max
to prevent the inclusion of non-object
edge pixels in the final contour.
Secondly, edge pixel pairs that are not roughly symmetric about an object’s symmetry
line are removed. This is done by comparing the deviation between their midpoints and
the symmetry line against a threshold MAX
mid
. This threshold is in the order of several
pixels so that small deviations in the object contour from perfect symmetry are tolerated
by SEPT.
Edge Pixel Weighting and SEPT Buffer
Apart from using the midpoint deviation d to reject asymmetric edge pixels, this deviation
value is also used to calculate the symmetry weight of the remaining edge pixel pairs. The
weighting function W(•) is monotonically decreasing so that large deviations from perfect
symmetry result in low weights. After calculating the weight of an edge pixel pair, it is
placed into the SEPT buffer SeptBuf at indices (r, w/2), as described by lines 8 to 11
of the algorithm. The vertical coordinate, r, is simply the current row the algorithm is
operating on. The horizontal coordinate is taken as the half-width of an edge pixel pair,
rounded towards the higher integer using a ceiling function CEIL().
58
Section 4.2. Monocular Object Segmentation Using Symmetry
Figure 4.1(b) is an image visualization of the SEPT buffer. The floating-point weights
in SeptBuf have been converted to pixel intensities. Buffer cells with strong symmetry
weights are given bright pixel values. The reverse is true for low symmetry cells, which
are assigned dark values. Buffer cells with −1 weight, indicating edge pixel pair rejection,
are coloured black. To improve visibility, only the portion of the SEPT buffer containing
the object is shown. Note that there remains many edge pixel pairs that do not belong
to the bottle’s object contour. The object contour returned by dynamic programming is
shown in Figure 4.1(c).
Resolving Ambiguity of Overlapping Edge Pixel Pairs
The monotonically decreasing weighting function W(•) serves another purpose. Due to
the asymmetry MAX
mid
allowed in d, multiple edge pixel pairs with equal separation
distance will all parameterize into the same SEPT buffer cell. This ambiguity is best
illustrated using a simple numerical example. Lets assume the object’s symmetry line has
a x-coordinate of Sym = 5. Also, two edge pixel pairs have x-coordinates of p0 = (2, 8)
and p1 = (3, 9). Notice that both edge pixel pairs are separated by 6 pixels, which means
that both pairs will attempt to place their weights into the same cell of the SEPT buffer.
This ambiguity is resolved by only keeping the larger weight. This will favour the edge
pixel pair with minimum midpoint to symmetry line deviation. In this example, p0 has
a deviation d of
8+2
2
− 5 = 0 and for p1,
3+9
2
− 5 = 1. Therefore, the final value in the
SEPT buffer will be the weight calculated using edge pixel pair p0 as it is more symmetric.
Algorithmically, this is performed by the if statement on line 10.
4.2.3 Dynamic Programming and Contour Refinement
The object contour is extracted from the SEPT buffer using a dynamic programming (DP)
method. Using the SEPT buffer as input, the dynamic programming algorithm generates
a table of contour continuity scores. The table is the same size as the SEPT buffer. High
scoring cells of the table indicate high continuity in the SEPT buffer. As the SEPT buffer is
basically a symmetry-weighted edge image, detecting the most continuous contour within
the DP score table also implicitly enforces contour symmetry. Note that this approach
differs from traditional DP convention as it uses a table of rewards instead of costs.
The details of the DP method are presented in Algorithm 3. Step 1 of the algorithm
calculates the score of a current cell from the cells in the row above it by scanning the SEPT
buffer vertically. Allowing for 8-connected contours, the maximum vertical continuity
score across three neighbour cells is retained. Step 2 of the algorithm performs the same
8-connected scan horizontally, calculating horizontal continuity scores from left to right.
Step 3 is a repeat of Step 2 but moving in the opposite direction.
59
CHAPTER 4. SENSING OBJECTS IN STATIC SCENES
(a) Input image and detected symmetry
(b) SEPT buffer (c) DP contour
Figure 4.1: Overview of object segmentation steps. The top image shows the symmetry line
returned by the fast symmetry detector. In the SEPT buffer image, higher weights are visualized as
brighter pixel values. The object contour extracted by dynamic programming (DP) has been dilated
and is shown in red.
60
Section 4.2. Monocular Object Segmentation Using Symmetry
Algorithm 3: Score table generation through dynamic programming
Input:
SeptBuf – SEPT buffer
Output:
sTab – Table of continuity scores (same size as SeptBuf)
backPtrV – Back pointer along vertical direction
backPtrH – Back pointer along horizontal direction
Parameters:
MAX
w
– Maximum expected width of symmetric objects
{P
ver
, R
ver
} – Penalty and reward for vertical continuity
{P
hor
, R
hor
} – Penalty and reward for horizontal continuity
sTab[ ][ ] ←0 1
for r ←1 to ImageHeight do 2
STEP 1: Vertical Continuity 3
for c ←1 to
MAX
w
2
do 4
if SeptBuf[r][c] is not −1 then 5
cost ←SeptBuf[r][c] ∗ R
ver
6
else 7
cost ←P
ver
8
vScore[c] ←MAX
_
¸
¸
_
¸
¸
_
0
sTab[r −1][c −1] +cost
sTab[r −1][c] +cost
sTab[r −1][c + 1] +cost 9
if vScore[c] > 0 then 10
backPtrV [r][c] ← index of cell with max score 11
backPtrH[r][c] ←backPtrV [r][c] 12
STEP 2: Horizontal Continuity - Left to Right 13
prevScore ←−∞ 14
for c ←1 to
MAX
w
2
do 15
if SeptBuf[r][c] is not −1 then 16
cost ←SeptBuf[r][c] ∗ R
hor
17
else 18
cost ←P
hor
19
hScore ←prevScore +cost 20
if vScore[c] ≥ hScore then 21
prevScore ←vScore[c] 22
columnPtr ←c 23
else 24
prevScore ←hScore 25
if prevScore > sTab[r][c] then 26
sTab[r][c] ←prevScore 27
backPtrV [r][c] ←{r, columnPtr} 28
STEP 3: Horizontal Continuity - Right to Left 29
Repeat STEP 2 for loop, with c ←
MAX
w
2
to 1 30
61
CHAPTER 4. SENSING OBJECTS IN STATIC SCENES
Only the highest continuity score from all three steps is recorded in the score table. The
neighbour cell contributing the maximum continuity is recorded in the backPtrV array.
Due to multiple horizontal scans, it is possible for cycles to form within the rows of
backPtrV . This is resolved in the algorithm by making a copy of backPtrV prior to the
horizontal scans. This horizontal continuity pointer is used during backtracking when
travelling along horizontal portions of the contour.
Recall that in Figure 3.4 and its accompanying discussion, it was suggested that humans
generally consider bilaterally symmetric objects as those with contours roughly parallel
to the line of symmetry. To steer contour detection towards object-like contours, the
horizontal continuity reward is lower than the vertical reward in order to discourage wide
and flat contours. A lower horizontal reward also prevents high frequency zigzag patterns
in the final contour.
After generating the score table, the most continuous object contour is found by backtrack-
ing from the highest scoring cell to the lowest. The details of the backtracking method are
described in Algorithm 4. An example of a contour produced by backtracking is shown in
the left image of Figure 4.2. Notice that the segmentation approach is able to generate
a reasonable object contour in spite of the occlusion caused by the human hand covering
the bottle. Due to the tolerance for asymmetries introduced in the SEPT step, the DP
contour produced by backtracking does not correspond exactly to the locations of the
input edge pixels. For rough segmentations, the contour obtained thus far is sufficient.
However, to improve the accuracy of segmentation, refinement is performed to minutely
shift the contour onto actual edge pixels. The refinement step snaps the contour to the
nearest edge pixel within a distance threshold. The threshold is set to MAX
mid
, which
governs the level of allowable asymmetry during the SEPT step. Note that the refined
contour, unlike the original, is allowed to have small asymmetries between its left and right
portions. The results of contour refinement is shown in the right image of Figure 4.2.
Algorithm 4: Contour detection by backtracking
Input:
sTab, backPtrV , backPtrH – Output of algorithm 3
Output:
{R
c
, C
c
} – {Row, Column} indices of contour
{R
c
, C
c
} ← indices of MAX(sTab) 1
while sTab[r][c] = 0 do 2
{r, c} ←{R
c
, C
c
} 3
{R
c
, C
c
} ←backPtrV [r][c] 4
if R
c
= r then 5
{R
c
, C
c
} ←backPtrH[r][c] 6
62
Section 4.2. Monocular Object Segmentation Using Symmetry
Figure 4.2: Object contour detection and contour refinement. The object outline returned by
dynamic programming backtracking is shown on the left. The refined contour is shown on the
right.
4.2.4 Segmentation Results
Figure 4.3 shows the segmentation result of a multi-colour mug. The symmetry line of
the object is detected using the fast symmetry detector. The entire segmentation process
is carried out automatically without any human intervention. The segmentation result
demonstrates the method’s robustness against non-uniform object colour. Note that the
edge image is quite noisy due to texture on the mug surface and high contrast text on the
book. This edge noise does not have any noticeable impact on the quality of segmentation.
However, due to the symmetry constraint placed on the segmentation method, the object
contour does not include the mug’s handle.
63
CHAPTER 4. SENSING OBJECTS IN STATIC SCENES
(a) Input image
(b) Object contour
Figure 4.3: Segmentation of a multi-colour mug. The refined object contour has been dilated
and is shown in purple overlayed on top of the edge image. The object’s detected symmetry line is
shown in yellow.
64
Section 4.2. Monocular Object Segmentation Using Symmetry
Next, the segmentation method is tested against a scene with multiple symmetric objects.
Reusing the test image in Figure 3.8, the segmentation method is applied to symmetry
lines 1, 3 and 5. The results of the segmentation is shown in Figure 4.4. Notice that
all three objects are segmented successfully. However, the contour of the multi-colour
mug is smaller than expected, containing the mug’s opening only. This is due to shadows
and specular reflections, which causes large gaps and asymmetric distortions in the edge
contour of the mug. Therefore, the highly symmetric and continuous elliptical contour
of the mug’s opening is returned by the segmentation method. Note also that there is a
slight distortion in the mug’s elliptical contour. This is caused by gaps in the outer rim
edge contour of the mug’s opening. These gaps reduced the continuity of the outer rim
contour, causing the object contour to include some portions of the inner rim.
Figure 4.4: Object segmentation on a scene with multiple objects. The images in the periphery
each show the detected contour overlayed on top of the edge image. The contours are rotated so
that the object’s symmetry line is vertical.
4.2.5 Computational Performance
The object segmentation software is implemented using C++. The experimental platform
is a desktop PC with an Intel Xeon 2.2GHz CPU. No processor-specific optimizations
65
CHAPTER 4. SENSING OBJECTS IN STATIC SCENES
such as MMX and SSE2 are used. No distance thresholds are applied to the segmentation
method. The maximum expected object width, MAX
w
, is set to the image width. These
trials were conduced prior to obtaining a copy of the Intel C compiler. As such, unlike the
timing trials performed in Section 3.4.3, aggressive optimization settings were not used
during compilation of the segmentation source code.
Table 4.1 contains execution times of the segmentation method operating on 640 × 480
images. Note that the test images are the same as those used in the fast symmetry
detection time trials recorded in Table 3.1. The execution times include the contour
refinement step, which takes around 6ms on average. Fast symmetry detection execution
times are not included. The values in the column titled Number of edge pixel pairs are
the quantity of edge pixel pairs formed during SEPT processing. Given that further
optimizations are possible during compilation and the MAX
w
distance threshold can be
lowered, the mean execution time suggests that the segmentation method can operate
comfortably at 30 frames per second.
Table 4.1: Execution time of object segmentation
Image number Number of edge pixel pairs Execution time (ms)
1 77983 30
2 142137 29
3 65479 22
4 68970 25
5 67426 43
6 44901 44
7 90104 32
8 133784 32
9 121725 48
10 177077 39
11 51475 38
OVERALL MEAN 94642 35
4.2.6 Discussion
The proposed segmentation method is a mid-level approach to obtain additional informa-
tion about symmetric objects. As the method does not require any object models prior
to operation, new symmetric objects can be segmented. This may allow the use of ob-
ject contours to gather training data so that reoccurring objects can be recognized. The
proposed method can segment multiple objects from a scene, assuming background sym-
metry lines are rejected prior to segmentation. It may also be possible to use the results
of segmentation to reject background symmetries by examining the length and width of
detected contours. The fast execution time of the segmentation method suggests that it
is well suited for time-critical situations.
66
Section 4.3. Stereo Triangulation of Symmetric Objects
The lack of prior information about objects is a double-edge sword. A recognition-based
system trained on hand-labelled and manually segmented images will out perform the pro-
posed method in terms of segmentation accuracy and robustness. However, the proposed
model-free approach frees a robotic system from mandatory offline training, which is very
time consuming for large object databases. Also, exhaustive modelling of all objects is
impractical for real world environments such as the household, where new objects may
appear sporadically without warning.
Asymmetric objects cannot be segmented using the proposed approach. This is accept-
able as visually orthogonal segmentation cues such as colour and shape can be used syn-
ergetically with symmetry. However, asymmetric portions of roughly symmetric objects,
such as cup handles, are also excluded from the final contour. Also, the disambiguation
of object and background symmetries remain unresolved. Additional visual information
and background knowledge are required to solve these problems robustly. In Chapter 6,
both problems are resolved by the careful and strategic application of autonomous robotic
manipulation to simultaneously reject background symmetries and segment objects. By
actively moving objects to perform segmentation, asymmetric portions of near-symmetric
objects are included in the segmentation results.
4.3 Stereo Triangulation of Symmetric Objects
4.3.1 Introduction
A plethora of stereo algorithms have been developed in the domain of computer vision.
Despite their many differences, the algorithmic process of stereo methods designed to
obtain three dimensional information can be broadly generalized into the following steps.
Firstly, the intrinsic and extrinsic parameters of the stereo camera pair are found through
a calibration step. Note that some stereo approaches do not require calibration or obtain
the camera calibration online during stereo triangulation.
After calibration, the next stage of most stereo algorithms can be described as correspon-
dence. This stage tries to pair portions of the left and right images that belong to the same
physical surface in a scene. The 3D surface location is usually assumed to be a Lambertian
surface that appears similar in both camera images. Once corresponding portions have
been found, their distance from the camera can be triangulated using the intrinsic and
extrinsic parameters.
Sparse or feature-based stereo approaches, more commonly used in wide baseline and
uncalibrated systems, triangulate feature points to obtain 3D information. Recent sparse
stereo approaches generally make use of affine invariant features such as maximally-stable
extrema regions (MSER) [Matas et al., 2002]. Dense stereo methods attempt to find
correspondences for every pixel. Local patches along epipolar lines are matched between
the left and right camera images using a variety of distance metrics. Common matching
67
CHAPTER 4. SENSING OBJECTS IN STATIC SCENES
metrics include sum of squared differences (SSD) and normalized cross correlation (NCC).
Depending on the time available for processing, dense stereo approaches may also utilize a
variety of optimization methods to improve the pixel correspondences. Global optimization
approaches make use of algorithms such as dynamic programming or graph cuts to optimize
across multiple pixels and across multiple epipolar lines. Dense stereo approaches are
surveyed in [Scharstein and Szeliski, 2001; Brown et al., 2003].
The proposed stereo triangulation approach uses bilateral symmetry found using the fast
symmetry detector. As symmetry is a structural feature, not a surface feature, the infor-
mation returned by the proposed stereo method is different from existing approaches. In
contrast to the surface 3D information returned by dense and sparse stereo methods, the
proposed method returns a symmetry axis passing through the interior of the object. By
looking for intersections between a symmetry axis and a known table plane, bilaterally
symmetric objects can be localized in three dimensions. Also, the symmetry axis can be
used to bootstrap model-fitting algorithms by providing information about an object’s
pose.
Symmetry triangulation can also deal with objects that have unreliable surface pixel infor-
mation, such as those with reflective and transparent surfaces. These objects are difficult
to deal with using traditional stereo methods as these methods rely on the assumption
that a surface appears similar across multiple views. Symmetry triangulation makes use
of edge pixel information only, so this assumption is not necessary. As such, the approach
elegantly generalizes across symmetric objects of different visual appearances without re-
quiring any prior knowledge in the form of object models.
In the context of robotics, the method is especially useful when dealing with surfaces of
revolution. As discussed at the beginning of Chapter 3, the triangulated symmetry axis
of a surface of revolution is the same as its axis of revolution. Assuming uniform mass
distribution, the symmetry axis of a surface of revolution object will pass through its center
of mass. As such, robotic grasping force should be applied perpendicular to the symmetry
axis to ensure stability. The structural information implied by bilateral symmetry may
also be useful for determining object pose before and during robotic manipulation.
4.3.2 Camera Calibration
The stereo camera pair is calibrated using the MATLAB calibration toolbox [Bouguet,
2006]. Both intrinsic and extrinsic parameters are estimated during calibration. The
intrinsic parameters model camera-specific properties, such as focal length, pixel offset of
the optical center and radial lens distortion. The extrinsic parameters model the geometric
pose of the camera pair, providing the necessary translation and rotation matrices to
map one camera’s coordinate frame into the other. Note that there are some stereo
methods that do not require any prior camera calibration. Instead, these stereo methods
use triangulated features to recover camera calibration parameters during operation.
68
Section 4.3. Stereo Triangulation of Symmetric Objects
Figure 4.5 shows the extrinsics of the stereo cameras and the physical setup of the ex-
perimental rig. The stereo cameras on the robot’s head are positioned at roughly arm’s
length from the checkerboard pattern. At the top of the figure, the red triangles in the
plot each represent a camera’s field of view. The plot’s origin is located at the focal point
of the left camera. The cameras are verged towards each other to provide a larger overlap
between image pairs when viewing objects at arm’s reach. The vergence angle is roughly
15 degrees and the right camera is rotated slightly about its z axis due to mechanical
errors introduced by the mounting bracket.
4.3.3 Triangulating Pairs of Symmetry Lines
Due to the reduction from three dimensions down to a pair of 2D images, stereo corre-
spondence is not a straightforward problem. Apart from the obvious issue of occlusions,
where only one camera can see the point being triangulated, other problems can arise.
For example, specular reflections and non-Lambertian surfaces can cause the same physi-
cal location to appear differently in each stereo image, causing incorrect correspondences.
The proposed method attempts to provide a robust solution by using symmetry lines as
the primary feature for stereo matching. By using symmetry, which does not rely on
object surface information, reflective and transparent objects are able to be triangulated
successfully.
The 3D location of an object’s symmetry axis is triangulated using the following method.
Firstly, the symmetry line is projected out from the camera’s focal point. The projection
forms a semi-infinite triangular plane in 3D space. This projection is done for both camera
images using their respective detected symmetry lines. After this, the line of intersection
between the triangular planes emanating from each camera is calculated. Assuming the
result is not undefined, the triangulation result is simply this line of intersection.
The projection of a symmetry line into a triangular plane in 3D space is performed as
follows. The first step is to locate the end points of the symmetry line in the camera image.
This begins by finding the pivot point on the symmetry line. A line joining the image
center with the pivot point has length equal to the symmetry line’s radius parameter. The
adjoining line is perpendicular to the symmetry line. Note that the following mathematics
assume that pixel indices begin at zero, not one.
69
CHAPTER 4. SENSING OBJECTS IN STATIC SCENES
(a) Extrinsics of verged stereo camera pair
(b) Experimental rig
Figure 4.5: Stereo vision hardware setup.
70
Section 4.3. Stereo Triangulation of Symmetric Objects
x
r
= r cos θ +
w −1
2
(4.1a)
y
r
= r sin θ +
h −1
2
(4.1b)
x
r
and y
r
are the horizontal and vertical pixel coordinates of the symmetry line’s pivot
point. r and θ are the polar parameters of the symmetry line. w and h are the width
and height of the camera image in pixels. The nearest image borders intersecting with the
symmetry line are found using a vector between the pivot point and the image center.
d
A
= MIN
_
x
r
sin θ
,
x
r
−w + 1
sin θ
, −
y
r
cos θ
,
h −1 −y
r
cos θ
_
(4.2a)
d
B
= MIN
_

x
r
sin θ
,
w −x
r
−1
sin θ
,
y
r
cos θ
,
1 +y
r
−y
r
cos θ
_
(4.2b)
d
A
and d
B
are distances to the nearest image borders from the end points of the symmetry
line. The special cases where division by zero occurs due to the sin and cos denominator
terms are dealt with programmatically by removing them from the MIN function.
p
A
= {−d
A
sin θ +x
r
, d
A
cos θ +y
r
} (4.3a)
p
B
= {d
B
sin θ +x
r
, −d
B
cos θ +y
r
} (4.3b)
The end points of the symmetry line are found by calculating p
A
and p
B
. Note that this
process is repeated twice for each symmetry line pair, to obtained two pairs of end points,
one for the left camera and one for the right camera. Note that the end point calculations
described here are also used in the visualization code for drawing detected symmetry lines
onto input images.
Next, the symmetry line end points are normalized according to the camera’s intrinsics be-
fore projecting them to form a triangular plane. This process removes image centering off-
sets and normalizes the focal length of both cameras to unity. Radial lens distortion is also
taken into account in the normalization. The normalization code, implemented in C++, is
based on the MATLAB Calibration Toolbox routine found in comp_distortion_oulu.m.
The normalized end points of the symmetry lines are then projected out into 3D space to
form a plane. The projection of a single point is performed as follows.
pp = z
max
_
p
norm
1
_
(4.4)
71
CHAPTER 4. SENSING OBJECTS IN STATIC SCENES
The constant z
max
governs the depth of the triangular plane projected from the camera.
In the experiments presented in this section, z
max
is set to 2.0m so that symmetry axes
of objects well out of arm’s reach of the robot do not appear in the triangulation results.
A ‘1’ is appended to the normalized symmetry end point location, p
norm
, to produce the
homogenous coordinate vector necessary for projection. The result of the projection, pp,
is a point in 3D space.
The end points of the left and right camera symmetry lines are projected into 3D space.
Note that the projected end points of the left symmetry lines have a different frame of
reference to those projected from the right symmetry line. The projected points of the
left symmetry line are bought into the right camera coordinate frame using the following
matrix equation.
pp
right
= R
c
pp
left
+T
c
(4.5)
The 3 ×3 matrices R
c
and T
c
are the extrinsic rotation and translation matrices between
the left and right camera coordinate frames obtained from stereo camera calibration. Now
that all projected points are in the right camera coordinate frame, symmetry triangulation
can be performed. Two triangular planes are formed by linking each camera’s focal points
with the projected end point pairs of their respective symmetry lines.
Tomas Moller’s triangle intersection method [Moller, 1997] is used to find the intersection
between the two triangular planes. The entire symmetry triangulation method is imple-
mented using C++. The compiled binary run at 5 frame-pairs-per-second on a Pentium M
1.73GHz laptop PC when operating on pairs of 640×480 images. This frame rate includes
Canny edge detection and fast bilateral symmetry detection for both stereo images. Note
that the Intel C Compiler was not available during the development of symmetry triangu-
lation software, so higher frame rates will be achieved with more aggressive optimization
during compilation.
4.3.4 Experiments and Results
Triangulation experiments were carried out on six symmetric objects with different visual
appearances. All six test objects can be seen in Figure 4.6. The test objects include
low texture, reflective and transparent objects, all of which are challenging for traditional
stereo methods.
For each test object, 4 image pairs are used as test data, resulting in a total of 24 image
pairs. The data set of the multi-colour mug is shown in Figure 4.7. The test object’s
symmetry line is centered on each of the 4 outer corners of the checkerboard pattern. Test
images are taken with the stereo cameras looking down at the checkerboard, with the
object roughly at arm’s length. This is to simulate the kind of object views a humanoid
robot would encounter when interacting with objects on a table using its manipulator.
72
Section 4.3. Stereo Triangulation of Symmetric Objects
(a) White cup (b) Multi-colour mug
(c) Textured bottle (d) White bottle
(e) Reflective can (f) Transparent bottle
Figure 4.6: Test objects used in the triangulation experiments.
73
CHAPTER 4. SENSING OBJECTS IN STATIC SCENES
(a) Location 1
(b) Location 2
(c) Location 3
(d) Location 4
Figure 4.7: Example stereo data set – Multi-colour mug. The left camera image is shown on the
left for each location’s stereo pair.
74
Section 4.3. Stereo Triangulation of Symmetric Objects
Figure 4.8 shows the stereo triangulated symmetry axes of the reflective metal can. The
red lines are the triangulated symmetry axes of the can when it is placed at the four outer
corners of the checkerboard pattern as demonstrated in Figure 4.7. The blue dots are the
corners of the checkerboard. The stereo camera pair can be seen in the upper left of the
figure.
Figure 4.8: Triangulation results for the reflective metal can in Figure 4.6(e).
4.3.5 Accuracy of Symmetry Triangulation
To examine the accuracy of symmetry triangulation, each triangulated symmetry axis is
compared against a known ground truth location. To obtain ground truth, an additional
image pair is taken with no objects on the checkerboard. This is done for each data set,
to ensure that small movements of the checkerboard when changing test objects does not
adversely affect the accuracy of ground truth data. The corner locations of the checker-
board in 3D space are found by standard stereo triangulation using the camera calibration
data. The four outer corner locations are used as ground truth data in the triangulation
accuracy measurements.
The table geometry is approximate by fitting a Hessian plane model to all triangulated
corner locations. Standard least mean squares fitting is used. The plane Hessian provides
the 3D location of the table on which the test objects are placed. Using the plane model,
an intersection between the object’s triangulated symmetry line and the table plane is
75
CHAPTER 4. SENSING OBJECTS IN STATIC SCENES
found for each test image pair. The Euclidean distance between this point of intersection
and the ground truth corner location is used as the error metric.
The following steps are used to measure the triangulation accuracy using the error metric.
Firstly, the top five symmetry lines are found for each image in a stereo pair. Next,
all possible pairings between symmetry lines from the left and right camera images are
generated. Stereo triangulation is performed on all possible symmetry line pairs. A
triangulated symmetry axis is ignored if it is too far from the camera pair, as it cannot
be reached by the robot. Triangulated symmetry axes that are not orientated within ±15
degrees of the checkerboard’s surface normal are also ignored. If no valid symmetry axes
are found, the triangulation is considered to have failed. Only the intersection points
between the remaining valid symmetry lines and the table plane are recorded. After
obtaining a list of intersection points for all test image pairs, the triangulation error is
measured using the aforementioned error metric. In the case where multiple symmetry
axes are found for a single ground truth datum, the symmetry axis closest to ground truth
is used to calculate the triangulation error.
Table 4.2 shows the mean triangulation error for the test images. The mean error is
calculated across four triangulation attempts, one at each outer corner of the checkerboard
pattern. There is only a single triangulation failure among the 24 test image pairs. The
failed triangulation occurred for Location 3 in Figure 4.7. The failed triangulation is due
to self occlusion caused by the mug’s handle, which severely disrupted the object contour
in the left camera image. This disruption resulted in failed symmetry detection for the
left camera image.
Table 4.2: Triangulation error at checkerboard corners
Object Mean error (mm)
White cup 13.5
Multi-colour mug 6.8

White bottle 10.7
Textured bottle 12.4
Reflective can 4.5
Transparent bottle 14.9

Triangulation failed for location 4
The mean error of the successful triangulations is 10.62mm, with a standard deviation of
7.38mm. Only a mean of 1.5 symmetry axes are generated for each test image pair. As
five symmetry lines are detected for each test image, 25 pairing permutations are given to
the triangulation method for each image pair. The small number of valid symmetry axes
returned by symmetry triangulation suggests that it is possible to reject most non-object
symmetry axes by using the aforementioned distance and orientation constraints.
76
Section 4.3. Stereo Triangulation of Symmetric Objects
4.3.6 Qualitative Comparison with Dense Stereo
Dense stereo approaches provide 3D information about the surface of an object. This is
different from the symmetry axes returned by symmetry triangulation, which is a structural
feature of an object. A symmetry axis is always inside an object, and usually passes near
its center of volume. In this section, a qualitative comparison is performed between dense
stereo disparity and symmetry axes returned by symmetry triangulation. Sparse stereo
approaches are excluded from this comparison as they generally do not rely on camera
calibration, which unfairly biases the comparison in favour of symmetry triangulation.
Disparity maps are generated using the dense stereo C++ code from the Middlebury
Stereo Research Lab [Scharstein and Szeliski, 2001]. The input images are rectified using
the MATLAB calibration toolbox prior to disparity calculations. After testing multiple
cost functions and optimization approaches, the sum of squared differences (SSD) using
15×15 windows is found to produce the best disparity results for the test image set. Global
optimization methods, such as dynamic programming, did not provide any significant
improvements. Figures 4.9 to 4.11 contain the disparity results for several test objects.
Darker pixels have lower disparity, meaning that the locations they represent are further
from the camera. The object’s location in the disparity map is marked with a red rectangle.
Note that grayscale images are used to generate the disparity maps.
In Figure 4.9, the textured bottle’s curvature and glossy plastic surface, combined with
non-uniform lighting, can cause the same surface to appear differently across camera im-
ages due to specular reflection. The transparent bottle appears as a distorted version of
its background in Figure 4.10, with its appearance changing between viewpoints. There
are also many specular reflections on the surface of the bottle. These visual issues violate
the similarity assumption employed in dense stereo. Therefore, the disparity results of
the object’s surface is very inconsistent. Similarly, Figure 4.11 shows that the reflective
can acts as a curved mirror that reflects its surroundings. This again results in different
surface appearances in the left and right camera images, which leads to poor disparity
results. All three objects can be triangulated using bilateral symmetry. Looking at the
dense stereo disparity maps, it is difficult to imagine a method that can recover the object
location or structure.
77
CHAPTER 4. SENSING OBJECTS IN STATIC SCENES
(a) Left image
(b) Right image
(c) Dense Stereo Disparity
Figure 4.9: Dense stereo disparity result – Textured bottle. The bottle’s location is enclosed by a
red rectangle in the disparity map.
78
Section 4.3. Stereo Triangulation of Symmetric Objects
(a) Left image
(b) Right image
(c) Dense Stereo Disparity
Figure 4.10: Dense stereo disparity result – Transparent bottle. The bottle’s location is enclosed
by a red rectangle in the disparity map.
79
CHAPTER 4. SENSING OBJECTS IN STATIC SCENES
(a) Left image
(b) Right image
(c) Dense Stereo Disparity
Figure 4.11: Dense stereo disparity result – Reflective can. The can’s location is enclosed by a
red rectangle in the disparity map.
80
Section 4.3. Stereo Triangulation of Symmetric Objects
4.3.7 Discussion
The proposed stereo triangulation approach is conceptually different from dense and sparse
stereo methods. Firstly, the approach does not rely on an object’s surface visual appear-
ance. Instead, a structural feature, bilateral symmetry, is used to perform triangulation.
Recall that the fast symmetry detector only relies on edge pixels as input. As a result, sym-
metry triangulation is able to operate on objects with unreliable surface pixel information,
such as those that are transparent and reflective.
Secondly, in the context of detecting and locating objects on a table, dense and sparse
stereo approaches only provide 3D location estimates of surfaces. Further model fitting is
needed to obtain structural information of objects. By only targeting bilaterally symmetric
objects, symmetry triangulation is able to obtain some structural information without
relying on object models. While geometric primitives can be used to represent objects
for the purpose of robotic manipulation [Taylor, 2004], objects such as the tall white
cup in Figure 4.6(a) are difficult to model as a single primitive. In contrast, symmetry
triangulation is able to localized this object accurately assuming geometry of the table
plane is known. The table plane geometry can be recovered dynamically by performing
robust plane fitting to stereo disparity information.
Triangulated symmetry axes also provide useful information with regards to the planning
and execution of robotic manipulations. Recall that the triangulated symmetry axis of
an object is always within its surface. For solid objects, a symmetry axis provides useful
hints for robotic manipulation without resorting to higher level knowledge. For example,
grasping force should generally be applied radially inward, in a manner perpendicular to
an object’s symmetry axis. This grasping strategy is especially applicable for surface of
revolution objects such as cups and bottles.
Symmetry triangulation using stereo has two main limitations. Firstly, triangulation can
only be performed on objects exhibiting bilateral symmetry in two camera views. This re-
striction suggests that the method will be most reliable when operating on objects that are
surfaces of revolution. Even with this limitation, a large variety of household objects such
as cups, bottles and cans, can be triangulated by their symmetry. Also, the edge features
used by the fast symmetry detector are visually orthogonal to the pixel information used
in dense stereo and most sparse stereo approaches. As such, symmetry triangulation can
be applied synergetically with other stereo methods to improve a vision system’s ability
to deal with symmetric objects, especially those with transparent or reflective surfaces.
Secondly, symmetry triangulation does not explicitly address the problem of correspon-
dence between symmetry lines in the left and right camera. As the experiments only dealt
with scenes containing a single symmetric object, this was a non-issue. However, in a scene
with multiple symmetric objects, pairings between symmetry lines belonging to different
objects can occur. Triangulation of these pairs may produce phantom symmetry axes that
do not correspond to any physical object.
81
CHAPTER 4. SENSING OBJECTS IN STATIC SCENES
The majority of phantom symmetries can be rejected by examining their location and ori-
entation. For a robot dealing with objects on a table, symmetry axes outside its reachable
workspace can be safely ignored. Symmetry axes that do not intersect the table plane in a
roughly perpendicular manner can also be rejected as they are unlikely to originate from
an upright bilaterally symmetric object. The remaining phantom symmetries are difficult
to reject without using additional high level information such as object models or expected
object locations. However, robotic action can be used to reject phantom symmetry axes
without resorting to the use of object models. Chapter 6 shows that a robotic nudge ap-
plied to a symmetry axis can detect the presence of an object and simultaneously obtain
a segmentation of the detected object.
Since the triangulation experiments documented in this section, the accuracy of symmetry
triangulation has been improved. The current implementation operates with an error of
around 5mm, half of the previous average. This reduction in error is due primarily to
the use of sub-pixel refinement in the fast symmetry detection process. Also, taking
the average result of multiple triangulations will produce a centroid with better error
characteristics.
4.4 Chapter Summary
This chapter detailed two methods that make use of detected bilateral symmetry to sense
objects in static scenes. The first method uses dynamic programming to perform segmen-
tation on a single image, extracting contours of symmetric objects from the image’s edge
pixels. Timing trials of the segmentation method suggest comfortable operation at 30
frames per second. The second method uses two images obtained from a pair of calibrated
stereo cameras as input data. Symmetry lines detected in the left and right camera im-
ages are paired and triangulated to form symmetry axes in three dimensional space. Both
methods do not require any prior object models before operation. Also, both methods are
able to robustly deal with objects that have reflective and transparent surfaces.
This chapter also highlighted several ways to disambiguate between object and background
symmetries. For example, the angle between a symmetry axis and the table plane can be
used to reject unwanted symmetries. The next chapter details a real time object tracker.
The tracker is able to estimate the symmetry line of a moving object while rejecting static
background symmetries by using motion and symmetry synergetically.
82
Time is an illusion, lunchtime doubly so.
Douglas Adams
5
Real Time Object Tracking
5.1 Introduction
The previous chapter introduced a couple of computer vision methods that exploit detected
symmetry to segment and triangulate static objects. Many robotic tasks, such as object
manipulation, demand the sensing of moving objects. These tasks require real time sensing
in order to keep track of the target object while providing time-critical feedback to the
robot’s control algorithms. This chapter details research on real time object tracking. The
proposed tracking method makes use of motion and detected symmetry to estimate an
object’s location rapidly and robustly. The symmetry tracker also returns a segmentation
of the tracked object in real time. Additionally, the C++ implementation of the proposed
method is able to operate at over 40 frames per second on 640×480 images. The symmetry
tracker was first published in [Li and Kleeman, 2006b], with additional experiments and
analysis later published in [Li et al., 2008].
As with the segmentation and symmetry triangulation methods, the proposed tracking
method is model-free. This allows the tracking of new objects, as long as the target has
a stable symmetry line. In order to perform tracking in real time, model-free approaches
use features that are computationally inexpensive to extract and match. For example,
[Huang et al., 2002] uses region templates of similar intensity. [Satoh et al., 2004] utilizes
colour histograms. Both approaches can tolerate occlusions, but are unable to handle
shadows and colour changes caused by variations in lighting. As symmetry measurements
are detected from edge pixels, the proposed tracker does not suffer from this limitation.
The main contribution of the research presented in this chapter is the use of bilateral
symmetry as an object tracking feature. This was previously impossible due to the pro-
hibitive execution times of symmetry detection methods, which has been overcome by
using the fast symmetry detector. Additionally, Kalman filter predictions are used to
limit the detection orientation range in order to further reduce execution time. On the
surface, using bilateral symmetry for object tracking seems very restrictive. However,
there are many objects with bilateral symmetry. Many man-made objects are symmetric
by design. This is especially true for container objects such as cups and bottles, which
83
CHAPTER 5. REAL TIME OBJECT TRACKING
are generally surfaces of revolution. Being visually orthogonal to tracking features such as
colour, the use of symmetry allows the tracker to operate on objects that are difficult for
traditional tracking methods. For example, the proposed method can track transparent
objects, which are difficult to deal with using other tracking methods.
5.2 Real Time Object Tracking
The system diagram in Figure 5.1 provides an overview of the proposed object tracker.
The tracker uses two time-sequential video images as input data. These images are la-
belled as Image
T−1
and Image
T
in the system diagram. The Block Motion Detector
module creates a Motion Mask representing the moving portions of the scene. The mo-
tion mask is used to reject edge pixels from static parts of the video, which helps prevent
the detection of background symmetries. The remaining edge pixels are given to the Fast
Symmetry Detector as input. Detected symmetry lines are provided to the Kalman Filter
as Measurements. The Posterior Estimate of the Kalman filter is the tracking result.
T-1
T
lower upper
KF, KF
1, 1 2, 2
Figure 5.1: System diagram of real time object tracker.
Once tracking begins, the Kalman filter provides angle limits to the symmetry detection
module. These angle limits represent the orientations of symmetry lines the Kalman
filter considers to be valid measurements. These angle limits are generated based on the
filter’s prediction and the prediction covariance. The speed and robustness of symmetry
detection are greatly improved by using these prediction-based angle limits to restrict
detection orientation.
The symmetry tracker also returns a motion-based segmentation of the tracking target in
real time. The segmentation is performed by using the Kalman filter’s posterior estimate
84
Section 5.2. Real Time Object Tracking
to refine the motion mask produced by the block motion detector. The refinement process
results in a near-symmetric segmentation of the tracked object. The object segmentation
is returned as a Refined Motion Mask and a Bounding Box that encapsulates the mask.
5.2.1 Improving the Quality of Detected Symmetry
Recall from previous discussions that the fast symmetry detector can return non-object
symmetry lines due to inter-object symmetry or background features. Non-object sym-
metries can overshadow object symmetries, especially when the target object has a weak
edge contours. An example of this is provided in Figure 5.2(a). The top three symmetry
lines are labelled according to the number of Hough votes they received, with line 1 having
received the most votes. The weak edge contour of the transparent bottle results in its
symmetry line being ranked lower than a background symmetry line.
During tracking, the Kalman filter uses a validation gate to help reject unwanted sym-
metries, labelled as lines 1 and 3 in Figure 5.2(a). However, a more cluttered scene may
introduce so many additional symmetry lines that the target object’s symmetry line is not
detected at all. The repeated use of these noisy symmetry lines as measurements may
lead to Kalman filter divergence. To overcome this, the tracker employs several methods
to prevent the detection of noisy symmetry lines.
Firstly, the state prediction of the Kalman filter is used to restrict the orientation range
of detection. The Kalman filter’s θ prediction and the prediction covariance are used to
restrict the detection orientation range. This prevents the detection of symmetry lines
with orientations that are vastly different from the filter’s prediction. In the experiments
presented here, the angle limits are set at three standard deviations from the filter predic-
tion.
Figure 5.2(b) shows a visualization of the detection angle limits during typical tracker
operation. Using these angle limits to restrict detection orientation will remove symmetry
lines 1 and 3 in Figure 5.2(a), pushing the transparent bottle’s symmetry line to number
one in terms of Hough votes. Additionally, reducing the angular range of symmetry
detection also lowers the execution time linearly as previously shown in Figure 3.11.
85
CHAPTER 5. REAL TIME OBJECT TRACKING
(a) Detection of non-object symmetry
(b) Tracking angle limits
Figure 5.2: Using angle limits to reject non-object symmetry. The top figure shows a case where
non-object symmetries are detected ahead of object symmetries. The bottom figure visualizes the
angle limits generated from the Kalman filter prediction covariance.
86
Section 5.2. Real Time Object Tracking
However, restricting detection orientations using angle limits based on the Kalman filter’s
prediction will not reject noisy symmetries with similar orientations to the symmetry line
of the target object. These unwanted symmetries can arise from background features,
inter-object symmetries or from other symmetric objects. It is possible for the Kalman
filter to latch onto these noisy symmetry lines if the tracked object moves over them at
a low speed. This problem is difficult to overcome without relying on an object model of
the tracking target. To maintain a model-free approach, the tracker takes advantage of
object motion to remove edge pixels from static portions of an image. By doing this, the
majority of edge pixels contributing to non-target symmetries are rejected. Details of the
motion masking method are provided in the next section.
5.2.2 Block Motion Detection
Many methods of motion detection are available from computer vision literature. Optical
flow approaches, such as the Lucas-Kanade method [Lucas and Kanade, 1981], are able
to find spatial and orientation information of motion. Generally, they are employed in
situations where there is camera movement or large areas of background motion. However,
optical flow is computationally expensive to calculate for large images. Also, as the tracker
only needs to reject edge pixels from static portions of a video, the orientation information
provided by optical flow is superfluous. Given that the camera system is stationary during
operation and the desire for real time object tracking, optical flow techniques are not used
to perform motion detection.
Background subtraction approaches monitor image pixels over time to construct a model
of the scene. Pixels that significantly alter their values with respect to the model are
labelled as moving. Mixture of Gaussian models [Wang and Suter, 2005] are fairly robust
to small changes in background statistics such as the repetitive movement of tree branches
in the wind. However, background subtraction methods require a training period with
no object motion to build a background model. This will require idle training periods
between tracking sessions, which is problematic for situations where object motion cannot
be prevented or controlled. Another point of note is that background modelling approaches
are not suitable for dealing with transparent and reflective objects as they taken on the
visual characteristics of their surroundings.
The block motion detector uses a frame difference approach to obtain motion information.
The classic two-frame absolute difference [Nagel, 1978] is used as it has a very low compu-
tational cost. Colour video images are converted to grayscale before performing the frame
difference. The difference image is calculated by taking the absolute difference of pixel
values between time-adjacent frames. The block motion detection method is described in
Algorithm 5.
The block motion detection algorithm produces a blocky motion mask that segments object
motion. The algorithm processes the difference image I
DIFF
using S
B
×S
B
square blocks.
The choice of block size is based on the smallest scale of motion that needs to be detected.
87
CHAPTER 5. REAL TIME OBJECT TRACKING
Algorithm 5: Block motion detection
Input:
I
0
, I
1
– Input images from time t −1, t
Output:
I
MASK
– Motion mask, same size as input images
Parameters:
T
M
– Motion threshold
S
B
– Block size
I
DIFF
– Difference image, same size as input images
I
RES
, I
SUM
– Buffer images with sides
1
S
B
the length of the input images
I
DIFF
←|I1 −I0| 1
I
SUM
[ ][ ] ←0 2
i ←0 3
for ii ←0 to height of I
SUM
do 4
m ←i 5
i ←i +S
B
6
for increment m until m == i do 7
j ←0 8
for jj ←0 to width of I
SUM
do 9
n ←j 10
j ←j +S
B
11
for increment n until n == j do 12
I
SUM
[ii][jj] ←I
SUM
[ii][jj] +I
DIFF
[m][n] 13
I
RES
← THRESHOLD(I
SUM
, MEAN(I
SUM
) × T
M
) 14
Median filter I
RES
15
Dilate I
RES
16
I
MASK
←I
RES
resized by a factor of S
B
17
In the tracker implementation, the block motion detector uses 8 × 8 pixel blocks. The
algorithm proceeds by summing the pixel values of the difference image in each 8 × 8
block. This is carried out on lines 3 to 13 of Algorithm 5.
Each block’s difference sum is compared against the global mean sum of all blocks. A
block whose sum is significantly larger than the global mean is classified as moving. This
is performed on line 14 of Algorithm 5 by thresholding the I
SUM
image. The threshold
is a multiple of the global mean. The motion threshold multiplier T
M
is determined
empirically by increasing it from an initial value of 1 until camera noise and small object
movements are excluded by the threshold. The motion threshold of 1.5 is used in all
tracking experiments presented in this section.
The result of the thresholding is stored in I
RES
, which is essentially a low resolution binary
representation of the motion found between the input video frames I
0
and I
1
. In the C++
implementation, a binary one in I
RES
is used to represent detected motion. Static blocks
are represented by a binary zero. Median filtering is performed after thresholding to
remove noisy motion blocks, which can arise from small movements in the background or
slight changes in camera pose. The filtered result is dilated to ensure that all edge pixels
88
Section 5.2. Real Time Object Tracking
belonging to the contour of a moving object are included by the motion mask. Finally,
the I
RES
image is resized by a factor of S
B
to produce the output motion mask I
MASK
.
As the motion mask I
MASK
is the same size as the input images, a simple logical AND
operation is used to reject edge pixels contributed by static parts of the scene.
5.2.3 Object Segmentation and Motion Mask Refinement
Apart from being used to reject unwanted edge pixels, the motion mask is also used to
obtain a segmentation of the tracked object. Motion segmentation is performed in real time
by zeroing images pixels labelled as static in the motion mask. Motion mask segmentation
results are shown in the Figures 5.3(a) and 5.4(a). Motion masking is implemented using
a logical AND operation. This allows real time operation as the logical AND has a low
computational cost. However, the raw motion mask tends to produce segmentations with
gaps, which can be seen in both segmentations. Also, pixels that do not belong to the
target object, such as the arm actuating the object, can be included in the segmentation
result. To overcome this problem, a refinement step is applied to improve the segmentation
result.
Refinement is performed on the binary motion blocks image, named I
RES
in Algorithm 5.
This dramatically reduces the computational cost of refinement as the number of pixels
in I
RES
is
1
S
B
2
of the motion mask (I
MASK
). The refinement is performed as follows.
Firstly, motion that do not belong to the target object is rejected by enforcing a symmetry
constraint. Each block with motion is reflected across the object’s symmetry line. If a
local window centered at the reflected location contains no motion, the motion block is
rejected. This results in a segmentation that is roughly symmetric about the tracked
object’s symmetry line. Secondly, holes and gaps in the mask are removed using a local
approach. A small window is passed over the I
RES
image. If a block has many moving
neighbours within the window, it is labelled as moving. This two-step refinement process
is very computationally efficient as it does not require multiple iterations.
Figures 5.3(b) and 5.4(b) show the improved segmentations obtained by using symme-
try to refine the motion mask. Again, a logical AND operation is used to perform the
segmentation. The posterior symmetry line estimate used by the refinement process is
shown as a red line. Notice that the majority of the experimenter’s arm is removed in
both symmetry-refined segmentations.
89
CHAPTER 5. REAL TIME OBJECT TRACKING
(a) Motion mask segmentation
(b) Symmetry-refined motion segmentation
Figure 5.3: Motion mask object segmentation – White bottle. The tracker’s posterior estimate of
the symmetry line is drawn in red. This symmetry line is used to produce the refined segmentation.
90
Section 5.2. Real Time Object Tracking
(a) Motion mask segmentation
(b) Symmetry-refined motion segmentation
Figure 5.4: Motion mask object segmentation – White cup. The tracker’s posterior estimate of
the symmetry line is drawn in red. This symmetry line is used to produce the refined segmentation.
91
CHAPTER 5. REAL TIME OBJECT TRACKING
5.2.4 Kalman Filter
After removing edge pixels from static portions of the input image using the motion mask,
fast symmetry detection is performed using the remaining edge pixels. The (R, θ) parame-
ters of the detected symmetry lines are provided to a Kalman filter as measurements. The
Kalman filter combines input measurements and a state prediction to maintain a posterior
estimate of the target object’s symmetry line. The filter state contains the position, veloc-
ity and acceleration of the symmetry line’s parameters. The Kalman filter implementation
is based on [Bar-Shalom et al., 2002]. The filter plant, measurement and state matrices
are as follows.
A =
_
¸
¸
¸
¸
¸
¸
¸
¸
¸
_
1 0 1 0
1
2
0
0 1 0 1 0
1
2
0 0 1 0 1 0
0 0 0 1 0 1
0 0 0 0 1 0
0 0 0 0 0 1
_
¸
¸
¸
¸
¸
¸
¸
¸
¸
_
H =
_
1 0 0 0 0 0
0 1 0 0 0 0
_
x =
_
¸
¸
¸
¸
¸
¸
¸
¸
¸
_
R
θ
˙
R
˙
θ
¨
R
¨
θ
_
¸
¸
¸
¸
¸
¸
¸
¸
¸
_
(5.1)
The process and measurement covariances are determined empirically. The matrices as-
sume that the noise variables are independent, with no cross correlation between R and
θ values. The diagonal elements of the process covariance matrix are 1, 0, 1, 10, 1, 10, 1.
The remaining elements in the process covariance matrix are zero. The measurement
covariance for R is 9 pixels
2
and the θ covariance is 9 degrees
2
.
For each video image, multiple symmetry lines are given to the Kalman filter as measure-
ments. As tracking is performed on a single target object, the Kalman filter must choose
the best measurement to update its tracking estimate. In other words, the problem of data
association must be addressed. Data association is performed using a validation gate. The
validation gate is based on the mathematics presented in [Kleeman, 1996].
Measurements are validated using the following procedure. Symmetry line parameters that
generate an error above 9.2, which equates to the 2-DOF Chi-square value at P = 0.01,
are rejected by the validation gate. If no symmetry line passes through the gate, the
next state will be estimated using the state model alone. If multiple valid symmetry lines
passes through the validation gate, the symmetry line with the lowest validation error
is used to update the Kalman filter estimate. Hence, the validation gate performs both
measurement validation and data association.
Traditionally, Kalman filter initialization is performed manually by specifying the starting
state of the tracking target at a particular video image. However, as the eventual goal is to
use the symmetry tracker to estimate the location of new objects, automatic initialization
is needed. The initialization method begins with a crude specification of the object’s initial
state to ensure filter convergence. The video image for which tracking begins does not
92
Section 5.3. Object Tracking Results
have to be specified manually. Instead, the tracker monitors the number of moving blocks
returned by the block motion detector. By looking for a sharp jump in the quantity of
detected motion, the video frame where the target object begins to move is determined.
Symmetry lines detected from the three video images after the time of first object move-
ment are used to automatically initialize the Kalman filter. All possible data associations
are generated across these three time-consecutive video images. In the tracking exper-
iments, the N
lines
parameter in Algorithm 1 is set to three. This limits the maximum
number of detected symmetry lines to three, which constrains the number of possible
symmetry line associations to 27 permutations.
The tracking error of each permutation is examined using the Kalman filter. For each per-
mutation, the first measurement is used to set the filter’s initial state. The filter is updated
in the usual manner using the second and third measurement of the permutation. The
validation gate error is accumulated across all three measurements for each permutation.
The permutation resulting in the smallest accumulated validation error is used to initialize
the Kalman filter for actual object tracking. This automatic initialization method is used
to bootstrap tracking in all experiments presented in this section.
5.3 Object Tracking Results
The proposed symmetry tracker is tested against ten videos each containing at least one bi-
laterally symmetric object under motion. The test videos were captured with a IEEE1394
CMOS camera in an indoor environment without controlled lighting. The videos were
captured at 25 frames per second with a resolution of 640 × 480 pixels. Image sequences
from several tracking videos can be found at the end of the author’s IROS paper [Li and
Kleeman, 2006b].
Videos of the tracking results can be found in the tracking folder of the multimedia
DVD included with this thesis. The videos are available as WMV and H264 files in their
respective folders. The H264 videos have better image quality than the WMV videos but
require more processing power for playback decoding. If video playback problems occur,
the author recommends the cross platform and open source video player VLC. A Windows
XP installer for VLC is available in the root folder of the multimedia DVD. Note that the
tracking videos are also available online:
• www.ecse.monash.edu.au/centres/irrc/li_iro2006.php
In the video results, the posterior estimate of the tracked object’s symmetry line is drawn
as a thick red line. The green bounding box encloses the refined motion mask of the
tracked object. Note that the rectangular bounding box is automatically rotated so that
two of its sides are parallel to the symmetry line. Figures 5.5 and 5.6 contain two examples
of how rotated bounding boxes are generated from refined motion masks.
93
CHAPTER 5. REAL TIME OBJECT TRACKING
(a) Refined motion mask
(b) Rotated bounding box
Figure 5.5: Generating a rotated bounding box from the refined motion mask of a transparent
bottle. The bounding box is drawn in green and the tracker’s posterior symmetry line estimate is
drawn in red.
94
Section 5.3. Object Tracking Results
(a) Refined motion mask
(b) Rotated bounding box
Figure 5.6: Generating a rotated bounding box from the refined motion mask of a multi-colour
mug. The bounding box is drawn in green and the tracker’s posterior symmetry line estimate is
drawn in red.
95
CHAPTER 5. REAL TIME OBJECT TRACKING
The bounding box may appear unstable or overly large in some videos. This does not
imply tracker divergence, as the bounding box is not being estimated by the tracker. These
jitters in the size and location of the bounding box are due to temporal changes in the
refined motion mask. Temporal filtering can be applied to the bounding box parameters to
remove these jitters. For the sake of visual clarity, separate videos without the bounding
box are available for test videos 1, 2 and 9. These video files are labelled with the suffix
symline only.
5.3.1 Discussion of Tracking Results
For all ten test videos, the proposed tracker remained convergent during operation, never
diverging from the target object’s symmetry line. The test videos include a variety of
visually challenging situations that may be encountered by a tracking system employed in
the domain of robotics. This section discusses these challenges and how they are met by
the proposed tracking method. Note that the test videos will be referred to by the number
at the end of their filenames.
Large Changes in Object Pose
An object’s pose refers to its location and orientation in three-dimensional space. Due to
the projective nature of visual sensing, scale changes can results from object pose changes
if the object moves towards or away from the camera. Similarly, a change in object
orientation can alter the visual appearance of an object dramatically. For a model-free
tracking approach such as the proposed method, the lack of a priori object knowledge
further increases the difficulty of pose changes as no predictions can be made concerning
future object appearances.
Test videos 3, 5, 6 and 7 all contain large changes of the object pose. Video 3 shows
the successful tracking of a symmetric cup over a range of orientations, well over 120
degrees. Also, the cup is titled towards the camera during the video, which alters its visual
appearance. Video 5 shows a textured bottle being tracked across different orientations
and locations. The tracker is able to successfully follow the bottle through various pose
changes, including a large tilt away from the camera that occurs at the end of the video.
Video 6 increases the difficulty by including large changes in object scale. The video is
also much longer and includes a larger variety of object poses. The tracker is able to
maintain track of the cup during the entire video without any jitters in the posterior
estimate. Video 7 is similar to video 6, but the difficulty is greater as a multi-colour
mug is now the tracking target. The mug has more noisy edge pixels and produces a
weaker symmetry line due to its relatively short edge contour. Also, near the middle
of the video, a large horizontal jump in the object’s location can be observed. This is
caused by an operating system lag during video capture. The tracker is able to cope with
these additional challenges, maintaining a convergent posterior estimate of the object’s
96
Section 5.3. Object Tracking Results
symmetry line during the entire video. Overall, the results suggest that the proposed
symmetry tracker is capable of tracking objects across large pose changes.
Object Transparency
Transparent objects, as shown in previous chapters, are difficult to deal with using tra-
ditional computer vision approaches due to their unreliable visual appearance. Video 1
shows the convergent tracking of a transparent bottle across different orientations. The
bottle produces very few edge pixels, resulting in several jitters of the posterior estimate
during tracking. However, the tracker did not diverge and the object’s symmetry line is
tracked successfully. Tracking methods that make use of an object’s internal pixel infor-
mation will most likely diverge during this tracking video. For example, a colour-based
approach will encounter vastly different hue histograms during tracking as the observed
colour of the bottle changes from a combination of brown and green to nearly entirely
green when it arrives at a horizontal orientation.
Video 10 also shows a successful tracking result for a transparent bottle. In this test
video, the bottle is moved in a L-shaped manner. This motion trajectory was chosen in
order to cause tracker divergence by constantly introducing sharp changes in the velocity
and acceleration of the symmetry line parameters. Also, small changes in the object scale
can be observed during the portions of the video where the bottle is actuated vertically.
Note that the appearance of the bottle changes drastically as it moves over backgrounds
of different hues and intensities. Again, this will cause trackers that rely on an object’s
surface appearance to diverge as the object sharply changes colour and intensity during
the horizontal portion of the L-shaped movement. The convergent tracking achieved on
this difficult video further confirms the proposed tracker’s robustness against changes of
object appearance. The results also show that the tracking method is able to robustly
track transparent objects.
Occlusion of Tracked Object
Similar to object transparency, occlusions of the tracking target changes its visual ap-
pearance. However, occlusions also remove visual information that may provide features
sorely needed to maintain tracking convergence. In video 2, a green bottle is occluded by
another green object. The similar colours of the tracking target and the occluding object
is difficult for colour-based tracking methods to resolve without relying on object models.
However, the proposed model-free symmetry-based method is able to keep track of the
green bottle’s symmetry line despite the occlusion. The tracker’s success is due to the
use of edge pixel information by the fast symmetry detector, which are more reliable in
situations where occlusions of similar colour to the tracked object occur.
In video 4, the occlusion problem is made more difficult by using a bilaterally symmetric
object to block the target. Note also that the mug reverses direction near the point of
97
CHAPTER 5. REAL TIME OBJECT TRACKING
occlusion, which may cause tracker divergence if the posterior estimate latches onto the
symmetry of the occluding cup. The convergent tracking result suggests that the block
motion detector is able to generate a motion mask that rejects the static occlusion’s edge
pixels while retaining enough edge pixels from the tracking target to allow for reliable
symmetry detection.
Two Objects with Bilateral Symmetry
In the remaining test videos, two moving objects with bilateral symmetry are present.
The tracker is initiated on one object while the other object pollutes the Kalman filter
measurements by contributing noisy symmetry lines. In video 8, the tracker follows a
bottle as it collides with a symmetric white cup, knocking the cup over in the process.
The bottle comes to a sudden stop after the collision. Due to the inclusion of velocity and
acceleration in the Kalman filter state model, the collision can cause tracking divergence
due to a large difference between the filter prediction and the bottle’s symmetry line
measurement. Also, the cup’s symmetry line is near the filter’s post-collision prediction,
which can cause the filter posterior estimate to latch onto the cup. The convergent tracking
result achieved for video 8 indicates that the combination of motion masking and Kalman
filtering is robust and flexible enough to handle object collision scenarios.
In video 9, two symmetric objects move in opposite directions. Their symmetry lines pass
over each other near the middle of the video. As both objects are moving simultaneously,
edge pixels from both objects are included by the motion mask. Due to the model-free
nature of the proposed tracking approach, the symmetry lines of both objects are given
to the Kalman filter as measurements. As such, this test video examines the Kalman
filter’s ability to correctly associate symmetry line measurements with the tracked object.
The convergent tracking result suggests that the Kalman filter’s linear acceleration model,
coupled with the careful choice of covariance values, is able to disambiguate symmetry
lines from different moving objects.
Note that the tracker’s motion-based segmentation generates an overly large bounding
box when the two objects pass over each other. This temporary enlargement of the
bounding box is due to symmetry between the motion of the blue cup and the hand
actuating the white cup. As the motion mask does not contain orientation information,
it is unable to distinguish between different sources of motion. As the primary goal is to
maintain a convergent estimate of a tracked object’s symmetry line, sporadic enlargement
of the bounding box has no impact on tracking performance. The application of temporal
filtering to the bounding box parameters will improve the stability of the bounding box.
The use of a motion detection method that provides orientation information, such as
optical flow, will further improve the quality and reliability of the motion segmentation
result.
98
Section 5.4. Bilateral Symmetry as a Tracking Feature
5.3.2 Real Time Performance
The entire tracking system is implemented using C++, with no platform specific optimiza-
tions. A notebook PC with a 1.73GHz Pentium M processor is used as the test computer
platform. As previously stated, the test videos are all 640 × 480 pixels in size and cap-
tured at 25 frames per second. Note that the timing results presented in this section are
from trials carried out prior to the use of the Intel C Compiler. As such, the current
implementation has even shorter execution times. The current tracker is able to operate
comfortably at 40 frames per second.
Table 5.1 contains the execution times of the tracker. Note that, the video number is the
same as the number in the video’s filenames. The longest video contains 400 frames. The
code responsible for symmetry detection, block motion detection, motion mask refinement
and Kalman filtering are timed independently. Only symmetry detection and Kalman fil-
tering are absolutely necessary for object tracking, but block motion detection will greatly
improve the tracker’s robustness to static symmetry lines. The column labelled Init con-
tains the time taken to perform automatic initialization of the Kalman filter, as discussed
at the end of Section 5.2.4. The mean frame rates during tracking are recorded in the FPS
column.
Table 5.1: Object tracker execution times and frame rate
Video Mean execution time (ms) Init FPS
number Sym detect Motion Refine Kalman (ms) (Hz)
1 37.87 4.84 0.86 0.09 10.41 22.91
2 16.76 4.76 0.75 0.06 9.74 44.77
3 17.95 4.85 0.85 0.04 10.69 42.22
4 18.31 4.74 0.75 0.04 11.90 41.96
5 33.69 4.87 0.87 0.05 11.38 25.33
6 20.84 4.94 0.85 0.04 13.18 37.50
7 35.29 5.01 0.87 0.13 11.32 24.22
8 34.48 4.94 0.79 0.14 11.14 24.79
9 18.19 4.91 0.79 0.06 11.83 41.75
10 27.01 4.89 0.82 0.06 12.50 30.51
MEAN 26.04 4.88 0.82 0.07 11.41 33.60
5.4 Bilateral Symmetry as a Tracking Feature
In computer vision, the performance of a tracking method is usually gauged by its success
rate at maintaining convergent estimates over a set of test videos. The last section ex-
amined the proposed tracker’s performance on ten test videos. The experiments showed
that the tracker is fast and robust, achieving convergent real time tracking in all ten
videos. The results showed that the tracker is able to deal with partial occlusions, object
99
CHAPTER 5. REAL TIME OBJECT TRACKING
transparency, large pose changes of the tracking target and scenes with multiple symmet-
ric objects. However, robotic tasks employing the proposed tracker may have accuracy
requirements, which are not addressed by the experiments thus far.
Firstly, the accuracy of tracking that can be achieved using detected symmetry should be
evaluated quantitatively. The evaluation should be performed using different backgrounds
to effectively gauge the flexibility and robustness of symmetry as a tracking feature. Sec-
ondly, symmetry and colour should be compared qualitatively as object tracking features.
Colour is chosen as the feature to compare against as it uses the pixel information within
an object’s contour, which is visually orthogonal to the edge pixels the fast symmetry
detector relies upon.
In order to meaningfully perform the evaluation and the comparison, tracking results must
be compared against a reliable and accurate measure of the tracked object’s symmetry
line. This is difficult as ground truth symmetry lines must be found for every video frame.
Obtaining ground truth data for every video frame is also time consuming, especially
considering that manual processing is usually required for real world video sequences.
However, by exploiting the predictable motion of a pendulum, ground truth data can be
obtained automatically.
5.4.1 Obtaining Ground Truth
To evaluate the accuracy of symmetry tracking under various background conditions, the
tracker’s symmetry line estimate must be compared against the ground truth symmetry
line of the tracked object. Ground truth data is generally troublesome to obtain due to
the lack of constraints on the target object’s trajectory. Also, manually extracting the
object’s symmetry line in long videos can introduce human errors into the ground truth
data. As such, a combined hardware-software solution is used to extract ground truth
data automatically.
On the hardware side, a pendulum rig provides predictable oscillatory object motion.
This custom-built pendulum is shown in Figure 5.7. The pendulum pivot has 1 degree of
freedom, which constrains the object’s motion to a plane. A hollow carbon fiber tube is
used as the pendulum arm to limit flex during motion. The test object is a red plastic
squeeze bottle. The bottle is mounted at the end of the pendulum by passing the pendulum
arm through its axis of symmetry, which is also its axis of revolution.
Colour markers are placed above and below the object on the pendulum arm. These
markers provide stable colour features for the ground truth extraction software. A simple
colour filter is used to extract pixels belonging to the colour markers. The centroid of
each marker is found by calculating the centers of mass of its extracted pixels. The polar
parameters of the line passing through both markers centroids are recorded as the ground
truth symmetry line. This symmetry line extraction process is done without any human
100
Section 5.4. Bilateral Symmetry as a Tracking Feature
1 Degree-of-
Freedom Pivot
Carbon Fiber Tube
Ground Truth
Markers
Figure 5.7: Pendulum hardware used to produce predictable object motion.
assistance. An example of an automatically extracted ground truth symmetry line is
shown in Figure 5.8.
Before proceeding to use the automatically extracted symmetry lines as ground truth
data, their accuracy must be verified. As a pendulum moves in a predictable manner, The
automatically extracted symmetry lines are compared against theoretical expectations.
Equations 5.2 and 5.3 describe the R and θ parameters of the pendulum.
θ(t) = Ae
−αt
cos(ω(t −t
0
)) +B (5.2)
R(t) = Lθ(t) +L
0
(5.3)
101
CHAPTER 5. REAL TIME OBJECT TRACKING
Figure 5.8: The automatically extracted ground truth symmetry line is shown in black. The
centroids of the colour markers are shown as red and green dots.
As the angular magnitude of oscillations during the experiments are small, the pendulum
equations uses the small-angle approximation of θ ≈ sin θ. Note that R(t) is a function of
θ(t). The damping is modelled as an exponential. The parameter α governs the rate of
decay.
To evaluate the accuracy of the extracted symmetry lines, non-linear regression is per-
formed using Equations 5.2 and 5.3. MATLAB’s nlinfit function is used to perform the
non-linear regression, simultaneously estimating A, α, t
0
, B, L and L
0
. The mean of ab-
solute regression residuals of the extracted symmetry lines are examined across four test
videos, each containing 1000 images. Example images from the test videos are available
from Figures 5.9 to 5.12. All four sets of 1000 test video images are available from the
pendulum folder of the multimedia DVD. The readme.txt text file can be consulted for
additional information. Note that the same test videos are used in Sections 5.4.2 and 5.4.3.
Table 5.2 contains the mean regression errors for the automatically extracted ground truth
symmetry lines.
Table 5.2: Pendulum ground truth data – Mean of absolute regression residuals
Video background R (pixels) θ (radians)
White 0.39 0.0014
Red 0.76 0.0021
Edge 1.82 0.0025
Mixed 0.51 0.0014
OVERALL MEAN 0.87 0.0019
102
Section 5.4. Bilateral Symmetry as a Tracking Feature
Figure 5.9: Pendulum video images – White
background
Figure 5.10: Pendulum video images – Back-
ground with red distracters.
103
CHAPTER 5. REAL TIME OBJECT TRACKING
Figure 5.11: Pendulum video images – Back-
ground with edge noise.
Figure 5.12: Pendulum video images – Mix of
red distracters and edge noise.
104
Section 5.4. Bilateral Symmetry as a Tracking Feature
The regression residuals for the extracted symmetry lines are very small across all four
test videos. The mean error of R is less than 1 pixel and the mean θ error is roughly
0.1 degrees. These low residuals suggests that the proposed marker-based method of
extracting symmetry lines is capable of providing reliable ground truth data. The small
residuals indicate that the extracted symmetry lines behave according to the damped
pendulum described by Equations 5.2 and 5.3. More importantly, the regression results
imply that the ground truth symmetry lines are good estimates of the object’s actual
symmetry line.
5.4.2 Quantitative Analysis of Tracking Accuracy
After establishing a way to obtain reliable ground truth data, quantitative examination
of the symmetry tracker can proceed. Four test videos, each containing 1000 images, are
used to thoroughly evaluate the symmetry tracker’s accuracy. The test videos each show
around 10 oscillations of the pendulum in front of different static backgrounds. Example
images from the four test videos are displayed in Figures 5.9 to 5.12
The following four backgrounds are used in the test videos. In the first test video, a
white background is used as a control experiment to examine the tracker under near-ideal
background conditions. Example images from the first test video are shown in Figure 5.9.
However, specular reflections and shadows are still quite prominent for some object poses.
Red distracters, similar in colour to the tracked object, are added to the background in
the second test video as can be seen in Figure 5.10. To increase input edge noise to the
symmetry detector, high-contrast line features are present in the background of the third
video. Images from this video can be found in Figure 5.11.The fourth video contains both
red distracters and edge noise in the background as shown in Figure 5.12
To measure tracking accuracy, the tracker result is compared against ground truth for
each frame of a video. As the main concern is to investigate the performance of detected
symmetry as a tracking feature, the focus is on looking at the quality of measurements
provided to the Kalman filter. Also, the Kalman filter motion model may provide uneven
reductions in tracking error during the swing. Therefore, Kalman filtering is not applied
during the tracking accuracy trials. However, to simulate actual tracking operation, the
tracker continues to use the motion mask to reject edge pixels from static portions of
the video image before applying symmetry detection. As little motion is experienced at
the extremities of the pendulum swing, motion masking cannot be applied to these video
images. To remove any bias introduced by the lack of motion masking, the accuracy
analysis ignores video images within five frames of the pendulum’s turning point.
Table 5.3 provides a statistical summary of the pendulum symmetry tracking error. The
table columns from left to right are the mean of absolute errors, standard deviation of
errors and the median of absolute errors. The difference between the polar parameters of
the tracker’s result and ground truth data is used as the error measure. Error plots for R
and θ are shown in Figures 5.14 to 5.17. The tracking errors are coloured blue. Due to the
105
CHAPTER 5. REAL TIME OBJECT TRACKING
length and cyclic nature of the test videos, each error plot only shows the first 400 images
of a test video. The mean-subtracted ground truth data is plotted as a black dotted line
against a different vertical axis for visualization purposes. Histograms of tracking errors
are shown in Figures 5.18 to 5.21.
Table 5.3: Pendulum symmetry tracking error statistics
Video background Polar parameter Abs. mean Std. Abs. median
White
R (pixels) 1.1256 1.1350 0.5675
θ (radians) 0.0057 0.0048 0.0043
Red
R (pixels) 2.0550 1.7955 2.8732
θ (radians) 0.0134 0.0110 0.0129
Edge
R (pixels) 1.2118 1.0529 0.8765
θ (radians) 0.0078 0.0053 0.0079
Mixed
R (pixels) 3.4147 1.6186 7.4565
θ (radians) 0.0192 0.0099 0.0375
White Background
The error plots for the white background video are contained in Figure 5.14. Example
images from the video are shown below the two error plots. The left histogram in Fig-
ure 5.18 shows a small DC offset in the radius error. From qualitative examination of the
edge images, this offset appears to be caused by a small drift in edge pixel locations due
to uneven lighting and object motion blur.
The θ error plot shows that the orientation error tend to increase in magnitude near ground
truth zero crossings. This increase in tracking error appears to be correlated with object
speed, which is fastest near the middle of the pendulum swing where θ = 0. Higher object
velocities increases motion blur. This introduces noise into the locations of detected edge
pixels thereby also reducing the accuracy of detected symmetry.
The error plots show that both radius and θ errors are small, despite specular reflections
on the right side of the bottle and fairly strong background shadows. The application of
a temporal filter, such as the Kalman filter employed in the proposed tracker, will further
reduce tracking errors. The low-error tracking results suggest that the fast symmetry
detection method provides accurate measurements to a tracker when the tracked object is
set against a plain background.
Background with Red Distracters
The error plots in Figure 5.15 show that a background littered with distracters, similar in
colour to the tracked object, increases tracking errors. The increase in error appear to be
due to missing edge pixels along the object’s contour. These missing edges are caused by
the lack of intensity contrast between the bottle and the red background distracters. The
reduction in object-background contrast also reduces the size of the motion mask, causing
106
Section 5.4. Bilateral Symmetry as a Tracking Feature
some edge pixels in the tracked object’s contour to be rejected. Again, the errors appear
to be greater in magnitude near ground truth zero crossings, where the object is moving
at a greater speed.
Overall, the symmetry tracking errors are still very small. In the radius error plot, the
magnitude of the error rarely exceeds 4 pixels, which is less than 1 percent of the image
width. The two peaks in the θ error plot are roughly 3 degrees in magnitude. The
histograms in Figure 5.15 show that over the entire 1000 frames of the test video there
are a few errors of larger magnitude. Due to their low frequency of occurrence, temporal
filters, such as a Kalman filter, can easily deal with these error spikes. As such, it appears
that even with similarly coloured distracters in the background, detected symmetry can
be used as a reliable source of measurements for the purpose of object tracking.
Background with Edge Noise
The error plots in Figure 5.16 suggest that an increase in input edge noise has little impact
on tracking accuracy. Figure 5.13 contains an example of the motion-masked edge pixels
given to the symmetry detector during the test video. Notice that despite the rejection
of static edge pixels using the motion mask, the remaining edge pixels are still very noisy.
However, the fast symmetry detection method is still able to accurately recover the bottle’s
symmetry line.
The histograms in Figure 5.16 confirm that background edge noise has little impact on
tracking accuracy. Comparing error magnitudes, it appears that background distracters of
similar colour to the tracked object deteriorates tracking accuracy more than background
edge noise. This observation implies that missing edge pixels, caused by low object-
background contrast, are more harmful to symmetry detection than the addition of random
edge pixel noise.
Mixed Background
The error plots for the final pendulum test video are shown in Figure 5.17. The example
video frames show that the background contains a combination of high-contrast edges
and red distracters. The error plots contain several large spikes near zero crossings of
the ground truth data. These error spikes are due to a combination of motion blur and
missing edge pixels caused by the low object-background contrast provided by the red
background distracters. The tracking results presented in Section 5.3, especially videos
with low object-background contrast such as those involving transparent objects, suggests
that the Kalman filter is able to handle these sparse error spikes.
107
CHAPTER 5. REAL TIME OBJECT TRACKING
(a) Input edge pixels
(b) Detected symmetry line
Figure 5.13: Example symmetry detection result from test video with background edge pixel noise.
The top image shows the motion-masked edge pixels given to the fast symmetry detector as input.
In the bottom image, the symmetry line returned by the detector is shown in blue.
108
Section 5.4. Bilateral Symmetry as a Tracking Feature
0 50 100 150 200 250 300 350 400
-4
-3
-2
-1
0
1
2
3
4
R
a
d
i
u
s

E
r
r
o
r

(
p
i
x
e
l
s
)
Frame Number
White Background: Fast Symmetry Radius Error


0 50 100 150 200 250 300 350 400
-200
-150
-100
-50
0
50
100
150
200
M
e
a
n
-
S
u
b
t
r
a
c
t
e
d

G
r
o
u
n
d

T
r
u
t
h

R
a
d
i
u
s

(
p
i
x
e
l
s
)
Symmetry Radius Error
Ground Truth
0 50 100 150 200 250 300 350 400
-0.03
-0.02
-0.01
0
0.01
0.02
0.03
s

E
r
r
o
r

(
r
a
d
i
a
n
s
)
Frame Number
White Background: Fast Symmetry s Error


0 50 100 150 200 250 300 350 400
-0.25
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.25
M
e
a
n
-
S
u
b
t
r
a
c
t
e
d

G
r
o
u
n
d

T
r
u
t
h

s

(
r
a
d
i
a
n
s
)
Symmetry s Error
Ground Truth
Figure 5.14: White background – Symmetry tracking error plots.
109
CHAPTER 5. REAL TIME OBJECT TRACKING
0 50 100 150 200 250 300 350 400
-8
-6
-4
-2
0
2
4
6
8
R
a
d
i
u
s

E
r
r
o
r

(
p
i
x
e
l
s
)
Frame Number
Background with Red Distractors: Fast Symmetry Radius Error


0 50 100 150 200 250 300 350 400
-200
-150
-100
-50
0
50
100
150
200
M
e
a
n
-
S
u
b
t
r
a
c
t
e
d

G
r
o
u
n
d

T
r
u
t
h

R
a
d
i
u
s

(
p
i
x
e
l
s
)
Symmetry Radius Error
Ground Truth
0 50 100 150 200 250 300 350 400
-0.06
-0.04
-0.02
0
0.02
0.04
0.06
s

E
r
r
o
r

(
r
a
d
i
a
n
s
)
Frame Number
Background with Red Distractors: Fast Symmetry s Error


0 50 100 150 200 250 300 350 400
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
M
e
a
n
-
S
u
b
t
r
a
c
t
e
d

G
r
o
u
n
d

T
r
u
t
h

s

(
r
a
d
i
a
n
s
)
Symmetry s Error
Ground Truth
Figure 5.15: Background with red distracters – Symmetry tracking error plots.
110
Section 5.4. Bilateral Symmetry as a Tracking Feature
0 50 100 150 200 250 300 350 400
-6
-4
-2
0
2
4
6
R
a
d
i
u
s

E
r
r
o
r

(
p
i
x
e
l
s
)
Frame Number
Background with Edge Noise: Fast Symmetry Radius Error


0 50 100 150 200 250 300 350 400
-200
-150
-100
-50
0
50
100
150
200
M
e
a
n
-
S
u
b
t
r
a
c
t
e
d

G
r
o
u
n
d

T
r
u
t
h

R
a
d
i
u
s

(
p
i
x
e
l
s
)
Symmetry Radius Error
Ground Truth
0 50 100 150 200 250 300 350 400
-0.06
-0.04
-0.02
0
0.02
0.04
0.06
s

E
r
r
o
r

(
r
a
d
i
a
n
s
)
Frame Number
Background with Edge Noise: Fast Symmetry s Error


0 50 100 150 200 250 300 350 400
-0.25
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.25
M
e
a
n
-
S
u
b
t
r
a
c
t
e
d

G
r
o
u
n
d

T
r
u
t
h

s

(
r
a
d
i
a
n
s
)
Symmetry s Error
Ground Truth
Figure 5.16: Background with edge noise – Symmetry tracking error plots.
111
CHAPTER 5. REAL TIME OBJECT TRACKING
0 50 100 150 200 250 300 350 400
-80
-60
-40
-20
0
20
40
60
80
R
a
d
i
u
s

E
r
r
o
r

(
p
i
x
e
l
s
)
Frame Number
Mixed Background: Fast Symmetry Radius Error


0 50 100 150 200 250 300 350 400
-200
-150
-100
-50
0
50
100
150
200
M
e
a
n
-
S
u
b
t
r
a
c
t
e
d

G
r
o
u
n
d

T
r
u
t
h

R
a
d
i
u
s

(
p
i
x
e
l
s
)
Symmetry Radius Error
Ground Truth
0 50 100 150 200 250 300 350 400
-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
s

E
r
r
o
r

(
r
a
d
i
a
n
s
)
Frame Number
Mixed Background: Fast Symmetry s Error


0 50 100 150 200 250 300 350 400
-0.25
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.25
M
e
a
n
-
S
u
b
t
r
a
c
t
e
d

G
r
o
u
n
d

T
r
u
t
h

s

(
r
a
d
i
a
n
s
)
Symmetry s Error
Ground Truth
Figure 5.17: Mixed background – Symmetry tracking error plots.
112
Section 5.4. Bilateral Symmetry as a Tracking Feature
−3 −2 −1 0 1 2 3
0
10
20
30
40
50
60
70
80
90
100
R Error (pixels)
H
is
t
o
g
r
a
m

C
o
u
n
t
(a) R Error (pixels)
−0.06 −0.04 −0.02 0 0.02 0.04 0.06
0
10
20
30
40
50
60
70
80
90
θ Error (radians)
H
is
t
o
g
r
a
m

C
o
u
n
t
(b) θ Error (radians)
Figure 5.18: White background – Histograms of symmetry tracking errors.
−80 −60 −40 −20 0 20 40 60 80
0
50
100
150
200
250
300
R Error (pixels)
H
is
t
o
g
r
a
m

C
o
u
n
t
(a) R Error (pixels)
−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8
0
50
100
150
200
250
θ Error (radians)
H
is
t
o
g
r
a
m

C
o
u
n
t
(b) θ Error (radians)
Figure 5.19: Background with red distracters – Histograms of symmetry tracking errors.
−8 −6 −4 −2 0 2 4 6 8
0
10
20
30
40
50
60
70
80
90
R Error (pixels)
H
is
t
o
g
r
a
m

C
o
u
n
t
(a) R Error (pixels)
−0.06 −0.04 −0.02 0 0.02 0.04 0.06
0
50
100
150
θ Error (radians)
H
is
t
o
g
r
a
m

C
o
u
n
t
(b) θ Error (radians)
Figure 5.20: Background with edge noise – Histograms of symmetry tracking errors.
−100 −50 0 50 100
0
100
200
300
400
500
600
R Error (pixels)
H
is
t
o
g
r
a
m

C
o
u
n
t
(a) R Error (pixels)
−1.5 −1 −0.5 0 0.5 1 1.5
0
50
100
150
200
250
300
350
400
450
θ Error (radians)
H
is
t
o
g
r
a
m

C
o
u
n
t
(b) θ Error (radians)
Figure 5.21: Mixed background – Histograms of symmetry tracking errors.
113
CHAPTER 5. REAL TIME OBJECT TRACKING
5.4.3 Qualitative Comparison Between Symmetry and Colour
Colour is regularly used to perform object tracking. For example, the back projection
of a hue histogram is used in Camshift [Bradski, 1998] to perform human face tracking.
The same approach can also be used to track inanimate objects such as tennis balls and
uniformly coloured cups. As discussed previously, the edge pixels that the fast symmetry
detector uses as input data are visually orthogonal to value-based features such as colour.
In this section, an attempt is made to compare the performance of bilateral symmetry and
colour as tracking features.
Colour Blob Centroid
Before proceeding further, a colour-based feature must be chosen for the comparison
against bilateral symmetry. As a symmetry line provides object pose information, the
colour blob centroid seems to be a valid colour-based counterpart, as it also provides pose
information. The colour blob centroid is extracted by finding the center of volume (zeroth
moment) of a similarly coloured blob of pixels. Note however that the pose information
provided by a colour centroid is different to that provided by a symmetry line. A colour
centroid provides an (x, y) pixel location of an object’s center of volume. A detected sym-
metry line provides object orientation and constrains the object location to the loci along
the symmetry line. As the test object is symmetric, tracking accuracy is skewed favourably
towards symmetry. As such, only a qualitative comparison is performed between the colour
and symmetry as tracking features.
The colour tracker uses the hue-saturation-value (HSV) colour space to represent pixel
information. The target object’s colour model is represented as a two-dimensional his-
togram, which stores hue and saturation information. The two-dimensional histogram
quantizes hue and saturation into 45 and 8 bins respectively. The value component of
HSV is only used to reject pixels that are very dark or very bright, which have unreliable
hue information. The object’s colour histogram is built offline and manually optimized
prior to colour blob centroid detection. This is different from using symmetry as an object
tracking feature, which requires no manual initialization.
The back projection image is obtained using the tracked object’s hue-saturation histogram,
according to the method described in [Swain and Ballard, 1991]. The back projection
image represents the probability that a pixel in the input image belongs to the target object
based on the pixel’s colour. To provide a fair comparison against symmetry tracking, the
colour tracker also uses the motion mask provided by the block motion detector to zero
static portions of the back projection image. As with the symmetry tracking experiments,
video images near the turning points of the pendulum are ignored. An example back
projection result is shown in Figure 5.22. In the back projection image, darker pixels
represent higher object probability.
114
Section 5.4. Bilateral Symmetry as a Tracking Feature
(a) Hue-saturation histogram of object
(b) Input
(c) Back projection
Figure 5.22: Hue-saturation histogram back projection. In the back projection image, darker
pixels have a high probability of belonging to the object according to the hue-saturation histogram.
115
CHAPTER 5. REAL TIME OBJECT TRACKING
(a) White background
(b) Red distracters in background
Figure 5.23: Effects of different backgrounds on colour centroid. The back projection object blob
is shown in yellow and the centroid is shown as a black dot.
116
Section 5.4. Bilateral Symmetry as a Tracking Feature
A binary image is produced by thresholding the back projection image. The largest 8-
connected blob is labelled as belonging to the target object. The colour blob centroid
is found by calculating the zeroth moment, which is simply the center of volume of the
pixels belonging to the object blob. Example object blobs are shown in Figure 5.23. The
centroid location is shown as a black dot. In Figure 5.23(b), the large gap in the object
blob is due to low object-background contrast, which causes gaps in the motion mask and
subsequently the object blob.
Error Metric for Colour Tracking
The ideal error measure is the distance between the detected centroid and a ground truth
centroid location. However, ground truth requires manual segmentation of the object in all
video images, including images where the test object is in front of other red objects. As the
test object is symmetric, its colour centroid is located on its symmetry line. Therefore,
a low centroid to symmetry line distance implies accurate centroid detection. As the
tracking errors are used for a qualitative comparison, this sub-optimal metric is sufficient.
Error plots of the colour blob centroid are available from Figures 5.24 to 5.27. The mean-
subtracted radius of the ground truth symmetry line is provided as a visual reference of
the pendulum’s motion. As the radius parameter is measured from the image center, zero
crossings of the dotted line represent the middle of the pendulum swing, where the object
has the highest speed.
Comparison Results
Figures 5.24 and 5.26 suggest that centroid detection is accurate and reliable for the white
background and the noisy edge background test videos. The error plots suggest that colour
tracking is reliable and accurate for these two videos. In Figure 5.26, the lack of jumps
in error magnitude during ground truth zero crossings agrees with expectations as the
colour blob centroid does not rely on edge information. While edge information becomes
less reliable due to motion blur when the object speed is high, hue and saturation are not
effected by motion blur. Note that the cyclic nature of the centroid error in Figure 5.24
is caused by a flip in the sign of the ground truth symmetry line’s radius parameter. The
absolute magnitude of error is fairly stable across the entire plot.
The error plot for the video with red background distracters is shown in Figure 5.25.
The plot shows a mean centroid error that is 4 to 5 times larger than that of the white
background video. This decrease in tracking accuracy is due to the background distracters
distorting the shape of the object blob, an example of which can be seen in Figure 5.23(b).
The distortion is caused by the inability of the hue-saturation histogram to distinguish
between object and background. The red distracters also caused gaps in the motion mask
due to low object-background contrast, which further increases the centroid tracking error.
117
CHAPTER 5. REAL TIME OBJECT TRACKING
0 50 100 150 200 250 300 350 400
-4
-3
-2
-1
0
1
2
3
4
C
e
n
t
r
o
i
d

D
i
s
p
l
a
c
e
m
e
n
t

E
r
r
o
r

s
(
p
i
x
e
l
s
)
Frame Number
White Background: HSV Tracking Centroid Displacement Error


0 50 100 150 200 250 300 350 400
-200
-150
-100
-50
0
50
100
150
200
M
e
a
n
-
S
u
b
t
r
a
c
t
e
d

G
r
o
u
n
d

T
r
u
t
h

R
a
d
i
u
s

(
p
i
x
e
l
s
)
Symmetry Error
Ground Truth
Figure 5.24: White background – Colour blob centroid tracking error plot.
0 50 100 150 200 250 300 350 400
-20
-15
-10
-5
0
5
10
15
20
C
e
n
t
r
o
i
d

D
i
s
p
l
a
c
e
m
e
n
t

E
r
r
o
r

s
(
p
i
x
e
l
s
)
Frame Number
Background with Red Distractors: HSV Tracking Centroid Displacement Error


0 50 100 150 200 250 300 350 400
-200
-150
-100
-50
0
50
100
150
200
M
e
a
n
-
S
u
b
t
r
a
c
t
e
d

G
r
o
u
n
d

T
r
u
t
h

R
a
d
i
u
s

(
p
i
x
e
l
s
)
Symmetry Error
Ground Truth
Figure 5.25: Background with red distracters – Colour blob centroid tracking error plot.
118
Section 5.4. Bilateral Symmetry as a Tracking Feature
0 50 100 150 200 250 300 350 400
-3
-2
-1
0
1
2
3
C
e
n
t
r
o
i
d

D
i
s
p
l
a
c
e
m
e
n
t

E
r
r
o
r

s
(
p
i
x
e
l
s
)
Frame Number
Background with Edge Noise: HSV Tracking Centroid Displacement Error


0 50 100 150 200 250 300 350 400
-200
-150
-100
-50
0
50
100
150
200
M
e
a
n
-
S
u
b
t
r
a
c
t
e
d

G
r
o
u
n
d

T
r
u
t
h

R
a
d
i
u
s

(
p
i
x
e
l
s
)
Symmetry Error
Ground Truth
Figure 5.26: Background with edge noise – Colour blob centroid tracking error plot.
0 50 100 150 200 250 300 350 400
-20
-15
-10
-5
0
5
10
15
20
C
e
n
t
r
o
i
d

D
i
s
p
l
a
c
e
m
e
n
t

E
r
r
o
r

s
(
p
i
x
e
l
s
)
Frame Number
Mixed Background: HSV Tracking Centroid Displacement Error


0 50 100 150 200 250 300 350 400
-200
-150
-100
-50
0
50
100
150
200
M
e
a
n
-
S
u
b
t
r
a
c
t
e
d

G
r
o
u
n
d

T
r
u
t
h

R
a
d
i
u
s

(
p
i
x
e
l
s
)
Symmetry Error
Ground Truth
Figure 5.27: Mixed background – Colour blob centroid tracking error plot.
119
CHAPTER 5. REAL TIME OBJECT TRACKING
The detrimental effects of background distracters is further confirmed by the error plots
of the mixed background video in Figure 5.27. The error plots show periodic increases in
error magnitude. The uneven nature of these noise cycles is due to the object swinging
over red distracters in the background, causing a shift in the location of the colour blob
centroid.
Comparing the colour tracking error plots against those of the symmetry tracker, it is
clear that symmetry and colour each have their own strengths and weaknesses as tracking
features. Each feature should only be applied to tracking after carefully considering the
visual characteristics of the target objects, the background and the requirements of higher
level methods making use of the tracking results. For example, colour tracking does not
assume object rigidity, which allows the tracking of deformable objects. Also, colour
tracking is less susceptible to motion blur than symmetry.
On the other hand, the comparison shows that colour blob centroids are difficult to employ
in situations where portions of the background have similar colour to the tracking target.
Also, colour tracking requires a reasonably accurate histogram of the target’s colour statis-
tics, which maybe difficult to obtain automatically. Colour models of all possible target
objects are needed before tracking. Moreover, transparent and reflective objects take on
the colour of their surroundings, which makes colour a poor choice as a tracking feature
for these objects.
5.5 Chapter Summary
This chapter detailed a novel symmetry-based approach to object tracking. Experiments
on ten challenging test videos show that the proposed method is able to robustly and
rapidly track objects, operating at over 40 frames per second on 640 × 480 images. The
experimental results also show that convergent tracking can be maintained for multi-colour
objects and transparent objects. The symmetry tracker successfully dealt with difficult
situations such as large object pose changes, occlusions and the presence of non-target
symmetric objects. The tracker also produces a symmetry-refined motion segmentation of
the tracked object in real-time.
The discussion at the end of the previous chapter mentioned the problem of distinguish-
ing between object and background symmetry lines. Due to the lack of object models,
this is difficult without relying on prior knowledge such as table plane geometry or the
expected orientation of a target object’s symmetry line. This chapter has shown that
motion is a useful visual cue for separating an object’s symmetry line from static back-
ground symmetries. In the next chapter, the visual sensing repertoire developed so far
is combined synergetically with robotic manipulation to actuate new objects in order to
perform motion segmentation autonomously.
120
Vision without action is a dream. Action without vision
is simply passing the time. Action with vision is making
a positive difference.
Joel Barker
6
Autonomous Object Segmentation
6.1 Introduction
This chapter details a robust and accurate object segmentation method for near-symmetric
objects placed on a table of known geometry. Here, object segmentation is defined as the
problem of isolating all portions of an image that belongs to a physically coherent object.
The term near-symmetric is used as the proposed method can segment objects with some
non-symmetric parts, such as a coffee mug and its handle. The proposed approach does
not require prior models of target objects and assumes no previously collected background
statistics. Instead, the approach relies on a precise robotic nudge to generate predictable
object motion. Object motion and symmetry are combined to produce an accurate seg-
mentation. The use of physical manipulation provides a completely autonomous solution
to the problem of object segmentation. Experiments show that the resulting autonomous
robotic system produces accurate segmentations, even when operating in cluttered scenes
on multi-colour and transparent objects. A paper detailing the work in this chapter is
available in press [Li and Kleeman, 2008].
6.1.1 Motivation
The work presented in this chapter is intended for use in domestic robotics applications as
there are many objects with bilateral symmetry in most households. However, the sensing
parts of the process, namely locating points of interest using symmetry triangulation
and the symmetry-guided motion segmentation method, are applicable to other robotic
tasks. The overall aim is to provide the robot with a general method to detect and
segment common household objects such as cups, bottles and cans, without the burden
of mandatory offline training for every new object. As the proposed approach assumes
nothing about the appearance of the robot manipulator, the actuation of target objects
can be provided by any manipulator capable of performing a robotic nudge as described
in Section 6.3.
121
CHAPTER 6. AUTONOMOUS OBJECT SEGMENTATION
Object segmentation is an important sensory process for robots using vision. It allows a
robot to build accurate internal models of its surroundings by isolating regions of an image
that belong to objects in the real world. For domestic robots, the ability to quickly and
robustly segment man-made objects is highly desirable. For example, a robot designed
to clean and tidy desks will need to locate and segment common objects such as cups.
Section 4.2.4 highlighted several limitations of symmetry-based static object segmentation.
The proposed model-free approach was unable to recover asymmetric portions of near-
symmetric objects, such as the handle of a mug. Also, segmentation results may not
encompass the entire object due to distortions and gaps in the object’s edge contour.
The problems above can be solved passively by using a high level method and object
models, essentially converting the problem to one of object recognition. Robust multi-scale
object recognition methods, such as SIFT [Lowe, 2004] and Haar boosted cascades [Viola
and Jones, 2001] can imbue a robot with the ability to recognize previously modelled
objects. However, training such methods requires large quantities of positive and negative
sample images for each object. The number of training images are in the order of a
thousand or greater if high accuracy is required. Manually labelling and segmenting
images to produce training data for such methods is time consuming, especially for large
sets of objects. Considering the large number of objects that are present in most real world
environments, exhaustive training is impossible. Also, the introduction of novel objects
into the operating environment will require offline training before the resumption of visual
sensing.
The segmentation process described in this chapter attempts to address these problems by
obtaining accurate object segmentations autonomously. This shifts the burden of training
data collection from the human user to the robot. Also, the ability to detect and segment
objects quickly, without having to pause for offline training, allows the robot to rapidly
adapt to changing environments. Departing from the norm, a model-free approach to
visual sensing is retained by physically manipulating objects using a precise robotic nudge.
Returning to the desk cleaning robot example, in the case where it encounters a cup lacking
a model, the robot will generate its own training data by nudging the cup over a short
distance. In order to do this, the robot uses the arsenal of model-free sensing methods
detailed in previous chapters.
6.1.2 Contributions
The sensing of objects using physical manipulation has been explored several times in the
past. Tactile approaches use force sensors on the end effector to obtain information about
an object’s shape during a probing action [Cole and Yap, 1987] or a pushing motion [Jia and
Erdmann, 1998; Moll and Erdmann, 2001]. The majority of active vision methods focus
on moving the camera to obtain multiple views of the scene. Departing from the norm,
the work of Fitzpatrick et al. [Fitzpatrick, 2003a; Fitzpatrick and Metta, 2003] uses the
visual feedback during physical manipulation to investigate objects. Their approach uses
122
Section 6.1. Introduction
a poking action, which sweeps the end effector across the workspace. The presence of an
object is detected by monitoring the scene visually, looking for a sharp jump in the amount
detected motion that occurs when the robot’s end effector makes contact with an object.
When the effector-object collision is detected, object segmentation is performed using
a graph cuts approach. This section details the contributions of the proposed approach
towards active vision. The proposed approach is also compared against that of Fitzpatrick
et al., which partly inspired the work in this chapter.
Detecting Interesting Locations before Object Manipulation
By limiting the scope to near-symmetric objects, interesting locations can be found prior
to robotic action. Similar to Section 4.3.5, objects are localized by looking for the inter-
section between each object’s triangulated axis of symmetry and the table plane. These
intersection points are clustered over multiple frames to obtain a stable estimate of ob-
ject locations. Due to the low likelihood of two background symmetry lines triangulating
to an axes that passes through the table plane within the robot manipulator’s reachable
workspace, the majority of clusters represent locations that will yield useful information
when physically explored. In essence, the interesting locations provides the robot with a
set of expectations to test using physical manipulation.
Finding interesting locations prior to robotic action provides several advantages. Firstly,
the detected interesting locations provide a subset of workspace areas that have high
probability of containing an object. This is different from the approach of Fitzpatrick
et al., which assumes that all parts of the unexplored workspace have equal probability
of containing an object. For scenes with sparse spatial distribution of objects, the pre-
action planning provided by the sensing of interesting locations will drastically reduce
exploration time. Secondly, interesting locations allow the use of exploration strategies.
For example, locations near the camera can be explored first, as they are less likely to
be occluded by other objects. In combination with dense stereo, which uses visually
orthogonal information to stereo symmetry triangulation, obstacles on the table top should
be identifiable. This will allow path planning with obstacle avoidance prior to robotic
action.
Note that the author is not claiming superiority over the approach of Fitzpatrick et al.
Their object exploration method is designed to be highly general for the purposes of
investigating the visual-motor response and measuring object affordances. Fitzpatrick’s
PhD thesis [Fitzpatrick, 2003b] also makes use of the same exploration method to measure
object affordances. Due to the lack of constraints on object shape imposed by their
method, no pre-action sensing can be performed to identify locations that are likely to
contain objects.
123
CHAPTER 6. AUTONOMOUS OBJECT SEGMENTATION
Precise Manipulation of Objects
Experiments show that the robotic nudge action does not tip over tall objects such as
empty bottles. The robotic nudge also does not damage fragile objects such as ceramic
mugs. This level of gentleness in object manipulation is not demonstrated in the work
of Fitzpatrick et al., which uses durable test objects such as toy cars and rubber balls.
Also, their poking action actuates objects at a relatively high speed. This is probably
due to their use of graph cuts segmentation at the moment of effector-object contact,
which requires a significant amount of object motion between time-adjacent video images.
The proposed segmentation approach does not rely on fast object motion, instead relying
on the small object displacement caused by the robotic nudge. As such, the speed of the
robotic nudge can be altered depending on the weight and fragility of the expected objects.
Due to the elastic actuators in the manipulator used by Fitzpatrick et al., their poking
action is inherently imprecise. In contrast, the proposed method uses a short and accu-
rate robotic nudge, only applied at locations of interest. While neither method directly
addresses the problem of end effector obstacle avoidance, the small workspace footprint
of the robotic nudge will make path planning easier. Also, the short length of the robotic
nudge allows for greater ease in path planning. As the robotic manipulator uses tradi-
tional DC motors and high resolution encoders, the robotic nudge motion is inherently
very precise. This allows the robotic nudge to be applied in cluttered scenes. The higher
probability of glancing blows, together with the imprecise execution of trajectories when
using elastic actuators, suggests that the poking approach of Fitzpatrick et al. is less
suitable for cluttered environments.
Prevention of Poor Segmentations
The proposed segmentation approach incorporates several tests to prevent poor segmen-
tation. Figure 6.2 contains a flowchart representing the segmentation process. The dia-
mond shaped blocks represent stages where segmentation can be restarted if undesirable
situations occur. Firstly, if the target location is not nudgable, the robot restarts its seg-
mentation process. A location is nudgable if it is within the robot manipulator’s reachable
workspace and there is sufficient clearance to nudge the target object without any collisions
with other objects.
Secondly, the robot looks for object motion during the nudge. If no object motion is
detected during the robotic nudge, the segmentation process is restarted. This ensures
that a segmentation is not attempted if the robotic nudge does not cause sufficient object
motion. The motion detection step also prevents the segmentation of phantom objects
caused by the triangulation of two background symmetry lines. As such, the robotic
nudge simultaneously tests for the existence of an object at the target location and allows
motion segmentation if an object is present. Unlike Fitzpatrick et al., a sharp jump in scene
motion is not used as an indicator for a moving object. Instead, motion is detected only
124
Section 6.1. Introduction
at the location being nudged, using the target object’s symmetry line as a barrier. This
is possible because the robotic nudge is a planned motion targeting a specific location,
allowing robust rejection of robot arm motion. The proposed object motion detection
method is further detailed in Section 6.3.2.
Thirdly, the robot tracks the object’s symmetry line in stereo. If either tracker diverges, the
robot abandons the current segmentation attempt and the process is restarted. Also, the
segmentation process is restarted if the orientation of the object’s symmetry line changes
dramatically before and after the robotic nudge. These checks ensure that segmentation is
not performed if an object is tipped over or pushed behind an occlusion. As the approach
of Fitzpatrick et al. does not forecast the time and location of object-manipulator contact,
they do not apply any prevention schemes against insufficient object motion. This can
result in near-empty segmentations due to robot arm motion being interpreted as object
motion. Also, the robot arm is sometimes included in their object segmentation result.
Robust Segmentation using Symmetry and Motion
While appearing similar at a glance, the proposed approach to visual segmentation is very
different to that of Fitzpatrick et al. Their approach uses video frames during robotic
action, around the time of contact between the end effector and object. Due to their
motion jump initiation, unlucky frame timing with respect to the time of contact can
produce poor segmentations. Several incorrect segmentations are highlighted in Figure 11
of [Fitzpatrick and Metta, 2003], which shows the inclusion of the end effector in some
segmentation results. Also, near-empty segmentations can be returned by their approach.
These problems never occur in the proposed approach as the video images used for motion
segmentation are temporally further apart, captured before and after the nudge but never
during a robotic action. This explicitly prevents the inclusion of the robot’s end effector
in the motion segmentation results. The motion detection approach also prevents the
production of near-empty segmentations. While not explicitly documented, it seems that
their choice of motion threshold will depend on object size and the speed of impact. As
the proposed approach only looks for any object motion, not a change in the quantity of
motion, the motion threshold is fairly arbitrary.
The proposed object segmentation approach is novel. The use of symmetry combined
with robot-generated object motion produces accurate object segmentations robustly and
autonomously. This model-free approach is able to segment multi-colour and transparent
objects in clutter. High resolution 1280 × 960 pixel images are used to produce more
accurate segmentations. These images each has four times the pixel data of the 640 ×480
images used in the static segmentation experiments presented in Section 4.2. However, as
the proposed segmentation method only requires a single pass and uses low computational
cost operations, the robot is able to obtain segmentations very rapidly. Timing trials
conducted on a 1.73GHz Pentium M laptop measured a mean execution time of 80ms
over 1000 segmentations, which includes time taken to calculate the frame difference.
125
CHAPTER 6. AUTONOMOUS OBJECT SEGMENTATION
The segmentation’s execution time is much lower than the many seconds required by the
graph cuts method used in the system of Fitzpatrick et al. This suggests that the proposed
symmetry-based segmentation approach is well suited for time-critical applications where
low execution times are essential.
6.1.3 System Overview
The hardware components of the robot are shown in Figure 6.1. The stereo cameras
consists of two Videre Design IEEE1394 CMOS cameras verged together at around 15
degrees from parallel. These cameras capture 640 × 480 images at 25Hz during all parts
of the segmentation process, except for high resolution 1280 ×960 snapshots of the scene
taken before and after the robotic nudge. The cameras are mounted on a tripod and
positioned to emulate the eye-manipulator geometric relationship of a humanoid robot.
Figure 6.1: Robotic system hardware components.
The PUMA 260 robot manipulator has six degrees of freedom. The PUMA arm and stereo
cameras are setup to roughly simulate the manipulator-camera relationship of a humanoid
robot. Note that the PUMA arm is driven using a custom-built controller so that CPU
126
Section 6.1. Introduction
processing can be focused towards visual sensing. Details of the new controller are available
from Appendix B. The calibration grid is used to perform camera-arm calibration and to
estimate the geometry of the table plane. Details of the calibration and table plane
estimation are described in Section 6.1.4.
The autonomous segmentation process is summarized in Figure 6.2. The robot begins
by surveying the scene for interesting locations to explore. The details of this process
are described in Section 6.2. Once an interesting location has been found, the robot
manipulator nudges the target location. If motion is detected during the nudge, stereo
tracking is initiated to keep track of the moving object. Section 6.3 describes the robotic
nudge and stereo tracking. If tracking converges, the object is segmented using the method
described in Section 6.4. As discussed in previous chapters, bilateral symmetry is the
backbone feature of the visual processing.
Find Interesting
Locations
Location
Nudgable?
Location
nearest Camera
No
Nudge Location
Tracking
Successful?
Segment Object
Yes
Motion
Detected?
No
Stereo Tracking
Yes
No
Yes
No
START
RESULT
Figure 6.2: Autonomous object segmentation flowchart.
127
CHAPTER 6. AUTONOMOUS OBJECT SEGMENTATION
Note that while the flowchart shows a series progression along the segmentation process,
several parallel threads of execution take place in practice. Firstly, two threads return
video images from the stereo camera pair in real time at 25 frames per second. The
visual processing to detect interesting locations and perform stereo tracking takes place
in parallel, alongside the camera threads. During the nudge, the robot arm controller
commands the manipulator in parallel with the visual processing and the camera threads,
making sure that the planned motion trajectories are being followed. This control thread
executes on its own processing unit on a PCI servo controller card. The controller card
commands the arm motors and monitors the joint encoders in order to execute smooth
trajectories.
6.1.4 System Calibration
The robot requires two pieces of prior knowledge in order to perform autonomous seg-
mentation. Firstly, the robot requires the geometry of the table plane. This allows the
robot the ability to move its end effector parallel to the table when performing the nudge.
Secondly, the robot needs the geometric relationship between its stereo cameras and its
manipulator. This allows object locations obtained via visual sensing to be used for robotic
manipulation. Both pieces of prior knowledge are obtained during system calibration. As
the robot manipulator is mounted on the table, calibration only needs to be repeated if
the stereo cameras’ pose changes relative to the table.
System calibration is performed as follows. Firstly, the stereo cameras are calibrated using
the MATLAB camera calibration toolbox [Bouguet, 2006]. The intrinsic parameters of
each camera in the stereo pair are obtained individually. This is followed by a separate
calibration to obtain the extrinsics of the stereo system. After calibrating the stereo
camera pair, the robot can triangulate locations in 3D relative to the camera’s coordinate
frame. The geometry of the table is found by fitting a plane to the corners of the checker
pattern, the locations of which are found using standard stereo triangulation.
To find the manipulator-camera transformation, the robot uses the calibration checker-
board pattern on the table. The corners of the checkerboard pattern are placed at a
known location in the robot manipulator’s coordinate frame. This is achieved by draw-
ing a grid of points on the table using a marker pen end effector with the manipulator.
The checkerboard pattern is then placed on the table with its corners aligned with the
manipulator-drawn grid points. The corners of the checkerboard are then triangulated us-
ing the stereo camera pair, obtaining their locations in the camera coordinate frame. The
manipulator-camera frame transformation is then found by solving the absolute orienta-
tion problem, which returns the transformation that will map points between the camera
and manipulator frames. The PCA solution proposed by [K. S. Arun and Blostein, 1987]
is used to obtain the transformation.
128
Section 6.2. Detecting Interesting Locations
6.2 Detecting Interesting Locations
6.2.1 Collecting Symmetry Intersects
The process of detecting interesting locations begins with the collection of symmetry
intersects across multiple video images. The use of multiple images ensures that the
detected interesting locations are temporally stable. Symmetry lines are detected in the
left and right camera image over 25 time-contiguous video images. As the stereo cameras
record at 25 frames per second, data collection only requires one second.
All possible pairings of symmetry lines between the left and right images are triangulated
to form 3D axes of symmetry using the method described previously in Section 4.3.3. In
the experiments, three symmetry lines are detected for each camera image, resulting in a
maximum of nine triangulated axes of symmetry. Symmetry axes that are more than 10
degrees from being perpendicular to the table plane are rejected. The intersection points
between the remaining symmetry axes and the table plane are calculated. To prevent the
detection of interesting locations outside the robot’s reach, only symmetry intersections
within the workspace of the robot manipulator are collected for clustering. The clustering
method is given the 2D locations of the symmetry intersects on the table plane.
It is possible for two background symmetry lines to produce a valid symmetry axis through
triangulation. In practice, the perpendicular-to-table constraint rejects the majority of
non-object symmetry axes. The limited workspace of the robot manipulator also implicitly
removes many background symmetry axes. In the rare case where a phantom symmetry
axis results in the detection of an interesting location, the robotic nudge will quickly
confirm that the location is empty. Note that the robot will not attempt to segment an
empty location as no object motion will be induced by the robotic nudge.
6.2.2 Clustering Symmetry Intersects
The intersections between valid symmetry axes and the table plane are collected over 25
pairs of video frames and recorded as 2D locations on the table plane. These intersect
locations are grouped into clusters using the QT algorithm [Laurie J. Heyer and Yooseph,
1999], which has been modified to deal with 2D input data. The QT algorithm works by
looking for clusters that satisfy a quality threshold. Unlike K-means clustering, the QT
clustering algorithm does not require any prior knowledge of the number of actual clusters.
This is important as it frees the robot from making an assumption about the number of
objects on the table.
Several modifications are made to the QT algorithm. Firstly, the recursion in the algorithm
is removed to reduce its computational cost. The original QT method is designed for
clustering genetic information where an upper limit for the number of clusters is unknown.
However, as the maximum number of object symmetry lines the robot can detect is limited
to three per video image, the recursion is replaced with loop iterations. Secondly, a
129
CHAPTER 6. AUTONOMOUS OBJECT SEGMENTATION
minimum cluster size threshold is used to reject temporally unstable symmetry intersects.
As the maximum number of symmetry axes per stereo image pair is nine, a temporally
stable symmetry axis will contribute at least
1
9
of the intersects used in clustering. As
such, the minimum cluster size is set to one-ninth of the total number of intersects.
The QT clustering is performed as follows. The algorithm iterates through each symmetry
intersect, adding all other intersects within a threshold distance. The cluster with the most
number of intersects is returned. The algorithm repeats on the remaining intersects until
the required number of clusters have been returned. In the experiments, the clusters are
limited to 10mm in radius. This prevents the formation of clusters that include symmetry
intersects from multiple objects. The geometric centroids of the clusters are used by the
robot as a list of interesting locations to explore. A nudge is performed on the valid
location closest to the right camera. A location is deemed invalid if the end effector will
collide with other interesting locations during the robotic nudge.
6.3 The Robotic Nudge
After selecting an interesting location on the table to explore, a robotic nudge is applied
to the target location. The robotic nudge is used to detect the presence of an object as
well as to generate the necessary object motion to perform segmentation. The robotic
nudge is designed to actuate objects across the table in a controlled manner. It is able to
move fragile objects such as ceramic mugs as well as objects with high center of gravity
such as empty beverage bottles. Visual sensing occurs in parallel with the robotic nudge.
Once object motion is detected, stereo tracking is used to monitor the nudge to ensure
sufficient and consistent object actuation.
6.3.1 Motion Control
The motion of the robot’s end effector during a nudge is shown in Figures 6.3 and 6.4.
The L-shaped protrusion is made of sponge to provide damping during contact, which is
especially important when nudging brittle objects such as ceramic cups. The L-shaped
sponge also allows object contact to occur very close to the table plane. By applying force
to the bottom of objects, nudged objects are less likely to tip over.
Figure 6.3 shows the side view of the robotic nudge motion. The nudge begins by lowering
the gripper from P0 to P1. The height of the gripper at location P0 is well above the
height of tallest expected object. D
max
is set to ensure that the L-shaped sponge will not
hit the largest expected object during its descent. In the experiments, D
max
is set to allow
at least 20mm of clearance.
After arriving at P1, the gripper travels towards P2. D
min
is selected such that the gripper
will make contact with the smallest expected object before arriving at P2. D
min
is set
to 10mm in the experiments. The gripper then retreats back through P1 to P0. In early
130
Section 6.3. The Robotic Nudge
P0
P2
Target
Object
Symmetry
Line
P1
Table
D
min
D
max
Detect
Object
Motion
Here
Figure 6.3: The robotic nudge – Side view.
tests, the gripper was moved directly from P2 back to P0. This knocked over tapered
objects such as the blue cup in Figure 6.8(a) due to friction between the soft sponge and
the object’s outer surface. Note that the end effector never visually crosses an object’s
symmetry line during any part of the robotic nudge.
An overhead view of the nudge is provided in Figure 6.4. The nudge motion is perpendic-
ular to the line formed between the focal point of the right camera and the location being
nudged. If an object is present at the location, the robotic nudge will actuate the object
horizontally across the camera’s image. This reduces the change in object scale caused
by the nudge and also lowers the probability of glancing contact, improving the quality of
segmentation.
P2
Symmetry
Line
P0,P1
Target
Object
Right
Camera
Hard
Sponge
Soft
Sponge
Hard Foam
Nudge
Vector
Figure 6.4: The robotic nudge – Overhead view.
131
CHAPTER 6. AUTONOMOUS OBJECT SEGMENTATION
A robotic nudge captured by the right camera is shown in Figure 6.5. The end effector
appears mirrored in the video images because the diagram in Figure 6.3 is drawn from
the experimenter’s point of view. Notice that the nudge only moves the transparent
cup a short distance. This prevents large object pose changes that can negatively affect
the segmentation results. Also, the small workspace footprint of the nudge reduces the
probability of object collisions when the robot operates on cluttered scenes.
Figure 6.5: Consecutive right camera video images taken during the P1–P2–P1 portion of the
robotic nudge.
After selecting an interesting location, P0, P1 and P2 are determined based on the camera’s
location. Using inverse kinematics, linearly-interpolated encoder values are generated at
runtime to move the end effector smoothly between these three points. In the robotic
experiments, the above information is displayed to the user before the nudge. Figure 6.6
shows an example of the nudge visualization provided to the user. The target location is
shown as a blue circle, with the cluster centroid drawn as a blue dot. The motion of the
planned nudge is drawn as a green arrow. The camera location is shown as a red rectangle
on the left of the image. The workspace of the robot manipulator is coloured white. The
radii of the inner and outer borders of the workspace are roughly 150mm and 350mm from
the base joint of the robot arm.
132
Section 6.3. The Robotic Nudge
Figure 6.6: Workspace visualization of robotic nudge.
6.3.2 Obtaining Visual Feedback by Stereo Tracking
During the nudge, the robot monitors the right camera image for object motion. Motion
detection is performed using the block motion method described previously in Section 5.2.2.
The robot only monitors for object motion in the green region in Figure 6.3. As the end
effector never crosses the symmetry line during the nudge, the ego motion of the robot will
not be detected as object motion. If no object motion is found, the entire segmentation
process can be restarted on the next interesting location. The detection of object motion
during the nudge allows the robot to identify empty interesting locations, which prevents
the production of near-empty segmentations as encountered by Fitzpatrick et al. Similarly,
poor segmentations due to insufficient object motion caused by glancing contact are also
avoided.
Once object motion is detected, the robot begins tracking on the target object’s symme-
try line. Tracking is performed in stereo using two independent symmetry trackers, one
for each camera. Each tracker is identical to the one described previously in Chapter 5.
Tracking stops after the nudge is complete and sufficient time has been given to allow the
nudged object to cease all motion. If both trackers converge, the final tracking estimates
from the left and right trackers are triangulated to form a 3D symmetry axis. If the sym-
metry axis is roughly perpendicular to the table plane, object segmentation is performed.
If either tracker fails or the resulting symmetry axis is no longer perpendicular to the
table, object segmentation is not performed. This prevents poor segmentations due to
unwanted object motion such as the target object being tipped over.
133
CHAPTER 6. AUTONOMOUS OBJECT SEGMENTATION
6.4 Object Segmentation
Segmentation is performed using the object motion generated by the robotic nudge. Fig-
ure 6.7 illustrates the major steps of the proposed object segmentation method. Fig-
ure 6.7(a) and Figure 6.7(b) are images taken with the right camera before and after the
robotic nudge. The absolute frame difference between the before and after images is shown
in Figure 6.7(c). The object’s symmetry lines are overlayed on top of the frame difference
image. The symmetry line of the object before the nudge is drawn in green and the line
after the nudge is coloured red.
Thresholding the raw frame difference will produce a binary mask that includes many
background pixels. The mask will also have a large gap at the center due to the low frame
difference in the interior of the nudged object. By using the object’s symmetry line, these
problems can be overcome. Figure 6.7(d) shows the compressed frame difference. This
image is produced by removing the pixels between the two symmetry lines in the frame
difference image. The region of pixels on the left of the red symmetry line is rotated so
that the symmetry line is vertical. Similarly, the region on the right of the green symmetry
line are rotated until the line is vertical. The compression is performed by merging the
left and right regions so that the red and green symmetry lines lie on top of each other.
After compression, a small gap still remains in the frame difference image. This can
be seen in Figure 6.7(d) as a dark vertical V-shaped wedge inside the cup-like shape.
To remedy this, the robot exploits object symmetry to its advantage. Recall that the
compression step merges the symmetry lines of the object in the before and after frames.
Using this merged symmetry line as a mirror, the robot searches for motion on either side
of it. If symmetric motion is found on the pixel pair on either side of the symmetry line,
the pixels between the pair are filled in. This is similar to the motion mask refinement
approach previously described in Section 5.2.3, except single pixels are used as opposed
to 8 ×8 blocks. The resulting symmetry-refined image is shown in Figure 6.7(e). Finally,
a segmentation binary mask is produced by thresholding the symmetry-refined difference
image. The segmentation result in Figure 6.7(f) is produced by a simple pixel-wise logical
AND between the binary mask and the after nudge image.
134
Section 6.4. Object Segmentation
(a) Before nudge (b) After nudge
(c) Frame difference (d) Compressed difference
(e) Symmetry refinement (f) Segmentation result
Figure 6.7: Motion segmentation using symmetry. Note that the compressed difference and
symmetry refinement images are rotated so that the object’s symmetry line is vertical.
135
CHAPTER 6. AUTONOMOUS OBJECT SEGMENTATION
6.5 Autonomous Segmentation Results
Segmentation experiments were carried out on ten objects of different size, shape, texture
and colour. Transparent, multi-coloured and near-symmetric objects are included in the
test object set. Both plain and cluttered scenes were used in the experiments, with some
objects set against two different backgrounds. While the experiments are carried out
indoors, four bright fluorescent ceiling light sources provide uneven illumination in the
scenes. In total, twelve experiments were carried out. The robot was able to autonomously
segment the test object in all twelve experiments without any human aid.
Segmentation results are available from the nudge folder of the accompanying multimedia
DVD. The segmentation results are tabulated in index.html. Each row of the table
contains an image of the segmentation result alongside corresponding videos of stereo
symmetry tracking and the robotic nudge as filmed from an external video camera. Note
that the full resolution image is displayed by clicking on an image in the table. For safety
reasons, a warning beacon is active when the robot manipulator is powered, periodically
casting red light on the table. The beacon flash can be observed sporadically in the
tracking videos.
The segmentation results and videos are also available online:
• www.ecse.monash.edu.au/centres/irrc/li_iro08.php
6.5.1 Cups Without Handles
The test object set includes several types of common household objects. The first subset
contains cups without handles.The segmentation is accurate and includes the entire cup.
Figure 6.8(a) places a blue cup against background clutter. Again, the segmentation is
accurate. This pair of segmentations illustrate the robustness of the autonomous approach
when operating against different backgrounds.
The white cup in Figure 6.8(b) poses a different challenge to the segmentation process.
Apart from its imperfect symmetry, the narrow stem of the cup results in a very small
shift in object location after the nudge. This creates a narrow motion contour in the frame
difference. The resulting segmentation shows that the proposed method only requires
minute object actuation to function. Lastly, Figure 6.8(c) show that the robot is able to
autonomously obtain a very clean segmentation of a transparent cup against background
clutter. This highlights the flexibility and robustness of the symmetry-based approach to
motion segmentation, which is able to operate across objects of varying visual appearances.
136
Section 6.5. Autonomous Segmentation Results
(a) Blue cup in clutter
(b) Near-symmetric white cup
(c) Transparent cup in clutter
Figure 6.8: Autonomous segmentation results – Cups.
137
CHAPTER 6. AUTONOMOUS OBJECT SEGMENTATION
6.5.2 Mugs With Handles
Unlike the dynamic programming segmentation approach described in Section 4.2, the new
motion-based segmentation approach is able to include asymmetric portions of objects
in the segmentation results. This capability is tested by using mugs with handles as
test objects. The white mug in Figure 6.9(a) contains an asymmetric handle, which is
successfully included in the segmentation result.
The multi-colour ceramic mug in Figure 6.9(b) is a brittle near-symmetric test object. The
robotic nudge is able to gently actuate this fragile object while generating enough motion
to initiate stereo tracking. The segmentation results include the mug but also a bit of the
background. The reason for this can be seen in the stereo tracking videos. During the
nudge, the L-shaped foam protrusion made contact with the table cloth. This was due
to a mechanical error in the PUMA 260 wrist joint that caused the end effector to dip
towards the table.
(a) White mug
(b) Multi-colour mug
Figure 6.9: Autonomous segmentation results – Mugs with handles.
138
Section 6.6. Discussion and Chapter Summary
6.5.3 Beverage Bottles
The remaining test objects are various bottles designed to store liquids. Due to their
elongated shape, they generally have high centers of gravity, which means they are easy to
tip over. This is especially true for the textured and transparent bottles as they are empty.
These empty bottles also tend to wobble when actuated, which provides a challenge for
the stereo symmetry tracking. Figures 6.10(a) and 6.10(b) show the segmentation results
for two textured bottles against a plain background. Figure 6.10(c) show the segmentation
result of a textured bottle against background clutter. These results suggest that the robot
is able to nudge and segment drink bottles robustly and accurately.
Next, a small water-filled bottle is used to test the strength and accuracy of the robotic
nudge. Due to its size and weight, the nudge must be accurate and firm to produce
sufficient object motion to initiate tracking. The successful segmentation in Figure 6.11(a)
shows that the robotic nudge is capable of actuating small and dense objects.
Finally, the proposed segmentation approach is tested on an empty transparent bottle.
Apart from the inherent visual difficulties posed by transparency, the top-heavy and light
bottle also tests the robustness of the robotic nudge. Figures 6.11(b) and 6.11(c) show
two segmentation results for the empty transparent bottle.
6.6 Discussion and Chapter Summary
This chapter has shown that object motion generated by robotic manipulation can be
used to produce accurate and physically true segmentations. The robot autonomously
and robustly carried out the entire segmentation process, including the robotic nudge
used to actuate objects. By using the robotic nudge, the robot was able to segment
a brittle multi-colour mug with an asymmetric handle as well as transparent objects.
Experiments on household objects have shown that the robotic nudge is a robust and
gentle method of actuating objects for the purpose of motion segmentation. The accurate
segmentation results confirms the implicit assumption that object scale and orientation is
not significantly altered by the robotic nudge holds true in practice.
All twelve segmentation experiments were carried out successfully. However, it is possible
for the robot to fail in obtaining an object segmentation. The most obvious issue is the
lack of bilateral symmetry in the target object, which means the object will be invisible
to the robot’s symmetry-based vision system. As previously discussed, the fast symmetry
detector operates on edge pixels as input. As such, visually orthogonal approaches operat-
ing on features such as colour and image gradients can be used synergetically to deal with
asymmetric objects. However, the use of other vision methods will probably be predicated
on object opaqueness. For example, stereo optical flow and graph cuts segmentation may
fail due to the unreliable surface pixel information of transparent objects when they are
under actuation.
139
CHAPTER 6. AUTONOMOUS OBJECT SEGMENTATION
(a) Green textured bottle
(b) Brown textured bottle
(c) White textured bottle in clutter
Figure 6.10: Autonomous segmentation results – Beverage bottles.
140
Section 6.6. Discussion and Chapter Summary
(a) Small water-filled bottle
(b) Transparent bottle
(c) Transparent bottle in clutter
Figure 6.11: Autonomous segmentation results – Beverage bottles (continued).
141
CHAPTER 6. AUTONOMOUS OBJECT SEGMENTATION
In early experimental trials, the robot had a propensity to knock over test objects as
its end effector did not have the L-shaped sponge protrusion. This led to the use of
stereo tracking to ensure stable object motion before attempting object segmentation.
The stereo trackers are able to prevent object segmentation if unstable object motion is
induced by the robotic nudge. Trials where no object is present at the nudged location were
also conducted during the prototyping stage. The lack of object motion prevented any
object segmentation attempt, which essentially allows the robot to detect clear portions
of the workspace using the robotic nudge. Also, the accidental actuation of asymmetric
objects will not lead to an object segmentation attempt. This is because of the use
of stereo tracking during the nudge, which will receive no symmetry measurements as
the nudged object is not symmetric. The stereo tracker will diverge, resulting in the
segmentation process being restarted before object segmentation can take place. Overall,
the many checks and balances in the proposed autonomous approach ensures that poor
segmentations rarely occur.
The proposed system does not explicitly address end effector path planning and obsta-
cle avoidance. These problems are difficult to tackle without the use of 3D sensing or
resorting to simple environments where the scene and test objects are previously mod-
elled. As the symmetry-based sensing methods are unable to localize asymmetric objects
in three dimensions, other stereo methods are required to construct reliable occupancy
maps. Given the robot’s knowledge of the table plane, a dense stereo approach should
be able to extract enough surface information to allow for end effector path planning and
obstacle avoidance. The problem of multiple symmetric objects within the robot’s work
space is left to future work, although it seems straightforward to use the robotic nudge
to clear up possible triangulation ambiguities that occur. As the robotic nudge is very
compact in terms of its workspace usage, it is well suited for use in cluttered scenes where
path planning must consider multiple obstacles and symmetric objects.
Now that the robot can autonomously obtain segmentations of new objects, many sensing
and manipulation possibilities open up for exploration. For example, the segmentations
obtained can be used as the positive images of a data set for the purposes of training up
a boosted Haar cascade [Viola and Jones, 2001]. This will allow the robot to recognize
previously segmented objects without requiring manual collection and labelling of training
images. The next chapter shows that segmentations obtained using the robotic nudge
enable the autonomous grasping and modelling of new objects.
142
Intelligence is the ability to adapt to change.
Steven Hawking
7
Object Learning by Interaction
7.1 Introduction
In the previous chapter, the robot was imparted with the ability to detect and segment
near-symmetric objects autonomously. The robot used a precise nudge to extract object
information, removing the traditional reliance on prior knowledge such as shape and colour
models. The previous chapter also suggested that autonomously obtained segmentations
can be used as training data for an object recognition system. The work presented in
this chapter confirms this suggestion experimentally. The robot learns and recognizes
objects autonomously by integrating the work presented in previous chapters with new
object manipulation and visual sensing methods. Training data is collected by physically
grasping an object and rotating it in view of the robot’s cameras. Experiments show
that the robot is able to autonomously learn object models and then use these models to
perform robust object recognition.
7.1.1 Motivations
Autonomous learning is an interesting concept as humans use it regularly to adapt to new
environments. A robot with the ability to learn new objects without human guidance will
be more flexible and robust across different operating environments. Instead of relying
on manually collected training data or object models provided by humans, the robot can
learn about objects all by itself. This shifts the burden of training data collection and
model construction from human users to the robot. By doing so, the robot is now able to
operate in environments such as the household where the large number of unique objects
make exhaustive modelling and training intractable.
The autonomous use of robotic actions to nudge, grasp and rotate objects in order to
produce reusable models is novel. The experimental platform allows us the unique oppor-
tunity to examine the robustness and accuracy of an object recognition system trained
using images obtained autonomously by a robot. David Lowe’s scale invariant feature
transform (SIFT) [Lowe, 2004] is used to model and recognize objects. As such, these
143
CHAPTER 7. OBJECT LEARNING BY INTERACTION
experiments also provide an insight into the performance of SIFT for modelling grasped
objects in a robotics context.
Overall, the primary goal remains the development of model-free approaches to robotics
and computer vision. In essence, the wish is to provide the robot with as little prior
knowledge as possible while still having it perform useful tasks such as object learning and
recognition. The autonomous approach presented in this chapter will hopefully encourage
more future work that focuses on the use of robotic action to perform recognition sys-
tem training, as opposed to the currently accepted norm of offline training on manually
obtained data.
7.1.2 Contributions
The work presented here makes several contributions to research in the area of object
manipulation, object learning and object recognition. On the global level, the completely
autonomous nature of the proposed approach is novel. The autonomous use of object
manipulation to generate training data for object learning has not been attempted in the
past, only suggested as a future possibility. Experiments show that a robot can obtain
training data by itself and use the robot-collected data to learn reusable visual models
that allow robust object recognition. Research contributions are made in the following
areas.
Performing Advanced Object Manipulations Autonomously
Existing approaches to object manipulation generally require object models or the fitting
of geometric primitives to 3D data prior to robotic action. In his paper on object seg-
mentation using a robotic poking action [Fitzpatrick, 2003a], Fitzpatrick proposed that
it maybe possible to perform what he termed advanced manipulations by using object
information obtained via a simple manipulation. Probably due to the one-fingered end
effector of his robot arm, Fitzpatrick did not address this suggestion any further in his
research.
The work presented in this chapter confirms Fitzpatrick’s suggestion. The robotic nudge
is successfully used to obtain the necessary object information to perform a grasping
operation. The autonomously obtained object segmentation is used in stereo to find the
height of the nudged object. By using the relatively simple robotic nudge, the robot is able
to grasp near-symmetric objects without requiring any prior geometric knowledge of the
target object such as width and location. Upon a successful grasp, the robot rotates the
object about its longitudinal axis to obtain views of the object at different orientations.
After rotation, the object is replaced at its pre-nudge location. The robot’s ability to
leverage a simple object manipulation to perform more advanced manipulations is novel
and useful in situations where the robot has to deal with new objects.
144
Section 7.1. Introduction
Autonomous Training Data Collection for Object Learning
The previous chapter suggested that autonomously collected object segmentations can be
used as training data for an object recognition system. While the autonomously produced
segmentations are accurate, they only present a single view of the nudged object. By
grasping the nudged object, the robot now has access to multiple views of the object.
Once an object has been grasped, the robot rotates the object over 360 degrees. Images
are captured at 30-degree intervals, resulting in twelve training images for each object.
This allows the construction of detailed object models that include the object’s surface
information across multiple orientations.
Traditionally, object modelling is done offline using a turntable-camera or scanning rangefinder
system. The concept of autonomous training data collection using a robot manipulator
for the purpose of object learning is relatively new. The quality of the robot-collected
training data is evaluated experimentally. Images of grasped objects are used to train an
object recognition system. The performance of the object recognition system provides an
indication of the usefulness of the autonomously collected training data. Object recog-
nition experiments also indirectly measure the effectiveness of the proposed autonomous
approach to object learning.
Robust Object Recognition using SIFT Descriptors
SIFT descriptors [Lowe, 2004] are used to perform object recognition. After collecting
training images of the grasped object from different orientations, SIFT descriptors are
detected for each image. SIFT descriptor sets from different views of the same object are
collected together to form an object model. Object recognition is performed by detecting
descriptors in the input image and matching them against a database of object descriptors.
The descriptor modelling process is further detailed in Section 7.3.
Traditionally, object recognition is performed using SIFT descriptors extracted from im-
ages where the target object has been segmented manually. The proposed approach de-
parts from the norm as the training data is obtained autonomously by the robot. While
the robotic nudge produces object segmentations that have similar accuracy to manual
segmentations, object grasping is inherently error prone. Due to the lack of object mod-
els, the robotic grasp is simply designed to firmly hold the object for rotation. The large
concave foam pads on the robot’s gripper, as seen in Figure 7.2(a), ensures a firm grasp
but may displace the object or change the object’s orientation. Once an object is grasped,
the robot is no longer able to maintain an accurate estimate of the object’s pose.
The imprecise grasping operation is a problem as the robot no longer has the ability to
accurately distinguish between object and background features. Without a way to prune
away background features, robust object recognition is impossible. This is especially
crucial when using SIFT descriptors, as the inclusion of background descriptors may lead to
repeated false positives. To overcome this problem, a pruning method has been developed
145
CHAPTER 7. OBJECT LEARNING BY INTERACTION
to remove background SIFT descriptors by cross-examining the descriptors detected from
all image views of the object. The proposed pruning method does not require bilateral
symmetry to operate, which prevents background symmetries from disrupting the pruning
process. As such, the pruning method can be used to remove background SIFT descriptors
in any situation where an object is being rotated in front of a static scene. The pruning
method is fully detailed in Section 7.3.3.
7.2 Autonomous Object Grasping After a Robotic Nudge
The object learning process begins with a robotic nudge as described previously in Section 6.3.
After a successful robotic nudge, the motion segmentation method detailed in Section 6.4
is performed twice to produce an object segmentation for each camera image. The three
dimensional information needed to perform object grasping is obtained by triangulating
the top of the object segmentations. Grasping is performed using a two-fingered robot
gripper mounted at the end of a PUMA 260 manipulator. The robot manipulator and grip-
per are controlled using the custom-built controller detailed in Appendix B. This section
details the method the robot uses to grasp objects after the robotic nudge.
7.2.1 Robot Gripper
Object grasping requires an end effector with two or more fingers. A gripper with more
fingers generally provide a more stable grasp, as more points of contact are made with an
object. However, more fingers also require more complex kinematics planning and the need
for self collision avoidance. As the robot does not have detailed three dimensional models
of target objects, little benefit is gained from having more fingers on the gripper. Instead,
the gripper and grasping approach are chosen to ensure a stable grasp. Concave foam
pads are added to a two-fingered gripper to increase the area of gripper-object contact to
ensure grasp stability.
Figure 7.1 shows the gripper and angle bracket before the addition of foam pads. The
gripper is an Otto Bock prosthetic hand, which has a wide maximum grasp width. The
gripper has two fingers, with the left finger having two grey tips that make contact when
grasping an object. As the gripper is a human prothesis, the wrist’s longitudinal axis is
at an angle to the gripper’s opening. A custom-made angle bracket is used to correct this
misalignment, so that the gripper opening is in line with the wrist axis about which the
gripper rotates. The angle bracket can be seen at the top left of Figure 7.1.
146
Section 7.2. Autonomous Object Grasping After a Robotic Nudge
Figure 7.1: Robot gripper and angle bracket.
Figure 7.2 shows the modifications that have been made to the gripper. Blue foam pads
have been added to the gripper to increase the contact area between the gripper fingers
and a grasped object. The concave shape of the pads helps guide the grasped object
towards the center of the gripper, which increases grasp stability. The blue foam is rigid,
unlike the soft foam that is used to construct the L-shaped nudging protrusion. A small
piece of metal was added to the bottom finger to compensate for the fact that the upper
finger has two tips. Figure 7.2(b) shows the gripper at maximum opening width.
(a) Front view of gripper (b) Side view of gripper
Figure 7.2: Photos of robot gripper.
147
CHAPTER 7. OBJECT LEARNING BY INTERACTION
7.2.2 Determining the Height of a Nudged Object
Recall from the previous chapter that after a successful robotic nudge, stereo tracking
provides the three dimensional location of the nudged object’s symmetry axis. The location
of the object on the table can be found by calculating the intersection between its symmetry
axis and the table plane. However, grasping also requires the height of the target object.
This prevents the gripper from grasping air or hitting the top of the object during its
descent. In addition, the grasp can be planned so that the gripper only occludes a small
portion of the object, which allow more visual features to be identified during model
building.
The motion segmentation method detailed previously in Section 6.4 is used to obtain two
segmentations, one for each camera. In each segmentation, the top of the object in the
camera image is detected using the following method. The largest 8-connected blob in
the binary segmentation mask is found, while all other blobs are removed. This prevents
background motion along the object’s symmetry line from affecting the detected top of
object location. The CvBlobs library [Borras, 2006] is used to perform the binary blob
analysis. After identifying the object binary blob, the intersection between the post-nudge
symmetry line estimate of the tracker and the top of the object binary blob is found. This
intersection point is returned as the top of the object in each camera image. Figure 7.3
shows the top of a nudged object detected using this symmetry intersection approach.
(a) Left camera image (b) Right camera image
Figure 7.3: Detecting the top of a nudged object. The top of the object’s binary blob is shown
as a black dot. The post-nudge symmetry estimate of the tracker within the object’s binary blob is
shown in red.
Now that the top of the object has been visually located in each camera image, stereo
triangulation can be performed to determine the height of the object. However, there is
an inherent uncertainty when using stereo triangulation to obtain an object’s height. This
is because of the fact that the top of the object in the camera image physically belongs
to the rear of the object. This results in a triangulated height that is always greater than
the actual height of the object.
148
Section 7.2. Autonomous Object Grasping After a Robotic Nudge
Figure 7.4 represents the pertinent geometry of a single camera from the stereo pair, the
table plane and an object that is having its height triangulated. Notice that the blue line
joining the camera’s focal point and the top of the object in the camera view actually
intersects the rear of the object. Performing stereo triangulation on the blue ray using
verged stereo cameras will result in the location marked as a black dot in the figure. This
location is higher than the object’s height for any graspable object.

d
r

Object Symmetry Axis
Object
Camera
Table
Figure 7.4: Uncertainty of stereo triangulated object height.
The object height error is labelled as d in the figure. The object radius is labelled as r. In
cases where the object deviates from a surface of revolution, r represents the horizontal
distance between the object symmetry axis and the point on the top of the object that
is furthest from the camera. The angle between the camera’s viewing direction and the
table plane is labelled as θ. Using similar triangles, the height error d is described by the
following equation.
d = r tan θ (7.1)
For the experiment platform, θ is roughly thirty degrees. Humanoid robots dealing with
objects on a table at arm’s length will have a similar θ angle. At thirty degrees, d is
roughly 0.60 × r. As the radii of the top of the test objects range from 30mm to 90mm,
d will be between 18mm and 54mm. Therefore, the gripper grasping coordinates are
vertically offset by 36mm, which allows the robot to grasp all the test objects as long as
the gripper maintains a vertical tolerance of ±18mm. The large foam pads attached to
the gripper’s fingers well exceeds the vertical tolerance. In a situation where the object
widths are unknown, the gripper’s maximum opening width can be used in place of the
maximum object width to determine the vertical offset.
149
CHAPTER 7. OBJECT LEARNING BY INTERACTION
7.2.3 Object Grasping, Rotation and Training Data Collection
After determining the object height, grasping is performed by moving the gripper directly
above the intersection between the object’s symmetry axis and the table plane. The
opened gripper is lowered vertically until the gripper fingers reaches the object height.
The gripper is closed and then raised vertically to lift the grasped object above the table.
The gripper is raised such that the majority of the gripper is no longer visible in the right
camera image. This prevents the inclusion of gripper descriptors in the object model.
The grasped object is rotated about the vertical axis of the robot manipulator wrist.
Images of the object are taken at 30-degree intervals using the right camera in the stereo
pair, resulting in a total of 12 images per grasped object. The 30-degree angle increment
is chosen due to the use of SIFT descriptors for object modelling, which are reported by
[Lowe, 2004] to tolerate viewing orientation changes of ±15 degrees. An example training
data set is displayed in Figure 7.5.
Figure 7.5: Autonomously collected training data set – Green bottle.
150
Section 7.3. Modelling Object using SIFT Descriptors
Notice that in Figure 7.5, the green bottle’s symmetry line changes its orientation across
different training images. This is due to a slightly off-center grasp of the object, which
produces a tilt in the bottle’s longitudinal axis. The off-center grasp is due to symmetry
axis triangulation error as well as mechanical offsets between the gripper and robot manip-
ulator wrist. After rotating the grasped object to obtain a set of twelve training images,
the object is replaced at its original location on the table before the robotic nudge. This
allows for future revisits if further training data is needed.
Two videos explaining the autonomous grasping of nudged objects are available in the
learning/explain folder of the multimedia DVD. Both videos include audio commentary
by the thesis author. In combination, these two explanation videos demonstrate that the
robot can autonomously grasp symmetric objects of different heights. The blue cup video
contains a demonstration where a blue cup is nudged and then grasped autonomously by
the robot. The video also shows the graphic user interface of the robotic system, which
provides real time visualization of the planned nudge, object segmentation and other
pertinent information. The white bottle video shows the robot nudging, grasping and
rotating a white bottle. The saving of experiment data such as tracking images has been
minimized in this video, which allows the robot to perform the learning process at full
speed. The object segmentation obtained using the robotic nudge is also shown in real
time during this video.
7.3 Modelling Object using SIFT Descriptors
7.3.1 Introduction
The scale invariant feature transform (SIFT) [Lowe, 2004] is a multi-scale method that
extracts descriptors from an image. SIFT descriptors are 128-value vectors that encode the
image gradient intensity and orientation information within a Gaussian-weighted window.
SIFT descriptors are highly unique, which makes them ideal for object recognition as the
probability of confusing the descriptors of different objects is very low. The SIFT process
is invariant against translation, rotation and scaling. SIFT descriptors are also invariant
to illumination changes. The combination of affine and illumination invariance allows for
very robust object recognition on real world images.
SIFT is a two step process. The first step detects stable regions in the input image.
A comprehensive survey of region detectors is provided by [Mikolajczyk et al., 2005].
Lowe’s recommended region detector finds stable locations in scale space, which essentially
represent high contrast blob features within the input image. Interest point detectors such
as the Harris corner [Harris and Stephens, 1988] and maximally stable extremal regions
(MSER) [Matas et al., 2002] can also be used to produce stable regions.
After finding a list of stable regions in the image, a descriptor is extracted from each region.
A survey of descriptors is provided in [Mikolajczyk and Schmid, 2005]. Descriptors are
151
CHAPTER 7. OBJECT LEARNING BY INTERACTION
designed to encode local pixel information into an easily matchable vector. Lowe’s SIFT
descriptor encodes local gradient intensity and orientation information into a 128-value
vector. The descriptor building process removes illumination information and limits the
impact of image noise. The Euclidean distance between descriptor vectors is used to
measure their similarity, with shorter distances representing better matches.
7.3.2 SIFT Detection
Grasped objects are modelled by performing SIFT detection on all twelve images of the
object’s training data set. The 640 × 480 pixel colour training images are converted to
grayscale prior to SIFT detection. David Lowe’s SIFT binary is used to perform the
detection. The binary outputs a plain text file that contains the descriptors returned from
detection. The author’s own C++ code is used to match and visualize descriptors.
The SIFT Descriptors detected in Figure 7.6(a) are visualized in Figure 7.6(b). Each
descriptor is drawn as a blue circle. A circle’s radius is proportional to the scale of the
descriptor. The location of a descriptor is shown as a blue dot at the center of the circle.
Note that the circles are merely a visualization and are not the same size as the Gaussian
windows used during descriptor building. The SIFT detection produced 268 descriptors,
providing dense coverage of the bottle’s pixels. However, numerous background descriptors
have also been detected. This is especially noticeable in the top right of the image. To
overcome this problem, an automatic pruning step is performed to remove descriptors that
do not belong to the grasped object.
7.3.3 Removing Background SIFT Descriptors
The problem of background SIFT descriptors is illustrated by Figure 7.7(a). The blue
dots in the figure show the location of descriptors returned by SIFT detection. Notice that
apart from the grasped bottle, many background descriptors are detected. The inclusion of
these descriptors in the object model will result in false positives during object recognition,
especially when the robot is operating on objects set against the same background.
Before using the detected SIFT descriptors for object recognition, background descriptors
are removed using an automated pruning process. Figure 7.7(b) shows the remaining SIFT
descriptors after pruning away background descriptors. Notice that the dense distribution
of descriptors in the upper right and lower right of Figure 7.7(a) are no longer present in
the refined result. The descriptor belonging to the object’s shadow has also been removed.
Note however that several descriptors belonging to the L-shaped foam protrusion remain
in the refined result.
152
Section 7.3. Modelling Object using SIFT Descriptors
(a) Input image
(b) SIFT descriptors
Figure 7.6: SIFT detection example – White bottle training image.
153
CHAPTER 7. OBJECT LEARNING BY INTERACTION
(a) All SIFT descriptors
(b) Background descriptors removed
Figure 7.7: Removing background SIFT descriptors.
154
Section 7.3. Modelling Object using SIFT Descriptors
The automatic removal of background descriptors is performed as follows. Firstly, de-
scriptors that are very far away from the grasped object is removed. This is achieved by
placing a bounding box around the grasped object and searching for descriptors outside
this bounding box. The bounding box is large enough to allow for the object tilt and dis-
placement caused by imperfect grasping. This first step removes the majority of non-object
descriptors, such as the dense cluster of descriptors on the right side of Figure 7.7(a).
Secondly, the robot takes advantage of the fact that twelve training images are collected for
each object. As the grasped object is rotated in front of a static background, background
descriptors should occur much more frequently than object descriptors in the training
image set. Generally, a SIFT descriptor remains detectable for one forward and one
backward object rotation increment. Therefore, an object descriptor should only match
with a maximum of two other descriptors from the training image set, one from the image
recorded at the previous rotation step and one from the image recorded at the next rotation
step.
Programmatically, this constraint is applied by searching for descriptor matches between
the training images. A descriptor with two or fewer matches with descriptors of other
training images are identified as object features. The remaining descriptors that have
three or more matches are considered to be background features and rejected before object
modelling. The ratio-based method described in Section 7.3.4 is used to find descriptor
matches.
The proposed pruning method for removing background SIFT descriptors can be applied
to any situation where an object is rotated at fixed angular increments in front of a static
background. The method does not require the rotated object to have any detectable lines
of symmetry. The pruning method does not use bilateral symmetry for SIFT descriptor
rejection as the robot’s gripper may disrupt or shift the grasped object’s symmetry axis.
Also, the human prothesis used as the robot’s gripper tend to tilt the grasped object due
to the uneven closing speeds of its fingers. As such, the object’s symmetry line will require
tracking during the rotation if it is used for descriptor rejection.
The number of descriptor matches allowed within a training image set should be ad-
justed depending on the angular size of the rotation increment. Note that the removal
of background descriptors reduces the total number of descriptors in the object recogni-
tion database. For example, the removal of background descriptors shown in Figure 7.7
reduced the number of descriptors from 268 to 163. Reducing the quantity of descriptors
directly reduces the computational cost of object recognition and help improve recognition
robustness by lowering the probability of false positives during descriptor matching.
7.3.4 Object Recognition
Figure 7.8 provides an overview of the robot’s object recognition process. The robot per-
forms object recognition by matching the descriptors detected in an input image with
155
CHAPTER 7. OBJECT LEARNING BY INTERACTION
descriptors in an object database. SIFT detection is performed on each of the twelve
robot-collected training images, resulting in twelve descriptor sets for each object. Back-
ground descriptors are removed from an object’s descriptor sets using the method detailed
previously in Section 7.3.3 prior to their insertion into the object recognition database.
0
o
30
o
60
o
330
o
OBJECT LABEL: White Bottle
SIFT Descriptor Sets from 12 Training Images
OBJECT LABEL: Yellow Bottle
OBJECT LABEL: Transparent Bottle
OBJECT DATABASE
Input Image
Detect
SIFT
Compute Matches with
all Descriptor Sets
0
o
30
o
60
o
330
o
SIFT Descriptor Sets from 12 Training Images
0
o
30
o
60
o
330
o
SIFT Descriptor Sets from 12 Training Images
Object Recognition
Results
Object Label of
Best SIFT Set
Best SIFT
Descriptor Set
Training Image
of Best SIFT Set
Set with Most
Matches is BEST
Input
SIFT
Descriptors
Figure 7.8: Object recognition using learned SIFT descriptors.
SIFT descriptor matching is performed using the ratio-based method suggested by Lowe.
Descriptor matching occurs between the descriptors of the input image against all descrip-
tor sets in the object database, one set at a time. Matches are computed between the input
descriptors and each descriptor set in the database as follows. The Euclidean distances
between each input descriptor and the descriptor set in the recognition database are cal-
culated. The descriptor in the database set closest to the input descriptor according to its
Euclidean distance is recorded as a possible match. To ensure matching uniqueness, the
second-closest descriptor is also found. The closest descriptor is returned as a match if it
is much closer than the second-closest descriptor. This criteria is evaluated using equation
d
1
< (N ×d
2
), where d
1
and d
2
are the Euclidean distances from the input descriptor to
the closest and second-closest descriptors. N is set to 0.6 in the C++ implementation.
Object recognition is performed by exhaustively matching the input SIFT descriptors
against the descriptor sets in the object database. The set with the most number of
descriptor matches with the input descriptors is considered best. The object label of the
best descriptor set is returned as the recognized object. The recognition system also
returns the training image that produces the best descriptor set for visual verification
purposes. According to [Lowe, 2004], a minimum of three correct matches is needed for
156
Section 7.4. Autonomous Object Learning Experiments
object recognition and localization. Therefore, object recognition will only return a result
if three or more descriptor matches are found between the input image and the best
matching descriptor set in the database.
7.4 Autonomous Object Learning Experiments
The proposed autonomous object learning approach is evaluated using seven test objects.
The test objects are shown in Figure 7.9. The test objects are beverage bottles, including
two transparent bottles and a reflective glass bottle. Each object was grasped and rotated
by the robot to collect training images for object recognition. The entire learning process
is performed autonomously by the robot. The only human intervention required is the
placing of individual test objects within the robot’s reachable workspace at the beginning
of each experiment. SIFT detection and background descriptor removal are performed
automatically after the conclusion of each learning experiment.
The robot performed autonomous learning on all seven test objects. Videos of the robotic
nudge, object grasping and object rotation are available from the multimedia DVD in the
learning/grasp folder. The videos are named after the object captions in Figure 7.9.
There is a visible mechanical issue in the grasping videos. The object’s symmetry line in
the camera image is generally no longer vertical after the grasp. This is due to the use
of a human prothesis hand as the robot’s gripper, which has an angled wrist. Also, the
speed at which the fingers of the gripper closes is uneven. The custom-built bracket shown
in Figure 7.1 only partially corrects this issue. The autonomous learning process is not
affected by this mechanical issue.
The long pause after the robotic nudge in the object grasping videos is due to the saving of
image data to document the experiment. This includes the writing of 200 640×480 track-
ing images to the host PC’s hard drive, which takes a considerable amount of time. The
data collected by the robot during the learning experiments, including the training images,
are available from the learning/data folder. The readme.txt text file in the folder pro-
vides further details about the experiment data. Without the saving of experiment data,
the grasping can be planned and performed 160ms after the robotic nudge. A video walk-
through of the autonomous learning process, where the saving of experiment data has been
reduced, is available form the multimedia DVD as learning/explain/white bottle.avi.
The video walkthrough includes a verbal narration of the entire learning process by the
thesis author.
7.4.1 Object Recognition Results
After the completion of autonomous learning on all seven test objects shown in Figure 7.9,
object recognition experiments are performed using the learned object database. The
recognition system is tested using 28 input images, four for each of the seven test objects.
157
CHAPTER 7. OBJECT LEARNING BY INTERACTION
(a) White (b) Yellow (c) Green (d) Brown
(e) Glass (f) Cola (g) Transparent
Figure 7.9: Bottles used in object learning and recognition experiments.
Each quartet of input images contains the test object placed against different backgrounds,
ranging from plain to cluttered. The orientation of the test object is varied between the
four input images. Some input images also include large partial occlusions of the test
object.
The object recognition results are available from the learning/sift folder of the attached
multimedia DVD. The results are organized into folders according to the test object in the
input image. The input images are named inputXX.png where XX is a number from
00 to 03. In general, a higher image number implies greater recognition difficulty. The
training image that produced the best matching SIFT descriptor set is named databa-
seXX.png, where XX corresponds to the input image number.
158
Section 7.4. Autonomous Object Learning Experiments
The object recognition results are visualized in the images named matchXX.png. Again,
the XX at the end of the image file name is the same as the corresponding input image
number. These images are a vertical concatenation of the input image and the best
matching training image. The input image is shown above the matching training image.
SIFT descriptor matches are linked using red lines in the image. The label of the object
identified in the input image by the robot’s object recognition system is shown as green
text at the bottom of the image.
Table 7.1 contains the number of good and bad matches obtained during the object recog-
nition experiments. A good match is defined as one where the descriptors are at similar
locations on the object in the input image and the matching training image. Similarity
in location is determined visually via manual observation. Matches that do not belong to
the same part of the object are labelled as bad. For example, Figure 7.14 contains a bad
descriptor match between the feature on the bottle cap in the input image and the glossy
label in the training image.
Table 7.1: Object recognition results – SIFT descriptor matches
Bottle
Image number
00 01 02 03
Good Bad Good Bad Good Bad Good Bad
White 16 0 6 0 17 0 7 0
Yellow 14 0 11 0 24 0 4 0
Green 23 1 21 1 11 0 9 1
Brown 15 0 16 0 16 0 8 0
Glass 5 0 6 1 4 1 4 1
Cola 7 0 4 0 9 0 11 0
Transparent 6 0 7 1 11 0 6 0
As only three correct SIFT matches are needed for object recognition and pose estima-
tion, the results in Table 7.1 suggests that SIFT descriptors extracted from autonomously
collected training data is sufficient for robust object recognition. The correct object labels
returned by the recognition system for all 28 input images further confirm this sugges-
tion. The remainder of this section discusses some of the object recognition results in the
learning/sift folder of the multimedia DVD.
Textured Bottles
Figures 7.10 to 7.13 contain several object recognition results for the textured bottle test
objects. Figure 7.10 shows the affine invariant nature of SIFT descriptors. The white
bottle is successfully recognized in the input image despite its inversion of orientation. In
Figure 7.11, the input image shows the yellow bottle in the middle of several other objects.
Notice that a textured yellow box and another textured yellow bottle are present in the
scene. The object recognition system is able to identify the yellow bottle correctly, with
an abundance of descriptor matches.
159
CHAPTER 7. OBJECT LEARNING BY INTERACTION
Figure 7.12 contains a difficult object recognition scenario. The green bottle is heavily
occluded by other objects in the scene. Due to its shiny surface, bright white specular
reflections are also present on the bottle. Despite these challenges, the object recognition
system is able to find numerous descriptor matches and thereby correctly identify the
object in the input image as the green bottle. Figure 7.13 contains a similar scenario,
where the recognition system is able to identify the partially occluded brown bottle.
Glass and Transparent Bottles
Figures 7.14 to 7.16 contain several recognition results for the glass bottle and the two
partially transparent bottles. These objects are difficult to recognize due to their unre-
liable surface information. The glass bottle and its shiny label produces many specular
reflections. The glass bottle was also chosen to show that the robot can autonomously
manipulate and model fragile objects with high centers of gravity. The transparent bottles
are prone to specular reflections and also change their appearance depending on the scene
background.
Figure 7.14 shows an object recognition result for the glass bottle partially occluded by
clutter. The object recognition system successfully identifies the glass bottle and returns
five SIFT descriptor matches. However, the descriptor match between the bottle cap in
the input image and the bottle label in the training image is incorrect. This matching
error is probably due to a specular reflection on the bottle cap, which appears similar to
the specular reflection on the shiny label. Note that as noise robust methods such as the
Hough transform are generally used to recover object pose from SIFT descriptor matches,
single matching errors will not adversely affect object localization.
Figure 7.15 shows the half-transparent cola bottle being successfully recognized amongst
background clutter. The bottle’s liquid content has been purposefully removed to produce
a large change in its appearance. Despite emptying the cola bottle, many correct SIFT
descriptor matches are found for the object. The recognition system is able to identify
the correct object without being adversely affected by the change in object appearance.
Finally, the result in Figure 7.16 shows that a mostly transparent bottle can be recognized
using the proposed method. The texture on the bottle label appear to be highly distinctive,
producing many correct matches.
7.5 Discussion and Chapter Summary
This chapter demonstrated that a robot can autonomously learn new objects through the
careful use of object manipulation. The robot was able to roughly determine the height of
a nudged object by performing segmentation in stereo. Most importantly, the autonomous
grasping of objects has clearly demonstrated that a relatively simple manipulation, the
160
Section 7.5. Discussion and Chapter Summary
Figure 7.10: Object recognition result – White bottle (match01.png).
161
CHAPTER 7. OBJECT LEARNING BY INTERACTION
Figure 7.11: Object recognition result – Yellow bottle (match02.png).
162
Section 7.5. Discussion and Chapter Summary
Figure 7.12: Object recognition result – Green bottle (match02.png).
163
CHAPTER 7. OBJECT LEARNING BY INTERACTION
Figure 7.13: Object recognition result – Brown bottle (match03.png).
164
Section 7.5. Discussion and Chapter Summary
Figure 7.14: Object recognition result – Glass bottle (match03.png).
165
CHAPTER 7. OBJECT LEARNING BY INTERACTION
Figure 7.15: Object recognition result – Cola bottle (match03.png).
166
Section 7.5. Discussion and Chapter Summary
Figure 7.16: Object recognition result – Transparent bottle (match02.png).
167
CHAPTER 7. OBJECT LEARNING BY INTERACTION
robotic nudge described in the previous chapter, can enable the use of more advanced
manipulations.
The robot autonomously collected training data by rotating a grasped object. Experiments
show that the robot-collected training data is sufficient to produce reusable object models.
By automatically pruning away background descriptors, the robot is able to generate visual
models that describe multiple views of a grasped object. Object recognition experiments
have clearly demonstrated that the object models produced autonomously by the robot
allow robust object recognition. Moreover, the robotic system is able to interact with
a fragile glass bottle and partially transparent objects in order to build object models
autonomously.
However, there are some issues that should be addressed by future work. Firstly, the
proposed learning approach is not designed to deal with asymmetric objects. If a suitable
replacement method can be found to autonomously produce object segmentations, the
grasping and SIFT model building parts of the approach can be applied without significant
change. Secondly, small asymmetric parts of symmetric objects, such as cup handles, may
cause grasping failure. This problem requires additional visual sensing designed to locate
small object asymmetries. Thirdly, the problem of learning duplicate object models can
be addressed by performing object recognition prior to the robotic nudge. If the target
object already exists in the robot’s database, the robot can simply move on to investigate
the next object.
The system in this chapter has integrated fast bilateral symmetry detection, symmetry
triangulation, real time object tracking, autonomous segmentation via the robotic nudge,
autonomous object grasping and SIFT-based object modelling to produce an autonomous
learning system. The general nature of the proposed learning approach should allow the
use of other robotic manipulators to perform the necessary object manipulations. It may
also be possible for a human teacher to actuate the object. Overall, the proposed approach
takes an important step towards greater robot autonomy by shifting the laborious task of
object modelling from the human user to the tireless robot.
168
At bottom, robotics is about us. It is the discipline of
emulating our lives, of wondering how we work.
Rod Grupen
8
Conclusion and Future Work
8.1 Summary
Biological optimization through evolution by natural selection can achieve stunningly el-
egant and intelligent autonomous systems. For example, the tiny honey bee, possessing
a brain of only a million neurons, is able to visually distinguish between multiple human
faces [Dyer et al., 2005]. The human intelligence combines offline optimization in the
form of evolution over millions of years with online learning through the interpretation of
sensory data into experiences during a person’s life. Our adaptive intelligence is difficult
to mimic with a robotic system as it incorporates functions such as reliable motor control,
robust visual sensing and real time decision making, all of which are intertwined and ever
changing.
Philosophically, human mimicry is an interesting challenge for roboticists. Pragmatically, a
domestic robot performing household tasks will benefit from having human-like sensing and
sensibilities. While tactile sensors with the density of receptors, robustness and flexibility
of human skin are yet unavailable, the same is not true for visual sensors. A study of
human optical signals [Koch et al., 2006] indicates that the human eye produces around
8.75 megabits of data per second. This is a fraction of the data rate of a colour webcam
capturing 640 × 480 pixel images at 25 frames per second. Ignoring the high dynamic
range of the human eye, it appears that current vision sensors are capable of providing
the quantity of data needed for human-like visual processing.
Given the emerging market for domestic robots caused by aging populations in the de-
veloped world, as previously discussed at the beginning of Chapter 1, the stage seems set
for robotic systems that rely on vision to perform household tasks. A statistical survey of
papers in the IROS 2008 conference [Merlet, 2008] provides evidence for this hypothesis.
The survey shows that papers with the keywords computer vision and humanoid robot are
ranked in the top three in terms of total submissions and accepted papers. While popu-
larity does not necessarily indicate quality or usefulness, it does indicate massive research
interest in visual sensing and humanoid robots, with the latter generally equipped with
manipulators capable of object interaction.
169
CHAPTER 8. CONCLUSION AND FUTURE WORK
The need for adaptive and intelligent object interaction motivates the research presented
in this thesis. The presented research includes a novel bilateral symmetry detector, model-
free object sensing methods, an autonomous object segmentation approach and a robotic
system that performs object learning autonomously. Each piece of research has been
evaluated using real world images or robotic experiments on common household objects.
Recall that the motivating challenges were previously detailed in Section 1.1. The research
presented in this thesis addresses these challenges as follows.
Fast and Robust Detection of Bilateral Symmetry
Bilateral symmetry is a visual property that rarely occurs by chance. Symmetry generally
arises from symmetric objects or a symmetric constellation of object features, both of which
are useful from the standpoint of object interaction. The literature survey of symmetry
detection research provided in Chapter 2 revealed an unexplored niche for fast and robust
detectors, which are sorely needed for real time robotic applications. The novel bilateral
symmetry detection method proposed in Chapter 3 is designed to fill this niche. Overall,
the proposed detection method fully addresses both the speed and robustness aspects of
the motivating challenge.
The proposed fast bilateral symmetry detection method is the fastest detector at the time of
writing. The C++ implementation of the algorithm running on a Intel 1.73GHz Pentium
M laptop only requires 45ms to detect symmetry lines of all orientations in a 640×480 pixel
image. Timing trials conducted against the highly cited generalized symmetry transform
[Reisfeld et al., 1995] shows that the proposed detection method is roughly 8000 times
faster. Additionally, narrowing the angle limits in fast symmetry linearly reduces detection
time. These angle limits allow the use of fast symmetry in demanding real time systems,
such as the Kalman filter object tracker detailed in Chapter 5.
Some existing methods, such as the SIFT-based method described in [Loy and Eklundh,
2006], are able to operate on camera images of real world scenes. However, these methods
rely on symmetry in local gradients and pixel values, which breakdown in the presence of
shadows and specular reflections due to unfavourable lighting. Additionally, pixel values
are inherently unreliable for objects with transparent or reflective surfaces. Fast symme-
try is able to robustly deal with a wide variety of objects under non-uniform illumination
by solely relying on edge pixel locations as input data. Additionally, the use of Canny
edge detection [Canny, 1986] in conjunction with convergent voting using Hough trans-
form [Hough, 1962] provides a high level of robustness against input noise.
Development of Model-Free Object Sensing Methods
The speed and robustness of fast symmetry allows it to be applied to a variety of object
sensing problems. Object sensing methods can be formulated in a model-free manner by
using visual symmetry as an object feature. A robot with model-free sensing is able to
170
Section 8.1. Summary
function without relying on a priori object models. This allows the robot to operate in
environments where new objects are present with minimal prior training. Moreover, the
large number of symmetric objects in the average household makes bilateral symmetry a
useful object feature. The proposed model-free object sensing methods are able to operate
quickly and robustly on the kind of sensor data encountered by a domestic robot.
A visual sensing toolbox consisting of model-free object sensing methods has been de-
veloped. The static sensing problems of object segmentation and stereo triangulation are
addressed in Chapter 4. The proposed segmentation approach uses dynamic programming
to extract near-symmetric edge contours. This symmetry-guided segmentation method is
fast, requiring an average of 35ms on 640 × 480 test images. Experiments on real world
images show that multiple objects can be segmented but the resulting contours may not
encompass the entire object. This weakness of the segmentation method, caused by the
lack of assumed object models, motivates the robotic nudge approach to segmentation
detailed in Chapter 6.
The proposed stereo triangulation method generates a three dimensional symmetry axis
from a pair of symmetry lines, one from each camera in a stereo pair. Departing from
traditional stereo methods, the proposed approach does not rely on matching local corre-
spondences and does not return 3D information about surfaces. Instead, the triangulated
axis of symmetry represents structural information, similar to a three dimensional medial
axis. Experiments show that symmetry triangulation is able to operate on objects that
confuse traditional stereo methods, such as transparent and reflective objects. Symmetry
triangulation is especially useful for surface of revolution objects as the resulting symme-
try axis is identical to the axis of revolution. This allows accurate localization of common
household objects such as cups and bottles. Additionally, the symmetry axis provides
useful information with regards to an object’s orientation and provide structural cues that
are useful for object manipulation.
To deal with dynamic objects that move within the robot’s environment, a real time
object tracker is proposed in Chapter 5. The C++ implementation of the tracker is able
to operate at 40 frames per second on 640 × 480 video. The tracker also provides a
symmetry-refined motion segmentation of the tracked object in real time. Prior to this
work, object tracking using bilateral symmetry has not been attempted. With regards to
object tracking in general, the level of robustness against affine transformations, occlusions
and object transparency exhibited by the symmetry tracker also contributes to the state
of the art. A quantitative analysis of symmetry tracking errors indicates that bilateral
symmetry is a robust and accurate object feature. Additionally, a qualitative comparison
between symmetry and colour suggests that the edge-based bilateral symmetry can be
used in a complementary manner with other tracking features.
171
CHAPTER 8. CONCLUSION AND FUTURE WORK
Autonomous Object Segmentation
Chapter 6 proposed an autonomous system that makes use of the object motion induced by
a precise robotic nudge to perform object segmentation. While most active vision systems
actuate an eye-in-hand camera to gain multiple views of a static scene, the proposed robotic
system actuates a target object to obtain segmentations that are true to the physical world.
This active approach allows the segmentation of new objects by using bilateral symmetry
to triangulate and track objects.
The proposed autonomous segmentation approach is partly inspired by [Fitzpatrick, 2003a],
which describes an accidental approach to object discovery and segmentation using a pok-
ing action. The proposed approach differs from previous work, intentionally causing object
motion by performing planning prior to robotic action. The planning enables the use of a
gentle nudge action that generates predictable object motion, which allows the segmenta-
tion and actuation of fragile and top-heavy objects. This differs from previous work that
requires the use of unbreakable test objects due to the high speed of contact between the
end effector and the test objects.
Fitzpatrick’s approach sometimes produces poor segmentation results that are near-empty
or includes the robot’s end effector. In the proposed approach, such results are prevented
by the use of stereo object tracking initiated upon detection of object motion. The pro-
posed motion segmentation method uses video images before and after object actuation to
prevent the inclusion of the robot’s end effector in the result. Additionally, the proposed
motion segmentation approach is fast, requiring only 80ms to perform a 1280 ×960 pixel
motion segmentation. This allows the robot to continue its online functions immediately
after object actuation, removing temporal gaps in sensing caused by offline processing.
Object Learning by Robotic Interaction
The robot described in Chapter 7 performs autonomous object learning by integrating the
research in previous chapters with new visual sensing and object manipulation techniques.
By leveraging the robotic nudge to perform autonomous object segmentation, the robot
collects training data on its own by grasping and rotating a nudged object. Object learning
is performed by building a 360-degree visual model of grasped objects using SIFT [Lowe,
2004]. Experiments show that the challenge of shifting the burden of object modelling
from the human user to the robot has been met for bilaterally symmetric objects.
In addition to meeting the challenge of autonomous object learning, the research also
provides the following contributions. Firstly, the work shows that it is possible to leverage
a simple object manipulation to perform a more advanced manipulation. This is embodied
by the use of the robotic nudge to obtain the necessary information for object grasping,
which is a more complex and difficult manipulation. Secondly, the work shows that a robot
can collect training data of sufficient quality and quantity to perform object modelling.
Specifically, the robot is able to build a panoramic visual model of grasped objects using
172
Section 8.2. Future Work
SIFT descriptors without having to rely on symmetry or any other trackable feature.
Finally, experiments show that the robot is able to perform robust object recognition using
autonomously learned object models. This finding confirms that the robot is able to learn
useful object models on its own, which is a first step towards intelligent robotic systems
that can adapt to dynamic domestic environments by learning new objects autonomously.
8.2 Future Work
The work covered in this thesis spans the areas of symmetry detection, model-free object
sensing, autonomous object manipulation and object learning through robotic interaction.
Given the wide scope of the thesis, there are many avenues for future research in each of
these areas. The future work covered here is primarily focused on encouraging more
research on robotic systems that learn by acting.
Fast Bilateral Symmetry Detection
The bilateral symmetry detector proposed in Chapter 3 is fast and robust enough to oper-
ate on real world video. However, the old adage of Garbage in, Garbage out still applies.
Without edge pixels that correspond closely to the physical boundaries of objects, sym-
metry detection will invariably fail to find the symmetry lines of objects. In practice, this
problem usually manifests as broken edge contours. It maybe possible to use approaches
such as gradient vector flow (GVF) [Xu and Prince, 1998] to fill in empty space between
segments of an edge contour. As GVF has been used successfully to detect symmetry
in contours of dots [Prasad and Yegnanarayana, 2004], it seems likely that a vector field
approach can be used to improve symmetry detection robustness against broken contours.
As vector fields and similar approaches are computationally expensive, the real time per-
formance of detection will probably be reduced by this kind of preprocessing.
Similarly, an excess of edge pixels that do not belong to object contours may lower the
signal-to-noise ratio to the point of detection failure. This is especially problematic when
the scene contains large patches of high frequency texture. Blurring the image prior to edge
detection will help alleviate the problem. However, blurring is a double edged sword as it
will also prevent the detection of edge pixels at low contrast boundaries. It maybe possible
to remedy the situation by using an edge detection approach that performs structure-
preserving noise reduction such as SUSAN [Smith and Brady, 1997]. Additionally, the
edge thinning approach employed in SUSAN [Smith, 1995] will help equalize the number
of votes cast by edge contours of different thicknesses.
Symmetry-Based Model-Free Object Sensing
While the model-free object sensing methods proposed in Chapters 4 and 5 are self-
contained, they also highlight several issues worthy of further investigation. Firstly, the
173
CHAPTER 8. CONCLUSION AND FUTURE WORK
pendulum experiments in Section 5.4.3 has provided some insight into colour and symme-
try as tracking features. The observations that object blurring negatively affects symmetry
detection suggests the possibility of using colour features to improve symmetry tracking
performance. On the other hand, background distracters similar in hue to the tracking
target are severe problems for colour tracking. It maybe possible to apply symmetry
synergetically to improve colour tracking accuracy and reliability.
Secondly, the comparison results in Section 4.3.6 highlights the different natures of tradi-
tional stereo approaches and the proposed symmetry triangulation method. While dense
and sparse stereo methods localize three dimensional points on an object’s surface, the pro-
posed symmetry method returns a symmetry axis that passes through the object. Apart
from the benefit of being able to deal with objects with transparent or reflective surfaces,
fusing the results of symmetry and traditional stereo methods may provide additional ben-
efits. For example, a symmetry axis may provide information with regards to an object’s
orientation that help improve the search for correspondences by narrowing geometric con-
straints. Similarly, an object’s symmetry axis may benefit higher level model fitting by
constraining the geometry of possible solutions.
Autonomous Object Segmentation using the Robotic Nudge
As discussed in Chapter 6, the use of robotic action to actuate objects for the purposes of
discovery and segmentation is rare amongst existing literature. Hopefully, the proposed
autonomous segmentation method will motivate future works that depart from the de facto
approach of camera actuation to approaches that physically interact with objects. With
regards to individual steps of the proposed segmentation process, many future directions
are available.
Recall that the proposed segmentation process begins with the sensing of interesting loca-
tions by stereo triangulation of symmetry axes over multiple video images. This step relies
on finding the intersection point between a symmetry axis and a table plane to localize
objects. Instead of requiring a priori knowledge of the table plane, a more adaptive ap-
proach would be to estimate the geometry of the table online by fitting horizontal planes
to dense stereo disparity.
As the test scenes used in the autonomous segmentation experiments only contain a sin-
gle symmetric object, multi-object scenarios should be explored further. The proposed
method performs a planned action in the form of the robotic nudge, which allows the
implementation of exploration strategies such as the previously suggested approach of in-
vestigating locations near the camera first. The evaluation of exploration strategies can
be based on their efficiency as measured by the number of nudges required to completely
disambiguate the correspondences between symmetry axes and actual physical objects
in a scene. Additionally, the object nudging strategy can be optimized for maintaining
maximum object spacing and to keep objects within the robot manipulator’s workspace.
174
Section 8.2. Future Work
Implementing these kinds of exploration strategies will require lower level processes such
as obstacle avoidance and end effector path planning. These low level processes must deal
with the perpetual 2.5D problem of stereo vision where objects are occluded in one camera
view but not the other. Also, in scenes where objects are not spatially sparse enough to
allow robust visual sensing, it maybe possible to use the robotic nudge primarily as a tool
to remove occlusions by pushing the occluding object out of the way.
The robotic nudge relies solely on visual feedback to determine whether an object has been
actuated. Force sensors incorporated directly into the robot’s end effector will provide ad-
ditional sensory confirmation upon effector-object contact. This will allow greater control
over object actuation. For example, the robotic nudge can be stopped a short time after
the force sensors indicate that the end effector has made contact with an object instead
of the current fixed-length nudge trajectory. Moreover, a maximum force threshold will
help prevent damage to the robot manipulator by aborting the robotic nudge when the
robot encounters an immovable or heavy object. Additionally, the forces measured during
a nudge can be used as an object feature and as a feedback signal to modulate the strength
of the nudge.
On a global level, the generalization of the segmentation process to asymmetric objects
appears to be a highly useful future direction. Other structural features, such as geometric
primitives extracted from dense stereo disparity, may allow the use of a robotic nudge
approach of motion segmentation. Instead of using the symmetry line to measure object
movement for motion segmentation, other features should be evaluated. For example, sets
of SIFT features may allow the recovery of the object’s pose in the video images taken
before and after the nudge. As the motion segmentation approach will have to be modified
when dealing with asymmetric objects, it may be worth departing from monocular methods
for stereo methods such as using a stereo optical flow approach.
Object Learning by Robotic Interaction
Chapter 7 proposed an autonomous object learning system specifically targeted at sym-
metric objects, with experiments carried out on beverage bottles of various shapes and
appearances. The system leverages the robotic nudge to grasp an object and then visually
model the grasped object. Grasping is performed by applying force towards an object’s
symmetry axis in a perpendicular direction. This produces a stable grasp for a surface of
revolution object as its axis of revolution is the same as its symmetry axis. An avenue
for future work is to extend the object grasping step to other symmetric objects such as
boxes, followed by further generalization to handle asymmetric objects. Additionally, as
with the robotic nudge, object grasping will benefit from manipulator path planning that
addresses the problem of obstacle avoidance.
After grasping an object, the robot builds an object model by rotating the grasped object
while taking images at fixed angular intervals. Instead of the current approach where
SIFT descriptors are rejected by examining their frequency of occurrence in an object’s
175
CHAPTER 8. CONCLUSION AND FUTURE WORK
set of training images, other approaches may provide greater flexibility. For example, a
new training image of the grasped object can be captured when most descriptors from the
previous training image no longer matches with descriptors in the current video image.
Additionally, work can be done to improve the quality of robot-collected training data by
performing intelligent manipulation of the grasped object. This concept is exemplified by
a paper published during the writing of this thesis [Ude et al., 2008]. The work focused
on the sensory-motor task of maintaining an object at the centre of the camera image
while minimizing change in visual scale during object modelling, which is carried out by
rotating the object in front of a camera.
Overall, as autonomous object learning via robotic interaction is a new area of research,
there are many possibilities for future work. With regards to object interaction, the
problem of nudging and grasping objects that contain liquid, such as a cup of coffee, is
particularly pertinent to domestic robotics. Also, the approach of learning via object
interaction can be generalized to other actors, both robotic and human. For example, a
human teacher can perform a nudge to actuate objects placed in front of a passive vision
system. Extending this concept further, it maybe possible to have a robot that learns
about objects by observing the way humans interact with objects, such as during dinner.
This will provide the robot with many opportunities to perform motion segmentation
of moving objects. After sufficient observations, the robot can proceed to interact with
previously segmented objects autonomously.
176
A
Multimedia DVD Contents
A multimedia DVD containing videos, images and experiment data accompanies this dis-
sertation. The contents of the DVD are related to the research in Chapters 5, 6 and 7.
Detailed discussions of the multimedia content can be found in their respective chapters.
If video playback problems occur, the author recommends the cross platform and open
source video player VLC. A Windows XP installer for VLC is available in the root folder
of the multimedia DVD. Users of non-Windows operating systems can download VLC for
their platform from www.videolan.org/vlc/. The multimedia content provided on the
DVD are as follows.
A.1 Real Time Object Tracking
Videos of the tracking results discussed in Section 5.3 are available from the tracking
folder. The tracking videos are available as WMV and H264 files from their respective
folders. The H264 videos have better image quality than the WMV videos but require
more processing power to decode.
Note that the tracking videos are also available from the following website:
• www.ecse.monash.edu.au/centres/irrc/li_iro2006.php
Additionally, all four sets of 1000 pendulum test video images are available from the
pendulum folder of the multimedia DVD. Consult the readme.txt text file in the folder
for additional information.
A.2 Autonomous Object Segmentation
The autonomous object segmentation results discussed in Section 6.5 are located in the
nudge folder. The object segmentation results are tabulated in index.html. Each row of
the table contains an image of the segmentation result alongside corresponding videos of
177
CHAPTER A. MULTIMEDIA DVD CONTENTS
stereo symmetry tracking and the robotic nudge as filmed from an external video camera.
Note that the full resolution images are displayed by clicking on an image.
The segmentation results and videos are also available online:
• www.ecse.monash.edu.au/centres/irrc/li_iro08.php
A.3 Object Learning by Interaction
The experimental results of the research presented in Chapter 7 are available from the
learning folder of the multimedia DVD. Object interaction videos showing the robotic
nudge followed by a grasp and rotate of the nudge object are located in the learn-
ing/grasp folder. These videos are named after the object captions in Figure 7.9. The
long pause between the nudge and the grasp in these videos is due to the saving of exper-
iment data to document the learning process. This includes the writing of 200 640 ×480
tracking images to disc, which takes a considerable amount time. The data collected by
the robot during the learning experiments, including the training images, are available
from the learning/data folder. The readme.txt text file in the folder provides further
details about the experiment data.
The object recognition results are available from the learning/sift folder. The results are
organized into folders according to the test object in the input image. The input images
are named inputXX.png where XX is a number from 00 to 03. The training image that
produced the best matching SIFT descriptor set is named databaseXX.png, where XX
corresponds to the input image number. The object recognition results are visualized in
the images named matchXX.png. Again, the XX at the end of the image file name is the
same as the corresponding input image number. These images are a vertical concatenation
of the input image and the best matching training image. The input image is shown above
the matching training image. SIFT descriptor matches are linked using red lines in the
image. The label of the object identified by the robot’s object recognition system is shown
as green text at the bottom of the image.
Two videos explaining the autonomous grasping of nudged objects are available in the
learning/explain folder of the multimedia DVD. Both videos include audio commentary
by the thesis author. In combination, these two explanation videos demonstrate that
the robot can autonomously grasp symmetric objects of different heights. The blue cup
video contains a demonstration where a blue cup is nudged and then grasped autonomous
by the robot. The video also shows the graphic user interface of the robotic system,
which provides real time visualization of the planned nudge, object segmentation and
other pertinent information. The white bottle video shows the robot nudging, grasping
and rotating a white bottle. The saving of experiment data such as tracking images has
been reduced in this video, which allows the robot to perform the learning process with
minimal delay. The object segmentation obtained using the robotic nudge is shown in real
time during this video.
178
B
Building a New Controller for the PUMA 260
B.1 Introduction
The author designed and implemented a stand-alone motion controller for the PUMA 260
robot manipulator during the course of thesis research. The new controller is a complete
replacement for the default Unimate controller. The controller drives the servo motors,
controls the magnetic brake, reads the encoders and also interfaces with external hardware
such as the two-fingered gripper used for object grasping. The new controller differs from
the PUMA 560 controller detailed in [Moreira et al., 1996], which only attempts to replace
the control logic while retaining the analog servo and amplifier modules of the Unimate
controller.
The new controller was motivated by two factors. Firstly, the aging Unimate controller
in the author’s laboratory is prone to overheating, which can lead to unreliable operation
during robotic experiments especially in hot weather. Secondly, the Unimate controller
allows real time PC control of the PUMA arm but demands the constant sending of
encoder positions via a serial port. This removes vital CPU cycles from visual processing,
especially when context switching between the arm and vision threads are considered. The
new controller allows the specification of long motion trajectories that are carried out on
a PCI motion control card so that no CPU cycles are taken away from time-critical tasks
such as real time object tracking during a robotic nudge.
B.2 System Overview
Figure B.1 shows the software and hardware components of the robot used for the exper-
iments in Chapters 6 and 7. Components of the new controller are shown in blue. The
two kinematics modules in bold are described in Section B.3.
Advance Motion Controls (AMC) Z6A6DDC PWM servo amplifiers are used to drive the
40V servo motors in the PUMA 260 robot arm. Six amplifiers are mounted in pairs on
AMC MC2XZQD 2-axis interface boards, shown one above the other on the center-right
179
CHAPTER B. BUILDING A NEW CONTROLLER FOR THE PUMA 260
Galil PCI Servo Controller
Fast
Symmetry
Detection
Object
Segmentation
Real Time
Object Tracking
Stereo
Triangulation
Visual Processing
Desktop PC
Inverse
Kinematics
Servo
Amplifier Pair
Interconnect
Modules
Controller
Video Frames
Video Camera Video Camera
Stereo Camera Pair
PUMA 260
Manipulator
Ottobock Gripper
Robot Arm
Joint Angles
PWM Signals
Robotic Actions: Nudge, Grasp etc...
Direct
Kinematics
Joint Angles
Encoder Counts
Robot Arm Pose Wrist Pose
Object
Pose
Servo and
Gripper Power
Encoder
Counts
Figure B.1: Overview of new robot arm controller.
of Figure B.2. Each green amplifier interface board is roughly the size of two credit cards.
The two long black modules on the left of Figure B.2 are interconnects that interface the
servo amplifiers, arm encoder signals, magnetic brake and robot gripper with the PCI servo
controller card. The amplifiers and brake are powered using a 40V 12A power supply.
Figure B.2: New stand-alone controller for the PUMA 260.
A Galil six-axis DMC1860 servo motion control card acts as the brains of the controller.
The robot’s software interfaces with the card via a C API by giving commands written in
Galil’s proprietary motion code. The PCI card provides a hardware PID control system
180
Section B.3. Kinematics
for each axis, allowing individual and collective motion control for all six robot arm axes.
Single motor actions such as moving the robot arm out of its cradle position are given using
the position absolute (PA) command, which drives motors using a trapezoidal velocity
profile to a desired location. More complex movements such as the robotic nudge are
performed by generating a list of waypoints that specify the desired motion trajectory,
storing this list on the controller’s onboard memory and then issuing a linear interpolation
(LM) command, which tells the controller to drive the robot arm through the waypoints.
B.3 Kinematics
Software written by the author in C++ is used to perform direct and inverse kinematics.
Direct kinematics is used to determine the position and orientation, also called the pose, of
the wrist based on the joint angles of the robot manipulator. Conversely, inverse kinematics
is used to generate the joint angles necessary to achieve a target wrist pose. An additional
homogenous transformation between the wrist and the end effector is used to achieve
a target end effector pose for tasks such as the robotic nudge or object grasping. As
the robotic experiments presented in this thesis do not require high manipulator speed,
velocity control using Jacobians and the modelling of rigid body dynamics are left for
future work.
A gentle introduction to robot manipulator kinematics is available from Chapter 4 of
[Craig, 2005]. Chapter 2 of [Tsai, 1999] details the mathematics of serial manipulator
kinematics that includes the popular Denavit-Hartenberg method. Inverse kinematics
and motion control for manipulators with redundant degrees of freedom is surveyed in
[Siciliano, 1990]. The kinematics described here are developed by modifying the link and
joint parameters of a PUMA 560 solution [Paul and Zhang, 1986] to adapt the solution
to the PUMA 260. Simulations were carried out using Peter Corke’s Robotics Toolbox
[Corke, 1996] prior to implementing the kinematics as C++ software on the robot.
B.3.1 PUMA 260 Physical Parameters
Kinematic calculations require the physical parameters of the robot manipulator. While
these parameters are readily available for the PUMA 560 from publications such as [Lee
and Ziegler, 1984] and [Lee, 1982], published records for the PUMA 260 are quite rare. The
distance between joints are provided in a book about real time vision and manipulator
control [Andersson, 1988], where a PUMA 260 is adapted to play ping pong. Thanks
to Giorgio Metta of LIRA-Lab at Genova University, Italy, the author was also directed
towards a real time control library from McGill University [Lloyd, 2002]. The physical
parameters of the PUMA 260 used in the kinematic calculations to follow are taken from
the latter source.
181
CHAPTER B. BUILDING A NEW CONTROLLER FOR THE PUMA 260
The distance between the shoulder joint and the elbow joint is 0.2032 metres. The ma-
nipulator’s forearm also has the same length. The distance offset between the shoulder
joint and the wrist axis along the forearm is 0.12624 metres. In Denavit-Hartenberg (DH)
notation, with the same link numbering convention used in Table 1 of [Paul and Zhang,
1986], the PUMA 260 link and joint parameters are shown in Table B.1. Notice that unlike
the PUMA 560 DH table, the PUMA 260 has a zero d value for link 3.
Table B.1: PUMA 260 link and joint parameters
Link number α (degrees) a d
1 90 0 0
2 0 0.2032 0
3 -90 0 0.12624
4 90 0 0.2032
5 -90 0 0
6 0 0 0
B.3.2 Direct Kinematics
Also known as forward kinematics, direct kinematics is used to determine the position and
orientation of the manipulator wrist from known joint angles. As mentioned earlier, the
direct kinematics solution is based on the PUMA 560 kinematics described in [Paul and
Zhang, 1986]. PUMA 560 direct kinematics is also briefly discussed in a tutorial format
on page 78 to 83 of [Craig, 2005]
Direct kinematics performs 3D homogenous transformations from one joint to the next,
moving from the shoulder to the wrist. The matrix T
i
represents the transformation across
a single link. The α
i
, a
i
and d
i
values are the link and joint parameters for the ith link in
Table B.1. θ
i
is the joint angle for link i, which is the input data to direct kinematics.
T
i
=
_
¸
¸
¸
¸
_
cos(θ
i
) −sin(θ
i
) cos(α
i
) sin(θ
i
) cos(α
i
) a
i
cos(θ
i
)
sin(θ
i
) cos(θ
i
) cos(α
i
) −cos(θ
i
) sin(α
i
) a
i
sin(θ
i
)
0 sin(α
i
) cos(α
i
) d
i
0 0 0 1
_
¸
¸
¸
¸
_
(B.1)
The T
i
matrix is calculated for all six links of the PUMA 260. After this, the position and
orientation of the manipulator wrist can be calculated using the transformation matrix
T
w
, which gives the transformation from the manipulator shoulder joint to the wrist. T
w
is the output of direct kinematics.
T
w
= T
1
T
2
T
3
T
4
T
5
T
6
(B.2)
182
Section B.3. Kinematics
B.3.3 Inverse Kinematics
Inverse kinematics is the problem of calculating the robotic manipulator joint angles nec-
essary to get the wrist to a target position and orientation, also known as the wrist pose.
As opposed to the relatively straightforward mathematics of direct kinematics, inverse
kinematics can be difficult for manipulators with high degrees of freedom such as the
PUMA 260. Firstly, the existence of solutions must be tested mathematically to ensure
that the required wrist pose is reachable within the manipulator’s workspace. Secondly,
due to the redundancies in joint actuation provided by the six-jointed PUMA 260, the
problems of multiple correct solutions and singularities when moving between solutions
must be addressed.
The spherical wrist of the PUMA 260 help simplify the inverse kinematics calculations,
as the first three joints control the position of the wrist while the wrist joints control the
orientation. The input to inverse kinematics is the desired 3D homogenous transforma-
tion from the base shoulder joint of the manipulator to its wrist. Defining the desired
transformation matrix as T
d
, elements within the matrix are labelled as follows.
T
d
=
_
¸
¸
¸
¸
_
t
11
t
21
t
31
t
41
t
12
t
22
t
32
t
42
t
13
t
23
t
33
t
43
t
14
t
24
t
34
t
44
_
¸
¸
¸
¸
_
(B.3)
The desired (P
x
, P
y
, P
z
) position of the manipulator wrist is represented by t
41
, t
42
and
t
43
respectively. Two three-element orientation vectors are defined as O = [t
21
t
22
t
23
]
T
and A = [t
31
t
32
t
33
]
T
. The elements of O from top to bottom are named O
x
, O
y
and O
z
.
The same naming convention is used for the elements of A.
The following inverse kinematics solution puts the manipulator in a right handed, elbow up
and wrist no-flip configuration as described in [Paul and Zhang, 1986]. A similar naming
convention to the solution of Paul and Zhang is used below. Again, α
i
, a
i
and d
i
are the
physical parameters of the ith link taken from Table B.1.
Joint 1 Angle (θ
1
)
r =
_
P
x
2
+P
y
2
θ
1
= tan
−1
_
P
y
P
x
_
+sin
−1
_
d
3
r
_
(B.4)
183
CHAPTER B. BUILDING A NEW CONTROLLER FOR THE PUMA 260
Joint 2 Angle (θ
2
)
V
114
= P
x
cos θ
1
+P
y
sin θ
1
ψ = cos
−1
_
_
_
_
a
2
2
−d
4
2
+ (V
114
)
2
+P
z
2
2a
2
_
_
(V
114
)
2
+P
z
2
_
_
_
_
_
θ
2
= tan
−1
_
P
z
V
114
_
+ψ (B.5)
Joint 3 Angle (θ
3
)
θ
3
= tan
−1
_
V
114
cos θ
2
+P
z
sin θ
2
−a
2
P
z
cos θ
2
−V
114
sinθ
2
_
(B.6)
Joint 4 Angle (θ
4
)
V
113
= A
x
cos θ
1
+A
y
sin θ
1
V
323
= A
y
cos θ
1
−A
x
sin θ
1
V
313
= V
113
cos (θ
2

3
) +A
z
sin (θ
2

3
)
θ
4
= tan
−1
_
V
323
V
313
_
(B.7)
Joint 5 Angle (θ
5
)
θ
5
= tan
−1
_
V
313
cos θ
4
+V
323
sin θ
4
V
113
sin (θ
2
+θ3) −A
z
cos (θ
2

3
)
_
(B.8)
Joint 6 Angle (θ
6
)
V
112
= O
x
cos θ
1
+O
y
sin θ
1
V
132
= O
x
sin θ
1
−O
y
cos θ
1
V
312
= V
112
cos (θ
2

3
) +O
z
sin (θ
2

3
)
V
332
= −V
112
sin (θ
2

3
) +O
z
cos (θ
2

3
)
V
412
= V
312
cos θ
4
−V
132
sin θ
4
V
432
= V
312
sin θ
4
+V
132
cos θ
4
θ
6
= tan
−1
_
V
412
cos θ
5
+V
332
sin θ
5
V
432
_
(B.9)
184
References
[Andersson, 1988] Russell L. Andersson. A Robot Ping-Pong Player: Experiment in Real-
Time Intelligent Control. The MIT Press, 1988.
[Atallah, 1985] M.J. Atallah. On symmetry detection. IEEE Transactions on Computers,
34:663–666, 1985.
[Ballard, 1981] Dana H. Ballard. Generalizing the hough transform to detect arbitrary
shapes. Pattern Recognition, 13(2):111–122, 1981.
[Bar-Shalom et al., 2002] Yaakov Bar-Shalom, Thiagalingam Kirubarajan, and X.-Rong
Li. Estimation with Applications to Tracking and Navigation. John Wiley & Sons, Inc.,
2002.
[Barnes and Zelinsky, 2004] N. Barnes and A. Zelinsky. Fast radial symmetry speed sign
detection and classification. In Proceedings of the IEEE Intelligent Vehicles Symposium,
pages 566–571, Parma, Italy, June 2004.
[Blum and Nagel, 1978] Harry Blum and Roger N. Nagel. Shape description using
weighted symmetric axis features. Pattern Recognition, 10:167–180, 1978.
[Blum, 1964] Harry Blum. A transformation for extracting descriptors of shape. In Pro-
ceedings of Meeting Held. MIT Press, November 1964.
[Blum, 1967] Harry Blum. A transformation for extracting descriptors of shape. In Pro-
ceedings of Symposium on Models for the Perception of Speech and Visual Form, pages
153–171, Cambridge, MA, USA, November 1967.
[Borgerfors, 1986] Gunilla Borgerfors. Distance tranfromations in digital images. Com-
puter Vision, Graphics, and Image Processing, 34:344–371, 1986.
[Borras, 2006] Ricard Borras. Opencv cvblobslib binary blob extraction library. Online,
November 2006.
URL: http://opencvlibrary.sourceforge.net/cvBlobsLib.
[Bouguet, 2006] Jean-Yves Bouguet. Camera calibration toolbox for matlab. Online, July
2006.
URL: http://www.vision.caltech.edu/bouguetj/calibdoc/.
185
[Bradski, 1998] Gary R. Bradski. Computer vision face tracking for use in a perceptual
user interface. Intel Technology Journal, Q2:214–219, 1998.
[Brady and Asada, 1984] Michael Brady and Haruo Asada. Smoothed local symmetries
and their implementation. Technical report, MIT, Cambridge, MA, USA, 1984.
[Brooks, 1981] Rodney A. Brooks. Symbolic reasoning among 3-d models and 2-d images.
Artificial Intelligence, 17(1-3):285–348, 1981.
[Brown et al., 2003] M.Z. Brown, D. Burschka, and G.D. Hager. Advances in compu-
tational stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence,
25(8):993–1008, 2003.
[Buss et al., 2008] Martin Buss, Henrik Christensen, and Yoshihiko Nakamura, editors.
IROS 2008 Workshop on Robot Services in Aging Society, Nice, France, September
2008.
[Canny, 1986] J. Canny. A computational approach to edge detection. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 8(6):679–698, 1986.
[Cham and Cipolla, 1994] Tat-Jen Cham and Roberto Cipolla. Skewed symmetry de-
tection through local skewed symmetries. In Proceedings of the conference on British
machine vision (BMVC), volume 2, pages 549–558, Surrey, UK, 1994. BMVA Press.
[Christensen, 2008] Henrik I. Christensen. Robotics as an enabler for aging in place. In
Robot Services in Aging Society IROS 2008 Workshop, Nice, France, September 2008.
[Cole and Yap, 1987] Richard Cole and Chee-Keng Yap. Shape from probing. Journal of
Algorithms, 8(1):19–38, 1987.
[Corke, 1996] Peter I. Corke. A robotics toolbox for matlab. IEEE Robotics and Automa-
tion Magazine, 3(1):24–32, March 1996.
URL: http://petercorke.com/Robotics%20Toolbox.html.
[Cornelius and Loy, 2006] Hugo Cornelius and Gareth Loy. Detecting bilateral symmetry
in perspective. In Proceedings of Conference on Computer Vision and Pattern Recogni-
tion Workshop, page 191, Los Alamitos, CA, USA, 2006. IEEE Computer Society.
[Craig, 2005] John J. Craig. Introduction to Robotics: Mechanics and Control. Pearson
Prentice Hall, 2005.
[Duda and Hart, 1972] Richard O. Duda and Peter E. Hart. Use of the hough transforma-
tion to detect lines and curves in pictures. Communications of the ACM, 15(1):11–15,
1972.
[Dyer et al., 2005] Adrian G. Dyer, Christa Neumeyer, and Lars Chittka. Honeybee (apis
mellifera) vision can discriminate between and recognise images of human faces. Journal
of Experimental Biology, 208:4709–4714, 2005.
186
[Fitzpatrick and Metta, 2003] Paul Fitzpatrick and Giorgio Metta. Grounding vision
through experimental manipulation. In Philosophical Transactions of the Royal So-
ciety: Mathematical, Physical, and Engineering Sciences, pages 2165–2185, 2003.
[Fitzpatrick, 2003a] Paul Fitzpatrick. First contact: an active vision approach to seg-
mentation. In Proceedings of Intelligent Robots and Systems (IROS), volume 3, pages
2161–2166, Las Vegas, Nevada, October 2003. IEEE.
[Fitzpatrick, 2003b] Paul Fitzpatrick. From First Contact to Close Encounters: A Devel-
opmentally Deep Perceptual System for a Humanoid Robot. PhD thesis, Massachusetts
Institute of Technology, 2003.
[Forsyth and Ponce, 2003] David A. Forsyth and Jean Ponce. Computer Vision - A Mod-
ern Approach. Alan Apt, 2003.
[Gupta et al., 2005] Abhinav Gupta, V. Shiv Naga Prasad, and Larry S. Davis. Extracting
regions of symmetry. In IEEE International Conference on Image Processing (ICIP),
volume 3, pages 133–136, Genova, Italy, September 2005.
[Harris and Stephens, 1988] C. Harris and M. Stephens. A combined corner and edge de-
tector. In Proceedings of The Fourth Alvey Vision Conference, pages 147–151, Manch-
ester, UK, September 1988.
[Hough, 1962] P.V.C. Hough. Method and means for recognizing complex patterns. United
States Patent, December 1962. 3,069,654.
[Huang et al., 2002] Yu Huang, Thomas S. Huang, and Heinrich Niemann. A region-
based method for model-free object tracking. In International Conference on Pattern
Recognition, pages 592–595, Quebec, Canada, August 2002.
[Huebner, 2007] Kai Huebner. Object description and decomposition by symmetry hier-
archies. In International Conferences in Central Europe on Computer Graphics, Visu-
alization and Computer Vision, Bory, Czech Republic, January 2007.
[Illingworth and Kittler, 1988] J. Illingworth and J. Kittler. A survey of the hough trans-
form. Computer Vision, Graphics, and Image Processing, 44(1):87–116, 1988.
[Intel, 2006] Intel. Opencv: Open source computer vision library. Online, November 2006.
URL: http://www.intel.com/technology/computing/opencv/.
[Jia and Erdmann, 1998] Yan-Bin Jia and Michael Erdmann. Observing pose and motion
through contact. In Proceedings of IEEE International Conferene on Robotics and
Automation, volume 1, pages 723–729, Leuven, Belgium, May 1998.
[K. S. Arun and Blostein, 1987] T. S. Huang K. S. Arun and S. D. Blostein. Least-squares
fitting of two 3-d point sets. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 9:698–700, 1987.
187
[Kiryati and Gofman, 1998] Nahum Kiryati and Yossi Gofman. Detecting symmetry in
grey level images: The global optimization approach. International Journal of Computer
Vision, 29(1):29–45, 1998.
[Kleeman, 1996] Lindsay Kleeman. Understanding and applying kalman filtering. In Pro-
ceedings of the Second Workshop onPerceptive Systems, Curtin University of Technology,
Perth, Western Australia, January 1996.
[Koch et al., 2006] Kristin Koch, Judith McLean, Ronen Segev, Michael A. Freed,
Michael J. Barry, Vijay Balasubramanian, and Peter Sterling. How much the eye tells
the brain. Current Biology, 16:1428–1434, 2006.
[Kovesi, 1997] Peter Kovesi. Symmetry and asymmetry from local phase. In Tenth Aus-
tralian Joint Conference on Artificial Intelligence, pages 185–190, Perth, Australia,
December 1997.
[Laurie J. Heyer and Yooseph, 1999] Semyon Kruglyak Laurie J. Heyer and Shibu
Yooseph. Exploring expression data: Identification and analysis of coexpressed genes.
Genome Research, 9:1106–1115, 1999.
[Lee and Ziegler, 1984] C.S.G. Lee and M. Ziegler. Geometric approach in solving inverse
kinematics of puma robots. IEEE Transactions on Aerospace and Electronic Systems,
6:695–706, 1984.
[Lee et al., 2001] Bin Lee, Jia-Yong Yan, and Tian-Ge Zhuang. A dynamic programming
based algorithm for optimal edge detection in medical images. In Proceedings of the
International Workshop on Medical Imaging and Augmented Reality, pages 193–198,
Hong Kong, China, June 2001.
[Lee, 1982] C S G Lee. Robot arm kinematics, dynamics, and control. Computer,
15(12):62–80, 1982.
[Lei and Wong, 1999] Y. Lei and K.C. Wong. Detection and localisation of reflectional and
rotational symmetry under weak perspective projection. Pattern Recognition, 32(2):167–
180, February 1999.
[Levitt, 1984] Tod S. Levitt. Domain independent object description and decomposition.
In National Conference on Artificial Intelligence (AAAI), pages 207–211, Austin, Texas,
USA, August 1984.
[Li and Kleeman, 2006a] Wai Ho Li and Lindsay Kleeman. Fast stereo triangulation using
symmetry. In Australasian Conference on Robotics and Automation, Auckland, New
Zealand, December 2006. Online.
URL: http://www.araa.asn.au/acra/acra2006/.
[Li and Kleeman, 2006b] Wai Ho Li and Lindsay Kleeman. Real time object tracking using
reflectional symmetry and motion. In IEEE/RSJ Conference on Intelligent Robots and
Systems, pages 2798–2803, Beijing, China, October 2006.
188
[Li and Kleeman, 2008] Wai Ho Li and Lindsay Kleeman. Autonomous segmentation of
near-symmetric objects through vision and robotic nudging. In 2008 IEEE/RSJ Inter-
national Conference on Intelligent Robots and Systems, pages 3604–3609, Nice, France,
September 2008.
[Li et al., 2005] Wai Ho Li, Alan M. Zhang, and Lindsay Kleeman. Fast global reflectional
symmetry detection for robotic grasping and visual tracking. In Claude Sammut, editor,
Australasian Conference on Robotics and Automation, Sydney, December 2005. Online.
URL: http://www.cse.unsw.edu.au/
~
acra2005/.
[Li et al., 2006] Wai Ho Li, Alan M. Zhang, and Lindsay Kleeman. Real time detection
and segmentation of reflectionally symmetric objects in digital images. In IEEE/RSJ
Conference on Intelligent Robots and Systems, pages 4867–4873, Beijing, China, October
2006.
[Li et al., 2008] Wai Ho Li, Alan M. Zhang, and Lindsay Kleeman. Bilateral symmetry
detection for real-time robotics applications. International Journal of Robotics Research,
27(7):785–814, July 2008.
[Lloyd, 2002] John Lloyd. Robot control c library. Online, 2002.
URL: http://www.cs.ubc.ca/
~
lloyd/rccl.html.
[Lowe, 2004] David G. Lowe. Distinctive image features from scale-invariant keypoints.
International Journal of Computer Vision, 60(2):91–110, November 2004.
[Loy and Eklundh, 2006] Gareth Loy and Jan-Olof Eklundh. Detecting symmetry and
symmetric constellations of features. In Proceedings of European Conference on Com-
puter Vision (ECCV), Graz, Austria, May 2006.
[Loy and Zelinsky, 2003] Gareth Loy and Alexander Zelinsky. Fast radial symmetry for
detecting points of interest. IEEE Transactions on Pattern Analysis and Machine In-
telligence, 25(8):959–973, 2003.
[Loy, 2003] Gareth Loy. Computer Vision to See People: a basis for enhanced human
computer interaction. PhD thesis, Australian National University, January 2003.
[Lucas and Kanade, 1981] Bruce D. Lucas and Takeo Kanade. An iterative image regis-
tration technique with an application to stereo vision. In International Joint Conference
on Artificial Intelligence, pages 674–679, Vancouver, British Columbia, Canada, April
1981.
[Makram-Ebeid, 2000] Sherif Makram-Ebeid. Digital image processing method for auto-
matic extraction of strip-shaped objects. United States Patent, October 2000. Number
6134353,
URL: http://www.freepatentsonline.com/6134353.html.
[Marola, 1989a] G. Marola. On the detection of the axes of symmetry of symmetric and
almost symmetric planar images. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 11(1):104–108, 1989.
189
[Marola, 1989b] Giovanni Marola. Using symmetry for detecting and locating objects in
a picture. Computer Vision, Graphics and Image Processing, 46(2):179–195, 1989.
[Matas et al., 2002] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide baseline
stereo from maximally stable extremal regions. In Proceedings of the British Machine
Vision Conference (BMCV), pages 384–393, Tbingen, Germany, November 2002.
[Merlet, 2008] J.P. Merlet. Statistics for iros 2008 and proposals for improving the review
process. Online PDF, August 2008.
[Mikolajczyk and Schmid, 2005] Krystian Mikolajczyk and Cordelia Schmid. A perfor-
mance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 27(10):1615–1630, October 2005.
[Mikolajczyk et al., 2005] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman,
J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool. A comparison of affine region
detectors. International Journal of Computer Vision, 65(1/2):43–72, 2005.
[Mohan and Nevatia, 1992] Rakesh Mohan and Ramakant Nevatia. Perceptual organiza-
tion for scene segmentation and description. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 14:616–635, 1992.
[Moll and Erdmann, 2001] Mark Moll and Michael A. Erdmann. Reconstructing shape
from motion and tactile sensors. In Proceedings of International Conference on Intelli-
gent Robots and Systems (IROS), Maui, HI, USA, October 2001.
[Moller, 1997] Tomas Moller. A fast triangle-triangle intersection test. Journal of Graphics
Tools, 2(2):25–30, 1997.
URL: http://www.cs.lth.se/home/Tomas_Akenine_Moller/.
[Moreira et al., 1996] Nuno Moreira, Paulo Alvito, and Pedro Lima. First steps towards
an open control architecture for a puma 560. In Proceedings of CONTROLO Conference,
1996.
[Mortensen et al., 1992] E. Mortensen, B. Morse, W. Barrett, and J. Udupa. Adaptive
boundary detection using live-wire two-dimensional dynamic programming. In IEEE
Proceedings of Computers in Cardiology, pages 635–638, Durhman, North Carolina,
October 1992.
[Nagel, 1978] H. H. Nagel. Formation of an object concept by analysis of systematic
time variations in the optically perceptible environment. Computer Graphics Image
Processing, 7(2):149–194, April 1978.
[Nalwa, 1988a] Vishvjit S. Nalwa. Line-drawing interpretation: A mathematical frame-
work. International Journal of Computer Vision, 2(2):103–124, September 1988.
[Nalwa, 1988b] Vishvjit S. Nalwa. Line-drawing interpretation: Straight lines and conic
sections. IEEE Transactions on Pattern Analysis and Machine Intelligence, 10(4):514–
529, 1988.
190
[Nalwa, 1989] Vishvjit S. Nalwa. Line-drawing interpretation: Bilateral symmetry. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 11(10):1117–1120, 1989.
[Nevatia and Binford, 1977] Ramakant Nevatia and Thomas O. Binford. Description and
recognition of curved objects. Artificial Intelligence, 8:77–98, 1977.
[Ogawa, 1991] Hideo Ogawa. Symmetry analysis of line drawings using the hough trans-
form. Pattern Recognition Letters, 12(1):9–12, 1991.
[Pal and Pal, 1993] N.R. Pal and S.K. Pal. A review on image segmentation techniques.
Pattern Recognition, 26(9):1277–1294, September 1993.
[Paul and Zhang, 1986] Richard P. Paul and Hong Zhang. Computationally efficient kine-
matics for manipulators with spherical wrists based on the homogeneous transformation
representation. International Journal of Robotics Research, 5:32–44, 1986.
[Pfaltz and Rosenfeld, 1967] John L. Pfaltz and Azriel Rosenfeld. Computer representa-
tion of planar regions by their skeletons. Communications of the ACM, 10(2):119–122,
1967.
[Ponce, 1990] Jean Ponce. On characterizing ribbons and finding skewed symmetries.
Computer Vision, Graphics, and Image Processing, 52(3):328–340, December 1990.
[Prasad and Yegnanarayana, 2004] V. Shiv Naga Prasad and B. Yegnanarayana. Find-
ing axes of symmetry from potential fields. IEEE Transactions on Image Processing,
13(12):1559–1566, December 2004.
[Ray et al., 2008] Celine Ray, Francesco Mondada, and Roland Siegwart. What do people
expect from robots? In IEEE/RSJ International Conference on Intelligent Robots and
Systems, pages 3816–3821, Nice, France, September 2008.
[Reisfeld et al., 1995] D. Reisfeld, H. Wolfson, and Y. Yeshurun. Context-free attentional
operators: The generalized symmetry transform. International Journal of Computer
Vision, Special Issue on Qualitative Vision, 14(2):119–130, March 1995.
[Rosenfeld and Pfaltz, 1966] Azriel Rosenfeld and John L. Pfaltz. Sequential operations
in digital picture processing. Journal of the Association for Computing Machinery,
13:471–494, 1966.
[Rosenfeld and Pfaltz, 1968] Azriel Rosenfeld and John L. Pfaltz. Distance functions on
digital images. Pattern Recognition, 1:33–61, 1968.
[Satoh et al., 2004] Yoshinori Satoh, Takayuki Okatani, and Koichiro Deguchi. A color-
based tracking by kalman particle filter. In Proceedings of International Conference on
Pattern Recognition, pages 502–505, Cambridge, UK, August 2004.
[Scharstein and Szeliski, 2001] Daniel Scharstein and Richard Szeliski. A taxonomy and
evaluation of dense two-frame stereo correspondence algorithms. Technical Report MSR-
TR-2001-81, Microsoft Research, Microsoft Corporation, November 2001.
191
[Siciliano, 1990] Bruno Siciliano. Kinematic control of redundant robot manipulators: A
tutorial. Journal of Intelligent and Robotic Systems, 3:202–212, 1990.
[Skarbek and Koschan, 1994] Wladyslaw Skarbek and Andreas Koschan. Colour image
segmentation — a survey. Technical report, Institute for Technical Informatics, Tech-
nical University of Berlin, October 1994.
[Smith and Brady, 1997] Stephem M. Smith and J. Michael Brady. Susana new approach
to low level image processing. International Journal of Computer Vision, 23:45–78,
1997.
[Smith, 1995] S.M. Smith. Edge thinning used in the susan edge detector. Technical
Report TR95SMS5, Oxford Centre for Functional Magnetic Resonance Imaging of the
Brain (FMRIB), 1995.
[Swain and Ballard, 1991] Michael J. Swain and Dana H. Ballard. Color indexing. Inter-
nation Journal of Computer Vision, 7(1):11–32, 1991.
[Taylor, 2004] Geoffrey Taylor. Robust Perception and Control for Humanoid Robots in
Unstructured Environments using Vision. PhD thesis, Monash University, Melbourne,
Australia, 2004.
[Tsai, 1999] Lung-Wen Tsai. Robot Analysis - The Mechanics of Serial and Parallel Ma-
nipulators. John Wiley and Sons Inc, 1999.
[Ude et al., 2008] Ales Ude, Damir Omrcen, and Gordon Cheng. Making object learning
and recognition and active process. International Journal of Humanoid Robotics, 5:267–
286, 2008. Special Issue: Towards Cognitive Humanoid Robots.
[Viola and Jones, 2001] Paul Viola and Michael J. Jones. Rapid object detection using
a boosted cascade of simple features. In Proceedings of IEEE Computer Society Con-
ference on Computer Vision and Pattern Recognition, Kauai Marriott, Hawaii, USA,
December 2001.
[Wang and Suter, 2003] Hanzi Wang and David Suter. Using symmetry in robust model
fitting. Pattern Recognition Letters, 24(16):2953–2966, 2003.
[Wang and Suter, 2005] Hanzi Wang and David Suter. A re-evaluation of mixture-of-
gaussian background modeling. In IEEE International Conference on Acoustics, Speech,
and Signal Processing (ICASSP), pages 1017–1020, Pennsylvania, USA, March 2005.
[Westhoff et al., 2005] D. Westhoff, K. Huebner, and J. Zhang. Robust illumination-
invariant features by quantitative bilateral symmetry detection. In Proceedings of the
IEEE International Conference on Information Acquisition (ICIA), Hong Kong, June
2005.
[Xu and Oja, 1993] L. Xu and E. Oja. Randomized hough transform (rht): Basic mech-
anisms, algorithms, and computational complexities. Computer Vision, Graphics, and
Image Processing, 57(2):131–154, March 1993.
192
[Xu and Prince, 1998] Chenyang Xu and Jerry L. Prince. Snakes, shapes, and gradient
vector flow. IEEE Transactions on Image Processing, 7(7):359–369, March 1998.
[Yan and Kassim, 2004] P. Yan and A.A. Kassim. Medical image segmentation with min-
imal path deformable models. In Proceedings of the International Conference on Image
Processing (ICIP), volume 4, pages 2733–2736, Singapore, October 2004.
[Yip et al., 1994] Raymond K. K. Yip, Wilson C. Y. Lam, Peter K. S. Tam, and Dennis
N. K. Leung. A hough transform technique for the detection of rotational symmetry.
Pattern Recognition Letters, 15(9):919–928, 1994.
[Yl-Jski and Ade, 1996] Antti Yl-Jski and Frank Ade. Grouping symmetrical structures
for object segmentation and description. Computer Vision and Image Understanding,
63(3):399–417, 1996.
[Yu and Luo, 2002] Ting Yu and Yupin Luo. A novel method of contour extraction based
on dynamic programming. In Proceedings of the 6th International Conference on Signal
Processing (ICSP), pages 817–820, Beijing, China, August 2002.
[Zabrodsky et al., 1993] H. Zabrodsky, S. Peleg, and D. Avnir. Completion of occluded
shapes using symmetry. In Proceedings of IEEE Conference on Computer Vision Pattern
Recognition, pages 678–679, New York, NY, USA, June 1993.
[Zabrodsky et al., 1995] H. Zabrodsky, S. Peleg, and D. Avnir. Symmetry as a continuous
feature. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17:1154–
1166, 1995.
193

Sign up to vote on this title
UsefulNot useful