ECMR11 Proceedings

Proceedings of the 5th European Conference on Mobile Robots ECMR2011
September 7-9, 2011 rebro, Sweden
Editors: Achim J. Lilienthal Tom Duckett
September 5, 2011
Achim J. Lilienthal AASS Research Centre School of Science and Technology rebro University SE-70182 rebro Sweden achim.lilienthal@oru.se Tom Duckett Lincoln School of Computer Science University of Lincoln LN6 7TS Lincoln United Kingdom tduckett@lincoln.ac.uk ECMR11 Proceedings of the 5th European Conference on Mobile Robots September 7-9, 2011 rebro, Sweden
5th European Conference on Mobile Robots, September 7-9, 2011, rebro, Sweden
ii
Committee
General Chair: Achim J. Lilienthal Program Chair: Tom Duckett Conference Board: Adam Borkowski Wolfram Burgard Primo Zingaretti Local Organization: Monica Wettler (Conference Coordinator) Barbro Alvin (Local Organization) Per Sporrong (Local Organization) Marcello Cirillo (Local Organization) Branko Arsov (Finance Organization) Program Committee: Kai Arras Alexander Bahr Juan Andrade Cetto Antonio Chella Marcello Cirillo Mark Cummins David Filliat Udo Frese Javier Gonzalez Horst-Michael Gross Dirk Holz Patric Jensfelt Maarja Kruusmaa Lino Marques Stefan May Ivan Petrovi c Libor Preucil Alessandro Safotti Antonio Sgorbissa Cyrill Stachniss Rudolph Triebel Jasmin Velagic
Tamim Asfour Nicola Bellotto Raja Chatila Grzegorz Cielniak Andreu Corominas Murtra Robert Cupec Thierry Fraichard Emanuele Frontoni Roderich Gross Dongbing Gu Luca Iocchi Dermot Kerr Martin Magnusson Toms Martnez-Marn Emanuele Menegatti Cedric Pradalier Dario Lodi Rizzini Thomas Schn Piotr Skrzypczy ski n Adriana Tapus Andrzej Typiak Markus Vincze
iii
Sponsors
ABB Volvo Construction Equipment Robot Dalen Atlas Copco
iv
Welcome Message
We are honoured to welcome you to the 5th European Conference on Mobile Robots ECMR 2011, held in the city of rebro, Sweden on September 7-9, 2011. ECMR is a biannual European forum, internationally open, allowing researchers throughout Europe to become acquainted with the latest accomplishments and innovations in advanced mobile robotics and mobile human-robot systems. The rst ECMR meeting was held in September 2003 in Radziejowice, Poland, followed by ECMRs in September 2005 in Ancona, Italy; in September 2007 in Freiburg, Germany; and in September 2009 in Mlini/Dubrovnik, Croatia. A priority of ECMR is to attract young researchers to present their work to an international audience. ECMR is organized in single track mode to favour discussions. ECMR 2011 will continue this policy and will provide panel sessions and original presentations about research in progress. ECMR 2011 has continued to build on the success of the previous conferences, reaching a level of maturity that is reected in the quality of the technical program. Each of the 71 papers submitted was evaluated by three reviewers and 51 of them (from 19 different countries and 170 authors) have been accepted by the Program Committee. These papers are included in these proceedings and will be presented at the conference. They cover a wide spectrum of research topics in mobile robotics: 3D perception, navigation, path planning and tracking, SLAM, mapping and exploration, human-robot cooperation, various service applications, etc. We are especially proud to welcome our distinguished keynote speakers Professor Per-Erik Forssn from Linkping University, Sweden, who will give the talk Dynamic and Situated Visual Perception, and Professor Markus Vincze from Vienna University of Technology, Austria, who will give the talk Robot Object Classication. The Technical Program also includes a special invited talk by Professor Wolfram Burgard of Freiburg University, Germany, on Robot Control Based on System Identication, a presentation in memoriam of Professor Ulrich Nehmzow. Ulrich was a keen supporter of ECMR (as well as EUROBOT, one of the predecessor conferences to ECMR) and will be deeply missed by many colleagues in the European mobile robotics community. A further invited talk will be given by Torbjrn Martinsson of Volvo CE on Construction Equipment is a Mobile Robotic Market, reecting the strong connections between the academic community and its industrial partners at ECMR. Our sincere thanks are due to all people whose cooperation and hard work made this conference possible. First and foremost, we would like to thank the members of the Organizing Committee and Program Committee for their outstanding work and our sponsors whose support is particularly appreciated. Our special thanks go to the authors for submitting their work to ECMR 2011 and to the reviewers for their time and effort in evaluating the submissions. The results of their joint work are visible in the Program and Proceedings of ECMR 2011. It is now up to all of us to make ECMR 2011 a great success and a memorable event by participating in the technical program, by supporting our younger colleagues, especially students, as they will shape the future of mobile robotics research.
Achim J. Lilienthal Tom Duckett
vi
C ONFERENCE P ROGRAM
Day 1 Session 1 Shared Environments 1 7 Cipriano Galindo, Javier Gonzlez, Juan-Antonio Fernndez-Madrigal, Alessandro Safotti Robots that Change Their World: Inferring Goals from Semantic Knowledge Arne Kreutzmann, Immo Colonius, Lutz Frommberger, Frank Dylla, Christian Freksa, Diedrich Wolter On Process Recognition by Logical Inference Alper Aydemir, Moritz Gbelbecker, Andrzej Pronobis, Kristoffer Sj, Patric Jensfelt Plan-based Object Search and Exploration using Semantic Spatial Knowledge in the Real World
13
Session 2 Comparative Evaluation 19 25 Todor Stoyanov, Athanasia Louloudi, Henrik Andreasson, Achim J. Lilienthal Comparative Evaluation of Range Sensor Accuracy in Indoor Environments Dirk Holz, Nicola Basilico, Francesco Amigoni, Sven Behnke A Comparative Evaluation of Exploration Strategies and Heuristics to Improve Them
Session 3 Tracking 31 37 Sre ko Juri -Kavelj, Ivan Markovi , Ivan Petrovi c c c c People Tracking with Heterogeneous Sensors using JPDAF with Entropy Based Track Management Aamir Ahmad, Pedro Lima Multi-Robot Cooperative Object Tracking Based on Particle Filters
Session 4 Navigation 43 Bernd Kitt, Jrn Rehder, Andrew Chambers, Miriam Schnbein, Henning Lategahn, Sanjiv Singh Monocular Visual Odometry using a Planar Road Model to Solve Scale Ambiguity Boris Lau, Christoph Sprunk, Wolfram Burgard Incremental Updates of Conguration Space Representations for Non-Circular Mobile Robots with 2D 2.5D or 3D Obstacle Models Maximilian Beinhofer, Jrg Mller, Wolfram Burgard Landmark Placement for Accurate Mobile Robot Navigation
49
55
vii
61
Francesco Capezio, Fulvio Mastrogiovanni, Antonello Scalmato, Antonio Sgorbissa, Paolo Vernazza, Tullio Vernazza, Renato Zaccaria Mobile Robots in Hospital Environments: an Installation Case Study
Day 2 Session 5 Visual SLAM 69 77 John McDonald, Michael Kaess, Cesar Cadena, Jos Neira, John J. Leonard 6-DOF Multi-session Visual SLAM using Anchor Nodes Gerardo Carrera, Adrien Angeli, Andrew J. Davison Lightweight SLAM and Navigation with a Multi-Camera Rig
Poster Spotlight Session 1 Shared Environments, Navigation 83 89 95 101 109 115 121 127 133 139 Abir B. Karami, Abdel-Illah Mouaddib A Decision Model of Adaptive Interaction Selection for a Robot Companion Jonas Firl, Quan Tran Probabilistic Maneuver Prediction in Trafc Scenarios Jens Kessler, Andrea Scheidig, Horst-Michael Gross Approaching a Person in a Socially Acceptable Manner using Expanding Random Trees Amir Aly, Adriana Tapus Speech to Head Gesture Mapping in Multimodal Human-Robot Interaction Hatice Kose-Bagci, Rabia Yorganci, Hatice Esra Algan Evaluation of the Robot Sign Language Tutoring Assistant using Video-based Studies Agustin Ortega, Juan Andrade-Cetto Segmentation of Dynamic Objects from Laser Data Erik Einhorn, Markus Filzhuth, Christof Schrter, Horst-Michael Gross Monocular Detection and Estimation of Moving Obstacles for Robot Navigation Robert Cupec, Emmanuel K. Nyarko, Damir Filko Fast 2.5D Mesh Segmentation to Approximately Convex Surfaces Janusz Bedkowski Data Registration Module - A Component of Semantic Simulation Engine Ernesto H. Teniente, Juan Andrade-Cetto FaMSA: Fast Multi-Scan Alignment with Partially Known Correspondences
Session 6 Perception 145 Sebastian Scherer, Daniel Dube, Philippe Komma, Andreas Masselli, Andreas Zell Robust Real-Time Number Sign Detection on a Mobile Outdoor Robot
viii
153 159 165
Marcel Hselich, Marc Arends, Dagmar Lang, Dietrich Paulus Terrain Classication with Markov Random Fields on fused Camera and 3D Laser Range Data Andrzej Pronobis, Patric Jensfelt Hierarchical Multi-Modal Place Categorization Lus Osrio, Gonalo Cabrita, Lino Marques Mobile Robot Odor Plume Tracking using Three Dimensional Information
Session 7 Planning 171 177 Jan Faigl, Vojt ch Vonsek, Libor Peu il e r c A Multi-Goal Path Planning for Goal Regions in the Polygonal Domain Jrg Stckler, Ricarda Steffens, Dirk Holz, Sven Behnke Real-Time 3D Perception and Efcient Grasp Planning for Everyday Manipulation Tasks
Poster Spotlight Session 2 Perception, Planning, Visual Mapping, SLAM, Applications 183 189 195 201 207 213 219 227 233 239 245 253 Gonalo Cabrita, Pedro Sousa, Lino Marques Odor Guided Exploration and Plume Tracking - Particle Plume Explorer Miriam Schnbein, Bernd Kitt, Martin Lauer Environmental Perception for Intelligent Vehicles Using Catadioptric Stereo Vision Systems Dominik Belter, Przemysaw abecki, Piotr Skrzypczy ski n On-Board Perception and Motion Planning for Legged Locomotion over Rough Terrain Vojt ch Vonsek, Jan Faigl, Tom Krajnk, Libor Peu il e r c A Sampling Schema for Rapidly Exploring Random Trees using a Guiding Path Rainer Palm, Abdelbaki Bouguerra Navigation of Mobile Robots by Potential Field Methods and Market-based Optimization Feras Dayoub, Grzegorz Cielniak, Tom Duckett A Sparse Hybrid Map for Vision-Guided Mobile Robots Marco A. Gutirrez, Pilar Bachiller, Luis J. Manso, Pablo Bustos, Pedro Nez An Incremental Hybrid Approach to Indoor Modeling Hemanth Korrapati, Youcef Mezouar, Philippe Martinet Efcient Topological Mapping with Image Sequence Partitioning Thomas Fraud, Roland Chapuis, Romuald Aufrre, Paul Checchin Kalman Filter Correction with Rational Non-linear Functions: Application to Visual-SLAM Paolo Raspa, Emanuele Frontoni, Adriano Mancini, Primo Zingaretti, Sauro Longhi Helicopter Safe Landing using Vision and 3D Sensing Klas Hedenberg, Bjrn strand Safety Standard for Mobile Robots - A Proposal for 3D Sensors Wajahat Kazmi, Morten Bisgaard, Francisco Garcia-Ruiz, Karl D. Hansen, Anders la CourHarbo Adaptive Surveying and Early Treatment of Crops with a Team of Autonomous Vehicles
ix
Day 3 Session 8 New Design Concepts 259 Shahriar Asta, Sanem Sariel-Talay A Differential Steering System for Humanoid Robots 265 Christian Mandel, Udo Frese Annelid - a Novel Design for Actuated Robots Inspired by Ringed Worms Locomotion Session 9 SLAM 271 Andreas Nchter, Seyedshams Feyzabadi, Deyuan Qiu, Stefan May SLAM la carte - GPGPU for Globally Consistent Scan Matching 277 Andreja Kitanov, Ivan Petrovi c Generalization of 2D SLAM Observability Condition 283 Anssi Kemppainen, Janne Haverinen, Ilari Vallivaara, Juha Rning Near-optimal Exploration in Gaussian Process SLAM: Scalable Optimality Factor and Model Quality Rating 291 Eduardo Lopez, Caleb De Bernardis, Tomas Martinez-Marin An Active SLAM Approach for Autonomous Navigation of Nonholonomic Vehicles Session 10 Localization 297 John Folkesson Robustness of the Quadratic Antiparticle Filter for Robot Localization 303 Stphane Bazeille, Emmanuel Battesti, David Filliat Qualitative Localization using Vision and Odometry for Path Following in Topo-metric Maps 309 Keisuke Matsuo, Jun Miura, Junji Satake Stereo-Based Outdoor Localization using a Line Drawing Building Map 315 Alessandro Benini, Adriano Mancini, Emanuele Frontoni, Primo Zingaretti, Sauro Longhi Adaptive Extended Kalman Filter for Indoor/Outdoor Localization using a 802.15.4a Wireless Network 321 List of Authors
Robots that change their world: Inferring Goals from Semantic Knowledge
Cipriano Galindo*, Javier Gonz lez, a Juan-Antonio Fern ndez-Madrigal a Dept. of System Engineering and Automation University of M laga, Spain a Alessandro Safotti AASS Mobile Robotics Lab Orebro University, Sweden
AbstractA growing body of literature shows that endowing a mobile robot with semantic knowledge, and with the ability to reason from this knowledge, can greatly increase its capabilities. In this paper, we explore a novel use of semantic knowledge: we encode information about how things should be, or norms, to allow the robot to infer deviations from these norms and to generate goals to correct these deviations. For instance, if a robot has semantic knowledge that perishable items must be kept in a refrigerator, and it observes a bottle of milk on a table, this robot will generate the goal to bring that bottle into a refrigerator. Our approach provides a mobile robot with a limited form of goal autonomy: the ability to derive its own goals to pursue generic aims. We illustrate our approach in a full mobile robot system that integrates a semantic map, a knowledge representation and reasoning system, a task planner, as well as standard perception and navigation routines. Index TermsSemantic Maps, Mobile Robotics, Goal Generation, Goal Autonomy, Knowledge Representation, Proactivity.
I. I NTRODUCTION Mobile robots intended for service and personal use are being increasingly endowed with the ability to represent and use semantic knowledge about the environment where they operate [13]. This knowledge encodes general information about the entities in the world and their relations, for instance: that a kitchen is a type of room which typically contains a refrigerator, a stove and a sink; that milk is a type of perishable food; and that perishable food is stored in a refrigerator. Once this knowledge is available to a robot, there are many ways in which it can be exploited to better understand the environment or plan actions [10], [18], [19], [21], [23], [25], assuming of course that this knowledge is a faithful representation of the properties of the environment. There is, however, an interesting issue which has received less attention so far: what happens if this knowledge turns out to be in conict with the robots observations? Suppose for concreteness that the robot observes a milk bottle laying on a table. This observation conicts with the semantic knowledge that milk is stored in a refrigerator. The robot has three options to resolve this contradiction: (a) to update its semantic knowledge base, e.g., by creating a new subsclass of milk that is not perishable; (b) to question the
*Corresponding author. System Engineering and Automation Dpt. University of M laga, Campus de Teatinos. E-29071 M laga, Spain. Email: a a cipriano@ctima.uma.es. This work has been partially supported by the Spanish Government under contract DPI2008-03527.
validity of its perceptions, e.g., by looking for clues that may indicate that the observed object is not a milk bottle; or (c) to modify the environment, e.g., by bringing the milk into a refrigerator. While some work have addressed the rst two options [6], [11], the last one has not received much attention so far. Interestingly, the last option leverages an unique capability of robots: the ability to modify the physical environment. The goal of this paper is to investigate this option. We propose a framework in which a mobile robot can exploit semantic knowledge to identify inconsistencies between the observed state of the environment and a set of general, declarative descriptions, or norms, and to generate goals to modify the state of the environment in such a way that these inconsistencies would disappear. When given to a planner, these goals lead to action plans that can be executed by the robot. This framework can be seen as a way to enable a robot to proactively generate new goals, based on the overall principle of maintaining the world consistent with the given declarative knowledge. In this light, our framework contributes to the robots goal autonomy. Although behavioral autonomy has been widely addressed in the robotic arena by developing deliberative architectures and robust algorithms for planning and executing tasks under uncertainty, goal autonomy has received less attention, being explored in the last years in the theoretical eld of multi-agents [4], [8] and implemented through motivational architectures [1], [7]. Our framework relies on a hybrid semantic map, which combines semantic knowledge based on description logics [2] with traditional robot maps [11], [18], [21]. Semantic maps have been already shown to increase the robots behavioral autonomy, by improving their basic skills (planning, navigation, localization, etc.) with deduction abilities. For instance, if a robot is commanded to fetch a milk bottle but it ignores the target location, it can deduce that milk is supposed to be in fridges which, in turn, are located at kitchens. We now extend our previous works on these issues [10], [11] to also include partial goal autonomy through the proactive generation of goals based on the robots internal semantic model. More specically, we consider a robot with the innate objective of keeping its environment in good order with respect to a given set of norms, encoded in a declarative way in its internal semantic representation. Incoherences between the sensed reality and the model, i.e., the observation of facts that violate a particular norm, will lead to the generation of the corresponding goal that, when planned and executed,
will re-align the reality to the model, as in the milk bottle example discussed above. It should be emphasized that in this work we only focus on the goal inference mechanism: the development of the required sensorial system, and the possible use of semantic knowledge in that context, are beyond the scope of this paper. Our approach to goal autonomy can be seen as a case of normative goals applied to agents which act based on beliefs and intentions [4], [8]. However, normative goals are often considered as simple if-then rules triggered when particular stimuli are given in the environment [1], [20]. Other works have used the term maintenance goals to represent innate goals that are aimed to satisfy a particular state of the world over time, e.g., the battery level should be always over a certain value [3], [12]. Our approach substantially diverges from those works, since it is not based on procedural rules, i.e., motivation-action pairs, nor if-then rules. Instead, we rely on a declarative representation of the domain, using the L OOM description logic language [17], from which the robot infers what should be done according to the current factual information in order to maintain the consistency between its environment and its representation. This manuscript is structured as follows. In the next section we present our semantic map. Section III formalizes the use of semantic knowledge for goal generation. In section IV a real experiment is described. Finally some conclusions and future work are outlined. II. A S EMANTIC M AP FOR M OBILE ROBOT O PERATION The semantic map considered in this work, derived from [10], comprises two different but tightly interconnected parts: a spatial box, or S-Box, and a terminological box, or TBox. Roughly speaking, the S-Box contains factual knowledge about the state of the environment and of the objects in it, while the T-Box contains general semantic knowledge about the domain, giving meaning to the entities in the S-Box in terms of concepts and relations. For instance, the S-Box may represent that Obj-3 is placed at Area-2, while the TBox may represent that Obj-3 is a stove which is a type of appliance. By combining the two sides, one can infer, for instance, that Area-2 is a kitchen, since it contains a stove. This structure is reminiscent of the structure of hybrid knowledge representation (KR) systems [2], which are now dominant in the KR community. Our semantic map extends the assertional component to be more than a list of facts about individuals by also associating these individuals to sensor-level information with a spatial structure hence the name S-Box. Please refer to [10] for more detail. Figure 1 shows a simple example of a semantic map of a home-like environment where both the S-Box and the TBox have a hierarchical structure. The hierarchy in the TBox reects the fact that the represented semantic knowledge forms a taxonomy. For the S-Box, the use of a hierarchical spatial representation is a convenient and common choice in the robotic literature [9], [15]. This hierarchical arrangement largely helps in reducing the computational burden in largescale scenarios when spatial information is involved, i.e. robot
Fig. 1. An example of semantic map for a home-like environment. S-Box is on the left and T-Box on the right. See explanation in the text.
localization, as well as when symbolic information is required, i.e. goal generation, or task planning [10].Of course one could also consider a at representation in the S-Box: in fact, in our framework, the S-Box can be substituted by any other spatial representation. III. I NFERRING G OALS FROM S EMANTICS The semantic map described above provides two different points of view of the robot workspace. On the one hand the spatial part (S-box) enables the robot to generate plans from basic skills, striving for behavioral autonomy. On the other hand the terminological part (T-box) provides an abstract model of the robot environment which includes general knowledge, e.g., books are located on shelves, which can be exploited for the automatic generation of robot goals. First we give an informal description of the proposed mechanism for goal generation. Then, section III-B formalizes our approach under description logic. Finally, section III-C illustrates the process with two intuitive examples. A. Informal Description In the eld of knowledge representation, semantic knowledge is usually interpreted as being descriptive of a specic domain: for example, the item of knowledge beds are located in bedrooms is used to partially describe beds. This knowledge is most useful to infer implicit properties from a few observed facts. For example, if the robot perceives a bed in a room it can infer that the room is a bedroom; conversely, if it is commanded to nd a bed it can restrict its search to bedrooms. Galindo et al. [10] offer examples of applications of these inferences in the robotic domain. Interestingly, semantic knowledge can also be interpreted as being normative: under this interpretation, the above item of knowledge is prescribing where a bed must be located. The difference becomes apparent when considering how a robot should react to an observation that contradicts this knowledge. Consider again the milk box example in the Introduction, and the three possible options to resolve the contradiction
discussed there. Options (a) (update the model) and (b) (update the perceived state) correspond to modifying the robots beliefs to recover from a contradiction, and are related to execution monitoring and uncertain state estimation. These options has been explored previously [6], [11]. The third option (c) (update the world) involves goal generation, and it is the one addressed here. Informally, our approach denes a subset of concepts and relations stored in the T-Box as normative, i.e. they are involved in norms that should be fullled, by dening a special class normative-concept and a special relation normative-relation. Items of knowledge to be treated as normative will derive from them. For instance, we can dene that the normative concept towel should be related to the concept bathroom through the normative relation place that is, towels should be located in a bathroom. When a given instance violates a norm in the T-Box, the system derives the items of knowledge involved in the norm, and hence the goal that should be posted in order to satisfy that norm. In our example, suppose that an instance of a towel is perceived in a room which is not a bathroom. Then the given denition of a towel is violated a circumstance that can be detected by most knowledge representation systems, including the L OOM [17] system used in our experiments. Since the above denition of towel is normative, the system yields a goal to satisfy the constraint, that is, to make the place of this towel be an instance of a bathroom. If the robot knows that, let say, room-3 is a bathroom, this means that the goal bring the towel to room-3 is generated. B. Description Logic Representation for Inferring Normative Goals Let I be a description logic interpretation on a particular domain D. Let dene a set of disjoint concepts = {P1 , . . . Pn }, i.e., a, a Pi j, j = i, a Pj , where x y denotes that x is subsumed by concept y. Let Nr be called a normative relation, a function dened as: Nr : NC where NC represents the so-called normative concepts, that is, concepts which ought to be properly related to those from . Nr actually denes the norms to be kept. Normative relations are dened as one-to-one function as b NC Pj , b [F ILLS : Nr Pj ] , where a [F ILLS : B C] denotes that the instance a is related to an instance derived from concept C through the relation B. The NC set is further divided into two disjoint sets: the set of all normative concepts that fulll the imposed norms, and the set of those that fail to fulll some of the norms (see gure 2). Within this structure of the domain, constraint violations are automatically inferred when instances of the dened partitions are deduced to belong to a number of disjoint concepts. Let see an example:
Normative Concepts
Normative Relations
'
'
Nr1
..
Nrn
P1
Pi
..
Pn
C
(defconcept C :is (and ' (:the Nr1 Pi))) c
(defrelation Nr1 :is normative-relation :domain normative-concept :range ) x
Fig. 2. Knowledge representation for detecting inconsistencies. Boxes represent concepts while instances are represented as circles. The concept C is dened as a normative concept related to Pi through the normative relation Nr1 . See explanation in the text.
Let C a normative concept (and therefore C by denition) which is related to the Pi concept through the normative relation Nr . That is, c C, c [F ILLS : Nr x], x Pi
If in a given interpretation I, k C, k [F ILLS : Nr y], y Pj , Pj = Pi I y Pj y Pi (Incoherent y). That is, if the normative relation is not met for a particular instance of a normative concept, the ller of such an instance, in this case y, becomes incoherent. Moreover, since k is dened as k C , it is also inferred that k , which also makes k incoherent. Goal Inference. Given an incoherent instance of a normative concept, k C and the normative relation Nr , Nr (k) = x, x Pi , the inferred goal to recover the system from the incoherence is: (exists ?z (Pi z) (Nr k z)) That is, in the goal state, there should exist an instance of Pi related to k through the normative relation Nr 1 . C. Sample Scenarios In this section we describe two illustrative examples. 1) Milk should be inside fridges: Consider a home assistant robot taking care of an apartment. Among other norms, the robot might watch milk bottles so they are always kept inside the fridge (see an implementation in section IV). The semantic map for this scenario will entail information about the different rooms, i.e. kitchen, livingroom, bathroom, etc., the objects found inside, i.e. tables, chairs, fridges, books, bottles, etc, and their relations. Following the formalization given in III-B, part of the description of this scenario includes the partition of different places where bottles of milk could be found, e.g. = {f ridge, table, shelf }, being milk-bottle a normative concept, i.e. milk bottle , (see gure 3). Note that this denition implicitly provides certain restrictions that any bottle of milk should fulll. Precisely,
1 It is not necessary to add the negation of (N k z) to the goal state, since r the Nr function is dened as one-to-one.
(defconcept object-places :is-primitive (and location (at-least 1 has-object) :partitions $$))
Normative Concepts (defrelation normative-relation) (defconcept normative-concept :partitions $Norms$))
Normative Relations
(defrelation normative-relation)
(defconcept fridge :is-primitive (and object-places appliance (:the temp cold) :in-partitions $$))
(defconcept table :is-primitive (and object-places furniture (:exactly 4 legs) :in-partitions $$))
(defconcept fulfilling-norm ..) (defrelation place (d f l i l :is (and normative-relation :domain object :range object-places))
(defconcept non-ulfilling-norm ..)
(defrelation has-humidity :is (and normative-relation :domain object :range humidity)..)
(defconcept plant :is ( d f lfilli i (and fulfilling-norm (:the place garden) (:the has-humidity normal-hum) ..)
(defrelation place :is (and normative-relation :domain object :range location))
Range Partitions (defconcept normative-concept :partitions $Norms$))
(defconcept room :partitions $Rooms$))
(defconcept humidity :partitions $H-levels$))
(defconcept fulfilling-norm :is (and normative-concept (:not (:some normative-relation incoherent)) :in-parititions $Norms$)
(defconcept non-fulfilling-norm :is (and normative-concept (:some normative-relation incoherent) :in-parititions $Norms$)
(defconcept kitchen :in-partition $Rooms$)) (defconcept bathroom :in-partition $Rooms$)) (defconcept garden :in-partition $Rooms$))
(defconcept dry :in-partition $H-levels$)) (defconcept normal-hum :in-partition $H-levels$)) (defconcept wet :in-partition $H-levels$))
(defconcept milk-bottle :is (and beverage fulfilling-norm (:the place fridge) (:the color white))
Fig. 4. General scheme for representing multiple norms. A particular partition has to be dened for each normative relation.
2
Fig. 3. Part of the domain denition for the milk inside fridges example. For clarity sake, fullling-norm is used instead of and non-fullling-norm instead of .
the following goal is deduced:
(exist ?x (fridge ?x) (place mb ?x))
milk-bottle is assumed to be a beverage which has to meet at least one norm imposed by a normative relation, since it is subsumed by the fulfilling-norm concept. Through the denitions given in gure 3, the expression (:the place fridge) indicates that every bottle of milk ought to be located in one location that must be a fridge. Notice that in this example, the other restriction (:the color white) is not dened as normative relation, and thus, if it is not fullled in the scenario it will be simply deduced that the object is not a bottle of milk and no incoherences or robot goals will be generated. Let us now consider the following situation in which the information gathered by the robot contradicts the denitions in the domain:
{(table t1)(milk-bottle mb)(fridge f1) (place mb t1)}
That is, the robot has to put the bottle of milk represented by mb inside any object ?x which is known to be a fridge. Since in the robots domain there is a single fridge f1, the above goal is instantiated as (place mb f1). 2) Plant should be properly watered: We now describe a more general example in which two norms are imposed on the same normative concept. Consider we impose that plants should be placed at the garden and have a normal humidity level. In this case we need two normative relations place and has-humidity and two partitions of concepts representing the possible, disjoint values for such relations. Figure 4 depicts part of the T-Box for this example. Let us consider the following situation:
{(kitchen k)(bathroom b)(garden g) (plant p)(place p k)(humidity-value dry) (has-humidity p dry)}
Under this interpretation, L OOM infers that the instance t1 should be a fridge since there is a bottle of milk placed on it. Such an inference produces an incoherence in the model given that the instance t1 is deduced to belong to two concepts, i.e. table and fridge, which have been dened as members of a partition. In this situation t1 is marked by L OOM as incoherent. Moreover, it is also deduced that the instance mb, initially dened as mb , also belongs to since the normative relation (:the place fridge) is lled with an incoherent instance. Again the system detects that mb belongs to two concepts dened in a partition, and thus mb is also marked as incoherent. The result is that the instances involved in the violated norm are detected and marked as incoherent. By checking the domain denition of such incoherent instances,
As in the previous example, the process for detecting norm violations checks for incoherent instances. In this case instances k and dry become incoherent since they are deduced to belong to {kitchen,garden} and {dry,normal-hum} respectively. Besides, the instance p is also incoherent and therefore the following goal is generated:
(and (exist ?x (garden ?x)(place p ?x)) (exist ?y (normal-hum ?y)(has-humidity p ?y)))
IV. A N I LLUSTRATIVE E XPERIMENT We now illustrate the applicability of our goal generation technique to a real robotic application by showing an illustra2 This goal is expressed in the goal language of the planner used in our experiment (see below), which is a subset of FOL.
Fig. 6. Sketch of the software architecture used in our experiments. Only the modules and connections relevant to goal generation are shown. Fig. 5. The test environment. Left: layout. Right: the robot Astrid.
tive experiment run in a home environment. The experiment is inspired by the milk scenario in Sec. III above. A. Physical setup We have used a physical test-bed facility, called the P EISHome [24], that looks like a bachelor apartment of about 25 m2 and consists of a living-room, a bedroom and a small kitchen see Fig. 5. The P EIS-Home is equipped with a communication infrastructure, and with a number of sensing and actuating devices, including a few mobile robots. Relevant to the experiments reported here are: a refrigerator equipped with a computer, some gas sensors, a motorized door, and an RFID tag reader; an RFID tag reader mounted under the kitchen table; a set of RFID tagged objects, including a milk cartoon; a set of webcams mounted on the ceiling; and Astrid, a PeopleBot mobile robot equipped with a laser scanner, a PTZ camera, and a simple gripper. A few provisions have been introduced to simplify execution. In particular, since Astrid does not have a manipulator able to pick-up an object and place it somewhere else, these operations have been performed with the assistance of a human who puts the object in and out from Astrids gripper. These simplications are acceptable here, since the purpose of our experiments is not to validate the execution system but to illustrate our goal generation algorithm in the context of a full robotic application. B. Software setup The software system used in our experiment is schematically shown in Fig. 6. The block named P EIS Ecology contains all the robotic components and devices distributed in the P EISHome. These are integrated through a specic middleware, called the P EIS-Middleware, that allows to dynamically activate and connect them in different ways in order to perform different tasks [5]. A set of activations and connections is called a conguration of the P EIS Ecology. For instance, the conguration in which the ceiling cameras are connected via an object recognition to the navigation controller onboard Astrid can be used to let the robot reach a given object. The semantic map is based on a simple metric-topological map attached to the LOOM knowledge representation system [17]. Newly observed facts are asserted in LOOM using the tell primitive. The goal generation system interacts with
LOOM as described in Sec. III above. Newly generated goals are passed to the planning system. This consists of three parts: an action planner, called PTLplan [14], that generates a sequence of actions to satisfy the goal; a sequencer, that selects those actions one by one; and a conguration planner [16], that generates the conguration needed to perform each action. When the current plan is completed, the goal generation system is re-activated. C. Execution Before the execution started, the semantic map contained a metric-topological map of the P EIS-Home, and the considered semantic knowledge in L OOM . In particular, the following statement was included in the L OOM knowledge base
(defconcept MilkBox :is (:and Container FulfillingNorm (:the place Fridge) ))
This encodes the normative constraint that any instance of the normative class MilkBox must have a single ller for the place relation, and that this ller must be of the class Fridge. An RFID tag has been attached to a milk box, containing an encoding of the following information:
id: mb-22 type: MilkBox color: white-green size: 1-liter
At start, the milk is put on the kitchen table, called table-1 in the map. The RFID tag reader under the table detects the new tag, and reports the information that mb-22 is a MilkBox and it is at table-1 see Fig. 7. This information is entered into L OOM by:
(tell (MilkBox mb-22)) (tell (place mb-22 table-1))
As discussed in Sec. III, this information renders both the instances mb-22 and table-1 incoherent. The goal generation algorithm identies mb-22 as the normative instance. The algorithm then searches through all the relations that constrain mb-22 to nd a violated normative one, and it nds place. Since this relation should be lled by an instance of Fridge, it generates the following goal:
(exists ?x (and (Fridge ?x)(place mb-22 ?x)))
PTLplan uses the knowledge in the semantic map, together with its domain knowledge about the available actions, to generate the following action plan (simplied):
((MOVE astrid table-1) (PICKUP astrid mb-22) (OPEN fridge-1) (MOVE astrid fridge-1)
R EFERENCES
[1] R.C. Arkin, M. Fujita, T. Takagi, and R. Hasegawa. An ethological and emotional basis for human-robot interaction. Robotics and Autonomous Systems, 42(3-4)., 42(3-4), 2003. [2] F. Baader, D. Calvanese, D.L.D McGuinness, and D. Nardi, editors. The Description Logic Handbook. Cambridge University Press, 2007. [3] C. Baral, T. Eiter, M. Bj reland, and M. Nakamura. Maintenance a goals of agents in a dynamic environment: Formulation and policy construction. Articial Intelligence, 172(12-13):14291469, 2008. [4] G. Boella and R. Damiano. An architecture for normative reactive agents. In Proc of the 5th Pacic Rim Int Workshop on Multi-Agents, pages 117, London, 2002. Springer. [5] M. Bordignon, J. Rashid, M. Broxvall, and A. Safotti. Seamless integration of robots and tiny embedded devices in a peis-ecology. In Proc. of the IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS). San Diego, CA, 2007. [6] A. Bouguerra, L. Karlsson, and A. Safotti. Semantic knowledge-based execution monitoring for mobile robots. In Proc. of the IEEE Int. Conf. on Robotics and Automation, Rome, Italy, pages 36933698, 2007. [7] A. M. Coddington, M. Fox, J. Gough, D. Long, and I. Serina. Madbot: A motivated and goal directed robot. In Proc The 20th Nat Conf on Articial Intelligence, Pennsylvania, USA, pages 16801681, 2005. [8] M. Dastani and L.W.N. van der Torre. What is a normative goal?: Towards goal-based normative agent architectures. In Proc of the Int Workshop on Regulated Agent-Based Social Systems (2002) Bologna, Italy, pages 210227, 2002. [9] C. Galindo, J.A. Fernandez-Madrigal, and J. Gonzalez. Multiple Abstraction Hierarchies for Mobile Robot Operation in Large Environments. Studies in Computational Intelligence, Vol. 68. Springer, 2007. [10] C. Galindo, J.A. Fernandez-Madrigal, J. Gonzalez, and A. Safotti. Robot task planning using semantic maps. Robotics and Autonomous Systems, 56(11):955966, 2008. [11] C. Galindo, A. Safotti, S. Coradeschi, P. Buschka, J.A. FernandezMadrigal, and J. Gonzalez. Multi-hierarchical semantic maps for mobile robotics. In Int. Conf. on Intelligent Robots and Systems, pages 3492 3497. IROS, Edmonton, Alberta (Canada), 2005. [12] M.V. Hindriks and M.B. van Riemsdijk. Satisfying maintenance goals. In Proc of the Int Workshop on Declarative Agent Languages and Technologies (DALT07), Honolulu, HI, USA, pages 86103, 2007. [13] J. Hertzberg and A. Safotti, editors. Special issue on semantic knowledge in robotics. Robotics and Autonomous Systems, 56(11), 2008. [14] L. Karlsson. Conditional progressive planning under uncertainty. In Proc of the Int Joint Conf on Articial Intell. (IJCAI), pages 431438, Seattle, USA, 2001. [15] B.J. Kuiper. Modeling Spatial Knowledge, chapter Advances in Spatial Reasoning, Vol. 2, pages 171198. The U. of Chicago Press, 1990. [16] R. Lundh, L. Karlsson, and A. Safotti. Autonomous functional conguration of a network robot system. Robotics and Autonomous Systems, 56(10):819830, 2008. [17] R. MacGregor and R. Bates. The loom knowledge representation language. Technical report, DTIC Research Report ADA183415, 1987. [18] D. Meger, P. Forssen, K. Lai, S. Helmer, S. McCann, T. Southey, M. Baumann, J. Little, D. Lowe, and B. Dow. Curious george: An attentive semantic robot. In IROS Workshop: From sensors to human spatial concepts, pages 390404, 2007. [19] O.M. Mozos, P. Jensfelt, H. Zender, M. Kruijff, and W. Burgard. From labels to semantics: An integrated system for conceptual spatial representations of indoor environments for mobile robots. In ICRA Workshop: Semantic Information in Robotics, 2007. [20] T.J. Norman and D. Long. Alarms: An implementation of motivated agency. In Proc of the IJCAI Workshop on Agent Theories, Architectures, and Languages (ATAL), Montreal, Canada, pages 219234, 1995. u [21] A. N chter, O. Wulf, K. Lingemann, J. Hertzberg, B. Wagner, and H. Surmann. 3D mapping with semantic knowledge. In RoboCup Int. Symp., pages 335346, 2005. [22] F. Pecora and M. Cirillo. A constraint-based approach for plan management in intelligent environments. In Scheduling and Planning Applications Workshop at ICAPS09, 2009. [23] A. Ranganathan and F. Dellaert. Semantic modeling of places using objects. In Robotics: Science and Systems Conf., 2007. [24] A. Safotti, M. Broxvall, M. Gritti, K. LeBlanc, R. Lundh, J. Rashid, B.S. Seo, and Y.J. Cho. The peis-ecology project: vision and results. In Int. Conf. on Intelligent Robots and Systems. Nice, France, 2008. [25] M. Tenorth, U. Klank, D. Pangercic, and M. Beetz. Web-enabled robots: Robots that use the web as an information resource. Robotics and Automation Magazine, 18(2), 2011.
Fig. 7. RFID tagged objects and RFID tag readers used in our experiments. Left: in the fridge. Right: under the kitchen table.
(PLACE astrid mb-22) (CLOSE fridge-1))
where the variable ?x has been instantiated by fridge-1. The sequencer passes each action in turn to the conguration planner, which connects and activates the devices in the P EIS Ecology needed to execute it. For example, the rst two actions only require devices which are on-board Astrid, while the third action requires the activation of the fridge door device. (The details of this ecological execution are not relevant here: see [16] for a comprehensive account.) After the milk is removed from the table, the RFID tag reader under the table detects its absence and it signals it to the semantic map. When the milk is placed into the fridge, it is detected by the reader in the fridge. Corresponding to these two events, the following assertions are made in L OOM :
(forget (place mb-22 table-1)) (tell (place mb-22 fridge-1))
After execution is completed, the sequencer re-activates the goal generator. Since the place of mb-22 is now an instance of a fridge, no incoherence is detected and no new goal is generated.
V. D ISCUSSION AND C ONCLUSIONS This paper has explored a novel use of semantic knowledge: recognizing and correcting the situation in the world that do not comply with the given semantic model, by generating appropriate goals for the robot. A distinctive feature of our approach is that the normative model is provided in a declarative way, rather than by exhaustive violation-action rules. Simple experiments demonstrate the conceptual viability of this approach. The work reported here is a rst step in an interesting direction, and many extensions can and should be considered. For instance, in our work we assume that the robot should always enforce consistency with the semantic knowledge. However, there are cases where norm violation might be allowed. Going back to the milk example, it would be reasonable to allow that the milk bottle stays out of the fridge while the user is having breakfast. Dealing with this type of situations would require the ability to reason about both the robots plans and the users activities in an integrated way [22].
On Process Recognition by Logical Inference

Arne Kreutzmann Immo Colonius Lutz Frommberger Frank Dylla Christian Freksa Diedrich Wolter SFB/TR 8 Spatial Cognition, University of Bremen, Germany
Abstract The ability to recognize and to understand processes allows a robot operating in a dynamic environment to rationally respond to dynamic changes. In this paper we demonstrate how a mobile robot can recognize storage processes in a warehouse environment, solely using perception data and an abstract specication of the processes. We specify processes symbolically in linear temporal logic (LTL) and pose process recognition as a model verication problem. The key feature of our logic based approach is its ability to infer missing pieces of information by logic-based reasoning. The evaluation demonstrates that this approach is able to reconstruct histories of good movements in a lab-simulated warehouse. Index Terms plan recognition, temporal reasoning temporal logic, spatio-
feature detection symbol grounding mapping and localization place identi cation qualitative locations
symbolic reasoning place inference and process recognition
odometry abstract process speci cation observation behavior process understanding
Fig. 1.
Conceptual overview of our software architecture
I. I NTRODUCTION Mastering dynamic environments is a demanding challenge in autonomous robotics. It involves recognition and understanding processes in the environment [7]. Recent advances in simultaneous localization and mapping (SLAM) [20, 21, 22] build the basis for sophisticated navigation in dynamic environments, but but our aim of understanding processes goes beyond navigation. In this paper we indicate how the problem of recognizing processes can be tackled on a conceptual level in the domain of warehouse logistics. In a warehouse, there is a constant ow of goods which are moved through space, establishing functional zones that are connected with certain types of storage processes (for example, admission of goods into a warehouse makes use of a buffer zone to temporarily store goods for quality assurance). Knowing about the in-warehouse processes and their whereabouts enables warehouse optimization. Hildebrandt et. al. argue for using autonomous robots as a minimally invasive means to observe in-warehouse processes [10]. However, the sensory system of the robot provides uncertain and incomplete knowledge about the environment and the observed spatio-temporal patterns. Thus the challenge is to interpret the observations sensibly. Many approaches to process recognition rely on statistical data to train probabilistic classiers such as Markov networks [6, 13], Bayesian networks [23], or supervised learning [5]. Approaches based on statistical data perform very well in terms of recognition rate, but, aside from the need for training, they do not support exible queries about processes and they have they have to be re-trained if new elements or processes are introduced in the domain. Symbolic approaches have none of these downsides, but require a model of the observable processes, which is given in our environment. Additionally, a
This paper presents work done in the project R3-[Q-Shape] of the Transregional Collaborative Research Center SFB/TR 8 Spatial Cognition. Funding by the German Research Foundation (DFG) is gratefully acknowledged.
well constructed model allows for efcient use of heuristics to speed up query processes[8]. Usually, symbolic approaches are used to tackle plan recognition, which is closely related to process recognitionsee [2, 3] for an overview. In the following we present a logic-based approach that allows us to recognize activities purely from qualitative process descriptions without prior training. By integrating and abstracting sensory information we are able to answer queries about observed spatio-temporal activities (such as How often have goods been relocated within the storage zone?) as well as about regions in space (e.g., Which areas in the warehouse have been used as a buffer zone?). Answering such queries is an important step towards logistic optimization. The contribution of this paper is to demonstrate how processes and their whereabouts can be inferred in a previously unknown environment. Referring to the decomposition of process detection by Yang [23], we propose a multi-step approach to get from low-level sensory observations to high-level symbolic representations (see Fig. 1). In our scenario, a robot performs a surveillance task in the warehouse. Object recognition is outside the scope of this paper, but in many logistics scenarios goods can easily be identied by unique labels attached to them (such as barcodes or RFID tags). Thus, we assume that the robot is able to uniquely identify goods in the warehouse. The integration of position estimates for the goods in itself presents a feature-based SLAM problem. Uncertain and incomplete position estimates of entities gathered by a probabilistic mapping procedure must be transferred into a symbolic representation in a symbol grounding process to allow for high-level descriptions of the system dynamics. What has been an uncertain position estimate in the mapping process must become a stable qualitative notion of location. Based on correspondence of features and locations in time, we are able to specify processes of interest in an abstract formal language and, in a third step, tackle the process recognition problem by model verication. The formal language we choose to formalize processes
picking zone (P)

G G
buffer zone (B)

G
outlet zone (O)
So when observing this kind of environment, we face the challenge that for detecting concrete storage processes we rely on the existence of certain zones, but we do not know their whereabouts. III. I N -WAREHOUSE P ROCESS D ETECTION WITH L INEAR T EMPORAL L OGIC To interpret raw sensory data such that we achieve a symbolic representation of the processes of interest, we rst introduce linear temporal logic and the axiomatization of our domain. All queries are stated as LTL formulas and can be answered by model verication. Following this, we describe the symbolic grounding. Then, we specify the in-warehouse processes in linear temporal logic and demonstrate the inference process by an example. A. Linear Temporal Logic (LTL) LTL [17] is a modal logic that extends propositional logic by a sequential notion of time. A formula in LTL is dened over a nite set of propositions with a set of the usual logic operators (, ,, ). The temporal component is established by an accessibility relation R that connects worlds (or states) and a set of modal operators, of which we use the following: next. A formula holds in in the following world always. A formula holds now and in all future worlds eventually. will hold in some world in the future ( )
entrance zone (E)

G G
storage zone (S)
Fig. 2.
A warehouse and its functional zones.
and to state queries is linear temporal logic (LTL) [17, see Sect. III-A]. LTL was proposed earlier as a tool for mobile robotics [1], especially for robot motion planning from highlevel specications [11, 18]. Recently, this approach has also been applied to real robotic systems [12]. In the domain of smart environments, an approach to process detection by LTL model verication has been presented in [14]. LTL not only allows for queries about processes, but also about spatial relations of regions. This approach covers a wide range of reasoning tasks adequately. In particular, it allows us to query the occurrence of processes operating on spatial regions and the concrete whereabouts of those regions at the same time in one and the same reasoning process. II. T HE WAREHOUSE S CENARIO
We address the problem of understanding so-called chaotic B. Axiomatizing the Warehouse Scenario or random-storage warehouses, characterized by a lacking 1) Propositions: We dene the propositions that model the predened spatial structure, that is, there is no xed assignment desired processes in our logic with the help of the following of storage locations to specic goods. Thus, storage processes are solely in the responsibility of the warehouse operators atomic observables: and basically not predictable: goods of the same type may be a set G = {G1 , . . . , Gn } of uniquely identiable goods distributed over various locations and no data base keeps track a set L = {L1 , . . . , Lm } of locations in space at which of these locations. This makes it a hard problem for people goods have been perceived by the robot aiming at understanding the internal storage processes. a set Z = {E, B, S, P, O} of functional zones as On a conceptual level, storage processes are dened by described in Sect. II. a unique pattern [19]: On their way into and out of the The following atoms need to be dened over G, such that we warehouse, goods are (temporarily) placed into functional zones obtain a nite set of atoms, L, and Z: which serve specic purposes (see Fig. 2). All goods arrive at(G, L) holds iff a good G is known to be at location L in the entrance zone (E). From there, they are picked up in(L, Z) holds iff a location L lies within zone Z and temporarily moved to a buffer zone (B) before they are close(L1 , L2 ) holds iff two locations L1 , L2 are close nally stored in the storage zone (S). Within the storage zone to one another redistribution of goods can occur arbitrarily often. When taking 2) Axioms: Based on constraints of space and general out goods, they are rst moved to the picking zone (P) from knowledge about our domain, we axiomatize our domain. One where they are taken to the outlet zone (O), before being placed constraint is that we disregard continuous motion and therefore on a truck. A mobile robot observing such a warehouse is not able only deal with snapshots of the world. This means that all to directly perceive these zones, as they are not marked. For observed goods are temporarily xed at their positions. all zones we know that they exist (that is, that such regions A good G can only be at one location at a time. We are used within the storage operations), but not their concrete introduce the following axioms for all G G and Li , Lj spatial extents or their number of occurrences, as they appear L, i = j: as a result of dynamic in-warehouse storage processes. The A1G,Li ,Lj = at(G, Li ) at(G, Lj ) (1) robot can detect and identify goods, and estimate their position.
Any object is always located within a zone Z Z. We have for all G G and L L: A2G,L = at(G, L) in(L, Z) (2)
ZZ
Locations in different zones are not close to each other, that is, zones are at least some minimum distance apart. We have for all Zk , Zl Z (k = l) and Li , Lj L (i = j): A3Li ,Lj ,Zk ,Zl = close(Li , Lj ) in(Li , Zk ) in(Lj , Zl ) (3)
the time steps t and map all goods Gi with their position estimates (xi , yi ) to corresponding observations obs(t, Gi , Lj ) that assign that Gi has been observed at Lj at time step t. This yields a series of sets of observations Ot = Gi G obs(t, G, Lj ) over time. A new world is established as soon as our observations change, that is, Ot+1 = Ot . Then, from obs(t, Gi , Lj ) follows at(Gi , Lj ), and the new world consist of B i at(Gi , Lj ). D. In-Warehouse Processes We now formalize the in-warehouse processes Admission, Take-out, and Redistribution: Admission a good G is delivered to the warehouses entrance zone E and moved to the storage zone S via the buffer zone B. For all G G and Li , Lj , Lk L: AdmissionG,Li ,Lj ,Lk = at(G, Li ) in(Li , E) (6)
Zones are static. We have for all Zk , Zl Z (k = l) and L L: A4L,Zk ,Zl = in(L, Zk ) in(L, Zl )
(4)
A set A subsumes all axioms (1) (4).
C. Grounding Symbols at(G, Lj ) in(Lj , B) at(G, Lk ) in(Lk , S) So far, we have formal descriptions of the high-level in Take-out a good G is moved from the storage zone S warehouse observables on one hand, and sensory perceptions to the outlet zone O via a picking zone P . For all G G from the robot, on the other hand. These need to be connected and Li , Lj , Lk L: to each other in order to perform reasoning on real world data. That is, we need to transform the sensory information to our TakeoutG,Li ,Lj ,Lk = at(G, Li ) in(Li , S) (7) logical propositions at(G, L), close(L1 , L2 ), and in(L, Z). at(G, Lj ) in(Lj , P ) at(G, Lk ) in(Lk , O) Mapping a perceived good to a symbol G is trivial in this task due to the unique identiers. However, for the goods Redistribution a good G is moved within the storage location we will only have an uncertain position estimate zone S. For all G G and Li , Lj L, i = j: 2 (x, y) R for the entity observed from the mapping process. These estimates are subject to noise and thus will vary over RedistributionG,Li ,Lj = at(G, Li ) in(Li , S) (8) time although the observed object remains static. A location at(G, Lj ) in(Lj , S) is a qualitative abstraction from positional measurements that abstracts from uncertainty emerging from sensory perceptions Process detection can be posed as a model checking problem: and the mapping process. Therefore, we need to transform An in-warehouse process is detected when we can nd a model position estimates to a discrete and nite set of symbols, i.e., to (based on the sensory observations from the robot) that satises subsume similar or comparable positions. This transformation the corresponding formula. The history of a good is the chain is a function f : R2 L, that is, every position estimate is of processes that the good is part of and can also be stated as mapped to a single location (see Axiom (1)). To this end, a a formula. A history for a good would be admission, zero or clustering method can be applied to map estimates to a set more redistributions and its takeout. of prototypical positionsthe locations (see Section IV-B). We ground close(L1 , L2 ) by applying a metric and checking E. Example whether the distance between L1 and L2 is below a certain A good G entered the warehouse and was stored in the threshold. To ground in(L, Z), we need to identify the functional entrance zone E at position L1 at time t0 . At t1 , it was moved zones in the warehouse. These zones are constituted by sets of to a location L2 and at t2 it was moved to L3 . All these locations. For zones Z whose extents are known a-priori by locations are not close to one another. Let us assume that we introducing the respective in-atoms the corresponding locations observe the following from this process: We perceived G to LZ L can be assigned directly. All remaining locations be at L1 at t0 , at L2 at t1 and at L3 at t4 . See Fig. 3 for a Li L\LZ are known to be not a part of Z, i.e., in(Li , Z), depiction and the logical interpretationsto ease understanding but (according to (2)) must be part of one of the other zones: the worlds are labeled just like the time points. These observations constitute a model that satises (6), such in(Li , Z ) with Z Z\Z. In addition to the axioms A, the propositions close and in that the observed process is an admission, starting in world t1 and ending in world t4 , and also deduces that location are persistent over all worlds. The set L2 is in the buffer zone and L3 is in of the storage zone. B =A close(Li , Lj ) in(L, Z) (5) Note that deduced start and end times differ from the real Li ,Lj L LL,ZZ admission times: While the admission took place from t0 to t3 , is called background knowledge. The only proposition that we detect it from observations t1 to t4 ; this is due to incomplete changes over different worlds is at(G, L). We traverse through observation of the world.
sketch
E
G
BS
BS
E
t1
BS
G
BS
E
t2
BS
BS
G
E
t3
BS
BS
G
t0
observation
background knowledge
at(G, L1 ) in(L1 , E) in(L2 , B) in(L2 , S) in(L3 , B) in(L3 , S)
in(L1 , E) in(L2 , B) in(L2 , S) in(L3 , B) in(L3 , S)
Admission at(G, L1 ) in(L1 , E)
Fig. 3. Example: Model verication for an admission process of good G (only the relevant assertions for each world t0...3 are shown). in(L1 , E) is background knowledge, also it is known that locations L2 and L3 are either part of the buffer zone (B) or the storage zone (S) but not close to one another so that they cannot belong to the same zone. From this admission rened knowledge about the buffer and storage zone can be inferred: in(L2 , B) in(L3 , S).
at(G, L2 ) in(L2 , B) at(G, L3 ) in(L3 , S)
IV. I MPLEMENTATION A. Mapping of Positions of Goods We use visual tags to represent our observable features. To ease the evaluation, some tags are known to be static throughout the experiments. This allows the map constructed by the robot to be easily aligned with the ground-truth for evaluation. The positions of the tags relative to the camera are estimated using the tag detection routine provided by the ARToolkit software library1 , for which we determined a measurement model. Positions of detected tags with a sufcient quality as well as odometry of the robot are fed into the TreeMap SLAM algorithm [9]. In contrast to [22] we deal with dynamic objects by using only one map layer in which we handle the movement of a good by simply comparing its current position measurement with its expected position. If the positions are too far apart (in our experiments >1 meter), the good is treated as having been moved and is added as a new feature into the SLAM algorithm. This results in a sequence of maps that contain position estimates and a mapping of goods to positions at each time step. B. From Positions to Locations Measured positions are clustered after each step and the generated cluster centroids are used as qualitative locations. Therefore, the mapping of positions to clusters needs to stay xed even when new centroids are generated by added data. We implemented two clustering methods for later comparison: The rst clustering method assigns position estimates to predened locations (shown in Fig. 4(a)). We used this method for evaluation purposes. The second clustering method computes locations automatically by employing a straightforward greedy algorithm: Positions are clustered together if their surrounding circle is below a certain size; otherwise a new cluster is created (shown for one test run in Fig. 4(b)). Each observation of a good is now attributed by a location and a time step (obs(t, G, L)), which is the starting point for the symbol grounding as described in Section III-C.
1 http://artoolkit.sourceforge.net/
C. From LTLWorlds to Histories As described at the end of Section III-D histories of goods are also LTL formulas and as such can also be used during model verication. It is straightforward to implement the rules as Prolog clauses and let Prolog try to prove them. The connection of the world is realized by an ordered list, i.e., two worlds Wi and Wj are connected if Wj directly follows Wi in the list. We then use Prolog to constructively prove the existence of a history for each good by model verication. The history construction includes the deduction of zones as demonstrated in the example shown in Fig. 3. In general, many histories can be veried for the same observations, e.g., moving a good from A to B to C veries the model RedistributionG,A,B RedistributionG,B,C but also RedistributionG,A,C . In the latter case the observation that the good was at location B is ignored. Therefore, in ambiguous cases we select the maximal model, i.e., the history involving the largest number of observations. V. E XPERIMENTS AND E VALUATION In our experimental setting we simulated warehouse processes in our lab in order to measure to which extend histories can be identied correctly.2 A. Experimental Setup Our robot platform is an Active Media Pioneer-3 AT controlled by a top-mounted laptop and equipped with a SONY DFW SX900 (approx. 160FOV) that delivers 7.5 frames per second. In our lab we simulate a warehouse that consists of ve dedicated zones (entrance, buffer, storage, picking, outlet) as seen in Fig. 4(a) and 4(c). 15 tags are distributed within the environment as static landmarks. Goods are represented by boxes with visual tags attached to all sides. An experiment consists of a series of movements of goods between the zones while our robot is monitoring the environment. For each of the 10 test runs, the robot was placed in the lab and driven around until each landmark had been seen at least twice to ensure a fully mapped environment. Then, we
2 Video material of a test run is available at http://www.sfbtr8. uni-bremen.de/project/r3/ECMR/index.html
10
5
8
7
Outlet
J F
Outlet
S
Storage Picking Area
M
4
Picking Area
Storage
W Y
Q
Buffer Entrance
Buffer
L G K 7
1
Entrance
(a) Floor plan with zones
-1 -1
(b) Automatically3 generated 5clusters 0 1 2 4 6
(c) The lab (picture taken from the outlet zone)
Fig. 4. Experimental Setup: The warehouse (6.12 m 7 m) with 5 zones (Entrance, Buffer, Storage, Picking, and Outlet). In (a) the xed cluster centers are shown as red stars and the landmarks as blue dots with there corresponding letters. In (b) measurements of a single experimental run are shown as black and the automatically generated clusters as blue circles.
predefined centroids 0.754 0.829
automatically generated centroids Known Regions No region 0.581 Entrance, Outlet All regions
Correct Histories
0,75 0.485 0.256
0,5
0.431
0,25
Fig. 5. Evaluation results showing the rate of correct history detections. We evaluated 10 test runs which included a total number of 21 full histories and 18 partial histories.
steered the robot in arbitrary round courses, while we moved boxes through the lab, simulating the previously dened logistic processes (Sect. III-D). The duration of a test run was between approx. 11 and 28 minutes in which we moved 3 to 8 goods through the warehouse, resulting in 4 to 19 detectable processes per run including runs with only partial histories. Goods were moved between zones while not covered by sensor surveillance, to comply with axiom 2 in section III-B. The robot was driven by hand in the experiments. As mentioned in section IV-B, we evaluated our approach with two different clustering methods, each one with 3 different settings of background knowledge. In the rst setting all zones are previously known, in the second setting only entrance and outlet are known and in the third one the whereabouts of no zone are known. B. Evaluation We evaluate our approach based on correctly identied histories. For each good we query its history, i.e., running the model verication to generate it. A history is correctly identied if temporal order and number of processes match the ground truth.
Fig. 5 shows the result of our evaluation. In the most favorable case of knowing all zones and predened cluster centroids we achieve an average recognition rate of 83%. The experiments comprise of 21 full histories and 18 partial ones. In partial histories, a good either started in the warehouse or after its admission never left it again. Our current interpretation prefers full histories over partial histories and is biased towards an empty starting warehouse, i.e., if the observations verify both admission and take-out we prefer the admission. Especially in the case of having no prior knowledge we found that partial histories reduced detection rate. In particular, with automatically generated centroids and no prior knowledge about the zones; 37.9% of the full histories were correctly found, but only 13.3% of the partial histories were correctly found. A signicant difference can be observed between the two clustering methods, but both follow the same pattern: additional prior knowledge results in more correctly identied histories. If no zone is known, i.e. all zones needed to be inferred, the results show that the approach is still capable of correctly identifying histories. This clearly demonstrates the utility of inference in process recognition. VI. D ISCUSSION Our work targets online process detection and online queries while the robot is operating. Thus, we rely on observations of goods as soon as we detect them, even if the position estimate is still uncertain. Over time, stability of positions is achieved by clustering them into locations. Every (new) perception of a good at a different location (immediately) triggers the creation of a new world. Poor position estimates (for example, when few tags are detected due to motion blur while the robot turns) can easily be mapped to locations that incorrectly induce movement of a good or lie outside of the zone the good is in. Such cases result in incorrectly detected histories. The results in Fig. 5 conrm this: When providing stable, pre-dened cluster centers, detection rates are signicantly higher, especially when more domain knowledge is included. Thus, excluding estimates with too much uncertainty would improve the detection rate. Using uncertainty estimates for measured positions will also improve
11
the robustness of geometrical shape estimation for the zones. However, the current implementation of the TreeMap SLAM algorithm3 does not provide uncertainty estimates. In a real-world environment it is reasonable to assume knowledge about entrance and outlet zones (e.g., by placing tags to mark the end of the warehouse). The observable difference between knowing all zones and knowing only entrance and outlet is relatively small, especially when predened clusters are used (83% and 75% respectively). These results illustrate the feasibility of our approach. In this work, we currently restrict ourselves to use inference only on sensory observations. As stated before, the detection of correct histories improves with better clustering (e.g., by using outlier detection). To query more complex information it would be reasonable to also include knowledge gained within the mapping process. That is, information on goods we have observed before and included into the map, but that we are not able to perceive at the very moment. For these objects, we have a strong belief of their existence and position, but this belief canaccording to the actual observationnot be validated. A possibility to include reasoning on such beliefs is to use a logic that provides a modal belief operator, such as the logic for BDI agents presented in [16]. Another source of information for more complex queries could be provided by an ontology, as shown in [15]. Our logic foundation also supports multiple instances of the same type of good, e.g. splitting or merging packages for delivery. However, due to limited size in our lab, we did not include this feature in our experiments. VII. S UMMARY In this paper we propose an approach to process detection based on a specication of processes as temporal logic formulas. We show in our evaluation that our approach is applicable using real sensory data from a mobile robot. One strength of our approach is that it can ll in missing pieces of information by reasoning about processes and spatial congurations in the same formalism. It is also possible to query about previously unspecied processes as well as about spatial facts, such as functional zones. Basing our approach on the well-established linear temporal logic not only works for passive process detection but would also allow us to incorporate so-called search control knowledge and perform high-level planning [4], i.e., doing active process detection in the sense of planning where to go for more information. This is the objective of our future research. R EFERENCES
[1] Marco Antoniotti and Bud Mishra. Discrete event models + temporal logic = supervisory controller: Automatic synthesis of locomotion controllers. In Proceedings of the IEEE Conference on Robotics and Automation (ICRA), volume 2, pages 14411446, 1995. [2] D. Avrahami-Zilberbrand, G.A. Kaminka, and H. Zarosim. Fast and complete symbolic plan recognition: Allowing for duration, interleaved execution, and lossy observations. In Proceedings of the AAAI Workshop on Modeling Others from Observations (MOO). Citeseer, 2005.
3 as
[3] Dorit Avrahami-Zilberbrand. Efcient Hybrid Algorithms for Plan Recognition and Detection of Suspicious and Anomalous Behavior. PhD thesis, Bar Ilan University, 2009. [4] Fahiem Bacchus and Froduald Kabanza. Using temporal logics to express search control knowledge for planning. Articial Intelligence, 116(12):123 191, 2000. [5] Maria-Florina Balcan and Avrim Blum. A discriminative model for semi-supervised learning. Journal of the ACM (JACM), 57(3):146, 2010. [6] Maren Bennewitz, Wolfram Burgard, Grzegorz Cielniak, and Sebastian Thrun. Learning motion patterns of people for compliant robot motion. The International Journal of Robotics Research (IJRR), 24(1):3941, 2005. [7] Marcello Cirillo, Lars Karlsson, and Alessandro Safotti. A human-aware robot task planner. In Alfonso Gerevini, Adele E. Howe, Amedeo Cesta, and Ioannis Refanidis, editors, Proceedings of the 11th International Conference on Automated Planning and Scheduling (ICAPS). AAAI, 2009. [8] Christophe Dousson and Pierre Le Maigat. Chronicle recognition improvement using temporal focusing and hierarchization. Proceedings of the 20th International Joint Conference on Artical Intelligence (IJCAI), pages 324329, 2007. [9] Udo Frese. An O(log n) Algorithm for Simulateneous Localization and Mapping of Mobile Robots in Indoor Environments. PhD thesis, University of Erlangen-N rnberg, 2004. u [10] Torsten Hildebrandt, Lutz Frommberger, Diedrich Wolter, Christian Zabel, Christian Freksa, and Bernd Scholz-Reiter. Towards optimization of manufacturing systems using autonomous robotic observers. In Proceedings of the 7th CIRP International Conference on Intelligent Computation in Manufacturing Engineering (ICME), June 2010. [11] Marius Kloetzer and Calin Belta. LTL planning for groups of robots. In Proceedings of the IEEE International Conference on Networking, Sensing and Control (ICNSC), pages 578583, 2006. [12] Marius Kloetzer and Calin Belta. Automatic deployment of distributed teams of robots from temporal logic motion specications. IEEE Transactions on Robotics, 26(1):4861, 2010. [13] Lin Liao, Donald J. Patterson, Dieter Fox, and Henry Kautz. Learning and inferring transportation routines. Articial Intelligence, 171(56):311 331, 2007. [14] Tommaso Magherini, Guido Parente, Christopher D. Nugent, Mark P. Donnelly, Enrico Vicario, Frederico Cruciani, and Cristiano Paggetti. Temporal logic bounded model-checking for recognition of activities of daily living. In Proceedings of the 10th IEEE International Conference on Information Technology and Applications in Biomedicine (ITAB), Corfu, Greece, November 2010. [15] Fulvio Mastrogiovanni, Antonio Sgorbissa, and Renato Zaccaria. Context assessment strategies for ubiquitous robots. In IEEE International Conference on Robotics and Automation (ICRA), pages 27172722, 2009. [16] John-Jules Meyer and Frank Veltman. Intelligent agents and common sense reasoning. In Patrick Blackburn, Johan Van Benthem, and Frank Wolter, editors, Handbook of Modal Logic, volume 3 of Studies in Logic and Practical Reasoning, chapter 18, pages 991 1029. Elsevier, 2007. [17] Amir Pnueli. The temporal logic of programs. In Proceedings of the 18th Annual Symposium on Foundations of Computer Science (FOCS), pages 4657, 1977. [18] Stephen L. Smith, Jana T mov , Calin Belta, and Daniela Rus. Optimal u a path planning under temporal logic constraints. In Proceeding of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 32883293, Taipeh, Taiwan, October 2010. [19] Michael Ten Hompel and Thorsten Schmidt. Management of Warehouse Systems, chapter 2, pages 1363. Springer, Berlin Heidelberg, 2010. [20] Trung-Dung Vu, Julien Burlet, and Olivier Aycard. Grid-based localization and local mapping with moving object detection and tracking. Information Fusion, 12:5869, January 2011. [21] Chieh-Chih Wang, Charles Thorpe, Sebastian Thrun, Martial Hebert, and Hugh Durrant-Whyte. Simultaneous localization, mapping and moving object tracking. The International Journal of Robotics Research, 26:889 916, September 2007. [22] Denis F. Wolf and Gaurav S. Sukhatme. Mobile robot simultaneous localization and mapping in dynamic environments. Autonomous Robots, 19:5365, 2005. 10.1007/s10514-005-0606-4. [23] Qiang Yang. Activity recognition: Linking low-level sensors to high level intelligence. In Proceedings of the 21st International Joint Conference on Articial Intelligence (IJCAI), pages 2025, Pasadena, CA, July 2009.
provided at http://openslam.org/treemap.html
12
Plan-based Object Search and Exploration Using Semantic Spatial Knowledge in the Real World
Alper Aydemir Moritz G belbecker o Andrzej Pronobis Kristoffer Sj o o Patric Jensfelt Centre for Autonomous Systems, Royal Institute of Technology, Stockholm, Sweden Institut f r Informatik, Albert-Ludwigs-Universit t Freiburg, Germany u a
Abstract In this paper we present a principled planner based approach to the active visual object search problem in unknown environments. We make use of a hierarchical planner that combines the strength of decision theory and heuristics. Furthermore, our object search approach leverages on the conceptual spatial knowledge in the form of object cooccurences and semantic place categorisation. A hierarchical model for representing object locations is presented with which the planner is able to perform indirect search. Finally we present real world experiments to show the feasibility of the approach. Index Terms Active Sensing, Object Search, Semantic Mapping, Planning
I. I NTRODUCTION Objects play an important role when building a semantic representation and an understanding of the function of space [14]. Key tasks for service robots, such as fetch-andcarry, require a robot to successfully nd objects. It is evident that such a system cannot rely on the assumption that all object relevant to the current task are already present in its sensory range. It has to actively change its sensor parameters to bring the target object in its eld of view. We call this problem active visual search (AVS). Although researchers began working on the problem of visually nding a relatively small sized object in a large environment as early as 1976 at SRI [4], the issue is often overlooked in the eld. A common stated reason for this is that the underlying problems such as reliable object recognition and mapping are posing hard enough challenges. However as the eld furthers in its aim to build robots acting in realistic environments, this assumption need to be relaxed. The main contribution of this work a method to relinquish the above mentioned assumption. A. Problem Statement We dene the active visual object search problem as an agent localizing an object in a known or unknown 3D environment by executing a series of actions with the lowest total cost. The cost function is often dened as the time it takes to complete the task or distance traveled. Let the environment be and being the search space with . Also let Po () be the probability distribution for the position of the center of the target object o dened as a function over . The agent can execute a sensing action s in
This work was supported by the SSF through its Centre for Autonomous Systems (CAS), and by the EU FP7 project CogX.
the reachable space of . In the case of a camera as the sensor, s is characterised by the camera position, (xc , yc , zc ), pan-tilt angles (p, t), focal length f and a recognition algorithm a; s = s(xc , yc , zc , p, t, f, a). The part of covered by s is called a viewcone. In practice, a has an effective region in which reliable recognition or detection is achieved. For the ith viewcone we call this region Vi . Depending on the agents level of a priori knowledge of and Po () there are three extreme cases of the AVS problem. If both and Po () is fully known then the problem is that of sensor placement and coverage maximization given limited eld of view and cost constraints. If both and Po () is unknown then the agent has an additional explore action as well. An exhaustive exploration strategy is not always optimal, i.e. the agent needs to select which parts of the environment to explore rst depending on the target objects properties. Furthermore the agent needs to trade-off between executing a sensing action and exploration at any given point. That is, should the robot search for the object o in the partially known or explore further. This is classically known as the exploration vs. exploitation problem. When Po () is unknown (i.e. uniformly distributed) but is known (i.e. acquired a priori), the agent needs to gather information about the environment similar to the above case. However in this case, the exploration is for learning about the target object specic characteristics of the environment. Knowing also means that the robot can reason whether or not to execute a costly search action at the current position, or move to another more promising region of space. The rare case where Po () is fully known but is unknown is not practically interesting to the scope of this paper. So far, we have examined the case where the target object is an instance. The implication of this is that Po () + Po ( \ ) = 1, therefore observing Vi has an effect on Po ( \ Vi ). However this is not necessarily true if instead the agent is searching for any member of an object category and the number of them is not known in advance. Therefore knowing whether the target object is a unique instance or a member of an object category is an important factor in search behavior. Recently theres an increasing amount of work on acquiring semantic maps. Semantic maps have parts of the environment labeled representing various high level concepts and functions of space. Exploring and building a semantic map while performing AVS contributes to the estimation of Po (). The semantic map provides information that can be exploited by leveraging on common-sense conceptual knowledge about
13
indoor environments. This knowledge describes, for example, how likely it is that plates are found in kitchens, that a mouse and a computer keyboard occur in the same scene and that corridors typically connect multiple rooms. Such information offers valuable information in limiting the search space. The sources for those can be from online common-sense databases or world wide web among others. Acknowledging the need to limit the search space and integrate various cues to guide the search, [4] proposed indirect search. Indirect search as a search strategy is a simple and powerful idea: its to nd another object rst and then use it to facilitate nding the target object, e.g. nding a table rst while looking for a landline phone. Tsotsos [13] approached the problem by analyzing the complexity of the AVS problem and showed that it is NP-hard. Therefore we must adhere to a heuristics based solution. Ye [15] formulated the problem in probabilistic framework. In this work we consider the case where and Po () are both unknown. However, the robot is given probabilistic default knowledge about the relation betweeen objects and the occurences of objects in difference room category following our previous work [1, 6]. B. Contributions The contributions of this work are four fold. First we provide the domain adaptation of a hierarchical planner to address the AVS problem. Second we show how to combine semantic cues to guide the object search process in a more complex and larger environment than found in previous work. Third, we start with an unknown map of the environment and provide an exploration strategy which takes into account the object search task. Four, we present real world experiments searching for multiple objects in a large ofce environment, and show how the planner adapts the search behavior depending of the current conditions. C. Outline The outline of this paper is as follows. First we present how the AVS problem can be formulated in a principled way using a planning approach (Section II). Section III provides the motivation for and structure of various aspects of our spatial representation. Finally we showcase the feasibility of our approach in real world experiments (Section IV). II. P LANNING For a problem like AVS which entails probabilistic action outcomes and world state, the robot needs to employ a planner to generate exible and intelligent search behavior that trade off exploitation versus exploration. In order to guarantee optimality a POMDP planner can be used in, i.e. a decision theoretic planner that can accurately trade different costs against each other and generate the optimal policy. However, this is only tractable when a complex problem like AVS is applied to very small environments. Another type of planner are the classical AI planners which requires perfect knowledge about the environment. This is not the case since both and Po () are unknown. A variation of the classical planners are the so called continual planners that interleave planning and plan monitoring in order to deal with uncertain or dynamic environments[3]. The basic
idea behind the approach is to create an plan that might reach the goal and to start executing that plan. This initial plan takes into account success probabilities and action costs however it is optimistic in nature. A monitoring component keeps track of the execution outcome and noties the planner in the event of the current plan becoming invalid (either because the preconditions of an action are no longer satised or the plan does not reach the goal anymore). In this case, a new plan is created with the updated current state as the initial state and execution starts again. This will continue until either the monitoring component detects that the goal has been reached or no plan can be found anymore. In this paper we will make use of a so called switching planner. It combines two different domain independent planners for different parts of the task: A classical continual planner to decide the overall strategy of the search (for which objects to search in which location) and a decision theoretic planner to schedule the low level observation actions using a probabilistic sensing model. Both planners use the same planning model and are tightly integrated. We rst give a brief description of the switching planner. We focus on the use of the planner in this paper and instead refer the reader to [5] for a more detailed description. We will also present the domain modeling for the planner, and give further details on various aspects of knowledge that planner makes use of. A. Switching Planner 1) Continual Planner (CP): We build our planning framework on an extended SAS+ [2] formalism. As a base for the continual planner, we use Fast Downward[7]. Because our knowledge of the world and the effects of our actions are uncertain we associate a success probability p(a) with every action a. In contrast to more expressive models like MDPs or even POMDPs, actions dont have multiple possible outcomes, they just can succeed with probability p(a) or fail with probability of 1 p(a). The goal of the planner is then to nd a plan that reaches the goal with a low cost. In classical planning the cost function is usually either the number of actions in a plan or the sum of all actions costs. Here we chose a function that resembles the expected reward adjusted to our restricted planning model. With p() = a p(a) as the plans total success probability and cost() = a cost(a) as the total costs, we get for the optimal plan : = argmin cost() + R(1 p())
where a is an action and the constant R is the reward the planner is given for achieving the goal. For small values of R the planner will prefer cheaper but more unlikely plans, for larger values more expensive plans will be considered. Assumptions The dening feature of an exploration problem is that the worlds state is uncertain. Some planning frameworks such as MDPs allow the specication of an initial state distribution. We choose not to do this for two different reasons: a) having state distributions would be a too strong departure from the classical planning model and b) the typical exploration problems we deal with have too many possible
14
states to express explicitly. We therefore use an approach we call assumptive actions that allow the planner to construct parts of the initial state on the y, and which allows us to map the spatial concepts to the planning problem in an easy way. 2) Decision Theoretic (DT) Planner: When the continual planner reaches a sensing action (e.g. search location1 for a object2), we create a POMDP that only contains the parts of the state that are relevant for that subproblem with. This planner can only use M OVE and P ROCESS V IEW C ONE actions explained in Section II-B.2. The DT planner operates in a closed-loop manner, sending actions to be executed and receiving observations from the system. Once the DT planner either conrms or rejects a hypothesis, it returns control back to the continual planner, which treats the outcome of the DT session like the outcome of any other action. B. Domain Modeling We need to discretize the involved spaces (object location, spatial model and actions) to make a planner approach applicable to the AVS problem. Most methods make use of discretizations as a way to handle the NP-hard nature of the problem. 1) Representing space: For the purposes of obstacle avoidance, navigation and sensing action calculation, is represented as a 3D metric map. discretised into i volumetric cells so that = c0 ...ci . Each cell represents the occupancy with the attributes OCCUPIED, FREE or UNKOWN as well as the probability of target objects center being in that cell. However, further abstraction is needed to achieve reliable and fast plan calculation as the number of cells can be high. For this purpose we employ a topological representation of called place map, see Fig 1(a). In the place map, the world is represented by a nite number of basic spatial entities called places created at equal intervals as the robot moves. Places are connected using paths which are discovered by traversing the space between places. Together, places and paths represent the topology of the environment. This abstraction is also useful for a planner since metric space would result in a largely intractable planning state space. The places in the place map are grouped into rooms. In the case of indoor environments, rooms are usually separated by doors or other narrow openings. Thus, we propose to use a door detector and perform reasoning about the segmentation of space into rooms based on the doorway hypotheses. We use a template-based door detection algorithm which matches a door template to each acquired laser scan. This creates door hypotheses which are further veried by the robot passing through a narrow opening. In addition, unexplored space is represented in the place map using hypothetical places called placeholders dened in the boundary between free and unknown space in the metric map. We represent object locations not in metric coordinates but in relation to other known objects or rooms to achieve further abstraction. The search space is considered to be divided into locations L. A location is either a room R or a related space. Related spaces are regions connected with a landmark object o, either in or on the landmark (see [1] for more details). The related space in o is termed Io and the space on o Oo .
2) Modeling actions: The planner has access to three physical actions: M OVE can be used to move to a place or placeholder, C REATE V IEWCONES creates sensing actions for an object label in relation to a specied location, P RO CESS V IEWCONE executes a sensing action. Finally, the virtual S EARCH F ORO BJECT action that triggers the decision theoretic planner. 3) Virtual objects: There are two aspects of exploration in the planning task: were searching for an (at that moment) unknown object, which may include the search for support objects as an intermediate step. But the planner may also need to consider the utility of exploring its environment in order to nd new rooms in which nding the goal object is more likely. Because the planners we use employ the closed world assumption, adding new objects as part of the plan is impossible. We therefore add a set of virtual objects to the planning problem that can be instantiated by the planner as required by the plan. This approach will fail for plans that require nding more objects than pre-allocated, but this is not a problem in practice. The monitoring component tries to match new (real) objects to virtual objects that occur in the plan. This allows us to deliver the correct observations to the DT planner and avoid unnecessary replanning. 4) Probabilitic spatial knowledge: The planner makes use of the following probabilistic spatial knowledge in order to generate sensible plans: Pcategory (roomi ) denes the distribution over room categories that the robot has a model for, for a given room integrated over places that belongs to roomi . The planner uses this information to decide whether to plan for a S EARCH F ORO BJECT action or explore the remaining placeholders. Pcategory (placeholderi ) represents the probability distribution of a placeholder turning into a new room of a certain category upon exploration. Using this distribution, the planner can choose to explore a placeholder instead of another, or plan to launch search altogether. P (ObjectAtL) gives the probability of an object o being at location L. More details about calculation of these probabilities are further explained in Section III. III. S PATIAL R EPRESENTATION 5) Conceptual Map: All higher level inference is performed in the so called conceptual map which is represented by a graphical model. It integrates the conceptual knowledge (food items are typically found in kitchens) with instance knowledge (the rice package is in room4). We model this in a chain graph [8], whose structure is adapted online according to the state of underlying topological map. Chain graphs provide a natural generalisation of directed (Bayesian Networks) and undirected (Markov Random Fields) graphical models, allowing us to model both directed causal as well as undirected symmetric or associative relations. The structure of the chain graph model is presented in Fig. 2. Each discrete place is represented by a set of random variables connected to variables representing semantic category of a room. Moreover, the room category variables are connected
15
(a)
Fig. 2.
Schematic image of chain graph
(b) Fig. 1. (a) A place map with several places and 3 detected doors shown as red. (b) Shows two placeholders with different probabilities for turning into new rooms: one of them is behind a door hypothesis therefore having a higher probability of leading into a new room. Colors on circular discs indicates the probability of room categories as in a pie chart: i.e. the bigger the color is the higher the probability. Here green is corridor, red is kitchen and blue is ofce.
3) as Poisson processes over cells c0 ...ci per location L In other words, each location has the possibility of containing, independently of all other locations, a number nc of objects of a class c with probability k eL,c L,c (1) k! where L,c is the expected number of objects of class c in the location L. The probability of at least one object in a location is P (nc > 0) = 1 P (nc = 0) = 1 ei,c (2) P (nc = k) =
by undirected links to one another according to the topology of the environment. The potential functions rc (, ) represent the type knowledge about the connectivity of rooms of certain semantic categories. To compute Pcategory (roomi ) each place is described by a set of properties such as size, shape and appearance of space. These are are based on sensory information as proposed in [12]. We extend this work by also including presence of a certain number of instances of objects as observed from each place as a properties (due to space limitations we refer to [11] for more details). This way object presence or absence in a room also affects the room category. The property variables can be connected to observations of features extracted directly from the sensory input. Finally, the functions ps (|), pa (|), poi (|) utilise the common sense knowledge about object, spatial property and room category co-occurrence to allow for reasoning about other properties and room categories. For planning, the chain graph is the sole source of beliefstate information. In the chain graph, belief updates are eventdriven. For example, if an appearance property, or object detection, alters the probability of a relation, inference proceeds to propagate the consequences throughout the graph. In our work, the underlying inference is approximate, and uses the fast Loopy Belief Propagation [9] procedure. A. Object existence probabilities To compute the P (ObjectAtL) value used in active visual search in this paper, objects are considered to be occurring: 1) independently in different locations L 2) independently of other objects in the same location
Because of the independence assumptions, the values for a location and all its subordinate locations can simply be added together to obtain the distribution of the number of objects of that class occurring in that whole hierarchy. 1) Exploration: In addition to making inferences about explored space, the conceptual map can provide predictions about unexplored space. To this end, we extend the graph by including the existence of placeholders. For each placeholder a set of probabilities is generated that the placeholder will lead to a room of a certain category. This process is repeated for each placeholder and consists of three steps. In the rst step, a set of hypotheses about the structure of the unexplored space is generated. In case of our implementation, we evaluate 6 hypotheses: (1) placeholder does not lead to new places, (2) placeholder leads to new places which do not lead to a new room, (3) placeholder leads to places that lead to a single new room (4) placeholder leads to places that lead a room which is further connected to another room, (5) placeholder leads to a single new room directly, and (6) placeholder leads to a new room directly which leads to another room. In the second step, the hypothesized rooms are added to the chain graph just like regular rooms and inference about their categories is performed. Then, the probability of any of the hypothesized rooms being of a certain category is obtained. Finally, this probability is multiplied by the likelihood of occurrence of each of the hypothesized worlds estimated based on the amount of open space behind the placeholder and the proximity of gateways. A simple example is shown in Fig. 1(b)
16
IV. E XPERIMENTS Experiments were carried out on a Pioneer III wheeled robot, equipped with a Hokuyo URG laser scanner, and a camera mounted at 1.4 m above the oor. Experiments took place in 12x8 m environment with 3 different rooms, kitchen, ofce1, ofce2 connected by a corridor. The robot had models of all objects it searches for before each search run. 3 different objects (cerealbox, stapler and whiteboardmarkers) were used during experiments. The BLORT framework was used to detect objects [10]. To highlight the exibility of the planning framework evaluated the system with 6 different starting positions and tasked with nding different objects in an unknown environment. We refer the reader to http://www.csc.kth.se/ aydemir/avs.html for videos. Each sub-gure in Fig. 3 shows the trajectory of the robot. The color coded trajectory indicates the room category as perceived by the robot: red is kitchen, green is corridor and blue is ofce. The two green arrows denote the current position and the start position of the robot. In the following we give a brief explanation for what happened in the different runs.
counter. Finding no counter either, it nally goes out in the corridor to look for another kitchen and upon failing that, gives up. Fig. 3(f) Starts: corridor Target: whiteboardmarker in ofce1 The robot is started in the corridor and driven to the kitchen by a joystick; thus in this case the environment is largely explored already when the planner is activated and asked to nd a whiteboardmarker object. The part of the corridor leading to ofce2 has been blocked. The robot immediately nds its way to ofce1 and launches a search which results in a successful detection of the target object. In the following, we describe the planning decisions in more detail for a run similar to the one described in Fig. 3(a), with the main difference being that the cereals could not be found in the end due to a false negative detection. The rst plan, with the robot starting out in the middle of the corridor, looks as follows:
A SSUME - LEADS - TO - ROOM place1 kitchen A SSUME - OBJECT- EXISTS table I N new-room1 kitchen A SSUME - OBJECT- EXISTS cerealbox O N new-object1 table kitchen M OVE place1 C REATE V IEWCONES table I N new-room1 S EARCH F ORO BJECT table I N new-room1 new-object1 C REATE V IEWCONES cerealbox O N new-object1 S EARCH F ORO BJECT cerealbox O N new-object1 new-object2 R EPORT P OSITION new-object2
Fig. 3(a) Starts: corridor, Target: cerealbox in kitchen The robot starts by exploring the corridor. The robot nds a doorway on its left and the placeholder behind it has a higher probability of yielding into a kitchen and the robot enters ofce1. As the robot acquires new observations the CPs kitchen assumption is violated. The robot returns to exploring the corridor until it nds the kitchen door. Here the CPs assumptions are validated and the robot searches this room. The DT planner plans a strategy of rst nding a table and then the target object on it. After nding a table, the robot generates view cones for the Otable,cornf lakes location. The cerealbox object is found. Fig. 3(b) Starts: ofce2, Target: cerealbox in kitchen Unsatised with the current rooms category, the CP commits to the assumption that exploring placeholders in the corridor will result in a room with category kitchen. The rest proceeds as in Fig. 3(a). Fig. 3(c) Starts: corridor Target: cerealbox in kitchen The robot explores until it nds ofce2. Upon entry the robot categorises ofce2 as kitchen but after further exploration, ofce2 is categorised correctly. The robot switches back to exploration and since the kitchen door is closed, it passes kitchen and nds ofce1. Not satised with ofce1, the robot gives up since all possible plans success probability are smaller than a given threshold value. Fig. 3(d) Starts: ofce1 Target:stapler in ofce2 After failing to nd the object in ofce1 the robot notices the open door, but nding that it is kitchen-like decides not to search the kitchen room. This time the stapler object is found in ofce2 Fig. 3(e) Starts: kitchen Target: cerealbox in kitchen As before it tries locating a table, but in this case all table objects have been eliminated beforehand; failing to detect a table the robot switches to looking for a
Here we see several virtual objects being introduced: The rst action assumes that place1 leads to a new room newroom1 with category kitchen. The next two assumptions hypothesize that a table exists in the room and that cornakes exist on that table. The rest of the plan is rather straightforward: create view cones and search for the table, then create view cones and search for the cereal box. Execution of that plan leads to frequent replanning, as the rst assumption is usually too optimistic: most placeholders do not directly lead to a new room, but require a bit more exploration. After following the corridor, the robot does nd the ofce, and returns to the corridor to explore into the other direction. It nally nds a room which has a high likelihood of being a kitchen.
A SSUME -C ATEGORY room3 kitchen A SSUME - OBJECT- EXISTS table I N room3 kitchen A SSUME - OBJECT- EXISTS cerealbox O N new-object1 table kitchen M OVE place17 M OVE place18 M OVE place16 C REATE V IEWCONES table I N room3 S EARCH F ORO BJECT table I N room3 new-object1 C REATE V IEWCONES cerealbox O N new-object1 S EARCH F ORO BJECT cerealbox O N new-object1 new-object2
The new plan looks similar to the rst one, except that we do not assume the existence of a new room but the
17
(a)
(b)
(c)
(d) Fig. 3. Trajectories taken by the robot in multiple experiments
(e)
(f)
category of an existing one. Also, the robot cannot start creating view cones immediately because a precondition of the C REATE V IEWCONES action is that the room must be fully explored, which involves exploring all remaining placeholders in the room. After view cones are created, the decision theoretic planner is invoked. We used a relatively simple sensing model, with a false negative probability of 0.2 and a false positive probability of 0.05 these are educated guesses, though. The DT planner starts moving around and processing view cones until it eventually detects a table and returns to the continual planner. At this point the probability of the room being a kitchen is so high, that it considered to be certain by the planner. With lots of the initial uncertainty removed, the nal plan is straightforward:
A SSUME - OBJECT- EXISTS cerealbox O N object1 table kitchen C REATE V IEWCONES cerealbox O N object1 S EARCH F ORO BJECT cerealbox O N object1 new-object2 R EPORT P OSITION new-object2
R EFERENCES
[1] Alper Aydemir, Kristoffer Sj o, John Folkesson, and Patric Jensfelt. o Search in the real world: Active visual object search based on spatial relations. In IEEE International Conference on Robotics and Automation (ICRA), May 2011. [2] C. B ckstr m and B. Nebel. Complexity results for SAS+ planning. a o Comp. Intell., 11(4):625655, 1995. [3] Michael Brenner and Bernhard Nebel. Continual planning and acting in dynamic multiagent environments. Journal of Autonomous Agents and Multiagent Systems, 19(3):297331, 2009. [4] Thomas D. Garvey. Perceptual strategies for purposive vision. Technical Report 117, AI Center, SRI International, 333 Ravenswood Ave., Menlo Park, CA 94025, Sep 1976. [5] Moritz G belbecker, Charles Gretton, and Richard Dearden. A switching o planner for combined task and observation planning. In Twenty-Fifth Conference on Articial Intelligence (AAAI-11), August 2011. [6] Marc Hanheide, Charles Gretton, Richard W Dearden, Nick A Hawes, Jeremy L Wyatt, Andrzej Pronobis, Alper Aydemir, Moritz G belbecker, o and Hendrik Zender. Exploiting Probabilistic Knowledge under Uncertain Sensing for Efcient Robot Behaviour. In Proc. Int. Joint Conf. on Articial Intelligence (IJCAI), 2011. [7] Malte Helmert. The fast downward planning system. Journal of Articial Intelligence Research, 26:191246, 2006. [8] S. L. Lauritzen and T. S. Richardson. Chain graph models and their causal interpretations. J. Roy. Statistical Society, Series B, 64(3):321 348, 2002. [9] J. M. Mooij. libDAI: A free and open source C++ library for discrete approximate inference in graphical models. J. Mach. Learn. Res., 11:21692173, August 2010. [10] T. M rwald, J. Prankl, A. Richtsfeld, M. Zillich, and M. Vincze. BLORT o - The blocks world robotic vision toolbox. In Workshop on Best Practice in 3D Perception and Modeling for Mobile Manipulation at ICRA 2010, 2010. [11] Andrzej Pronobis and Patric Jensfelt. Hierarchical multi-modal place categorization. In submitted to ECMR11, 2011. [12] Andrzej Pronobis, Oscar M. Mozos, Barbara Caputo, and Patric Jensfelt. Multi-modal semantic place classication. The International Journal of Robotics Research (IJRR), Special Issue on Robotic Vision, 29(2-3):298 320, February 2010. [13] J. K. Tsotsos. On the relative complexity of active vs. passive visual search. International Journal of Computer Vision, 7(2):127141, 1992. [14] S. Vasudevan and R. Siegwart. Bayesian space conceptualization and place classication for semantic maps in mobile robotics. Robot. Auton. Syst., 56:522537, June 2008. [15] Yiming Ye and John K. Tsotsos. Sensor planning for 3d object search. Comput. Vis. Image Underst., 73(2):145168, 1999.
During the run, the continual planner created 14 plans in total, taking 0.2 0.5 seconds per plan on average. The DT planner was called twice, and took about 0.5 2 seconds per action it executed. V. C ONCLUSIONS AND F UTURE W ORK In this paper we have presented a planning approach to the active object search. We made use of a switching planner, combing a classical continual planner with a decision theoretic planner. We provide a model for the planning domain appropriate for the planner and show by experimental results that the system is able to search for objects in a real world ofce environment making use of both low level sensor perceipt and high level conceptual and semantic information. Future work includes incorporating 3D shape cues to guide the search and a specialized planner for the AVS problem.
18
Comparative Evaluation of Range Sensor Accuracy in Indoor Environments

Todor Stoyanov, Athanasia Louloudi, Henrik Andreasson and Achim J. Lilienthal
Center of Applied Autonomous Sensor Systems (AASS), Orebro University, Sweden
Abstract 3D range sensing is one of the important topics in robotics, as it is often a component in vital autonomous subsystems like collision avoidance, mapping and semantic perception. The development of affordable, high frame rate and precise 3D range sensors is thus of considerable interest. Recent advances in sensing technology have produced several novel sensors that attempt to meet these requirements. This work is concerned with the development of a holistic method for accuracy evaluation of the measurements produced by such devices. A method for comparison of range sensor output to a set of reference distance measurements is proposed. The approach is then used to compare the behavior of three integrated range sensing devices, to that of a standard actuated laser range sensor. Test cases in an uncontrolled indoor environment are performed in order to evaluate the sensors performance in a challenging, realistic application scenario.
I. I NTRODUCTION In recent years, a multitude of range sensing devices have become available at more affordable costs. Notably, 2D laser range sensors have demonstrated reliable performance and therefore have been widely used for both research and industrial applications. As the complexity of application scenarios considered in mobile robotics increases, so does the use of 3D range sensors. Although precise commercial 3D laser sensors are available, their prohibitively high cost has limited their use. Actuated laser range nders (aLRF) have thus been the most widely used 3D range sensors in the mobile robotics community. Usually aLRF sensors utilize commercially available 2D laser sensors of high accuracy and well known precision, resulting in a reliable measurement system. Nevertheless, aLRF sensors have several disadvantages the lack of commercial availability of an integrated system, slow refresh rates on the order of 0.1Hz, a high weight and a high number of moving parts. Several competing sensor technologies that attempt to solve the problems of aLRF systems have been proposed. Recently, time-of-ight (ToF) and structured light cameras have become more available and more widely used. ToF sensors operate by emitting modulated infrared light and measuring the phase shift of the reected signal. Typically, ToF cameras can deliver dense range measurements at high frame rates of up to 50Hz. Structured light cameras can produce similar measurements, using a projected pattern that is observed by a CCD camera with a known baseline distance. Although the ToF and structured light sensor technologies have a lot of potential for use in mobile robotics, both are affected by several error sources. It is thus very important to evaluate the accuracy of these emerging sensors, compared to that of the actuated LRF.
Over the course of development of ToF sensors, several works have evaluated their accuracy, in the context of sensor calibration. Kahlmann [1] proposes several calibration routines for the SwissRanger SR-2 and SR-3000 cameras. In order to evaluate their effect on the range measurement quality, he proposes to scan a at wall and compare offsets from the expected distance. Later, several precisely engineered optical calibration setups, as well as a calibration track line are used. Linder et. al. also use standard optical calibration setups (checkerboard images), in order to calibrate PMD ToF cameras [2]. Fuchs et. al. [3] evaluate distance accuracy using a modied checkerboard pattern and a pair of ToF cameras mounted on an industrial manipulator. Chiabrando et. al. [4] perform a comprehensive evaluation of the SR-4000 ToF camera, also using a pre-xed optical calibration pattern and a tilted calibration board. While these set-ups constitute an important test case for ToF cameras, they do not capture the complexity of typical uncontrolled environments encountered by mobile robots. May et. al. [5] consider several real-world environments and measure the distance to well known objects in these set-ups, resulting in a more complete accuracy evaluation. Several works have also implicitly evaluated the accuracy of ToF ranging systems by considering their utility in the context of mapping [6] [7], obstacle avoidance [8] and object modeling [9]. Prior work dealing with the accuracy of ToF systems has uncovered complex error sources. Features of the environment, such as dark textures, sharp edges, foreground objects and distant bright objects, all introduce measurement errors in the obtained ranges. Although investigations into structured light camera accuracy are lacking, similar complexity of the error sources may be expected. Thus, the evaluation of both ToF and structured light cameras in strictly controlled environments and engineered test scenarios may not properly reect on their performance in a real world setting. It is therefore important to develop a procedure for a holistic evaluation of novel range sensing devices for the purposes of mobile robotics applications. This work extends a recently proposed method for spatial representation accuracy evaluation [10] to perform a comparison of range sensor measurements. Three integrated 3D range sensors the SR-4000 and Fotonic B70 ToF cameras and the Microsoft Kinect structured light camera are compared to a standard aLRF sensor. This article proceeds with a description of the accuracy evaluation approach. Section III describes the test setup and evaluation environments considered. Section IV presents an analysis of the obtained results, followed by a summary of the major contributions.
19
II. ACCURACY E VALUATION In order to objectively compare the performance of different 3D range sensors, a well-dened framework is needed. It is also important to dene well-formed comparison criteria and obtain meaningful, easy to interpret statistical results. The main objective of this work is to propose a method to compare the accuracy of range measurements in an uncontrolled environment, closely related to the application scenario in which the sensors are to be used. The information returned by 3D range sensors consists of discrete samples from a real world environment. Due to differences in sensing methodology, position and orientation of the devices, the discrete points obtained by each sensor do not correspond to precisely the same physical locations. Thus, a simple, per-point distance error would not be an accurate measure of the discrepancies between different sensors. Rather, in order to compare a test point set Pt to a reference measurement Pr , a continuous model M(Pr ) should be used. The model M should be capable of evaluating the occupancy of any arbitrary point in the eld of view of the sensor. Several alternative probabilistic modeling techniques could be used for constructing M. Building on previous results in spatial representation accuracy evaluation [10], the Three-Dimensional Normal Distributions Transform (3DNDT) is chosen for constructing the reference model M(Pr ). A. The Three-Dimensional Normal Distributions Transform (3D-NDT) The Normal Distributions Transform was originally developed in the context of 2D laser scan registration [11]. The central idea is to represent the observed range points as a set of Gaussian probability distributions. In order to better account for outliers in the data, a mixture of a Gaussian distribution and a uniform distribution may also be used [12]. NDT has been extended to three dimensions [13] and applied to the domains of 3D scan registration [14] and loop detection [15], as well as change detection [16] and path planning [17]. One of the key advantages of the 3DNDT is the fact that it forms a piecewise smooth spatial representation, resulting in the existence of analytic derivatives. Consequently, standard optimization methods can be employed to produce state-of-the-art registration of 3D point clouds using the 3D-NDT representation [14]. Assuming that a set of n point samples P = {(xi , yi , zi )} has been drawn from a Gaussian distribution N (, ), the maximum likelihood estimates of the covariance and mean can be obtained from the observations: xi 1 yi = n i=0 zi x0 x .. M = y0 y .. z0 z .. 1 = MMT n1
i=n
The 3D-NDT partitions space into a set of disjoint voxels and ts a Gaussian distribution to the points in each cell. In order to obtain a well-tting estimate, it is important to choose a discretisation strategy that would permit for each distribution to be well supported by the observations in P . The most typical implementation of 3D-NDT utilizes a regular grid, though the use of an OcTree data structure of irregular cell size is also possible [15]. As shown previously, the spatial indexing used has an effect on the representation accuracy of the 3D-NDT [10], as well as on the speed and accuracy of the registration algorithms that utilize it [15]. B. Comparing Range Sensor Accuracy using the 3D-NDT Prior work in evaluating the discriminative capabilities of the 3D-NDT [10] has demonstrated that it is a robust spatial representation with better accuracy than the standard occupancy grid map. Thus, the 3D-NDT is used in this work to represent the ground truth models of the environment. An important feature of the 3D-NDT is that due to its probabilistic foundations, it can be used to evaluate the occupancy likelihood of any point in space. Thus, given a 3D-NDT model Mndt and a query point x, a probability of occupancy can be estimated. The PDFs in each cell of Mndt can be treated as surface generation processes, thus the probability of observing x, generated from N (, ) is: p(x|N (, )) = 1 (2)3/2 || e
(x)T 1 (x) 2
(4)
(1) xn x yn y zn z
(2) (3)
As the PDFs in the model Mndt are largely disjoint, the probability p(x|Mndt ) can be evaluated by only using the closest 3D-NDT component and directly applying Equation 4. In order to declare the point in space as occupied, the probability p(x|Mndt ) can be tested against a prexed threshold. Previous work has empirically determined that a conservative threshold value of 0.1 results in a very accurate model. For the purposes of the current evaluation, an OcTree version of the 3D-NDT is used, as it offers stable performance over varying discretisation sizes. The proposed evaluation procedure assumes a reference 3D sensor, which produces a ground truth point cloud Pr . First, the 3D-NDT model of the reference point cloud Mndt (Pr ) is constructed. The point set output of a test sensor Pt is then considered as a set of positive examples post , produced by a binary classier associated with the sensor. Next, a set of corresponding negatively classied points negt can be generated, using Pt . For each point in Pt , a random offset of between 0.1 and 1 meters along the beam direction is subtracted. This method of generating free space samples is consistent with the sensor output, as each point in negt is located between the sensor origin and a corresponding reported obstacle point from post . Next, all points in post and negt are tested for occupancy, using the ground truth model Mndt . The samples in post correctly labeled positive represent the true positives. The negatively labeled samples from post are false negatives. Similarly, the positively labeled points from negt constitute
20
false positives, while the remaining points are true negatives. The true positive rate (tpr) can then be calculated as the number of true positives over all samples labeled positive by the classier (true positives and false negatives). The false positive rate (f pr) is dened as the number of false positives over all negatively labeled points (false positives and true negatives). Thus, the point set Pt can be represented by a corresponding point p = (tpr, f pr) on a Receiver Operating Characteristic (ROC) plot a standard method of evaluating and comparing binary classiers in machine learning. III. E VALUATION M ETHODOLOGY Using the evaluation procedure proposed in Section II-B, any number of sensors can be compared against a ground truth reference scan. In this work, four different 3D range sensors were mounted on a mobile platform. The sensor setup and sample output point sets are shown in Figure 1. In order to use the evaluation methodology previously presented, the output point clouds have to be placed in a common reference system. One possible method for extrinsic sensor calibration would be to consult the data sheets and then precisely measure the relative displacements and orientations. The main disadvantages of this approach are that it lacks robustness to changing the sensor conguration and that small errors in the orientation can be made easily, resulting in large discrepancies of the obtained measurements. A common strategy from photogammetry is the use of a calibration pattern. Due to the large amounts of noise present and low resolution of some of the sensors, this also is not a feasible alternative. Thus, the sensors were registered together directly using the obtained range measurements and the 3D-NDT registration algorithm [15]. The advantages of this approach are that using several test scans rich in features minimizes the effect of sensor noise, resulting in a more accurate calibration. Several procedures for intrinsic parameter calibration of ToF sensors have been proposed [1][3]. Prior work on the SR-4000 sensor [4] has however concluded that most of these procedures have been performed by the manufacturer and are not necessary. Notably, the evaluation scenario with known ground-truth information (discussed in Section IV-A) could be further used to improve the calibration parameters and internal noise models of the evaluated sensors. While such an investigation could be of interest in correcting the internal sensor errors, it would not offer insights into environmentinduced discrepancies and will not be further pursued in the current work. Thus, an out-of-the-box testing approach was used in this work, with no further sensor-specic ltering or noise removal techniques employed. Factory settings were used for all sensors, with automatic parameter estimation enabled whenever possible. The Fotonic camera was used in 25Hz mode, with four samples used to estimate each pixel depth and two alternating modulation frequencies. Finally, in order to provide for a more objective evaluation, only the overlapping parts of the sensors elds of view were used. Once all sensor data has been reliably registered in a common reference frame, the accuracy of the range meaFig. 1. Sensor setup used for the presented evaluation. An actuated SICK LMS-200, a SwissRanger SR-4000, a Fotonic B70 and a Microsoft Kinect camera were all mounted on a mobile platform. The resulting point sets were registered in a common reference system and used for evaluation.
surements can be evaluated. As ground truth geometrical models of uncontrolled indoor environments are generally not available, this work uses as a reference measurement set the actuated laser points. In order to demonstrate the feasibility of this assumption, an initial test of the evaluation procedure was performed in a controlled environment. Figure 2 shows the environment used to perform an evaluation with a known ground truth. Three standardized boxes of known dimensions were placed in front of a white wall. In order to increase the complexity of the scene and test for robustness to texture, a black and white calibration board was also placed on the wall. The sensor setup was placed at three meters from the wall and used to collect data from all sensors. A ground truth model of the environment was created by hand using 3D modeling software and sampled uniformly to obtain a reference point cloud Pr . Next, a more extensive real world scenario data set was collected. Point clouds from an indoor laboratory environment were gathered at fty distinct positions. The resulting ROC plots obtained are discussed in the next section. IV. R ESULTS A. Evaluation with Known Ground-Truth Information Ten point clouds from each of the four sensors were collected and compared to the reference point cloud Pr , resulting in the ROC plot in Figure 3(a). Evidently, the point sets obtained by the actuated laser are spread with the largest variance of all four sensors. This behavior is unexpected, as a large spread indicates a big difference between individual point clouds. The variance can be explained by some inaccuracies in the laser sensor system namely an inconsistency in the timing of messages originating from the laser and from the tilting actuator. The problem is further complicated by the large amount of data coming from the three high frame rate 3D sensors, resulting in high computational load of the data collection computer. The three integrated range sensing devices all exhibit highly repeatable measurements with low variance in the scan scores. The Microsoft Kinect sensor has the best performance in this test scenario, with results very close
21
(a)
(b)
(c)
Fig. 2. Setup for Evaluation with Known Ground Truth. Figure 2(a) shows a picture of the controlled testing environment. Figure 2(b) shows the output from the SwissRanger SR-4000 camera (red), registered together with the ground truth point set (white.) Finally, Figure 2(c) shows the SwissRanger point set along with the 3D-NDT distributions constructed from the ground truth model. The normal distributions are drawn as ellipsoids at three standard deviations, with color coding indicating the dominant orientations of the distributions.
ROC Plot Evaluation with Known Ground Truth

1 1
ROC Plot Evaluation with Known Ground Truth (Laser)

1
ROC Plot for all Scans

Laser Kinect Swiss Ranger Fotonic
0.9
0.9
0.9
0.8
0.8
0.8
0.7
0.7
0.7
True Positive Rate (tpr)
0.6
0.6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.6
0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
False Positive Rate (fpr)
(a)
ROC Plot for Low Range Scans
1 1
(b)
ROC Plot for Mid Range Scans
1
(c)
ROC Plot for High Range Scans
Laser Kinect Swiss Ranger Fotonic Laser Kinect Swiss Ranger Fotonic
0.9
0.9
0.9
0.8
0.8
0.8
0.7
0.7
0.7
0.6
0.6

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.6
0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
(d)
(e)
(f)
Fig. 3. Receiver Operating Characteristic (ROC) plots for all the evaluation scenarios discussed. Figure 3(a) shows the ROC points for the test scans in the ground truth scenario, for all four sensors. Figure 3(b) shows the ROC points in the same scenario, when considering the actuated laser points as ground truth data. Figure 3(c) shows the ROC plots for all sensors in the lab environment scenario. Finally, Figures 3(d)3(f) show partitions of the lab environment test set, according to the average point distances from the sensor. The gures show point sets at 0 3m, 3 5m and > 5m ranges respectively.
22
to those of the laser sensor. The two time-of-ight cameras exhibit similar performance, with the SwissRanger slightly outperforming the Fotonic camera. Figure 3(b) shows results from the same test scenario, with the laser sensor point clouds used as a reference sensor, providing the point cloud Pr . Comparing with Figure 3(a) can give a good sense of the expected accuracy of the results presented later. First, due to the inaccuracies in the laser system discussed earlier, measurements are considerably more spread across the ROC plot. Notably though, the comparative performance of the evaluated sensors remains the same, with the Kinect sensor producing the best results. The average values of the sensor accuracy also remain close to their ground truth values, making a strong case for the proposed comparison strategy, especially so in the absence of reliable ground truth information. B. Evaluation in an Uncontrolled Environment Fifty sets of test examples collected from an indoor laboratory environment were used to generate the ROC plot shown in Figure 3(c). The ROC points associated with the measurements from the actuated laser are as expected concentrated in the top-left area of the plot. The obtained points are spread over a small area, featuring a high rate of true positives and a negligible rate of false positives. As the laser points were used to construct the ground truth 3D-NDT models, this behavior is normal and only serves as a reference for the best possible results on the data set. The points from the other three sensors are spread over a large area of the ROC plot and do not seem to offer a clear distinction between the sensors. A closer look at the collected data can however offer some insights about the sensor performance. It is important to note that in the operational mode used (millimeter precision), the SICK LMS-200 has a maximum range of eight meters. The operational ranges of the three integrated 3D sensors, as declared by the manufacturers are between 0.8 and 3.5 meters for the Kinect, up to 5 meters for the SR-4000 and between 0.1 and 7 meters for the Fotonic B70. Thus, a separate evaluation, depending on the average distances measured in the collected point sets is necessary. Figures 3(d)3(f) show the results obtained for partitions of the evaluation set into point clouds containing ranges of up to 3, between 3 and 5 and above 5 meters respectively. The distributions of points on the ROC plots obtained are notably more concentrated and easier to interpret. When observing objects that are closer than three meters, the Microsoft Kinect structured light camera performs the best, delivering point clouds with a high likelihood conditioned on the laser measurements. The Fotonic and SwissRanger cameras have slightly worse performance, with a slightly higher rate of false positives. Table I summarizes the average performance of each sensor on the full data set, as well as the partitioned sets. As also testied by Figure 3(d), the performance of the Kinect sensor is very close to that of the laser, for short range environments. The two timeof-ight cameras did not show signicant differences in the short range test set. In the indoor environment tested, the
Data Set aLRF full Kinect full SR-4000 full F B700 full Kinect low SR-4000 low F B700 low Kinect mid SR-4000 mid F B700 mid Kinect high SR-4000 high F B700 high
True Positive Rate (TPR) 0.91 0.77 0.87 0.86 0.89 0.89 0.89 0.71 0.87 0.85 0.61 0.87 0.83
(T P R) 0.02 0.15 0.05 0.04 0.05 0.05 0.04 0.13 0.02 0.04 0.12 0.04 0.02
False Positive Rate (FPR) 0.01 0.22 0.26 0.24 0.05 0.14 0.16 0.34 0.30 0.25 0.44 0.41 0.35
(F P R) 0.01 0.20 0.14 0.11 0.05 0.10 0.09 0.18 0.07 0.07 0.07 0.04 0.05
TABLE I AVERAGE AND STANDARD DEVIATION OF TRUE AND FALSE POSITIVE

RATES FOR DIFFERENT SENSORS AND SCAN TYPES
two cameras performed in a similar fashion over the whole test set, with the SwissRanger naturally degrading on scans containing longer distances. In the evaluation performed, no sensor achieved performance comparable to the one of the laser sensor at the full distance range. Depending on the application scenario and intended use of the sensors, additional evaluation might be needed to asses the feasibility of using an integrated 3D range device, in place of an actuated laser sensor. For example, if the primary use of the sensor is collision avoidance, distances of less than three meters from the robot are usually a sufcient look-ahead interval. Thus, any of the three evaluated 3D range sensors would perform well. If a mapping application is envisioned, the uncertainty in faraway objects position would have to be explicitly taken into account, if one of the evaluated 3D range cameras is to be used. Amplitude thresholding for the time-of-ight cameras, or a distance cutoff for the structured light sensor might be necessary to ensure better performance. In order to demonstrate the precision of the evaluated sensors at shorter ranges, an additional test was performed. The point sets, obtained by all four sensors were ltered, eliminating all obstacle points found at ranges above 3.5 meters. Thus, when only close points are considered both for model building and testing, the ROC plots in Figure 4 are obtained. The improvement in performance is evident, with the average accuracy of the Kinect sensor increasing from (0.77, 0.22) TPR/FPR ((0.20, 0.15)) to (0.90, 0.06) TPR/FPR ((0.03, 0.01)). A similar performance gain can be observed for the Fotonic B70 camera. The SwissRanger SR-4000 is not inuenced by the relaxed conditions, due to the severe backfolding effects that occur in high range scans due to the modulation frequency used bright objects at more than ve meters cannot be distinguished from faint objects at very close range. This effect was not observed in the Fotonic ToF sensor, due to the use of multiple modulation frequencies for range disambiguation. V. D ISCUSSION This work presented a comparative evaluation of three integrated 3D range cameras. A novel evaluation methodology,
23
ROC Plot for Full Evaluation with Constrained Range

1
0.9
0.8
evaluated and the collection of a larger test sample from different environments. Emphasis will also be placed on outdoor environments and some application specic scenarios, in particular the scenario of grasping deformable objects, stacked in a container. ACKNOWLEDGMENTS This work has been partially supported by the European Union FP7 Project RobLog Cognitive Robot for Automation of Logistic Processes. R EFERENCES
[1] T. Kahlmann, Range imaging metrology: Investigation, calibration and development, Ph.D. dissertation, 2007. [2] M. Lindner, I. Schiller, A. Kolb, and R. Koch, Time-of-Flight sensor calibration for accurate range sensing, Computer Vision and Image Understanding, 2010. [3] S. Fuchs and G. Hirzinger, Extrinsic and Depth Calibration of ToF-cameras, in Computer Vision and Pattern Recognition, IEEE Computer Society Conference on, 2008, pp. 16. [4] F. Chiabrando, R. Chiabrando, D. Piatti, and F. Rinaudo, Sensors for 3D Imaging: Metric Evaluation and Calibration of a CCD/CMOS Time-of-Flight Camera, Sensors, vol. 9, no. 12, 2009. [5] S. May, B. Werner, H. Surmann, and K. Pervolz, 3D Time-of-Flight Cameras for Mobile Robotics, in Proc. of IEEE/RSJ Int. Conf. on Intelligent Robots and Systems. IROS 2006, pp. 790795. [6] S. May, D. Dr schel, S. Fuchs, D. Holz, and A. Nuchter, Robust o 3D-mapping with Time-of-Flight Cameras, in Proc. of IEEE/RSJ Int. Conf. on Intelligent Robots and Systems. IROS 2009, pp. 16731678. [7] K. Pathak, A. Birk, and J. Poppinga, Sub-pixel Depth Accuracy with a Time of Flight Sensor using Multimodal Gaussian Analysis, in Proc. of IEEE/RSJ Int. Conf. on Intelligent Robots and Systems. IROS 2008, pp. 35193524. [8] D. Dr schel, D. Holz, J. St ckler, and S. Behnke, Using Time-ofo u Flight Cameras with Active Gaze Control for 3D Collision Avoidance, in IEEE Int. Conf. on Robotics and Automation, ICRA 2010, pp. 4035 4040. [9] Y. Cui, S. Schuon, D. Chan, S. Thrun, and C. Theobalt, 3D Shape Scanning with a Time-of-Flight Camera, 2010, pp. 11731180. [10] T. Stoyanov, M. Magnusson, H. Almqvist, and A. J. Lilienthal, On the Accuracy of the 3D Normal Distributions Transform as a Tool for Spatial Representation, in Proc. of IEEE Int. Conf. on Robotics and Automation, ICRA 2011. [11] P. Biber and W. Straer, The Normal Distributions Transform: A new Approach to Laser Scan Matching, in Proc. of IEEE/RSJ Int. Conf. on Intelligent Robots and Systems. IROS 2003, 2003. [12] P. Biber, S. Fleck, and W. Straer, A Probabilistic Framework for Robust and Accurate Matching of Point Clouds, in 26th Pattern Recognition Symposium (DAGM 04), 2004. [13] M. Magnusson, T. Duckett, and A. J. Lilienthal, 3D scan registration for autonomous mining vehicles, Journal of Field Robotics, vol. 24, no. 10, pp. 803827, Oct. 24 2007. [14] M. Magnusson, A. N chter, C. L rken, A. J. Lilienthal, and u o J. Hertzberg, Evaluation of 3D Registration Reliability and Speed - A Comparison of ICP and NDT, in Proc. IEEE Int. Conf. on Robotics and Automation, ICRA 2009, pp. 39073912. [15] M. Magnusson, The Three-Dimensional Normal-Distributions Transform an Efcient Representation for Registration, Surface Analysis, and Loop Detection, Ph.D. dissertation, Orebro University, Dec. 2009. [16] H. Andreasson, M. Magnusson, and A. J. Lilienthal, Has Something Changed Here? Autonomous Difference Detection for Security Patrol Robots, in Proc. of IEEE/RSJ Int. Conf. on Intelligent Robots and Systems. IROS 2007, 2007. [17] T. Stoyanov, M. Magnusson, H. Andreasson, and A. J. Lilienthal, Path Planning in 3D Environments Using the Normal Distributions Transform, in Proc. of IEEE/RSJ Int. Conf. on Intelligent Robots and Systems. IROS 2010, Taipei, Taiwan, 2010.
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Fig. 4. ROC plot for the full indoor environment data set, considering only ranges of less than 3.5 meters. The performance of the Kinect and Fotonic cameras improves drastically.
based on previous work on spatial representation evaluation, was used to compare the outputs of the three cameras. The proposed approach offers easy to interpret statistics for the descriptive power of different sensors, compared to a set of reference range observations. In simple environments, handcrafted precise geometric models can be used to evaluate absolute sensor performance. In addition, a sensor with known performance can be used to collect reference measurements in arbitrarily complex application scenarios. A direct comparison of the ranges obtained with a different sensor can then be performed, allowing for further analysis into the usability of novel range sensors. The proposed methodology does not make any assumptions about the operational principle of the evaluated range sensors and can thus be used for generic benchmarking of newly available devices. The accuracy of three integrated 3D range sensors a SwissRanger SR-4000 and Fotonic B70 ToF cameras and a Microsoft Kinect structured light camera, was compared to that of an actuated laser range sensor. The three devices evaluated can deliver high frame rates and dense range measurements, but none of them features an accuracy similar to that of the laser system. A notably different conclusion can be reached when explicitly considering environments of constrained size. When all range measurements are concentrated within a sphere of radius 3.5 meters, the Microsoft Kinect sensor has an accuracy, similar to the one obtained by the actuated laser sensor. In conclusion, when considering smaller environments the Kinect and Fotonic B70 sensors can be directly used as a high data rate substitute of an aLRF system. Future work in sensor evaluation will focus on a more indepth analysis of the acquired data set and more extensive comparison of the expressive power of the sensors considered. One option to be explored is the direct comparison of the 3D-NDT models that can be constructed from the range measurements returned by each sensor. Other directions to be further explored are a diversication of the set of sensors
24
A comparative evaluation of exploration strategies and heuristics to improve them

Dirk Holz Nicola Basilico Francesco Amigoni Sven Behnke Autonomous Intelligent Systems Group, University of Bonn, Bonn, Germany Dipartimento di Elettronica e Informazione, Politecnico di Milano, Milano, Italy
AbstractExploration strategies play an important role in inuencing the performance of an autonomous mobile robot exploring and mapping an unknown environment. Although several exploration strategies have been proposed in the last years, their experimental evaluation and comparison are still largely unaddressed. In this paper, we quantitatively evaluate exploration strategies by experimentally comparing, in a simulation setting, a representative sample of techniques taken from literature. From a broader perspective, our work also contributes to the development of good experimental methodologies in the eld of autonomous mobile robotics by promoting the principles of comparison, reproducibility, and repeatability of experiments. Index TermsRobotic exploration. Exploration strategies.
I. I NTRODUCTION Exploration and mapping are fundamental tasks for autonomous mobile robots operating in unknown environments. Recent work [1] showed that exploration strategies largely inuence the efciency with which a robot performs exploration. Broadly speaking, an exploration strategy drives a robot within a partially known environment, determining where to acquire new spatial information. In this work, we focus on the mainstream approach of greedy Next-Best-View (NBV) exploration strategies. When employing a NBV strategy, exploration comes down to a sequence of steps where, at each step, a number of candidate observation locations are evaluated according to some objective function and the best one is selected for the robot to reach. Several exploration strategies [2], [3], [8], [16], [19] have been proposed, but their experimental evaluation and comparison constitute a topic that is still largely unaddressed, with few exceptions (e.g., [1] and [12]). In our opinion, lling this gap is an important issue in mobile robot exploration. In this paper, we aim at contributing to the assessment of the experimental comparison between exploration strategies. In particular, we compare three exploration strategies for a single robot [5], [8], [11] that are a representative sample of the current state of the art and we investigate the reasons for their different performance and the ways in which they can be improved. The original contribution of this paper is not in the proposal of new exploration strategies, but in presenting some insights derived from the quantitative experimental evaluation of both some strategies and some general heuristics that can be used to improve them. These insights can be intended as enabling factors for more complex exploration applications and for developing better exploration strategies. Our work extends the results of [1] by comparing a different set of
strategies within a more realistic simulation framework and by presenting new insights. Furthermore, we extend the work in [9] by also evaluating the heuristic improvements when applied to different exploration strategies. Our work can be also viewed from the general perspective of the denition of good experimental methodologies for autonomous mobile robotics (for instance, see [6] and [14]). Recent efforts have recognized that experimentation in this eld has not yet reached a level of maturity comparable with that reached in other engineering and scientic elds [4]. Among the elements that dene a good experimental methodology is the comparison of experimental results. With this paper, we contribute toward the denition of a framework for evaluating exploration strategies in different setups. We conduct our comparison in simulation, since it enables performing reproducible and repeatable experiments [4]. Reproducibility is the possibility to verify, in an independent way, the results of a given experiment. Other experimenters, different from the one claiming for the validity of some results, should be able to achieve the same results, by starting from the same initial conditions, using the same type of instruments, and adopting the same experimental techniques. Repeatability concerns the fact that a single result is not sufcient to ensure the success of an experiment. A successful experiment must be the outcome of a number of trials, performed at different times and in different places. These requirements guarantee that the result has not been achieved by chance, but is systematic. Performing experiments using a standard and publicly available simulation platform (like Player/Stage) is a way to promote comparison, reproducibility, and repeatability of experiments. II. R ELATED W ORKS The denition of strategies for autonomous exploration of environments has been addressed by several works in literature. Besides exploration strategies that make the robots move along predened trajectories [13] and that attempt to close loops for localization purposes [17], the mainstream approach considers exploration as an incremental process in which the next observation location is selected among a set of candidates on the basis of available information. These NextBest-View (NBV) systems evaluate the candidate observation locations according to some criteria. Usually, in NBV systems, candidate locations are on the frontier between the known free space and the unexplored part of the environment and are reachable from the current position of the robot [21]
25
(an exception is the feature-based approach of [15]). The exploration strategies analyzed in this paper follow the NBV approach. NBV problems have been also studied in Computer Vision and Graphics. However, the proposed techniques do not apply well to mobile robots [8]. In evaluating a candidate location, single or multiple criteria can be used. For example, [21] presents a strategy that uses a single criterion, the traveling cost, according to which the best observation location is the nearest one. Other approaches combine traveling cost with different criteria, for example with expected information gain [8]. This criterion is related to the expected amount of new information about the environment obtainable from a candidate location. It is estimated by measuring the area of the portion of unknown environment potentially visible from the candidate location, taking into account the so-far built map and the robots sensing range. Other examples of combining different criteria are [16], in which the traveling cost is linearly combined with the expected reduction of the uncertainty of the map after the observation, and [2], in which a technique based on relative entropy is used. In [19], several criteria are employed to evaluate a candidate location: traveling cost, uncertainty in landmark recognition, number of visible features, length of visible free edges, rotation and number of stops needed to follow the path to the location. They are combined in a multiplicative function to obtain a global utility value. The above strategies are based on ad hoc aggregation functions (linear combination, multiplication, . . . ) that combine criteria. In [3], the authors dealt with this problem and proposed a more theoreticallygrounded approach based on multi-objective optimization, in which the best candidate location is selected on the Pareto frontier. In [3], besides traveling cost and expected information gain, also overlap is taken into account. This criterion is related to the amount of already known features that are visible from a candidate location. It accounts for the precision of selflocalization of the robot: the larger the overlap, the better the localization of the robot. III. E XPERIMENTAL S ETTING We now introduce our experimental setting in which we compared the three exploration strategies described in the next section. The strategies have been integrated into a robot control architecture [11] and simulated runs have been performed in Player/Stage to assess and compare their performance. The system represents a class of widely used wheeled mobile robots and consists of a differential-drive robot platform equipped with a SICK LMS 200 laser range scanner with 180 degree eld of view and 1 degree angular resolution. The goal of the robot is to fully explore an initially unknown environment. Robot localization and mapping are performed by incrementally registering raw 2D laser range scans as described in [10]. The robot continuously updates the map as it moves. The map is represented as an unordered point cloud where duplicate storage of measurements is avoided by adding to the map only points that provide new information. They are determined according to a minimum distance from the already stored
points. In addition, we update a grid map that represents, for each cell c[xy] , its reection probability p(c[xy] ) = #hits , #hits + #misses
where #hits is the number of range beams that have been reected by an object in the corresponding region and #misses is the number of range beams that have passed through the cell without being reected. Initially, a value 0.5 is assigned to each cell, i.e., a cells reection is initially unknown. Path planning is accomplished by computing a reachability map which stores, for every cell, both the length of the shortest path to reach it from the current location of the robot and the preceding cell along this path. It is built by iteratively applying Dijkstras algorithm on the grid map without specifying any goal location to fully explore the reachable workspace. Therefore, once a candidate location is selected, the shortest obstacle-free path for navigating to it can be recursively looked up in the reachability map. To guarantee safe navigation, we consider as traversable only cells c[xy] such that p(c[xy] ) 0.25 and whose distance to the closest obstacle is less than 30 cm. Finally, note that running all the exploration strategies in the same experimental setting provides a fair way to compare them. Furthermore, using the architecture described above allows for directly applying the implemented exploration strategies on real mobile robots. IV. E XPLORATION S TRATEGIES In this section, we present the three exploration strategies that we compared in our experiments. These methods constitute a representative sample of different classes of NBV exploration strategies proposed in literature. The rst one, a closest-frontier strategy [11], is simple, both in its denition and computation, and considers a single criterion for candidate selection the traveling cost. The second technique [8] combines traveling cost and information gain with an ad hoc exponential function. The third technique [5] is based on a more principled way for aggregating multiple criteria in a global utility function. A. Closest-Frontier Exploration Strategy The idea of frontier-based exploration strategies is to detect borders between already explored regions of the environment and those regions where the robot has not yet acquired information. Hence, the robot searches for regions that are traversable in the map built so far and that are adjacent to unexplored regions and holes in the map. A simple frontier-based exploration strategy is the closestfrontier strategy (CF). It has been proposed in [21] and can be briey described according to the following steps: 1) determine the set T of traversable cells; 2) determine the set R of reachable cells, i.e., compute a reachability map (see Section III); 3) determine the set C of cells that are both reachable and traversable: C = T R;
26
4) determine the set of frontier cells F by checking for every cell in the set C if it is adjacent to a cell with unknown reection probability: F = {c[xy] | c[xy] C,
C. MCDM-based Exploration Strategy This exploration strategy has been introduced in [5] for maps of line segments. Here we summarize it and show its extension to grid maps. This exploration strategy exploits a decision theoretic technique called Multi-Criteria Decision Making (MCDM), which constitutes a more principled way to combine the criteria that evaluate a candidate location. Given a candidate location p, we consider three criteria for its evaluation. The rst one is the traveling cost L(p, r), computed as the length of the path connecting the current position of the robot with p. Then, we consider the estimated information gain I(p) and the overlap O(p). These two last criteria should be maximized in order to select good observation locations. Both I(p) and O(p) are computed according to a standard entropy measure. Given the set of cells Vp that are visible from the candidate location p, i.e., cells falling within the sensing range area centered at p, we distinguish between old and new cells using a threshold k over the reection probability. In particular, a cell c[xy] Vp is considered as old if p(c[xy] ) k or if p(c[xy] ) 1 k, otherwise c[xy] is considered as new. In our experiments we set k = 0.2. Then, maximizing I(p) corresponds to maximizing the total entropy over new cells of Vp (p provides a potentially large amount of new information) while maximizing O(p) corresponds to minimizing the total entropy over old cells of Vp (p provides a good localization). We call N the set of three criteria that are considered, N = {L(), I(), O()}. Given a criterion i N and a candidate location p, an utility value ui (p) in the [0, 1] interval is computed in order to evaluate on a common scale ps goodness according to every criterion. The utility is normalized over all the candidates in the current exploration step. For example, considering the traveling cost L(p, r) and called C the set of (current) candidate locations, the utility uL (p) (with p C) is computed with the following linear mapping function: uL (p) = 1 (L(p, r) minqC L(q, r)) . (maxqC L(q, r) minqC L(q, r)) (5)
c[(x+m)(y+n)] : p(c[(x+m)(y+n)] ) = 0.5, m {1, 1}, n {1, 1}}; (1)
5) determine n = (nx ny )T as the frontier cell lying closest T to the robots current position r = (rx ry ) : n = arg min L (x y) , r ,
c[xy] F T
(2)
where L(p, r) is the length of the shortest path from p to r. Finally, n is chosen as the next best observation location and the robot is guided towards it following the minimum path.
B. Gonzlez-Baos and Latombes Exploration Strategy The second exploration strategy we decided to evaluate is the strategy by Gonzlez-Baos and Latombe (GBL) presented in [8]. It selects the next best observation location according to traveling cost and information gain. Given the current partial map of the environment, this strategy generates a set of candidate locations by randomly sampling cells in the vicinity of frontier cells F . Then, given a candidate location p, the corresponding utility u(p) is computed according to two criteria: the traveling cost L(p, r) for reaching p and the estimated information gain I(p) when performing a sensing action at p. The global utility is then computed as u(p) = I(p)eL(p,r) , (3)
and the candidate n that maximizes u() is selected as the next observation position (in our experiments we used = 0.2, as suggested by the authors). Whereas the traveling cost is estimated in the same way as above (using the reachability map), the information gain is estimated as the expected relative change in map entropy. That is, we simulate range scans and corresponding map updates at all candidate locations p. The information gain I(p) is estimated as the difference between the maps entropy before (H) and after (H) the simulated H. Since the probabilistic reection maps update I(p) = H we used represent, in principle, two probabilities for each cell (being occupied and being free), we estimate the map entropy by: H=
c[xy]
[xy] p(c ) log p(c[xy] )

=Hp(occupied)
(1 p(c[xy] )) log(1 p(c[xy] )) . (4)

=Hp(f ree)
Analogous normalization functions are used for other criteria, preserving the idea that the larger the utility the better the satisfaction of the criterion. In order to select an observation location, the robot computes a global utility value measuring the overall goodness of each candidate. For every pair p C and i N an utility value ui (p) is computed. MCDM uses an aggregation technique called Choquet fuzzy integral. Let us introduce this concept. We call a function1 : P(N ) [0, 1] a fuzzy measure on the set of criteria N when it satises the following properties: 1) () = 0, (N ) = 1, 2) if A B N , then (A) (B). Given A P(N ), (A) represents the weight of the set of criteria A. In this way, weights are associated not only to single criteria, but also to their combinations. Global utility u(p) for a candidate location p is computed by means of the Choquet
1 P(N )
is the power set of N .
27
(L) = 0.2 (I) = 0.4
(O) = 0.4 ({L, I}) = 0.9
({L, O}) = 0.6 ({I, O}) = 0.8
Without rechecking (73m) With rechecking (70m) 15 m
0m
TABLE I D EFINITION OF () FOR THE MCDM- BASED STRATEGY
10 m
10 m
5m
20 m
integral with respect to the fuzzy measure :

0m
30 m
u(p) =
|N | i=1
Vertices Edges Critical points 10 m
0m
5m
10 m
15 m
0m
(u(i) (p) u(i1) (p))(A(i) ),
(6)
(a)
(b)
where (i) indicates the indices after a permutation that changed their order to have, for a given p, u(1) (p) . . . u(|N )| (p) 1 (it is supposed that u(0) (p) = 0) and A(i) = {j N |u(i) (p) uj (p) u(|N |) (p)}. Different aggregation functions can be dened by changing the denition of . For example, weighted average is a particular case of the Choquet integral when is additive (i.e., (A B) = (A) + (B)). Most importantly, can model dependence relationships between criteria. Formally, criteria belonging to a group G N are:

Fig. 1. Example trajectories with and without repetitive re-checking (a) and examples of map segmentation (b).
redundant, if (G) < gG (g); synergic, if (G) > gG (g); independent, otherwise.
In summary, what MCDM provides is a sort of distorted weighted average, which takes into account dependency between criteria. The next observation location is the candidate location that maximizes u() in Eq. (6). The MCDM-based strategy used in experiments has been dened according to the weights reported in Table I. Such weights have been manually chosen in order to model a synergy relation between the information gain I() and the traveling cost L(), thus favoring candidates that satisfy those criteria in a balanced way. Finally, we note that the computational time of employing MCDM, although longer than that of employing CF, has a negligible impact on the time required to map an environment.
D. Heuristics to Improve the Strategies The three exploration strategies we considered (and most of those presented in literature) have two main limitations: 1) the decision of reaching a selected location is not changed until the location is actually reached, 2) evaluation of candidate locations is based only on information relative to the single locations, without considering their relation with other locations. In the next section, we provide an experimental answer to the question of how much these limitations affect the performance of exploration strategies. Here, we describe two simple heuristics that can be applied to exploration strategies in order to cope with these limitations and to obtain a better performance (thereby extending the initial results from [9]).
1) Repetitive Re-checking: During navigation to a selected location, the map is continuously updated. As a result, the robot might have fully explored an unknown region before actually reaching the selected frontier location. Hence, continuing to travel to the selected location is unnecessary. We address this problem by using Repetitive Re-checking (RR), i.e., the robot checks whether or not the currently approached frontier location n is still adjacent to at least one cell with unknown reection probability. As soon as n is no longer a valid frontier, the robot stops traveling towards it and selects the next best location according to the employed exploration strategy. In Fig. 1(a) we report an example that shows that the robots trajectory is shortened by repetitive re-checking, especially when approaching frontiers in the vicinity of corners. 2) Map Segmentation: Sometimes it can happen that a single room gets visited multiple times if successively selected locations lie in different rooms. To reduce the number of multiple visits, we applied Map Segmentation (SEG), which splits the map built so far into segments representing individual rooms and makes the robot prefer candidates lying in the segment of its current location. We use an approach based on [18] and [20] that splits map regions at local minima in the Voronoi diagram (critical points) of the maps free space. We dene critical points to be local minima with respect to the distances to the closest Voronoi site, nodes of degree 2, and to be itself adjacent to a junction node or adjacent to another node that is adjacent to a junction node. Using critical points we split previously unassigned map regions into two parts. We assign cells to segments with respect to their distances to critical points. That is, we form clusters of cells being closest to a common split point. This can be performed efciently by computing an Euclidean distance transform (EDT) for the critical points. For the actual assignment we compute and store both the distance to the closest critical point (as for the EDT) and the closest critical point itself; thus computing a nearest neighbor transform. Then, in an iterative renement step, we merge segments that are adjacent to each other but not split by the same critical point. An example of the segmentation algorithm is reported in Fig. 1(b). The map segmentation can be used in the exploration strategies to restrict the set of candidate locations within the scope of the robots current segment. If the set of candidates
28
(a) AVZ
(b) Hospital
Fig. 2. The two environments provided by K. Lingemann, A. Nchter, and J. Hertzberg (a) and by R. Vaughan (b).
belonging to the robots current segment is not empty, then the exploration strategy will choose the next best location from that set. In this sense, we will say that, when SEG is used, a depth-rst-like or room-by-room exploration is favored. V. E XPERIMENTAL R ESULTS We compared the performance of exploration strategies in two ofce-like indoor environments composed of several rooms and corridors (Fig. 2). Indoor environments present interesting challenges to exploration strategies, mainly due to their intricate structure that makes the selection of the next observation position non-trivial. We compared the three exploration strategies of Section IV with and without the RR and SEG heuristics. As a baseline for comparison, we report also results obtained with a Random Frontier (RF) exploration strategy. It chooses the next observation location according to a uniform probability distribution over the current candidate locations. For every conguration in which a particular exploration strategy was tested within an environment, we performed 50 simulation runs with the same initial position for the robot. Each run is considered completed when no more frontiers can be determined in the current map, i.e., when there does not exist any reachable cell adjacent to another cell with unknown reection probability. To compare the performance obtained in different congurations, we report the mean of the length of trajectories covered by the robot (as in [1][3], [5], [16], [19]). Results obtained in the two environments are reported in Fig. 3. All the strategies perform better than RF, as expected, with more evident differences in the more complex hospital environment. The rst interesting comparison that is worth doing is between CF and MCDM strategies. The good performance of CF means that minimizing the traveled distance at every exploration step produces a small global traveled distance in the indoor environments we considered. This fact and recalling that the robot acquires data during its movements explain the good performance of CF. Although CF performs slightly better, MCDM achieves comparable performance with respect to CF. This is not obvious, since in MCDM other criteria (I()
and O()) are given more importance than traveling cost (see Table I), which is the only criterion adopted by CF. In fact, the MCDM strategy provides, by means of synergy, a good tradeoff between I() and L(). The close performance of CF and MCDM can be explained also by saying that the latter strategy compensates the potential performance worsening, due to the fact that distance is not minimized, with good information gains. Moreover, we observed that MCDM maps most of the environment following a short path and then travels a relatively long path to complete the map (e.g., lling holes close to corners). A reduction in the total traveled distance of the three strategies can be observed when enabling Repetitive Rechecking (RR). A strategy with the RR heuristic outperforms the corresponding basic strategy, which needs to reach every selected observation location independently of the sensorial data acquired along the path. Using map segmentation (SEG) reduces the traveled distance especially in the hospital environment, where SEG provides a good quality segmentation. Enabling SEG prevents to leave out corners and occlusions and exploring them in the last steps of the exploration. Without SEG, multiple visits to the same room can be necessary, e.g., when the current robots room is not completely explored and the best frontier location happens to be outside that room. Interestingly, the MCDM strategy showed a depthrst behavior with respect to unknown regions, even without using SEG. The main reason is the presence of the overlap criterion, which leads to a more conservative exploration by imposing to have a certain amount of old information in each sensorial acquisition. Comparing the two heuristics, SEG appears to reduce the traveled distance slightly more than RR. This result can be explained by considering that RR only stops following an already made decision, while SEG helps in making a better decision. The GBL strategy (with and without RR and SEG) is outperformed by MCDM and CF in both the environments. This means that using more criteria does not guarantee by itself to obtain a better exploration strategy and suggests that the way in which criteria are combined is fundamental. In this sense, general aggregation techniques such as MCDM appear more suitable to design multi-criteria exploration strategies. This is in accordance with the results of [5], where exploration strategies dened with MCDM and GBL are compared using maps composed of line segments. Finally, we considered a variant of the GBL strategy, in which the information gain is computed as in MCDM, i.e., by using the entropy only over the new cells visible from a candidate location (data are not shown here). With this different I(), GBL shows a slightly better performance, but the above considerations still hold. This suggests that the way in which criteria are combined could be even more important than the methods used to compute the criteria themselves. An evaluation criterion that has been largely neglected so far, is the completeness of the maps after exploration. In our experiments, we compared the map entropies resulting from exploration with those computed on (manually) fully explored maps (c = (1 (Hfull /Hexpl. )) 100 to obtain completeness c in percent). By having the same termination criterion, all basic
29
(a) AVZ Fig. 3. Traveled distances in m (average and standard deviation).
(b) Hospital
strategies (with and without SEG) achieve a map completeness of roughly 99%. RR lowers the completeness to 98% in the AVZ environment, and 96% in the hospital environment. This is primarily caused by taking less close range measurements in corners, which can be seen in Fig. 1(a). That is, there is a trade-off in RR between shortening the robots trajectory and inreasing its map uncertainty. VI. CONCLUSIONS In this paper, we addressed the experimental comparison of frontier-based exploration strategies for an autonomous mobile robot that maps an unknown environment. A representative sample of three strategies proposed in literature has been evaluated in combination with two improvement techniques in a common simulated experimental setting. Some insights obtained from our analysis, like the inuence of the function used to combine criteria in evaluating candidate locations, can help in developing better exploration strategies. Our work is intended to constitute another step toward the denition of good experimental methodologies for exploration strategies. In particular, in our experimental framework we used a standard simulation platform in order to support the comparison, reproducibility, and repeatability of experiments. Several additional issues can be considered to improve the experimental framework of this paper. For example, it would be interesting to compare performance of exploration strategies with that of optimal ofine coverage strategies (in which the map is known) [7]. Another issue worth considering is the metric used to measure performance. In this paper, we have considered the traveled distance to account for the energy and time effort, but also the number of map updates and the entropy of the nal map can be considered to account for, respectively, the computational effort and for the quality of the produced map (which could also involve loop closures). Moreover, the relationships between the performance of the strategies and the particular setting (robot locomotion, speed, etc.) deserve more attention. Finally, extensions to 3D and ying robots is a matter of future work. R EFERENCES
[1] F. Amigoni. Experimental evaluation of some exploration strategies for mobile robots. In Proc. ICRA, pages 28182823, 2008.
[2] F. Amigoni and V. Caglioti. An information-based exploration strategy for environment mapping with mobile robots. Robotics and Autonomous Systems, 58(5):684699, 2010. [3] F. Amigoni and A. Gallo. A multi-objective exploration strategy for mobile robots. In Proc. ICRA, pages 38613866, 2005. [4] F. Amigoni, M. Reggiani, and V. Schiaffonati. An insightful comparison between experiments in mobile robotics and in science. Autonomous Robots, 27(4):313325, 2009. [5] N. Basilico and F. Amigoni. Exploration strategies based on multicriteria decision making for an autonomous mobile robot. In Proc. ECMR, pages 259264, 2009. [6] EURON GEM Sig. http://www.heronrobots.com/EuronGEMSig/, 2007. [7] Y. Gabriely and E. Rimon. Competitive complexity of mobile robot online motion planning problems. International Journal of Computational Geometry and Applications, 20(3):255283, 2010. [8] H. Gonzles-Baos and J.-C. Latombe. Navigation strategies for exploring indoor environments. International Journal of Robotics Research, 21(10-11):829848, 2002. [9] D. Holz, N. Basilico, F. Amigoni, and S. Behnke. Evaluating the efciency of frontier-based exploration strategies. In Proc. ISR 2010, pages 3643, 2010. [10] D. Holz and S. Behnke. Sancta simplicitas on the efciency and achievable results of SLAM using ICP-Based Incremental Registration. In Proc. ICRA, pages 13801387, 2010. [11] D. Holz, G. K. Kraetzschmar, and E. Rome. Robust and Computationally Efcient Navigation in Domestic Environments. In RoboCup 2009: Robot Soccer World Cup XIII, volume 5949/2010 of Lecture Notes in Computer Science, pages 104115. Springer, Germany, 2009. [12] D. Lee and M. Recce. Quantitative evaluation of the exploration strategies of a mobile robot. International Journal of Robotics Research, 16:413447, 1997. [13] J. Leonard and H. Feder. A computationally efcient method for largescale concurrent mapping and localization. In Proc. Intl Symposium on Robotics Research, pages 169176, 1999. [14] R. Madhavan, C. Scrapper, and A. Kleiner. Special issue on characterizing mobile robot localization and mapping. Autonomous Robots, 27(4):309481, 2009. [15] P. Newman, M. Bosse, and J. Leonard. Autonomous feature-based exploration. In Proc. ICRA, pages 12341240, 2003. [16] C. Stachniss and W. Burgard. Exploring unknown environments with mobile robots using coverage maps. In Proc. IJCAI, pages 11271134, 2003. [17] C. Stachniss, D. Haehnel, and W. Burgard. Exploration with active loop-closing for FastSLAM. In Proc. IROS, pages 15051510, 2004. [18] S. Thrun. Learning Metric-Topological Maps for Indoor Mobile Robot Navigation. Articial Intelligence, 99(1):2171, 1998. [19] B. Tovar, L. Munoz-Gomez, R. Murrieta-Cid, M. Alencastre-Miranda, R. Monroy, and S. Hutchinson. Planning exploration strategies for simultaneous localization and mapping. Robotics and Autonomous Systems, 54(4):314 331, 2006. [20] K. Wurm, C. Stachniss, and W. Burgard. Coordinated Multi-Robot Exploration using a Segmentation of the Environment. In Proc. IROS, pages 11601165, 2008. [21] B. Yamauchi. A frontier-based approach for autonomous exploration. In Proc. CIRA, pages 146151, 1997.
30
People Tracking with Heterogeneous Sensors using JPDAF with Entropy Based Track Management
Sre ko Juri -Kavelj c c Ivan Markovi c Ivan Petrovi c Department of Control and Computer Engineering, University of Zagreb Faculty of Electrical Engineering and Computing, Croatia
Abstract In this paper we study the problem of tracking an arbitrary number of people with multiple heterogeneous sensors. To solve the problem, we start with a Bayesian derivation of the multiple-hypothesis tracking (MHT), and, under certain assumptions, we arrive to the joint probabilistic data association lter (JPDAF). In their original derivation, both the MHT and JPDAF assume a multiple sensor scenario which enables us to fuse the sensors measurements by asynchronously updating the tracking lters. To solve the data association problem, instead of using the optimal MHT with complex hypothesis branching, we choose the JPDAF since we are interested only in local observations by a mobile robot for people detection, tracking, and avoidance. However, the JPDAF assumes a constant and known number of objects in the scene, and therefore, we propose to extend it with an entropy based track management scheme. The benets of the proposed approach are that all the required data come from a running lter, and that it can be readily utilized for an arbitrary type of lter, as long as such a strong mathematical principle like entropy is tractable for the underlying distribution. The proposed algorithm is implemented for the Kalman and particle lter, and the performance is veried by simulation and experiment. For the simulation purposes, we analyze two generic sensors, a location and a bearing sensor, while in the experiments we use a laser range scanner, a microphone array and an RGB-D camera. Index Terms Multi-sensor fusion, 3D sensing, JPDAF, Entropy
I. I NTRODUCTION The prospects of utilizing measurements from several sensors to infer about a system state are manyfold. To begin with, the use of multiple sensors results in increased sensor measurement accuracy, and moreover, additional sensors should never reduce the performance of the optimal estimator [24]. However, in order to ensure this performance, special care must be taken when choosing the process model [29]. Furthermore, system reliability increases with additional sensors, since the system itself becomes more resilient to sensor failure [14]. Therefore, by combining data from multiple sensors, and perhaps related information from associated databases, we can achieve improved accuracies and more specic inferences than by using only a single sensor [9]. With such approach, we increase the chances of a mobile robot to operate intelligently in a dynamic environment. Sensor measurements may be combined, or fused, at a variety of levels; from the raw data level to the state vector level, or at the decision level [9]. Raw sensor data can be directly combined if the sensor data are commensurate (i.e., if the sensors are measuring the same physical phenomena),
while if the sensor data are noncommensurate, then the sensor data, i.e. sensor information, must be fused at a feature/state vector level or a decision level. Some sensors, like monocular cameras and microphone arrays, can only measure the angle and not the range of the detected objects. Moreover, some sensors can provide measurements at higher rates, thus making sensor fusion an even more challenging problem. A large body of work exists on tracking moving objects with mobile robots. As discussed in [21] two major approaches can be identied, both dened by the sensors. The rst approach stems form the eld of computer vision and implies a camera as a major sensor, while the second utilizes laser range sensor (LRS) whose measurements are similar to those of radars and sonars. Since the eld of tracking and surveillance (where radars and sonars are commonly used), was well established before the mobile robotics, a lot of results [8], [22] from that eld were applied to the problem of people tracking with an LRS. The LRS approach can be further subdivided according to data association techniques into deterministic and probabilistic [1], [11], [12], [25] approaches. Additionally, these two sensors can also be used conjointly. E.g., in [4], the nearest neighbour approach and unscented Kalman lter are used for tracking people with a laser and a camera, while in [16] the authors used euclidean distance and covariance intersection method for fusing laser, sonar and camera measurements. When considering multitarget tracking, data association is the fundamental problem. A detailed overview of probabilistic data association techniques is given in [6]. Our previous work [11] was heavily inuenced by [1], [25], where the authors use the joint probabilistic data association lter (JPDAF) to solve the data association problem. In [19] the JPDAF is extended to handle multiple data sources (sensors). Such a rigorous approach is questioned when looking at the JPDAF seminal paper [8, Section V, Fig. 2], since the target-sensor geometry indicates that three sonar sensors were used to obtain the measurements. Since in our case the data acquisition happens asynchronously across sensors, we prefer the approach in [8]. The idea is as follows. When the new sensory inputs arrive, predictions about track states are made, and then the JPDAF is used to solve the data association problem. Finally, the track states are updated according to the association probabilities, where the nal steps use the likelihood function of the reporting sensor, and that is the only thing required by the JPDAF to handle the multisensor case. Another approach to probabilistic data association is the multiple hypothesis tracking (MHT) developed in a seminal
31
paper [22] by Reid. It is an optimal solution to the data association problem, unlike the JPDAF. As discussed by Reid himself [22, Section I], the JPDAF is a special case of the MHT, in which only one hypothesis remains after data processing. To be clear, Reid is referring to [3], which is the initial derivation of the JPDAF. Reid already solves the multisensor problem for two different generic types of sensors. He accomplishes that by describing sensors with their detection and false-alarm statistics. Thanks to such approach, we can use any type of a sensor, provided we have its probabilistic description. The downside of the MHT is in its high memory and processing requirements (which grow exponentially with the number of tracks). However, an efcient implementation of the MHT is discussed in [7] and some recent applications are presented in [2], [13]. Instead of using the optimal MHT with complex hypothesis branching, we choose the simpler, although not optimal JPDAF, as it is a very convenient solution for people tracking by a mobile robot for its local navigation [26]. However, the JPDAF assumes a constant and known number of objects in the scene, and therefore we propose to extend it with an entropy based track management algorithm. The proposed approach is tested for, but not limited to, the Kalman and particle lter. Furthermore, both trackers were tested in simulation and experiment. II. P ROBLEM STATEMENT We consider initialized tracks at time k described by a set of 1 2 T continuous random variables Xk = xk , xk , . . . , xk k , where Tk denotes the number of tracks. At time k we receive a set Zsk = zsk , zsk , . . . , zskk of measurements from sensor sk , m 1 2 where mk denotes the number of measurements. A set of all measurements received until and including k is denoted as Zk = {Zs0 , . . . , Zsk }. Let k1 denote a set of hypotheses about measurement to track assignments at time k 1. Considering a specic hypothesis k1 at time k 1, we can construct legal assignp(h) ment h (k) for the set of measurements Zsk . The resulting hypothesis at time k is denoted as k = {k1 , h (k)}, h p(h) where k1 denotes the parent of the k hypothesis. h p(h) We can now consider the probability of a hypothesis given the sensors measurements P k Zk and calculate it using h Bayes theorem P k Zk = h P k1 , h (k), Zsk , Zk1 p(h) P (Zsk , Zk1 ) P h (k) k1 , Zk1 p(h) P k1 Zk1 p(h) P (Zsk |Zk1 ) . (1)
at time k as (k). Again, the JPDAF considers a possible data assignment hypothesis relative to the hypothesis in the previous time instant k 1. Calculation of the probability for such hypothesis reduces to P k Z k = h 1 P Zsk h (k), (k 1), Zk1 c P h (k) (k 1), Zk1 P (k 1) Zk1 ,
(2)
where c is a normalizing constant. Since we stated that the JPDAF is a zero scan data association algorithm (only one hypothesis is left after the measurements processing), the probability of the previous hypothesis P (k 1) Zk1 is equal to one. If no new track hypotheses are considered, then this formulation coincides with the one in [8]. Each hypothesis k contains track states Xk updated using h considered association h (k). Since the JPDAF attens the hypothesis tree to a single branch (k), it contains track states Xk updated using all the measurements, given their association probabilities t P Zk , j = (3)
k jt
where k denotes all the hypotheses that associate measurejt ment j with track t. III. J OINT P ROBABILISTIC DATA A SSOCIATION F ILTER As stated before, the JPDAF considers possible data assignment hypothesis h (k) relative to the singular hypothesis from the previous time instant k 1, which has unit probability. Probability of a specic hypothesis is given in (2) (with P (k 1) Zk1 = 1). This leaves us to describe the other two terms in (2). The second term, P h (k) (k 1), Zk1 can be modeled as a constant, as in [11], [25]. A more precise derivation of this term can be found in [7], [8], [22]. Now, we only need to develop the rst term P Zsk h (k), (k 1), Zk1 = P (Zsk |h (k) )
mk
=
j=1
P zsk |h (k) , j
(4)
where P zsk |h (k) depends on the measurement to track j associations made by hypothesis h (k)
s t PDk P zsk xk , zsk existing track j j (5) s s where PDk is the detection and PFk the false alarm probability for sensor sk . By inserting (5) in (2) and then inserting the result in (3), we get an expression for measurements to tracked objects association probabilities t j =
zsk j
|h (k) =
s PF k ,
zsk false alarm j
= P Zsk k1 , h (k), Zk1 p(h)
1 c
mk k jt j=1
P zsk | j
(6)
The JPDAF can be viewed as a special case of the MHT, where after each iteration all the considered hypotheses are reduced to a single hypothesis. We denote that hypothesis
Aforementioned association probabilities are used to update the track states (the ltering part). Since the track states are
32
updated with all the measurements (weighted by their respective probabilities) from their cluster, this is what essentially attens the hypothesis tree. In other words, after the update, only one hypothesis remains, the one about the current track states. The actual track state update and implementation of (6) depends on the used state estimator (lter). In this paper, we present a particle lter, used in our previous work [11], and a Kalman lter, used in the original JPDAF formulation [8]. A. Kalman JPDAF Given any state estimator, a process model is required. We use a quite general constant velocity model for motion in 2D plane, where state x = [x x y y]T (7) is described by position (x, y) and velocity (x, y) in the xy plane. The model itself is given by xk+1 = Fxk + Gwk 1 Tk 0 0 1 0 = 0 0 1 0 0 0 2 Tk 0 2 k Tk 0 x + Tk 0 1 0 0 (8) 0 w , 2 Tk k 2 Tk
B. Particle JPDAF If a particle lter is to be used as a state estimator, besides the prediction and update of the lter, we also have to modify gating and hypotheses probability calculation procedures. We model the estimated state as a set of tuples ( k (i), wt (i)) xt t where xk (i) is a particle containing a possible track state, and wt (i) is its associated weight. If a particle falls within the valid measurement region (based on the squared Mahalanobis tT t distance j (i)R1 j (i)), then we consider the association sk of measurement j to track t when generating the hypothesis. As for the hypothesis probability calculation, we use average likelihood of the particles P zsk xk = t j 1 N
N i=1 t N (j (i); 0, Rsk )
(14)
t where j (i) is the individual particles innovation. After obtaining association probabilities, we update the particle mk t t weights according to wt (i) = j=1 j N (j (i); 0, Rsk ). After the weights calculation, and only if there are measurements in the current time step, we resample the particles.
IV. T RACK MANAGEMENT When tracking multiple targets, track management is practically as important as the association itself. A solution for the Kalman lter, described in [5], is based on a logarithmic hypothesis ratio and innovation matrix. In [25] a Bayesian estimator of the number of objects for an LRS is proposed. This approach requires learning the probability of how many features are observed under a presumed number of objects in the perceptual eld of the sensor, while the tracking performance is monitored by an average of the sum of unnormalized sample weights of the particle lter. We propose to use an entropy measure as a feature in track management. If such a strong mathematical principle is tractable for the underlying probability distribution, then it can be readily utilized for track management independently of the ltering approach. Furthermore, all the information required for the entropy calculation is already available in the running lter and sensor model, and as it will be presented, threshold setting is quite convenient. A practical entropy measure for this task is the quadratic R nyi entropy [23] e H2 ( t ) = log x p ( t ) d t . x x
2
where wk is the process noise and Tk is the update interval. Prediction is calculated using standard Kalman lter equations t xk = F k1 , xt (9) k Pt = FPk1 FT + GQGT . t Given the innovation vector and its covariance matrix
t j = zsk H k , xt j
(10) (11)
St = HPk HT + Rsk , t
we can dene (5) in the case of an existing track association t as P zsk xk = N (j ; 0, St ). t j The innovation vector and covariance matrix can be used t tT for measurement gating. Since j S1 j has 2 distribution, t by using tables we can select upper limit which includes valid measurements with, e.g., 99% probability. Update is done by using all the validated measurements, i.e. weighted innovation is used for the state update
mk
t = t xk =
t mk j=1 t j
t t j j , j=1 t xk
(12)
mk j=1 t t tT j j j t tT
(15)
+ Kk t .
t
Given = 1 and P = the covariance update is calculated as in [5]
An important implementation note is that instead of the standard Kalman lter covariance update [I Kk H] Pk we t T use Josephs stabilized form [I Kk H] Pk [I Kk H] + t Kk Rsk KT , since the standard form caused numerical probk lems.
Pk = t Pk + (1 t ) [I Kk H] Pk + Kk P t KT . (13) t t t k
For the Kalman lter, i.e. Gaussian distributions, an analytical solution exists and is given by H2 ( t ) = n log 4+ 1 log |Pt |, x 2 2 where n is the state dimension. Entropy calculation of continuous random variables is based on the probability density functions (pdfs) of these variables. In order to calculate entropy of a particle lter, which rather represents the density and not the function, we need a nonparametric method to estimate the pdf. One such method is the Parzen window method [20] which involves placing a kernel function on top of each sample and then evaluating the density
33
12 10 8 6 4 2 0
3 2 1 0
entropy
3 2 1 0 1 2 3 4
1 2 3 4
y [m]
2 4 6 4
5 6 7 0 5 10 15 20
x [m]
t [s]
(a) Trajectories
(b) KF entropies
Fig. 1. Simulation results for the KF with modeled detection probability, false alarms and silent periods true (dashed) and estimated (solid) track states, and tentative but not conrmed tracks (red + marker). The rst two objects were detected and a KF was initialized for each one at 0 s. At 5.9 s they went out the range of the sensors and the entropy of their KF kept rising until they were deleted at 6.3 s. The third object was detected at 3.6 s, and at 9.4 it went behind the robot thus causing a rise in entropy. At 11 s it was detected again by the bearing sensor causing an effective drop in entropy. At 15.1 s it moved in front of the robot again and was detected by the location sensor which signicantly lowered the entropy.
as a sum of the kernels. We continue this approach as proposed in [17], [18], and convert each sample to a kernel Kh ( t ) = hn K( t ), x x (16)
where K(.) is the particle set covariance, and h > 0 is the scale 4 ing parameter. For the kernel, we choose h = n+2 N e , 1 where e = n+4 , and N is the number of particles. At this point, each track is described as a sum of Gaussian kernels, N p ( t ) = x x x i=1 N ( t (i), 2Kh ( t )), for which an analytical solution for the quadratic R nyi entropy exists [27] e H2 (xt ) = log 1 N2
N N
(17) Due to symmetry, only half of these kernels need to be evaluated in practice. The track management logic is as follows. When the tracks are initialized, they are considered tentative and the initial entropy is stored. When the entropy of a tentative track drops for 50% it is a conrmed track. If and when the entropy gets 20% larger than the initial entropy the track is deleted. This logic reect the fact that if the entropy is rising, we are becoming less and less condent that the track is informative. Furthermore, since no entropy should be greater than the one calculated at the point of the track initialization, we can use this initialization entropy as an appropriate deletion threshold. V. S IMULATIONS In order to test the performance of the algorithm, we generated three intersecting circular trajectories. The robot was at (0, 0, 90circ ) m, the rst object started at (2, 1) m and nished at (0.8, 10) m, the second object started at (2, 1) m and nished at (0.8, 10) m, while the third object started at (3, 0) m and nished at (1.6, 2.5) thus making more than one
i=1 j=1
N ( t (i) xt (j); 0, 2Kh ( t )). x x
revolution around the mobile robot (Fig. 1a). Each object was tracked in an alternating manner by the location and bearing sensor, while the maximum range for both was kept at 6 m. The location sensor can only track objects in front of the mobile robot, i.e. from 0 to , and was corrupted with white Gaussian noise given by N ([x y]T ; 0, 0.03 I). The bearing sensor, on the other hand, can only measure the bearing angle of the object, but in the full range around the mobile robot, i.e. from 0 to , and was also corrupted with white Gaussian noise given by N (; 0, 3 ). Furthermore, for both sensors each measurement had the detection probability of PD = 0.9, and the probability of a false alarm was PF = 0.01. Since the bearing sensor models a microphone array, it is logical to assume that the speaker will have pauses while talking, thus resulting in longer periods of absent measurements. This was modeled by placing a random number of pauses of maximum length of 2 seconds at random time instances. Although the assumption about the talking speaker might not be realistic for every-day scenarios, we nd it important to analyze performance of a bearing-only sensor in such a multisensor system. The tracks can only be initialized by the location sensor, but the existing tracks should be kept by the bearing sensor when the object moves behind the robot. Since in this case the entropy is substantially larger, it requires calculation of bearing-only initialization entropy, in order to efciently manage the case when the object is behind the robot. In Fig. 1 we show KF simulation results with added detection probability, false alarms, and silent speaker periods, from which we can see that there were several false tracks initiated but never conrmed. Furthermore, we made 100 Monte Carlo (MC) runs for the Kalman lter on average there were 11.43 initialized, 7.95 tentative tracks, and 3.48 conrmed tracks. In an ideal situation we would have had three conrmed tracks, but taking the scenario into account we can conclude that the
34
12 10 8 6
entropy
y [m]
4 2
2
0 2 4 4
6 0
10
15
20
x [m]
t [s]
(a) Trajectories
(b) PF entropies
Fig. 2. Simulation results for the PF with modeled detection probability, false alarms and silent periods true (dashed) and estimated (solid) track states, and tentative but not conrmed tracks (red + marker). The objects trajectories were the same as in the case of the KF. The main difference is in the third objects estimated trajectory: when the object moved behind the robot, there were bearing measurements up to 13.8 seconds when a silent period started. The entropy kept rising and the object was deleted before new measurements appeared. The object moved in front of the robot at 15.1 and consequently a new lter was initialized.
algorithm performs well when it comes to tracking, association and track management. The results of the simulation for the PF with added detection probability, false alarms, and silent speaker periods are shown in Fig. 2. Furthermore, we also made 100 Monte Carlo (MC) runs for the particle lter on average there were 8.17 initialized, 3.22 tentative, and 4.95 conrmed tracks. Although the average number of conrmed tracks was larger than in the case of the Kalman lter, we still nd it to be of acceptable performance. Simulations were performed on a machine running at 2.33 GHz with an unoptimized Matlab implementation. The average computational time of each iteration was 1.9 ms and 137.2 ms for the KF and PF, respectively. Time spent on the entropy calculation was 0.02 ms and 88.6 ms for the KF and PF, respectively. VI. E XPERIMENTS To further test the proposed approach, we conducted experiments with our Pioneer 3-DX robot. The laser sensor was the Sick LMS 200 model, while the microphone array is of our design. Furthermore, since the proposed framework is easily extended to multiple sensors, we also used the Kinect time-ofight camera with a face recognition algorithm based on [28] to yield a set of measurements in 3D. In the experiment two people were walking in an intersecting trajectory in front of the robot (a snapshot of the experiment is shown in Fig. 3). The results are shown in Fig. 4 from which we can see that the rst person (blue line) started at (1.2, 2.3) m and nished at (0.9, 2.3) m, while the second person (green line) started at (0.7, 0) m and nished at (0.6, 0) m. The rst person was in the eld-of-view (FOV) of all the three sensors and was talking throughout the experiment, while the second person entered LRS FOV at a later time, kept quiet and was facing the robot only in the second half of the trajectory. Tracks were
Fig. 3. A snapshot of the data acquisition and signal processing for the experiments. The measurements were classied and collected based on our previous work [10], [11], [15], with only the signal processing stage done, i.e. no tracking was performed on the sensor level.
correctly initialized and maintained, despite the large number of false alarms. The second track was deleted short-after the second person left the LRS FOV. VII. C ONCLUSION In the present work we addressed the problem of tracking multiple objects with multiple heterogeneous sensors specically an LRS, a microphone array, and an RGB-D camera. The integration of multiple sensors is solved by asynchronously updating the tracking lters as new data arrives. We solved the data association problem by applying the JPDAF, which is a suboptimal zero-scan derivation of the MHT, but which in effect assumes a known number of objects. To circumvent this assumption, we proposed an entropy based track management
35
entropy
y [m]
1 4
4 0
10
15
20
25
30
35
40
x [m]
t [s]
(a) People trajectories
(b) KF entropies
Fig. 4. Experimental results for the KF estimated (solid) track states, and tentative but not conrmed tracks (red + marker). The rst object was in the scene from the beginning, while the second object entered the scene at 7.5 s. At 15 s the second object got occluded by the rst, which caused an increase in entropy, while at 30 s the second object occluded the rst shortly before exiting the scene. The false alarms were caused by tiles on the wall and leg-like features in the room (chairs and tables).
scheme, and demonstrated its performance for the Kalman and particle lter both in simulation and experiment. The results showed that the proposed algorithm is capable of maintaining a viable number of lters with correct association and accurate tracking. ACKNOWLEDGMENT This work was supported by the Ministry of Science, Education and Sports of the Republic of Croatia under grant No. 036-0363078-3018. R EFERENCES
[1] A. Almeida, J. Almeida, and R. Ara jo. Real-time Tracking of u Multiple Moving Objects Using Particle Filters and Probabilistic Data Association. AUTOMATIKA Journal, 46(1-2):3948, 2005. [2] K.O. Arras, S. Grzonka, M. Luber, and W. Burgard. Efcient people tracking in laser range data using a multi-hypothesis leg-tracker with adaptive occlusion probabilities. In IEEE ICRA, pages 17101715. IEEE, 2008. [3] Y. Bar-Shalom. Extension of the probabilistic data association lter to multi-target environment. In Proceedings of the 5th Symposium on Nonlinear Estimation, pages 1621, San Diego, CA, 1974. [4] N. Bellotto and H. Hu. Vision and Laser Data Fusion for Tracking People with a Mobile Robot. In Proceedings of the IEEE International Conference on Robotics and Biomimetics, pages 712, 2006. [5] S. Blackman and R. Popoli. Design and Analysis of Modern Tracking Systems (Artech House Radar Library). Artech House Publishers, 1999. [6] I.J. Cox. A review of statistical data association techniques for motion correspondence. International Journal of CV, 10(1):5366, 1993. [7] I.J. Cox and S.L. Hingorani. An Efcient Implementation of Reids Multiple Hypothesis Tracking Algorithm and its Evaluation for the Purpose of Visual Tracking. Trans. on PAMI, 18(2):138150, 1996. [8] T. Fortmann, Y. Bar-Shalom, and M. Scheffe. Sonar tracking of multiple targets using joint probabilistic data association. IEEE Journal of Oceanic Engineering, 8(3):173184, July 1983. [9] D.L. Hall and J. Llinas. An Introduction to Multisensor Data Fusion. Proceedings of the IEEE, 85(1):623, 1997. [10] S. Juri -Kavelj and I. Petrovi . Experimental comparison of AdaBoost c c algorithms applied on leg detection with different range sensor setups. In 19th International Workshop on RAAD, pages 267272. IEEE, 2010. [11] S. Juri -Kavelj, M. Seder, and I. Petrovi . Tracking Multiple Moving c c Objects Using Adaptive Sample-based Joint Probabilistic Data Association Filter. In Proceedings of 5th International Conference on CIRAS, pages 99104, Linz, Austria, 2008.
[12] A. Kr uling and D. Schulz. Tracking extended targetsa switching a algorithm versus the SJPDAF. In Proc. of FUSION, pages 18, Florence, Italy, 2006. [13] M. Luber, G.D. Tipaldi, and K.O. Arras. Spatially grounded multihypothesis tracking of people. In Workshop on People Detection and Tracking, 2009 IEEE ICRA, Kobe, Japan, 2009. [14] R. Luo and M. Kay. Multisensor Integration and Fusion in Intelligent Systems. IEEE Transactions on Systems, Man, and Cybernetics, 19(5):901931, 1989. [15] I. Markovi and I. Petrovi . Applying von Mises Distribution to c c Microphone Array Probabilistic Sensor Modelling. In International Symposium on Robotics (ISR10), M nchen, Germany, 2010. in print. u [16] C Martin, E Schaffernicht, A Scheidig, and H.-M. Gross. Multi-Modal Sensor Fusion using a Probabilistic Aggregation Scheme for People Detection and Tracking. Robotics and Autonomous Systems, 54(9):721 728, September 2006. [17] C. Musso, N. Oudjane, and F. Le Gland. Improving Regularised Particle Filters, pages 247272. Springer-Verlag, 2001. [18] L.-L.S. Ong. Non-Gaussian Representations for Decentralised Bayesian Estimation. PhD thesis, The University of Sydney, 2007. [19] L.Y. Pao and S.D. ONeil. Multisensor fusion algorithms for tracking. In Proceedings of the ACC, pages 859863, San Francisco, CA, 1993. [20] E. Parzen. On Estimation of a Probability Density Function and Mode. The Annals of Mathematical Statistics, 33(3):10651076, 1962. [21] E. Prassler, J. Scholz, and A. Elfes. Tracking multiple moving objects for real-time robot navigation. Autonomous Robots, 8(2):105116, 2000. [22] D. Reid. An algorithm for tracking multiple targets. IEEE Transactions on Automatic Control, 24(6):843854, December 1979. [23] A. R nyi. Probability Theory. North-Holland, London, 1970. e [24] J. Richardson and K. Marsh. Fusion of Multisensor Data. The International Journal of Robotics Research, 7(6):7896, 1988. [25] D. Schulz, W. Burgard, D. Fox, and A.B. Cremers. People Tracking with Mobile Robots Using Sample-Based Joint Probabilistic Data Association Filters. The International JRR, 22(2):99116, February 2003. [26] M. Seder and I. Petrovi . Dynamic window based approach to mobile c robot motion control in the presence of moving obstacles. In Proceedings of IEEE ICRA, pages 19861991, Roma, Italy, 2007. [27] K. Torkkola. Feature Extraction by Non-Parametric Mutual Information Maximization. Journal of Machine Learning Research, 3(7-8):1415 1438, October 2003. [28] P. Viola and M. Jones. Robust real-time object detection. International Journal of Computer Vision, 2002. [29] G.A. Watson and W.D. Blair. Tracking Maneuvering Targets with Multiple Sensors Using the Interacting Multiple Model Algorithm. Proceedings of SPIE, 1954:438449, 1993.
36
Multi-Robot Cooperative Object Tracking Based on Particle Filters

Institute
Aamir Ahmad Pedro Lima for Systems and Robotics, Instituto Superior T cnico, Lisboa, Portugal e
Abstract This article presents a cooperative approach for tracking a moving object by a team of mobile robots equipped with sensors, in a highly dynamic environment. The trackers core is a particle lter, modied to handle, within a single unied framework, the problem of complete or partial occlusion for some of the involved mobile sensors, as well as inconsistent estimates in the global frame among sensors, due to observation errors and/or self-localization uncertainty. We present results supporting our approach by applying it to a team of real soccer robots tracking a soccer ball.
I. I NTRODUCTION AND R ELATED W ORK This paper deals with cooperative tracking of an object by a team of robots. Object tracking is a eld of research with multiple techniques being researched and developed extensively [1]. In recent years RoboCup Soccer has laid down a common platform for various research areas in robotics, object tracking being a predominant and crucial one. This involves tracking the soccer ball by the robots during the game play. The complexity of tracking has risen from small, orange colored balls to standard sized, random/multi-colored balls and from 2D to 3D [2]. The problem can be formulated as tracking a moving object of known dimensions by a moving robot. We use RoboCup Soccer as an ideal testbed for novel methods that can be used outside soccer. Particle Filters (PF) are one of the most popular methods employed for tracking [7]. PF is a non-parametric lter. Nonparametric lters can efciently handle multi-modal beliefs. In a generic tracker, the motion model of object being tracked can be completely unknown and might change over time hence using a parametric lter can lead to failures quite often. This is because if we use any standard motion model for the object, the tracker can quickly result in low condence on the posterior when the object motion changes to a different model or switches randomly. This makes it essential to have beliefs with multiple modes scattered over the whole state space which makes the use of a non-parametric lter appropriate. In the RoboCup scenario, PF based trackers are dominant tools currently being used by most of the teams. An interesting approach of fusing the Extended Kalman Filter (EKF) and Monte Carlo PF has been described in [4] where an integrated self-localization and ball tracking method is presented. In [5] a method for simultaneously estimating ball position and velocity using Monte Carlo Localization (MCL) is developed. An efcient implementation of Rao-Blackwellised PF which was successfully demonstrated on Sony AIBO robots in the four-legged league of RoboCup is presented in [7]. None
This work was supported by FCT (ISR/IST plurianual funding) through the PIDDAC Program funds, and by project FCT PTDC/EEA-CRO/100692/2008 PCMMC.
of these works use the information from more than one sensor/robot, therefore being less robust to occlusion and very dependent on the relative state of the robot and the object tracked. The eld of sensor fusion, including its use for single and multiple target tracking [12,13], is now very mature. However, it does frequently address situations where the sensors are static, know their location in a global frame with no uncertainty, and occlusions occur rarely. When sensors are mobile, e.g., mounted on the top of mobile robots, their knowledge of their own localization may degrade over time and/or during time periods due to a number of reasons (e.g., absence of known environment features, bad odometry) and this impacts the uncertainty in the determination of the target position in the global frame, where it is fused with the estimates from the other sensors. Furthermore, occlusions can occur more frequently, as they are due not only to the target(s) path but also to the motion of the different sensors/robots. Therefore, the problem of cooperatively tracking a moving object by a team of mobile sensors is an extension of sensor fusion, designated here as cooperative perception, in which one has to handle occlusions, disagreements between sensors, and dynamic changes of the observation models due to frequent spatial changes. Efcient solutions for multiple static platforms and a moving target [14] or a single moving platform and moving target(s) [15] have been introduced. Our approach to combine both challenges, i.e., track a moving target using multiple moving platforms, is novel. In [10] relationship between xed world objects and moving objects is explored for global object localization. These relationships are communicated to teammates where they form a set of constrained relations, solving which gives object location estimates. The authors in [6] present a cooperative PF based tracker for Sony AIBO robots, where the fusion of information involves communicating a reduced set of particles between the robots over the wireless network, which still remains a huge data set causing inefcient communication. Our approach overcomes this problem which is explained further in this paper. Outside the RoboCup scenario, in [11] a new cooperative perception architecture is developed and tested on multiple UAVs for forest re detection and localization. A substantial effort is put on developing the re detector and fusion of data from various sensors used on-board a single and multiple UAVs. The errors that creep in due to the selflocalization of the UAVs themselves are unaccounted for, which we address in our work. In [12,13] a decentralized PF for multiple target tracking is developed and deployed on ight vehicles. The communication bandwidth problem is solved by transforming the particle
37
set into a Gaussian Mixture Model (GMM) which seems to be an efcient way. In our work we communicate a single parametrized observation probability density function between two robots. This not only further reduces the bandwidth usage but also prevents the prediction model errors of the PF to be propagated to team-mates which happens when sharing of particles (or of a parametrized form of it) is done. Our work builds mainly upon [2] and [3], carried out in the direction of object tracking and sensor fusion among teammates respectively. In [2], a PF based tracker is presented with a unique and novel 3-D observation model based on color histogram matching. Each robot has an individual tracker and its most notable feature is that the tracking could be performed in 3-D space without the object color information, but at the cost of computational expense. In [3] a sensor fusion technique for cooperative object localization using particle lters is presented. Parameters of a GMM approximating a teammates trackers particles are communicated to the other robots. Particles at a robots tracker are then sampled using own belief and the received GMM. In this paper we introduce an approach to cooperative perception where we implement a PF-based tracker. For each observing robot (i.e., a mobile robot with a sensor), we determine condence factors associated to the tracked target from two origins: i) the condence on the observation itself and ii) the condence on the self-localization estimate of the observing robot. Note that the self-localization method itself is completely decoupled from the PF-based tracker. The observation model of each mobile sensor is a parametrized probability density function (e.g., a Gaussian centered on the observation). The probability density functions associated to the observations of the team robots are shared by all of them in a pool. Each robot selects the best function, i.e., the one with higher condence factors, from the pool, and uses it to assign weights to the particles in the traditional PF update step. The parametrization of the observation models intends to reduce signicantly the amount of data communicated to teammates, since the probability density function can be univocally represented by its communicated parameters. The method handles, within a single unied framework, inconsistencies (disagreements) between sensors due to observation errors and/or self-localization uncertainty. In order to achieve near real-time tracking, we track the object in 2-D space only and use the object color information. These will be relaxed in the future work, as they depend mostly on the available computing power. II. PARTICLE F ILTER BASED T RACKER In this section we briey explain a standard PF and how it is used to construct a tracker. Our tracker will estimate the 2-D pose of an object assumed to be moving on a known ground plane. The state vector of the object will be denoted by xt . We assume that the objects state evolution over time is an unobserved Markov process with a uniform initial distribution p(x0 ) and a transition distribution p(xt |xt1 ). The observations {yt ; t N} are conditionally independent given the process {xt ; t N} with distribution p(yt |xt ).
For the estimation of the a posteriori distribution i.e. belief of the state given all observations p(xt |y1:t ), the problem under Markov assumption is formulated as: p(xt |y1:t ) p(yt |xt ) p(xt |xt1 , ut )p(xt1 |y1:t1 )dxt1
(1) In the rest of paper we refer to p(xt |xt1 ) as the motion model and to p(yt |xt ) as the observation model. ut is the odometry input to the motion model. For solving the problem as formulated above we use a PFbased tracker. A PF is a non-parametric Bayesian lter where, contrary to a parametric Bayesian lter, the states probability distribution is represented by a set of M weighted particles i Xt {xi , wt }M where each particle is an instantiation of t i=1 the state itself [8]. III. T HE C OOPERATIVE T RACKER In this section we present our PF-based cooperative tracker algorithm (see Algorithm 1) which involves the classical PF augmented with the fusion step which we introduce as a novel contribution. Algorithm 1 PF Cooperative Tracker(Xt1 , ut , M, k, N) Xt = Xt = for m = 1 to M do [m] [ m] xt = sample motion model(ut , xt1 ) end for {The Fusion step starts here} Perform sensor observation and generate Observation Model Mklocal Mklocal Mkworld {Frame Transformation using self posture. In rest of the algorithm the Observation models parameters are in world frame and are denoted by Mk } Compute and Send Mk , C(Mk ), CLOC k to teammates. k = C(Mk ) {Observation condence for robot k} for r = 1 to N do if r = k then receive Mr , C(Mr ), CLOC r from Rr r = C(Mr ) CLOC r end if end for for r = 1 to N do r = N r {Weight normalization step} r=1 r end for for m = 1 to M do i = r ; r [1 : N ] sampled with probability r [ m] wt Mi [ m] [m] Xt = Xt + xt , wt end for {The Fusion step ends here} Xt = Low Variance Sampler(Xt ) return Xt The prediction step and the resampling step of the PF based cooperative tracker inherit directly from the original PF [8]. We introduce a fusion step which modies the observation
38
models mechanism which in turn modies the way in which the particles are assigned weights. The rst input to the algorithm, Xt1 is the set of particles, which initially could be distributed on the state space according to any choice. Here we consider a uniform distribution. ut denotes the input to the motion model of the tracked object. M is the total number of particles which depends on the available computational resource. The larger the value of M , the better is the approximation of the target probability density function (PDF). A standard practice is to keep M close to the order of 500 1000. k is the robot number on which the algorithm is running in a team of N robots. We denote the rth robot as Rr . First, we assume that each robot can communicate with every other robot in the team. After the prediction step of PF the robot makes an observation and generates an observation PDF, denoted by M, approximating the observation over the whole state space. This PDF could be any parametrized function in general based on the model chosen. Mk is the observation model PDF by robot k, which can be reduced to the parameters of the PDF representing the observation model, e.g., mean and covariance matrix for the Gaussian case. C(Mk ) represents the robot ks condence on its observation which can be calculated in various ways depending on the implementation and scenario. We calculate C(Mk ) as : C(Mk ) = Aobs , Aexp (2)
Furthermore, we normalize the weights to :

r = r N r=1 r 1 r N (6)
The most crucial step of fusion is here. For every particle that we have after the prediction step we do the following : Step 1 : Sample an element Mi from the OMP Pk with probability i . Recall that Mi is the parametrized observation PDF generated by robot is observation of the target. Step 2 : Calculate the weight of the particle using the PDF represented by Mi sampled in the previous step. For instance if Mi is parametrized as a Gaussian over the target space, the closer the particle lies to the mean of this Gaussian, the higher is its weight. These two steps are further claried in Fig. 1 with the help of an example OMP.
where Aobs is the area of the tracked object observed in the camera image and Aexp is the expected area of the object since we know a priori the real dimensions of the tracked object. CLOC k represents the robot ks condence on its own localization. The self-localization condence factor is determined from the particle lter set. One good approach to do this is to consider the number of effective particles nk f [8] ef
CLOC k = nk f = ef
[m ] 2 M m=1 (wt )
Fig. 1. For an mth particle xt at time step t, M3 is sampled with a probability 3 from the OMP which consists of the observation model PDFs M1 M4 which in this gure are shown as Gaussians for the sake of an example (but in practice it can be any parametrized PDF). M3 is then used to [m] [m] generate the weight wt of the particle xt . Sampling M and then using it for generating the particles weight is done for every particle at each time step.
[m]
(3)
which assumes normalized weights of the particles. The observation PDFs parameters are converted into the world frame at this point, using the robots self-localization estimate. Next, we receive Mr , C(Mr ), CLOC r from every other robot r in the team. It is important to note here that these parameters obtained from the teammate robots are already expressed in the world frame. This leads us to form a set Pk = {Mr | 1 r N } which we call an observation model pool (OMP) for the robot k in the world frame whose elements represent each robots individual observation PDF. We now associate a weight to each element in Pk as mentioned in Algorithm 1. For the robot ks OMP, only the elements due to other teammates are weighed using their self-localization condence except the element due to ks own observation which is weighed only by its observation condence. The reason for this will become clear later.
k = C(Mk ) r = C(Mr ) CLOC r 1 r N; r = k (4) (5)
Since the above two steps are performed for every particle at a given iteration of the Algorithm 1, it fuses the information from all the elements of the OMP according to their respective importance. After this, resampling is performed which can be done by any established method. The association of OMP elements to particles is done in every iteration of the PF-based tracker. Due to the sampling process in step 1 as mentioned above a few particles get associated with low weight elements of OMP to maintain the spread of the particles. The most relevant problem which this PF-based cooperative tracker solves is that of partial or complete occlusion. A partially occluded objects observation by a robot k may lead to a low condence in its own observation PDF, but in case the same object is being fully observed by another teammate robot r, it will lead to a high condence in the element contributed by it to Pk , and hence a greater chunk of particles will get associated to it. This way the robot k would still be able to make a good approximation of the target distribution. The OMP elements refer to the world frame. If a robot is wrongly localized, its observation of the object in the world frame will be incoherent with another correctly localized robots observation of the same object in the world frame, although both might be observing the object correctly in their respective local frames. The incoherency here is due to the frame transformation carried out by the wrongly
39
localized robot, of its observation in local frame to the global frame. In our approach this problem is solved by weighing the OMP elements by the associated robots self localization condence multiplied by its observation condence. Also, this is done differently for the recipient robot and the sender robots (4),(5). By doing this we ensure 2 major advantages: i) a wrongly localized robots good observation hardly inuences other robots OMP; ii) a wrongly localized robot would still have a high condence in its own good observation which is necessary if we want to carry out visual tracking in its local frame. It will not be affected by the observations of well localized teammates which are incoherent with the wrongly localized robots local frame. This enables the robot to keep tracking the object with its local information during visual tracking without relying on incorrect global frame information. We do not have to deal with the robots motion directly. This is taken care in situ when we construct the OMP in world frame, using the self-posture of the robot which inherently involves odometry or motion update of the robot, assuming that the robots acceleration is not very high. Otherwise the frame transformation functions should include the non-inertial terms. Robot motion control is handled assuming that the robots acceleration is not very high. Otherwise the frame transformation functions should include the non-inertial terms. IV. I MPLEMENTATION AND R ESULTS A. Test Bed We applied the proposed algorithm here to the robot soccer scenario. Our testbed is the RoboCup Middle Sized League(MSL) robot team. In robot soccer, one of the basic necessities is to continuously track the ball. A major concern here is that the robots tend to lose their localization on the soccer eld because of low-range vision and eld symmetry. Moreover, since the eld is too large as compared to a robots camera vision eld, a ball far from the robot (> 5m) is scarcely visible, and very often a nearby lying ball is occluded by a teammate or an opponent robot. Thus it becomes a very interesting and appropriate testbed for our proposed algorithm.
B. Implementation In our implementation, we model individual robots observation of the ball as a bivariate Gaussian distribution over the soccer eld. The condence of the observation is calculated based on the observed ball size and expected ball size as mentioned in (2). This is because in the image frame the expected ball size is a xed function of the distance of ball (xed radius = 10cm) from the image center. Images from the camera are streamed at 15 fps. For the prediction step of the tracker we use the same approach (7) as in [2]
Xt+1 = I 0 (t)I I Xt + ( t )I 2 (t)I
2
at ,
(7)
which is a constant velocity model with normally distributed acceleration noise about zero mean. I is a 33 identity matrix, t = 1, and at is a 3 1 white zero mean random vector corresponding to an acceleration disturbance. For resampling we apply the Low Variance Sampling method [8]. C. Robot Localization Since the PF based tracker is closely interlaced with the localization uncertainties of individual robots, it is worth mentioning that the localization method implemented here is Monte Carlo Localization, and is independent of the ball tracking particle lter. D. Results In this sub-section we present a set of plots for each experiment we made using 4 robots represented as R1 R4 . In Figs. 3 and 5 robots self localization condence and individual robots ball observation condence is shown. The plots in Figs. 4 and 6 represent the spatial variance and the variance of weights of the particles at every iteration step for the tracker running on R3 in each experiment which are calculated as in (8) and (9). Every iteration of the PF-based tracker consisting of the prediction step and the observation-fusion-update step takes 0.1 second. Note that comparing with the ground truth is rather difcult in scenarios like this because an overhead eld camera will have tracking errors in itself.
2 spatial =
1 M
m=1
[m ] xt xt
(8)
Fig. 2.
Omni directional robots of MSL on the soccer eld
2 weight =
Each of our 5 fully autonomous robots (Fig. 2) in the team is characterized by an omni directional drive and a dioptric vision system consisting of a sh-eye lens facing downwards. All high level algorithms are coded using C/C++ programming language and run on a Pentium IV 1.6 Ghz CPU laptop always mounted on the robot itself. All implementations and results presented further involve tracking the ball by 4 of these robots in one half of the MSL eld.
1 M
[m] wt
m=1
1 M
[m ] wt
(9)
m=1
In the accompanying video [http://www.youtube. com/watch?v=m_a4DuWagYI], the left side of the frame shows the self localization estimates of all robots and R3 s estimate of the tracked balls global position. The right side of the frame shows the simultaneously shot overhead footage of the experiment in which R3 can be seen reaching for the ball.
40
5
1 0.8 0.6 0.4 0.2 0 1 0.8 0.6 0.4 0.2 0 1 0.8 0.6 0.4 0.2 0 1 0.8 0.6 0.4 0.2 0 1 0.8 0.6 0.4 0.2 0 1 0.8 0.6 0.4 0.2 0 1 0.8 0.6 0.4 0.2 0 1 0.8 0.6 0.4 0.2 0
1100 1000 900 800 700 600 500 400 300 200 100 0 0.5
100
200
300
400
500
600
Spatial Variance
100
200
300
400
500
600
Weight Variance
100
200
300
400
500
600
0.4 0.3 0.2 0.1 0 0 100 200 300 400 500 600
100
200
300
400
500
600
Iteration number
100
200
300
400
500
600
Fig. 4. Spatial Variance and Variance of the weights of the Particles of the cooperative tracker at R3 in experiment 1
100
200
300
400
500
600
100
200
300
400
500
600
100
200
300
400
500
600
100
200
300
400
500
600
Iteration Number
Fig. 3. Condence plots of experiment 1 (Each iteration is performed every 0.1 second)
1) Experiment 1: In this experiment we have 3 robots which are forced to stay at their initial positions on the eld and R3 is supposed to reach for the ball. Tracking by R3 has been analyzed in the plots of Fig. 4. Condence plots for R3 in Fig. 3 show that intermittently between iterations 100 and 400 the ball was not directly observed by it. During this period, the ball was directly observed by either R1 , R2 or R4 (see other plots in Fig. 3). As a result of fusion of the observation models from teammates as explained previously in this paper, the cooperative tracker on R3 was able to track the ball consistently during this period with only a few breaks where it momentarily lost the ball. This is supported by the low spatial variance of the R3 s tracker particles and their converged weight variance in Fig. 4 with few small periods of peaks in variances denoting instances when the ball was completely lost. Also, the robot kept following the correct ball visible in the video accompanying this paper. The spatial variance of the particles in R3 s tracker reach a high peak for a small duration after iteration 300 in Fig. 4 which indicates that the ball was not tracked during that period. This was not only because the ball was lost from R2 , R3 and R4 s vision eld but also only R1 was observing the ball with a low observation condence ( 0.4). During most part of this experiment all four robots stay well localized.
2) Experiment 2: Similar to experiment 1, in this setup too there are 3 static robots (R1 , R2 , R4 ) and one dynamic robot (R3 ). The major difference here is that R1 , R2 and R3 do not localize well for most of the time. The rst visible conclusion from the condence plots in Fig. 5 is that the robot R3 directly observes the ball for a very short period (iterations 300 350, near 450 and near 700). Up to iteration 300, the spatial variance of R3 s trackers particles remain low and their weight variance has also converged indicating that R3 was able to track the ball, which is supported by the accompanying video. This is because R4 (see Fig. 5) was directly observing the ball and since it was well localized, R3 was relying on its observation model and hence tracking the ball in the correct position. Close to iteration 300 the ball is lost by all the robots for a very short instance (peak in spatial variance in Fig. 6) and then R3 and R4 observe the ball alternatively between iterations ( 300 & 420) (Fig. 5) leading to a converged spatial variance during that time supported by the video showing that R3 is consistently following the ball. During iterations ( 420 650) the ball is seen directly only by R1 (Fig. 5) which has an almost zero condence on its self localization estimate (Fig. 5). The cooperative tracker at R3 takes this into account and prevents itself from tracking the ball in a wrong position, hence the high spatial variance in Fig. 6 during that time period. Another notable instance is at iteration 400 where R4 was directly observing the ball and was well localized (Fig. 5), R1 and R3 were not seeing the ball and R2 was observing the ball but was not localized (low self-localization condence in Fig. 5). At that instance, R3 was following the correct ball (supported by low spatial variance in Fig. 6 and the video) indicating that the fusion in R3 s cooperative tracker diminished the inuence of R2 s observation model because of its low self localization condence hence protecting itself from tracking the ball in a wrong position. V. C ONCLUSION AND F UTURE W ORK After performing a considerable number of tests on real robots we conclude that our algorithm provides a robust approach for tracking an object cooperatively by a team of robots and support this on a set of real robot experimental
41
6
1 0.8 0.6 0.4 0.2 0 1 0.8 0.6 0.4 0.2 0 1 0.8 0.6 0.4 0.2 0 1 0.8 0.6 0.4 0.2 0 1 0.8 0.6 0.4 0.2 0 1 0.8 0.6 0.4 0.2 0 1 0.8 0.6 0.4 0.2 0 1 0.8 0.6 0.4 0.2 0
1100 1000 900 800 700 600 500 400 300 200 100 0 0.14
200
400
600
800
1000
Spatial Variance
100
200
300
400
500
600
700
800
900
1000
Weight Variance
200
400
600
800
1000
0.12 0.1 0.08 0.06 0.04 0.02 0 0 100 200 300 400 500 600 700 800 900 1000
200
400
600
800
1000
Iteration number
200
400
600
800
1000
Fig. 6. Spatial Variance and Variance of the weights of the Particles of the cooperative tracker at R3 in experiment 2
R EFERENCES
0 200 400 600 800 1000
200
400
600
800
1000
200
400
600
800
1000
200
400
600
800
1000
Iteration Number
Fig. 5. Condence plot of experiment 2 (Each iteration is performed every 0.1 second)
data. However, a few major points can be enumerated: the approach is a solution for tracking an object which is likely to be occluded or partially occluded quite often. If the object is condently located by a wrongly localized robot, after fusion it would track it correctly in its own local frame and affect other teammates fused observation model quite insignicantly. By sharing an observation PDF (in which only the models parameters are enough to construct the whole PDF), we signicantly reduce the use of bandwidth and communication time which leads to real-time tracking of the object. We are working further to extend our model of cooperative tracking to include multiple sensors on the same robot as well as to relax the 2-D space and colored object tracking assumptions. Tracking a random colored ball in 3-D space will require a new observation model and an extension in the state space without any changes in our current trackers algorithm because it uses a generalized observation model and is independent of the state space dimensions. Furthermore we intend to extend our current approach to active cooperative tracking where we dynamically change the geometry of the robot formation so as to reduce the uncertainty of the belief in the object recognition.
[1] Alper Yilmaz, Omar Javed and Mubarak Shah Object Tracking: A Survey, ACM Computing Surveys, Vol.38, No.4, Article 13, December 2006. [2] M. Taiana, J. Santos, J. Gaspar, J. Nascimento, A. Bernardino and P. Lima Tracking objects with generic calibrated sensors : an algorithm based on color and 3D shape features, Robotics and Autonomous Systems, special issue on Omnidirectional Robot Vision, Vol. 58, Issue 6, 30 June, pp. 784-795, 2010. [3] J. Santos and Pedro Lima Multi-Robot Cooperative Object Localization Decentralized Bayesian Approach, Proc. of RoboCup 2009 Symposium, Graz, Austria, 2009. [4] Ra l A. Lastra, Paul A Vallejos and Javier Ruiz-del-Solar Integrated Selfu Localization and Ball Tracking in the Four-Legged Robot Soccer League, 1st IEEE Latin American Robotics Symposium, Mexico city, 2004. [5] Jun Inoue, Akira Ishino and Ayumi Shinohara Ball tracking with velocity based on Monte-Carlo localization, 9th International Conference on Intelligent Autonomous Systems, Tokyo, Japan, March 2006. [6] Walter Nistic , Matthias Hebbel, Thorsten Kerkhof and Christine Zarges o Cooperative Visual Tracking in a Team of Autonomous Mobile Robots, RoboCup 2006: Robot Soccer World Cup X. Lecture Notes in Articial Intelligence, Springer 2007. [7] Cody Kwok and Dieter Fox Map-based Multiple Model Tracking of a Moving Object, RoboCup 2004: Robot Soccer World Cup VIII, Lecture Notes in Computer Science 2005. [8] Thrun, Sebastian and Burgard, Wolfram and Fox, Dieter Probabilistic Robotics, The MIT Press, 2005. [9] Hastie Trevor, Tibshirani Robert and Friedman Jerome The Elements of Statistical Learning, Springer, 2001. [10] Daniel G hring and Hans-Dieter Burkhard Multi Robot Object Tracking o and Self Localization Using Visual Percept Relations, Proceedings of the 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, October 9 - 15, 2006, Beijing, China. [11] Luis Merino, Fernando Caballero, J. R. Martnez-de Dios, Joaqun Ferruz, and Anbal Ollero A Cooperative Perception System for Multiple UAVs: Application to Automatic Detection of Forest Fires, Journal of Field Robotics 23(3/4), 165-184; 2006 [12] Ong, L.-L., Upcroft, B., Bailey, T., Ridley, M., Sukkarieh, S., and Durrant-Whyte, H. (2006a). A decentralised particle ltering algorithm for multi-target tracking across multiple ight vehicles. In Intelligent Robots and Systems, 2006 IEEE/RSJ International Conference on, pages 4539 4544. [13] Ong, L.-L., Upcroft, B., Ridley, M., Bailey, T., Sukkarieh, S., and Durrant-Whyte, H. (2006b). Consistent methods for decentralised data fusion using particle lters. In Multisensor Fusion and Integration for Intelligent Systems, 2006 IEEE International Conference on, pages 85 91. [14] Sheng, Xiaohong and Hu, Yu-Hen and Ramanathan, Parameswaran. Distributed particle lter with GMM approximation for multiple targets localization and tracking in wireless sensor network. Proceedings of the 4th international symposium on Information processing in sensor networks , IPSN 2005. [15] Michael D. Breitenstein and Fabian Reichlin and Bastian Leibe and Esther Koller-Meier and Luc Van Gool. Online Multi-Person Trackingby-Detection from a Single, Uncalibrated Camera. in IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010
42
Monocular Visual Odometry using a Planar Road Model to Solve Scale Ambiguity
Bernd Kitt J rn Rehder Andrew Chambers Miriam Sch nbein Henning Lategahn Sanjiv Singh o o Institute of Measurement and Control Systems, Karlsruhe Institute of Technology, Karlsruhe, Germany The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, USA Hamburg University of Technology, Hamburg, Germany
Abstract Precise knowledge of a robotss ego-motion is a crucial requirement for higher level tasks like autonomous navigation. Bundle adjustment based monocular visual odometry has proven to successfully estimate the motion of a robot for short sequences, but it suffers from an ambiguity in scale. Hence, approaches that only optimize locally are prone to drift in scale for sequences that span hundreds of frames. In this paper we present an approach to monocular visual odometry that compensates for drift in scale by applying constraints imposed by the known camera mounting and assumptions about the environment. To this end, we employ a continuously updated point cloud to estimate the camera poses based on 2d-3dcorrespondences. Within this set of camera poses, we identify keyframes which are combined into a sliding window and rened by bundle adjustment. Subsequently, we update the scale based on robustly tracked features on the road surface. Results on real datasets demonstrate a signicant increase in accuracy compared to the non-scaled scheme. Index Terms Localization, Navigation, Robot Vision
which fuse inertial measurements with GPS data [10]. As for all incremental motion estimation techniques, long term drift can only be mitigated by applying loop-closure on (visual) place recognition (e.g. [20]) or by fusing absolute localization data. In this work, we make use of a fully calibrated monocular camera to estimate the pose of the current camera with respect to a global world reference frame. The used datasets were captured in urban environments using a vehicle moving at a speed of approximately 15m/s on average. The six degrees of freedom (6 DoF) motion of the vehicle is estimated merely on the visual information. No additional sensor measurements such as GPS- or IMU-data are used in contrast to [1] or [5]. The remainder of this paper is organized as follows: The following section describes work already done in the eld of vision-based motion estimation. In Section III, the monocular motion estimation approach is described, which is extended in section IV to cope with drift in scale. We close the paper with experimental results in section V, a short conclusion and an outlook on future work. II. R ELATED W ORK In recent years, many algorithms have been developed that estimate the ego-motion of a robot. These algorithms can roughly be classied into two main categories, namely algorithms using monocular camera systems (e.g. [24], [18]) and binocular approaches (e.g. [15], [12]). A further subdivision is possible into algorithms using only feature matches between consecutive frames (e.g. [10], [21]) and algorithms using feature tracks over a couple of images (e.g. [11], [17]). Each class of algorithms has different benets and drawbacks. Monocular algorithms suffer from the scale ambiguity in the translational camera movement which is usually resolved using measurements from IMUs (e.g. [5]) or a combination of wheel speed sensors and GPS as in Agrawal et al. (e.g. [1], [2]). Compared to algorithms which use feature tracks, algorithms making use only of feature matches usually suffer from higher drift rates, since the used information incorporates only two images. The entire trajectory is then computed by accumulating the relative camera motions between two consecutive frames. This drift can be reduced using feature tracks over a sequence of images combined with a bundle adjustment scheme [23]. The drawback of bundle adjustment is the computational burden of the optimization process. To
I. I NTRODUCTION Ego-motion estimation is an important prerequisite in robotics applications. Many higher level tasks like obstacle detection, collision avoidance or autonomous navigation rely on an accurate localization of the robot. All of these applications make use of the relative pose of the current camera with respect to the previous camera frame or a static world reference frame. Usually, this localization task is performed using GPS or wheel speed sensors. In recent years, camera systems became cheaper and the performance of computing hardware increased dramatically. Hence, high resolution images can be processed in real-time on standard hardware. It has been proven, that the information given by a camera system is sufcient to estimate the motion of a moving camera in a static environment, called visual odometry [16]. Compared to the abovementioned sensors, camera systems have different advantages. It is well known, that the accuracy of the GPS-localization depends on the number of satellites used. This number drops down in urban environments with large buildings on either side of the road. The accuracy of wheel speed sensors depends mainly on the slip between wheel and road, which can be high depending on the terrain. Obviously, the localization results based on these sensors are highly affected by the environment. Further drawbacks of GPS or inertial measurement units (IMUs) are the low accuracy and the high cost, respectively. The local drift rates for visual odometry are mostly smaller than the drift rates of IMUs, except for expensive high accuracy integrated navigation systems
43
relax this, most algorithms are based on a bundle adjustment which performs the optimization only over a limited number of images, i.e. a sliding window. Other algorithms make use of additional sensors, like IMU- or GPS-systems (e.g. [1]) to increase the accuracy of the estimation. Obviously, the use of GPS-information reduces drift signicantly because of the global nature of the system. Furthermore, approaches applying assumptions about the observers motion have been developed. Scaramuzza et al. [18] make use of a planar motion model and the non-holonomic constraints of wheeled vehicles to reduce the parameter space and increase the accuracy. Good localization results have also been achieved using visual SLAM techniques (e.g. [4]) which simultaneously estimate a map of the environment jointly with the camera pose inside this map. Besides the computational complexity of these approaches, most of the monocular visual SLAM techniques perform only well in well structured environments and at low speed. Hence, these approaches are mainly applicable to indoor environments. Recently, Strasdat et al. [20] developed a monocular SLAM algorithm applicable for large-scale environments, which resolves the drift in scale when loop-closure occurs. Since we focus on open-loop scenarios, where a robot is travelling from A to B, this approach is not suitable in our case. Compared to the abovementioned approaches which combine monocular vision with additional sensors, our algorithm uses only visual inputs. To solve translational scale drift, we make use of some assumptions about the mounting pose of the camera and the planarity of the road surface in the vicinity of the vehicle. Given these reasonable assumptions, we can continuously resolve the ambiguity in scale and reduce the scale drift signicantly. We prefer to use a monocular camera setup because the effort used in the calibration process is much less than in the binocular case. Furthermore, calibration errors directly affect the motion estimation process. Hence, the use of a monocular camera mitigates the effect of an erroneous calibration onto the motion estimation. Compared to Scaramuzza et al. [18] our assumptions are less restrictive since we assume only a locally planar surface but estimate the robots movement in 6 DoF. III. M ONOCULAR V ISUAL O DOMETRY The proposed algorithm for performing monocular visual odometry assumes a fully calibrated camera with known and xed intrinsic calibration parameters K. Given a sequence of images, the goal is to estimate the camera pose at each time step solely based on these images. To this end, we track salient image features xk (e.g. corners [7], [19]) over a series j of frames. Here, j describes the index of the feature track and k denotes the index of the frame, respectively. Based on these feature tracks, we reconstruct the poses of the moving camera with respect to a predened world reference frame and the locations of the scene points corresponding with the tracks. In the following sections we will give a detailed description of the different steps of the pose estimation algorithm. Since the feature tracks are short, the (arbitrary) scale of the estimated camera poses drifts over time. In section IV we propose
a technique to reduce the scale drift and recover scale from a moving monocular camera. A. Pose Initialization The pose initialization step is based on a set of three preselected keyframes K = {K1 , K2 , K3 } and corresponding feature points visible in all of these images. Based on the feature points, we compute the epipolar geometry between K1 K2 and K1 K3 , which describes the relative pose of the keyframes with respect to the world reference frame. Note, that the world reference frame coincides with the camera coordinate frame of K1 . To this end, we use the normalized eight-point-algorithm [9] wrapped in a RANSAC framework [6] to reject outliers caused by independently moving objects or false feature matches. Outliers are detected, using the pairwise Euclidean distance between the observed features and their corresponding epipolarlines. All features with a distance larger than a predened threshold are classied as outlier. Based on the essential matrix E K1 K3 , we recover the pose RK3 , tK3 of K3 with respect to the world reference frame [8]. Here, R and t describe the orientation and translation of the camera respectively. Note, that tK3 can only be recovered up to an arbitrary scale factor. Using the recovered pose of the third keyframe and the feature correspondences, we compute the corresponding scene points for all features which are inliers in both pairwise epipolar geometry estimations. Based on the triangulated scene points and their corresponding features in K2 we can compute its pose with respect to the world reference frame using the algorithm proposed in [13]. Afterwards, we perform a bundle adjustment [23] over these three keyframes to get the best estimation for the camera poses and the scene points. The bundle adjustment minimizes the reprojection error, i.e.
P k ,Xj
min
d P k X j , xk j
j,k
(1)
making it the maximum likelihood estimate, assuming a Gaussian distribution in the measurement errors [23]. Here, P k = K Rk |tk denotes the 3 4 projection matrix of camera k and d (.) denotes the geometric distance between two image points. B. Pose Estimation Using the previously initialized scene structure, the estimation of the camera pose Rk , tk corresponding to image I k , is based on already initialized scene points, visible in the current image, i.e. a set of 2d-3d-correspondences. Given a set of N already known scene points X = {X1 , . . . , Xj , . . . XN } and feature tracks in the current image I k , we use all feature tracks xk in the current image with an already associated scene j point Xj to estimate the current camera pose in the global coordinate frame, using the algorithm proposed in [13]. Since we would like to reject features lying on independently moving objects or false feature matches, we use a RANSAC scheme to get a robust pose estimation. Hence, we use randomly chosen subsets of the 2d-3d-correspondences and create a pose hypothesis for each subset. Given the pose hypothesis and
44
the set of visible scene points, we can compute the expected image points. Subsequently, we evaluate the quality of each hypothesis using the Euclidean reprojection error between the expected image points based on the current hypothesis and the measured image features. All correspondences with an error larger than a predened threshold are classied as outlier. The nal estimation uses only the correspondences of the largest set of inliers. C. Keyframe Selection and Pose Renement Since the pose estimation scheme proposed in the previous section does not update the scene structure, the number of visible scene points decreases over time. Hence, if none of the initialized scene points is visible in the current camera image, no pose estimation is possible. To this end, new keyframes are selected when the number of visible scene points in the current image drops below a threshold. Whenever a new frame I k is selected as keyframe KM +1 , where M denotes the number of already initialized keyframes, a bundle adjustment is performed. To keep the computational complexity moderate, we perform the bundle adjustment only over a sliding window of L keyframes. Furthermore, we optimize only the m camera poses of the most recent keyframes inside this window (m < L), keeping the remaining camera poses xed. This is reasonable, since the other keyframes have been optimized multiple times in previous optimization steps. Note that the reprojection error in the xed camera poses is taken into consideration during the optimization process. We use the well established implementation of a sparse bundle adjustment proposed in Louriakis et al. [14]. After the bundle adjustment, we use the most recent keyframe KM +1 and the two previous keyframes {KM 1 , KM } to triangulate all feature points visible in these keyframes using a three-view triangulation scheme [3]. Note that only those features are triangulated that do not correspond to a previously initialized scene point. To reject outliers, we check the reprojection error of the triangulated scene points in all keyframes used for the triangulation. If the Euclidean reprojection error in any of these keyframes exceeds a certain threshold, the triangulated scene point is rejected. The following section describes the approach to recover scale from a monocular image sequence based on assumptions about the camera mounting and the road environment. Furthermore, the proposed approach allows for a reduction in the drift of the scale. This part is the main contribution of the paper and consists of a robust triangulation scheme for feature points lying on the less textured asphalt as well as the recovery of the translational scale from monocular imagery. IV. R ECOVERING S CALE A. Planar Assumption In automotive applications, cameras are often rigidly mounted in a xed position and at constant orientation. Furthermore, streets in urban environments may be reasonably assumed to be approximately planar in the vicinity of the vehicle (see Fig. 2: This gure depicts the region of interest (ROI) in the image that is assumed to be approximately planar. Fig. 1: This gure illustrates the mounting of the camera. h denotes the height of the camera above ground and R0 denotes the orientation of the camera with respect to the ground plane.
gure 2). Additionally, we assume that the roll and pitch movement of the vehicle is negligible, since high dynamic maneuvers are not in the scope of this contribution. Nevertheless, roll and pitch of the camera can be taken into account based on the estimated pose of the vehicle. In this case, only the initial pose of the camera w.r.t. the ground plane has to be known. Employing these assumptions provides a clue that may be exploited to upgrade the monocular visual odometry to a metric reconstruction and to compensate for the inevitable drift in scale. For this purpose, we estimate the ground plane based on correspondences of features that may safely be assumed to lie on this plane. The motion of the camera is then scaled in a way that recovers its known height above ground. B. Patch Matching While we employ corner-like features (e.g. [19]) in visual odometry, this approach fails to robustly detect features on asphalt due to the lack of strong texture. Furthermore, the images of patches on the street plane close to the camera undergo severe perspective distortions in successive frames caused by the camera movement, which makes it particularly challenging to match them by means of simple block-matching algorithms. However, the known mounting position and orientation of the camera as well as the assumption about the planarity of the area in front of the vehicle may be exploited to overcome these challenges and robustly match features on the street. Let R0 denote a rotation matrix that accounts for the pitch and roll angle of the camera as depicted in gure 1 and t = T R0 (0, h, 0) denotes the translation in its reference frame, where h is the height above the road. Further, let the road plane be dened by Y = 0 as shown by the gray coordinate frame T in gure 1, and a point on this plane by X = (X, Z, 1) in
45
projective coordinates, then image points x = (x, y, 1) are related to points X on the road plane through a projective transformation x = HX . This transformation is calculated as (see also [8]) x = K [R0 |t] (X, 0, Z, 1)
T T
(2) (3) (4)
= K [r1 r3 t] (X, Z, 1) = HX ,
where r1 and r3 denote the rst and third column of R0 . Our approach uses the inverse mapping H 1 to generate a synthetic fronto-parallel view of a region of interest (ROI) on the street plane, as depicted in gure 3. As the ROI is in general evenly textured, it is difcult to identify signicant features that are well suited for tracking. Instead, a predened set of square patches is used in frame k, located at the edge of the rectangular ROI closest to the vehicle. This choice ensures that a region with high resolution is used for matching as well as the maximum ow for triangulation. The patches are matched against the ROI at k 1, using the sum of absolute differences (SAD) and choosing the minimum value as match. In most cases, the matching exhibits a distinct minimum. In the fronto-parallel projection, patches in successive frames are related by an Euclidean transformation which improves the robustness of patch based matching schemes signicantly. In our experiments, the rotational component of this transformation was moderate and thus did not affect the performance of the SAD based matching negatively. Using the projective transformation H, the correspondences are mapped onto image points as
k xk = HXj j k1 xj
Fig. 3: This gure depicts the synthetic fronto-parallel view of the ROI. The white square marks a single patch that is matched in the preceeding frame.
and
(5) (6)
the mounting height h of the camera. Then the points Xj , k1 k triangulated with P and P and the patch correspondences k1 xj and xk should roughly have the same Y-coordinate j (expressed in the coordinate frame of image k 1), which corresponds to the height above the road plane in the current scale, i.e. the estimated height of these points depends on the actual translation between the frames. Since we know the height of the camera h and the height of these points, we can rescale the translation vector such that the camera height and the height of the points coincide. To robustly estimate the scale factor, we implemented an outlier rejection based on the reprojection error |xk P k Xj | j to cope with mismatches as well as a simple planarity check to discard estimates of the plane that exceed a threshold in the variance of the Y-coordinates of the triangulated points. The scale factor sk is determined by
k1 HXj .
sk = (1 )sk1 + h/Yjk , Yjk
(9)
Note that the purpose of all steps described in this section is to robustly match features on the road surface. Only the k1 positions xj and xk of these features in I k1 and I k will j be used in the following. C. Scale Update In order to relate the height of the camera above the estimated ground plane to the distance of translation, we triangulate the points based on the relative positions of the cameras at frame k 1 and frame k as estimated by the visual odometry. Let P k1 = K [R0 |0] P =K R
k k
(7)
k1 1
R0 |t R
k1 1 k1
(8)
be projection matrices, where R0 accounts for tilting as depicted in gure 1. P k compensates for the poses of both frames in the global world reference frame, i.e. P k expresses the relative pose of frame k w.r.t. frame k 1. Since we compensate for the mounting orientation of the camera at frame k 1, the X-Z-plane of the camera coordinate frame is parallel to the ground plane and its height coincides with
with [0, 1] and being the mean over all Y-coordinates of Xj triangulated from the image pair k and k 1. The factor incorporates the previous scale factor for temporal smoothing of the scale as well as spatial averaging. This smoothing seems reasonable as the drift in scale occurs on a large time-scale, thus making noise rejection a priority over a fast response. The scale factor sk is used to scale all camera positions i within the current window as well as the structure observed by the frames in the window. The window for which the scaling is applied consists of all frames i {S, k}, where S is the index corresponding to the oldest keyframe inside the sliding window. As the scale of the motion only affects the translations ti , the scaled translations may be calculated as ti = sk ti tS + tS , where tS is the translation of the oldest camera pose in the optimization window which is held constant to ensure a smooth trajectory. Analogously, the point cloud is scaled as Xj = sk Xj + R i
1 i
Ri
1 i
t.
(10)
The proposed scaling of both, the translational camera movements as well as the scene points guarantees a consistently estimated trajectory of the camera as well as the structure of the scene. Obviously, the structure of the scene is too sparse
46
20 0
Z-coordinate [m]
-20 -40 -60 -80 -100 -120 -140 -50 0 50 100 150
INS VO (unscaled) VO (scaled)
X-coordinate [m]
Fig. 4: This gure depicts the results of the proposed algorithm. The red trajectory is the ground truth given by the INS system. The green and blue trajectories are estimated using the visual odometry algorithm. Blue depicts the trajectory with a single scaling at the beginning of the sequence, green is the trajectory with continuously updated scale.
for obstacle avoidance or path planning purposes but sufcient for a robust motion estimation. V. E XPERIMENTAL R ESULTS For our experiments, we used different datasets captured in real environments. To this end, we used our experimental vehicle, which is equipped with a camera and a high accuracy integrated navigation system (INS), which combines intertial measurements with a GPS-receiver and wheel speed sensors for measuring motion, pose and orientation of the vehicle. Therefore, the INS yields accurate measurements for the motion of the vehicle along its roll axis and the yaw rate. In the following, these measurements are used as ground truth for our experiments. The used camera is mounted on top of the car and yields images at a resolution of 1344 391 pixels with 10Hz. As features, we used Harris corners [7] in combination with a block matching on the image derivatives to get feature correspondences between two adjacent frames. These feature matches are accumulated, to get feature tracks over a series of images. The results for two challenging datasets with different length and speed can be seen in gure 4 and 5. The rst sequence (gure 4) includes 600 frames and was captured with an average speed of approximately 10m/s. The second dataset (gure 5) with a loopy trajectory consists of 1405 frames, the speed was approximately 15m/s on average. The demonstrated results were computed ofine on a standard PC, using an implementation in MATLAB. The length of the sliding windows was set to L = 10, optimizing only the m = 5 most recent camera poses inside this window. These parameters have been chosen to account for the average track length of approximately four frames. The number of
used scene points for the bundle adjustment over the sliding window is approximately 900. This and the good initialization of both, the scene points and the camera poses yield a fast convergence of the bundle adjustment. Per frame, ve patches of size 50 50 pixels were matched on the road plane. For the vast majority of patches, the SAD exhibited a distinct minimum which gave rise to accurately triangulated points on the plane. The resulting scale was smoothed with a factor of = 0.8. Figure 4 displays the results of the proposed approach compared to the ground truth trajectory given by the high accuracy INS (red). Compared to the trajectory which has been scaled only once at the beginning of the sequence (blue), the trajectory using the proposed approach (green) exhibits no signicant drift in scale. Obviously, there is an increasing deviation from the ground truth position due to the local nature of the algorithm. Since the drift in scale does not affect the rotational component of the motion estimate, the angular error of the scaled and unscaled trajectories is similar. Only the length of the trajectory is affected by the scaling. Note that the trajectories are manually translated and rotated to align them with the ground truth trajectory. Currently, the proposed algorithm (without feature tracking) takes approximately 1.7s/f rame due to the implementation in MATLAB. An implementation in C++, which is planned for future work, should increase the framerate signicantly. VI. C ONCLUSION AND F UTURE W ORK In this paper we presented an approach for estimating the 6 DoF ego-motion of a vehicle solely from monocular image sequences. By exploiting constraints induced by the known camera mounting and a reasonable assumption about the planarity of the road surface in the vicinity of the vehicle, the algorithm is capable of continuously recovering the scale of the motion. Furthermore we can signicantly reduce the drift in scale. Our experiments have shown, that the scale ambiguity in monocular approaches leads to a large drift in scale. As a result, distant sections of the same estimated trajectory are scaled differently, causing a distortion of the path. Based on feature tracks over a series of images, we estimate the motion of the camera. To this end, we use a continuously updated scene structure to estimate the pose of the camera based on 2d-3d-correspondences between scene points and image features. Whenever the number of visible scene points in the current image drops below a threshold, we select a new keyframe and rene the previous scene structure and the camera poses in a sliding window based on bundle adjustment. Subsequently, we update the scene structure based on newly triangulated feature tracks. The proposed keyframe selection criterion has proven to be sufcient for the proposed algorithm. Nevertheless, more sophisticated approaches for keyframe selection as proposed e.g. in [22] may be used. The algorithm employs knowledge about the mounting of the camera and a planar assumption to generate a synthetic fronto-parallel view of the road. Despite the lack of strong features on asphalt, patches in successive frames can be matched robustly in this view. Using these matches and the motion
47
Fig. 5: This gure illustrates the results of the proposed algorithm (green) in comparison to the unscaled version (blue) for a sequence consisting of 1405 frames. As can be seen, the unscaled version suffers from a signicant drift in scale. The trajectory exploiting knowledge about the camera mounting exhibits a slight drift due to the local nature of the algorithm. Nevertheless, there is no signicant drift in scale.
hypothesis of the visual odometry, we triangulate points to acquire an estimate of the distance to the ground plane as measured by the visual odometry. Based on the deviation of this measurement from the known mounting height, the algorithm continuously scales the motion and structure of the current window. The experimental results suggest that the proposed algorithm is capable of accurately reconstructing the trajectory of the vehicle solely based on visual inputs. Furthermore, continuously correcting the scale results in a signicant improvement of the accuracy. Due to the use of only a single camera, the algorithm fails in the presence of dominant independent motion. This may be the case, when moving trafc accounts for a signicant share of the image. To improve the robustness of the approach, we are working on a more reliable outlier rejection scheme. The planar assumption is sufciently fullled in most urban scenarios. Nevertheless, large changes in the slope of the road cause our approach to estimate a wrong scaling factor. Although it will recover from such perturbations, the resulting trajectory will be incorrectly scaled around this slope. Another case where the algorithm fails to correctly compute the scale factor is the presence of objects inside the ROI. Future work will include the detection and handling of these conditions as well as a real-time implementation. ACKNOWLEDGEMENTS The rst author gratefully acknowledges the support from Karlsruhe House of Young Scientists. R EFERENCES
[1] Motilal Agrawal and Kurt Konolige. Real-time localization in outdoor environments using stereo vision and inexpensive gps. In Proceedings of the International Conference on Pattern Recognition, pages 1063 1068, 2006. [2] Motilal Agrawal and Kurt Konolige. Rough terrain visual odometry. In Proceedings of the International Conference on Advanced Robotics, August 2007.
[3] Martin Byr d, Klas Josephson, and Kalle Astr m. Fast optimal three o o view triangulation. In Proceedings of the Asian Conference on Computer Vision, pages 549 559, 2007. [4] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse. Monoslam: Real-time single camera slam. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6):1 16, June 2007. [5] Christian Dornhege and Alexander Kleiner. Visual odometry for tracked vehicles. In Proceedings of the IEEE International Workshop on Safety, Security and Rescue Robotics (SSRR 2006), 2006. [6] Martin A. Fischler and Robert C. Bolles. Random sample consensus: A paradigm for model tting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381 395, 1981. [7] Chris Harris and Mike Stephens. A combined corner and edge detector. In Proceedings of the Alvey Vision Converence, pages 147 151, 1988. [8] Richard Hartley and Andrew Zisserman. Multiple View Geometry in computer vision. Cambridge University Press, second edition edition, 2008. [9] Richard I. Hartley. In defense of the eight-point algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(6):580 593, June 1997. [10] Andrew Howard. Real-time stereo visual odometry for autonomous ground vehicles. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2008), pages 3946 3952, September 2008. [11] Andrew Edie Johnson, Steven B. Goldberg, Yang Cheng, and Larry H. Matthies. Robust and efcient stereo feature tracking for visual odometry. In IEEE International Conference on Robotics and Automation, pages 39 46, May 2008. [12] Bernd Kitt, Frank Moosmann, and Christoph Stiller. Moving on to dynamic environments: Visual odometry using feature classication. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5551 5556, Taipei, Taiwan, October 2010. [13] Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. Epnp: An accurate o(n) solution to the pnp problem. International Journal of Computer Vision, 81(2):155 166, February 2009. [14] Manolis I. A. Louriakis and Antonis A. Argyros. Sba: A software package for generic sparse bundle adjustment. ACM Transactions on Mathematical Software, 36(1):1 30, March 2009. [15] Annalisa Milella and Roland Siegwart. Stereo-based ego-motion estimation using pixel-tracking and iterative closest point. In Proceedings of the IEEE International Conference on Computer Vision Systems, 2006. [16] David Nist r, Oleg Naroditsky, and James Bergen. Visual odometry. e In IEEE Computer Society Conference Computer Vision and Pattern Recognition, volume 1, pages 652 659, 2004. [17] Frank Pagel. Robust monocular egomotion estimation based on an iekf. In Canadian Conference on Computer and Robot Vision, pages 213 220, Kelowna, BC, Canada, May 2009. [18] Davide Scaramuzza, Friedrich Fraundorfer, and Roland Siegwart. Realtime monocular visual odometry for on-road vehicles with 1-point ransac. In Proceedings of the IEEE Conference on Robotics and Automation, May 2009. [19] Jianbo Shi and Carlo Tomasi. Good features to track. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 1994), pages 593 600, June 1994. [20] Hauke Stradsdat, J. M. M. Montiel, and Andrew J. Davison. Scale drift-aware large scale monocular slam. In Proceedings of the Robotics: Science and Systems Conference, Zaragoza, Spain, June 2010. [21] Ashit Talukder and Larry Matthies. Real-time detection of moving objects from moving vehicles using dense stereo and optical ow. In IEEE International Conference on Intelligent Robots and Systems (IROS 2004), volume 4, pages 3718 3725, September 2004. [22] Thorsten Thorm hlen, Hellward Broszio, and Axel Weissenfeld. Keyfa rame selection for camera motion and structure estimation from multiple views. In Proceedings of the European Conference on Computer Vision, pages 523 535, Prague, Czech Republic, May 2004. [23] Bill Triggs, Philip McLauchlan, Richard Hartley, and Andrew Fitzgibbon. Bundle adjustment a modern synthesis. Lecture Notes in Compute Sciene, 1883:298 372, 2000. [24] Koichiro Yamaguchi, Takeo Kato, and Yoshiki Ninomiya. Vehicle ego-motion estimation and moving object detection using a monocular camera. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR 2006), pages 610 613, 2006.
48
Incremental Updates of Conguration Space Representations for Non-Circular Mobile Robots with 2D, 2.5D, or 3D Obstacle Models
Boris Lau Christoph Sprunk Wolfram Burgard
Abstract This paper presents techniques to incrementally update collision maps, distance maps, and Voronoi diagrams in the conguration space of non-circular mobile robots. Compared to previous work, our approach only updates the cells affected by changes in the environment. Thus, it is applicable in large workspaces and in the presence of unknown or moving obstacles. The c-space collision maps allow for performing highly efcient collision checks in dynamically changing environments, for 2D, 2.5D, and 3D representations of robots and obstacles. By using the proposed c-space distance maps, long trajectories can efciently be checked for collisions. Finally, our c-space Voronoi diagrams allow effective and complete path planning in narrow spaces. Our algorithms are easy to implement, benet from parallelization on multi-core CPUs, and can be integrated with state-of-the-art path planners. Experiments demonstrate the effectiveness of our methods and show their applicability to real-world scenarios.
I. I NTRODUCTION Efcient collision checks are a crucial ability for many online systems in autonomous mobile robotics: simulators, path planners, and trajectory optimizers alike need to check for every considered pose for collision. For plain 2D obstacle representations and circular approximations of the robots footprint, these collision checks are easy to perform. For non-circular robots in passages narrower than their circumcircle, however, circularity is too crude an assumption, and collisions have to be checked for in the three-dimensional conguration space (c-space) of robot poses. Also, even for robots moving on a plane as considered in this paper, 3D obstacles and collisions can be important: applications like robotic transporters, wheelchairs, or mobile manipulators can require the robot to partially move underneath or above obstacles as shown in Fig. 1. In these cases, collision checks can easily become a dominant part of the computational effort. Naive collision checks on 2D occupancy grid maps require one lookup per grid cell in the area occupied by the robot. For 3D obstacle representations like multi-layer surface maps or full 3D maps [1], the effort depends on the robots volume. However, by convolving a map with the discretized shape of the robot, one can precompute a collision map that marks all colliding poses. With such a map, a collision check requires just a single lookup, even for 3D obstacle representations. From these c-space maps one can derive c-space distance maps that encode the distance to the closest colliding robot pose. Based on this, checking a trajectory for collisions can be made more efcient by extending the intervals of collision
All authors are with the Department of Computer Science at the University of Freiburg, Germany, {lau, sprunkc, burgard}@informatik.uni-freiburg.de. This work has partly been supported by the European Commission under FP7-248258-First-MM, FP7-248873-RADHAR, and FP7-260026-TAPAS.
Fig. 1. For some applications, representing obstacles and robots by their 2D footprints can be sufcient (top-left). For overhanging parts of robots, their load, or obstacles, 2.5D representations are needed (bottom), whereas interaction tasks can also require actual 3D obstacle and robot models (topright). Robot discretizations as used in our experiments are depicted in blue.
checks from one cell to the respective local collision distance. Similarly, in contrast to regular 2D Voronoi maps that can be used for complete motion planning of circular robots, c-space Voronoi maps can be employed for non-circular robots as well. Early work on c-space obstacle representations assumes static environments [2]. More recent approaches are often motivated by dynamic obstacles, but mostly try to improve the efciency with the goal to execute the entire computation on every sensor update. This paper presents methods to incrementally update collision, distance, and Voronoi grid maps in the conguration space of non-circular mobile robot poses. Our algorithms only update cells affected by changes, which greatly reduces the computational costs. We consider all obstacle types shown in Fig. 1 and thus provide means for efcient collision checking in real-world applications. The algorithms are easy to implement and benet from parallelization on multiple cores. After discussing related work, we describe dynamic c-space collision, distance, and Voronoi maps in Sects. III and IV, and their application to path planning in Sect. V. Finally, we present extensive experiments in Sect. VI. II. R ELATED W ORK Algorithms for efcient collision checking in 3D continue to be an active area of research. For example, Tang et al. [3] recently proposed a connection collision query algorithm that
49
detects collisions of triangle meshes moving between given states. Thus, it can be used for sampling-based path planning. For online feasibility, Pan et al. use multi-core GPUs for collision queries [4]. Still, the cost per collision check depends on the number of polygons used to represent the tested objects. Schlegel precomputes collision distances for circular arcs as a function of relative obstacle location and curvature [5]. Thus, the kinematic analysis is done ofine, and collision distances can be obtained with one lookup per obstacle. Precomputing c-space representations instead further reduces the online effort for collision checks to a single lookup. Since Lozano-Perezs 1983 paper on c-space planning among static polyhedric obstacles, many approaches were proposed to reduce the cost for computing c-space obstacles [2]. Because of the relevance of this problem, researchers still work on improving the efciency [6]. Convolving a gridmap of a robots environment with an image of its footprint yields a discrete c-space map. To always reect the current state of previously unknown or moving obstacles, these maps need to be updated regularly. Kavraki proposed to use the fast Fourier transform (FFT) to reduce the computational cost of the convolution [7], and Ther n et al. o added parallelization as a further speed-up [8]. Later, the same group proposed a multi-resolution approach to reduce memory and computational load in large workspaces [9]. To speed up path planning for autonomous cars, Ziegler and Stiller decompose the shape of the robot into circular discs [10]. As a rst dynamic approach for changing enviroments, Wu et al. precompute colliding robot poses for each potentially occupied cell in the work space of a manipulator [11]. Taking the union of the colliding poses for a given set of occupied cells yields the c-space obstacle map without further recomputation. For mobile robots, however, the size of the operational area can make the database storage and the online computation of the union infeasible. In contrast, our method for updating c-space collision maps is truly incremental: it executes a regular map convolution in an ofine phase, and during online application only updates the cells affected by changes in the environment. In our previous work we presented algorithms to incrementally update two-dimensional Euclidean distance maps and Voronoi diagrams [12]. For path planning with circular robots, 2D Voronoi diagrams are appealing roadmaps since they cover all topologically different paths in a map with a small number of cells. For rectangular robots however, 2D Voronoi planning looses its completeness property, and one has to repair paths in narrow areas where following the Voronoi diagram leads to collisions, e.g., by using RRTs as proposed by Foskey et al. [13]. In this paper we combine the dynamic distance and Voronoi maps proposed in [12] with our novel incrementally updatable c-space collision maps. In this way, we overcome the aforementioned problem and can perform complete Voronoi planning in the conguration space of non-circular robots. Although we use A planning as an example application, our approach can be combined with other planners, e.g., D Lite [14], or rapidly-exploring random trees (RRT) with Voronoi-biased sampling [15].
M px, y q S0 pi, j q
0 1 1 1 1 0
0 0 0 1 1 0
0 0 1 2 1 0
M px, y q
S90 pi, j q
0 0 0 0 0 0
1 1 1 2 2 1
0 0 0 0 1
C0 px, y q
C90 px, y q
Fig. 2. Convolving a map M px, y q with a representation of the robots shape S pi, j q for a given orientation yields a collision map C px, y q, according to Eq. (1). Each cell xx, y y in C counts the number of cells in the robot shape that collide with occupied cells in M , given the robot is at pose xx, y, y.
M px, y q S0 pi, j q
0 2 1 1 1 0
0 1 0 1 1 0
1 1 1 2 1 0
M px, y q
S90 pi, j q
0 0 0 0 0 0
2 2 2 2 2 1
0 0 1 0 1
C0 px, y q
C90 px, y q
Fig. 3. A newly occupied cell in the map M (red) increments the collision count C for all robot poses that cause a collision with the obstacle in the updated cell. In this way, the collision map is updated (red cells) without recomputing the values for unaffected (gray) cells. Alg. 1 implements this procedure as well as the corresponding case for newly emptied cells.
III. DYNAMIC C-S PACE C OLLISION M APS This section presents a method to incrementally update cspace collision maps for non-circular mobile robots moving on a plane. It can be applied to all obstacle types shown in Fig. 1. For the sake of clarity we start with the 2.5D representation with overhanging obstacles shown in Fig. 1 (bottom-right), and discuss the adaptation to the other obstacle models later. Let M px, y q be a grid map that represents the vertical clearance, i.e., the height of free space above the oor, with zeros for completely occupied cells. Consider a robot moving on the oor with continuous orientation with respect to the map coordinate system. We represent the discretized shape of the robot for a given orientation by a map S pi, j q, that stores the height of the robot for every cell of its footprint. S has the same resolution and orientation as the grid map M , whereas its origin S p0, 0q is located at the robots center. A convolution-type conjunction of M and S yields a count map C px, y q as shown in Fig. 2. Each cell xx, y y in C stores the number of cells the robot collides with when located there: C px, y q
eval M px i, y j q S pi, j q ,
(1)
where evalptrueq 1 and evalpfalseq 0. If we discretize and stack the C px, y q for all discrete , we obtain the robots cspace map C px, y, q for M . Clearly, by testing C px, y, q 0 we can check if the discretized pose xx, y, y is colliding. A. Discretization of Orientations An appropriate discretization of ensures that if two adjacent poses xx, y, i y and xx, y, i1 y are collision-free accord ing to C, intermediate orientations ri , i1 s are collision free as well. Under this constraint we seek to discretize as coarse as possible to keep the number of -layers in C small. In occupancy grid maps, the actual location of obstacles can be anywhere in the cells they occupy. Therefore, one usually
50
Alg. 1 Dynamic Update of C-space Collision Map function UpdateVerticalClearancepx, y, vnew q 1: vold M px, y q 2: M px, y q vnew 3: for all do 4: for all xxI , y I ytxx i, y j y | S pi, j q 0u do 5: if vnew S pi, j q vold S pi, j q then 6: C pxI , y I , q C pxI , y I , q 1 7: if C pxI , y I , q 1 then NewOccupiedpxI , y I , q 8: else if vnew S pi, j q vold S pi, j q then 9: C pxI , y I , q C pxI , y I , q 1 10: if C pxI , y I , q 0 then NewEmptypxI , y I , q assumes an additional safety margin m around the robot, e.g., of m 1 pixel unit. With this, we can formulate a bound on the angular resolution for the discretization of as follows: if the robot rotates from i to i1 , each point on the robot moves along an arc. The maximum arc length occurs at the outmost point of the robot, which is the radius r (in pixels) of the circumcircle around its center. By choosing a resolution of |i i1 | m{r, we ensure that even in the worst case an obstacle collides only with the safety margin but not with the actual robot. Depending on the shape of the robot, less conservative bounds on the discretization can be formulated. B. Incremental Update of the C-space Map Unknown or moving obstacles cause changes in the environmental representation of a robot. For the 2.5D obstacle model, a change is given by an updated vertical clearance vnew for a cell xx, y y in M . To refresh C incrementally rather than computing it from scratch, we only update the affected parts of the sum in Eq. (1) according to Alg. 1. See Fig. 3 in comparison to Fig. 2 for an illustration. The algorithm separately updates the -layers of C, and can thus be parallelized (line 3). For each cell xi, j y of the robot shape S pi, j q we visit the robot position xxI , y I y that lets xi, j y fall on xx, y y (line 4). These cells can be efciently selected using standard drawing algorithms for rasterized images. If the new vertical clearance vnew in xx, y y causes a collision with S pi, j q while vold did not, the collision counter of xxI , y I y is incremented (line 5), since this represents a new collision candidate cell. Vice versa, if vnew is collision-free and vold collides, the counter is decremented (line 8), since a collision candidate was removed. Whenever the count changes from 0 to 1 or from 1 to 0, the pose xxI , y I , y is newly occupied (line 7) or emptied (line 10), respectively. These events can be used to trigger further computation, e.g., to update the c-space distance map and Voronoi diagram discussed in Sect. IV. C. Adaptation to Other Kinds of Obstacles and Robots Until here, we assumed overhanging obstacles and a robot on the oor that can move underneath obstacles as in Fig. 1 (bottom-right). By reversing the comparisons of robot height and vertical clearance in Eq. (1) and Alg. 1 (lines 5 and 8), this can easily be adapted to obstacles elevating from the oor and
Fig. 4. C-space distance map (top) and Voronoi diagram (bottom) for a rectangular robot, both obtained by stacking layers computed in 2D for different robot orientations . For readability, only half of the layers are shown, the other half is identical due to the robots symmetry. In the visualization at the top, cells above the bottom layer have a different color scaling, and were removed when exceeding a distance threshold.
robots with overhanging load or parts as in Fig. 1 (bottom-left). For plain 2D robot and obstacle models, the heights vnew and vold are binary values that encode occupied and free. The other techniques presented in this paper apply without modications. For some applications, the obstacles and the robot have to be represented in full 3D as in Fig. 1 (top-right). The height comparisons in Alg. 1, lines 5 and 8, then have to consider lists of obstacle heights. If the robot shape is approximated by a set of vertical columns with a given upper and lower end as in Fig. 1, one can also use a separate shape map for each column. By only considering the affected columns in line 4, the c-space collision map can be efciently updated. Robots with a symmetric shape with respect to their center cause a part of the -layers in C px, y, q to be redundant. For example, a rectangular robot at orientation 180 causes the same c-space obstacles as at 0 . Omitting the respective layers when iterating over in Alg. 1 (line 3) saves a substantial part of the computational effort and memory consumption. IV. C- SPACE D ISTANCE M APS AND VORONOI D IAGRAMS The cells of a distance map D measure the distance to their closest obstacle in the map M . Given a three-dimensional cspace collision map C px, y, q as dened above, one could consider a 3D distance map that uses a 3D distance measure including the angle. As discussed by Canny [16], it is however more desirable to employ 2D Euclidean distances per -layer. Thus, we stack Euclidean distance maps D px, y q computed for every c-space map layer C px, y q, yielding the c-space distance map Dpx, y, q as shown in Fig. 4 (top).
51
Connecting start and goal to the closest cell on the Voronoi graph
start start start
10 m
FR079
FR101
goal
goal
goal
Fig. 6.
Maps of the environments we used to benchmark collision checks.
Our method: use bubble-like Voronoi areas at start and goal

start start start
goal
goal
goal
Fig. 5. Connecting start and goal to the Voronoi graph (green) during planning: using the shortest connection (top), the planned path (blue) can change abruptly for small changes of the start conguration, even for sightlinepruned paths (dashed). We create Voronoi bubbles around start and goal, and use goal-directed search therein, which yields more stable paths (bottom).
In 2D, Voronoi graphs are the union of points whose two closest obstacles are at the same distance. Similar to distance maps, we compute a Voronoi diagram V px, y q for every C px, y q. Stacking these Voronoi diagrams results in the cspace Voronoi diagram V px, y, q as shown in Fig. 4 (bottom). If is discretized according to Sect. III-A, Voronoi lines in neighboring layers connect to unbroken surfaces. We update the layers of the c-space distance map D and Voronoi diagram V incrementally as proposed for 2D in our previous work [12]. The functions SetObstaclepx, y q and RemoveObstaclepx, y q described therein are used to specify newly occupied or freed cells. UpdateDistanceMap() performs the update with a dynamic brushre algorithm. The event NewOccupiedpxI , y I , q in Alg. 1 calls SetObstaclepxI , y I q for the dynamic distance map D px, y q, and NewEmptypxI , y I , q calls RemoveObstaclepxI , y I q. After the update of C px, y, q, we execute UpdateDistanceMap() in parallel for every D , which completes the update of Dpx, y, q and V px, y, q. V. C- SPACE VORONOI PATH P LANNING Given a c-space Voronoi map as described above, a goaldirected graph search on the Voronoi cells is no different from regular planning on 3D grids, except for the cyclic nature of the orientation dimension. This section details on important aspects of Voronoi planning in dynamic environments. In general, the start and goal locations of a planning problem are not on the Voronoi graph. Trivial approaches search for the closest Voronoi cell at both locations, and connect them with straight lines to the graph [17]. This is problematic in practice, since a small change of the start pose can substantially change the planned path as shown in Fig. 5 (top row). Our approach tackles this problem by applying the following steps: 1) insert a virtual obstacle at the start and goal location, 2) update the Voronoi map for all layers, 3) use brushre expansion to mark cells in Voronoi bubbles, 4) plan a path from start to goal, 5) undo the changes made by 1) and 2).
Through the virtual obstacles inserted at start and goal, these locations get enclosed by the Voronoi graph which generates a bubble-like area as shown in Fig. 5 (bottom row). For all -layers in parallel, the brushre expansion starts at the start and at the goal position, and marks all visited cells while expanding like wavefronts up to the enclosing Voronoi lines. Then, a goal-driven A search is initiated at the start location. It is restricted to only consider Voronoi cells and the ones in the start and goal bubble, marked by the brushre expansion. In this way, the search expands from the start onto the Voronoi graph, follows Voronoi lines, and then connects to the goal when reaching the goal bubble. Since the whole path is the result of goal-directed graph search, the consecutive paths planned for a moving robot are very similar to each other and do not change abruptly (see Fig. 5). The brushre expansion has to be run for each -layer separately using a 4-connected neighborhood. This ensures that the expansion is contained in the start and goal bubbles. Since the update of the c-space representations is run separately for the layers as well, the whole procedure can be parallelized except for the actual planning routine in step 4). VI. E XPERIMENTS This section presents application examples and benchmarks for our incrementally updatable c-space representations. We performed tests on two sequences of 2D laser range data with 200 frames each. The data was recorded by a robot moving in areas where walking people heavily affected the traversable space, namely an ofce building (FR079) and a large foyer (FR101). The algorithms were initialized with the gridmaps shown in Fig. 6, using a resolution of 0.05 m per grid cell. The maximum range of the laser scanner was limited to 5 m. To get 2.5D and 3D obstacles, we augmented the laser data with random height values between 0 m and the robot height. In 2D, we assumed a medium sized rectangular robot (0.85x0.45 m) and a large one (1.75x0.85 m). In 2.5D, we modeled a wheelchair with a low front and a high rear part, as in Fig. 1 (bottom-right). In 3D, the robot was modeled like a Willow Garage PR2, with a frontal extension for the base and the xed arms (see Fig. 1 top-right). The C++ implementation of our algorithms was executed on an Intel Core i7 2670 MHz, using OpenMP for parallelization with up to 6 threads. A. Collision Checks for Non-Circular Robots The c-space collision map presented in Sect. III requires computation of the incremental update in every time step, but then, each collision check for the whole robot takes only a single lookup. In the 2D model, we exploit the symmetry of the rectangular robot as described in Sect. III-C.
52
2D Occupancy Map
2D Distance Map
C-space Map
B A
Time for 200 frames [s]
8 6 4 2 0
2D FR079 Medium Robot
8 6 4 2
2D FR079 Large Robot
narrow passages
C 50k 2D FR101 Medium Robot 100k 150k 8 6 4 1m 2 0 50k 100k 150k #collision checks per frame 2 C-space Voronoi 0 50k 100k 150k #collision checks per frame KPiece 0 50k 2D FR101 Large Robot 100k 150k 8 6 4
Fig. 7. Computation time for different collision checking routines for two sequences and two robot models. The update required in every frame for the c-space collision map pays off starting from 10,000 collision checks per frame. The plot shows mean and standard deviations for 10 runs.
2D (Medium robot) 2.5D (Wheelchair) 3D (PR2)
Fig. 9. Map of a factory oor (9.5x15.4 m) with start location (A) and three goals (B), (C), and (D). Example paths from (A) to (D) are shown for two different planners. The sampling-based planner (right) is challenged by narrow passages, while the performance of the Voronoi planner (left) is unaffected. Planning time [s] 10000 1000 100 10 1 0.1 0.01 0.001 4000 C-space Voronoi KPiece RRT 3000 2000 1000 A C-space Voronoi KPiece RRT Path length [cells]
10 8 6 4 2 0
C-space Map
10 8 6 4 2
Grid Map
B AC AD
B AC AD
50k 100k 150k #collision checks per frame
50k 100k 150k #collision checks per frame
Fig. 10. Planning time and path length for three planners and the three planning tasks in Fig. 9. The plot shows mean and min/max for 20 runs. In contrast to the Voronoi planner, the sampling-based planners require several orders of magnitude more planning time for each narrow passage in the path.
Fig. 8. Collision check performance for different robot and obstacle models, using our updatable c-space collision map (left) vs. the straight-forward occupancy gridmap approach (right). The costs for updating the c-space map are remedied by the faster collision checks for 10,000 checks or more per frame. The plot shows mean and standard deviations for 20 runs.
We compare our method to a previous collision checking approach for rectangular robots that uses incrementally updatable 2D distance maps [18]. As a baseline, we also include a straight-forward approach that checks every cell of the robots footprint for collision using an up-to-date 2D occupancy map. The results of this benchmark are shown in Fig. 7. The time required for updating the distance and c-space maps is shown by the rst data point of each plot (zero collision checks). The slopes of the curves depend on the cost per collision check. In contrast to the distance map approach, the update time for the c-space map grows with the size of the robot (right vs. left column), but does not suffer from the open area in FR101 (bottom vs. top row). The update for the c-space collision map pays off for 10,000 or more collision checks, which can easily be required during path planning or trajectory optimization. In comparison, the break-even point for a single disc-shaped
object was at 22,400 for the disc-decomposition method by Ziegler et al., and 5 106 for the full c-space [10]. We repeat the experiment, but with 2.5D and 3D obstacles and robots this time. Compared to the 2D rectangular robot (dashed), the costs for the c-space update with 2.5D and 3D are higher, since the robots are not symmetric anymore and consist of two and three parts, respectively, see Fig. 8 (left). However, the costs per collision check (slope of the plots) is the same, as opposed to the curves for the straight-forward occupancy grid map approach (right). In all cases, the update of the c-space map takes less than 15 ms per frame. Performing 150,000 collision checks per frame additionally requires at most another 15 ms. This corresponds to 10 106 collision checks per second for arbitrary robot shapes, which clearly outperforms even modern GPUbased approaches with 0.5 106 collision checks per second for simple polygons [4]. B. Path Planning using C-space Voronoi Maps The c-space Voronoi maps presented in this paper provide means for complete grid map planning for non-circular omnidirectional robots using standard graph search algorithms
53
We consider different obstacle representations, namely a robot moving on a plane with overhanging obstacles, or vice versa, obstacles elevating from the ground, and a robot with overhanging parts. Our approach is also applicable to 2D and full 3D obstacle representations, and can exploit symmetries in the robot shape. In our experiments we showed that the presented methods allow for a high number of collision checks and reliable path planning in frame rates required for online real-world applications. To further increase the efciency of our algorithms, e.g., in higher-dimensional c-spaces of manipulators, one could combine them with hierarchical decomposition as proposed by Blanco et al. [9].
Fig. 11. Table-docking with a PR2 robot in a 3D map using Voronoi planning.
R EFERENCES
[1] K. Wurm, A. Hornung, M. Bennewitz, C. Stachniss, and W. Burgard, Octomap: A probabilistic, exible, and compact 3D map representation for robotic systems, in ICRA Workshop on Best Practice in 3D Perception and Modeling for Mobile Manipulation, Anchorage, 2010. [2] K. D. Wise and A. Bowyer, A survey of global conguration-space mapping techniques for a single robot in a static environment, International Journal of Robotics Research, vol. 19, no. 8, pp. 762779, 2000. [3] M. Tang, Y. J. Kim, and D. Manocha, CCQ: Efcient local planning using connection collision query, in Algorithmic Foundations of Robotics IX, ser. Springer Tracts in Advanced Robotics (STAR), 2011, vol. 68. [4] J. Pan and D. Manocha, GPU-based parallel collision detection for realtime motion planning, in Algorithmic Foundations of Robotics IX, ser. Springer Tracts in Advanced Robotics (STAR), 2011, vol. 68. [5] C. Schlegel, Fast local obstacle avoidance under kinematic and dynamic constraints for a mobile robot, in Intl. Conf. on Intelligent Robots and Systems (IROS), Victoria, Canada, 1998. [6] E. Behar and J.-M. Lien, A new method for mapping the conguration space obstacles of polygons, Department of Computer Science, George Mason University, Tech. Rep. GMU-CS-TR-2011-11, 2010. [7] L. E. Kavraki, Computation of conguration-space obstacles using the fast Fourier transform, IEEE Transactions on Robotics and Automation, vol. 11, no. 3, pp. 408413, June 1995. [8] R. Ther n, V. Moreno, B. Curto, and F. J. Blanco, Conguration o space of 3D mobile robots: Parallel processing, in 11th Intl. Conf. on Advanced Robotics, vol. 13, 2003. [9] F. J. Blanco, V. Moreno, B. Curto, and R. Ther n, C-space evaluation o for mobile robots at large workspaces, in IEEE Intl. Conf. on Robotics and Automation (ICRA), Barcelona, Spain, April 2005. [10] J. Ziegler and C. Stiller, Fast collision checking for intelligent vehicle motion planning, in IEEE Intelligent Vehicles Symposium, San Diego, CA, USA, June 2010. [11] X. J. Wu, J. Tang, and K. H. Heng, On the construction of discretized conguration space of manipulators, Robotica, vol. 25, pp. 111, 2006. [12] B. Lau, C. Sprunk, and W. Burgard, Improved updating of Euclidean distance maps and Voronoi diagrams, in IEEE Intl. Conf. on Intelligent Robots and Systems (IROS), Taipei, Taiwan, October 2010. [13] M. Foskey, M. Garber, M. C. Lin, and D. Manocha, A Voronoi-based hybrid motion planner, in IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), Maui, HI, USA, October 2001. [14] S. Koenig and M. Likhachev, D* lite, in Eighteenth National Conference on Articial Intelligence (AAAI), 2002, pp. 476483. [15] L. Zhang and D. Manocha, An efcient retraction-based RRT planner, in IEEE Intl. Conf. on Robotics and Automation (ICRA), Pasadena, 2008. [16] J. Canny, A Voronoi method for the piano-movers problem, in IEEE Intl. Conf. on Robotics and Automation (ICRA), March 1985. [17] R. Geraerts and M. Overmars, A comparative study of probabilistic roadmap planners, in Algorithmic Foundations of Robotics V, ser. Springer Tracts in Advanced Robotics, 2004, vol. 7, pp. 4358. [18] C. Sprunk, B. Lau, P. Pfaff, and W. Burgard, Online generation of kinodynamic trajectories for non-circular omnidirectional robots, in IEEE Intl. Conf. on Robotics and Automation (ICRA), Shanghai, 2011. [19] I. A. Sucan, M. Moll, and L. E. Kavraki, The open motion planning library (OMPL), 2010. [Online]. Available: http://ompl.kavrakilab.org/
updates they are applicable in dynamic environments. This experiment uses the Voronoi bubble technique proposed in Sect. V. We use A to plan paths for the large robot model (see above) on the grid map of the factory oor shown in Fig. 9. The start pose is given by (A), and three possible goal poses by (B), (C), and (D). Each of the consecutive goals requires traversing another narrow passage. For comparison, we test our method against the KPiece and RRT implementations available in the Open Motion Planning Library [19]. All planners use our c-space map for collision checking. The average resulting planning times and path lengths for 20 runs per start-goal combination are shown in Fig. 10. Each additional narrow passage requires several orders of magnitude more planning time for the sampling-based planners, while the time taken by the Voronoi planner grows roughly linearly with the path length. Using per-cell collision checking rather than the c-space collision maps for the sampling based planners increases the computation times by a factor of 3. As another application example, we plan the path of a PR2 robot using a c-space Voronoi map generated from real 3D point cloud data (see Fig. 11). After a precomputation phase of 0.5 s, planning a path on the incrementally updatable cspace Voronoi map takes less than 2.5 ms. Clearly, Voronoi planning is of advantage in narrow areas as long as the grid resolution is ne enough. Our incrementally updatable c-space Voronoi representation allows to apply this idea to non-circular robots in dynamic environments, and could also be used in Voronoi sampling routines of other path planners [15]. VII. C ONCLUSION This paper presents incrementally updatable collision maps, distance maps, and Voronoi diagrams for non-circular robots. During initialization, these representations are computed using a given grid map or point cloud. During online application, our methods only update cells that are affected by changes in the environment, and can thus be used in real-world scenarios with unexpected or moving obstacles.
like A or D Lite [14]. With our algorithms for incremental
54
Landmark Placement for Accurate Mobile Robot Navigation

Maximilian Beinhofer J rg M ller o u Wolfram Burgard Department of Computer Science, University of Freiburg, Germany
Abstract Being able to navigate accurately is one of the basic capabilities a mobile robot needs to effectively execute a variety of tasks, including collision avoidance, docking, manipulation, and path following. One popular approach to achieve the desired accuracy is to use articial landmarks for a sufciently accurate localization. In this paper, we consider the problem of optimally placing landmarks for robots navigating repeatedly along a given set of trajectories. Our method aims at minimizing the number of landmarks for which a bound on the maximum deviation of the robot from its desired trajectory can be guaranteed with high condence. The proposed approach incrementally places landmarks utilizing linearized versions of the system dynamics of the robot, thus allowing for an efcient computation of the deviation guarantee. To verify this guarantee for the real and possibly non-linear system dynamics of the robot, our method then performs a Monte Carlo simulation. We evaluate our algorithm in extensive experiments carried out both in simulation and with a real robot. The experiments demonstrate that our method requires substantially fewer landmarks than other approaches to achieve the desired accuracy. Index Terms Localization, Autonomous Navigation, Service Robots
Fig. 1. The miniature e-puck robot we use in the experiments. Mounted on top is a wireless webcam detecting uniquely identiable visual markers attached to the ceiling.
I. I NTRODUCTION One of the main challenges for mobile service robots is autonomous navigation. Especially in industrial applications, a high degree of accuracy is required to avoid collisions and, e.g., to ensure reliable docking maneuvers. However, robots navigating autonomously tend to deviate from their desired trajectory due to uncertainty in motion and position. Typically, they are equipped with on-board sensors to estimate the deviation and react according to a feedback control law. In service tasks, robots usually have to repeatedly execute a xed number of trajectories. As real world environments often contain dynamic and ambiguous areas, many practical applications rely on articial landmarks placed along these trajectories to allow for accurate self-localization [6, 7]. In these applications, placing the articial landmarks is often expensive or the computational power of the robot is limited. Therefore, it is benecial to select the smallest possible number of landmark positions which still guarantees an accurate navigation. In this paper, we present a novel algorithm for landmark placement. It computes a landmark conguration which guarantees that the deviation of the robot from its desired trajectory stays below a user-dened threshold dmax with high condence. To check if the guarantee holds, we use the specic properties of the robot and its navigation task in the landmark placement algorithm. The algorithm works in two stages. In a
rst stage, it uses a linearized motion model and observation model of the robot to efciently place landmarks in an incremental fashion. This stage aims at placing the smallest number of landmarks for which the deviation guarantee holds. In a second stage, the algorithm employs a Monte Carlo simulation for the computed landmark conguration to check if the guarantee also holds for the possibly non-linear models. Our approach has several characteristics which make it especially useful for mobile robot navigation. It can deal with arbitrary trajectories, and the maximum allowed deviation of the robot can be dened individually for every part of the trajectories. Taking into account the properties of the individual robotic system results in customized landmark sets: while high-precision robots only need few landmarks for reaching the deviation guarantee, low cost systems typically require more landmarks. As our placement algorithm efciently evaluates the guarantee using linear models, it can deal even with large instances of the landmark placement problem (i.e., long trajectories). Note that our incremental method simultaneously determines the number of landmarks needed and their positions to meet the desired guarantee. This paper is organized as follows. After discussing related work in the following section, we formalize the problem denition in Section III. In Section IV we describe the prediction of the deviation from the trajectory in linearized systems. Afterwards, in Section V, we present our incremental landmark placement algorithm. Finally, we provide extensive experiments in which we evaluate the algorithm in simulation as well as with the real e-puck robot [9] depicted in Fig. 1. II. R ELATED W ORK In the past, the problem of nding an optimal set of landmark positions has been addressed from several points
55
of view. Sala et al. [12] select landmark positions so that at every position in the map, at least k landmarks are observable. Jourdan and Roy [8] consider a set of possible target positions. They place sensors on the walls of buildings to minimize the average position error bound in the sensor network. Unlike these methods, our approach takes into account the full specication of the robot and its navigation task. Vitus and Tomlin [15] also consider the full problem specication to place sensors in the environment. They approximate the a-priori covariances with the a-posteriori covariances of the most likely run of the robot. Similar to our approach, van den Berg et al. [3] evaluate sensor positions using the exact a-priori distributions in a linearized system. As they focus mainly on path planning, they restrict themselves to randomly sampled positions of a single sensor. Our previous work [1] selects landmarks maximizing the mutual information between the sensor readings and the states of the robot. It solely applies Monte Carlo simulations to estimate the a-priori distributions, which makes it computationally more demanding. While all of the approaches above place articial landmarks or sensors before operation of the robot, the following approaches decide whether to utilize observed landmarks during operation. In contrast to our method, their decisions are based on a-posteriori distributions, i.e., the information already gathered by the robot. Thrun [14] selects the subset of the observed landmarks for localization which minimizes the average posterior localization error. Strasdat et al. [13] and Zhang et al. [17] both consider landmark selection in the context of the simultaneous localization and mapping (SLAM) problem. Strasdat et al. use reinforcement learning to create a landmark selection policy whereas Zhang et al. minimize the entropy of the a-posteriori distributions. In contrast to the above-mentioned approaches, our method optimizes the positions of the landmarks so that the robot stays within a user-dened region around the trajectory with high a-priori probability. III. P ROBLEM D EFINITION We consider the problem of placing landmarks for localization and control of a mobile robot. We assume the time to be discretized into steps of equal duration. At each time step t the state of the robot is dened by a vector xt X , which changes over time according to the stochastic motion model xt = f (xt1 , ut1 , mt ) (1)
be considered as a series of states and desired controls the robot should execute to reach these states. In this navigation task, we assume that the trajectory will be executed using a linear-quadratic regulator (LQR) [4] feedback controller. At each time step t the LQR controller selects the control command ut which minimizes the quadratic cost function
T
E
=t
((x x )T C(x x )+(u u )T D(u u )) , (3)
where C and D are positive denite weight matrices. The localization uncertainty and, as a result, also the deviation from the desired trajectory strongly depend on the specic conguration of landmarks A = {l1 , ..., ln } which are observed during operation. Our approach selects landmarks li from a continuous space of possible landmark locations L. We evaluate the quality of a landmark conguration based on the deviation of the (real) state xt from the desired state x at each t time step t (ignoring the control part u of the trajectory). 0:T In particular, we consider the Euclidean distance between the part of the state xpos describing the position of the robot and t xpos . We focus on limiting the deviation t dpos (xt , x ) = xpos xpos t t t
2
(4)
of the robot from its trajectory at all time steps t [0, T ]. Our approach aims at nding the landmark conguration A with the fewest elements for which the deviation guarantee t [0, T ] : p (dpos (xt , x ) dmax (x ) | A) pmin t t (5)
holds. This guarantee ensures that the probability of deviating at most dmax from the desired trajectory is at least pmin . Note that dmax can be either a globally constant value or depend on the position or time. IV. P REDICTING THE D EVIATION FROM THE T RAJECTORY To validate the guarantee (5) for a certain landmark conguration A, we need to compute p (dpos (xt , x ) dmax (x ) | A). t t For this, we consider the a-priori probability distribution p(xt x | A) = t p(xt x | u0:t1 , z1:t , A) t p(u0:t1 , z1:t | A) du0:t1 dz1:t , (6)
where ut U is the control command at time t. Thereby, the motion is disturbed by Gaussian noise mt N (0, Mt ). For self-localization, we assume that the robot is equipped with a sensor taking measurements of a set of landmarks A = {l1 , ..., ln } according to the measurement function zt = h(xt , nt , A) (2) where the sensor signal is disturbed by Gaussian noise nt N (0, Nt ). The covariances Mt and Nt model the uncertainty in the motion and the measurements, respectively. We dene a navigation task as a trajectory that the robot should follow. A trajectory T = ( x , u , . . . , x , u ) can 0 0 T T
which averages over the observations z1:t and controls u0:t1 that are not yet available during landmark placement. For general, non-linear systems, the a-posteriori distributions p(xt x | u1:t1 , z1:t , A) can be used to estimate the t a-priori distributions (6) via Monte Carlo simulation by sampling observations and controls and averaging over numerous runs [1]. However, this is computationally expensive for large instances of the landmark placement problem. A. A-Priori State Estimation in Linearized, Gaussian Systems In the main part of our landmark placement algorithm, we locally linearize the system and approximate all distributions by Gaussians. This allows for an analytical evaluation of the guarantee (5), making it more efcient than Monte Carlo simulations. Linearizing the motion model (1) and the sensor
56
model (2) around the desired trajectory (x , u ) by rst 0:T 0:T order Taylor expansion leads to xt = + At (xt1 + Bt (ut1 u ) + Vt mt , t1 zt = h(x , 0, A) + Ht (xt x ) + Wt nt , t t f (x , u , 0) t1 t1 x ) t1 (7) (8)
with Ft = At Kt Ht At Gt = Bt Lt1 , At + Bt Lt1 Kt Ht At 0 . Kt W t (14) (15)
Vt K t Ht V t
with the Jacobians At = f f (x , u , 0), Bt = (x , u , 0), x t1 t1 u t1 t1 f (x , u , 0), Vt = m t1 t1 h h Ht = (xt , 0, A), Wt = (x , 0, A) . (9) x n t
Rt only depends on a-priori known variables, namely the Jacobians (9), the Kalman gain (11), and the feedback matrix (13). These variables can be computed a-priori since we linearize the models around the (a-priori known) desired states x and 0:T not around the (a-priori unknown) estimates 0:T as it is done for example in the extended Kalman lter [16]. B. Evaluation of the Deviation Guarantee In the linearized system we can efciently check whether pos the deviation guarantee (5) holds. Let St be the part of the a-priori covariance St of p(xt xt | A) corresponding to the position of the robot. The length at (A) of the major semi-axis pos of the pmin -condence ellipsoid of St can be calculated using at (A) = c t ,
pos St
In this linearized system, the Gaussian a-posteriori distribution p(xt x | u1:t1 , z1:t , A) N (t x , Pt ) of the deviation t t from the trajectory can be computed recursively using a Kalman lter [16]. The Kalman lter propagates a given initial Gaussian distribution p(x0 x | A) N (0 x , P0 ) 0 0 according to the actual control commands in the motion update t x = At (t1 x ) + Bt (ut1 u ) t t1 t1 Pt = At Pt1 AT + Vt Mt VtT . (10) t and according to the measurements in the observation update T T Kt = Pt Ht (Ht Pt Ht + Wt Nt WtT )1 (11) t xt = t xt + Kt (zt h(xt , 0, A) Ht (t xt )) Pt = (I Kt Ht )Pt . (12) Note that the covariance Pt and the Kalman gain Kt depend, via the Jacobians, on x and u 0:t 0:t1 but not on the actual values of u0:t1 and z1:t (see (10), (11), (12)). Therefore they can be calculated before the robot starts operation (a-priori). The minimization of the expected deviation from the desired trajectory (3) in the LQR controller can also be solved a-priori, linearly relating the control command ut to the estimated state t via a feedback matrix Lt : ut u = Lt (t x ) . t t (13)
(16)
where t is the largest eigenvalue of and c is a scaling factor corresponding to pmin via the regularized Gamma function as described in [2]. If at (A) dmax , then the pmin -ellipsoid of pos St is inside a circle with radius dmax and guarantee (5) holds for the linearized system. Note that this test is a conservative approximation and is exact if the pmin -ellipsoid is a sphere. C. Visibility of Landmarks For robots with a limited sensor range, the non-linearity of the sensor model at the border of the eld of view of the robot induces a large discrepancy between the real model and its linearization. To avoid this, we conservatively estimate for every landmark if the robot will observe it at time t, following the approach of Vitus and Tomlin [15]: We consider a landmark as visible at time t only if it is pmin -visible, i.e., it is visible from every position inside the pmin -ellipsoid of pos St around xpos . If a landmark conguration satises the t guarantee (5) when only pmin -visible landmarks are observed, it also satises it when using all visible landmarks. Note that for general state spaces and elds of view (like in the framework proposed in [11]), checking if a certain landmark is pmin -visible is highly non-trivial and could for example be done using a sampling based approach. However, for the two dimensional case (i.e., xpos = [x y]) t and for a robot with a circular eld of view with radius r, there exists a closed form solution to checking if a given pos landmark l is pmin -visible. Consider the pmin -ellipse of St centered at the origin of the coordinate system. We apply a principal axis transformation on the ellipse so that afterwards its semi-axes lie on the axes of the coordinate system. Appos plying this transformation on St yields a diagonal matrix pos = diag(1 , 2 ) with the diagonal elements identical St to the eigenvalues of the matrix. Any point x = [x , x ]T 2 1 on the transformed ellipse E can then be described as x = c diag( 1 , 2 ) x for a point x = [x1 , x2 ]T on the unit circle C and the scaling factor c from (16). To check if
Lt depends on the a-priori known Jacobians (9) and the weight matrices (3) and is derived explicitly in [4]. As described above, we express the whole navigation algorithm, which consists of executing a motion command, making an observation, localizing, and selecting the next motion command depending on the localization, by linear functions. For this linear navigation system, van den Berg et al. [2] proved that the a-priori joint distribution of xt and t is a Gaussian St 0 xt x t , Rt = N( 0 t x Cov(xt , t )T t P0 0 Cov(xt , t ) ), Ut
and that its covariance Rt can be computed recursively by R0 = Mt 0 , Rt = Ft Rt1 FtT + Gt 0 0 0 GT , t Nt
57
landmark l is pmin -visible given an a-priori estimate (x , St ), t we apply the principal axis transformation also on the relative position (l x ) of the landmark, resulting in l = [l1 , l2 ]. t Checking if l is pmin -visible means checking if max x l
x E xC 2
r 1 , 2 )x l (c
2
max c diag( max
r
l1 ) 2
Algorithm 1 Landmark Placement for the Linearized System Input: trajectory x , space of landmark locations L 0:T Output: landmark conguration A A t0 while t < T do l argmax tmax (A {l}, x ) 0:T t tmax (A {l }, x ) 0:T if t = t then l argmin atmax (A {l})
lL lL
x1 [1,1], sgn{1,1}
1 x 1
+ (c
2 sgn
1 x 2 l2 ) 2 r 2 0 . 1
Applying a distinction of cases for sgn, we set the derivative with respect to x1 of the function inside the max-operator to 0 and reorder the resulting equation. This yields a quartic term, which can be solved analytically in an efcient way. V. I NCREMENTAL L ANDMARK P LACEMENT A LGORITHM Our landmark placement approach aims at minimizing the number of landmarks that have to be placed for the deviation guarantee to hold. Since the dimensionality of the search space grows with the length of the trajectory, in general, globally searching for the optimal landmark conguration is computationally intractable. However, using an incremental placement algorithm, we can efciently nd an approximate solution to the landmark placement problem. A. Landmark Placement for the Linearized System In a rst stage, our algorithm employs the linearized system to incrementally place landmarks. Considering linearized Gaussian models has the advantage that the a-priori distributions can be efciently calculated analytically. The objective of our approach is to minimize the number of landmarks needed for the deviation guarantee to hold on the whole trajectory x . We approximate this minimum by maximizing 0:T the number of time steps every single landmark guarantees (5). Let tmax (A, x ) = max{t | as (A) dmax s t} 0:T (17)
end if A A {l } t t end while return A
models. This is necessary to account for approximation errors due to the linearization and the Gaussian assumption. The Monte Carlo simulation samples robot states x0:T , controls u0:T , and observations z1:T of the landmarks in A and counts the number of time steps t in which dpos (xt , x ) dmax (x ), t t as required in guarantee (5). Averaging over numerous runs yields an estimate pMC of pmin for which the deviation guarantee in the real system holds. If pMC < pmin , one can use arbitrary heuristics to place additional landmarks. For example, one could run our algorithm for increased values of pmin or decreased values of dmax . VI. E XPERIMENTAL R ESULTS We evaluated our landmark placement algorithm and compared it to other landmark placement approaches in extensive experiments both in simulation and with a real robot. A. Setup of the Simulation Experiments In the simulation experiments, we considered a wheeled robot with a differential drive. For self-localization we simulated three different kinds of sensors detecting uniquely identiable landmarks: a range-only sensor, measuring only the distance to the landmarks, a bearing-only sensor, measuring only the relative angle between the robot and the landmarks, and a range-and-bearing sensor, measuring both. In this setup, the differential drive motion model and all sensor models have non-linear components. For all sensors, we assumed a circular eld of view around the robot with radius 2m. We evaluated our algorithm on ve navigation tasks T1-T5 for all three sensor models, resulting in 15 experiments. Fig. 2 shows the landmarks our algorithm computed for the three sensor models in the rst task T1, together with a-priori and a-posteriori distributions. Fig. 3 depicts the landmark congurations and a-priori distributions for the other four tasks T2-T5 for a rangeonly sensor. For all trajectories, we set pmin = 99% and dmax = 0.5m. For the pick-and-place task T5, we changed dmax to 0.2m in the pick up zone and in the deposit zone (gray rectangles). Because of the high accuracy necessary to fulll this task, we simulated a more precise robotic system than for the other tasks, i.e., we scaled down the noise values.
be the maximum time step for which the landmark set A guarantees (5) in the linearized system for the rst part of the trajectory x max . In every iteration our algorithm adds the 0:t landmark l which maximizes tmax to the already selected set of landmarks A. In some cases, one additional landmark is not enough to increase tmax . This can happen for example if dmax (xmax +1 ) is chosen considerably smaller than dmax (xmax ). t t In these cases, the algorithm selects the landmark which minimizes atmax (A) instead. Reducing atmax (A) increases the likelihood that in the next step a landmark can be found which increases tmax again (see (17)). Algorithm 1 describes the incremental landmark placement for the linearized system. B. Monte Carlo Validation In a second stage, we check the computed landmark conguration A for the deviation guarantee via Monte Carlo simulation using the real (possibly non-linear, non-Gaussian)
58
T1
G G G G G G G G G G G G G G G G G G
G G G G G G G
G G G G G G G G G G G G G G G G G G G G G G G G G G G G G
G G G G G G G
G G G G G G G G G G G G G G G G G G G G G G G G G G G
G G G G G G G
G G G G G G G G G G G G G G
G G G G G G G G GG GG G G G G G G G G G G G G G G G
G G G G G G G G GG GG G G G G G G G G G G G G G G G
G G G G G G GG GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G
start
G G G G G G G G G G G G G G G G G G G G
G G G
start
G G
G G G
start
G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G
G G G
1m
G G
goal
goal
goal
G G G G G G G G G G G G
G G G G G G G
G G G G G G G
G G G G G G G G G G G G G G G G G G GG G G G G G G G
G G
GG
G G G G G G G G G G G G G G G G G G G G G G G
G G G G
G G G G G G G G
GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G
G G G
G G G G G G G G G G G G G G G
B. Restriction of the Search Space To implement the argmax and argmin operators in Algorithm 1, we used an insight gained from empirical evaluations. In the experiments, l typically lay inside the eld of view of x . Therefore, we restricted the search space for the t optimizations to this area. As tmax and atmax usually had several local optima inside the eld of view of x , we did a search on t a discrete grid covering this region followed by a ne search with Powells method [10] around the optimum on the grid. To evaluate this strategy, we also ran grid searches on the full environment of the robot followed by Powell optimization steps. In all 15 experiments, Algorithm 1 with the full searches did not select less landmarks than our implementation with the restricted search spaces. On an Intel R CoreTM i7 2.8GHz with 12GB RAM, the execution time of our algorithm averaged over the 15 experiments was 138 sec utilizing the restricted search spaces and 617 sec using the full searches. C. Inuence of the Sensor Model As can be seen in Fig. 2, the amount of landmarks our algorithm computes and their conguration strongly depend on the chosen sensor model. For the range-only sensor, the landmarks tend to be further away from the trajectory than for the other two sensor models. The numbers of landmarks needed are stated in the rst row of Table I. Also the results of the Monte Carlo simulations in our algorithm varied strongly for the different sensor models. In every Monte Carlo simulation, we performed 1000 simulated runs of the robot to get an estimate pMC of pmin . For all trajectories and all sensor models, the values of pMC for the landmark sets our approach computed are stated in the fth row of Table I. For the range-only sensor, pMC is considerably above the intended value of 99% in all tasks. For the range-and-bearing sensor, pMC is slightly below 99% in the pick-and-place task and for the bearing-only sensor, pMC is below 99% in three of the ve tasks. These results indicate that the non-linear components of the range measurements are less critical for landmark placement than the ones of the bearing measurements. D. Comparison to other Landmark Placement Strategies For comparison, we evaluated three other landmark placement methods. Each method starts with a minimum number of landmarks and successively increases the number (or density) of landmarks until it nds a set for which the guarantee in the linearized system holds. On trajectory places landmarks equidistant on the desired trajectory. On grid places a landmark in the center of each cell of a regular grid. Starting with one cell covering the whole environment, the cell size is decreased at every iteration until the deviation guarantee holds. For efciency, landmark positions which are outside the eld of view of all states x on the desired trajectory are not used. 0:T Random successively places landmarks at randomly chosen positions observable from the desired trajectory. The numbers of selected landmarks and the values of pMC for all landmark placement strategies are stated in Table I.
G G G G G G G GG G G G G G G G G G G G G G G
G G G
G G G G G G G G G G G G G G G G G G G G G G G G G G G
G G G G G G G G G G G G G G G GG G G G G G G G
start
G G G G G G G G G G G G G G G G G G G
G G
start
G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G
G G
start
G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G
goal
goal
goal
G G G G
G G
G G
G G G G G G G
G G G G G G G GG G G
Fig. 2. The Landmark congurations (red triangles) our algorithm computed for a gure eight trajectory T1 for three different sensor models: range-only (left), bearing-only (middle) and range-and-bearing (right). The blue points and ellipses in the upper row correspond to the means and 99% covariance ellipses of the a-priori distributions, and in the lower row to the a-posteriori distributions of simulated sample runs. The true positions of the robot in the sample runs are depicted as black lines. T2
G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G
goal
G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G
T3
G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G
start
G G G G G G G G G G G G G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G
goal 1m start
1m
T4
goal
G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G
T5
G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G
G G G G G G G
G G G G GG G G G G G G
G G G G G G G G G G G G
start goal
G
G G G G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G G GG GG G G G G
G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G
start
1m
1m
Fig. 3. The Landmark congurations (red triangles) our algorithm computed for four sample trajectories T2-T5 using a range-only sensor. T2 is a square, T3 a curved shape, T4 a sweeping trajectory, and T5 a pick-and-place task. The blue points and ellipses correspond to the means and 99% covariance ellipses of the a-priori distributions. In T5, the pick up zone and the deposit zone are marked as gray rectangles.
59
TABLE I N UMBERS OF SELECTED LANDMARKS AND RESULTS OF M ONTE C ARLO SIMULATIONS

T1 Our approach On trajectory On grid Random Our approach On trajectory On grid Random 14 48 108 0.999 0.999 0.999 Range-only sensor T2 T3 T4 12 32 63 0.998 0.999 0.999 11 25 38 62 0.999 0.996 0.999 0.999 18 56 138 0.999 0.999 0.999 T5 10 58 23 62 0.999 0.962 0.998 0.998 T1 Number 11 41 26 75 0.979 0.353 0.997 0.996 Bearing-only sensor T2 T3 T4 of landmarks 9 7 16 12 19 17 30 66 51 88 pMC 0.978 0.991 0.994 0.955 0.981 0.995 0.999 0.996 0.999 0.999 T5 7 23 18 31 0.826 0.773 0.931 0.999 T1 9 12 20 38 0.999 0.996 0.996 0.999 Range-and-bearing sensor T2 T3 T4 8 9 20 29 0.997 0.999 0.999 0.999 6 10 17 38 0.998 0.983 0.995 0.996 13 17 25 37 0.994 0.999 0.999 0.999 T5 5 7 16 15 0.986 0.980 0.999 0.999
Dashes in the table indicate that no valid landmark conguration could be found. For all experiments, our approach placed less landmarks than the other approaches. The on trajectory method is the best method after ours for the range-andbearing sensor, measured in the number of landmarks placed. However, for the other two sensor models, the on trajectory method was not always able to nd a landmark conguration which satised the guarantee in the linearized system. For this method, especially the non-linearities in the bearing-only sensor model resulted in low values for pMC . E. Experiments with a Real Robot To further validate the simulation experiments, we evaluated a landmark set our algorithm generated also on the real e-puck robot [9] depicted in Fig. 1. As a range-and-bearing sensor, we utilized a webcam pointing upwards detecting uniquely identiable ARToolkit markers [5] attached to the ceiling. We considered the navigation task T1 scaled down to suit the miniature size of our robot (diameter 75mm) and the lower ceiling. Scaling the task by the factor 0.08 yields dmax = 0.04m. To evaluate the deviation dpos (xt , x ) of the e-puck t robot from its desired trajectory, we obtained the reference positions xt from a MotionAnalysis motion capture system with four digital Raptor-E cameras. During 20 autonomous runs, dpos (xt , x ) was below dmax in 99.7% of the time steps. t VII. C ONCLUSIONS In this paper, we presented a landmark placement method guaranteeing a bound on the maximum deviation of the robot from its trajectory with high condence. During a placement stage, our approach approximates the real motion model and sensor model by their linearizations to efciently evaluate the guarantee. In a subsequent validation stage, we apply a Monte Carlo simulation using the real system dynamics to check if the selected landmark set satises the deviation guarantee also for the possibly non-linear models. In contrast to other approaches, our algorithm is customizable to specic robotic systems and navigation tasks and inherently chooses the number of landmarks needed. In extensive experiments, we demonstrated that our method outperforms other approaches. ACKNOWLEDGMENTS This work has partly been supported by the German Research Foundation (DFG) within the Research Training Group
1103. The authors would like to thank Axel Rottmann for his implementation of the Powell algorithm. R EFERENCES
[1] M. Beinhofer, J. M ller, and W. Burgard. Near-optimal landmark u selection for mobile robot navigation. In Proc. of the IEEE Int. Conf. on Robotics & Automation (ICRA), 2011. [2] J. van den Berg, P. Abbeel, and K. Goldberg. LQG-MP: Optimized path planning for robots with motion uncertainty and imperfect state information. In Proc. of Robotics: Science and Systems (RSS), 2010. [3] J. van den Berg, S. Patil, R. Alterovitz, P. Abbeel, and K. Goldberg. LQG-based planning, sensing, and control of steerable needles. In Proc. of the Int. Workshop on the Algorithmic Foundations of Robotics (WAFR), 2010. [4] D. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientic, 2nd edition, 2000. [5] M. Billinghurst and H. Kato. Collaborative augmented reality. Communications of the ACM, 45(7):6470, 2002. [6] H.F. Durrant-Whyte, D. Pagac, B. Rogers, M. Stevens, and G. Nelmes. Field and service applications - an autonomous straddle carrier for movement of shipping containers. IEEE Robotics & Automation Magazine, 14:1423, 2007. [7] J.-S. Gutmann, E. Eade, P. Fong, and M. Munich. A constant-time algorithm for vector eld SLAM using an exactly sparse extended information lter. In Proc. of Robotics: Science and Systems (RSS), 2010. [8] D.B. Jourdan and N. Roy. Optimal sensor placement for agent localization. ACM Trans. Sen. Netw., 4(3):140, 2008. [9] F. Mondada, M. Bonani, X. Raemy, J. Pugh, C. Cianci, A. Klaptocz, S. Magnenat, J.-C. Zufferey, D. Floreano, and A. Martinoli. The e-puck, a Robot Designed for Education in Engineering. In Proc. of the Conf. on Autonomous Robot Systems and Competitions, 2009. [10] M. Powell. An efcient method for nding the minimum of a function of several variables without calculating derivatives. The Computer Journal, 7(2):155162, 1964. [11] C. Pradalier and S. Sekhavat. Localization Space: a framework for localization and planning, for systems using a sensor/landmarks module. In Proc. of the IEEE Int. Conf. on Robotics & Automation (ICRA), 2002. [12] P. Sala, R. Sim, A. Shokoufandeh, and S. Dickinson. Landmark selection for vision-based navigation. IEEE Transactions on Robotics and Automation, 22(2):334349, 2006. [13] H. Strasdat, C. Stachniss, and W. Burgard. Which landmark is useful? Learning selection policies for navigation in unknown environments. In Proc. of the IEEE Int. Conf. on Robotics & Automation (ICRA), 2009. [14] S. Thrun. Finding landmarks for mobile robot navigation. In Proc. of the IEEE Int. Conf. on Robotics & Automation (ICRA), 1998. [15] M. Vitus and C. Tomlin. Sensor placement for improved robotic navigation. In Proc. of Robotics: Science and Systems (RSS), 2010. [16] G. Welch and G. Bishop. An introduction to the Kalman lter. Technical report, 1995. [17] S. Zhang, L. Xie, and M.D. Adams. Entropy based feature selection scheme for real time simultaneous localization and map building. In Proc. of the IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS), 2005.
60
Mobile Robots in Hospital Environments: an Installation Case Study

F. Capezio, F. Mastrogiovanni, A. Scalmato, A. Sgorbissa, P. Vernazza, T. Vernazza, R. Zaccaria
Abstract This paper discusses a number of issues arose during the deployment of a commercial robotic platform in a hospital scenario. In particular, a realistic case study and installation process at Polyclinic of Modena (Italy) is described, where robots have to accomplish two different types of tasks, namely drugs delivery and waste transportation. The lesson learned is that the design of a robot to be used in civilian environments must not only take into account state of the art and dependable algorithms, but also consider real-world issues (i.e., not strictly scientic or technological) when adopting techniques for navigation, obstacle avoidance or localization. The paper discusses the most important features of the overall system architecture that have been modied and the solutions that have been selected to deal with this realistic scenario. Index Terms Add up to 4 keywords here
I. I NTRODUCTION During the past few years, service Mobile Robotics for object transportation in semi-structured environments and warehouses developed in a mature research eld. A number of systems and algorithms have been presented, which are able in principle to provide customers with specications about their use. Industrial applications, and especially Automated Guided Vehicles (AGVs in short) are nowadays very common. However, a number of limitation are usually associated with these systems, which are related to the need of operating on a workspace that is specically modied, or even built on purpose, for the robots. Among the many functional or non-functional requirements that robots must be able to adhere to, a service robot must be dependable. Under a customers perspective, failures imply costs: a service robot must be less expensive and more efcient than a corresponding team of human workers when performing the same task. It is necessary to analyse how customers or contingent requirements can be guaranteed (or at least enforced) by the robot control framework and architecture, both at the software and hardware levels. In particular, it is necessary to take into account the impact of such requirements on the robot design, both at the hardware and software levels. This paper describes the lesson learned in deploying a commercial service robot in a real-world environment. In particular, the paper discusses how a robotic platform has been designed to better comply with a number of issues arose during the deployment process, as well as to guarantee the execution of the assigned tasks, namely drugs delivery and waste transportation. These design choices take into account both hardware and software issues. The deployment of robots in similar scenarios is not new. The rst robot operating in hospital environments is HelpMate
[1], which appeared in 1991 and was used for the transportation of small-sized objects at Danbury Hospital. After this rst example, other robots followed the same philosophy and strategies adopted for HelpMate. I-Merc [2], which is specialized in hospital meal transportation, and Care-o-Bot [3, 4], which has been developed to provide assistance to elderly and people with special needs are such a systems. In recent years, mobile robots for hospital transportation started to be a commercial business. The most known system in Europe is TransCar [6] by Swisslog, which is a laserguided AGV that can transport payloads up to 450 kg. The vehicle is able to move in order to lift specically designed boxes where the material to be handled is assumed to be located. Many robots are able to operate in the same workspace or storage area. However, the system assumes that the environment is very well-structured: such requirements include, for instance, wide space, at oors and elevators that precisely align with the oor in order to allow robots to enter within. The TransCar solution seems feasible in new buildings, although it is difcult to adopt it in existing (and old) hospital buildings. Recently, Swisslog produced a small autonomous robot called RoboCourier [7], which is designed for interdepartment material transportation, with a payload of about 25 kg. TUG [8] from Aethon is similar to RoboCourier. It has been designed for the transportation of various materials as dietary trays, medications, linens, blood samples, medical records and pharmacy drugs. It can be interfaced with most of standard carts used in hospitals, which can store up to 250 kg. FMC Technologies develops ATLIS [9], a robot for the transportation of heavy objects that is very similar to TransCar, but characterized by a heavier payload. However, it requires major changes to the workspace. Recently, Muratek Keio Robot [5] has been presented, which is not yet commercially available but seems very promising. It is an omni-directional autonomous robot, developed for the transportation of light payloads. The paper describes the experiences gained during a realworld installation of Merry Porter, a service mobile robot from Genova Robot R . The paper is organized as follows: Section II describes the considered real-world case study; Sections III and IV introduce the robot architecture that has been designed to accomplish the required tasks; Section V discussed the installation procedure in a hospital environment. Conclusions follow. II. C ASE S TUDY In this Section we describe the two types of mission that have been required by the hospital service management unit.
61
Fig. 1. Different robot tasks. Left: waste transportation. Right: drugs delivery.
The case study discussed in the article is represented by a hospital environment, the Polyclinic of Modena. The hospital needs a high level automation to accomplish two tasks: waste transportation and drugs delivery. Each task is characterized by a specic path for the robots to follow and specic needs. As depicted in Figure 1 on the right, the drugs delivery mission starts from the pharmacy, which is located at the basement oor and arrives at the sixth oor, where various pediatric departments are located. The total path length is about 50 m. The delivery is scheduled three times per day, and in particular early morning, midday and evening, subject to the fact that they occur before patients visits. Waste transportation (Figure 1 on the left) has a different and longer path covering about 500 m. The robot starts from the fourth oor where the operating rooms of general surgery are located to transport waste to a waste dump placed outside the hospital, passing through the basement oor. Without the adoption of robots, this task needs two workers: the former to transport waste to the basement oor, the latter to move the waste to the external garbage dump with a small electric machine. As it can be noticed, robots need to use and control: (i) elevators; (ii) additional environmental devices, such as the ones controlling automated doors: e.g., the pharmacy has a system for access control for security reasons. To accomplish these two tasks, robots must be equipped with different components. On the one hand, a robot requested to perform drugs delivery must be provided with a security box to prevent drugs theft. The box should be opened only with a predened code, known by authorized personnel. On the other hand, a robot performing the waste transportation tasks must be provided with a conveyor to load waste (in order to avoid human personnel), and by a GPS for localization, since it has to transport its payload outside the hospital. A commercial robot system must maximize the reliability because, as previously pointed out, each inconvenience or delay costs. Typically, maintenance efforts are parts of costs that must be minimized. After analysing these missions, we found the following weak points:
Hardware structure: real and general-purpose environments can require a number of hardware features on the robot side in order to overtake physical problems like slope or heavy payloads. Navigation and security: robots must navigate in different environments that can present obstacles. Areas can be crowded, thereby needing particular trajectories: as a consequence, the navigation system must be adaptable. Even more critical is the system security. Robots must avoid contacts with objects or people. The security system must have higher priority over the navigation system and it must full the rules dened in the European laws, in particular EN 1525:1997. Localization: localization failures are not acceptable: when the robot localization fails, the manual intervention of a technician is necessary. User interface: in order for the system to be used by non specialists, a simple interface allowing users to provide robots with tasks and to monitor their their execution is required. Interfaces can be different: either for simple monitoring or for providing advanced commands, for qualied personnel. The following Sections consider each problem and present the developed solution for the presented system.
III. H ARDWARE A RCHITECTURE Merry Porter has a base architecture that can be extended with different modules, either hardware and software. The hardware modules allow to congure the robot with different devices, like GPS or conveyors, as well as software modules allowing to separate and manage each component with proper characteristics. The base architecture of each robot fulls these specications: maximum speed of 1,5 m/s; max slope without payload 30%; 7% slope with 100 kg of payload; 180 kg of payload without slope; 3 cm step overtake. These characteristics are required by a standard hospital where there can be very long corridors (hence: high speed is required in order to avoid waste of time), the oors can be rugged and sloping, and the alignment between the oors themselves and the elevators can be not perfect (hence: the robot must overtake steps). The base robot architecture has these characteristics (Figure 2): unicycle structure with a front steering wheel; a telescopic turret for placing sensors at a variable height between 1.40 to 2.10 meters, to avoid occlusions due to surrounding people; 4 motors and drives (2 for the rear wheels, 1 for the front steering wheel, and 1 for the telescopic turret) connected by CANbus; 1 Hokuyo laser scanner for localization only, placed on top of the turret; 1 Sick S3000 laser scanner for security and obstacle avoidance placed at 30 cm from the oor;
62
Fig. 3. Fig. 2. Merry Porter base architecture at Polyclinic of Modena.
Agents present in the robot software architecture.
A. Navigation Requirements: The navigation subsystem must present an efcient algorithm for trajectory following allowing to navigate in different kinds of environment. It must be characterized by good obstacle avoidance algorithms and any kind of collision must be prevented at any cost: as a consequence, the associated priority is at maximum. This is necessary in any civilian environment where people as well as other obstacles like hospital carts often left in the robot workspace. Approach: The robot navigates on a predened graph, following a given node sequence that can be chosen by the user or it is automatically generated by a trajectory planner. In absence of obstacles, node-to-node navigation is simply performed by moving along the connecting straight lines. The navigation subsystem can be observed in Figure 3: the Trajectory Generator computes the linear and angular speed that must be fed to actuators, depending on the current position returned by the Localization Agent and the next point to be reached determined by Task&Path Manager. Controls are then sent to the Motor Manager. Obstacle detection and avoidance: In order to guarantee the safest possible behaviour, obstacle detection is managed by a safety laser scanner (the previously cited SICK S3000) with two different safety mechanisms, i.e., hardware and software. The hardware mechanism consists in an internal circuit that, when the laser detects something in a given (short) range, automatically stops the robot by cutting the motor power, which makes the motors brakes to be activated. This system is necessary according to European laws, and it is activated only when extremely necessary. Standard obstacle detection is made by a software agent, which subsumes the hardware mechanisms in most cases. The software analyses all the data returned by the safety laser and feed them to a simple rulebased engine: the nal output depends also on the robot speed, corridors and passages width. For example, the higher the
2 IP cameras for payloads and environmental surveillance, placed on the turret; 1 DLPS R active beacon localization system, placed on top of the turret; 16 ultrasound sensors placed around the robot body for obstacle detection; 1 alert ashing light to advise people about the robot presence. 1 PC embedded fanless CPU Intel Celeron 1Ghz with Linux-based O.S.
Robots are equipped with two lithium polymer batteries allowing them 4 hours of continuous work with 30 minutes charge. Consequently, robots can work for more than 21 hours per day. The two cameras prevent or record materials thefts and, at the same time, they can improve the security system of the hospital, by visualizing their input stream in the central room of the security personnel. IV. S OFTWARE A RCHITECTURE Since the Merry Porter software architecture present different modules, these are integrated within a multi-agent based system, called ETHNOS [10], where each agent has a specic task to carry out, like managing missions, localization, or trajectory planning and following (Figure 3). ETHNOS provides a soft real-time extensions to the Linux kernel supporting different communication, execution, and representation requirements. This is built over a Linux kernel and a dedicated network communication protocol allowing not only the communication between the different parts of the robot, but also with other software agents running on board of devices distributed throughout the environment.
63
speed the greater the distance at which the robot begins to slow down; the smaller the passage (such as when entering an elevator) the smaller the minimum allowed distance from obstacles before stopping the robot. Simple stop (or slow down) strategies guarantee a safe approach, allowing to avoid collisions with people and objects. Furthermore, they guarantee a very repeatable behaviour. Therefore, they maximize the system reliability. As a consequence of a security standard check on the robots performed by an expert society, another hardware mechanism has been installed. This mechanism prevents the lateral collision and the shear effect that can occur in the case that the robot, when performing a high-curvature trajectory or when moving very close to walls, comes very close to people. To solve this issue, two photocells have been placed on rods of about 15 cm length on the lateral part of the robot: when someone is located between the photocells or it hits the rods, the motor power is taken off. Other 15 cm rods with photocells are placed on the back of the robot to prevent collisions when the robot moves backward. This solution complies with European standard: reverse motion is used only when exiting from the elevator and when approaching conveyor belts. On the one hand, when the robot is in cluttered and very crowded environments, it does not avoid obstacles (but simply stops), because this can cause it to be trapped in very narrow areas. On the other hand, an algorithm for obstacle avoidance can be used wider areas. The algorithm is based on the concept of roaming stripes [11]. A roaming stripe is a diamond shaped region connecting two nodes. These are designed in such a way that their intersection with the free space is always a convex area. When the obstacle avoidance algorithm is active, the trajectory to avoid obstacles is computed through a standard Articial Potential Field, by storing laser data into a local probabilistic map centered on the robot reference frame. During navigation, the robot is allowed to deviate from its trajectory to avoid obstacles, but it is never allowed to exit from the roaming stripe. This guarantees that the robot always moves within a convex area containing the target node. By assuming that, sooner or later, all dynamic obstacles (e.g., people, piece of furniture or objects) move or are moved away, this in turn guarantees that the robot will be able to reach the target by following a straight line. Pros and Cons: The adopted solution is able to guarantee that robots navigate in selected areas, because the roaming stripes algorithm always select a trajectory that lies within the stripe. This is very important in a real-world case, because medical personnel must be guaranteed always a free path for emergency cases. The on board safety system (made up a laser range nder and a number of sonars) is aimed at preventing obstacles in a very conservative way. However, this approach increases the time necessary to complete the task. B. Localization Requirements: A robot required to navigate in an environment for very long periods of time and with high reliability must be provided with a robust and redundant localization system. With respect to the introduced case study it must be noted
that localization must provide also a very precise estimate of the robot position. The workspace is characterized by many narrow passages, like elevators and doors. Furthermore, the robot must be able to approach conveyors with a positioning error lower than 2 cm. In the case that this accuracy can not be met, the payload could fall down. Approach: Merry Porter is provided with two different localization systems in indoor and a GPS-based system for outdoor. All the three systems communicate positioning information to an Extended Kalman Filter (EKF). 1) Indoor - Map Based Localization: Usual localization approaches based on available maps periodically compare observations taken from the environment with a map representation, with the aim of correcting the a priori estimate provided by odometry. Among the most common approaches, Kalman lter based localization assumes that an unimodal probability distribution is adequate to represent the pose estimate, thus being a simple and efcient solution to local position tracking. Merry Porter exploits the EKF with a priori known environment maps. The design choices associated with the whole localization subsystem have been described in [12], where the use of line-based features (extracted from a laser rangender) is discussed: a technique to improve the corrective power of each extracted line feature is discussed. In particular, [12] introduces the concept of low frequency cross section: in typical indoor environments, above a given height from the ground, environments are more regular and simple. This has obvious consequences when computing line segments from range data: each laser scan produces a smaller number of lines, each one being longer and more reliable. However, the hospital environment turns out to be more complex. For instance, in the basement oor there are areas where the ceiling is particularly low, thereby requiring the adoption of a telescopic turret (where the localization laser is mounted), which is able to lower its position if necessary. This is actually performed by the robot control architecture as a consequence of specic robot area. 2) Indoor - Active Beacon Localization: The active beacon localization system (called DLPS) has been described in [13]. The base idea starts from classic active beacons methods where beacons have an unique identier and a specic position in the map. When a beacon is in line-sight with respect to the robot perspective, robots can send them specic commands; as a consequence, the active beacon can reply with useful information about localization. Obviously enough, different commands issues by robots are possible. In particular, the system adopts the classic trilateration principle, that is when the robot detects three beacons it is able to compute its position precise position. The DPLS system introduces a different approach where beacon-based localization is formalized as a position-tracking problem. The information sent from beacons is processed by the EKF, because the high non-linearity of the measurement model. Every time instant ti , the DLPS system provides the robot with the identiers of the detected beacons, their positions (xbk , ybk ) and the angle k , for each detected beacon k. The angle k is interpreted as the estimate of the direction of the beacon. More precisely, two values are 2 considered: k and k correspond respectively to the mean
64
value and the variance of the position estimate of beacon k. This estimate is fed to the EKF to correct the estimate of the robot position. 3) Outdoor - GPS: The outdoor localization system derives from a similar system previously tested in the context of an airport surveillance robot project, which has been described in [14], where the very huge space characterized by few lines features in the environment needed an alternative localization method. A common non-differential GPS localization system has been developed, which is integrated with the laser scanner data. Unfortunately, GPS measurements are corrupted by error sources introducing highly coloured noise with signicant lowfrequency components, which can be modelled as a non-zero mean value (i.e., a bias) in GPS errors slowly varying in time. The analysis of both longitude and latitude data collected for 24 hours at a xed location clearly shows this effect: considering the Fast Fourier Transform (FFT in short) of GPS longitude and latitude data, low-frequency components can be noticed, corresponding to a slow variation of the signal over time. By estimating this bias in GPS measurements, one can expect to improve GPS data precision, therefore making robot localization more accurate. To this purpose, an augmented state vector x is dened, comprising the x, y, and components of the robot pose, as well as the xgps and ygps components of the low-frequency bias. Since the coloured components are then separated from additive white Gaussian components of GPS noise (i.e., they are comprised in the state vector), the EKF can be applied. When moving outdoor, detected line features of the a priori map mostly correspond to external walls of buildings. In this situation, a smaller number of features are usually available, since the robot mostly traverses areas where no lines are detected at all. However, when line features are available, their contribute is sufcient to estimate the full state vector then pose and precision of GPS measure. Pros and Cons: In indoor areas the combined use of mapbased and beacons-based localization is complementary. The map-based algorithm does not need any external devices but, in few parts of the robot paths, this does not provide the necessary precision. Even worse, the variance associated to the localization estimate is very high, or no warranties about localization can be assured. Under a commercial perspective, the beacon-based localization system adds costs to the overall system set-up. However, it turns out to be a good choice for enforcing system reliability. Finally, GPS-based localization has good performance and it is mandatory for the outdoor localization. C. User Interface - Robomap Requirements: A commercial robot system must present a simple and user-friendly interface. Approach: The main user interface is provide by a software tool called Robomap, an integrated management system that allows to dene maps, missions and real time monitoring of the state of the robots. Map denition is only allowed to wellqualied personnel and it provides all the elements necessary to describe fundamental parts of the environment to be used
Fig. 4.
Robomap Interface: Map denition starting from CAD maps.
Fig. 5. Robomap Interface: On-line monitoring and visualization of the robot map, mission nodes and active beacon positions.
by the robot localization system. As a matter of fact, a map comprises both lines and active beacons. The specication of map lines can be directly made by using and rening already available CAD maps (Figure 4). An alternative method for populate the map with suitable lines for localization is to trace out the lines derived by the line extraction algorithm taken by laser scanner processing. To do this, it is necessary to move the robot in the interested location. The tasks or missions denition mode allows to create nodes and links between nodes, which compose the path. Each link can be associated with a number of parameters, such as the desired speed for the robot on the link or the preferred localization method. Furthermore, for each node it is possible to declare a set of actions that the robot must perform in order for the mission to complete or continue, such as requesting the presence of an elevator at the oor, opening an automated door or loading the payload (by communicating to the conveyor base station). During robot missions, Robomap allows for the robot online monitoring (Figure 5). This mode allows to monitor different robot properties, such as the position with respect to the map, speed, jog, obstacles revealed with sonars, battery status, detected beacons, just to name but few. Furthermore, a human supervisor can issue commands to the robot for movement control and mission management (e.g., starting or changing a mission).
65
V. A PPLICATIONS This Section describes the system set-up realized at Polyclinic of Modena, Italy. We explain the main environment requirements, the device installation process and an overview about performance issues and improvements on typical hospital tasks. A. Environment and Devices Robots have to control a number of devices distributed in the environment in order to accomplish their tasks. For example, moving between two oors requires an elevator call; accessing an area may require to control automated doors. Environment automation is essentially based on custom devices that communicate via an Echelon Lonwork Network. The network guarantees a very exible topology and it is easily adapted to any environment. In order for all the robot parameters to be accessible to human supervisors, the whole workspace is provided with a wireless network communication system. This system is used also to allow the communication between robots and devices, specically through a monitoring system remotely located on a workstation, which is connected both to the network and to the Echelon eldbus. In order to explain how the environment automation works, we briey discuss the elevator control. Custom devices are connected to the elevator control system, which requires a simple installation that must be performed by a qualied technician. The device is directly connected to the elevator buttons. Therefore, it is possible to perform all the actions provided by the elevator panel. The interface device is connected to the Echelon eldbus like all other devices and the supervisor workstation. Therefore, the supervisor can query each device and it can issue commands to them. Robots can interact with the supervisor by simply using the wireless network provided by the hospital and the multiagent framework used to develop both robots and supervisor applications. In order to complete the required tasks, the described building automation is not sufcient. In particular, the described hospital application needs other components, such as a conveyor where the personnel leaves waste waiting for the robot transportation. The conveyor must be connected to the automation network and it is controlled by the supervisor application. For instance, in the described set-up, a number of conveyors have been installed, which can be directly controlled by robots through ZigBee commands. B. DPLS implementation The DPLS system is composed by a rotating laser placed on top of the robot telescopic turret as well as by a number of active beacons distributed in the environment. Active beacons are powered by batteries and communicate through the ZigBee protocol. These two characteristics allow to minimize time and costs for the set-up. The laser carries information about the encoder associated with the motor that is responsible for rotating the laser itself. When the laser hits a beacon, the beacon sends a message to the robot with the beacon
Fig. 6. Robot takes the elevator to go from oor -1 to oor 6. 1,2,3,4: Robot calls elevator and it enters. 5,6: Robot performs reverse, it turns and it continues the mission at 6th oor.
identier and the encoder value: it is possible to estimate the robot position as previously described. Battery duration is very important in real-world scenarios because human personnel can not be frequently dedicated to substitute beacons. Beacon electronic parts are essentially two, namely ZigBee communication and laser management, which allows for the interpretation of the message contained in the laser light. The following strategy is implemented for maximise beacons life: beacons wake up every 28 seconds and check if any robot is near, otherwise return to a sleep state; if a robot nds a beacon, it maintains the beacon active and turns on the laser management part; after 160 seconds, when the beacon does not receive any signal by the robot, it returns to a sleep state. Using these devices, beacons need to recharge every three months. C. System evaluation and automation improvements The described system reduces the number of manual transportations and improves the management of the described tasks. Workers and nurses can delegate robots with a number of low-level tasks, thereby focusing on other tasks where their know-how is essential. Drugs delivery is not a very frequently needed task. Instead, waste transportation requires a huge amount of effort for human personnel. The current batteries power cycle allows robots to operate more than 21 hours a day, i.e., a period of time that is normally covered by three workers, since the standard working time is 8 hours a day). Another consideration must be done related to medical waste laws. With the introduction of robots, the
66
hospital management wants to divide the different paths of material distribution (i.e., medicine, linen and food) and waste disposal as medical regulation impose. In particular, operating surgery waste has more strict rules suggesting that this waste must come in contact with the personnel as unfrequently as possible. Under this perspective, automated robot-based transportation enforces this aspect. As it can be seen in Figure 1, the two task types are characterized by very different paths, distances and then different necessary time to complete. The robots requested to perform waste transportation has to cover about 500 meters and to take an elevator, with a mean overall mission time of about 15 minutes, which is quite constant because there are not many people in those hospital areas. Much less is the distance that is required to traverse in case of drug delivery, which is only 50 meters. However, the path presents a huge number of obstacles and people. As a consequence, the time to accomplish two deliveries, in the two pediatric departments, is about 8-10 minutes. Currently, waste transportation presents a number of difcult issues, to the extent that it could not convenient for the hospital management. It is evident that human personnel is more exible and provide better performance in some cases. On the one hand, the existing waste transportation system is proven to work reasonably well: structures for waste storage and transportation are compliant for the task, if performed by humans. On the other hand, using an automated robotbased system requires to install a number of ad hoc conveyors, whereas the existing big carts for waste transportation must be replaced by other small containers. Switch from human to automated transportation implies an important change in the hospital habits that is difcult to implement. VI. C ONCLUSIONS An example of commercial robots for hospital environments has been presented. Merry Porter robots use different and consolidated approaches to navigate, avoid obstacle and localize. All of these aspects have been designed to improve reliability, with the aim of comply not only with state of the art research activities, but above all with real-world regulations and functional needs of a typical hospital environment. The current set-up at Polyclinic of Modena has been described, where two robots are used for drugs delivery and waste transportation. The article addresses a number of topics that played a major role in the deployment process. In particular, it is focused on the architectural and system-level choices that deemed necessary for a real-world installation. R EFERENCES
[1] Evans, J.M.; HelpMate: an autonomous mobile robot courier for hospitals. In proceedings of Intelligent Robots and Systems 94. Advanced Robotic Systems and the Real World, IROS 94. [2] F. Carreira et al., i-Merc: A Mobile Robot to Deliver Meals inside Health Services in Proceedings of the 2006 IEEE International Conference on Cibernetics & Intelligence Systems & Robotics, Automation & Mechtronics, 2006, 1-8. [3] B. Graf, M. Hans, and R. D. Schraft, Care-O-bot IIDevelopment of a next generation robotic home assistant, Autonomous robots 16, no. 2 (2004): 193-205 [4] Care-O-bot 3: http://www.care-o-bot-research.org/care-o-bot-3
[5] M.Takahashi, T.Suzuki, H.Shitamoto, T.Moriguchi, K.Yoshida, 1Developing a mobile robot for transport applications in the hospital domain, Robotics and Autonomous Systems, Vol. 58(July 2010) pagg: 889-899. [6] Swisslog TransCar - http://www.swisslog.com/index/hcs-index/hcssystems/hcs-agv/hcs-agv-transcar.htm [7] Swisslog RoboCourier - http://www.swisslog.com/index/hcs-index/hcssystems/hcs-mobile-robot-speciminder.htm [8] Aethon TUG - http://www.aethon.com/products/default.php [9] FMC Technologies ATLIS - http://www.fmcsgvs.com/content/products/atlis.htm [10] M. Piaggio, A. Sgorbissa, and R. Zaccaria: A programming environment for real-time control of distributed multiple robotic systems. Advanced Robotics 14(1): 75-86 (2000) [11] A. Sgorbissa and R. Zaccaria, Roaming Stripes: Smooth Reactive Navigation in a Partially Known Environment, Proceedings of the 2003 IEEE Intemabonal Workshop on Robot and Human Interactive Communication Millbrae. California, USA, Od 31 - Nov. 2, 2003 [12] F. Mastrogiovanni, A. Sgorbissa and R. Zaccaria. Learning to Extract Line Features: beyond Split and Merge. In Proceedings of the 10th International Conference on Intelligent Autonomous Systems (IAS-10), Baden-Baden, Germany, July 23 - 25, 2008. [13] Maurizio Piaggio, Antonio Sgorbissa, Renato Zaccaria, Beacon Based Navigation and Localization for Service Mobile Robots, Proceedings from the 2nd International Symposium on Robotics and Automation (ISRA2000), Monterrey, N.L. Mexico, November 10 - 12, 2000. [14] F. Capezio, A. Sgorbissa, R. Zaccaria (2007). An Augmented State Vector Approach to GPS-Based Localization. In Proceedings of the 7th IEEE International Symposium on Computational Intelligence in Robotics and Automation (CIRA-07), Jacksonville, Florida, June 21-23, 2007. [15] F. Mastrogiovanni, A. Sgorbissa and R. Zaccaria. Context Assessment Strategies for Ubiquitous Robots. In Proceedings of the 2009 IEEE International Conference on Robotics and Automation (ICRA 09), Kobe, Japan, May 12-17, 2009.
67
68
6-DOF Multi-session Visual SLAM using Anchor Nodes

John McDonald Michael Kaess Cesar Cadena Jos Neira e John J. Leonard of Computer Science, National University of Ireland Maynooth, Maynooth, Co. Kildare, Ireland Instituto de Investigaci n en Ingeniera de Arag n (I3A), Universidad de Zaragoza, Zaragoza 50018, Spain o o Computer Science and Articial Intelligence Laboratory (CSAIL), Massachusetts Institute of Technology (MIT), Cambridge, MA 02139, USA
Department
Abstract This paper describes a system for performing multisession visual mapping in large-scale environments. Multi-session mapping considers the problem of combining the results of multiple Simultaneous Localisation and Mapping (SLAM) missions performed repeatedly over time in the same environment. The goal is to robustly combine multiple maps in a common metrical coordinate system, with consistent estimates of uncertainty. Our work employs incremental Smoothing and Mapping (iSAM) as the underlying SLAM state estimator and uses an improved appearance-based method for detecting loop closures within single mapping sessions and across multiple sessions. To stitch together pose graph maps from multiple visual mapping sessions, we employ spatial separator variables, called anchor nodes, to link together multiple relative pose graphs. We provide experimental results for multi-session visual mapping in the MIT Stata Center, demonstrating key capabilities that will serve as a foundation for future work in large-scale persistent visual mapping. Index Terms multi-session visual SLAM, lifelong learning, persistent autonomy
Fig. 1: Internal architecture of windowed and multi-session visual SLAM (vSLAM) processes.
I. I NTRODUCTION Despite substantial recent progress in visual SLAM [17], many issues remain to be solved before a robust, general visual mapping and navigation solution can be widely deployed. A key issue in our view is that of persistence the capability for a robot to operate robustly for long periods of time. As a robot makes repeated transits through previously visited areas, it cannot simply treat each mission as a completely new experiment, not making use of previously built maps. However, nor can the robot treat its complete lifetime experience as one big mission, with all data considered as a single pose graph and processed in a single batch optimisation. We seek to develop a framework that achieves a balance between these two extremes, enabling the robot to leverage off the results of previous missions, while still adding in new areas as they are uncovered and improving its map over time. The overall problem of persistent visual SLAM involves several difcult challenges not encountered in the basic SLAM problem. One issue is dealing with dynamic environments, requiring the robot to correct for long-term changes, such as furniture and other objects being moved, in its internal representation; this issue is not addressed in this paper. Another critical issue, which is addressed in this paper, is how to pose the state estimation problem for combining the results of multiple mapping missions efciently and robustly. Cummins denes the multi-session mapping problem as the task of aligning two partial maps of the environment collected by the robot during different periods of operation [3].
We consider multi-session mapping in the broader context of life-long, persistent autonomous navigation, in which we would anticipate tens or hundreds of repeated missions in the same environment over time. As noted by Cummins, the kidnapped robot problem is closely related to multi-session mapping. In the kidnapped robot problem, the goal is to estimate the robots position with respect to a prior map given no a priori information about the robots position. Also closely related to the multi-session mapping problem is the multi-robot mapping problem. In fact, multi-session mapping can be considered as a more restricted case of multirobot mapping in which there are no direct encounters between robots (only indirect encounters, via observations made of the same environmental structure). Kim et al. presented an extension to iSAM to facilitate online multi-robot mapping based on multiple pose graphs [11]. This work utilised anchor nodes, equivalent to the base nodes introduced by Ni and Dellaert for decomposition of large pose graph SLAM problems into submaps of efcient batch optimisation [18], in an approach called Tectonic Smoothing and Mapping (T-SAM). Our work extends the approach of Kim et al. [11] to perform multisession visual mapping by incorporating a stereo odometry frontend in conjunction with a place-recognition system for identifying inter- and intra-session loop closures.
69
II. R ELATED W ORK Several vision researchers have demonstrated the operation of visual mapping systems that achieve persistent operation in a limited environment. Examples of recent real-time visual SLAM systems that can operate persistently in a smallscale environment include Klein and Murray [12], Eade and Drummond [5], and Davison et al. [4, 8]. Klein and Murrays system is highly representative of this work, and is targeted at the task of facilitating augmented reality applications in small-scale workspaces (such as a desktop). In this approach, the processes of tracking and mapping are performed in two parallel threads. Mapping is performed using bundle adjustment. Robust performance was achieved in an environment as large as a single ofce. While impressive, these systems are not designed for multi-session missions or for mapping of large-scale spaces (e.g., the interior of a building). There have also been a number of approaches reported for large-scale visual mapping. Although a comprehensive survey is beyond the scope of this paper we do draw attention to the more relevant stereo based approaches. Perhaps the earliest of these was the work of Nist r et al. [19] on stereo odometry. In e the robotics literature, large-scale multi-session mapping has been the focus of recent work of Konolige et al. in developing view-based mapping systems [14, 13]. Our research is closely related to this work, but has several differences. A crucial new aspect of our work in relation to [14] is the method we use for joining the pose graphs from different mapping sessions. Konolige and Bowman join pose graphs using weak links, which are used to connect disjoint sequences. The weak links are added with a very high covariance and subsequently deleted after place recognition is used to join the pose graphs [14]. In our approach, which extends [11] to full 6DOF, we use anchor nodes as an alternative to weak links; the use of anchor nodes provides a more efcient and consistent way to stitch together the multiple pose graphs resulting from multiple mapping sessions. In addition, our system has been applied to hybrid indoor/outdoor scenes, with hand-carried (full 6-DOF) camera motion. III. S YSTEM OVERVIEW In this section we describe the architecture and components of a complete multi-session stereo visual SLAM system. This includes a stereo visual SLAM frontend, a place recognition system for detecting single and multi-session loop closures, and a multi-session state-estimation system. A schematic of the system architecture is shown in Figure 1. The system uses a sub-mapping approach in conjunction with a global multisession pose graph representation. Optimisation is performed by applying incremental and batch SAM to the pose graph and the constituent submaps, respectively. Each submap is built up over consecutive sets of frames, where both the motion of the sensor and a feature based map of the scene is estimated. Once the current submap reaches a user dened maximum number of poses, 15 in our system, the global pose graph is augmented with the resultant poses. In parallel to the above, as each frame is processed, the visual SLAM frontend communicates with a global place
recognition system for intra- and inter-session loop closure detection. When a loop closure is detected, pose estimation is performed on the matched frames, with the resultant pose and frame-ids passed to the multi-session pose graph optimisation module. IV. S TEREO O DOMETRY Within each submap the inter-frame motion and associated scene structure is estimated via a stereo odometry frontend. The most immediate benet of the use of stereo vision is that it avoids issues associated with monocular systems including inability to estimate scale and indirect depth estimation. The stereo odometry approach we use is similar to that presented by [19]. Our stereo odometry pipeline tracks features using a standard robust approach followed by a pose renement step. For each pair of stereo frames we rst track a set Harris corners in the left frame using the KLT tracking algorithm. The resulting tracked feature positions are then used to compute the corresponding feature locations in the right frame. Approximate 6-DOF pose estimation is performed through the use of a RANSAC based 3-point algorithm [6]. The input to the motion estimation algorithm consists of the set of tracked features positions and disparities within the current frame and the current estimates of the 3D locations of the corresponding landmarks. In our work we have found that ensuring that approximately 50 features are tracked between frames results in a reliable pose estimate through the 3-point RANSAC procedure. Finally, accurate pose estimation is achieved by identifying the inliers from the estimated pose and using them in a Levenberg-Marquardt optimisation that minimises the reprojection error in both the left and right frames. In our implementation of the above stereo odometry pipeline we use a GPU based KLT tracker [25]. This minimises the load on the CPU (by delegating the feature detection and tracker to the GPU) and exploits the GPUs inherent parallel architecture to permit processing at high frame rates. In parallel to this we compute a disparity map for the frame, which is then combined with the results of the feature tracker, resulting in a set of stereo features. In order to maintain an adequate number of features we detect new features in every fth frame, or when the number of feature tracks in the current frame drops below a certain threshold. A consequence of keeping the number of features in a given frame high, whilst at the same time setting a minimum inter-feature distance in the KLT tracker, is that it helps to ensure a good distribution of the resulting feature set over the image. V. S INGLE S ESSION V ISUAL SLAM Deriving a pose graph representation from the stereo odometry system involves two levels of processing. The rst of these optimises over the poses, features and structure within a local window. As each new frame is added, a full batch optimisation is performed. The second step transfers optimised poses to the pose graph after a xed maximum number of frames is reached. The resulting pose graph structure contains
70
no point features and can be optimised efciently even for a large number of poses. We apply smoothing in combination with a homogeneous point parameterisation to the local window to improve the pose estimates obtained from visual odometry. In contrast to visual odometry, smoothing takes longer range constraints into account, which arise from a single point being visible in multiple frames. The homogeneous point parameterisation p = (x, y, z, w) allows dealing with points at innity [24]. Points close to or at innity cannot be represented correctly by the conventional Euclidean formulation. Even for points that are not at innity, convergence of the smoothing optimisation is typically improved. We use exponential maps based on Lie group theory to deal with overparameterised representations. In particular we use Quaternions to represent orientations in 3D space. Quaternions consist of four parameters to represent three degrees of freedom, therefore causing problems for conventional leastsquares algorithms. Using an exponential map, as described for example in [7], reduces the local updates during optimisation to three parameters. The homogeneous point parameterisation suffers from the same problem, and indeed the same solution can be applied as for Quaternions after realising that both are equivalent to the 3-sphere S3 in R4 if normalised. With overparameterisations removed, the optimisation problem can now be solved with standard least-squares solvers. We use the iSAM library [9] to perform batch smoothing with Powells Dog Leg algorithm. iSAM represents the optimisation as a factor graph, a bipartite graph containing variable nodes, factor nodes and links between those. Factor nodes, or short factors, represent individual probability densities fi (i ) = fi (x ji , pki ) exp 1 (x ji , pki ) zi 2
2 i
contrast to the stereo projection factors fi in the smoothing formulation above, we now use factors gi gi (i ) = gi (x ji , x ji ) exp 1 (x ji x ji ) ci 2
2 i
(3)
that represent constraints ci with covariances i between pairs of poses as obtained by local smoothing or by loop closure detection. We use the notation xd = xa xb from Lu and Milios [16] for representing pose xa in the local frame of pose xb (xa = xb xd ). VI. P LACE R ECOGNITION Place recognition is an important component in the context of large-scale, multi-robot and multi-session SLAM, where algorithms based on visual appearance are becoming more popular when detecting locations already visited, also known as loop closures. In this work we have implemented a place recognition module based on the recent work of [1, 2], which demonstrated robust and reliable performance. The place recognition module has the following two components: The rst component is based on the bag-of-words method (BoW) [23] which is implemented in a hierarchical way [20]. This implementation enables quick comparisons of an image at time t with a database of images in order to nd those that are similar according to the score s. Then, there are three possibilities, if s + t the match is considered highly reliable and accepted, if t < s < + t the match is checked by conditional random eld (CRF)-Matching in the next step, otherwise the match is ignored. In our implementation, t is the BoW score computed between the current image and the previous one in the database. The minimum condence expected for a loop closure candidate is = 0.15 and for a loop closure to be accepted is + = 0.8. The images from one session are added to the database at one frame per second and with the sensor in motion, i.e. during the last second, the sensors motion according to the visual odometery module might be greater than 0.2m or 0.2rad. The second component consists of checking the previous candidates with CRF-Matching in 3D space. CRFMatching is an algorithm based on Conditional Random Fields (CRF). Lafferty et al. [15] proposed CRF for matching 2D laser scans [21] and for matching image features [22]. CRF-Matching is a probabilistic model that is able to jointly reason about the association of features. In [1] CRF-Matching was extended to reason in 3D space about the association of data provided by a stereo camera system. We compute the negative log-likelihood t,t from the maximum a posteriori (MAP) association between the current scene in time t against the candidate scene in time t . We accept the match only if t,t t,t1 . This module exploits the efciency of BoW to detect revisited places in real-time. CRF-Matching is a more computationally demanding data association algorithm because it uses much more information than BoW. For this reason, only the positive results of BoW are considered for CRF-Matching.
(1)
where (x, p) is the stereo projection of a 3D point p into a camera of given 3D pose x, yielding the predicted stereo projections (uL , v) and (uR , v), zi = (uL , uR , v) is the actual stereo measurement, and i represents the Gaussian image measurement noise. iSAM then nds the least-squares estimate of all variables (camera poses and scene structure combined) as (2) = argmax fi (i )
i
When the smoothing window reaches a maximum size, all poses and associated odometry are transferred to the current sessions pose graph, and a new local window is initialised. By including all poses from a window, as opposed to just the rst or rst and last pose (as is the case in other approaches) we ensure that loop closures between arbitrary frames can be dealt with within the pose graph. Full details of the loop closure handling is provided in Section VII. To initialise a new window we use the last pose of the previous window in conjunction with all landmarks that correspond to features that are tracked into the current frame. The pose graph is again being optimised using the iSAM library [9], but this time using the actual incremental iSAM algorithm [10] to efciently deal with large pose graphs. In
71
VII. M ULTI -S ESSION V ISUAL SLAM For multi-session mapping we use one pose graph for each robot/camera trajectory, with multiple pose graphs connected to one another with the help of anchor nodes as introduced in Kim et al. [11] and Ni and Dellaert [18]. In this work we distinguish between intra-session and intersession loop closures. Processing of loop closures is performed rstly with each candidate frame being input to the above place recognition system. These candidate frames are matched against previously input frames from all sessions. On successful recognition of a loop closure the place recognition system returns the matched frames session and frame identier in conjunction with a set of stereo feature correspondences between the two frames. These feature sets consist of lists of SURF feature locations and stereo disparities. Note that since these features are already computed and stored during the place recognition processing, their use here does not place any additional computational load on the system. These feature sets serve as input to the same camera orientation estimation system described in Section IV. Here the disparities for one of the feature sets are used to perform 3D reconstruction of their preimage points. These 3D points are passed with their corresponding 2D features from the second image into a 3-point algorithm based RANSAC procedure. Finally the estimated orientation is iteratively rened through a non-linear optimisation procedure that minimises the reprojection error in conjunction with the disparity. Inter-session loop closures introduce encounters between pose graphs corresponding to different visual SLAM sessions. An encounter between two sessions s and s is a measurement that connects two robot poses xsj and xsj . This is in contrast to measurements between poses of a single trajectory, which are of one of two types: The most frequent type of measurement connects successive poses, and is derived from visual odometry and the subsequent local smoothing. A second type of measurement is provided by intra-session loop closures. The use of anchor nodes [11] allows at any time to combine multiple pose graphs that have previously been optimised independently. The anchor node s for the pose graph of session s species the offset of the complete trajectory with respect to a global coordinate frame. That is, we keep the individual pose graphs in their own local frame. Poses are transformed to the global frame by pose composition s xs i with the corresponding anchor node. In this relative formulation, pose graph optimisation remains the same, only the formulation of encounter measurements involves the anchor nodes. The factor describing an encounter between two pose graphs also involves the anchor nodes associated with each pose graph. The anchor nodes are involved because the encounter is a global measure between the two trajectories, but the pose variables of each trajectory are specied in the sessions own local coordinate frame. The anchor nodes are used to transform the respective poses of each pose graph into the global frame, where a comparison with the measurement becomes possible. The factor h describing
Fig. 2: Multi-session visual SLAM processing
an encounter ci is given by h(xsj , xsj , s , s ) exp

1 ((s xsj ) (s xsj )) c 2
(4) where the index i was dropped for simplicity. The concept of relative pose graphs generalises well to a larger number of robot trajectories. The number of anchor nodes depends only on the number of robot trajectories.
VIII. E XPERIMENTS
AND
R ESULTS
In this section we present results of the performance of our system for both single- and multi-session processing. The dataset that we use was collected at the Ray and Maria Stata Center at MIT over a period of months. This building in known for its irregular architecture and provides a good testing ground for visual SLAM techniques in general. The dataset includes indoor and outdoor (and mixed) sequences captured from both a wheeled platform and using a handheld camera with full 6-DOF movement (e.g. ascending and descending stairs, etc.). All images sequences were captured using a Point Grey Bumblebee colour stereo camera with a baseline of 11.9cm and where both lenses had a focal length of 3.8mm. The wheeled platform also included a horizontally mounted 2D SICK laser scanner and a spinning LiDAR. Although we do not use the LiDAR sensors in our system, the accompanying laser data allows us to compare the performance of our technique to that of a laser-based scan matcher in restricted 3D scenarios (i.e. 2D + rotational movement). The complete multi-session visual SLAM system follows the architecture shown in Fig. 1, and is implemented as a set of loosely coupled processes that communicate via the Lightweight Communications and Marshalling (LCM) robot middleware system. This permits straightforward parallelism between the components of the system, hence minimising the impact on all modules due to uctuations in the load of a particular module (e.g. due to place recognition deferring to CRF processing). Futhermore the overall architecture can be transparently recongured for different setups (e.g. from single CPU to multi-core or distributed processing).
72
(a)
(b)
Fig. 3: Single session visual SLAM processing including full 6-DOF motion.
A. Single-Session Visual SLAM Results In this section we provide results from a number of single session SLAM experiments. We have applied the system in single session mode (i.e. only running a single frontend) across a variety of sequences for the Stata Center dataset described above. The system is capable of operating over extended sequences in both indoor, outdoor and mixed environments with full 6-DOF motion. Two example feature-based maps from outdoor sequences are shown in Fig. 3. Here, for (a), the underlying grid is at a scale of 10m, where the trajectory is approximately 100m in length. An example image from the sequence is shown in the inset with the GPU KLT feature tracks overlaid on the left frame. Fig. 3 (b) shows a similar scale sequence that includes full 6-DOF motion, where the user has carried a handheld camera up a stairs. In the absence of loop closing we have found the system to have drift of approximately 1%-3% in position during level motion (i.e. without changes in pitch angle). To demonstrate this, Fig. 4 shows two maps with two trajectories, both taken from the same sequence. The yellow contour shows a 2D LiDAR based map computed from applying a scanmatching algorithm to the output of horizontal LiDAR scanner attached to the cart. The scanmatchers estimated pose is shown by the dark blue trajectory, which can be seen more clearly in the lower right-hand inset. The distance between grid lines in the gure is 2m. From the gure the horizontal displacement of the nal poses is approximately 60cm with a total trajectory of approximately 20m. An example of the accumulated error in position due to drift is shown in Fig. 5. Here the dataset consists of an image sequence taken over an indoor area within in the Stata Center. Here the grid is at a scale of 5m with the sequence taken by travelling on a large loop over a space of approximately 35m15m. The image at the top shows the result of the motion estimate in the absence of a loop closure. The majority of the drift here is due to the tight turn at the right-hand end of the
sequence, where the divergence between each traversal of the hallway can be clearly seen. The center gure shows the result of the correction applied to the pose graph due to a sequence of loop closures occuring at the area highlighted by the red box. Here it can seen that the pose graph sections showing the traversals of the hallway are much more coincident and that the misalignment in corresponding portions of the map is reduced considerably. The gure also shows accuracy of the map relative to the ground truth CAD oorplan. Although the odometry system has shown to be robust over maps of the order of hundreds of meters, two failure modes for the system are in low-texture or low contrast environments, or where the disparity estimation fails over a large set of features, e.g. due to aliasing. We do not address this situation in the current system, however the standard approach of incorporating inertial sensors is a natural solution to this problem. An alternative approach that we are currently investigating is the possibility of using multi-session SLAM as a solution to this problem, whereby odometry failure results in the creation of a new session with a weak prior on the initial position. This disjoint session is treated the same as any other session. When a new encounter does occur, the session can be reconnected to the global pose graph. A future paper will present results of this approach.
B. Multi-Session Visual SLAM Results To test the full multi-session visual SLAM system, we took two sequences from the same area as shown in Fig. 5 and processed each through a separate instance of the visual SLAM frontend. Results of each of the separate sessions are shown in Fig. 6 (a) and 6 (b), with the combined multi-session results shown in Fig. 6 (c). Again, loop closure occurred in the same area as shown in Fig. 5 (b). Finally Fig. 6 (d) shows a textured version of the same map. The scale of the grid is 2m for Figures (a) & (b), and 5m for Figures (c) & (d).
73
(a)
Fig. 4: Comparison of drift in single session visual SLAM against 2D LiDAR scan matcher over a 20m trajectory. Grid scale is 2m. IX. C ONCLUSIONS In this paper we have presented a 6-DOF multi-session visual SLAM system. The principal contribution of the paper is to integrate all of the components required for a multisession visual SLAM system using iSAM with the anchor node formulation [11]. In particular this is the rst example of an anchor node based SLAM system that (i) uses vision as the primary sensor, (ii) operates in general 6-DOF motion, (iii) includes a place recognition module for identifying encounters in general environments, and (iv) derives 6-DOF pose constraints from those loop closures within these general environments (i.e. removing the need for ducial targets as was used in [11]). We have demonstrated this system in both indoor and outdoor environments, and have provided examples of single- and multi-session pose graph optimisation and map construction. We have also shown the effects of loop closures within singlesession mapping in reducing drift and correcting map structure. Multi-session visual mapping provides a solution to the problem of large-scale persistent localisation and mapping. In the future we plan to extend the results published here to incorporate the entire Stata dataset described in the Section VIII. Furthermore we intend to evaluate the approach in online collaborative mapping scenarios over extended timescales. ACKNOWLEDGMENTS Research presented in this paper was funded by a Strategic Research Cluster grant (07/SRC/I1168) by Science Foundation Ireland under the Irish National Development Plan, and by the Direccisi n General de Investigaci n of Spain under projects o o DPI2009-13710, DPI2009-07130. The authors gratefully acknowledge this support. The authors would like to thank Hordur Johannsson and Maurice Fallon for their assistance in the collection of the Stata datasets. R EFERENCES
[1] C. Cadena, D. G lvez, F. Ramos, J.D. Tard s, and J. Neira. Robust place a o recognition with stereo cameras. In IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), Taipei, Taiwan, October 2010.
(b)
(c)
Fig. 5: Single-session dataset containing a large loop. Here the grid scale is at 5m. (a) Map and pose graph prior to loop closure showing drift in position and structure. (b) Map and pose graph showing correction in the position and structure due to a series of loop closures in the area shown by the red square. Background image shows ground truth CAD oorplans of the environment. (c) Textured version of gure (b).
[2] C. Cadena, J. McDonald, J. Leonard, and J. Neira. Place recognition using near and far visual information. In Proceedings of the 18th IFAC World Congress, August 2011. To appear. [3] M. Cummins. Probabilistic Localization and Mapping in Appearance Space. PhD thesis, University of Oxford, 2009. [4] A.J. Davison. Real-time simultaneous localisation and mapping with a single camera. In Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, pages 14031410, 2003. [5] E. Eade and T. Drummond. Unied loop closing and recovery for real time monocular SLAM. In British Machine Vision Conference, 2008. [6] M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model tting with applications to image analysis and automated cartography. Commun. ACM, 24(6):381395, 1981. [7] F.S. Grassia. Practical parameterization of rotations using the exponential map. J. Graph. Tools, 3:2948, Mar 1998. [8] J.M.M. Montiel H. Strasdat and A.J. Davison. Real-time monocular SLAM: Why lter? In IEEE Intl. Conf. on Robotics and Automation (ICRA), May 2010.
74
(a)
(b)
(c)
(d)
Fig. 6: Stata Center second oor dataset with two separate sessions captured over an 850m2 area. (a) Map and poses for session 1. (b) Map and poses for session 2. (c) Multi-session pose graphs after inter-session loop closure showing transformed maps. (d) Textured version of gure (c).
[9] M. Kaess, H. Johannsson, and J.J. Leonard. Incremental smoothing and mapping (iSAM) library. http://people.csail.mit.edu/ kaess/isam, 20102011. [10] M. Kaess, A. Ranganathan, and F. Dellaert. iSAM: Incremental smoothing and mapping. IEEE Trans. Robotics, 24(6):13651378, Dec 2008. [11] B. Kim, M. Kaess, L. Fletcher, J.J. Leonard, A. Bachrach, N. Roy, and S. Teller. Multiple relative pose graphs for robust cooperative mapping. In IEEE Intl. Conf. on Robotics and Automation (ICRA), pages 3185 3192, Anchorage, Alaska, May 2010. [12] G. Klein and D. Murray. Parallel tracking and mapping for small AR workspaces. In Proceedings of the 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, pages 1 10. IEEE Computer Society, 2007. [13] K. Konolige, J. Bowman, J.D. Chen, P. Mihelich, M. Calonder, V. Lepetit, and P. Fua. View-based maps. Intl. J. of Robotics Research, 29(10), 2010. [14] K. Konolige and J. Bowmand. Towards lifelong visual maps. In IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), pages 11561163, 2009. [15] J. Lafferty, A. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic models for segmenting and labeling sequence data. In Proc. 18th International Conf. on Machine Learning, pages 282289. Morgan Kaufmann, San Francisco, CA, 2001. [16] F. Lu and E. Milios. Globally consistent range scan alignment for environmental mapping. Autonomous Robots, 4:333349, April 1997. [17] J. Neira, A.J. Davison, and J.J. Leonard. Guest editorial special issue on visual SLAM. IEEE Trans. Robotics, 24(5):929931, Oct 2008. [18] K. Ni, D. Steedly, and F. Dellaert. Tectonic SAM: Exact, out-of-core, submap-based SLAM. In IEEE Intl. Conf. on Robotics and Automation (ICRA), pages 16781685, Apr 2007. [19] D. Nister, O. Naroditsky, and J. Bergen. Visual odometry for ground vehicle applications. J. of Field Robotics, 23(1):320, 2006. [20] D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, volume 2, pages 2161 2168, 2006. [21] F. Ramos, D. Fox, and H. Durrant-Whyte. CRF-Matching: Conditional Random Fields for feature-based scan matching. In Robotics: Science and Systems (RSS), 2007. [22] F. Ramos, M.W. Kadous, and D. Fox. Learning to associate image features with CRF-Matching. In Intl. Sym. on Experimental Robotics (ISER), pages 505514, 2008. [23] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. In Intl. Conf. on Computer Vision (ICCV), volume 2, page 1470, Los Alamitos, CA, USA, 2003. IEEE Computer Society. [24] B. Triggs, P. McLauchlan, R. Hartley, and A. Fitzgibbon. Bundle adjustment a modern synthesis. In W. Triggs, A. Zisserman, and R. Szeliski, editors, Vision Algorithms: Theory and Practice, LNCS, pages 298375. Springer Verlag, Sep 1999. [25] C. Zach, D. Gallup, and J.-M. Frahm. Fast gain-adaptive KLT tracking on the GPU. In Computer Vision and Pattern Recognition Workshops, 2008. CVPRW 08. IEEE Computer Society Conference on, pages 1 7, June 2008.
75
76
Lightweight SLAM and Navigation with a Multi-Camera Rig

Gerardo Carrera Adrien Angeli Andrew J. Davison
Department of Computing, Imperial College London, 180 Queens Gate, London SW7 2AZ, UK
Abstract An interesting recent branch of SLAM algorithms using vision has taken an appealing approach which can be characterised as simple, robust and lightweight when compared to the more established and complex geometrical methods. These lightweight approaches typically comprise mechanical odometry or simple visual odometry for local motion estimation; appearance-based loop closure detection using either whole image statistics or invariant feature matching; and some type of efcient pose graph relaxation. However, such algorithms have so far been proven only as localisation systems, since they have not offered the semantic demarcation of free space and obstacles necessary to guide fully autonomous navigation and exploration. In this paper we investigate how to apply and augment such approaches with other lightweight techniques to permit fully autonomous navigation and exploration around a large and complicated room environment. Our approach uses only odometry and visual sensing, the latter being provided by a rig of multiple standard cameras without the need for precise inter-camera calibration. A particular focus of our work is to investigate the camera congurations which are most valuable to permit the capabilities needed by autonomous navigation, and our solution neatly assigns each camera a well-dened specialist role. Index Terms Computer Vision, Robotics, SLAM, MultiCamera Rig.
I. I NTRODUCTION Computer vision is an increasingly appealing sensing modality for mobile robotics, and a number of successful SLAM systems involving vision as the primary outwardlooking sensor have been recently presented. In particular, besides the more standard feature-based reconstruction approaches, there have been several systems which have solved the visual SLAM problem using much simpler lightweight techniques which combine local trajectory estimation (via either wheel or simple visual odometry), visual place recognition, and pose graph optimisation (e.g. [13]). However, in either paradigm few systems have gone further than tackling localisation, and few real visual SLAM systems have been proven in autonomous navigation. In this paper we present a method based on a combination of lightweight vision techniques which permits robust, automatic and repeatable mapping and navigation around a large room. In this scenario, there are several demands placed on the robots sensing systems: Local trajectory estimation. Place recognition to detect re-visits (loop closures) and permit global map optimisation. Detection and avoidance of obstacles in the robots close proximity. Global free-space mapping for path planning. Here we describe a novel combination of lightweight methods to provide all of these capabilities using odometry together
with multi-camera visual sensing. Our use of a rig of multiple standard cameras is based on the observation that while a wide eld of view is often desirable for SLAM and relocalisation, the other requirements placed on a vision system when the goal is full autonomous navigation are not easily satised by a single omnidirectional camera. These cameras often have a limited and low resolution view of downward angles below the horizontal necessary for free space and obstacle detection, which, when looking more to the future, could be a limitation for tasks like object recognition or human interaction. Pre-calibrated multi-camera rigs in a single unit which offer good resolution across a wide eld of view are available, such as Point Greys Ladybug, but they are expensive, as well as inexible since they cannot be recongured. The advantage of multiple specialised sensors in high performance autonomous systems has been proven in robots such as the DARPA Grand Challenge vehicles or Willow Garages PR2. In our work, we choose to use just cameras rather than other outward-looking sensor types, but retain the idea of multiple cameras mounted in a weakly coupled way and which provide specic functions as part of a whole navigation solution, without requiring a tedious inter-camera calibration procedure. We demonstrate automatic localisation and mapping, and autonomous navigation in a goal-directed scenario where the robot is able to move repeatably between any pair of points indicated in the map. Further, we demonstrate full autonomous exploration; the robot is dropped into the room with no knowledge or a large cluttered room and is able to explore autonomously to build a consistent map of the whole area suitable for autonomous navigation. II. R ELATED W ORK A. Lightweight Vision-Based SLAM The lightweight approaches we study in this work have at their core a coarse metric-topological environment model, but they can still enable accurate and autonomous navigation. The single most important function provided by vision within such a system boils down to image retrieval, where the most recent image is compared against all previous locations in the environment model. The use of colour and texture information encoded in global histograms for image representation has proven successful for topological robot localisation [16, 18]. More recently, purely appearance-based topological place recognition approaches based on the Bag of words paradigm and inspried by information retrieval techniques have been proposed [2, 4]. These are however more computationally demanding, as they require expensive feature descriptor calculations and matching, and we consider
77
them unnecessary in the context of our robots local navigation problem. Although a rig of multiple standard cameras is seemingly good alternative to a single omnidirectional device, the use of such rig for SLAM has been subject of little research, presumably because of the difculty of extrinsic calibration. In our previous work [3] we showed that autonomous external calibration of a camera rig with no overlapping views was indeed possible, and [10] demonstrated SLAM using an eight camera rig. While such geometric work on fused multi-camera SLAM will doubtless continue, in the present paper we aim to use each of the cameras for one or more specialised tasks without the need for global extrinsic calibration. B. Integrated Visual SLAM and Autonomous Navigation There have only been few visual SLAM systems using standard cameras which enable fully autonomous navigation. Davison and Murrays early visual SLAM system based on xating active stereo [6] was used for real-time position-based navigation, but the feature map generated was too sparse to permit reliable reasoning about free and occupied areas of space. With the advent of feature-based robot SLAM systems with much increased feature density, there have recently been some attempts at performing free-space mapping based on semi-dense point clouds. Notably, systems like [15] offer good possibilities for free-space detection (at least in highly textured areas with many feature points), and enable autonomous navigation and exploration. C. 2D Free Space Mapping using Vision On the assumption that our robot moves on a ground plane, 2D free space mapping is critical to allow obstacle avoidance and path planning. This problem has been studied not only for robot navigation but also in road detection for vehicles to aid autonomous driving. Some approaches attempt to dene the drivable area either using ofine machine learning techniques such as Support Vector Machines [17] or by a combination of geometry information and ofine learning [1]. Such techniques cannot however be directly transposed to the indoor 2D free-space detection problem, because in this context there is no such thing as a general geometric pattern for the boundary of the drivable area that can be retrieved in each image based on (potentially learned) a priori information. Successful methods for obstacle avoidance and free-space mapping for mobile robot navigation rely solely on the use of colour information [9], or infer a planar homography mapping image pixels to 2D oor coordinates under a ground plane assumption [19, 11]. The authors of [12] propose to make use of multiple cues to calculate horizontal lines dening the boundary between oor and walls, which is well suited to corridors, but presumably not adapted in case of more complicated structure. III. M ETHOD We divide our whole framework into ve main sections which are all interconnected:
Rig camera placement. Map representation and loop-closure detection. Global map relaxation. Free space mapping. Autonomous navigation, obstacle avoidance and path planning.
A. Rig Camera Placement Our robots vision capability is to be provided by a rig of up to four standard cameras, each with a lens offering approximately 80 horizontal eld of view. There are many possible choices for the conguration of these cameras, since they must support the various tasks required by loop closure detection, free-space mapping and exploration, and we have examined the trade-offs of different set-ups. Notably, the nal conguration chosen is adapted to the characteristics of the ofce environment used, where movements can be quite restricted and the distance to the objects relatively short (e.g., when traversing narrow corridors between two desks). In principle, an extremely wide eld of view, up to the maximum full cylindrical eld of view offered by a single omnidirectional camera, is well suited to relocalisation: not only does this enable the capture of a good variety of appearance data to act as a ngerprint for a location, but it also permits recognition of previously-visited places independent of the orientation of the robot. However, when this wide eld of view is provided not by a single omnidirectional camera but by a rig, we found that additional difculties arose. We experimented extensively with histogram-based place recognition (see the next section) with an ad-hoc four-camera rig designed such that the cameras were mounted horizontally with maximum angular spacing (i.e. at the corners of a square). However, the performance in recognising locations with different robot orientations was disappointing. This was partly due to the fact that the four cameras used left gaps or blind spots in the cylindrical eld of view which would not necessarily align when the robot had made a rotation. Another signicant point was that the actual distance between the camera centres on the robot was often signicant compared to the distance to objects and furniture in the indoor scene, so unmodelled parallax effects came into play. We found that a good pragmatic solution to camera placement for loop closure detection in an indoor scene is to have one horizontal camera facing forwards, and one backwards (see Figure 1). This conguration is able to detect the vast majority of loop closure events the robot will encounter, because in restricted spaces, when a robot revisits a place it is very likely to do it either moving in the same or exactly opposite direction to before. Yes, there will be possible loop closure events sometimes missed when the robot crosses a previous path at right angles. But in fact, such a crossing might well be difcult to reliably detect in any case, as it may correspond to a wide open area where several signicantly distant positions at the centre of it have similar appearance, making recognition ambiguous and localisation inaccurate.
78
Fig. 1. Robot camera conguration consisting of a three camera rig where camera C is used for obstacle avoidance and free space mapping and the cameras A and B are used for loop closure detection.
Together with the front/back camera combination for loop closure detection, we added a third camera pointing down at the ground in front of the robot for free space mapping. As detailed later, we experimented with the downward angle of this camera where there is a trade-off between immediate, local obstacle detection capability and a more forward view suitable for planning. There is an interesting feedback in play here between the image matching for relocalisation required by SLAM and free space mapping which permits autonomous robot guidance. A free space detection solution which is working well will enable the robot, when exploring or revisiting, to navigate in a precise way by for instance moving repeatably half-way between gaps or along corridor passages. This makes loop closure detection easier, since the robot is likely to very precisely revisit locations. B. Map representation and Loop Closure Detection Our approach does not assume any prior knowledge about the environment except that is traversable by a wheeled robot and that its visual appearance is descriptive enough. As the robot autonomously explores or is being driven through an environment, it builds a topological or graph representation of the area. A topological map is a very practical and desirable model for navigation, since it imposes a discrete structure on a continuous space, and because it easily enables the use of different low level (graph relaxation) and high level (path planning) algorithms. N In this undirected graph G = (V, E), a vertex V = (It , Xt ) represents a physical location in the environment which stores all the images I from N cameras of the rig at time t as well as the global position of the robot Xt = (xt , yt , t ) (2D position plus heading direction). Each edge Et in the graph stores the relative 2D transformation between nodes Xt and Xt+1 . A new vertex is initialised when the distance traveled from the previous position is greater than some threshold (using the internal odometry of the robot), or when the dissimilarity between consecutive images is above a threshold . Our approach for image comparison relies on a global descriptor implemented in the form of a 1D grey-level histogram. Such minimalistic single-signature place characterisation enables reasonable discrimination, while on the other hand only requiring frugal computational resources. The signature of a location is obtained by sticking the images of both
forward and backward cameras next to each other into a single mosaic which serves as the support for the calculation of the descriptor. Different methods for comparison were tested as well as multi-dimension histograms including cues such as gradients, color or intensity, obtaining the best results using EDM [14] over 1D histograms of grey intensities. Loop-closure detection is achieved by comparing the most recent location signature against all the previously visited locations. Thanks to the compactness and simplicity of place characterisation, such an exhaustive retrieval procedure can be executed efciently. Once a candidate for loop closure is found, time consistency is ensured to cope with perceptual aliasing. To this end, we perform a comparison of the appearance of 7 locations Lrecent = {Vi3 , . . . , Vi , . . . , Vi+3 } centred in time around the location of interest Vi , with 7 locations Lprevious = {Vj3 , . . . , Vj , . . . , Vj+3 } similarly sampled around the potentially loop-closing location Vj . This yields a 7 7 matrix C = i,j=1,...,7 Cij , where each entry corresponds to the distance (in appearance-space) between locations Vi Lrecent and Vj Lprevious . Note that the procedure has to be postponed until the locations Vi+1 , Vi+2 and Vi+3 are visited and added to the map. Asserting the time-consistency of a potential loop-closure requires an evaluation of the entries of C which takes care of the heading direction of the robot (see Figure 2). However, because of the conguration of the rig retained in our approach, we only need to distinguish between situations where the relative orientation from one passing to another is either 0 deg or 180 deg. Therefore, we review the scores of all the elements on both diagonals of C, and only if they are all below some threshold for one diagonal the loop-closure is accepted. This is enforcing the consistency of the appearance over neighbouring locations in time around both the location of interest and the potential loop-closing location in such a way that the relative orientation of the robot between the 2 passings is properly taken into consideration. C. Graph Relaxation and Map Optimization Due to the topological nature of our map representation we can optimize the poses by minimising the error between robot positions every time a loop closure is found. Every time a new graph relaxation is applied the map becomes more consistent to the real metric map and therefore becomes usable for navigation and path planning. In our approach we used TORO [8] which provides a highly efcient gradient descentbased error minimisation for constrained graphs. In order to enable accurate obstacle avoidance and path planning, a global free space map M of the environment is being built incrementally as the robot navigates (see Section IIID for details about free space detection). To this end, we associate to every keyframe of the robot trajectory a relative local free space map Mt , every vertex of the graph G being N now represented as Vt = (It , Xt , Mt ), which is a simple 2D occupancy grid with a resolution of 1cm2 per cell and whose origin is anchored according to the position and orientation of Xt . To ensure the consistency between topological and global
79
Fig. 2. Loop Closures using two cameras: forward and backward. a) represents a loop closure found with a difference in robot orientation of 180 deg with respect to the matching node. b) represents a loop closure found with the same robot orientation as the matching node. The top row in gures a) and b) indicates the current frame and the bottom row the matching node or keyframe. The single image at the left of gures a) and b) shows the matrix consistency check, with the bright red indicating strong matching along either the forward or backward diagonal.
free space maps, M is updated based on the optimised vertex positions whenever the topological graph is relaxed. It is true that with our featureless technique for loopclosure detection, an accurate position of the robot is not obtained right away. Yet, this is important to impose precision which improves the graph of poses after relaxation. Therefore, to obtain a good relative position estimate at loop-closure between current Xt and past Xp poses, we approximate the displacement P = (x, y, ) between them by aligning the contours of their respective local free space maps Mt and Mp , which can be simply done by solving a cost function minimising the distance between points in those contours: F C1 C2 = min(
i,j
1 The inverse mapping Hf can be employed to retrieve from the image the RGB value Xi associated to any oor cell i visible in front of the robot, so as to determine if this cell is free of obstacles or not. This is done here by calculating the log-likelihood ratio of occupancy, as follows (statistical dependency on the model is omitted for simplicity):
ln
P (Xi |Ci = F ) P (Ci = F ) P (Ci = F |Xi ) = ln +ln (2) P (Ci = O|Xi ) P (Xi |Ci = O) P (Ci = O)
dist(PiC1 , PjC2 )),
(1)
where the function dist(, ) is evaluated obtaining the 2D Euclidian distance between all the points in PiC1 and PjC2 . When the robot revisits a location with an opposite direction to the one of the previous passing, it is difcult to match the free space maps, as they do not overlap very much. In such situation though, we can still approximate the relative orientation between the two views by calculating the mean horizontal optical ow between the corresponding images (obtained by the loop closure detection) and this relative orientation is used in the loop closure constraint. D. Free Space Mapping We have developed an accurate algorithm which incorporates geometry and colour information. Our solution assumes that the robot is moving on a plane which is the same over the whole room, and that the oor is made of large regions with homogeneous colour. Under the oor planarity assumption, it is possible to map the image pixels of a downward looking camera to the 2D cells of a local free space map of the visible oor in front of the robot. This transformation Hf is a homography which is calibrated once each time the is xed to the robot by the simple procedure of matching four articial landmarks on the oor with the corresponding pixel positions in the image.
where Ci is the class of cell i (F for oor, O for obstacle). We assume constant empirically xed priors, and a uniform obstacle likelihood model, while the oor likelihood model is a mixture of L gaussians similar to the one used by Thrun et al. [5] for road detection: P (Xi |Ci = F ) = 1
L j=1 L
wj
wj
j=1
1 (2)
3 2
(3) where j , j , wj respectively are the mean, covariance matrix and mixing coefcient parametrising gaussian i. When the occupancy ratios have been calculated for every cell, the model parameters are updated according to the procedure proposed in [5], using only those cells whose ratio is above some threshold. Initially, the gaussians are learned in the very rst frame using only a small region at the bottom-centre of the image (that is assuming that at startup, the corresponding area on the oor is free of obstacles and is a good representative sample of the overall appearance of the oor). For better robustness and more efcient planning during navigation, the local occupancy grids corresponding to several consecutive robot poses are fused together (see Figure 3): not only does this provide a medium-sized free space map around the robot which is more suited to navigation, but also makes it possible to lter out incorrect obstacle detections due to inaccurate probability scores in the presence of noise in the image or illumination changes in the scene. Similarly, as already mentioned, a global free space map M of the explored
|j |
exp0.5(Xi j ) j
(Xi j )
80
environment can be computed at anytime by fusing all the individual local occupancy grids (see Figures 5 and 6).
Fig. 4. Ofce environment with strong perceptual aliasing and challenging navigable spaces.
Fig. 3. The image on the left shows part of a map which has been incrementally built, the region inside the marked area is the current frame being mapped which corresponds to the image on the right.
Robot Pos (182, 0) (212, 145) (425, 143) (512, -271)
Ground truth comparison (cm) Ground truth Robot Pos (180, 0) (96, 445) (210, 150) (-246, 487) (420, 140) (89, -325) (510, -270) (-280, -326)
Ground truth (90, 450) (-240, 480) (90, -330) (-270, -330)
E. Autonomous Navigation and Exploration We use the Dynamic Window Approach (DWA) [7] to plan smooth motions of the robot around nearby obstacles. Having estimated the free space around the robot based on a local window of a xed number of recent keyframes, the local occupancy map is binarized to obtain boundaries between freespace, obstacles and unexplored areas. Every cell detected as an obstacle boundary in our map is used in DWA to plan the next velocity commands. DWA plans the best route from a current robot position to a target location, avoiding obstacles, and therefore requires the denition of robot goals. We have considered it outside the scope of this paper to investigate ideal exploration strategies, nding that a heuristic technique was sufcient in our application. A goal is randomly selected relative to the robot position and around a square window of 3m 3m. If the robot has spent much time around the same area then a goal is selected further away. It is also important to obtain a balance between mapping unexplored areas and obtaining accurate maps. Every time the robot has mapped an area completely it revisits previous places within the mapped area to nd potential loop closures. By doing this we try to correct drifts in the odometry and also every cell is corrected according to the new robot position. IV. E XPERIMENTS AND R ESULTS We have developed our experiments in an ofce of size 10 10m using a Pioneer robot platform. As can be seen in gure 4, this environment represents a challenging scenario with strong perceptual aliasing and multiple narrow spaces (88cm), making both localisation and autonomous navigation difcult. The quality of our approach is demonstrated by performing manual and autonomous navigation with incremental and realtime free space mapping and loop closure detection. Our rst experiments consider the effect of loop closure detection over map precision: it is well known that odometrybased robot position estimation will continuously drift over time, until a loop-closure is detected and the inconsistency is compensated for. To safely navigate, it is therefore extremely important to correct the map accordingly every time as possible. As can be observed in the top left image of Figure 5,
TABLE I E VALUATION OF L OCALISATION ACCURACY.
a constructed map using only odometry information with no loop-closure detection is highly inaccurate, and it would be almost impossible to navigate autonomously or to use the map for further high-level tasks. When loop-closures are detected with only one camera (top right image), then it can be observed that both the robot trajectory and the map are signicantly more accurate. When using two cameras, the number of loop closure detections increases, and so the map becomes even more accurate (see Figure 5, bottom). We have investigated the impact of the downward looking angle of the camera used for free space detection on the quality of the map (Figure 5, bottom images). We have found that if the camera is oriented such that it is only covering a very restricted region (i.e., within 20cm to 110cm) in front of the robot, then, the obtained map is very detailed, enabling very accurate obstacle localisation. However, such very downward looking angle penalises motion planning, as it only provides very local information, with obstacles being detected very late, only when the robot is very close to them (see the bottom right part of the gure). When the camera is covering a larger region (i.e., within 60cm to 450cm) in front of the robot, further away obstacles can be detected, enabling more efcient motion planning, over a wider time window, leading to smoother robot trajectories (bottom left of the gure). To measure the metric quality of our free space map, we have picked 8 random robot positions distributed over the whole room, and compared their coordinates in the map with ground truth obtained by manually measuring the coordinates of these points on the oor. This comparison, presented in Table I, demonstrates the accuracy of our solution, with a mean 2D error of 6.39cm, and a max error of 10.77cm. With our system the robot was able to successfully achieve several autonomous navigation runs of approximatively 20min. Figure 6 shows examples of global free space maps, providing
81
further and creating a system with heterogeneous camera types with even more elaborated congurations and roles. ACKNOWLEDGMENTS This work was supported by a CONACYT scholarship from the Mexican government to G. Carrera and European Research Council Starting Grant 210346. R EFERENCES
[1] Y. Alon, A. Ferencz, and A. Shashua. Off-road path following using region classication and geometric projection constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, pages 689696. IEEE, 2006. [2] A. Angeli, D. Filliat, S. Doncieux, and J.-A. Meyer. Real-time visual loop-closure detection. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2008. [3] G. Carrera, A. Angeli, and A. J. Davison. SLAM-based automatic extrinsic calibration of a multi-camera rig. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2011. [4] M. Cummins and P. Newman. Highly scalable appearance-only SLAM FAB-MAP 2.0. In Proceedings of Robotics: Science and Systems (RSS), 2009. [5] H. Dahlkamp, A. Kaehler, D. Stavens, S. Thrun, and G. Bradski. Selfsupervised monocular road detection in desert terrain. In Proc. of Robotics: Science and Systems (RSS), 2006. [6] A. J. Davison and D. W. Murray. Mobile robot localisation using active vision. In Proceedings of the European Conference on Computer Vision (ECCV), 1998. [7] D. Fox, W. Burgard, and S. Thrun. The dynamic window approach to collision avoidance. IEEE Robotics and Automation Magazine, 4(1):23 33, 1997. [8] G. Grisetti, C. Stachniss, S. Grzonka, and W. Burgard. A tree parameterization for efciently computing maximum likelihood maps using gradient descent. In Proceedings of Robotics: Science and Systems (RSS), 2007. [9] J. Hoffmann, M. Jungel, and M. Lotzsch. A vision based system for goal-directed obstacle avoidance. In ROBOCUP2004 SYMPOSIUM, Instituto Superior T cnico, Lisboa, Portugal, 2004. e [10] M. Kaess and F. Dellaert. Probabilistic structure matching for visual SLAM with a multi-camera rig. Computer Vision and Image Understanding (CVIU), 2009. [11] Y. Kim and H. Kim. Layered ground oor detection for vision-based mobile robot navigation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), volume 1, pages 13 18. IEEE, 2004. [12] Y. Li and S.T. Bircheld. Image-Based Segmentation of Indoor Corridor Floors for a Mobile Robot. In Proceedings of the IEEE/RSJ Conference on Intelligent Robots and Systems (IROS), 2010. [13] M. Milford and G. Wyeth. Persistent navigation and mapping using a biologically inspired SLAM system. International Journal of Robotics Research (IJRR), 29(9):11311153, 2009. [14] Y. Rubner, C. Tomasi, and L.J. Guibas. A metric for distributions with applications to image databases. In Proceedings of the 1998 IEEE International Conference on Computer Vision, pages pp. 5966, 1998. [15] R. Sim and J. J. Little. Autonomous vision-based exploration and mapping using hybrid maps and rao-blackwellised particle lters. In Proceedings of the IEEE/RSJ Conference on Intelligent Robots and Systems (IROS), 2006. [16] I. Ulrich and I. Nourbakhsh. Appearance-based place recognition for topological localization. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2000. [17] Q. Wu, W. Zhang, T. Chen, V. Kumar, et al. Camera-based clear path detection. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pages 18741877. IEEE, 2010. [18] C. Zhou, Y. Wei, and T. Tan. Mobile robot self-localization based on global visual appearance features. In IEEE International Conference on Robotics and Automation, volume 1, pages 12711276. Citeseer, 2003. [19] J. Zhou and B. Li. Robust ground plane detection with normalized homography in monocular sequences from a robot platform. In Image Processing, 2006 IEEE International Conference on, pages 30173020. IEEE, 2007.
Fig. 5. Free Space Map built by driving the robot manually (700 keyframes). Top left: map without loop closures, top right: map with loop closures coming from the camera facing forward, bottom left: map with loop closures from both cameras, bottom right: map built using a camera facing to the oor.
a comparison of the two downward looking angles already used earlier, again proving the superior accuracy of the proximal sensing conguration. The purpose of this experiment is to demonstrate the reliability of our complete solution comprising simultaneous localisation, mapping, free space detection and navigation: the exploration strategy here is deliberately simplistic, and used only as a proof of applicability of our solution (more advanced policies would certainly result in more efcient exploration).
Fig. 6.
Free Space Map built autonomously using 2500 keyframes
V. C ONCLUSIONS We have shown that lightweight vision-based techniques, previously shown to be effective for surprisingly high performance localisation, can be augmented to be similarly effective for fully autonomous robot navigation. Our system puts together odometry-based trajectory estimation, front/back vision-based loop closure detection, free space identication from a local obstacle detection camera and a forwardlooking camera, and local and global occupancy mapping. Our approach relies on an ad-hoc rig of multiple standard cameras, and we have investigated the role each camera should effectively play for the best performance, but in future work it may be worth considering taking the specialisation route
82
A Decision Model of Adaptive Interaction Selection for a Robot Companion

Abir B. Karami Abdel-Illah Mouaddib GREYC-CNRS Laboratory, UMR 6072, 14032 Caen CEDEX, France. {akarami, abdel-illah.mouaddib}@info.unicaen.fr
Abstract This paper is motivated by two ideas. First, establishing a system for a robot companion that is capable to switch between different types of interaction according to the humans needs. Second, representing the system with a model that outperforms Partially Observable Markov decision processes (POMDPs) for large-scale problems. For this end, we propose a unied model of Human-Robot Multi-Type Interaction (HRMTI). The objective is to observe the humans behavior and try to predict/estimate the humans intention/need and therefore react appropriately (assist, cooperate, collaborate, . . . ). POMDPs are largely used by the community of collaborative and assistive Human-Robot Interaction (HRI). However, solving an HRMTI problem for large-scale applications by employing POMDPs is intractable. We present an approach to overcome this limitation by dividing the problem of HRMTI to three levels: rst, estimate the human intention; second, select the appropriate type of interaction; last, choose one of the pre-calculated policies with respect to the human intention and the needed type of interaction. We present the unied model in detail, it includes: a belief update model concerning the intention of the human using multiple MDP models dened by empathy for each specic human task; an MDP model that supports the robots decision making towards the human needs by selecting the appropriate interaction type; nally, an algorithm that rst denes the task to accomplish by the robot, then chooses the pre-calculated policy to achieve it with respect to the type of interaction. We present some interesting performance and scalability analysis and, with a representative scenario inspired from robocup-at-work, we present an implementation on a real robot showing the feasibility of the approach. Index Terms Human-Robot Adaptive Interaction, Robot Companions, Hidden Markov Models, Makovian Decision Models.
I. I NTRODUCTION Nowadays, a vast research interest is progressing towards robot companions that share environments with people like elders in their homes or in assisted living facilities. A robot companion can be of help to a human in different ways. This depends on the human possible incapabilities or preferences. More information about the human desires and constraints would more probably lead the robot companion to offer the good help when needed. For example, if the human has the intention of doing a certain task like calling his friend, he might be able to achieve the task himself, however, he might need assistance in nding or remembering the phone number. In the rst case, the robot will be able to help the human by achieving some other task (like cleaning the room) and in the second case the robot should offer assistance to the human to nd the phone number.
We consider a Human-Robot Multi-Type Interaction (HRMTI) model where the robot companion observes a human and acts appropriately for the best help. Indeed, to best achieve this mission, the companion must be able to: estimate the humans intention (what task is he trying to achieve?), then infer the kind of help (interaction) that he might need, then reason the best action to help the human. In addition to accurately and quickly infer the humans intentions, the companion should adapt to the humans possible change in desire over time. A possible idea to solve such problems would be using POMDPs while representing the humans intention as a hidden part of the POMDP state. This solution was used in many studies in the HRI domain [2] [5] [10]. POMDPs proved to be a natural solution for handling uncertainty about the environment and humans and adapt to any possible change. However, as known, solving POMDPs is PSPACE-Complete and nding a tractable approximate solution for real size applications is highly complicated. We are interested in a framework for a robot companion that is able to interact with a human using different types of interaction (assisting, collaborating, cooperating, . . . ) depending on the recognized intention and preference of the human. Instead of using POMDPs, we propose an outperforming model to solve the problem on three levels without loosing the adaptability and the performance of handling uncertainty of POMDPs. First, the system observes the human action and estimates his intention using multiple simulated Human-MDPs and a Hidden Markov Model (HMM) where the hidden state corresponds to the human intention. Second, an MDP model helps in inferring the kind of help the human needs using the belief state updated from the HMM. Third, knowing the kind of interaction needed to help the human and his intention, the robot applies an action from previously computed policies (a policy for each task achieved under certain type of interaction). II. D EFINITIONS In this section, we will present some denitions that are related to the HRMTI model. We will present few types of interaction and try to dene the difference between them. We also present our denition of a task and how it can be related to one or more types of interaction. Finally, knowing all that we dene the what a human intention might be. We dene I as a set of interaction classes I = {IC1 , IC2 , IC3 , . . .}. Each class of interaction differs in the
83
2
Level 1
way the robot companion helps the human. We distinct in the HRI literature three types of interaction that we will concentrate on in this paper: Cooperation, Assistance and Collaboration, I = {CP, AS, CL}. Cooperation CP : a robot can cooperate with a human by doing a task on his behalf to save him time or effort. The robot should be capable to do the task alone (cleaning carpet or the dinner table), and should not disturb the human by respecting the humans preferences and interests [7]. Assistance AS: a robot companion can assist a human by guidance (typically spoken guidance). For example by accessing a manual of how to run the dishwasher the robot can assist the human by reading the steps. This type of interaction also covers cases like robot assistants for people with dementia [2]. Collaboration CL: a task that needs the participation of the human and the robot to achieve it (typically physical action) requires a collaboration between them [6] [10]. A robot companion should not keep the human waiting when he shows interest in a collaborative task. The robot holds a list of all possible tasks that can be done by the human or/and the robot. A task tk T K can be dened as tk = context, agent, policy, time, . . . , context holds information about the task regarding the context of the problem, it might include relations and dependencies between tasks, conditions to achieve, related databases to access or update information; agent {R, H, H R, H R}, is the possible agent that can achieve the task, it can be achieved uniquely by the robot, uniquely by the human, by any of them or by both of them; policy = {true, f alse} is the fact that the robot holds a policy or a manual of how to assist the human if he needs assistance to do this task; and time is the time steps normally needed to achieve the task. A task can be achieved by one or more type of interaction following the rules mentioned in Table I:
tk CP tk AS tk CL tk = , {R, R H}, , tk = , {H, R H}, true, tk = , {R H}, ,
z ie
Human Intention Estimator (Belief Update)
Library of Human Action Values (Qvalues)

Qtk3 (sh , ah ) sh sh 1 2 h aQtk (sh , ah ) 1 2 sh 1 ah ah 2 h 1 Qtk h , a ) (s h a2 1 h a1 ah 2
h
b
Level 2
ie
s2 s
h 1
s2
Interaction Class Selector (MDP)
Level 3
ci CI, F lag
Task & Policy Selection
tk
CI
policy database
F lag action Environment

Fig. 1. The HRMTI Decision Control Model
TABLE I R ELEVANCE BETWEEN TASKS AND DIFFERENT INTERACTION CLASSES . (*) REPRESENTS ANY POSSIBLE VALUE .
We dene T K = T Kh T Kr , where for T Kh the domain of variable agent is {H R, H, H R}, and for T Kr the variable agent {H R, R, H R}. The humans intention can be one of the humans tasks or nothing at all, intention T Kh {do nothing} III. T HE M ULTI I NTERACTION M ODEL We will present in this section the multi-type interaction HRMTI decision control model that allows a robot to observe a human and detects his possible intentions (intended tasks) and infer any possible need of assistance or collaboration. We concentrate in this paper on a robot companion system that puts the human needs as priority (assistance, collaboration), otherwise helping the human by doing cooperative tasks. Figure 1 presents the three levels of the system. In the rst level, at each time step, the Human Intention Estimator
observes the human action z ie and updates the estimation over possible human intentions HIE(b ie |bie , z ie ). It uses a library of human action values (Q-values) calculated ofine. The second level receives those estimations and uses them to efciently switch the type of interaction in a way that corresponds to the human desires. The third level uses all information from the rst and second levels to choose a task for the robot respecting the belief over the human intentions and the chosen class of interaction. By selecting the policy of the robot task from the policy database, the robot applies the appropriate action from this policy. Afterwards, the system passes the control back to the Human Intention Estimator and along with the new observed human action, a new time step begins. We note that the levels 2 and 3 adapt their decisions following the possible changes in the human intention estimation. The high level algorithm of the system is as the following: 1) the system observes the human action z ie , 2) the belief update function updates the belief over the human intention b ie , 3) Interaction Class Selector creates the state sic from b ie , 4) Interaction Class Selector calls the policy ic to choose an interaction class, 5) Task Selector algorithm chooses a task to achieve by the robot tk T Kr , 6) the appropriate policy from the database is called, and a robot action is sent to the system, 7) the robot applies his action and new time step starts, In the following of this section, we will explain how we build the Library of Q-values and we detail the three levels of the system. A. The Human-MDPs To be able to link an observed human action to a possible human intention, we propose a solution inspired from the simulation theory of empathy [11] which suggest that with similar brain structure to simulate the underlying mental states of a person, it is possible to infer his beliefs and goals and anticipate and understand his behavior. Our solution consists of building a number of Human-MDPs simulated by empathy:
84
h h h MDPh = Stk , Ah , Ttk , Rtk , tk T Kh . Each of the tk tk Human-MDPs simulates a rational human acting towards a task tk. After solving the Human-MDPs (using Value Iteration [1] or other approximated MDP solver), a library of Qvalues is lled using the MDPh model and the corresponding tk calculated value function Vtk : h Qtk (sh , ah ) = Rtk (sh , ah ) + s h S h
Ttk (sh , ah , s h )Vtk (sh )
The simulated Human-MDPs policies are calculated ofine by the system but not necessarily followed by the human. Figure 2 shows an example library of Q-values. They hold a value of all human actions in any possible state, therefore they help in evaluating any rational human action towards any of his intentions. It is possible to personalize the system towards a certain human behavior by updating the values online while observing the real human actions [5]. We notice from the gure that observing the human doing ah from state sh for example 1 1 indicates a higher estimation for intention tk1 rather than tk2 , where Qtk1 (sh )(ah ) = 0.5 and Qtk2 (sh )(ah ) = 0.2. We will 1 1 1 1 explain later how we integrate the Q-values in (Eq.1) in order to update the belief over the possible human intentions in (Eq.2).
h tk
Qtk1 (s , a ) ah 1 ah 2

1
where is a normalizing factor. If the human is rational in his actions, then the probability of observing the human doing ie action (z ie ) while his intention is (sintenh ) can be estimated from the library of Q-values. Where we have a value of each possible human action towards each possible human intention. The Human Intention Estimator receives at each time step the perceived human action z ie and uses the belief update function (Eq. 2) to create the new belief state b ie and passes it to the Interaction Class Selector. Any possible change in the human intention is usually reected on his actions and therefore reected on the new belief over the human intentions.
b
ie
(s
ie
)=
sie S ie
pr z ie |sie , s
ie
pr s
ie
|sie bie (sie )
(2)
C. Level 2: The Interaction Class Selector (ICS) The ICS is responsible of choosing the appropriate type of interaction. It starts by analyzing the received belief state b ie and concludes the intention of the human. Then it uses a precalculated policy M DP ic to choose one of the different interaction classes: Cooperation CP ; Assistance AS; Collaboration CL. We will explain two points before detailing the ICS model and then we will explain how this model allows the robot to detect a humans need of assistance. Conrming with the human: In order to avoid unneeded gesture of assistance or collaboration deduced from a false intention estimation, we propose to add a Conrmation policy which allows the robot to ask the human for a conrmation when the estimated human intention is of type CL. On the other hand, the fact that the human intention is of type AS does not necessarily prove that the human needs assistant to achieve his task (he is possibly capable of achieving it by himself). We will explain later how the human need of assistant is detected. The conrmation is simple, for example: (Do you need my help in doing task x?) and the human answer can be of type (yes, no). There is no need of any kind of conrmation if the human intention if of type CP . In this case, the robot will choose another task of type CP . The Flag: The F lag represents the current status of interaction or information that directly affects the decided type of interaction between the robot and the human. It is also a way to communicate between the different levels of the system. At each time step the F lag is assigned a value by one of the system models. The value of the F lag is used in the ICS to decide the needed class of interaction. An example of possible F lag values: - P ossibleN eed : a need of assistance is detected. - DoingX : the robot is processing a task of type X where: X I. - Conf irm : the robot is conrming with the human his need of help. - Y es : the human answer is (yes). - N o : the human answer is (no). 1) The ICS model: The ICS is a Markov Decision Process such as M DP ic = S ic , Aic , T ic , Ric : ic ic ic ic S = scontext sinten sF lag , where: h h (sic contexth ) represents human context related information,
h tk
sh 2

Library of Human Action Values (Qvalues)

2
sh 1

Qtk2 (s , a ) ah 1 ah 2

sh 1

sh 2

Qtk3 (sh , ah ) sh sh 1 2 ah 1 h h h sh 1 aQtk2 (s , a ) 2 ah 1 h h h aQtk1 (s , a ) 2 h a1 ah 2

sh s
h 1
sh
Fig. 2.
The Q-values concluded from the Human-MDPs.
B. Level 1: The Human Intention Estimator (HIE) The HIE is the part of the system that observes the human actions and tries to translate those actions into intentions using the Library of Q-values. A belief update function is used to calculate the belief over the human intentions after observing his actions. This function is dened using an HMM where the human intention is represented as the hidden part of the model. The HIEie = S ie , Z ie , T ie , Oie , bie , where S ie = 0 ie scontexth sie h , the human intention (sie h T Kh ) inten inten is non observable and the human context (sie contexth ) includes humans information regarding the problem context. The possible observations are the possible perceived actions of the human that are related to the set of tasks T K: Z ie Ah . We introduce into the transition function T ie the probabilities that the human maintain or change his intention (p, 1 p). Those probabilities can be learned according to the human personality. pr(s ie |sie ) =
ie pr(scontexth |sie ) p 1p ie pr(scontexth |sie ) |s ie
intenh
ie if sie h = sintenh inten ie ie if sintenh = sintenh
The observation function Oie is calculated using the Library of Q-values.

ie ie Oie = pr(z ie |sie , s ie ) = Qsinten (sie contexth )(z ),
h
(1)
85
(sic h T Kh ) is the human intended task, and inten (siclag F lag) holds the F lag value. F Aic = {doCP, doAS, doCL, doConf irm}, the decision is one of the possible interaction classes or to conrm the need of any of them. T ic = pr(s ic |sic , aic ): we calculate the ICS transition function in two parts:
ic ic ic T ic = pr(sF lag |sic , aic ) pr({scontexth , sintenh }|sic ).
Line (4) of Algorithm 1 tests if the dominant state st is dominant enough to recognize a distinct intention. The value of k has an inversed relationship with the number of tasks tk T KH . The condition in Line (4) covers cases where (sic h {CP CL}). However if (sic h AS) inten inten it does not necessarily means that the human needs assistance. 2) Detecting a need of assistance: When human actions are indistinct (waiting for example) this will lead to a certain ambiguity in the belief state over his intention. This ambiguity might be a sign of a lack of interest in all intentions, or a need of assistance. Line (7:) tests if the human possibly needs help in doing one of the (AS) tasks. This test checks for a relation between the human context variable st contexth and the context variables of one of the assistant tasks (AS). This relation might be a similar location or any variable that presents a human interest in this task. In case of possible need of assistance, a state is created with the related assistance task and a F lag of value (P ossibleN eed). In Line (13:) we dene the case where there is a lot of ambiguity in the belief state (bie ) and no need of assistance. The created state includes the intention of the dominant state st even if it is not dominant enough for a clear type of interaction. From here we pass the (sic ) to the policy ( ic ) that will give the appropriate ICS action. D. Level 3: Choosing a Task and Achieving a Type of Interaction Algorithm 2 describes the selection of the task that the robot should achieve. We notice that the task is easily chosen if the human intention is of type AS or CL and he conrmed his need of help. Else-wise in Line (11) the algorithm looks for tasks that only the robot can do. In Lines (17: 23), if non of the prior cases is true, the method looks for a task that is doable by the robot and is least intended by the human. The algorithm sets the F lag to pass information about the type of interaction the robot is intending to do. As the system passes throw level 2 at each time step to choose the Interaction Class, any change of the human intention will be processed directly. This allows the robot to adapt quickly to the human possible need of assistance or collaboration. If there are dependency conditions for achieving a task (dened in tkcontext )), tests over such conditions should be integrated in algorithm 2. We assume a database of pre-calculated policies, one for each of the tasks in the mission respecting a certain type of interaction. Once the task is chosen, the appropriate precalculated policy is called. IV. E XPERIMENTS The results we show in this section concerns problems with three classes of interaction. We present some performance results including ofine calculation times registered for different sizes of the problem. For scalability analysis we show in Table II results for 6 experiments. We considered the tasks contexts are their positions in the human-robot shared area. The HMDPs are calculated to plan the human path to each possible task. Tasks types are divided equally and randomly
The rst is responsible for the probability of changing the F lag variable knowing the action aic Aic . The second part is responsible for the transition of the huic man variables sic contexth and intention sintenh where we also introduce the probabilities of human maintaining or changing his intention (p, 1 p). The changes in other human information (sic contexth ) depends on the specic context of the problem.
ic pr(scontexth |sic ) p 1p ic pr(scontexth |sic ) |s ic ic ic pr({scontexth , sintenh }|sic ) = | ic if sic h = sintenh inten ic ic if sintenh = sintenh
intenh
Ric (sic , aic ): a conrmation of collaboration is rewarded if the estimated human intention is of type CL. While a conrmation for assistance is only rewarded if F lag = P ossibeN eed. DoCL, DoAS are rewarded when estimated human intention is of type CL, AS respectively and F lag = Y es. DoCP is rewarded else-wise. The ICS policy ic is calculated ofine. Once the system is online the policy is called to choose the best interaction class. As the state of the human is not totally observable, the state of the ICS is created online as described in Algorithm 1. Algorithm 1 The creation of the ICS state 1: INPUT: F lag, bie (s): the passed F lag and belief state. 2: st = argmax (bie (s)) argmax in S ie ie 3: st2 = argmax (b (s)) argmax in {S ie st } ie ie 4: if (b (st2 ) b (st ) > k) then distinct intention 5: sic =< st , st , F lag) > contexth intenh 6: else ambiguity in bie 7: for (tk AS) do check for need of assistance 8: if (st relates with tkcontext ) then contexth
9: 10: 11: 12: 13: 14: 15:
end if end for end if if sic still not dened then no clear intention sic =< st , st , F lag) > contexth intenh end if
sic =< st contexth , tk, P ossibleN eed >
Using the belief state that the ICS receives from the HIE bie (s) at each time step, the algorithm chooses the dominant state st in this belief state and uses its variables to build the ICS state sic . We dene: st 2 st = argmax s S bie (s), = argmax s S {st } bie (s)
86
Algorithm 2 Task Selection ie 1: INPUT: sic intenh , b (s). 2: OUTPUT: tk T K, F lag. 3: if (sic intenh AS F lag {yes, P ossibleN eed}) then 4: tk = sic h inten 5: F lag = DoingAS 6: else if (sic intenh CL F lag {yes}) then 7: tk = sic h inten 8: F lag = DoingCL 9: else 10: for (t T Kr ) do 11: if (tagent = R) then priority robot only 12: tk = t 13: F lag = DoingCP 14: end if 15: end for 16: end if 17: while (tk is not dened) do 18: get next argmin st0 = argmin (bie (s)) 0 19: if (st0 / intenh T Kr AND stintenh CL AS) then 0 20: tk = stintenh 21: F lag = DoingCP 22: end if 23: end while to (CP, CL, AS). In those experiments we presented the areas with (10 to 100) rooms each with (10 to 20) accessible positions, which is large enough to cover scenarios of a house, an apartment or a hotel. Table II presents the number of positions |P OS|, the number of tasks |T K|, number of ICS states and the ofine calculation time. We consider the presented HRMTI calculation times allows us to solve realworld robot companion scenarios contrary to a POMDP model (Table III)1 .
Exp. 1 2 3 4 5 6 |P OS| 200 200 400 400 1200 1200 |T K| 25 50 50 100 100 400 |S ic | 4.5 104 9 104 1.8 105 3.6 105 1.08 106 4.32 106 total time (minutes) <1 <1 2 5 21 480 (8 hours)
front of the bookcase, in 3(a) the robot offers assistance in nding a book (we suppose the robot has a database with the placement of books in the bookcase). In 3(b), Conf irm for CL task when the human started to show interest in the collaborative task: lling the printer with papers. In 3(c), DoingCL by bringing pack of papers to the human. In 3(d): DoingCP task: cleaning the window. We propose to follow this link to see the complete video: http://users.info.unicaen.fr/akarami/ demohri/demo_multi_interaction.avi
(a) Conrm Assistance
(b) Conrm Collaboration
(c) Collaboration Fig. 3.
(d) Cooperation
Screen-shots of a video demonstrating three types of interactions.
In previous work[7], we presented a POMDP model to solve a cooperative Human-Robot Interaction. The model was presented for a robot that shares a mission with the human, where the mission is divided into tasks (of type CP only). The robot is supposed to choose tasks that have the least of the human interest in doing. Table III shows the incapability of a POMDP model to solve large-scale problems. The approach we present in this paper solves problems more than 1000 times the size of problems solved using the POMDP model and with acceptable calculation time. Knowing that it will be impossible to solve HRMTI using a POMDP model.
Exp. 1 2 3 4 |P OS| 10 12 11 16 |T K| 4 4 6 6 |S ic | 2765 4385 15112 40897 total time (minutes) 2 10 51 315 (5 hours)
TABLE II E XPERIMENTS FROM HRMTI MODEL
TABLE III E XPERIMENTS FROM C OOPERATIVE -I NTERACTION POMDP MODEL .
We remind that the parts that are calculated ofine are: the HMDPs policies, the HIE transition and observation functions, the MDPic policy. The more the scenario is complicated, the more time needed in creating the MDP models and in solving them. In our results, the majority of calculation time served for the creation of the transition function T ic of the MDPic model. We also remind that once the system is online, the robot decisions are taken instantly. Figure 3 includes screen-shots of a video demonstrating three types of interactions. After some human hesitation in
1 Experiments
V. R ELATED W ORK To our knowledge, there is no prior work on adaptive multiinteraction systems in HRI. However, assistive and collaborative individual models have a large interest in the literature. Most prior work on decision making for assistant systems used approaches based on POMDP models. Some of them were for assistant-interaction systems. For example, hierarchical POMDPs for an autonomous mobile robotic assistant
done with Linux 6 core Intel(R) 2.13 GHz, 20Go Memory.
87
that provides elderly people in nursing homes with reminders about their daily activities [9]. A cognitive assistive system to help people with advanced dementia in their daily living activities [2], they present a model for a hand-washing assistant which monitors the persons progress in his activity and suggest guidance in case of an unusual observed behavior. Markovian models are also described for human Intention/activity recognition problems. A switching hidden semi-Markov model [4] is used for a system that learns a model of the house occupants daily activities through observation. Then by monitoring the persons current activity, the model detects any abnormality and alerts the caregiver. Assistant robots are supposed to offer the information/help that facilitate the human task. This class of approaches include the detection of the humans need of help and assisting him. However, the systems are limited to one type of interaction (assistive/informative) and in some cases, it is a specialized system for only one task [2]. This is why we thought it is interesting to have a higher level system that will be able to detect which of the precalculated task/interaction policies is needed by the human. A POMDP model for selective assistive actions [5] estimates the possible human goals by using the Q-function values of a representative optimal human MDPs. They infer these values in a heuristic to approximately solve the POMDP while, in our approach, we use the Q-values in a belief update function over the human intention and his context variables. They describe a solution for human-robot interaction different from the one described in this paper. The assistant robot in [5] continues doing assisting actions from the calculated policy until no more actions can be done or high ambiguity is reached (no operation), this is where the human can participate in doing only one action. Our approach, in case of ambiguity, checks if special help is needed, otherwise, switches to cooperation interaction (CP ). An additional variety of research over robot companions include robots that collaborate with the human to do their tasks. A wheelchair navigation system [10] uses a POMDP model that can predicts, with a minimal joystick input, the target position of the wheelchairs user and drives him there. We consider this approach as collaborative as the system (the chair) and the human are collaborating together to do the same task. Other problems suggested in the literature discuss robot companions that cooperate with their human partners without acting jointly with them. Given a robot companion that plans for its actions with awareness of the existence of its human partner in the same environment [3] using HMMs and [7] using POMDPs. In those problems the robot should recognize the human plan in order to cooperate by achieving its own tasks without disturbing the human plan. Some approaches use speech to help recognizing the human plan or preferences for cooperative missions [8]. Previously mentioned approaches are specialized in only one type of interaction. This paper concentrated on how we can enrich companion robots with systems that are able to switch from one type of interaction to another depending the human needs. Our approach includes an online intention recognition for both human intention and type of help needed. It includes
an MDP model that helps the robot to make the best decision for each possible situation. Solving the problem this way helps in enriching the system without exploding the memory or the calculation time. In addition, we are able to solve largescale applications of HRI problems contrary to POMDP-based approaches. VI. C ONCLUSION In this paper, we presented a multi-type human robot interaction model for a robot companion. We explained the difference between three types of interaction: Cooperation, Assistance and Collaboration. We described how the humans behavior leads to estimate his need to one of those interactions. Then we presented the succession of processes in our model that gives the robot companion a good reaction. Our results show that by dividing the decision model into different components, we were able to solve real life problem sizes with acceptable calculation time, and compute policies that act reasonably as expected. Those policies handle the problem of uncertainty over the human intention and also overcomes the limitation of state space of POMDPs. An important amount of time is dedicated for manually create MDP models (HMDPs, ICS, HIE and a database of policies for each couple (task, interaction)). Nevertheless, this is a normal difculty when solving HRI problems of important task space. In addition, the policy database can be reused in different robot companion scenarios. R EFERENCES
[1] Richard Bellman. A markovian decision process. Indiana Univ. Math. J., 6:679684, 1957. [2] Jennifer Boger, Pascal Poupart, Jesse Hoey, Craig Boutilier, Geoff Fernie, and Alex Mihailidis. A decision-theoretic approach to task assistance for persons with dementia. In Leslie Pack Kaelbling and Alessandro Safotti, editors, IJCAI, pages 12931299, 2005. [3] Marcello Cirillo, Lars Karlsson, and Alessandro Safotti. A humanaware robot task planner. In Alfonso Gerevini, Adele E. Howe, Amedeo Cesta, and Ioannis Refanidis, editors, ICAPS, 2009. [4] Thi V. Duong, Hung Hai Bui, Dinh Q. Phung, and Svetha Venkatesh. Activity recognition and abnormality detection with the switching hidden semi-markov model. In CVPR (1), pages 838845. IEEE Computer Society, 2005. [5] Alan Fern, Sriraam Natarajan, Kshitij Judah, and Prasad Tadepalli. A decision-theoretic model of assistance. In Manuela M. Veloso, editor, IJCAI, pages 18791884, 2007. [6] Guy Hoffman and Cynthia Breazeal. Cost-based anticipatory action selection for human-robot uency. IEEE Transactions on Robotics, 23(5):952961, 2007. [7] Abir-Beatrice Karami, Laurent Jeanpierre, and Abdel-Illah Mouaddib. Human-robot collaboration for a shared mission. In Pamela J. Hinds, Hiroshi Ishiguro, Takayuki Kanda, and Peter H. Kahn Jr., editors, HRI, pages 155156, 2010. [8] L. Matignon, A. Karami, and A. I. Mouaddib. A model for verbal and non-verbal human-robot collaboration. In Dialog with Robots: AAAI 2010 Fall Symposium Technical Reports, pages 6267, 2010. [9] Joelle Pineau, Michael Montemerlo, Martha E. Pollack, Nicholas Roy, and Sebastian Thrun. Towards robotic assistants in nursing homes: Challenges and results. Robotics and Autonomous Systems, 42(3-4):271 281, 2003. [10] Tarek Taha, Jaime Valls Mir , and Gamini Dissanayake. Pomdp-based o long-term user intention prediction for wheelchair navigation. In ICRA, pages 39203925, 2008. [11] http://en.wikipedia.org/wiki/Simulation_theory_ of_empathy Wikipedia.
88
Probabilistic Maneuver Prediction in Trafc Scenarios

Jonas Firl Quan Tran Opel AG, 65423 R sselsheim, Germany u Department of Measurement and Control, KIT, 76131 Karlsruhe, Germany
Adam
Abstract Trafc scene understanding is an important part of future advanced driver assistance system (ADAS). In this paper we present a probabilistic approach for this problem. We use Hidden Markov Models (HMMs) for modeling the spatiotemporal dependencies of trafc situations. The parameters of the individual models are learned from a data base, generated by a professional driving simulation software. New oncoming situations are rstly perceived by an on-board sensor system including radar and camera. After the parameters of trafc participants are estimated they are used for recognition and prediction of the trafc maneuver by evaluating the likelihood for the learned models. Experimental results in real-world environments have shown that the method can robustly recognize driving maneuvers. As an example, the prediction of lane change maneuvers shows the benet for different ADAS functions. Index Terms trafc scene understanding, probabilistic model, pattern recognition
I. M OTIVATION Trafc scene understanding is a key concept for future advanced driver assistance system (ADAS) and a fundamental requirement for autonomous driving as well. Real trafc is a complex phenomenon involving multiple trafc participants, which are time dynamic and interact with each other. To act intelligently in this dynamic scene, an ADAS should be able to perceive and interpret its environment. Especially, recognition and prediction of the maneuvers of trafc participants in advance allows improving the passengers comfort and safety. Information perception in this context refers to detection and tracking of relevant objects based on raw sensory data. At this low level the system must deal with the uncertainty in sensory data. Generally, to improve the reliability of the sensory data perception, data from separate sources are combined, namely by using sensor fusion. At the high level, the system uses the estimated information from the low level to recognize, what is happening in the driving environment. At this level the system must be able to deal with the ambiguities of the scene. Although much effort has been devoted recently, trafc scene understanding remains challenging. One of the crucial difculties is the complexity of the dynamic trafc scene, which usually deals with the spatio-temporal dependencies of multiple objects. Most of existing approaches can be divided into two groups: logical approaches and probabilistic approaches. In the logical approaches, the common knowledge is used to model space
and time of situations. This approach was rst introduced in [11]. Prior knowledge is formulated in logical languages such as propositional time logic [10], modal logic [1] or description logic [7] to achieve a logical representation of trafc situations. In another system introduced in [5], a conceptual representation of elementary vehicle activities is formulated by fuzzy metric-temporal Horn logic. Despite the advantage in scene representation and reasoning, these logical based approaches have difculties to handle uncertainties and ambiguities of real situations. Recently there are some approaches based on the probabilistic logic formalism introduced in [14]. In the work [6] the Markov Logic is used for modeling and understanding the object relations in a trafc scene. However, these approaches are still not capable to capture the temporal dependency of the scene. The other group of approaches is based on probabilistic models. In [4] Dynamic Bayesian Networks (DBNs) are chosen because of their capability to capture spatio-temporal dependencies. Whereas using DBNs for modeling trafc situations is intuitive and elegant, parameter learning and doing inference remain challenging. To overcome these problems some other approaches use specic structures of DBNs. Hidden Markov Models (see [13]) are special cases of DBNs with linear structures, which are widely used in many areas of pattern recognition. The main advantages of HMMs in comparison with DBNs are the efcient algorithms for learning model parameters as well as for doing inference. Some variants of HMMs like coupled HMMs and factorial HMMs are applied successfully for modeling human activities as in [2] and [8]. Our work is mainly motivated by the idea of using HMMs to model trafc maneuvers as in [12], where two laser scanners are used for data acquisition, which are not applicable for current ADAS. Furthermore, additional scenario information like road geometry and boundaries are not taken into account. The main contribution of this paper is to show that the maneuver of trafc participants can be practically modeled and robustly recognized with a platform using production-oriented sensors. This paper is organized as follows. In the next section, we review Hidden Markov Models with the important algorithms for parameter learning and probabilistic inference. Next in section III we describe our system using HMMs as the framework for modeling and recognition of trafc situations. In section IV we present the experimental results evaluated in the real-world environment. The paper is nally concluded in section V.
89
2
a22 a12 a23
a11
a33
qt-1
qt
qt+1
s1
b1(v1)
a21 b1(v2)
s2
b2(v1)
a32 b3(v1) b2(v2)
s3
b3(v2)
ot-1
ot
ot+1
v1
v2
of this sum will require O(N T ) operations. In practice p(O|) can be computed via a dynamic programming algorithm called forward algorithm, described in [13]. In the forward algorithm, at every time step, the likelihood of the partial observation sequence that ends at the state j are recursively computed as. Initialization: 1 (i) = a1,i bi (o1 ), for i = 1, . . . , N
(2)
Fig. 1. Example of an HMM structure. Left: Markov chain of hidden system states qt and observations ot . Right: Model structure with 3 system states and 2 (discrete) observation symbols v1 , v2 .
Recursion
N
t+1 (j) = [ II. T HEORETICAL F RAMEWORK Hidden Markov Models (HMMs) are one of the most popular methods for probabilistic modeling of sequential data. HMMs have been used successfully in many areas such as speech, gesture or human activities. In [13] Rabiner gives an excellent tutorial on HMMs with applications in speech recognition. An HMM is specied by the following: The set of states S = {s1 , s2 , ..., sN } denes the unobserved, or hidden, discrete state of the model. In an HMM, the sequence of hidden states is a Markov chain Q = (q1 , q2 , . . . , qT ), in which the probability of each state is dependent only on the state immediately preceding it. The probabilities to go from state i to state j : ai,j = P (qt+1 = sj |qt = si ) are summarized in the transition matrix A. The observation model or emission probabilities B consist of bi (o) = p(o|si ) the probabilities of an observation o if a model is in the state si . For practical reasons, the observation distributions are continuous and are usually parameterized e.g. by Mixtures of Gaussian (MoG). The initial state distribution = {i }, where i = P (q1 = si ) are the probabilities of si being the rst state of the state sequence. In summary, the parameter set of an HMM can be indicated by the compact notation: = (A, B, ). An example of the structure of an HMM can be seen in Figure 1. Using HMMs for modeling and recognition of trafc situations are concerned with two questions: evaluation of the likelihood of an observation sequence with respect to an HMM and learning the parameters of the HMMs. These two problems can be efciently solved by some standard algorithms, which are briey described in the following subsections. A. Recognition with HMMs The rst problem is about the question: given a new observation sequence O = (o1 , o2 , . . . , oT ) and a set of models {1 , 2 , . . . , K }, which model explains the sequence best. To solve this problem, it is necessary to compute the likelihood of an observation sequence O given a model i . This likelihood expands as follows: p(O|i ) =
all possible Q i=1
t (i) aij ]bj (ot+1 )
(3)
The likelihood of the whole sequence is nally calculated by:

N
p(O|) =
i=1
T (i)
(4)
The algorithm can compute t (i) efciently. At each time step only O(N ) operations are needed. To compute the total probability of an observation with length T the complexity is O(N T ). B. Parameter Learning The second problem of HMMs related to situation recognition is how to learn the model parameters. Given a set of observations {Otraining }, the model parameter = (A, B, ) need to be adjusted, so that it makes the observation data most likely. In other words, we have to solve the following optimization problem: = argmax P (Otraining |)
(5)
This optimization problem can be solved efciently by using the Baum-Welch-algorithm described in [13] in more detail. In this work we use the Baum-Welch-algorithm to learn for every situation HMM i the initial distribution i , the transition matrix Ai and the vector Bi of emission probability density functions, which are modeled by Gaussian Mixtures. The Baum-Welch-algorithm is a derivation of the Expectation Maximization algorithm for Hidden Markov Model, which use the iterative numerical technique to achieve the maximization of the likelihood function greedily. Like other applications of EM algorithms, parameter learning for HMMs is a non-convex problem with many local maxima. The Baum-Welch-algorithm will converge to a maximum based on its initial parameters. In order to get stable and robust results multiple runnings of the algorithm are needed. III. S YSTEM DESCRIPTION In this section the system for maneuver recognition and prediction in trafc scenarios is described. It is based on an HMM based approach, introduced in the last section, to take advantage of their benecial properties, such as the capability to handle dynamic, complex systems, uncertainties and ambiguities, the availability of effective training and evaluation algorithms and the readability of the model. As a rst step, relevant trafc maneuvers have to be identied, which may inuence an improved trafc prediction. As
p(O, Q|i )
(1)
It is the sum over every possible assignment to Q. Because qt can take one of N possible value at each time step, evaluation
90
Offline Training
s1
s2
s3
s4
s5
Get actual object list

For all traffic participant pairs and for all maneuvers
Time t++
Forwardprocedure:
Get Prior:
Fig. 2.
Semantic modeling of driving maneuvers using HMM and MoG.
Compute Likelihood:
this work focuses on extra urban trafc scenarios, following-, overtaking- and anking-maneuvers are selected, which are the most occurring interactions on country roads and highways. To develop a recognition system for these trafc maneuvers, it is necessary to identify characteristic quantities. By keeping in mind the available sensor setups of current ADAS, the information about relative position, velocity and acceleration of each trafc participant pair is appropriate. Each of the mentioned maneuvers is modeled by one HMM. The system states of the models thereby correspond to the different physical stages of the maneuvers. The number of states of every model is set to 5. For a given observation sequence O, recognition is performed by applying the forward procedure to compute the likelihood P (O|i ) for each model i . The continuous observations are implemented by using a MoG-observation model as already mentioned in section II, since they have good approximation capabilities for arbitrary probability distribution. The number of mixture components is set to 3, while a higher number of mixture component leads to an extremely increasing number of model parameters. Figure 2 presents the semantic modeling of driving maneuvers using Hidden Markov Model with Mixtures of Gaussian as observation models. Although this approach works well in simple trafc environment, it shows some disadvantages in more complex real world trafc scenarios. These are characterized e.g. by different road types with different numbers of lanes, relations of more than two objects and uncertainty about possible free space for the maneuvers. In addition, the probability P (O|) of an observation sequence given a model is not the best quantity for making decisions but the probability P (|O) of a model given an observation sequence. Hence, a more feasible approach was developed in this work. Using the basic equation of the Bayes theorem, the likelihood P (O|) can be rewritten and using the assumption of a constant observation probability P (O) yields: P (O|i ) P (i ) P (i |O) = P (O|i ) P (i ) P (O) (6)
Maneuver prediction
Fig. 3.
System overview for trafc maneuver recognition using HMMs.
Past recognized maneuvers: Individual driver behavior like aggressive or defensive way of driving have inuence on oncoming maneuvers. Therefore previous recognized maneuvers are also taken into account. Inuence of other vehicles: The intended maneuver does not only depend on the relative information between the two involving vehicles, but also on other vehicles in the trafc situation. Road geometry: Road information like boundaries or oncoming lanes have also to be taken into account. As an example for the last two points, the overtaking maneuver of Figure 6 is considered. In stage 1 (before the rst lane change) the probability of an overtaking maneuver is directly be inuenced by possible road boundaries on the left or by other vehicles on the left lane. Figure 3 shows the diagram of the maneuver recognition and prediction system. New trafc participant pairs are initialized every time a new vehicle enters the sensor eld of view. For every participant pair the likelihood P (i |O) of each maneuver i is computed. This information is used for maneuver prediction. Therefore the dominating maneuver model is identied by determining:
argmax P (i |O).
i
Results of single maneuver prediction are presented in the next section. IV. E XPERIMENTAL R ESULTS In this section the implementation of the described system is presented and some results of real world data for trafc maneuver recognition and prediction are given. A. Scenario Setup For data acquisition an experimental vehicle of the Adam Opel AG, an Opel Insignia (see Figure 4), is used which is equipped with a sensor fusion system consisting of a forward-looking radar sensor (Continental ARS300) and a
The a-priori probability P (i ) has effect on the resulting likelihood and depends on the following factors: Type of road: The a-priori probabilities for different maneuvers depends on the type of road, e.g. overtaking maneuvers occur more often on highways than on country roads.
91
Fig. 5. Lane xed coordinate system (X, Y ) and sensor xed coordinate system (XS , YS ) of the ego vehicle (red), with: distx = X1 X2 and disty = Y1 Y2 .
Fig. 6.
Overtaking maneuver for real data evaluation.
mono camera (Continental CSF200). With this sensor platform it is possible to detect, classify and track all objects in front of the car. The opening angles of the sensor setup are shown at the right side of Figure 4. Both sensors have complementary properties that make sensor fusion more reasonable: Radar: Good position and velocity estimation of all detected objects. High detection range using far range scans. Camera: Good classication capability of detected objects and provide scenario information (lanes and road boundaries). After selecting the sensor platform, the next step is to choose an appropriate coordinate system. Whereas current approaches use vehicle or sensor xed coordinates, our work uses a lane xed coordinate system to get more signicant information, especially on winding streets (see Figure 5). One axis is directed along the road, which is normally not a straight line, and the other axis is orthogonal onto it (the resulted coordinate system is not Cartesian). To transform the positions from the given sensor xed coordinate system (Xs , Ys ) to the lane coordinate system (X, Y ) the geometry of the road has to be known. In this work a clothoid model is used (see [3]), where the parameters of the model are extracted from the lane markers and road boundaries information given by the sensor fusion system (see [3]). After that, the observation vector for a trafc participant pair can be dened: o = (distx , disty , vel, acc), (7)
are the absolute value of relative velocity and acceleration of the objects (see Figure 5). The reason for using only one dimensional velocity and acceleration is the inaccurate lateral velocity resolution of radar sensors. The details of the algorithmic implementation in the demonstration vehicle regarding object detection, tracking and classication are presented in [9]. After dening the observation vector, two kinds of input data have to be generated to implement the proposed maneuver recognition system: Training data: data for model parameter estimation using Baum-Welch-algorithm. Evaluation data: data for maneuver recognition using the forward-algorithm. As HMMs have many model parameters, the number of training sequences have to be large enough. Because the acquisition of real data with the current sensor setup shown in Figure 4 is extremely time-consuming, data simulated with the tool IPG CarMaker c is used for the training stage. CarMaker is a popular simulation tool for many applications in the automotive industry. When simulating trafc maneuvers with CarMaker two main models are used: vehicle model, which models all characteristic properties of the simulated cars; driver model, which includes driver specic inaccuracies when following a predened path. With a parametrized denition of trafc scenarios, CarMaker allows the user to rapidly build up a maneuver database (here consisting of 500 maneuvers). To evaluate the trained models and carry out prediction steps, the test scenario shown in Figure 7 was set up. With this scenario 60 overtaking maneuvers were recorded to evaluate the maneuver recognition system described in the last section and to show exemplarily the benet for lane change prediction. Figure 6 shows the different stages of the maneuvers. B. Recognition Results In the following, results of the maneuver recognition are presented. This is done by using real world data as described in section IV-A to guarantee the functionality of the system in real ADAS applications. After acquiring the data, relative dynamic information about the interacting vehicle pair is extracted according to (7). Maneuver recognition is done by applying the procedure shown in Figure 3 to classify the three maneuvers overtaking, following and anking for the complete set of approx. 60 recorded sequences. The calculation of the likelihoods P (O|i ) is computed effectively by applying the forwardalgorithm (see section II). Because of the complexity of the algorithm is only O(N T ), the system has real-time capability. One main problem when computing the forward variables is underows due to multiplication of signicant small terms. Therefore in this work we use the scaling procedure presented in [13] that normalize the forward variables at each time step. When applying this procedure the likelihood-functions are no longer monotonically decreasing. In this highway-like test scenario, the a-priori probabilities for all maneuvers were set manually to: P (f ollowing) = 0.9 and P (overtaking) = P (f lanking) = 0.05 due to the general occurrence of these
where distx and disty are the relative distances of the objects in X- and Y -direction (lane coordinate system), vel and acc
92
5
x
ARS300 radar (far range scan) ARS300 radar (near range scan) CSF200 mono camera system
200m
20
60m 60m
28
1.5m
Fig. 4. Experimental vehicle of the Adam Opel AG: Sensor system consists of a mono camera and a radar sensor performing a far and a short range scan. Left: Opel Insignia experimental vehicle. Right: Opening angles of the used sensor setup.
Fig. 7.
Screenshots of scenario setup for model evaluation with real data. From left to right: different stages of the overtaking maneuver.
50
log(P (i |O))
100
150
200
250
overtaking following flanking

0 100 200 300 400 500 600
beginning of the sequence, when vehicle number 1 is following vehicle number 2 (left image in Figure 7), the likelihood for a following maneuver is slightly above the overtaking maneuver. At frame 250 vehicle number 1 begins the overtaking maneuver with the rst lane change (middle image in Figure 7), the likelihood of the following maneuver decreases and the likelihood of the anking maneuver increases abruptly. At frame 520 the second lane change from vehicle number 1 is performed (right image in Figure 7), which is represented by the decreasing anking probability. C. Lane Change Prediction
300
frame
Fig. 8. Results of the maneuver recognition system for three different models (using scaling procedure of [13]).
maneuvers in real trafc situations on highways. That means that only the dependency of the type of road (see section III) in the a-priori probabilities is taken into account. The overtaking maneuvers are detected correctly in all sequences, at the latest when the rst lane change was performed. At the beginning of all sequences (stage 1 of Figure 6) a clear distinction between the maneuvers overtaking and following is hard to make or sometimes impossible. This corresponds to the general ambiguities of trafc maneuvers in real world scenarios. Figure 8 presents exemplarily the results for the sequence shown in Figure 7, where the likelihoods P (i |O) corresponding to the three HMMs are plotted logarithmically. At the
The results of maneuver recognition of the last section can be used in a wide eld of ADAS applications. In the following we present a single driving maneuver prediction in the form of lane changes. The basic idea is to predict a lane change maneuver by recognizing an overtaking maneuver, when the vehicle is still being in state 1 of Figure 6. Therefore we have to observe the ratio r = P (O|f ollowing)/P (O|overtaking). For example, an increasing relative velocity vel will lead to a rapidly decreasing likelihood P (O|f ollowing) and makes a lane change more probable to occur. If we set the critical ration value to r = 1, lane changes are predicted when the probability for the following maneuver fall below the overtaking maneuver (see e.g. Figure 8, frame 250). To decrease the resulted false positive rate, the critical ratio has to be decreased too. In Table I the results of the lane change prediction described above are shown for the complete set of 60 recorded maneuvers. The predicted lane changes are listed for different times
93
TABLE I L ANE CHANGE PREDICTION RESULTS BASED ON OVERTAKING MANEUVER

RECOGNITION .
Time before lane change [s] correct predictions [%] Time before lane change [s] correct predictions [%]
2 6.7 0.8 55
1.8 10 0.6 65
1.6 18.3 0.4 76.7
1.4 28.3 0.2 85
1.2 33.3 0.0 98.3
1.0 40 -0.2 100
before the vehicle actually starts to perform the lane change (left tires of the vehicle pass lane makers). The ground truth is thus determined manually by analyzing the given images of the camera system. It can be seen that at this point of time already 98.3% of the maneuvers and 0.4 seconds before the true lane change happens approx. 80% of the maneuvers are predicted correctly. V. C ONCLUSION In this paper we presented a probabilistic framework for vehicle maneuver recognition and prediction in real-world trafc scenarios. Especially, our system is based only on state of the art sensor systems of current ADAS. Furthermore we considered different factors, which have inuence on trafc situations. Our system provides robust and reliable results for recognition of driving maneuvers in extra urban scenarios. One application of the maneuver recognition system is lane change prediction, which can be used by other following ADAS functions, like lane departure warning. For future work we plan to study more detailed modeling of the prior information. We believe that, if the prior information is modeled more accurately, the recognition results will be improved. Moreover it will be possible to handle more different and complex scenarios. We also would like to use the result of the recognition stage for a path prediction stage. R EFERENCES
[1] B. Bennett, A.G. Cohn, F. Wolter, and M. Zakharyaschev. Multidimensional modal logic as a framework for spatio-temporal reasoning. Applied Intelligence, 17(3):239251, 2002. [2] M. Brand, N. Oliver, and A. Pentland. Coupled hidden Markov models for complex action recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), page 994, 1997. [3] M. Darms, M. Komar, and S. Lueke. Map based road boundary estimation. In Intelligent Vehicles Symposium (IV), pages 609 614, 2010. [4] T. Dean and K. Kanazawa. A model for reasoning about persistence and causation. Computational intelligence, 5(2):142150, 1989. [5] R. Gerber and H.H. Nagel. Representation of occurrences for road vehicle trafc. Articial Intelligence, 172(4-5):351391, 2008. [6] I. Hensel, A. Bachmann, B. Hummel, and Q. Tran. Understanding object relations in trafc scenes. International Conference on Computer Vision Theory and Applications (VISAPP), pages 389395, 2010. [7] B. Hummel, W. Thiemann, and I. Lulcheva. Scene understanding of urban road intersections with description logic. Logic and Probability for Scene Interpretation, in Dagstuhl Seminar Proceedings, Dagstuhl, Germany, 2008. [8] N. Landwehr. Modeling interleaved hidden processes. In Proceedings of the 25th international conference on Machine Learning, pages 520527, 2008. [9] S. L ke, P. Rieth, and M. Darms. From brake assist to autonomous u collision avoidance. International Federation of Automotive Engineering Societies (FISITA), 2008.
[10] Z. Manna and A. Pnueli. The temporal logic of reactive and concurrent systems: Specication. springer, 1992. [11] J. McCarthy. Situations, actions, and causal laws. Technical report, Stanford Department of Computer Science, 1963. [12] D. Meyer-Delius, C. Plagemann, and W. Burgard. Probabilistic situation recognition for vehicular trafc scenarios. In Robotics and Automation, 2009. ICRA09. IEEE International Conference on, pages 459464. IEEE, 2009. [13] L.R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257 286, 1989. [14] M. Richardson and P. Domingos. Markov logic networks. Machine Learning, 62(1):107136, 2006.
94
Approaching a Person in a Socially Acceptable Manner Using Expanding Random Trees

Jens Kessler Andrea Scheidig Horst-Michael Gross Neuroinformatics and Cognitive Robotics Lab, Ilmenau University of Technology, Ilmenau,Germany
Abstract In real world scenarios for mobile robots, socially acceptable navigation is a key component to interact naturally with other persons. On the one hand this enables a robot to behave more human-like, and on the other hand it increases the acceptance of the user towards the robot as an interaction partner. As part of this research eld, we present in this paper a strategy of approaching a person in a socially acceptable manner. Therefore, we use the theory of personal space and present a method of modeling this space to enable a mobile robot to approach a person from the front. We use a standard Dynamic Window Approach to control the robot motion and, since the personal space model could not be used directly, a graph planner in conguration space, to plan an optimal path by expanding the graph with the use of the DWAs update rule. Additionally, we give a proof of concept with rst preliminary experiments. Index Terms Social acceptable navigation, approaching strategy, expanding random trees, dynamic window approach
Omnicamera
Frontal camera Touch display + speakers + microphones Kinect 3D camera Laser range finder Sonar system
I. I NTRODUCTION In the recent years, mobile robotics have been developing towards elds of applications with direct interaction with persons. There are several prototypical systems that aim to help elderly people to improve cognitive abilities [1], to assist care givers in hospitals [2, 3], be an intelligent videoconferencing system [4], guide people in supermarkets and home improvement stores [5, 6] or simply improve the wellbeing by providing an easy-to-use communication platform. All these scenarios have to consider persons, interacting with the robot system. Psychologists and gerontologists showed in the 90s that technical devices are treated and observed as social beings, for example cars, television and computers [7, 8]. Also a robot system is recognized as a social being and also has to behave like one. One important part of the robots behavior is the socially acceptable navigation. Navigation commonly includes tasks like mapping, motion control, obstacle avoidance, localization and path planning. Socialacceptable navigation focuses on these tasks with keeping in mind that humans are within the operation area of the robot, and that an extra treatment is needed. We are contributing to the ALIAS (Adaptable Ambient LIving ASsistant) project. ALIAS has the goal of developing a mobile robot system to interact with elderly users, monitor and provide cognitive assistance in daily life, and promote social inclusion by creating connections to people and events in the wider world [9]. A. The ALIAS robot and the navigation system The ALIAS projects provides a variety of services, like auto-collecting and searching the web for specic events , a calendar function to remind the user, and, most important, a
Fig. 1. The ALIAS robot, a SCITOS G5 platform of MetraLabs GmbH, with cameras, Kinect c 3D sensor and laser range nder. It interacts with the user by touch-display and7 speech output.
service to communicate by e-mail, social networks and voiceor video telephone, particularly adapted to the needs of the target group. All these tasks are provided by a mobile robot system (see Fig. 1). The benet of a mobile system is the capability to move: the robot can be requested by the user and should autonomously drive to the user and approach him/her. Navigation has to be smooth and exact, therefore our motion controlling system is based on the Dynamic Window Approach [10]. Based on this approach, we present here how to approach a person with known pose while considering the personal space of the interaction partner. This provides a more natural, polite and unobtrusive approaching behavior of the robot. The personal space itself is not appropriate to use directly inside the DWA, so we need to apply a planning strategy to nd an optimal approaching behavior. The robot we use is a SCITOS G5 platform and is equipped with sonar based and laser based distance sensors, a highres front camera, a Kinect c 3D camera from Microsoft (see Fig. 1), and a dual core PC. With these sensors obstacle recognition and also person detection is done to provide the navigation system with all needed information. The navigation system is only a small part of the overall ALIAS system
95
zone close intimate intimate zone personal zone social zone public zone
interval 0.0m - 0.15m 0.15m - 0.45m 0.45m - 1.2m 1.2m - 3.6m from 3.6m
example situation lover or close friend touching lover or close friend talking conversion between friends conversion to non-friend no private interaction
TABLE I P SYCHOLOGICAL DEFINITION OF THE PERSONAL SPACE AS DEFINED IN [11]. T HIS SPACE CONSISTS OF 5 ZONES , EACH SUPPORTING DIFFERENT ACTIVITIES AND DIFFERENT COMMUNICATION INTENTIONS .
architecture, which also consists of the dialog controller, person recognition and detection system, speech recognition and speaker identication, and a set of applications presented to the user, which are enhanced with various web services. II. S TATE OF THE ART Psychologists investigated the human-to-human interaction in public areas very carefully since the 70s of the last century. One of the foundations and most recognized publications is the work of Hall [11],[12], who rst introduced the concept of different spaces around a human being to support different modes of interaction. There is a space for non-interaction, public interaction, interactions with friends and also an intimate space for interaction with very close relatives (see table I). By formulating the theory that interaction is also coupled to spatial congurations between interaction partners, many investigations on this matter have taken place, and it could be shown that the conguration depends on many aspects like cultural background, age, sex, social status and persons character [13, 14, 15, 16, 17, 18]. But is the personal space a valid description for human robot interaction? As Reeves and Nass [8, 7] showed, complex technical devices are indeed seen as social beings and treated as such. So, we can assume that a robot with a person-like appearance is treated like a person. Additional proof is given by exhaustive experiments done within the COGNIRON project, where wizard of oz methods showed that a spatial conguration between robots and humans exists [19] and that this conguration also changes depending of the task of interaction (e.g. talking, handing over an object)[20], or such constraints like sex or experience with robots [21].However, non of these works tried to autonomously approach a person in a socially acceptable manner. But the wizard of oz experiments could nd out useful spatial parameters to autonomously approach a person. Despite the thorough psychological background work, only few publications exist that describe an actual autonomous approaching behavior. Often a simple control policy is used, where a fuzzy controller [22], a PID controller [23, 24], or a similar technique is used to keep the robot at a certain distance to the person. The used distance thresholds or fuzzyrules are always hand-crafted and set by the designer without sufcient psychological justication. Some can only approach a person from the front [23], since face detection is needed, and some simply do not consider the upper body orientation of the person and approach the person from any direction [22]. There are only a few works, more aware of the concept of personal space, which use this space to approach a person or drive around a person without intruding the persons personal zone. For example Pacchierotti [25] uses an elliptical region
around a tracked person in a corridor to signal avoidance towards the person by changing the robots driving lane in a corridor at an early stage of approaching, where collision avoidance would not have suggested such a driving behavior. The distance of the lane changing where tuned by hand and the distance threshold for driving by was determined by evaluating a questionnaire. A hand-made approaching scenario was also presented by Hoeller [26], where different approaching regions where dened, each with a different priority. At least one of these regions had to be free from obstacles and the region with the highest priority was the current target region. Hoeller uses also expanding random trees[26] to plan the next motion step in an optimal fashion. The work of Svenstrup and Andersen [27] models the personal space explicitly and without the need of any thresholds, so they could create a dense representation of the personal space and approach a person by using a potential eld method. Although their results do not consider any obstacles and could get stuck in local minima, they were the rst with an actual mathematical model of the personal space. Sisbot [28] investigates in his work other aspects of planning a path towards a person. So the robot has to be visible, should not hide behind walls and also should not drive behind a person. He uses an adapted A* planner to derive a planning path but does not show how to include these results into the motion planning concept. Other authors do not consider the personal space, but also have the need to approach a walking person from the front to catch customer attention [29]. Here, the trajectory of the person is predicted, and a point on that trajectory is chosen as the goal, to give the robot enough time to turn towards that person and approach her from the front. A. The Dynamic window approach To move a robot, there must be decisions taken which action to be executed as next. Here, two parts are important. First, the robot has to know to which position it has to drive, and second, which trajectory it has to drive to reach a good position. As mentioned before, we use the Dynamic Window Approach [10] for motion planning and therefor can only support physical plausible paths towards the target. We can assume two things when decide upon the next action. First, we can measure the robots position and speed, and second we know the current obstacle situation. The Dynamic Window Approachs key idea is to select a rectangular region of rotation- and translation speeds around the current rotation- and translation speed, and decide which next speed pair is the best by evaluating different so called objectives. Each Objective focuses on one aspect of navigation like avoiding obstacles, heading towards the target, drive at a certain speed and so on. The windows outer bounds are only based on physical constraints, like the robots acceleration capabilities and maximum allowed speeds. The voting values of the objectives are summed up weighted, and the minimum vote of the current speed window is chosen to be the next valid action. Our goal is to design an objective for the DWA, which uses a personal space model to approach a person. The model of the personal space is described in the next section. After that section we show, how to include the model into the DWAs objective.
96
y/m 90 +45
180 0
-45 -90 x/m
f=40.1
f=22.9
Fig. 3. Fig. 2. Two regions of our personal space model. The front region is within an 45 interval (in red). The back region is the rest (in blue). Note, that the regions are not limited in radial extension, like it is done in the illustration.
Two example congurations for different approaching directions.
III. M ODEL OF THE PERSONAL SPACE As described in section II, the model of the personal space is the key component to approach a person. Similar to the work of Dautenhahn [19], we also want the robot to approach a person from the front, but with a slight aberration from the direct front, since most user perceive such a behavior more comfortable. For this purpose, obviously we need the position and viewing direction of the person to calculate the conguration of the personal space model. The space conguration should enable the robot to drive around the person in a comfortable distance and turn towards the person when a front position is reached. Like in [27], we model the personal space with a sum of Gaussians. The space relative to the persons upper body direction is separated into two regions: a front-region, which is considered to be within 45 around the persons upper direction, and a back-region, which is the rest (see g. 2). In both areas we dene a distance function to keep the robot out of the users personal zone but within his/her social zone while approaching the person. The function is dened relative to the persons upper body direction. a(x, y) = e 21
2 2 x +y 2 1
2 2 x = 2.9 and y = 1.1. Only and x need to be set at runtime to regulate the approaching distance and direction. These parameters dening the form of the personal space can be obtained by investigating the familiarity of the user with robots, but for the sake of simplicity have been chosen manually for our rst trials. All other parameters are constant and are chosen to reect the properties of the personal space denition in [11]. So, the nal denition of the personal space p(x, y) relatively to the person coordinates x = 0, y = 0 and upper body pose towards the x-axis is dened as follows:
p(x, y) =
a(x, y) , if x, y in back-region a(x, y) + i(x, y) , if x, y in front-region
(3)
To compute the personal space in a real world application, each point (, y )T has to be transformed to the person-centered x coordinate system (x, y)T presented here. In our trials we use the given persons upper body pose, representing the most likely pose. Figure 3 shows an example of two congurations of the personal space with two different approaching directions. IV. P LANNING WITH EXPANDING RANDOM TREES AND THE DYNAMIC W INDOW A PPROACH Up to that point, we have shown how the personal space can be described, if the upper body pose of a person is known. We also stated, that this space is used within an objective for the DWA. The basic idea of the DWA is to decide what next action is best in a local optimal fashion. The local driving command is only valid for a certain t, than the next window conguration is evaluated. The model of the personal space could be used directly within the Dynamic Window objective. It is possible to predict for every speed pair Vrot , Vtrans the trajectory within the interval t and simply evaluate the value of the personal space at the end point of each trajectory. This is shown in Fig. 4. The minimal value results in the most supported driving decision. By using the personal space directly, multiple driving decisions may lead to the same minimal value and a unique local optimum can not be guaranteed. So we have to reformulate the search problem to guarantee a function with a unique local minimum, and, by sequentially following the local minima, a function that leads towards the global minimum (or target position). It is known that planning algorithms can provide such functions. We choose a random tree planner[30] for two reasons. First, classical
e 22
2 2 x +y 2 2
(1)
The variables , , 1 , 2 describe a classical Difference of Gaussians function and are set in case (see Fig. 2) to our = 0.6, = 0.3, 1 = 2m, 2 = 7m to form a minimum cost region in a distance of 3.5 meters around the person. The front region is treated additionally with an intrusion function i(x, y). This is also a Gaussian function and is simply added to a(x, y). i(x, y) = =
T 1 ex x 2 || x 0.0 cos() sin() 0.0 y sin() cos()
(2)
Here the variables x and y dene an elliptical region, that is rotated towards the needed approaching direction , as seen from the persons perspective. The vector x is simply a column vector (x, y)T . The variables are set to = 0.5,
97
B. Expanding the graph To expand the graph, the method of A* [31] is used. A* uses heuristics to implement a directed search (unlike other planners like E* or Dijkstra) and could signicantly speed up the search for the optimal path. Each node of the planning graph also carries a cost value ci which is incrementally increased with the graph nodes parent ci1 , the real costs to travel from node si1 to node si (denoted by the cost function C(si1 , si ) and the heuristic for node si . So, a cost update is: ci = ci1 + C(si1 , si ) + h(si ) (4)
Fig. 4. No distinct speed decision is possible, when the personal space model is used directly. Here, several actions can lead toward the same minimal value.
planning approaches like A*,D* and E* are dened only in metric grid-based maps (and not in the conguration space the DWA is dened in) and have to explore a large area of the grid to nish the plan. The second reason is, that random trees need to touch only a small area of the planning grid. Here, computation time can be saved by computing only the needed cells of the personal space grid and also by covering only a sparse portion of the planning space. In the following sections we describe how the random tree graph is constructed and how it ts to the Dynamic Window Approach. The basic idea is simply, to use the global optimal pose, extracted from the personal space, and use the mentioned planning algorithm to overcome local minima in the personal space by also nding a cost optimal path to the global optimal pose. A. Expanding Random Trees For planning purposes we use so called expanding random trees [30]. These trees are used to generate a path towards a target pose, so one aspect of planning is to dene the target pose to reach (see IV-C), the graphs state denition (inside this section) and also the directed expansion of existing graph nodes towards the target (section IV-B). If the target is reached by one node of the graph, it is guaranteed that a cost minimal path toward the target is found. The benet of an expanding random tree is, that only a sampled set of possible actions are used per node to expand that node. This makes the tree efcient and still suitable for complex planning tasks. Formally a random tree is quite simple: it consists of a set of nodes S = (s1 , s2 , ..., sn ), each representing a state si of the system. Our tree uses a ve dimensional state space consisting of rotational speed Vrot of the robot, translational speed Vtrans , position and orientation of the robot x, y, . What makes this approach useful is the creation of successive states by using a random transition function tr(si ) and using the state update equation from the DWA. This function generates a set of next states by considering the current nodes state si and applying a set of random actions on that state to generate a set of next system states (see Fig. 5 b) ). This process is also called the expansion of a node. We use as the transition function a motion model for a differential drive robot with left wheel speed and right wheel speed. Given a pair of these speeds, we can create the trajectory for a given time interval t. Since translation speed and rotation speed is convertible to left wheel- and right wheel speed, we can sample a set of speed pairs from a virtual dynamic window, centered at the current speed states of the given node (see Fig. 5 b).
The traveling cost function C(si1 , si ) is described in more detail in section IV-D. The heuristic is quite simple. We use the 5D euclidean distance of the nodes state vector to the minimum cell (, y ) of pmin (, y ) with target speeds Vrot = 0 x x and Vtrans = 0. All nodes with updated costs are put to the active node list. From that list, the node with the lowest costs ci is selected, expanded and removed from the list of active node. If a node reaches the target cell with correct speed and viewing direction, the planning task is complete. The graph is initialized by using the current conguration of the dynamic window. The root node is the current robot position, view direction and rotation- and translation speed. The dynamic window is used to give a fully specied set of next actions, which are applied to that node and the graph expands. All subsequent nodes are expanded by using only a sampled subset of the corresponding dynamic window, valid only for each node (see Fig. 5 c) ). Than the sequence of best motion actions is applied to the robots driving system. The deviation from the best path is measured and if the difference reaches above a threshold, complete replanning is done. The same is done when the person changes his/her position too much. C. Extracting the target region To navigate with the Dynamic Window, we use local occupancy maps to represent the surrounding obstacle situation around the robot. In this grid representation, we have to rasterize also the personal space values p(, y ) to merge the x costs of the personal space with the costs of obstacles to create an optimal path. Each planning algorithm has to know the target, to which state the system has to drive to. Since we have a rasterized personal space, we are able to easily extract the minimum value pmin (, y ). For a grid based planner this x would be sufcient to be the target region, but for expanding random trees it is very unlikely to hit exactly that single cell. So the target region has to grow to increase the probability to hit a target cell. We do so by adding a value to the minimum and each cell with a value below pmin (, y ) + is called a x target cell (see Fig. 5 a) ). For each target cell we also store the needed orientation of the robot towards the person. Planning is complete when the rst lattice of the graph hit a target cell, when the speed of the robot at that cell is near zero and the viewing direction of the robot is nearly towards the person to approach.
98
nodes dynamic windows

Vtrans
Vtrans Vrot
Minimum cell Target region

Vrot
expanding by random sampling
[]
x y Vrot V trans
expanded root node
State of a node a) b) c)
Fig. 5. Different stages of creating the expanding random tree. In a) the root node at the current robot state is created and expanded with the actions from the full dynamic window. Also the target region is dened. Each node gets new state variables. In b) in each node a new individual dynamic window is constructed, to dene a set of new possible trajectories. In c) only a subset of these trajectories are used per node to expand the graph towards the target region. Robot scen. 1(I) scen. 1(II) scen. 2(I) scen. 2(II) pers (0.3, 0.1) (0.2, 0.1) (0.2, 0.1) (0.2, 0.2) rob (0.3, 0.1) (0.4, 0.2) (0.2, 0.2) (0.2, 0.4)
TABLE II T HE VARIANCE OF THE END POSITION OF THE ROBOT VS . THE VARIANCE OF THE PERSONS UPPER BODY POSITION .
Fig. 6. To estimate the personal space costs, the trajectory is rastered and personal space costs along the trajectory are summed up.
D. The cost function The last piece (and core component) to understand the graph structure is the calculation of the traveling cost function C(si1 , si ) from one node to the next. It consists of two components. One is the cost component from the personal space and the second is the traveling time. In a differential drive system the robot can only drive strait lines or piecewise dened circles. The radius of the circles is simply Vtrans /Vrot . So, when Vrot reaches zero, the radius is innitely large. Given Vrot and the prediction time interval t one can easily calculate the rotation angle (Vrot t) and the length of the line segment lij . To compute the traveling time tsi ,sj we calculate tsi ,sj = lij /Vtrans . The costs of the personal space are harder to calculate. We use here the rasterization of the trajectory and sum up all costs on the the rasterized trajectory (see Fig. 6). For each cell xi , yi which is part of the trajectory. The costs k(si , sj ) are: k(si , sj ) =
n
p(xn , yn ) if xn , yn traj(si , sj )
(5)
If the trajectory hits an obstacle, the traveling costs are set to innity. The resulting costs are the sum of both values: C(si1 , si ) = t(si1 , si ) + k(si1 , si ) V. E XPERIMENTS A problem on approaching a person is the estimation of the persons position and the associated measurement noise. We
plan to detect the upper body pose by fusing two standard tracker methods, namely the leg-pair detector of [32] by using the laser range scanner and the OpenNI full body pose tracker by using the Kinect. To test the stability and robustness of the approach towards that noise, we investigated two scenarios, one in a narrow space and one in a large room of our lab. We use a simulator to avoid the problems of person detection and to control the (simulated) measurement noise of the persons and robots pose. We could also proof in rst test, that the approach is running well on the real robot, but here you have to face the challenging task of upper body pose estimation. To investigate the stability of the approaching behavior, the position of the person and the robot was chosen randomly to approach in a circle around a marked position. The robot and the person should face towards a given direction each. For each of the two locations, we dene two person positions with different viewing angles and performed ten runs for each position. So, we had a set of four trials with a sum of 40 single runs. The variance of the nal positions of the robot and the variance of the person position are shown in table II. From the experimental setup we have uncertainties of 0.1m to 0.3 meters in the person position. The question to be answered in our experiments is, how the uncertainty of the target position of the robot will increase when approaching a person. To do so, we record the trajectory of the robot and calculate the mean and standard deviation of the nal robot poses. The results are also shown in table II. The average distance from the person is 0.7 meters, the variance is usually within the same magnitude as the variance of the persons pose. In two cases, the variance in one direction is increased by 0.2 m, which is a result of the target region calculation with its simple threshold heuristic. Figure 7 shows
99
the path and the mean person position with variance of all four test cases. The quality of the trajectories also gives an impression on the stability of the method. Scenario 2 shows, how the upper body pose heavily inuences the trajectory of the robot. Scenario 1 shows, that in narrow spaces the trajectory has to follow the physical restrictions and only the upper body pose is considered. The personal space has to be intruded, if there is no other chance.
R EFERENCES
[1] C. Pastor et al, Affective robotics for assisting elderly people, in Conf. Proc. Assistive Technology From Adapted Equipment to Inclusive Environments: AAATE 2009, Florence, Italy, 2009, pp. 153258. [2] M.E. Pollack et al, Pearl: A mobile robotic assistant for the elderly, in AAAI Workshop on Automation as Eldercare, Munich, Germany, 2002. [3] F. Weisshardt et al, Making high-tech service robot platforms available, in Joint Int. Conf. ISR/ROBOTIK2010, 2010, pp. 11151120. [4] Ch. Schroeter, M. Hoechemer, St. Mueller, and H.-M. Gross, Autonomous robot cameraman - observation pose optimization for a mobile service robot in indoor living space, in Proc. ICRA, Kobe, Japan, 2009, pp. 424429. [5] T. Kanda, M. Shiomi, Z. Miyashita, H. Ishiguro, and N. Hagita, A communication robot in a shopping mall, IEEE Transactions on Robotics, vol. 26, no. 5, pp. 897913, 2010. [6] H.-M. Gross, H.-J. Boehme, Ch. Schroeter, St. Mueller, A. Koenig, E. Einhorn, Ch. Martin, M. Merten, and A. Bley, Toomas: Interactive shopping guide robots in everyday use - nal implementation and experiences from long-term eld trials, in Proc. IROS, St. Louis, 2009, pp. 20052012. [7] C. Nass, J. Steuer, and E.R. Tauber, Computers are social actors, in Proc. the Conference on Human Factors in Computing, 1994, pp. 7278. [8] B. Reeves and C. Nass, The Media Equation: How People Treat Computers, Television, and New Medial Like Real People and Places, CSLI Press, Stanford, 1996. [9] F. Walhoff and E. Bourginion, Alias project description, ALIAS home page, http://www.aal-alias.eu/content/project-overview. [10] D. Fox, W. Burgard, and S. Thrun, The dynamic window approach to collision avoidance, IEEE Robotics and Automation Magazine, vol. 4, no. 1, pp. 2333, 1997. [11] E.T. Hall, The hidden dimension, Doubleday, NY, 1966. [12] E.T. Hall, Proxemics, Current Anthropology, vol. 9, no. 2, pp. 83+, 1968. [13] D.L. Gillespie and A. Lefer, Theories of nonverbal behaviour: A critical review of proxemics research, Sociological Theory, vol. 1, pp. 120154, 1983. [14] A. Lefer, D.L. Gillespie, and J.C. Conaty, The effects of status differentiation on nonverbal behaviour, Social Psychology Quaterly, vol. 45, pp. 153161, 1982. [15] H.W. Smith, Territorial spacing on a beach revisited: A cross-national exploration, Social Psychology Quaterly, vol. 44, pp. 132137, 1981. [16] C. Albas and D. Albas, Aligning actions: The case of subcultural proxemics, Canadian Ethnic Studies, vol. 21, pp. 7482, 1989. [17] K.A. Krail and G. Leventhal, The sex variable in the intrusion of personal space, Sociometry, vol. 39, pp. 170173, 1976. [18] J.L. Williams, Personal space and its relation to extraversionintroversion, Canadian Journal of Behavioural Science, vol. 3, pp. 156160, 1971. [19] K. Dautenhahn et al, How may i serve you? a robot companion approaching a seated person in a helping context, in Proc. HRI, 2006, pp. 172179. [20] K. Koay et al, Exploratory study of a robot approaching a person in the context of handing over an object, in AAAI Spring Symposia, 2007. [21] L Takayama and C. Pantofaru, Inuences on proxemic behaviours in human-robot interaction, in Proc. IROS, 2009, pp. 54955502. [22] Ch. Hu, X. Ma, and X. Dai, Reliable person following approach for mobile robot in indoor environment, in Proc. 8th IEEE Int. Conf. on Machine Learning and Mechatronics, 2009, pp. 18151821. [23] Z. Chen and S.T. Bircheld, Person following with a mobile robot using binocular feature-based tracking, in Proc. IROS, 2007, pp. 815820. [24] X. Ma, C. Hu, X. Dai, and K. Qian, Sensor integration for person tracking and following with mobile robot, in Proc. IROS, 2008, pp. 32543259. [25] E. Pacchierotti, H.I. Christensen, and P. Jensfelt, Evaluation of passing distance for social robots, in Proc. RO-MAN, 2006. [26] F. Hoeller, D. Schulz, M. Moors, and F.E. Schneider, Accompanying persons with a mobile robot using motion prediction and probabilistic roadmaps, in Proc. IROS, 2007, p. 1260 1265. [27] M. Svenstrup, S. Tranberg, H.J Andersen, and T. Bak, Pose estimation and adaptive robot behaviour for human-robot interaction, in Proc. ICRA, 2009, p. 3571 3576. [28] E.A. Sisbot, Towards Human-Aware Robot Motions, PHD Thesis, Univ. of Toulouse, Toulouse, 2006. [29] S. Satake et al, How to approach humans?- strategies for social robots to initiate interaction, in Proc. Of HRI, 2009, pp. 109116. [30] J.M Phillips, L.E. Kavraki, and N. Bedrossian, Spacecraft rendezvous and docking with real-time, randomized optimization, in AIAA Guidance,Navigation, and Control, 2003. [31] E.P. Hart, N.J Nilsson, and B. Raphael, A formal basis for the heuristic determination of minimum cost paths, IEEE Transactions on Systems, Science and Cybernetics, vol. 4, pp. 100107, 1968. [32] K. Arras, O.M. Mozos, and W. Burgard, Using boosted features for the detection of people in 2d range data, in Proc. ICRA, 2007.
y/m
Scenario 1
(I)
0.5 0 -0.5
(II)
30 31 32 33 34 x/m
y/m 5.5 5 4.5 4 3.5 3 2.5 2 8 9

(I)
Scenario 2
(II)
10
11
12
13
14
x/m
Fig. 7. Resulting trajectories of the two tested scenarios. Per scenario two different poses are evaluated by the user (I and II). The mean positions of the user are shown as black dots, the mean upper body poses as arrows. In each scenario the blue lines denote the robots trajectories corresponding to the rst person setup, while the red lines show trajectories of the second setup. Red circles denote the mean starting position of the robot. Both scenarios show, how the upper body pose inuences the approaching trajectory. Scenario 2 also shows, that the social zone is respected if there is room to navigate.
VI. C ONCLUSIONS In this paper we presented a method, working within the Dynamic Window Approach, to approach a person by considering his/her personal space. We could demonstrate, by using a planning strategy, that a stable and reliable solution could be achieved. Nevertheless the method of extracting the target region could be improved in future work. We also want to include obstacles into the personal space model, to improve planning quality and focus on the task of real time replanning, when the person changes his/her pose while the robot approaches. And off course we plan to couple the planning approach with data from a real person tracker to show these results at the conference. ACKNOWLEDGMENT
This work was nanced by the project AAL-2009-2-049 Adaptable Ambient Living Assistant (ALIAS) co-funded by the European Commission and the Federal Ministry of Education and Research (BMBF) in the Ambient Assisted Living (AAL) program.
100
Speech to Head Gesture Mapping in Multimodal Human-Robot Interaction

Amir Aly and Adriana Tapus Cognitive Robotics Lab, ENSTA-ParisTech, France {amir.aly, adriana.tapus}@ensta-paristech.fr
Abstract In human-human interaction, para-verbal and nonverbal communication are naturally aligned and synchronized. The difculty encountered during the coordination between speech and head gestures concerns the conveyed meaning, the way of performing the gesture with respect to speech characteristics, their relative temporal arrangement, and their coordinated organization in a phrasal structure of utterance. In this research, we focus on the mechanism of mapping head gestures and speech prosodic characteristics in a natural human-robot interaction. Prosody patterns and head gestures are aligned separately as a parallel multi-stream HMM model. The mapping between speech and head gestures is based on Coupled Hidden Markov Models (CHMMs), which could be seen as a collection of HMMs, one for the video stream and one for the audio stream. Experimental results with Nao robot are reported. Index Terms Coupled HMM, audio-video signal synchronization, signal mapping
I. INTRODUCTION Robots are more and more present in our daily lifes and the new trend is to make them behave more natural so as to obtain an appropriate social behavior and response. The work described in this paper presents a new methodology that allows the robot to automatically adapt its head gestural behavior to the users prole (e.g. the user prosodic patterns) and therefore to produce a personalizable interaction. This work is based on some ndings in the linguistic literature that show that head movements (e.g., nodding, turn taking system) support the verbal stream. Moreover, in human-human communication, prosody express the rhythm and intonation of speech and reect various features of the speakers. These two communication modalities are strongly linked together and synchronized. Humans use gestures and postures as a communicative act. McNeill in [1] denes a gesture as a movement of the body synchronized with the ow of speech. The mechanism of the human natural alignment of the verbal and non-verbal characteristic patterns based on the work described in [2] shows a direct relationship between prosody features and gestures/postures, and constitute an inspiration for our work. Recently, there has been a growth of interest in socially intelligent robotic technologies featuring exible and customizable behaviors. Based on the literature in linguistics and psychology that suggests that prosody and gestural kinematics are synchronous and therefore strongly linked together, we posit that is important to have a robot behavior that integrates this element. Therefore, in this paper, we describe a new
methodology for speech prosody and head gesture mapping for human-robot social interaction. The gesture/prosody modeled patterns are aligned separately as a parallel multi-stream HMM model and the mapping between speech and head gestures is based on Coupled Hidden Markov Models (CHMMs). A specic gestural behavior is estimated according to the incoming voice signals prosody of the human interacting with the robot. This permits to the robot to adapt its behavior to the user prole (e.g. here the user prosodic patterns) and therefore to produce a personalizable interaction. To the best of our knowledge, very little research has been dedicated to this research area. An attempt is described by the authors in [3] that present a robotic system that uses dance so as to explore the properties of rhythmic movement in general social interaction. Most of the existing works are related to computer graphics and interactive techniques. A general correlation between head gestures and voice prosody had been discussed in [4], [5]. The emotional content of the speech can also be correlated to some bodily gestures. In [6], it is discussed the relation between voice prosody and hand gestures, while [7] discusses the relation between the verbal and semantic content and the gesture. In [8], which is somehow closed to the discussed topic on this research, presents the relation between prosody changes and the orientation of the head (Euler angles). Moreover, authors in [9], proposed a mechanism for driving a head gesture from speech prosody. Our work presents a framework for head gesture and prosody correlation for an automatic robot gesture production from interacting human user speech. The system is validated with the Nao robot in order to nd out how naturalistic will be the driven head gestures from a voice test signal with respect to an interacting human speaker. The rest of the paper is organized as following: section II presents the applied algorithm for extracting the pitch contour of a voice signal; section III illustrates the detection of head poses and Euler angles; section IV describes speech and gesture temporal segmentation; section V presents the speech to head gesture coupling by using CHMMs; section VI resumes the results obtained; and nally, section VII concludes the paper. II. PROSODIC FEATURES EXTRACTION In human-robot interaction applications, the human voice signal can convey many messages and meanings, which should be understood appropriately by the robot in order to interact properly. Next, we describe the methodology used for pitch extraction.
101
Fig. 2: Detecting the face rectangle that contains all salient points
Fig. 1: Pitch Tracking Talkin [10] dened the pitch as the auditory percept of tone, which is not directly measurable from a signal. Moreover, it is a nonlinear function of the signals spectral and temporal energy distribution. Instead, another vocal characteristic, the fundamental frequency F 0, is measured as it correlates well with the perceived pitch. Voice processing systems that estimate the fundamental frequency F 0 often have 3 common processes: (1) Signal Conditioning; (2) Candidate Periods Estimation and (3) Post Processing.Signal preconditioning process is concerned by removing interfering signal components like noise and DC offset, while post processing process chooses the more likely candidate period in order to precisely estimate the fundamental frequency F 0. Talkin in [10] developed the traditional (NCC) method in order to estimate reliably the voicing periods and the fundamental frequency F 0 by considering all candidates simultaneously in a large temporal context. This methodology uses two pass normalized cross correlation (NCC) calculation for searching the fundamental frequency F 0 which reduces the overall computation load with respect to the traditional (NCC) method. The procedures of the algorithm are illustrated in Figure 1. In this work, we choose to express the characterizing vector of the voice signal in terms of the pitch and the intensity of the signal. III. H EAD P OSE E STIMATION During social human-robot interaction, robots should be able to estimate the human head pose. This can help the robot to understand the human focus of attention and/or the meaning of the spoken message. The authors in [11] present a survey
5th
on the different existing algorithms for head poses estimation and detection. Our methodology used for the detection of head poses is Viola and Jones algorithm [12]. After extracting the head region, the eyes are detected by the valley points detection algorithm [13]. After detecting the location of eyes, it is possible to detect the location of the other salient points of the face using the geometry of the face [14]. For example, if the distance between the two eyes points (1&3) equals to D (see Figure 2), and point 2 is the midpoint between the eyes, then the mouth point 4 is located at a distance = 1.1D downwards from point 2. The X-Y coordinates of the rectangle surrounding the salient points of the face (points 5, 6, 7, and 8) (see Figure 2) could be precised as following:

After calculating the coordinates of points (5, 7), the coordinates of points (6, 8) are directly calculated based on the vertical distance between points (7&8 or 5&6), which is equal to 1.8D. One of the problems that may appear when detecting the surrounding rectangle of the facial salient points is the rotation of the head clockwise and counterclockwise (see Figure 3). Therefore, the (X, Y ) coordinates of the eyes has to be rotated rst to (X , Y ) before following the previous steps in order to precise the points of the surrounding rectangle, because the above metioned relations are valid when the eyes coordinates are in the same plane of the face (i.e., if the face is rotated, the coordinate of the eyes have also to be located in the rotated plane). The direction of rotation will be detected by calculating the slope (i.e., rotation angle ) of the line passing by the two eyes using their (X, Y ) coordinates. The rotation of the axes is described by the following equations: X = Xcos Y sin (1)
102
The difference between the Y -coordinates of points (5&1 or 3&7) = 0.2 1.8D The difference between the X-coordinates of points (5&1 or 3&7) = 0.225 1.8D
European Conference on Mobile Robots, September 7-9, 2011, rebro, Sweden
Yaw Pitch
Frame1 (reference) 0 0
Frame 2 0.0016 0.00255
Frame 3 0.0034 0.0075
TABLE I: Yaw and Pitch Initial Angles (Frames 1-3) Used for Calculation of Regression Values C1x,i : the x coordinates of the center point between point 5 and point 6 (Figure 2), in frame i. C2x,i : the x coordinates of the center point between point 7 and point 8 (Figure 2), in frame i. C3y,i : the y coordinates of the center point between point 5 and point 7 (Figure 2), in frame i. C4y,i : the y coordinates of the center point between point 6 and point 8 (Figure 2), in frame i. The regression values b1 , b2 , b3 , and b4 are constants throughout all the video frames. They are calculated by xing the absolute values of Y aw and P itch angles in the second and third frames (according to empirical test) as shown in Table I. The substitution of the second and third values of P itch and Y aw in the equations 3 and 4, leads directly to the computation of the values of the constants b1 , b2 , b3 , and b4 . The calculation of Roll angle is straightforward, it depends on the coordinates of the midpoint between eyes (point 2) in frame i with respect to the reference frame [15], and it is clear that the value of Roll angle in the rst reference frame equals to 0. P 2y,i P 2y,0 Rolli = tan1 ( ) tan1 ( ) (5) P 2x,i P 2x,0
(a)
(b)
Fig. 3: Tracking Salient Face Details in Different Cases of Rotation: (a) Rotation clockwise; (b) Rotation counterclockwise Y = X sin + Y cos (2)
After calculating the coordinates of a face salient point in the rotated (X , Y ) plane while the head is rotating clockwise or counterclockwise, it is important to make the inverse rotation each time to detect the location of this point in the (X, Y ) plane. The pose of the head is calculated in terms of Euler angles: Pitch, Yaw, and Roll. These angles are calculated using the previously detected 8 salient facial points and their relative positions based on the geometry and symmetry of faces (Figure 2) as following: (P 2x,i C1x,i ) + (P 2x,i C2x,i ) 2D eyes0 (P 2x,0 C1x,0 ) + (P 2x,0 C2x,0 ) ]+ 2D eyes0 (P 4x,i C1x,i ) + (P 4x,i C2x,i ) b2 [ 2D eyes0 (P 4x,0 C1x,0 ) + (P 4x,0 C2x,0 ) ] (3) 2D eyes0 (P 2y,i C3y,i ) + (P 2y,i C4y,i ) P itchi = b3 [ 2D eyes0 (P 2y,0 C3y,0 ) + (P 2y,0 C4y,0 ) ]+ 2D eyes0 (P 4y,i C3y,i ) + (P 4y,i C4y,i ) b4 [ 2D eyes0 (P 4y,0 C3y,0 ) + (P 4y,0 C4y,0 ) ] (4) 2D eyes0 where: P 2x,i , P 4x,i : the x coordinates of the midpoint between eyes, mouth point, respectively (see Figure 2), in frame i of the video. P 2y,i , P 4y,i : the y coordinates of the midpoint between eyes, mouth point, respectively (see Figure 2), in frame i of the video. P 2x,0 , P 4x,0 : the x coordinates of the midpoint between eyes, mouth point, respectively (see Figure 2), in frame 0 which is the reference frame in the video (1st frame). P 2y,0 , P 4y,0 : the y coordinates of the midpoint between eyes, mouth point, respectively (see Figure 2), in frame 0 which is the reference frame in the video (1st frame). Y awi = b1 [
IV. SPEECH AND HEAD GESTURE SEGMENTATION The mapping between speech and head gestures is done by using the Coupled Hidden Markov Models (CHMMs), which could be seen as a collection of HMMs, one for the video stream and one for the audio stream. The advantage of this model over a lot of other topologies is its ability to capture the dual inuences of each stream on the other one across time (see Figure 6). In the beginning, speech and head gestures streams are aligned separately as a parallel multi-stream HMM model. The mapping between speech and head gestures is performed in 2 main steps: (1) the rst is modeling the gesture sequence and the associated voice prosody sequence (in terms of their characteristic vectors) into two separate HMMs; (2) then after training both models, a correlation between the two HMMs is necessary so as to estimate a nal head gesture states sequence given a speech test signal. The HMM structure used in analyzing gestures (and similarly voice prosody) is indicated in Figure 4. It is composed of N parallel states, where each one represents a gesture composed of M observations. The goal of the transition between states SEN D to SST ART is to continue the transition between states from 1 to N (e.g., after performing gesture state 1, the model transfers from the transient end state to the start state to perform any gesture state from 2 to N in a sequential way and so on). In order to be able to model gestures/prosody, it is necessary to make a temporal segmentation of the video
103
Trajectory Class 1 2 3 4 5
Trajectory State (Rate of Change) Yaw & Pitch Yaw & Pitch Yaw & Pitch Yaw & Pitch No change
TABLE III: Gesture Segmentation Labels
Fig. 6: Coupled Hidden Markov Model CHMM lag-1 Structure
Fig. 4: HMM structure for gesture and prosody analysis

Trajectory Class 1 2 3 4 5 Trajectory State pitch & intensity pitch & intensity pitch & intensity pitch & intensity Unvoiced segment
within a syllable of duration varying from 180 ms to 220 ms, the average number of trajectory classes in its corresponding pitch and intensity curves is around 5. Therefore, given the voice signal with its segments coded by the corresponding pitch-intensity trajectory labels, each 5 segments of the signal will create a syllable state (from 1 to N ) and the corresponding 5 labels will be the observations M within the syllable state. B. Gestures Temporal Segmentation The average duration for making gestures, in general, varies between 0.1 to 2.5 seconds according to the speed and the performed gesture as being pointing or head gesture for example. In case of head gestures, the average duration of performing a gesture will be limited to 0.4 seconds [18, 19]. In our case, the camera used to capture the gestures had the ability of capturing 30 frames/second, and therefore we can estimate to 12 frames the average number of frames sufcient to characterize a gesture. Similarly to the speech temporal segmentation (see Section IV-A), gesture temporal segmentation is performed by comparing the 9 trajectory classes according to the sinusoidal evolution of the extracted angles curves. However, the mechanical characteristics of our platform (NAO robot) are limited only to pitch and yaw movements, therefore introducing only 5 trajectory classes (see Table III). In the context of the CHMM model each group of 12 frames will form a complete gesture state from 1 to N , and the corresponding coding labels will constitute the observations within the gesture state. V. SPEECH TO HEAD GESTURE COUPLING A typical CHMM structure is shown in Figure 6, where the circles present the discrete hidden nodes/states while the rectangles present the observable continuous nodes/states, which contain the observation sequences of voice and gestures characteristics.
104
TABLE II: Voice Signal Segmentation Labels content to detect the M number of observations in each state and the total number of states N . A. Speech Temporal Segmentation Speech is segmented as syllables presented by the states from 1 to N as indicated in Figure 4. The segmentation is performed by intersecting the inection points (zeros crossing points of the rate of change of the curve) for both the pitch and intensity curves, beside the points that separate between the voiced and unvoiced segments of the signal (see Figure 5 for an example of pitch and intenisty curves). When comparing the two curves together, 5 different trajectory states can result [16] (see Table II). The goal is to code each segment of the signal with its corresponding pitch-intensity trajectory class (e.g., a voice signal segment coding could be: 5, 3, 4, 2, etc.). This segmental coding is used as label for CHMM training. The next step corresponds to segmenting the voice signal with its corresponding trajectory labeling into syllables. Arai and Greenberg in [17] dened the average duration of a syllable as 200 ms and this duration can increase or decrease according to the nature of the syllable as being short or long. Practical tests proved that
Fig. 5: Speech, Pitch and Intensity Curves (The red parts in the voice signal are the unvoiced parts, while blue parts are the voiced parts of the signal. The black points depict the inection points of the signal, while green points represent the separating points between the unvoiced and the voiced segments.) According to the sequential nature of gestures and speech, the CHMM structure is of type lag-1 in which couple (backbone) nodes at time t are conditioned on those at time t 1 [20, 21, 22]. A CHMM model C is dened by the following parameters:
C C 0 (i) = P (q1 = Si )
(6) (7) (8)
aC i|j,k
C P (qt
audio Si |qt1 bC (i) t
= =
video Sj , qt1 C C P (Ot |qt
The training of the CHMM differs from the standard HMM in the expectation step (E) while they are both identical in the maximization step (M) which tries to maximize equation 9 in terms of the expected parameters [25].The expectation step of the CHMM is dened in terms of the forward and backward recursion. For the forward recursion we dene a variable for the audio and video chains at t = 1:
audio t=1 = P (A1 |X1 )P (X1 ) video t=1 = P (B1 |Y1 )P (Y1 )
= Sk ) = Si )
(10) (11)
where C {audio, video} denotes the audio and visual C channels respectively, and qt is the state of the coupling node in the cth stream at time t [23, 24]. The training of this model is based on the maximum likelihood form of the expectation maximization (EM) algorithm. Supposing there are 2 observable sequences of the audio and video states O = {A1..N , B1..N } where A1..N = {a1 , , aN } is the set of observable states in the rst audio sequence, and similarly B1..N = {b1 , , bN } is the set of observable states in the second visual sequence, and S = {X1..N , Y1..N } is the set of states of the couple nodes at the rst audio chain and the second visual chain respectively [21, 22]. The expectation maximization algorithm nds the maximum likelihood estimates of the model parameters by maximizing the following function [22]: f (C ) = P (X1 )P (Y1 )
T
Then the variable is calculated incrementally at any arbitrary moment t as follows: audio audio video t+1 = P (At+1 |Xt+1 ) t t P (Xt+1 |Xt , Yt )dXt dYt
audio video t t video t+1
(12)
= P (Bt+1 |Yt+1 )
P (Yt+1 |Xt , Yt )dXt dYt
(13)
Meanwhile, for the backwards direction there is no split in the calculated recursions which can be expressed as follows:
N audio,video = P (Ot+1 |St ) = t+1 N P (AN , Bt+1 |Xt+1 , Yt+1 ) t+1
t=1
P (At |Xt )P (Bt |Yt )
(9)
P (Xt+1 , Yt+1 |Xt , Yt )dXt+1 dYt+1
(14)
P (Xt+1 |Xt , Yt )P (Yt+1 |Xt , Yt ), 1 T N where: P (X1 ) and P (Y1 ) are the prior probabilities of the audio and video chains respectively P (At |Xt ) and P (Bt |Yt ) are the observation densities of the audio and video chains respectively P (Xt+1 |Xt , Yt ) and P (Yt+1 |Xt , Yt ) are the couple nodes transition probabilities in the audio and video chains.
After combining both forward and backwards recursion parameters, an audio signal will be tested on the trained model, generating a synthesized equivalent gesture that most likely t the model. The generated gesture sequence is determined when the change in the likelihood is below a threshold. VI. EXPERIMENTAL RESULTS The experimental testbed used in this study is the humanoid robot Nao developed by Aldebaran Robotics. For the training and testing, we used the MVGL-MASAL gesture-speech
105
Synthesized/Real Gesture Classes 1 2 3 4 5
1 25 3 2 4 20
2 13 29 6 2 18
3 13 5 20 5 30
4 6 3 8 40 43
5 36 28 25 33 351
TABLE IV: Confussion matrix of the original and synthesized trajectories classes
Turkish database [9]. The database is composed of 4 videos of different durations that go from 6 to 8 minutes. It contains the audiovisual information of different subjects instructed to tell stories to children audience. We use one part of the database for the training of the models and the other part for the testing. The audio signals are extracted and then they are processed in order to extract the relevant prosodic characteristics. The proposed speech to gesture mapping methodology was tested on the database using cross validation algorithm. The system was trained on the audio/visual sequences of 3 videos from the database, and then tested on the audio sequence of the 4th video. The corresponding generated gestures are compared to the natural gesture sequence in the video of test and an average score of 62% was found in terms of the similarity of trajectory classes. Table IV depicts the confussion matrix between the original and synthesized gesture labels trajectories. The confussion matrix reveals that the trajectory state 5 in which there would be no change in the Y aw and P itch angles is the dominant trajectory class. This can be a result of the smoothing processes and/or of the precision of Euler angles extracting algorithm; however this will not cause unnaturaleness when the robot and the human are interacting in long conversations. After calculating the score of similarity between the trajectory labels of the original and the synthesized signals, it is important to generate the corresponding Y aw and P itch curves for the head motion and compare them to the original curves by calculating the total average root mean square (RMS) error between the corresponding curves points. The RMS errors found between the generated Y aw and P itch curves with respect to the original curves are 10% and 12% respectively. In fact, the obtained score 62% and the RMS errors between the original and the synthesized curves can be considered a reasonable result, because the duration and the surrounding environement conditions of the test video and the training videos set were similar. Also, the speakers tonality in all training and test videos were similar. However, we dont know yet the score we will obtain in real applications where the robot will be tested under different condiditions. The performed head gestures could differ in the amplitude or the direction from one person to another without hindering the transfer of the meaning of the gesture message between interacting humans and similarly, between the interacting robot and human. Figures 7 and 8 show a comparison between a part of the original and synthesized pitch and yaw curves (after being smoothed by a median lter) of the test video from the database. A video of the speech-gesture mapping system with Nao
Fig. 7: View of the original (blue curve) and synthesized (red curve) Pitch angles of a part of the test video
Fig. 8: View of the original (blue curve) and synthesized (red curve) Yaw angles of a part of the test video robot is available at: http://www.ensta-paristech. fr/tapus/HRIAA/media. VII. CONCLUSIONS This research focuses on synthesizing head gestures based on speech characteristics (e.g., pitch and intensity of the signal). Our mapping system is based on the Coupled Hidden Markov Model (CHMM) that tries to nd a coupling joint between audio and visual sequences. The audio sequence is composed of parallel states presenting the syllables and each syllable is composed of a specic number of observations (M=5, in our case). Meanwhile, the video sequence has the same parallel construction where the states present the gestures and each state is composed of another specic number of observations determined experimentally (M=12, in our case). After training the CHMM on audio-visual sequences from a database, and when a test audio signal is generated, the system tries to nd a corresponding sequence of gestures based on its own experience learnt during the training phase. The generated gesture sequence is the sequence that achieves the maximum likelihood estimation with the speech test signal. Our system shows a score of 62%, which measures the similarity between the original gesture sequence labels and the synthesized gesture sequence labels, over a test video of 8 minutes. This can be considered a good score. The proposed system is able to generate appropriate robot head gesture from speech input, which allows it to produce an automatic natural robot behavior that is almost completely absent from present-day
106
human-robot interactions. Further work will focus on creating a triadic alignment between the speech, head gestures, and hand gestures in different human-robot interactional contexts that will allow the robot to interact naturally under different conditions. R EFERENCES
[1] D. McNeill, Hand and mind : what gestures reveal about thought. Chicago, USA: Chicago : University of Chicago Press, 1992. [2] F. P. Eyereisen and J. D. D. Lannoy, Gestures and Speech: Psychological Investigations. Cambridge University Press, 1991. [3] M. P. Michalowski, S. Sabanovic, and H. Kozima, A dancing robot for rhythmic social interaction, in Proceedings of the Human-Robot Interaction Conference, Arlington, USA, mar 2007, pp. 8996. [4] K. Munhall, J. A. Jones, D. E. Callan, T. Kuratate, and E. VatikiotisBateson, Visual prosody and speech intelligibility: Head movement improves auditory speech perception, Psychological Science, vol. 15, no. 2, pp. 133137, 2004. [5] T. Kuratate, K. G. Munhall, P. E. Rubin, E. Vatikiotis-Bateson, and H. Yehia, Audio-visual synthesis of talking faces from speech production correlates, in Proceedings of the 6th European Conference on Speech Communication and Technology (EUROSPEECH), 1999, pp. 12791282. [6] L. Valbonesi, R. Ansari, D. McNeill, F. Quek, S. Duncan, K. E. McCullough, and R. Bryll, Multimodal signal analysis of prosody and hand motion: Temporal correlation of speech and gestures, in Proceedings of the European Signal Processing Conference(EUSIPCO), vol. 1, 2005, pp. 7578. [7] F. Quek, D. McNeill, R. Ansari, X. Ma, R. Bryll, S. Duncan, and K. McCullough, Gesture cues for conversational interaction in monocular video, in Proceedings of the ICCV, 1999, pp. 6469. [8] H. P. Graf, E. Cosatto, V. Strom, and F. J. Huang, Visual prosody: facial movements accompanying speech, in Proceedings of IEEE Int. Conf. Automatic Face and Gesture Recognition, 2002, pp. 381386. [9] M. E. Sargn, Y. Yemez, E.Erzin, and A. M. Tekalp, Analysis of head gesture and prosody patterns for prosody-driven head-gesture animation, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 8, pp. 13301345, 2008. [10] D. Talkin, A robust algorithm for pitch tracking, in Speech Coding and Synthesis. W B Kleijn, K Paliwal eds, Elsevier, 1995, pp. 497518. [11] E. M.Chutorian and M.M.Trivedi, Head pose estimation in computer vision: A survey, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 4, pp. 607626, 2009. [12] P. Viola and M. J. Jones, Robust real-time face detection, International Journal of Computer Vision, vol. 57, pp. 137154, 2004. [13] K. Wong, K. Lam, and W. Siu, A robust scheme for live detection of human faces in color images, Signal Processing: Image Communication, vol. 18, no. 2, pp. 103114, 2003. [14] K. W. Wong, K. I. Lam, and W. Siu, An efcient algorithm for human face detection and facial feature extraction under different conditions, Pattern Recognition, vol. 34, no. 10, pp. 19932004, 2000. [15] B. Yip, W. Y. Siu, and S. Jin, Pose determination of human head using one feature point based on head movement, in Proceedings of IEEE Int. Conf. on Multimedia and Expo (ICME), vol. 2, 2004, pp. 11831186. [16] F. Ringeval, J. Demouy, G. S. andM. Chetouani, L. Robel, J. Xavier, and D. C. Plaza, Automatic intonation recognition for the prosodic assessment of language impaired children. IEEE Transactions on Audio, Speech and Language Processing, vol. 99, pp. 115, 2010. [17] T. Arai and S. Greenberg, The temporal properties of spoken japanese are similar to those of english, in Proceedings of Eurospeech, Rhodes, Greece, 1997, pp. 10111114. [18] K. Nickel and R. Stiefelhagen, Real-time recognition of 3d-pointing gestures for human-machine-interaction, in Proceedings of DAGMSymposium, Magdeburg, Germany, 2003, pp. 557565. [19] S. A. Moubayed and J. Beskow, Effects of visual prominence cues on speech intelligibility, in Proceedings of the International Conference on Auditory-Visual Speech Processing (AVSP), Norwich, UK, 2009. [20] L. R. Rabiner, A tutorial on hidden markov models and selected applications in speech recognition, in Proceedings of the IEEE, vol. 77, no. 2, 1989, pp. 257286. [21] I.Rezek, P. Sykacek, and S. Roberts, Coupled hidden markov models for biosignal interaction modelling, in Proceedings of the International Conference on Advances in Medical Signal and Information Processing (MEDSIP), 2000.
[22] I. Rezek and S. J. Roberts, Estimation of coupled hidden markov models with application to biosignal interaction modelling, in Proceedings of the IEEE International Workshop on Neural Networks for Signal Processing (NNSP), Sydney, Australia, 2000. [23] A. V. Nean, L. Liang, X. Pi, X. Liu, and C. Mao, A coupled hidden markov model for audio-visual speech recognition, in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 2, Orlando, USA, 2002, pp. 20132016. [24] L. Liang, X. Liu, X. Pi, Y. Zhao, and A. V. Nean, Speaker independent audio-visual continuous speech recognition, in Proceedings of the International Conference on Multimedia and Expo (ICME), vol. 2, Lausanne, Switzerland, 2002, pp. 2528. [25] W. Penny and S. Roberts, Gaussian observation hidden markov models for eeg analysis, in Technical Report TR-98-12, Imperial College London, UK, 1998.
107
108
Evaluation of the Robot Sign Language Tutoring Assistant Using Video-Based Studies
H. Kose- Bagci, R. Yorganci , H. E. Algan Social Robotics Lab, Faculty of Computer and Informatics, Istanbul Technical University, Istanbul, Turkiye Email:{ hatice.kose, yorganci, algan}@itu.edu.tr
AbstractThere is an on-going study which aims to assist in teaching Turkish Sign Language (TSL) to hearing impaired children by means of non-verbal communication and imitation based interaction games between a humanoid robot and the child. In this study, the robot will be able to express a word in the TSL among a set of chosen words using hand movements, body and face gestures and having comprehended the word, the child will give relevant feedback to the robot. This paper aims to perform an evaluation on a subset of sample words chosen from TSL via the comparison of their different video representations carried out by human teachers and the simulated Nao robot. Several surveys and user studies have been realized to reveal the resemblance between the two types of videos involving the performance of the robot simulator and the human teacher for each chosen word. In order to investigate the perceived level of similarity between human and robot behavior, a number of participants with different sign language acquaintance have been asked to complete questionnaires. The results of these surveys have been summarized and the most significant factors affecting the comprehension of TSL words have been discussed. Index Terms Humanoid Robots, Interaction games, Nonverbal communication, Sign Language
I. INTRODUCTION Turkish Sign Language(TSL) is a visual language that is comprised of hand movements, body and face gestures. Hearing impaired children have chance to learn this language as their native language even before they learn written Turkish language on condition that their parents are hearing impaired as well. Language acquisition, which is an extremely crucial process for brain development and intelligence, is completed at ages of 2 or 3 years. Hence existence of sufficient native language materials and their employment during education is of great importance for preschool training. Utilization of television and computer has become widespread for the purpose of teaching sign language to hearing impaired people. A comprehensive research is being held at Boazii University, Turkey to teach TSL by regarding videos and to understand sign language words performed by students. This study involves research on recognition and classification of hand and face gestures [1This work was supported by Istanbul Technical University Scientific Research Projects foundation under the contract BAP 34255. Contact address: hatice.kose@itu.edu.tr
4]. In addition, a TSL tutoring software [5] has been implemented. This study has been realized as part of an on-going research, which aims to utilize humanoid robots for aiding sign language tutoring due to the incompetency of 2-D instructional tools developed for this goal and the lack of sufficient educational material. In the proposed system, it is intended that a child-sized humanoid robot is going to perform various elementary TSL words so as to assist teaching these words to hearing impaired children. This will be achieved through interaction games based on non-verbal communication, turn-taking and imitation that are designed specifically for robot and child to play together. In this study, a survey has been realized to investigate the similarity between the behaviors of human teachers and Nao robot simulator during the representation of TSL words. For each word selected, a video which displays the Nao robot simulator performing the sign language expression has been prepared. The corresponding videos of sign language representations by human teachers are available within the TSL tutoring software [5]. For the test study, following the demonstration of the robots and human teachers performances of several selected words from TSL by videos, a number of participants have been demanded to fill out a questionnaire regarding the success of the robots performance in matching to the correct human implementation. The results of the survey have been summarized and the most significant factors regarding the realization of TSL words have been discussed in the experiments and discussions section. II. LITERATURE SURVEY Various studies have been carried out for the recognition of different sign languages. A system that recognizes 40 words, each of which is a pronoun, noun, verb or adjective in American Sign Language, with 90% precision has been implemented [6]. Another study states a 80% success rate for detecting 95 words taken from the Australian Sign language [7]. Regarding the Japanese Sign language, a total of 52 words, 42 of which are represented by the fingeralphabet, have also been adequately identified [8]. There is also a study that recognizes 19 words from American Sign Language by utilizing Hidden Markov
109
models [1]. These words are expressed by the movements of the head and the two hands. The program flow commences with the user repeating a sign language word after watching its corresponding video. The recorded movements of the user are analyzed to determine whether the sign language word has been performed appropriately. A success rate of 99% has been reported for the recognition of words expressed solely by hands and a success rate of 85% was achieved for the classification of words performed both by hands and by the head. Various other studies on sign languages such as the recognition of hand shapes and movements or the classification of facial gestures have been carried out in order to analyze and help teach sign languages. [1-4] There are several successful user studies on non-gesture communication through imitation based interaction games with humanoid robots and human participants [9, 10]. To achieve interaction, drumming games with gestures provided by the robot to motivate successful turn-taking were studied in [10-13]. There is no previous study that reports the usage of humanoid robots for helping teach sign language to hearing impaired children. The proposed system aims to compensate this mentioned lack while being supportive to hearing impaired children. III. PROPOSED WORK This study is a part of an on-going research that aims to help teach TSL to hearing-impaired children via interaction based games between robot and the child. The H-25 Nao robots will be used for this research in the user-studies, as they have hands and fingers to implement most of the sign language words, and appropriate to use in interactive children games due to its expressive face, small size, compact shape and toy-like appearance. The Nao robot, which has a height of 0.57 m. and a weight of 4.5 kg., is a system with 21-25 degrees of freedom, 500 MHz processor, two cameras, sonar sensors and force-sensitive resistors [14]. In this study, a subset of the most appropriate words that can be demonstrated by the movements of a Nao robot simulator software have been determined. The physical limitations of the Nao robot makes it hard to implement some of the words. One of the reasons for this is the fact that the Nao robot has only 3 depended fingers while most of the words from the TSL are performed by using 5 fingers (mostly independent, i.e one still and other 2 are curled). By the aid of the proposed system, distinct success rates for robot and human tutors will be extracted. Regarding the TSL, no matter how professional the teacher might be, each individual has a differentiated style, which causes difficulty in learning in case the instructor is replaced. In the proposed system, it is intended to show that the robot tutor achieves a similar success rate with human teachers. In this way, it is planned that tutor independent TSL teaching will be achieved. Here, it should also be noted that the aim of the
proposed system is to develop a fast, simple, motivating tool with easy update facility that allows children to test their knowledge. The proposed system is expected to assist human teachers rather than to form a substitute for them. IV. EXPERIMENTS&DISCUSSION A survey, which intends to evaluate the performance of the Nao robot simulator[15] for demonstrating a number of chosen TSL words, has been carried out in this study. The survey has been carried out both for people who have little or no acquaintance with the TSL and for second grade students of TSL courses held at Bilgi and Boazii Universities, Istanbul, Turkey. A. Adult participants with no acquaintance with Sign Language In the first part of the survey, 40 people (16 male, 24 female), 36 of whom are students at different universities in Istanbul, have participated in the survey. The remaining four people have distinct professions. Table 1 exhibits the acquaintance of participants with TSL. A subset of TSL words, which have been chosen for the survey, are given with their English meaning in Table 2.
TABLE 1. TSL ACQUAINTANCE OF PARTICIPANTS # of participants
Familiar Little acquaintance No acquaintance No answer
4 6 27 3
Araba Arkada Baba Masa
TABLE 2. CHOSEN TSL WORDS Turkish Word English Meaning
Car Friend Father Table Three
During the execution process throughout the survey, firstly the video of the robot performing the sign language word has been displayed to the participant. Subsequently, a number of human teacher videos taken from the TSL software have been shown for the participant to identify the video that mostly resembles that of the robot. In each of the TSL software videos, a specific TSL word is performed by a human teacher. Finally, the participants were asked to reveal which word was most successfully expressed by the robot. Figure 1 to Figure 4 display video screen shots of sample TSL word representations performed by Nao robot simulator and a human teacher.
110
total number of answers, and the number of wrong choices for each video are presented in Table 3 as well.
Figure 3 Videos corresponding to the realization of TSL word baba(father) by Nao robot simulator and a human teacher.
Figure 1 Videos corresponding to the realization of TSL word masa(table) by Nao robot simulator and a human teacher.
Figure 4 Videos corresponding to the realization of TSL word araba(car) by Nao robot simulator and a human teacher. TABLE 3. ACCURACY OF CHOICES FOR NONFAMILIAR PARTICIPANTS CORRECT WRONG TOT. CHOICE CHOICE Araba Arkada Baba Masa 36 23 19 33 31 4 14 18 6 5 40 37 37 39 36
Figure 2 Videos corresponding to the realization of TSL word arkada(friend) by Nao robot simulator and a human teacher.
In Table 3, the accuracy of choices made by the participants who are not familiar with TSL is displayed. To state more explicitly, it has been observed that 36 people have identified the corresponding human teacher TSL video having watched the video of the Nao robot performing the TSL word araba(car). A choice refers to the human teacher TSL video selected by the participant, who attempts to ascertain the most correlative human teacher TSL video for the robot video. For each robot video that demonstrates a particular TSL word, there is merely one corresponding human teacher TSL video and at most three irrevelant other human teacher TSL videos. In other words, there is one correct choice along with other wrong choices that can be selected by the participant. Wrong choices were generally selected among the ones which look like the correct video most. Some of the participants did not choose any choices at all since they admitted they could not find a resemblance between current choices and the corresponding video. The
In this survey, having chosen the most likely human teacher TSL video for the robots expression, the participant will then assess a score of resemblance for the corresponding two videos of human and robot performance (Table 4). Hence, if the participant is able to come up with the correct choice, he/she will determine a score for the corresponding TSL video. Otherwise, he/she will be evaluating a wrong choice, thus endeavoring to assess the resemblance between the robot video and an irrelevant human teacher TSL video. In Table 4, each column represents the score of resemblance for the related choice. Also the distribution of scores among different wrong choices, and the total scores of the wrong choices are given.
TABLE 4. ASSESSMENT OF RESEMBLANCE FOR ROBOT AND HUMAN TEACHER VIDEOS.
111
Correct Choice Araba Arkadas Baba Masa c 139 57 56 121 121
1st Wrong Choice 4 3 5 6 16
2nd Wrong Choice 11 28 13 16 0
3rd Wrong Choice
Total # of Wrong Choices 15 31
# of Ungraded Words 2 4 2 1 2
32
50 22 16
word. It can be inferred from the results that the rates of appropriate association between robot and human teacher TSL videos are above 50% for all the TSL words chosen. The word Baba has the least successful robot performance among the TSL words. However, it is conferred that the wrong choices made by the participants exhibit a distribution rather than getting stuck in a single wrong answer.
TABLE 8. AVERAGE AND STANDARD DEVIATION VALUES FOR DIFFERENT Correct Choice Araba Arkada Baba Masa 3,860,82 2,470,85 2,950,97 3,670,82 3,901,11
CHOICE SCORES
Wrong Choice 2,151,21 2,751,52 3,21,48
Lastly, the questionnaire asks about the most and the least successful robot/human teacher TSL video pair according to their similarity with each other. Numbers in parentheses display the number of participants that selected the correct TSL video for the robot video of that TSL word (Table 5). The participants are demanded to evaluate the rate of success according to the resemblance between the two videos for the same TSL word. However, the participants attempt to perform this reclamation for the robot video that they have watched and the TSL video choice that they have come up with.
TABLE 5. NUMBER OF PARTICIPANTS EVALUATING THE MOST AND THE LEAST SUCCESSFUL ROBOT/HUMAN TEACHER VIDEO PAIR Successful Unsuccessful Araba(Car) 17 (16) Arkada(Friend) 2 (1) 11 (6) Baba(Father) 3 (2) 14 (6) Masa(Table) 5 (5) 1 (0) (Three) 8 (8) 8 (5)
Since problems have been faced whilst trying to move fingers and rotate elbows on the Nao simulation program, robot expressions for the words arkada(friend), baba(father) and (three) tend to be rather problematic. Therefore, the results in the previous tables were anticipated. The word (three) has the best performance among the other problematic words. Moreover, it is realized to have one of the most successful representations. The reason for this can be traced to the fact that the corresponding expression of the TSL word is widely used among Turkish people. Therefore, the participants were unable to respond in an unbiased way. B. Adult participants with Sign Language acquaintance As a test case, the survey is also done on a class of second degree sign language students (11 females and 1 male). Although the words and the videos are same with the first survey, the way we ask the questions slightly changed. Before giving the choices to the sign language professionals, we first wanted to get their unbiased opinions and guesses about the videos. The second questionnaire is comprised of three phases. Firstly, the participant is asked to watch a video containing a Nao simulator representation of a TSL word and then to guess the word that the simulated Nao attempts to perform. Secondly, the participant is expected to choose the correct video, which resembles the most to the Naos demonstration of a given TSL word, among a number of videos each of which displays the human teacher performance of a specific TSL word. In addition, the resemblance ratio among these two similar videos is required to be graded by the participant based on a 5-scale. This part is same with the test applied in the first survey. Thirdly, the participant is requested to select a certain word that he/she thinks to correspond to the video of the Nao simulator representation among a number of given words and to rank the resemblance ratio as well. Finally, the participants were asked to evaluate the videos of the robots as the most and least successful word implementation of the robot.
TABLE 9. THE RESULT DISTRIBUTION FOR THE PHASE1 Correct Wrong No Choice Choice Choice
The questionnaires also interrogate the reasons for the failure of robot performance for expressing a TSL word such as; a) physical appearance of the robot b) physical incompetence pertaining to the robot c) software inadequacy of the Nao robot simulator d) other reasons (Table 6).
TABLE 6. POSSIBLE REASONS FOR THE LACK OF RESEMBLANCE BETWEEN ROBOT AND TID VIDEOS. (a) (b) (c) (d) Araba(Car) 10 1 2 Arkada(Friend) 2 9 2 Baba(Father) 1 13 2 2 Masa(Table) 1 2 3 (Three) 6 8 1 2 TABLE 7. DISTRIBUTION OF CHOICE PERCENTILES Wrong Correct Choice Choice(s) Araba %90 Arkada %62,16 %35,14 Baba %51,35 %43,24 Masa %84,62 %86,11 %13,89
Table 7 demonstrates the distribution of percentiles for the choices made by the participants. Choices that have been selected by less than 5 people, and the unanswered questions have been discarded. Table 8 shows the mean scores and standard deviations for different choice scores for each TSL
112
7 1 -
5 8 10 11 5
3 2 1 7
Table 9 displays the results of the first part of the second questionnaire. From the table, it can be observed that only the word Araba could be guessed correctly from the Nao simulator video with slightly more than %50 success rate when no human teacher TSL videos are shown to the participants.
TABLE 10. DISTRIBUTION OF CHOICES MADE BY THE PROFESSIONAL PARTICIPANTS IN PHASE 2 CORRECT 1ST 2ND 3RD TOT. CHOICE Araba Arkada Baba Masa 11 8 8 4 2 1 3 1 4 11 8 10 11 2
meaning of the words, since there are many words which look very similar and difficult to distinguish even for the sign language proffesionals. In the third phase of the second questionnaire, the participants are given certain words instead of videos and are asked to choose the word that they think best fits the one that is demonstrated in the Nao simulator video. Table 14 to Table 17 display the corresponding results.
TABLE 14. DISTRIBUTION OF CHOICES MADE BY THE PROFESSIONAL PARTICIPANTS IN PHASE 3 CORRECT 1ST 2ND 3RD TOT. CHOICE Araba Arkada Baba Masa 10 10 7 8 1 2 2 4 3 1 2 12 10 10 12 6
TABLE 11. ASSESSMENT OF RESEMBLANCE FOR ROBOT AND TID VIDEOS. Correct 1st 2nd 3rd # of Choice Wrong Wrong Wrong Ungraded Choice Choice Choice Words Araba 31 2 Arkada 11 Baba 17 4 2 1 Masa 13 8 9 2 4 TABLE 12. DISTRIBUTION OF PERCENTILES FOR THE SCORES MADE BY THE PARTICIPANTS. Wrong Correct Choice Choice(s) Araba %91,66 Arkada %66,67 Baba %66,67 %16,66 Masa %33,33 %58,33 %16,67 TABLE 13. AVERAGE AND STANDARD DEVIATION VALUES OF SCORES FOR Correct Choice Araba Arkada Baba Masa 3,450,87 1,1250,35 2,1250,64 3,21,10 10
EACH TSL WORD
TABLE 15. ASSESSMENT OF RESEMBLANCE FOR ROBOT AND TID VIDEOS. Correct 1st 2nd 3rd # of Choice Wrong Wrong Wrong Ungraded Choice Choice Choice Words Araba 36 7 Arkada 14 Baba 16 1 2 2 Masa 20 10 7 3 TABLE 16. DISTRIBUTION OF PERCENTILES FOR THE CHOICES MADE BY THE PROFFESIONAL PARTICIPANTS. Wrong Correct Choice Choice(s) %83,34 Araba 16,67% Arkada Baba Masa %83,34 %58,33 %66,67 %8,33 25,00% 33,34% 41,67%
TABLE 17. AVERAGE AND STANDARD DEVIATION VALUES FOR DIFFERENT

CHOICE SCORES FOR EACH TURKISH SIGN LANGUAGE WORD
Correct Choice 3,451,13 1,40,97 2,290,76 2,51,07
Wrong Choice 3,50,71 30 2,50,58 3,543
Wrong Choice 60 5,251,41 -
From Table 10 to Table 13, we can observe the results for the evaluation of the Nao simulator video by the participants. In Table 11, each column represents the score of resemblance for the related choice. The participants that are capable of recognizing a much wider range of TSL words have more difficulty in finding the right TSL video corresponding to the robot video. The reason for this can be traced to the fact that the physical deficiencies of the simulated robot might change the
Lastly, Table 18 presents the number of participants evaluating the most and the least successful robot/tid video pair according to their similarity with each other. The numbers in parentheses display the number of participants that selected the correct TSL video for the robot video of that turkish sign language word.
TABLE 18. EVALUATION OF THE MOST AND THE LEAST SUCCESSFUL ROBOT/TID VIDEO PAIR Successful Unsuccessful Araba(Car) 9(9) Arkada(Friend) 1 (1) 3(1) Baba(Father) -
113
Masa(Table) (Three)
8(1)
working on extending the study to American Sign Language (ASL). REFERENCES

[1] [2] Aran, O., Keskin, C., Akarun, L., Sign Language Tutoring Tool, European Signal Processing Conference, EUSIPCO'05, Antalya, Trkiye, 2005. A. Caplier, S. Stillittano, O. Aran, L. Akarun, G. Bailly, D. Beautemps, N. Aboutabit, and T. Burger, Image and video for hearing impaired people, EURASIP Journal on Image and Video Processing, Special Issue on Image and Video Processing for Disability, 2007. Keskin, C. and L. Akarun, " Input-output HMM based 3D hand gesture recognition and spotting for generic applications, accepted for publication, Pattern Recognition Letters, vol. 30, no. 12, pp. 10861095, September 2009. Aran, O., I., Ari, A. Benoit, P. Campr, A. H. Carrillo, F. Fanard, L. Akarun, A. Caplier, M. Rombaut, and B. Sankur, Signtutor: An Interactive System for Sign Language Tutoring. IEEE Multimedia, Volume: 16, Issue: 1, Pages: 81-93, Jan-March 2009. Turkish Sign Language Dictionary v1.0. [Online]. Available: http://www.cmpe.boun.edu.tr/pilab/tidsozlugu Staner , A. T., Pentland, A., Real-Time American Sign Language Recognition from Video using Hidden Markov Models, Technical Report TR-306, Media Lab, MIT. Kadous, Waleed, GRASP: Recognition of Australian Sign Language Using Instrumented Gloves, MSc. Thesis, University of New South Wales, 1995. Murakami, Kouichi, Hitomi, Taguchi, Gesture Recognition Using Recurrent Neural Networks, Proceedings of CHI91 Human Factors in Computing Systems, 1991, pp. 237-242. Shen, Q., J. Saunders, H. Kose-Bagci, K. Dautenhahn, An Experimental Investigation of Interference Effects in HumanHumanoid Interaction Games, IEEE RO-MAN2009, accepted Kose-Bagci, H., K. Dautenhahn, C. L. Nehaniv, Emergent Dynamics of Turn-Taking Interaction in Drumming Games with a Humanoid Robot, Proc. IEEE RO-MAN 2008, Technische Universitat Munchen, Munich, Germany, 1-3 August 2008. Kose-Bagci, H., E. Ferrari, K. Dautenhahn, D. S. Syrdal, and C. L. Nehaniv, Effects of Embodiment and Gestures on Social Interaction in Drumming Games with a Humanoid Robot, Special issue on Robot and Human Interactive Communication, Advanced Robotics vol.24, no.14, 2009. Kose-Bagci, H., K. Dautenhahn, D. S. Syrdal, and C. L. Nehaniv, Drummate: Interaction dynamics and gestures in human-humanoid drumming experiments, Connection Science, 2009, accepted. Dautenhahn, K., C. L. Nehaniv, M. L. Walters, B. Robins, H. KOSEBAGCI, N. A. Mirza, M. Blow (in press) KASPAR - A Minimally Expressive Humanoid Robot for Human-Robot Interaction Research. to appear in Special Issue on "Humanoid Robots" for Applied Bionics and Biomechanics, published by Taylor and Francis, Graf, C., Hrtl, A., Rfer, T., Laue, T.: A robust closed-loop gait for the standard platform league humanoid. In Zhou, C., Pagello, E., Menegatti, E., Behnke, S., Rfer, T., eds.: Proceedings of the Fourth Workshop on Humanoid Soccer Robots in conjunction with the 2009 IEEE-RAS International Conference on Humanoid Robots, Paris, France, 2009, pp.30 37. Aldebaran Robotics Choregraphe. [Online]. Available: http://www.aldebaran-robotics.com/en/programmable
The answers of the participants with adequate TSL knowledge confirm the results from the nave users. In this study, the students were also asked to suggest alternative answers for the words embodied by the robots actions. Although they were able to find to words with high success rates, they comment on how the little cues realized by slight motion of fingers or finger tips which are currently out of Nao robots physical capabilities might change the meaning of the words, since there are many words which look very similar and difficult to distinguish even for the sign language professionals. As expected although word (three), got a high success rate among the nave users, it was not rated as successful by the professionals since the robots fingers are not capable of showing the right action and it is not easy to distinguish it from the next similar word iki, without the fingers. V. CONCLUSIONS In this study, a survey has been carried out to compare the performance of Nao robot simulator with that of a human teacher for expressing certain words from the TSL. This study has been realized as part of a more comprehensive research, which aims to help teach Turkish Sign language to hearing impaired children by generating interaction based games between the robot and the child. The videos in which a human TSL tutor performs the chosen words have been accessed from the TSLsoftware developed at Boazii University[3]. The Nao robot simulator has been utilized to prepare corresponding videos for the chosen words as well. The participants were asked to fill out a questionnaire in order to assess the resemblance between the robot and the human teacher so as to determine whether the Nao robot has near-human capability for expressing TSL words. Although the physical limitations of the robot does not allow the implementation of all words in the domain (i.e. the ones that require 5 fingers or some complex upper torso motion), and most of the participants were chosen from nave people who do not know robots or sign language, all the words were recognized with big success. There was bias related to the familiarity of some words, or the fact that there are several words whose implementation looks very familiar due to their close meaning (i.e. father -uncle). We enlarged our domain of words and repeated this experiment with experienced sign language users, and we are still working on the analysis of the results. Our next step is to convert the survey into an online interaction game and test with primary school students to see if age and experience will have an effect on these results. Currently, more than 40 words have been implemented on real/simulated Nao H25 robot. There is also an ongoing work to use a robotic platform with 5 fingers and more DOF on arms to realize the signs more successfully. Also we are
[3]
[4]
[5] [6] [7] [8] [9] [10]
[11]
[12] [13]
[14]
[15]
114
Segmentation of Dynamic Objects from Laser Data

Agustin Ortega and Juan Andrade-Cetto
Institut de Rob` tica i Inform` tica Industrial, CSIC-UPC, Barcelona, Spain o a
Abstract We present a method to segment dynamic objects from high-resolution low-rate laser scans. Data points are tagged as static or dynamic based on the classication of pixel data from registered imagery. Per-pixel background classes are adapted online as Gaussian mixtures, and their matching 3D points are classied accordingly. Special attention is paid to the correct calibration and synchronization of the scanner with the the accessory camera. Results of the method are shown for a small indoor sequence with several people following arbitrarily different trajectories. Index terms Segmentation, 3D sensing, calibration, sensor synchronization.
I. I NTRODUCTION 2D and 3D lidar scanning are popular sensing methodologies for robotics applications. They are used for robot navigation [5], trajectory planning [20], scene reconstruction [17], and even object recognition [1]. Aside from pricey devices such as the Velodyne HDL-64E, high resolution 3D lidar scanning is only possible at low frame rates. As an example, we have built an omnidirectional lidar sensing device for outdoor mobile robotics applications that scans with resolutions and acquisition times that range from 0.5 degrees at 9 seconds per revolution to ner point clouds sampled at 0.1 degrees resolution at a more demanding processing time of 45 seconds per revolution. This sensor has been devised for low cost, dense 3d mapping. The removal of dynamic and spurious data from the laser scan is a prerequisite to dense 3d mapping. In this paper we address this problem by synchronizing the laser range sensor with a color camera, and using the high frame-rate image data to segment out dynamic objects from the point clouds. Per-pixel class properties of image data are adapted online using Gaussian mixtures. The result is a synchronized labeling of foreground/background corresponding laser points and image data as shown in Fig. 1. The paper is organized as follows. In the next section we present related work in background segmentation using computer vision methods and 3D laser range data. Section III gives our custom built sensor specications, and details the methods developed for sensor synchronization and sensor calibration. Section IV details the background segmentation algorithm. Results of the method are shown in Section V on a real indoor scenario with several people moving with random patterns. Conclusions and future work are detailed in Section VI.
Fig. 1. Several laser scans of a dynamic object reprojected on their corresponding image frame.
II. R ELATED W ORK Methods that study the segmentation of 3D laser data usually focus on the extraction of valuable geometric primitives such as planes or cylinders [10] with applications that vary from map building, to object classication [2], road classication [6], or camera network calibration [9]. All these methods however are designed to work on static data only and do not consider the temporal information. For outdoor map building applications, the removal of dynamic objects from the laser data is desirable. Furthermore, for low-rate scanning devices such as ours, moving items in the scene would appear as spurious 3D data; hence the need to segment them out. Background segmentation is a mature topic in computer vision, and is applied specially to track objects in scenarios that change illumination over time but keep the camera xed to a given reference frame. The most popular methods adapt the probability of each image pixel to be of background class using the variation of intensity values over time. Such adaptation can be tracked with the aid of a Kalman lter [14] taking into account illumination changes and cast shadows. These methods can be extended to use multimodal density functions [18, 19] in the form of Gaussian mixture models, whose parameters are updated depending on the membership degree to the background class. The classication of 3D range data fusing appearance information has been addressed in the past, again for static scene analysis. Posner et al. [11, 12, 13] propose an unsupervised
115
Fig. 2. Camera to laser rigid body pose estimation using a planar calibration pattern.
Fig. 4. Laser-camera pose renement using line primitives. The green dotted lines show the image features. Red lines show reprojection prior to pose renement, and blue lines correspond to rened reprojected estimates (best viewed in color).
Fig. 3. Our custom built 3D range sensing device and a rigidly attached color camera.
Fig. 5.
Camera and laser synchronization.
III. S ENSOR SYNCHRONIZATION AND CALIBRATION method that combines 3D laser data and monocular images to classify image patches to belong to a set of 8 different object classes. The technique oversegments images based on texture and appearance properties, and assigns geometric attributes to these patches using the reprojected 3D point correspondences. Each patch is then described by a bag of words and classied using a Markov random eld to model the expected relationship between patch labels. These methods (and ours) have as a prerequisite the accurate calibration of both sensors, the laser and the camera. The computation of the rigid body transformation between 2D and 3D laser scanners and a camera are common procedures in mobile robotics and are usually solved with the aid of a calibration pattern. The techniques vary depending on the type of sensor to calibrate, and on the geometric motion constraints between the two sensor reference frames [23, 22, 7, 9]. Sensor synchronization on the other hand has received less attention. Sensor synchronization and occlusions are studied in [16] for the case of the Velodyne HDL-64 sensor. A more general method to synchronize sensors with varying latency is proposed in [8]. A. Sensor specications and data acquisition Our 3D range sensing device consists of a Hokuyo UTM30LX laser mounted on a slip-ring, with computer-controlled angular position via a DC brushless motor and a controller. For the experiments reported in this paper, laser resolution has been set to 0.5 degrees in azimuth with 360 degree omnidirectional eld of view, and 0.5 degrees resolution in elevation for a range of 270 degrees. Each point cloud contains 194,580 range measurements of up to 30 meters with noises varying from 30mm for distances closer to 10m, and up to 50mm for objects as far as 30m. The color camera used is a Pointgray Flea camera with M1214-MP optics and 40.4 degree eld of view. Fig. 3 shows a picture of the entire unit. B. Sensor Calibration We are interested in the accurate registration of laser range data with intensity images. Registration can be possible by rst calibrating the intrinsic camera parameters and then, nding the relative transformation between the camera and laser reference frames. Intrinsic camera calibration is computed using Zhangs method and a planar calibration pattern [24],
116
although other methods could also be used [3, 21]. Extrinsic calibration between the laser and camera is initialized by selecting correspondences of the calibration plane corners on both sensing modalities with the aid of a graphical user interface, and using Haggers method for pose estimation [4], as shown in Fig. 2. The method is subject to the resolution of the laser scanner for the selection of the four 3D to 2D corner matches in the pattern. Pose estimation is further rened by minimizing the reprojection error of line primitives. Lines in the 3D point cloud are obtained growing and intersecting planar patches as in [10]. Their corresponding matches in the images are manually selected using the graphical user interface. Line reprojection error is computed as the weighted sum of angular and midpoint location reprojection errors as shown in Fig. 4, = (i p )2 + w(mi mp )T (mi mp ) (1)
session, when a pixel RGB value x falls within 2.5 standard deviations of any of the distributions in the sum (in the Mahalanobis sense), evidence in the matching distributions is stored by recursively updating their sample weight, mean, and variance with k (t + 1) k (t + 1) k (t + 1) and = N (x|k , k ) (7) Note that after updating in Eq. 4, the weights need to be renormalized. And, just as in [19] we also consider during the training session, that when a pixel value x falls below a 2.5 standard deviation of the distribution, the least probable distribution of the Gaussian sum is replaced by the current RGB pixel value as the current mean, with an initially high variance, and a low prior weight. B. Background Class The mixture model on each pixel encodes the distribution of colors for the full image sequence set per full 3D scan (about 153 images). The static portion of the data, i.e., the background, is expected to have large frequency and low variance. By ordering the Gaussians on each sum by the value det , the distributions with larger probability to be of the background class will be aggregated in the top of the list. Static items might however be multimodal in their color. For instance, a ickering screen or a blinking light. As a result we choose as background class the rst B < K ordered distributions which add up to a factored weight B , where
b
= = =
(1 )k (t) +
(4)
(1 )k (t) + x (5) T (1 )k (t) + (x (t)) (x (t))(6)
where the subscript i corresponds to measured image features, and the subscript p indicates projected model features. The weight w is a free tuning parameter to account for the difference between angular and Cartesian coordinates. C. Synchronization At 0.5 degree resolution, our 3D scanner takes about 9 seconds to complete a scan, which is made of a 180 degree turn of the sensor. Camera frame rate is set to 17 fps, thus we have roughly 153 images per full 3D image. The timestamps between consecutive laser slices tslicei , and grabbed images tframej are compared and set to lie within a reasonable threshold Ts in milliseconds. |tslicei tframej | Ts (2)
With Ts = 1/17, each laser scan is uniquely assigned to its corresponding image frame, roughly two to three per image. Increasing this threshold, allows to match each laser slice to more than one image at a time (see Fig. 5). IV. BACKGROUND S USTRACTION Once we have time correspondences between 3D laser slices and image frames, we can use background segmentation results on the image sequence to classify the corresponding 3D points in each time slice as belonging to a dynamic or static object. The method we implemented is based in [19]. A. Mixture Model For each pixel in the image, the probability of its RGB coordinates x to be of the background class is modeled as a mixture of K Gaussian distributions.
K
B = argminb (
i=1
i B ) .
(8)
C. Point classication Each point on each scan slice is reprojected to its matching image frames according to Eq. 2. Ideally, for tight bounds on Ts , only one image will be assigned to each scan slice. Robustness to noise is possible however, if this bound is relaxed and we allow for larger values of Ts , so that more than one image can be matched to the same scan slice. We call this set of images I. Thus for each point in a slice, the corresponding pixel values x from the whole set I is visited, and checked for inclusion in the set B of distributions. Class assignment is made if x belongs to B for all the images in the set I. V. E XPERIMENTS Results are shown for a series of indoor sequences with moderate dynamic content. For background segmentation, the multimodal distribution is set to contain 4 Guassians, the learning rate is set at = 0.3, and the background class is set to one third of the frequency in the distributions, i.e.,
p(x) =
k=0
k N (x|k , k )
(3)
with k the weight of the k-th Gaussian, and K a user selected number of distributions. This classication scheme assumes that the RGB values for neighboring pixels are independent. During the training
117
(a) Ts = 1/fps.
(b) Ts = 0.5sec
(c) Foreground segmentation
(d) Segmented dynamic object Fig. 6. Segmentation results for a sequence with one moving person and varying values of the synchronization threshold.
B = 0.3. The synchronization threshold Ts is varied from the minimal 1/17 to a more conservative value of 0.5 seconds. The rst analyzed sequence corresponds to a single person moving in front of the laser and camera. Frames (a) and (b) in Figure 6 show nal results of point classication for different values of Ts ; frame (c) shows the image pixel classication results; and frame (d) shows the 3D reconstruction of both, the segmented dynamic object, and the entire 3D scene. The second sequence contains a more challenging scenario with three people with slow random walking trajectories. Given the slow motion rate of the people, laser range readings hitting on them are difcult to categorize as being dynamic. The background segmentation algorithm proposed in this paper helps to alleviate this issue. Figure 7 shows results of background segmentation in this new sequence for varying values of the synchronization parameter. Setting this parameter slightly above the camera acquisition rate accounts for synchronization errors and produces better segmentation results. Frames (a-c) in the image show the segmentation results for Ts = 1/fps, whereas frames (d-f) show segmentation results for Ts = 0.5sec. Figure 8 shows 3D reconstruction results of the segmented data and of the full 3D scene. The results shown are for a synchronization threshold of 0.5 sec. We appreciate the suggestion during the peer review phase
of this work to compare our method with other approaches. Unfortunately, as far as we know, the system presented is unique, and there are no other methods in the literature that take low-rate 3d scans and remove dynamic content from them using high-rate imagery. To validate the approach, we can report however an empirical comparison with ground truth image difference. Assuming a clean background scan is available (without people), image difference to a full dynamic cloud was computed with the Point Cloud Library [15] using a distance threshold of 3mm. Fig. 9 shows results of such image difference computation. The results of our method are visually comparable to such ground truth experiment. VI. C ONCLUSIONS We present a method to segment low-rate 3D range data as static or dynamic using multimodal classication. The technique classies fast-rate image data from an accessory camera as background/foreground adapting at frame rate a per-pixel Gaussian mixture distribution. The results of image classication are used to tag reprojected laser data. Special attention is paid to the synchronization and metric calibration of the two sensing devices. Sensor synchronization is of paramount importance as it allows to match high frame rate imagery with their corresponding low rate laser scans. The method is tested for indoor sequences with moderate dynamics.
118
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 7. Segmentation results for a sequence with three people moving randomly and varying values of the synchronization threshold. Frames (a-c) show three sequence instances segmented at Ts = 1/fps. Frames (d-f) show the same sequence instances segmented at Ts = 0.5sec.
Fig. 8.
Segmentation results for a sequence with three slowly moving people with random walking trajectories.
The proposed method was designed to remove spurious data or dynamic objects from low acquisition rate lidar sensors. The result is a cleaner 3d picture of static data points. These point clouds could then be aggregated into larger datasets with the guarantee that dynamic data and noise will not jeopardize point cloud registration. The intended application of the technique is robotic 3d mapping.
VII. ACKNOWLEDGMENTS This work has been partially supported by the Mexican Council of Science and Technology with a PhD Scholarship to A. Ortega, by the Spanish Ministry of Science and Innovation under projects PAU (DPI2008-06022) and MIPRCV Consolider Ingenio (CSD2007-018), and by the CEEDS (FP7-ICT2009-5-95682) and INTELLACT (FP7-ICT2009-6-269959) projects of the EU. The authors thank M. Morta for the development of the 3D laser used for this research.
119
Fig. 9.
Result of applying point cloud difference using PCL.
R EFERENCES
[1] K.O. Arras, O.M. Mozos, and W. Burgard. Using boosted features for the detection of people in 2d range data. In Proceedings of the IEEE International Conference on Robotics and Automation, pages 3402 3407, Rome, April 2007. [2] F. Endres, C. Plagemann, C. Stachniss, and W. Burgard. Unsupervised discovery of object classes from range data using latent Dirichlet allocation. In Robotics: Science and Systems V, Seattle, USA, June 2009. [3] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge, 2nd edition, 2004. [4] C.P. Lu, G.D. Hager, and E. Mjolsness. Fast and globally convergent pose estimation from video images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22:610622, 2000. [5] F. Maurelli, D. Droeschel, T. Wisspeintner, S. May, and H. Surmann. A 3D laser scanner system for autonomous vehicle navigation. In Proceedings of the 14th International Conference on Advanced Robotics, Munich, June 2009. [6] F. Moosmann, O. Pink, and C. Stiller. Segmentation of 3D lidar data in non-at urban environments using a local convexity criterion. In IEEE Intelligent Vehicles Symposium, pages 215220, 2009. [7] P. N nez, P. Drews Jr, R. Rocha, and J. Dias. Data fusion calibration for u a 3D laser range nder and a camera using inertial data. In Proceedings of the European Conference on Mobile Robotics, Dubrovnik, September 2009. [8] E. Olson. A passive solution to the sensor synchronization problem. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 10591064, Taipei, October 2010. [9] A. Ortega, B. Dias, E. Teniente, A. Bernardino, J. Gaspar, and Juan Andrade-Cetto. Calibrating an outdoor distributed camera network using laser range nder data. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 303308, Saint Louis, October 2009. [10] A. Ortega, I. Haddad, and J. Andrade-Cetto. Graph-based segmentation of range data with applications to 3d urban mapping. In Proceedings of the European Conference on Mobile Robotics, pages 193198, Dubrovnik, September 2009. [11] I. Posner, M. Cummins, and P. Newman. Fast probabilistic labeling of city maps. In Robotics: Science and Systems IV, Zurich, June 2008. [12] I. Posner, D. Schroeter, and P. Newman. Describing composite urban workspaces. In Proceedings of the IEEE International Conference on Robotics and Automation, pages 49624968, Rome, April 2007. [13] I. Posner, D. Schroeter, and P. Newman. Online generation of scene descriptions in urban environments. Robotics and Autonomous Systems, 56(11):901914, 2008. [14] C. Ridder, O. Munkelt, and H. Kirchner. Adaptive background estimation and foreground detection using kalman-ltering. In Proceedings of [15] [16]
[17]
[18]
[19] [20]
[21] [22] [23]
[24]
the IASTED International Conference on Robotics and Manufacturing, pages 193199, Istanbul, August 1995. R.B. Rusu and S. Cousins. 3D is here: Point Cloud Library (PCL). In Proceedings of the IEEE International Conference on Robotics and Automation, Shanghai, May 2011. S. Schneider, M. Himmelsbach, T. Luettel, and H.J. Wuensche. Fusing vision and lidar - synchronization, correction and occlusion reasoning. In Proceedings of the IEEE Intelligent Vehicles Symposium, pages 388 393, San Diego, June 2010. I. Stamos and P.K. Allen. 3-d model construction using range and image data. In Proceedings of the 14th IEEE Conference on Computer Vision and Pattern Recognition, volume 1, page 1531, Hilton Head, SC, June 2000. C. Stauffer and W. Grimson. Adaptive background mixture models for real-time tracking. In Proceedings of the 13th IEEE Conference on Computer Vision and Pattern Recognition, pages 246252, Fort Collins, June 1999. C. Stauffer and W. Grimson. Learning patterns of activity using realtime tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):747757, 2000. H. Surmann, A. Nuchter, and J. Hertzberg. An autonomous mobile robot with a 3D laser range nder for 3D exploration and digitalization of indoor environments. Robotics and Autonomous Systems, 45(3-4):181 198, 2003. R. Tsai. A versatile camera calibration technique for high accuracy 3D machine vision metrology using off-the-shelf TV cameras. IEEE Journal of Robotics and Automation, 3(4):323344, August 1987. R. Unnikrishnan and M. Hebert. Fast extrinsic calibration of a laser rangender to a camera. Technical Report CMU-RI-TR-05-09, Robotics Institute, Pittsburgh, July 2005. Q. Zhang and R. Pless. Extrinsic calibration of a camera and laser range nder (improves camera calibration). In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 23012306, Sendei, September 2004. Z. Zhang. A exible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(11):1330 1334, 2000.
120
Monocular Detection and Estimation of Moving Obstacles for Robot Navigation

Erik Einhorn Markus Filzhuth Christof Schr ter o Horst-Michael Gross Neuroinformatics and Cognitive Robotics Lab, Ilmenau University of Technology, Germany
Abstract The detection of motion and moving objects or persons with stationary monocular cameras has been extensively studied. However, those techniques fail if the camera is moving itself. In this paper, we present a method for detecting and estimating the position of moving objects using a monocular camera that is mounted in front of a mobile robot platform. The position estimates are used for obstacle avoidance and robot navigation. We apply image warping to compensate the egomotion of the camera. This allows us to use standard techniques for motion detection. The nal position and velocity estimates are obtained using Extended Kalman Filters. Combined with a monocular scene reconstruction our approach allows the robust detection and avoidance of both static and moving obstacles by using a single monocular camera as the only sensor. Index Terms visual obstacle detection, motion detection, monocular scene reconstruction
I. I NTRODUCTION With steadily increasing processing power and newly evolving hardware, approaches for visual navigation gain more and more importance in mobile robotics. For obstacle detection vision based sensors can obtain a larger amount of information about the structure of the local surroundings compared to traditional range-measuring sensors like laser range nders or sonar sensors that operate in a single plane only. There is a large variety of vision sensors that are suitable for obstacle detection. Time-of-ight cameras are able to measure the distance between camera and obstacles directly by emitting short light pulses. Due to their still high costs, these cameras may be suitable for robot prototypes but are no option for our purposes. Microsofts Kinect depth camera uses structured light patterns to obtain the depth of objects in the scene. Due to the low costs and the precise depth measurements, this technique will surely have a huge impact in mobile robotics and robot navigation. However, in several experiments we came across a few disadvantages of that camera. Even for smaller robots, its eld of view is too narrow to cover the whole area in front of the robot. Moreover, the camera can be used indoors only. In outdoor environments, the emitted infrared light pattern is outshone by the sunlight and cannot be detected by the camera. Beside these depth cameras and stereo cameras, monocular approaches are an adequate alternative for obstacle detection. The majority of such approaches use feature-based techniques that reconstruct the depth or the entire 3D position of each feature. In our previous works, we have developed such an approach for visual obstacle detection that utilizes images captured by a single monocular camera mounted in front of a mobile robot [7]. Our Structure-from-Motion approach
employs Extended Kalman Filters (EKF) to reconstruct the 3D position of the image features in real-time in order to identify potential obstacles in the reconstructed scene. We have shown that this method can signicantly improve the reliability of obstacle detection when combined with laser range nders [6]. Similar to other related approaches [2, 5] our previous approach assumes that the scene is static. Moving objects could not be estimated and were ignored. However, since our mobile robots operate at home and in public environments [8] that are populated by walking people and other dynamic objects, a proper handling of those obstacles is required. In this paper, we present an extension of our previous approach that is now able to detect moving objects and allows the estimation of their position and velocity. The presented method is the rst approach for monocular obstacle detection that is able to detect both static and dynamic obstacles. II. R ELATED W ORK For detecting moving objects or persons with stationary cameras a large number of approaches exists that use difference images, motion history images [3] or background subtraction. The detection of moving objects using images of a monocular camera mounted in front of a mobile robot is a more difcult problem since the ego-motion of the robot induces an inherent optical ow in the images that must be distinguished from the ow of the moving objects. In [11] a feature-based method is presented that uses image correspondences over two and three frames. Different constraints are applied to differentiate between features located on static objects and those located on moving objects that violate the constraints. In a comprehensive analysis Klappstein et al. [11] show, that objects moving in parallel to the camera with a lower velocity can hardly be detected. Moreover, detecting objects moving anti-parallelly towards the camera is only possible if an additional heuristic is used [10]. A problem that is related to the problem of detecting moving objects in a monocular image sequence is motion segmentation, i.e. the segmentation of the image into regions of pixels moving coherently across the image sequence. In [13] a feature-based approach for real-time motion segmentation is presented that uses an expectation-maximization algorithm to group neighboring features with similar trajectories. However, the problem of detecting objects moving parallel to the camera remains. Those objects would most likely be assigned to the segments of background motion.
121
Fig. 1. The architecture of our approach. The upper part shows the scene reconstruction for static parts of the environment. In the captured image sequence, features are tracked, and their 3D positions in the scene are estimated using EKFs. The resulting features with their 3D positions are used for creating a volumetric 3D occupancy map of the static scene. The tracked features and their estimated 3D position are also used for the detection and estimation of dynamic objects in the scene. As shown in the lower part, the features are used for ego-motion compensation in the input image sequence. Afterwards, the dynamic objects are detected, segmented, and tracked to provide information that is used to recover their positions and velocities.
Beside the detection of dynamic objects, the estimation of their 3D positions and trajectories is even more difcult. In addition to an overall scale ambiguity, further difculties arise due to the ego-motion of the camera making it impossible to estimate the object positions and trajectories if no additional assumption about the object movements are made. There are approaches for Nonrigid Structure-from-Motion [15, 12, 4, 1, 16] that tackle these difculties by using shape priors [4] or shape bases [16] to constrain the problem. Other authors [12, 1] apply an orthogonal approach and try to solve the problem in the trajectory space instead of the shape space. In those feature-based methods the position and movement of each point is represented by a linear combination of basis trajectories, e.g. the discrete cosine transform basis. The reconstruction is then performed by estimating the coefcients of such a combination of basis trajectories. The approach presented in [12] is most related to our problem. However, the authors analyze the reconstructibility of the point trajectories and come to the conclusion that the reconstruction of the points and their trajectories is poor if the trajectories are correlated to the cameras movement and if the camera is moving slowly and smoothly. Since the movement of our camera mounted in front of the robot performs this kind of movement this method is not applicable in our case. III. OVERVIEW We use a single calibrated camera that is mounted in front of the robot. During the robots locomotion, the camera is capturing a sequence of images (. . . , It1 , It , It+1 , . . .) that are rectied immediately according to the intrinsic camera parameters in order to correct the lens distortions. Using the robots odometry data, the corresponding camera position, expressed by its projection matrix Pt = KRt [I | ct ], can be computed for each image It , containing the orientation Rt , the position ct , and the intrinsic calibration matrix K of the camera (see [9] for details). Both the cameras position and its orientation are expressed with respect to a global reference coordinate frame. The complete architecture of the software system that processes this input data is shown in Figure 1. The monocular Structure-from-Motion approach [7] for reconstructing the 3D shape of the static scene is shown in the upper part of that gure. In each captured and preprocessed image (frame), distinctive image features are selected using the Shi-Tomasi corner detector [14]. These features are tracked over the acquired image sequence while their 3D positions are estimated
using EKFs. Similar to the cameras pose, these 3D positions of the features are computed with respect to the same global reference coordinate frame [7]. For static obstacle detection, we perform this monocular scene reconstruction for 200-300 salient features of the scene simultaneously. Before the reconstructed features are used to build a map of the environment, they have to undergo some post-processing where unreliable estimates with large covariance and uncertainty are removed. The lower part of Fig. 1 shows the data ow for the detection of dynamic obstacles, which is the main scope of this paper. The tracked features processed by the above static scene reconstruction are used to perform an ego-motion compensation: the input images are warped in order to eliminate the effect of the robots movement in the images. After this step, the images can be treated as if they were captured by a static camera. Hence, standard operations such as the computation of difference images can be applied to detect and segment the moving obstacles within the images. Such objects are then tracked and their positions in the images are used to estimate their positions and velocities in the scene. These processing steps are described in more detail in the following two sections. IV. E GO -M OTION C OMPENSATION One major problem for detecting moving objects is the inherent optical ow that is induced by the movement of the robot or camera. We try to eliminate this effect in the images by virtually correcting the viewpoint of the images. We will see, that the ego-motion compensation nally allows us to apply methods known from motion detection with static cameras. The ego-motion compensation is done by image warping. We will show that this image warping can be easily computed using the image features that are tracked by the approach for static scene reconstruction. However, the egomotion compensation is successful only if features on static scene objects are used. For simplicity we call these features static features, while features on moving objects are called dynamic features in the following. The usage of dynamic features for the ego-motion compensation leads to inferior results and would disallow the detection of moving objects in later processing steps of our approach. A. Feature Filtering In the next paragraphs, we will describe several constraints and criteria that allow us to classify the features that are
122
extracted and tracked during the static scene reconstruction into static features and dynamic ones. As one constraint one could use the epipolar constraint. For a static feature at position xt in frame It its corresponding image point xt in frame It will be located along the epipolar line that can be described by l = Ft , where xt denotes x the homogeneous coordinate of xt and F is the fundamental matrix that describes the geometry between the two involved camera poses (see [9] for details). A dynamic feature in comparison will have a larger distance from that epipolar line. Though, instead of the epipolar constraint we use a different constraint for long-term stability that implicitly contains the epipolar constraint but uses more than two frames to classify the feature. As stated in section III, the 3D positions of the tracked features are estimated iteratively using EKFs. Beside the 3D positions, the EKFs additionally compute the covariance of the position estimates. During the scene reconstruction the algorithm assumes that the scene is static. Hence, dynamic features are estimated with large covariances since they violate the assumptions of the estimation algorithm. This is where the epipolar constraint is implicitly taken into account. The dynamic features can then be ltered out by ignoring all features whose covariance is above a certain threshold. Moreover, the reconstruction algorithm uses the feature estimates to guide the feature tracking process by predicting the position of the corresponding image points in the next frame. Dynamic features that move quickly and perpendicular to their epipolar line, will be too far away from the predicted image location, and the approach for static scene reconstruction will lose track of them. Hence, those dynamic features are ltered out at a very early stage already during the static scene reconstruction. Unfortunately, there are still dynamic features that pass through the stage of static scene reconstruction and that have 3D position estimates with a low variance. This happens for features located on objects that move in parallel to the camera. Those features primarily do not violate the assumption of a static scene. Instead the estimation algorithm is able to nd static and stable 3D position estimates that correspond to the observed 2D feature locations in the image sequence. Of course, those position estimates are not correct for dynamic features. The error depends on the movement of the object. Dynamic features on objects moving faster than the camera and hence away from it are reconstructed behind the camera with a negative depth. For features on objects that are moving in the same direction as the camera but with a lower speed the estimated depth is too large. Since our camera is tilted towards the ground, those features are reconstructed below the ground plane. Features that move with the same speed as the camera are reconstructed at innity and hence also below the ground plane. We apply simple plausibility tests and classify those features with a negative depth and a negative height as dynamic features in order to lter them out. It should be noted, that the detection of dynamic features is done only to exclude them from the ego-motion compensation. They are not used to detect moving obstacles at this point. The above criteria are able to reveal a large number of features on moving objects. However, features on objects that
are moving towards the camera cannot be detected as dynamic features. They are reconstructed as static obstacle closer to the camera compared to their real distance. In fact, they are reconstructed near the position where a potential collision with the moving object would occur. Hence for obstacle detection and avoidance, adding those features into the static obstacle map results in an acceptable behavior of the robot. Nevertheless, our processing step for detecting moving objects in the warped images is also able to detect most of these moving objects. This will be described in section V. B. Image Warping The static features that were classied by the above criteria can now be used for the actual ego-motion compensation which is done by image warping. Image warping allows us to morph any image It taken at camera position Pt to the perspective Pt of any another frame It . The warp W depends on the two positions of the camera and hence - in our case of a single continuously moving camera - on the two time stamps t and t : Itt = Wtt (It ) (1)
The warped image Itt corresponds to It being taken from the same camera position where It was captured from. Moreover, if the warp was perfect and the scene was static, the warped image Itt would be equal or similar to the image It : It Itt (2) The warp is approximated as piecewise afne transform of a triangle mesh. As vertexes of the mesh we use the static features F = f (1) , . . . , f (n) that are tracked between (i) the two frames It and It . Let xt denote the 2D image (i) position of the feature f (i) within image It , while xt denotes its image position within image It . The triangle mesh is generated by computing a Delaunay triangulation of the 2D (1) (n) points xt , . . . , xt within the image It as shown in Fig. 2. The mesh is then used to warp each image point in It . (i) The image features xt are warped according to their tracked image position in the image It : xt = Wtt (xt ). For all the other pixels in between, their position is bilinearly interpolated within the triangle that is spanned by the surrounding three image features according to the Delaunay triangulation. When generating the mesh, one must take into account that the positions of the features and hence the vertices of the mesh move in image It . This can result in triangles that fold over, i.e. their orientation ips and their back-faces become visible as shown in Fig. 3. Such triangles result in inferior results of the warp. Hence, we identify the vertices that cause the foldover in order to remove them from the mesh. Afterwards, the mesh is triangulated again. The described warp can be efciently computed on contemporary graphics hardware by using the image It as a texture for rendering the triangle mesh. The features positions in image
(i) (i)
123
It
It
Fig. 2. Generated triangulation for both images It and It . While the same mesh topology is used in both images, the positions of the vertices are different in image It since the features have moved due to the ego-motion of the camera.
It
It
It
It
Fig. 3. left: The two yellow triangles fold over in image It since two of the features that are used for the triangulation moved signicantly differently. The vertices/features that caused the fold-over are marked with red circles. right: The two vertices were removed from the mesh to resolve the fold-over.
It are used as texture coordinates and the features positions in the image It are used as model coordinates for each vertex of the mesh. The upper row in Fig. 4 shows the warping for an image sequence consisting of 5 frames. The warping was computed using the triangle mesh that was built from the features tracked by the approach for static scene reconstruction as described above. Dynamic features where removed by the aforementioned criteria. The two leftmost and the two rightmost images are warped to the perspective of the middle image. While the images were captured, the robot and the blue ball moved forward. The blue ball was slightly faster than the robot. Due to the ego-motion compensation, the static scene remains still in the images although the robot moves forward. Additionally, it is apparent that the blue ball moves forward. V. D ETECTION AND E STIMATION OF M OVING O BJECTS Using the described ego-motion compensation, we can implement the actual detection of moving objects using methods known from motion detection with static cameras. For performance reasons, we use difference images which can be computed efciently. According to Eq. 2 the difference image of two input images is close to zero, if one image is warped into the perspective of the other image and if the scene is static: Itt It 0. Moving objects on the other hand will produce large differences and, therefore, can be detected in the difference images. To reduce the inuence of image noise and to increase the signal to noise ratio, we do not only use two images for generating the difference image. Instead, we use 5 consecutive images from the captured image sequence. Let It denote the reference image where the motion should be detected. Then we use two images It2 , It1 that were captured before and two images It+1 , It+2 that were captured after the reference image. All images (except the reference image) are warped into the perspective of the reference image
It . Afterwards, the difference images between the four warped images and the reference image It are computed as shown in the second row of Fig. 4. Moreover, the difference images are binarized using a threshold that was chosen experimentally. As shown in Fig. 4 the binarized difference images are then combined in pairs using an AND operation, i.e. in the resulting binary image a pixel is set only if the pixel was set in both binarized difference images. The resulting images are combined using an OR operation. Finally, remaining noise is removed by applying an opening and closing morphological operator. This procedure has yielded the best results in our experiments. It reduces the noise in the difference images and preserves the outline of the moving object even if the size of the object increases or decreases when the objects distance changes. As seen in the above gure, some smaller image regions especially along occlusion boundaries were also classied as moving objects although they belong to the static environment. These regions would also be treated as moving objects in the next processing steps of our approach. However, their position and velocity estimates would nally reveal that those objects are in fact static. In the nal processed binary difference image, bounding boxes are generated around each connected image segment. These bounding boxes are used as features of the extracted moving objects. They provide the approximate position and size of each object within the image and are used for tracking the moving objects within the image sequence. At the moment, we use a single hypothesis tracker, i.e. we assume that there is only one moving object within the eld of view of the camera. Currently, the bounding box with the largest area is chosen as hypothesis of the single moving object. The other bounding boxes are ignored. The position and velocity estimation is done using an EKF. As stated in the rst sections it is not possible to reconstruct the position of a moving object from a moving camera without additional constraints. We therefore assume that all moving objects touch the ground and hence the lower border of the extracted bounding box is located on the ground plane in the scene. Consequently, the state of the moving object can be modeled by y = (p, v) , where p 2 represents the position of the object on the ground plane in world coordinates and v 2 denotes the velocity of the object along the ground plane. The resulting state transition function is given as yk = Ayk1 , with: 1 0 t 0 0 1 0 t A= 0 0 1 0 0 0 0 1
As measurement for the EKF update, the midpoint of the bounding boxs bottom edge in the image is used. During the update the image position of the midpoint is projected onto the ground plane. This projection can be described by a homography H = Pt G, where Pt is again the projection e1 e2 c describes the matrix of the camera. G = 0 0 1 orientation and location of the ground plane that is spanned by the two vectors e1 , e2 and goes through the point c.
124
It2
It1
It
It+1
It+2
It2t
It1t
It
It+1t
It+2t
AND OR
AND
Fig. 4. Processing chain for the detection of moving objects. rst row: Input sequence consisting of 5 frames. second row: Two images before and two images after the reference frame It are warped into the perspective of the reference frame. third row: Binarized difference images between the warped images and the reference image. fourth row: Difference images are combined in pairs using an AND operation. fth row: Resulting images are combined using an OR operation. sixth row: Final detection of moving objects after opening and closing operations with extracted bounding boxes.
Using the above stated transition and projection functions, the usual EKF prediction and update steps are iteratively processed for each new image and hence for each extracted bounding box to continuously estimate the position and velocity of the moving object. The nal position estimate is used for navigation and obstacle avoidance in addition to the reconstructed features of the static scene. VI. R ESULTS We have tested our approach with numerous data sets of real data that was recorded while the robot was moving through an indoor environment. As moving obstacles we used an untextured ball as shown in Fig. 4 and a person who walked in front of the robot as shown in Fig. 5 and Fig. 6. We tested different kinds of movements relative to the robot. The detection and estimation of obstacles that moved perpendicular to the robots trajectory was easily managed by our approach (Fig. 5). More difcult is the detection of objects that move parallel to the robot. Nevertheless, those obstacles were successfully detected and estimated correctly by the proposed approach. An example is shown in Fig. 6 where a person is walking in front of the robot into the same direction with a slightly higher velocity. In the left column of Fig. 6 the frames 10, 20 and 30 of the captured images sequence taken by the robots front camera are shown. Image regions where motion was detected using our proposed method are labeled with red color. Beside some smaller artifacts near occlusion boundaries the walking person was correctly detected. The
blue circle denotes the estimated position of the person on the ground plane reconstructed using an EKF as described in the previous section. The green circles indicate the reconstructed positions of the person within the next ten frames that are not shown as separate images. The sizes of the circles indicate the uncertainties of the estimates. In the right column a birds-eye view of the scene is shown, where the reconstructed static features are additionally visualized as black dots. Together with the estimated position of the moving object, such a 2D map can be used for obstacle avoidance and navigation. As ground truth, the range measurements of a laser range nder are indicated in all images by a thin black line. In the images of the front camera, the dashed line shows a projection of the range measurements at the height where the laser is mounted, while the solid line shows a projection of the range measurements onto the ground plane in order to allow for a better comparison with the position estimates computed by our approach, which are located on the ground plane, too.
Fig. 5. Detection of a moving person that is walking from left to right while the robot is moving forward. Image regions, where motion was detected, are highlighted in red. Our approach correctly detected the motion of the leg that is moving forward, while the leg that stands on the oor is classied as static.
125
Fig. 6. left column: Three frames of the input sequence: image regions where motion was detected are highlighted in red. The position estimate of the moving object is indicated by the blue circle. The green circles show the reconstructed positions for 10 further frames where no image is shown. For comparison, measurements of a laser range nder are shown as black line. right column: Birds-eye view of the scene. The reconstructed features of the static scene are additionally shown as black dots.
public environments, like tour guides and shopping assistants [8]. Those environments are populated by many persons and other moving obstacles (e.g. shopping carts). In several experiments, we have tested the robustness of our approach. Even in the difcult case of an object that is moving parallel to the camera, our technique is able to detect the motion and correctly estimates the position of that object. One example for this kind of motion is given in the paper (Fig. 6). At the moment our algorithm uses a single hypothesis tracker only. Hence, it assumes that there is only one moving object in the cameras eld of view. However, it can be easily extended to a multi-hypotheses tracking algorithm that can handle several moving objects. The presented method for ego-motion compensation using image warping allows the use of difference images for motion detection. In other applications, the same ego-motion compensation technique could be used to allow for the usage of image processing algorithms that were primarily developed for static cameras also for moving cameras. R EFERENCES
[1] Ijaz Akhter, Yaser Ajmal Sheikh, Sohaib Khan, and Takeo Kanade. Nonrigid Structure from Motion in Trajectory Space. In Neural Information Processing Systems, pages 4148, 2008. [2] J. Civera, A.J. Davison, and J. Montiel. Inverse Depth Parametrization for Monocular SLAM. IEEE Trans. on Robotics, pages 932945, 2008. [3] J.W. Davis. Hierarchical Motion History Images for Recognizing Human Motion. In Proceedings IEEE Workshop on Detection and Recognition of Events in Video, pages 3946. IEEE Comput. Soc. [4] A. Del Bue. A Factorization Approach to Structure from Motion with Shape Priors. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 18, 2008. [5] E. Eade and T. Drummond. Unied Loop Closing and Recovery for Real Time Monocular SLAM. In Proc. of the BMVC, 2008. [6] E. Einhorn, Ch. Schr ter, and H.-M. Gross. Monocular Scene Recono struction for Reliable Obstacle Detection and Robot Navigation. In Proc. of the 4th ECMR, pages 156161, 2009. [7] E. Einhorn, Ch. Schr ter, and H.-M. Gross. Attention-Driven Monocular o Scene Reconstruction for Obstacle Detection, Robot Navigation and Map Building. Robotics and Autonomous Systems, 59(5):279292, 2011. [8] H.-M. Gross, H.-J. B hme, Ch. Schr ter, St. M ller, A. K nig, E. Eino o u o horn, Ch. Martin, M. Merten, and A. Bley. TOOMAS: Interactive Shopping Guide Robots in Everyday Use - Final Implementation and Experiences from Long-term Field Trials. In Proc. of IROS, pages 2005 2012, 2009. [9] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0-521-54051-8, second edition, 2006. [10] Jens Klappstein. Optical-Flow Based Detection of Moving Objects in Trafc Scenes. PhD thesis, Ruprecht-Karls-Universit t Heidelberg, 2008. a [11] Jens Klappstein, Fridtjof Stein, and Uwe Franke. Detectability of Moving Objects Using Correspondences over Two and Three Frames. In Proc of the 29th DAGM conference on Pattern recognition, pages 112121, 2007. [12] Hyun Soo Park, Takaaki Shiratori, Iain Matthews, and Yaser Sheikh. 3D Reconstruction of a Moving Point from a Series of 2D Projections. In ECCV, pages 158171, 2010. [13] Shrinivas J. Pundlik, Stanley T. Bircheld, and Senior Member. Realtime motion segmentation of sparse feature points at any speed. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 38(3):731 742, 2008. [14] J. Shi and C. Tomasi. Good features to track. In 9th IEEE Conference on Computer Vision and Pattern Recognition, pages 593600, 1994. [15] J. Taylor, A.D. Jepson, and K.N. Kutulakos. Non-Rigid Structure from Locally-Rigid Motion. In Proc. Computer Vision and Pattern Recognition Conf., pages 27612768, 2010. [16] R. Vidal and D. Abretske. Nonrigid Shape and Motion from Multiple Perspective Views. In ECCV, pages 205218, 2006.
Especially in the birds-eye view it becomes apparent that our approach estimates the position of the moving person correctly. Qualitatively, the precision is comparable with the precision of the laser range nder. In Fig. 7, the velocity of the person that is additionally estimated by the EKF is plotted for each frame of the image sequence. The person comes into the eld of view in the third frame. After a short initialization phase of 2 frames, the velocity is correctly estimated at around 1 m/s. The small oscillation in the graph is caused by the non-uniform leg movement of the person.
2
velocity (m/s)
20
frame
40
60
Fig. 7. Estimated velocity of the moving object for each frame of the sequence.
VII. C ONCLUSIONS AND F UTURE W ORK In this paper, we presented a method for detecting and estimating the position and velocity of moving objects in monocular image sequences that were captured from a moving robot. Combined with our existing approach for monocular static scene reconstruction, the presented technique allows us to detect both static and dynamic obstacles for robot navigation and obstacle avoidance. Thus, it increases the robustness of our obstacle detection system for real world robot applications in
126
Fast 2.5D Mesh Segmentation to Approximately Convex Surfaces

Robert Cupec, Emmanuel K. Nyarko and Damir Filko Faculty of Electrical Engineering, J. J. Strossmayer University of Osijek, Croatia
AbstractA range image segmentation approach is presented. A range image obtained by a 3D sensor is transformed to a 2.5D triangle mesh. This mesh is then segmented into approximately convex surfaces by an iterative procedure based on the incremental convex hull algorithm and region growing. The proposed approach is designed as a fast mid-level tool whose task is to provide suitable geometric primitives for a high-level object recognition or robot localization algorithm. The presented approach is experimentally evaluated using 3D data obtained by a Kinect sensor. Index Terms 2.5D Mesh, Kinect, Range Image Segmentation, Convex Surfaces
and sensors based on structured light can provide depth image of an observed scene and thus information of the shape of geometric structures which appear in the scene. The images obtained by the discussed sensors are called range images since every image point is assigned a depth value encoding the distance of the observed point from the camera. Since every image point is assigned a unique depth value, such an image is also called 2.5D model of the scene.
I. INTRODUCTION
NVIRONMENT PERCEPTION capabilities are crucial for achieving a highly autonomous operation of a mobile robot. The robot must be capable of localizing itself in an unstructured environment as well as to recognize certain objects of interest for a particular task. Human-like robot localization strategies rely on visual recognition of landmarks, i.e. objects with fixed position in the environment. A common approach is to use point features with assigned descriptors as landmarks [1]. Alternatively, line features [2] can be used. More complex structures such as generic objects could be very useful for robot localization [3]. Apart from the purpose of localization, a robot's ability to recognize objects is necessary for flexible object manipulation. A useful preprocessing step in an object recognition process is image segmentation. The image of a scene is segmented into several regions potentially representing different objects. A common image segmentation approach is to partition the image into a number of approximately uniformly colored regions [4], [5]. These regions can be grouped into more complex structures which can be matched with objects in a database. A human can easily recognize separate objects in an image such as the one shown in Fig. 1. For a computer vision system, segmentation of a scene from a single 2D image is still an open problem. 3D information can be of great help in scene analysis. 3D sensors such as laser range finder, stereo vision, time-of-flight camera
Fig. 1 Human can easily segment such an image to object and recognize them. In this paper a computationally efficient approach for segmenting a 2.5D image is proposed. First, the point cloud obtained by a 3D sensor is transformed to a 2.5D triangle mesh. Then an iterative procedure based on the incremental convex hull algorithm and region growing is applied to detect approximately convex surfaces in the scene. The proposed approach is designed as a mid-level processing stage in a vision-based perception system of a robot which should provide suitable geometric primitives for high-level object recognition or robot localization algorithm. The experiments reported in this paper are performed using a commercially available low cost 3D sensor Kinect and give some insight into the applicability of this device for robot perception tasks. This paper is structured as follows. In Section II an overview of the related work is given. The proposed range image segmentation algorithm is described in Section III and its experimental evaluation in Section IV.
127
2 II. RELATED WORK Range image segmentation is the process of dividing, or segmenting, a range image such that all the points of the same surface belong to the same region. The most common range image segmentation methods found in literature can be grouped into two main categories: region-based (surfacebased) methods and edge-based methods. In [6], an overview of several such algorithms are presented and evaluated. Edge-based methods often apply an edge detector to extract edges from the range image. Even though such methods produce well defined boundaries between different regions, they often produce gaps between boundaries. These methods are also prone to under-segmenting the range image especially on curved surfaces where the discontinuities are smooth and hard to determine. Examples of edge-based methods are given in [7], [8] and [9]. With region-based methods, the image can be segmented either by using only planar surfaces or a combination of planar and curved surfaces. In [10], [11], [12], [13] and [14] range images are segmented entirely using only planar surfaces. A procedure to detect connected planar, convex, and concave surfaces of 3-D objects is given in [15]. The range image is initially segmented into "surface patches" by applying a square error criterion clustering algorithm based on the surface points and the associated surface normals. These patches are then classified as planar, convex or concave using a non-parametric statistical trend test, curvature values and eigenvalue analysis. Finally, the boundaries between adjacent surface patches are classified as crease or noncrease edges, and based on this information, compatible patches are merged to produce reasonable faces of the objects. In [16], a method for segmenting range images using curved (convex) surfaces is given. The authors propose a method of detecting 3D convex surfaces using the signs of simplified Gaussian and mean curvatures based on classical differential geometry. By combining edge-based methods and region-based methods, more accurate segmentation results can be obtained [8]. An approach to multi-scale edge and planar surface detection in range images is also given in [9]. Computational geometry approaches to range image segmentation often utilize Voronoi diagrams and Delaunay triangulations to determine connections among data points and subsequently construct triangulated surfaces by connecting adjacent vertices [14], [17] and [18]. These triangulated surfaces form a topological structure often referred to as 3D mesh. A survey in [19] provides an overview of 3D mesh segmentation methodologies. With the aid of this 3D mesh, 3D objects can be modeled by using the convex hull algorithm. The use of convex hulls is well-suited for rapid 3D object modeling since: such volumes are inherently conservative due to their convex nature; the resulting hull is bounded by planar faces, thereby enabling fast computation of distances; and its use in the generation of bounding models to represent the spatial shapes of 3D objects is compact, fast, and relatively efficient [20] and [21]. A technique for approximating range images with adaptive triangular meshes ensuring a user-defined approximation error is presented in [18]. The technique is based on an efficient coarse to-fine refinement algorithm that avoids iterative optimization stages. The algorithm first maps the pixels of the given range image to 3D points defined in a curvature space (the curvature space can be thought of as a curvature image in which the range originally associated with each pixel is substituted for a measure of its local curvature). These points are then tetrahedralized with a 3D Delaunay algorithm. Finally, an iterative process starts sculpturing (digging up) the convex hull of the obtained tetrahedralization by progressively removing the triangles that do not fulfill the specified userdefined approximation error with respect to the original 3D triangular mesh. The introduction of the aforementioned curvature space makes it possible for both convex and nonconvex object surfaces to be approximated with adaptive triangular meshes, improving thus the behavior of previous coarse-to-fine sculpturing techniques. In [22], the authors developed a system for automated 3D object detection and modeling using range images in a construction work environment. The range image is initially filtered to reduce noise. Then the image is segmented such that the ground-floor or plane is detected. The data corresponding to the ground-floor of the scene is removed in order to ease the detection of the individual objects located on the ground. The resulting image is then segmented in order to detect the objects in the scene. As a result of this object segmentation process, the surface points on the various objects of interest are segmented. Using a convex hull algorithm, the surface points for each object are then connected to generate a 3D model. We propose a fast and computationally efficient method of segmenting a range image into approximately convex surfaces. We initially create a 2.5D mesh of the range image using iterative Delaunay triangulation. Approximately convex surfaces are then obtained by implementing the incremental convex hull algorithm combined with region growing. III. SEGMENTATION TO APPROXIMATELY CONVEX SURFACES Triangle mesh is a common representation of 3D objects in computer graphics and computer vision, where 3D objects are approximated by polyhedral surfaces up to a given precision. In this section, 2.5D mesh is defined and the term of approximately convex surface is introduced. Let us start with the notation used throughout this section. Let M = {F1, F2, ... F} be a set of triangular surfaces Fk representing a triangle mesh. The vertices of the triangles Fk M are referred to in the following as the vertices of M. The set of all vertices of M is denoted by M. A mesh M such that for every two triangles Fi , F j M there
is a sequence Fs , Fs , , Fs , where s1 = i, sr = j and every 1 2 r two triangles
Fs p and Fs p+1 in that sequence share a common
side is referred to in the following as a connected mesh.
128
3 Let us consider a mesh M such that for every two points P, P M holds
1 0 0 , 0 1 0 ( p p ) 0
6: 7: 8:
H M H M ,1 H M ,2
If (3) is satisfied for M' and H M
where p = [x, y, z]T denotes the vector defining the position of a point P. A mesh with these properties is referred to in the following as a 2.5D mesh. The convex hull of any 2.5D mesh M is also a polyhedral surface, i.e. a 3D mesh which can be represented by a set of triangles HM = {T1, T2, ... Tm}. Each triangle Tk is assigned a unit normal nk directed out of the hull and a parameter k, referred to in the following as offset. The plane in which Tk lies is defined by the equation
return M', H M , SUCESS 9: else 10: return M, HM, FAILURE 11: end if The algorithm is based on the incremental convex hull algorithm [23], where after updating the convex hull of the surface M by appending a new triangle F, evaluation of (3) is performed. Evaluating (3) for all vertices in M can be avoided. Let us assign to each triangle Tk a set M,k containing all vertices P + M for which Tk is the closest triangle in H M , i.e.
M ,k = P M : k = argmin { k nT p} k + Tk H M
P Tk , nT p k = 0 . k
Let H
+ M
(1)
be a subset of HM defined by
+ H M = {Tk H M : nT z > 0} . k
(2)
It is easy to show that instead of evaluating (3) for all vertices in M it is sufficient to evaluate only those in the set
+ In the case where M is obtained by a 3D sensor, H M represents the subset of triangles Tk H M which are visible to
* = M
Tk H M \ H M ,1
M ,k .
the sensor. A connected 2.5D mesh M such that
max min ( k nT p ) , k +
PM Tk H M
}}
(3)
Additional computational saving can be achieved by evaluating (3) only for the triangles from the set
+ H M ,2 = {Tk H M ,2 : nT z > 0} . k
where = 1 and is a given tolerance, is referred to in the following as an approximately convex surface with tolerance . Approximately concave surface can be defined by the same equation for = 1. Detection of approximately convex surfaces in a 2.5D mesh can be achieved by employing a region growing approach. Starting with a single triangle, which is by itself a convex surface, region growing proceeds by successively appending adjacent triangles while preserving (3). Algorithm 1 presented in the following can be used for such an incremental growing of an approximately convex surface. Algorithm 1: Incremental Growing of an Approximately Convex Surface Input: M, HM, , F Output: M', H M , result 1: 2: 3: 4: 5:
M' M { F } P vertex of F which is not a vertex of any triangle of M H M ,1 {Tk H M : nT p k } k
Hence, in line 7: of Algorithm 1, instead of (3), the following equation can be evaluated
max min ( k nT p ) . k + *
PM Tk H M ,2
}}
(4)
However, this second simplification can result in rejecting some valid approximately convex surfaces in certain cases. This can happen if there is a vertex P * , for which M
+ Tk H M ,1 exists such that
( k nT p ) and at the same k
time
+ Tk H M ,2
min ( k nT p ) > . k
Find all pairs of adjacent triangles Tk , Tl H M such that Tk H M ,1 and Tl H M ,1 . For every such pair, insert into a set H M ,2 a new triangle defined by the common side of Tk and Tl and the point P.
In that case, condition (4) is not satisfied although (3) is. Nevertheless, such cases are expected to be rather rare. The segmentation algorithm discussed in this section starts with a connected 2.5D mesh M. The triangle with the greatest area of the image projection is selected from M as a seed for the region growing procedure. Region growing proceeds by incrementally appending adjacent triangles to the expanding surface, as explained in Algorithm 1. In every iteration, there are multiple candidates which can be merged with the growing surface. There are various possible criteria for selecting the
129
4 next triangle to be appended. In the experiments reported in Section IV, the angle between the normals of adjacent triangles is used as the selection criterion, where appending of triangles with similar normal is favored. The growing of the surface stops when no adjacent triangle satisfies (4). All triangles grouped in the obtained approximately convex surface are labeled and thereby excluded from the further segmentation process. After one surface is completed, the procedure continues by taking the triangle with the greatest area of the image projection from the remaining unlabeled triangles in M as a seed for the next surface. The procedure stops when all or a predefined percentage of triangles are labeled. IV. EXPERIMENTAL EVALUATION The range image segmentation approach described in Section III is experimentally evaluated using 3D data provided by a Kinect sensor. Microsoft Kinect is a game console peripheral with integrated multi-microphone array for sound direction tracking, RGB camera and a depth sensor. The depth sensor combines an infrared laser projector with a monochrome CMOS sensor. The depth camera interprets 3D scene information based on continuously projected infrared structured light, i.e. this 3D scanner uses patent technology from Primesense called LightCoding. Kinect specifications claims its video and depth refresh rate to be 30Hz with 8-bit VGA resolution, while the depth sensor produces 640 480 depth image with maximum 11-bit values which corresponds to 2048 levels of depth. In the experiments reported in this section, the depth images are subsampled to the resolution 320 240. The effective scanning range of the Kinect depth sensor is between 0.7m and approximately 5m, with an angular field of view of 58 horizontally and 45 vertically. The depth resolution is approximately 1cm at distance of 2m, while at the same distance spatial resolution is 3mm [24]. At the time of writing this paper no official drivers for Kinect existed for Windows platform, therefore we used unofficial drivers and libfreenect framework from the OpenKinect community [25] which provided satisfactory results. The Kinect sensor provides a 3D point cloud. This point cloud is transformed to a 2.5D triangle mesh by iterative refinement of Delaunay triangulation proposed in [14]. The obtained mesh is then segmented into approximately convex surfaces by applying the procedure described in Section III. The algorithm is implemented in C++ programming language and executed on a 3.40GHz Intel Pentium 4 Dual Core CPU with 2GB of RAM. The results are presented in Figs. 2 and 3. The objects are mostly correctly segmented. However, in some cases oversegmentation occurs. This is mostly due to the choice of the tolerance, . Furthermore, in some cases, almost identical scenes result in different segmentation. In general, the smaller the value of , the greater the probability of oversegmentation. On the other hand, setting to greater values could result in merging of separate objects into a single segment. Nevertheless, the proposed method is designed as a mid-level tool which requires appropriate grouping of the obtained segments on a higher level. This grouping is simplified by the fact that a relatively small number of dominant segments must be considered for grouping on the higher level.
Fig. 2 Camera image (top left); depth image obtained by Kinect (top right); triangle mesh (bottom left) and segmentation (bottom right).
Fig. 3 Camera image (top left); depth image obtained by Kinect (top right); triangle mesh (bottom left) and segmentation (bottom right).
In order to evaluate the segmentation obtained by the proposed method, we used V-measure [26], a cluster evaluation measure based on the conditional entropy. This measure combines two criteria for a segmentation result: homogeneity and completeness. A segmentation satisfies homogeneity if every segment is a subset of a ground truth segment, while satisfying completeness assumes that every ground truth segment is a subset of a segment belonging to the evaluated segmentation. Both homogeneity and completeness are assessed by a value between 0 and 1 and these two values
130
5 are combined in V-measure. The approach proposed in [26] allows different weighting of homogeneity and completeness. In this work, we use equal weights for both properties. The evaluation is performed by applying the proposed algorithm to 12 depth images representing scenes with several objects positioned on a table. Two such scenes are shown in Figs. 4 and 5. The segmentation results obtained by the proposed approach are compared to ground truth segmentation that is obtained by hand-labeling. segmentation according to V-measure is achieved with
= 7.
We compared our method (Segmentation to Approximately Convex Surfaces SACS) to the segmentation obtained by the tools SACSegmentation and EuclideanClusterExtraction of the Point Cloud Library (PCL). These two tools are used as explained in the following. First, SACSegmentation is used to detect two dominant planar surfaces: the table surface and the wall. This detection is performed using RANSAC with 100 iterations and distance threshold parameter set to 0.02. Then, EuclideanClusterExtraction is applied with cluster tolerance set to 0.02, minimum cluster size set to 10 and maximum cluster size set to 250000. The obtained results are compared to the segmentation obtained by our method with = 7 using V-measure. This comparison is presented in Fig. 7. It indicates that our method outperforms PCL for all 12 test samples. One of the most important advantages of the proposed approach is its computational efficiency, which makes it suitable for application in robotics. The execution time of our method (SACS) for the considered 12 test samples is shown in Tab. 1, where it is compared to the execution time of PCL.
Fig. 4 Camera image (top left); ground truth (top right); segmentation obtained by PCL (bottom left) and segmentation obtained by our method (bottom right).
Fig. 6 Homogenity, completeness and V-measure for different values .
V-Measu re
1 0.8 0.6
SACS PCL
Fig. 5 Camera image (top left); ground truth (top right); segmentation obtained by PCL (bottom left) and segmentation obtained by our method (bottom right). In order to explore the influence of the threshold to the segmentation quality, we performed a series of experiments by changing from 0, which corresponds to the segmentation to strictly convex surfaces, to 20. The average homogeneity, completeness and V-measure over 12 considered depth images for different values of are presented in Fig. 6. The best
0.4 0.2 0 1 2 3 4 5 6 7 Image 8 9 10 11 12
Fig. 7 Comparison of the proposed method (SACS) and PCL according to the V-measure.
131
6 Tab. 1 Execution time of the proposed method (SACS) and PCL in seconds. SACS Exec. PCL mesh time segment. total building min. 0.119 0.036 0.155 1.293 max. 0.176 0.128 0.278 2.492 average 0.146 0.066 0.212 1.626
[9] X. Jiang, Qualitative Decomposition of Range Images into Convex Parts / Objects, Proceedings of the International Workshop on Machine Vision Applications (MVA), The University of Tokyo, Japan, November 2000. X. Yu, T. D. Bui, and A. Krzyzak, Robust Estimation for Range Image Segmentation and Reconstruction, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 16, no. 5, May 1994, pp. 530 538. K. M. Lee, P. Meer, and R. H. Park, Robust Adaptive Segmentation of Range Images, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 2, February 1998, pp. 200205. H. Wang, and D. Suter, Robust Adaptive-Scale Parametric Model Estimation for Computer Vision, IEEE Transactions on Pattern Analysis and Machine Intelligence, November 2004, pp. 14591474. R. Cupec et al, Detection of Planar Surfaces Based on RANSAC and LAD Plane Fitting, Proceedings of the 4th European Conference on Mobile Robots (ECMR), Mlini/Dubrovnik, Croatia, September 2009, pp. 37-42. F. Schmitt and X. Chen, Fast Segmentation of Range Images into Planar Regions, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Maui, HI , USA, June 1991, pp 710 711. R. Hoffman and A. K. Jain, Segmentation and classification of range images, IEEE Transaction on pattern analysis and machine intelligence, no. 5, 1987, pp. 608620. C. Zhao, D. Zhao and Y. Chen, Simplified Gaussian and mean curvatures to range image segmentation, Proceedings of the 13th International Conference on Pattern Recognition (ICPR), Vienna, Austria, vol.2, August 1996, pp. 427 431. N. Amenta, M. Bern and M. Kamvysselis,. A new voronoi based surface reconstruction algorithm, Proceedings of the 25th annual conference on Computer graphics and interactive techniques (SIGGRAPH 98) 1998, pp. 415421. A. D. Sappa and M. A. Garcia, Coarse-to-fine approximation of range images with bounded error adaptive triangular meshes, Journal of Electronic Imaging, vol.16, no. 2, 023010, April 2007. A. Agathos et al, 3D Mesh segmentation methodologies for CAD applications, Computer Aided Design & Applications, vol. 4, no. 6, 2007, pp. 827841. C. Kim, C.T. Haas, K.A. Liapi and C.H. Caldas, Human-assisted obstacle avoidance system using 3D workspace modeling for construction equipment operation, Journal of Computing in Civil Engineering, vol. 20, no.3, 2006, pp.177186. J. McLaughlin, S.V. Sreenivasan, C. Haas and K. Liapi, Rapid humanassisted creation of bounding models for obstacle avoidance in construction, Computer-Aided Civil and Infrastructure Engineering, vol. 19, no. 1, 2004, pp. 315. H. Son, C. Kim, K. Choi, Rapid 3D object detection and modeling using range data from 3D range imaging camera for heavy equipment operation, Automation in Construction, vol. 19, no. 7, November 2010, pp 898-906. T. Lambert, Incremental Algorithm, http://www.cse.unsw.edu.au/ ~lambert/java/3d/incremental.html Primesense reference design, http://www.primesense.com/ OpenKinect, http://openkinect.org/wiki/Main_Page A. Rosenberg, J. Hirschberg, V-Measure: A conditional entropy-based external cluster evaluation measure, Proceedings of EMNLPCoNLL'2007, pp. 410420. R. B. Rusu, S. Cousins, 3D is here: Point Cloud Library (PCL), Proceedings of IEEE International Conference on Robotics and Automation (ICRA), Shanghai, China, 2011.
[10]
[11]
[12]
V. CONCLUSION An approach for the segmentation of range images into approximately convex surfaces is presented. From the original range image a 2.5D triangle mesh is created. This mesh is then segmented to approximately convex surfaces using a region growing technique based on the incremental convex hull algorithm. The algorithm requires the user to specify the tolerance up to which the surfaces in the scene can be considered as convex surfaces. The method is experimentally evaluated and compared to the 3D segmentation tools provided by the Point Cloud Library (PCL). For the scenes used in this evaluation, our method achieved better segmentation according to V-measure in significantly shorter time. We believe that the output of the proposed method could provide a good basis for a high-level object recognition or robot localization algorithm. Nevertheless, the critical aspect of the considered algorithm is tolerance which significantly influences the result. Possible extensions of the presented research are search for an appropriate tolerance adaptation strategy as well as designing a robot localization or object recognition system which would rely on the convex surfaces provided by the proposed approach. REFERENCES
[1] S. Se, D. G. Lowe and J. J. Little, Vision-Based Global Localization and Mapping for Mobile Robots, IEEE Trans. on Robotics, vol. 21, no. 3, June 2005, pp. 364375. X. Lebgue, J. K. Aggarwal, Significant Line Segments for an Indoor Mobile Robot, IEEE Trans. on Robotics and Automation, vol. 9, no. 6, December 1993. D. Kim and R. Nevatia, Recognition and localization of generic objects for indoor navigation using functionality, Image and Vision Computing vol. 16, Issue 11, 1 August 1998, pp. 729743. D. Comanicu and P. Meer, Mean Shift: A Robust Approach Toward Feature Space Analysis, IEEE Trans. on Pattern Analysis and Machine Intelligence, May 2002. P. F. Felzenszwalb, and D. P. Huttenlocher, Efficient Graph-Based Image Segmentation, International Journal of Computer Vision, vol. 59, no. 2, September 2004. A. Hoover et al., An Experimental Comparison of Range Image Segmentation Algorithms, IEEE Transactions on Pattern Analysis and Machine Intelligence, July 1996, pp. 673-689. S. Suganthan, S. Coleman and B. Scotney, Single-step Planar Surface Extraction from Range Images, in Proceedings of the Fourth International Symposium on 3D Data Processing, Visualization and Transmission (3DPVT 08), Georgia Institute of Technology, Atlanta, GA, USA, 2008, pp363-370. S. M. Bhandarkar and A. Siebert, Integrating edge and surface information for range image segmentation, Pattern Recognition, vol. 25, no. 9, September 1992, pp 947-962. [22] [13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[2]
[23] [24] [25] [26]
[3]
[4]
[5]
[27]
[6]
[7]
[8]
132
Data registration module - a component of semantic simulation engine

Institute of Automation and Robotics, Warsaw University of Technology, Poland Janusz Bedkowski Industrial Research Institute for Automation and Measurements, Warsaw, Poland
Abstract In this paper the data registration module being a component of semantic simulation engine is shown. An improved implementation of ICP (Iterative Closest Point) algorithm based on GPGPU (General-purpose computing on graphics processing units) is proposed. The main achievement is on-line aliment of two data sets composed of up to 262144 3D points, therefore it can be used during robot motion. The algorithm uses GPGPU NVIDIA CUDA for NNS (Nearest Neighborhood Search) computation and to improve the performance all ICP steps are implemented also on GPU. Experiments performed in INDOOR and OUDROOR environments show benets of parallel computation applied for on-line 3D map building. Empirical validation using new generation CUDA architecture, named Fermi is shown. Index Terms Data registration, parallel computing, iterative closest point
I. I NTRODUCTION Semantic simulation engine [12] is a project that main idea is to improve State of The Art in the area of semantic robotics [5] [16] with the focus on practical applications such as robot supervision and control, semantic map building, robot reasoning and robot operator training using augmented reality techniques [14] [11]. Data registration using Iterative Closest Point algorithm is a very important component of Semantic Simulation Engine. Semantic simulation engine combines semantic map with rigid body simulation to perform supervision of its entities such as robots moving in INDOOR or OUTDOOR environments composed by oor, ceiling, walls, door, tree, etc. Semantic simulation engine is composed of data registration modules, semantic entities identication (data segmentation) modules and semantic simulation module. It provides tools to implement mobile robot simulation based on real data delivered by robot and processed on-line using parallel computation. Semantic entities identication modules can classify several objects in INDOOR and OUTDOOR environments. Data can be delivered by robot observation based on modern sensors such as laser measurement system 3D and RGB-D cameras. Real objects are associated with virtual entities of simulated environment. The concept of semantic simulation is a new idea, and its strength lies on the semantic map integration with mobile robot simulator. Alignment and merging of two 3D scans, which are obtained from different sensor coordinates, with respect to a reference coordinate system is called 3D registration [10] [6] [15]. Park [18] proposed a real-time approach for 3D registration using GPU, where the registration technique is based on the Iterative Projection Point (IPP) algorithm. IPP technique is a combination of point-to-plane and point-to-projection
registration schemes [19]. Processing time for this approach is about 60ms for aligning 2 3D data sets of 76800 points during 30 iterations of the IPP algorithm. Fast searching algorithms such as the k-d tree algorithm are usually used to improve the performance of the closest point search [17] [23]. GPU accelerated nearest neighbor search for 3D registration is proposed in work of Qiu [22], where the advantage of Aryas priority search algorithm described in [4] to t NNS in the SIMD (Single Instruction Multiple Data) model was used for GPU acceleration purpose. Purcell suggested that k-d tree and priority queue methods are efcient but difcult to be implemented on GPU [21]. Garcia proves, that a brute force NNS approach using NVidia Compute Unied Device Architecture (CUDA) is 400 times faster over the CPU kd tree implementation [8]. GPU-based NNS with advanced search structures is also used in the context of ray tracing [7], where NNS procedure builds trees with a different manner from a triangle soup, and takes these triangles as the objects of interest. To convert k-d tree into serialized at array that can be easily loaded into CUDA device, left-balanced k-d tree was proposed [13] [22]. Another technique for 3D registration using Fast Point Feature Histograms (FPFH) is shown in the work of Rusu [24]. Rusu also proposed a way of characterizing the local geometry of 3D points, using persistent feature histograms, where the relationships between the neighbors of a point are analyzed and the resulted values are stored in a 16-bin histogram [25]. The histograms are pose and point cloud density invariant and cope well with noisy datasets. An alternative concept to ICP algorithm which relies on instantaneous kinematics and on the geometry of the squared distance function of a surface is shown in the work of Pottmann [20]. The proposed algorithm exhibits faster convergence than ICP, which is supported both by results of a local convergence analysis and by experiments. The ICP algorithm is used in SLAM 6D (Simultaneous Localization and Mapping), where 6 DOF (Degree of freedom) of robot position is computed based on aliment of 3D clouds of points and loop-closing technique [28]. The paper shows an improved implementation and empirical validation of GPGPU ICP algorithm. This algorithm offers online computation. The main difference compared to State of The Art approaches is NNS procedure, where 3D space was divided into regular grid of buckets, therefore there is no need to build complex data structures such as k-d tree, and the time of ICP is decreased. The proposed solution is efcient since it performs nearest neighbor search using a bucket data structure (sorted using CUDA primitive) and computes the correlation
133
IV. 3D DATA REGISTRATION Aligning two-view range images with respect to the reference coordinate system is performed by the ICP (Iterative Closest Points) algorithm. Range images are dened as a model set M and data set D, Nm and Nd denotes the number of the elements in the respective set. The alignment of these two data sets is solved by minimization with respect to R,t of the following cost function:
Nm Nd
E (R, t) =
i=1 j=1
wij mi (Rdj + t)
(1)
Fig. 1. Robot PIONEER 3AT equipped with 3D laser measurement system 3DLSN (rotated SICK LMS 200).
matrix using parallel CUDA all-prex-sum instruction. The paper is organized as follows: section II gives an overview concerning harware used in experiments, section III concerns NVIDIA GPGPU processor characteristic features. The ICP algorithm is presented in section IV where all details concerning parallel computing are given. The results of data registration in INDOOR and OUTDOOR environments are shown in section V. II. ROBOT In the experiments commercially available robot PIONEER 3AT was used (gure 1). Robot is equipped with 3D laser measurement system 3DLSN based on rotated SICK LMS 200. Measurement system delivers 3D cloud of points in stop-scan fashion. Each scan contains 361 (horizontal) x 498 (vertical) data points. III. GPU ARCHITECTURE NVIDIA GPGPUs are programmable multi core chips built around an array of processors working in parallel. The GPU is composed of an array of streaming multiprocessors (SM), where each of them can launch up to 1024 co-resident concurrent threads. Currently available graphics units are in the range from 1 SM up to 30 SMs for the high end products. Each single SM contains 8 scalar processors (SP) each with 1024 32-bit registers. The total of 64KB of register space is available for each SM. Each SM is also equipped with a 16KB on-chip memory that is characterized by low access latency and high bandwidth. It is important to realize that all thread management (creation, scheduling, synchronization) is performed in hardware and the overhead is extremely low. SM works in SIMT scheme (Single Instruction, Multiple Thread), where threads are executed in groups of 32 called warps. CUDA programming model denes the host and the device. Host executes CPU sequential procedures while the device executes parallel GPU programs - kernels. A kernel works in a SPMD scheme (Single Program, Multiple Data). CUDA gives an advantage of using massively parallel computation for several applications. Detailed GPU architecture can be found in the original documentation [3]. Useful additional programming issues are published in best practices guide [2].
wij is assigned 1 if the i-th point of M correspond to the j-th point in D. Otherwise wij =0. R is a rotation matrix, t is a translation matrix. mi and di corresponds to the i-th point from model set M and D respectively. The ICP algorithm using CUDA parallel programming is given: Algorithm 1 ICP - parallel computing approach allocate the memory copy data from the host(Mhost , Dhost ) to the device(Mdevice , Ddevice ) for iter = 0 to max iterations do select closest points between Mdevice and Ddevice calculate (R, t) that minimizes equation 1 transform Ddevice by (R, t) and put the results into DdeviceRt copy DdeviceRt to Ddevice end for copy Ddevice to Dhost free memory
A. Memory allocation on device and copy from host Memory allocation on device and copy from host is a main bottleneck of proposed implementation. Data transfer between host and the device is limited by memory bandwidth and should be minimized. Proposed ICP algorithm is implemented in GPU (all ICP steps are performed using GPU), therefore the need of data transfer is limited to copying cloud of points to/from GPU. Figure 2 and algorithm 2 shows main data used in presented approach. The table of f ound buckets, table of sorted buckets, table of sorted points consist of 512 512 integer elements, table of amount of points in bucket and table of bucket indexes consist of 256 256 256 integer elements. M (reference data) and D (data to be align) data sets are stored in six 512512 tables consisting oat values stored in one dimensional array. For sorting the table of buckets routine described in section IV-B.2 and used in algorithm 2 the CUDA Radix Sort class available in [1] briey described in [26] and [27] is used. The method initialize() called by the constructor of Radix Sort Class allocates temporary storage for the sort and the prex sum that it uses. Temporary storage is (2*NUMBER OF POINTS + 3*8*M/CTA SIZE) unsigned ints, with a default CTA SIZE of 256 threads and
134
Fig. 2.
Initial steps: selection of closest points procedure. Fig. 3. Tree structure used for XYZ space decomposition into 64 buckets. From root to the leaf/bucket chosen left or right brunch depends on the current separation line.
N U M BER OF P OIN T S = 512 512. Amount of data is large, therefore the procedure of memory allocation is done initially. B. Selection of closest points The distance between two points in Euclidean distance metric for point p1 = {x1 , y1 , z1 } and p2 = {x2 , y2 , z2 } is: (2) To nd pairs of closest points between model set M and data set D, the decomposition of XYZ space, where x,y,z < 1, 1 >, into 28 x28 x28 buckets is proposed. The idea of the decomposition will be discussed for 22 x22 x22 case. Figure 3 shows decision tree that decomposes XYZ space into 64 buckets.Each node of the decision tree includes boundary decision, therefore points are categorized into left or right branch. Nodes that do not have branches assign buckets. Each bucket has unique index and is related to cubic subspace with length,width,height = 2/22 ,2/22 ,2/22 . Each bucket that does not belong to border has 26 neighbors. The 27 neighboring cubic subspaces are shown on gure 4 where also the way of indexing is given. Figure 5 demonstrates the idea of nearest neighbor (NN) search technique on 2 dimensional example. Assuming that we are looking for nearest neighbor that satises condition R < 2/28 and circleR=2/28 bucket3R NN will be found in the same bucket or in neighboring bucket (in this example NN of point d is m5). Algorithm 2 describes the procedure of selection of closest points. For better explanation gure 2 shows initial steps of this algorithm where set M of 10 points from gure 5 is used for NN search. The details of the algorithm will be discussed in next subsections. 1) Find bucket: Figure 3 shows tree structure used for indexing of 22 x22 x22 buckets. The concept of nding bucket index for point mxyz is shown on the scheme 6, where x corresponds to border for current level in the tree and 0, 1, 2, 3, ... 14, ... correspond to actual bucket index during its computation. The bucket indexing procedure is executed in parallel, where each CUDA kernel computes bucket index for different pointxyz . 2) Sort buckets: Radix sort class [1] is used to sort unsigned integer key-value pairs. Keys correspond to the elements of table of buckets and value corresponds to elements from table distance(p1 , p2 ) = (x1 x2 ) + (y1 y2 ) + (z1 z2 )
2 2 2
1 2
Fig. 4. Cubic subspaces - neighboring buckets, the way of indexing is explained in section IV-B.1.
Fig. 5.
2 dimensional example of NN search in neighboring buckets.
Algorithm 2 Selection of closest points for all points mxyz in parallel do nd bucketm update table of f ound buckets end for in parallel sort table of f ound buckets {radix sort} in parallel count points in each bucket for all points dxyz in parallel do nd bucketd for all neighbors of bucketd do nd N N for dxyz {nearest neighbor is one from mxyz } end for end for
Fig. 6.
The scheme of bucket indexing procedure.
135
of points. Procedure outputs sorted table of buckets. Figure 2 shows an example of sorting result. Radix sort is a well known sorting algorithm, very efcient on sequential machines for sorting small keys. It assumes that the keys are d-digit numbers and sorts on one digit of the keys at a time, starting from least and nishing on most signicant. The complexity of sorting n keys will be O(n). The details of GPU based radix sort implementation can be found in [26] [27]. It is important to emphasized that the implementation of GPU based radix sort is robust, therefore it can be used for on-line computation. 3) Count points in bucket and nd index of bucket: In the procedure of counting points that belong to the same bucket the counting is based on table of sorted buckets (see gure 2). It is important to notice, that also the index of the found bucket is computed. This index, along with the information concerning an amount of points in the bucket, will be used for searching nearest neighbor in algorithm 2. C. Calculation of (R,t) Calculation of the rotation and translation (R,t) is performed using reduced equation 1: E (R, t) where
Nm Nd
equation 9 takes following form: E (R, t)

2 Nt 1 N N i=1
mi Rdi
1 N
t
2
(11)
To minimize 11 the algorithm has to minimize the following term:

N i=1
N i=1
(mi Rdi ) +
N i=1
mi Rdi t=0
(12)
with the assumption: (13) The optimal rotation is calculated by R= VUT , where matrices V and U are derived from the singular value decomposition of a correlation matrix C = USVT given by: N cxx cxy cxz C= mi T di = cyx cyy cyz (14) i=1 czx czy czz where: cxx =
i=1 N N N
mix dix , cxy =

i=1
mix diy , ..., czz =

i=1
miz diz
1 N
N i=1
mi (Rdi + t)
(3)
(15) Correlation matrix elements are computed using optimized parallel reduction described in section IV-C.1. The optimal translation t is derived from equation 13 and 10, therefore t = cm Rcd (16)
N=
i=1 j=1
wij
(4)
Rotation R is decoupled from computation of translation t using the centroids cm and cd of points: cm = 1 N
N
mi
i=1 N
(5)
1) Optimized parallel reduction: Parallel computation of the correlation matrix (equation 14) is performed using prex sum [9] available in CUDA. The all-prex-sums operations take a binary associate operator with identity I, and an array of n elements [a0 , a1 , ..., an1 ] (17) In the result following array is returned:
1 cd = N and modied data sets:
di
i=1
(6)
[I, a0 , (a0 a1 ) , ..., (a0 a1 ... an2 )]
(18)
M = {mi = mi cm }1,...,N D = {di = di cd }1,...,N
(7) (8)
After applying equations 5, 6, 7 and 8 to the mean square error function E(R, t), the equation 3 takes the following form: 1 E (R, t) N Assuming that: t cm + Rcd = t (10)
N i=1
All-prex-sums operations on array of data is commonly known as scan. The parallel implementation uses multiple thread blocks for processing an array of 512 512 data points stored in one dimensional array. The strategy is to keep all multiprocessors on the GPU busy to increase the performance. Each thread block is responsible for reducing a portion of the array. To avoid the problem of global synchronization, the computation is decomposed into multi kernel invocations. Optimized kernel available in CUDA is used in parallel computation. V. E XPERIMENTS
mi Rdi (t cm + Rcd )
(9)
Experiments where performed in INDOOR and OUTDOOR environments shown on gure 7, where robot was acquiring observations in stop-scan fashion in one meter step. The goal was to align iteratively all scans, therefore the odometry error was deceased. INDOOR data set is composed of 142
136
Fig. 7.
Robot paths during INDOOR and OUTDOOR 3D data acquisition.
Fig. 9. Comparison between the robot trajectory based on combined odometry with gyroscope correction system and GPGPU based ICP with 10 and 100 iterations - OUTDOOR. Robot initial and nal positions are shown as in gure 7.
Error E(R,t) INDOOR
x 10 2 error E(R,t)
0 0
Fig. 8. Comparison between the robot trajectory based on combined odometry with gyroscope correction system and GPGPU based ICP with 10 and 100 iterations - INDOOR. Robot initial and nal positions are shown as in gure 7.
10 20 30 40 ICP iteration 50 0 50 robot position 100
150
(a) INDOOR
scans of 361 498 3D data points. OUTDOOR data set is composed of 61 3D scans. The computation is performed using NVIDIA GF 580 GTX GPGPU. The comparison between the robot trajectory based on combined odometry with gyroscope correction system and GPGPU based ICP with 10 and 100 iterations is shown on gure 8 for INDOOR and 9 for OUTDOOR. The error E(R,t) (see eq. 1 ) for each robot position in function of ICP algorithm iterations is shown on gure 10. The error E(R,t) obviously is decreasing monotonically during performed ICP iterations. Time of ICP for each robot position in function of ICP iterations is shown on gure 11. The worst, best and average ICP performance measured during ICP alignments are shown on gure 12. It is observed that 30 iterations performed average in 300ms of ICP for INDOOR environment and 50 iterations performed average in 400ms of ICP for OUTDOOR environment satisfy the expected performance of scans alignment. The total time of GPGPU ICP computation is very optimistic and guarantee on-line execution. VI. C ONCLUSIONS Data registration module being a component of semantic simulation engine is implemented using GPGPU. The implementation of classic ICP algorithm is improved by parallel computation applied for fast nearest neighbor search (NNS). The proposed solution is efcient since it performs nearest neighbor search using a bucket data structure (sorted using CUDA primitive) and computes the correlation matrix using parallel CUDA all-prex-sum instruction. The on-line capability of GPGPU based ICP computation is very optimistic and
Error E(R,t) OUTDOOR
x 10 error E(R,t) 4 2 0 0
70 60 10 20 30 40 ICP iteration 50 0 10 20 robot position 40 30 50
(b) OUTDOOR Fig. 10. Error E(R,t) for each robot position in function of ICP algorithm iterations.
provides potential application to correct robot position during motion and in consequence aligned 3D cloud of points. Next step will be integration of closing loop technique to improve 3D map building. R EFERENCES
[1] http://www.nvidia.com/cuda, 2010. [2] CUDA C Best Practices Guide 3.2. http://www.nvidia.com/cuda, 8 2010. [3] NVIDIA CUDA C Programming Guide 3.2. http://www.nvidia.com/cuda, 10 2010. [4] Sunil Arya and David M. Mount. Algorithms for fast vector quantization. In Proc. of DCC 93: Data Compression Conference, pages 381390. IEEE Press, 1993.
137
ICP measured performance INDOOR
[11]
2000 ICP time[ms] 1500 1000 500 0 0 50 80 100 20 robot position 150 0 number of ICP iterations 60 40 100
[12] [13] [14]
[15] [16] [17]
(a) INDOOR
ICP measured performance OUTDOOR
1500 ICP time[ms] 1000 500 0 0 20 40 60 robot position 80 0 20 number of ICP iterations 60 40 80
[18] [19]
100
[20] [21]
(b) OUTDOOR Fig. 11. Time of ICP for each robot position in function of ICP algorithm iterations.
ICP best and worst measured performance INDOOR 1600 1400 1200
800 1200 ICP best and worst measured performance OUTDOOR best performance worst performance average performance
[22]
best performance worst performance average performance
[23] [24]
1000
ICP time[ms]
800 600 400
ICP time[ms]
1000
600
400
200
200 0 10
0 10
20
30
40 50 60 70 number of ICP iterations
80
90
100
20
30
40 50 60 70 number of ICP iterations
80
90
100
[25]
(a) INDOOR - 142 scans
(b) OUTDOOR - 61 scans [26] [27]
Fig. 12. The worst, best and average ICP performance measured during ICP alignments.
[5] Minoru Asada and Yoshiaki Shirai. Building a world model for a mobile robot using dynamic semantic constraints. In Proc. 11 th International Joint Conference on Articial Intelligence, pages 16291634, 1989. [6] Andrew W. Fitzgibbon. Robust registration of 2d and 3d point sets. In In British Machine Vision Conference, pages 411420, 2001. [7] Tim Foley and Jeremy Sugerman. Kd-tree acceleration structures for a gpu raytracer. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, HWWS 05, pages 1522, New York, NY, USA, 2005. ACM. [8] Vincent Garcia, Eric Debreuve, and Michel Barlaud. Fast k nearest neighbor search using gpu. In 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pages 16, 2008. [9] Mark Harris, Shubhabrata Sengupta, and John D. Owens. GPU Gems 3, Parallel Prex Sum (Scan) with CUDA, chapter 39, pages 851876. Addison-Wesley, 2007. [10] Daniel Huber and Martial Hebert. Fully automatic registration of
[28]
multiple 3d data sets. Image and Vision Computing, 21(1):637650, July 2003. A. Maslowski J. Bedkowski. Methodology of control and supervision of web connected mobile robots with cuda technology application. Journal of Automation, Mobile Robotics and Intelligent Systems, 5(2):311, 2011. A. Maslowski J. Bedkowski. Semantic simulation engine for mobile robotic applications. Pomiary, Automatyka, Robotyka 2/2011, Automation 2011 6-8 April, Warsaw, pages 333343, 2011. Henrik Wann Jensen. Realistic image synthesis using photon mapping. A. K. Peters, Ltd., Natick, MA, USA, 2001. G. Kowalski, J. Bedkowski, P. Kowalski, and A. Maslowski. Computer training with ground teleoperated robots for de-mining. Using robots in hazardous environments, Landmine detection, de-mining and other applications, pages 397418, 2011. Martin Magnusson and Tom Duckett. A comparison of 3d registration algorithms for autonomous underground mining vehicles. In In Proc. ECMR, pages 8691, 2005. Andreas N chter and Joachim Hertzberg. Towards semantic maps for u mobile robots. Robot. Auton. Syst., 56(11):915926, 2008. Andreas Nuchter, Kai Lingemann, and Joachim Hertzberg. Cached k-d tree search for icp algorithms. In Proceedings of the Sixth International Conference on 3-D Digital Imaging and Modeling, pages 419426, Washington, DC, USA, 2007. IEEE Computer Society. Soon-Yong Park, Sung-In Choi, Jun Kim, and Jeong Chae. Real-time 3d registration using gpu. Machine Vision and Applications, pages 114, 2010. 10.1007/s00138-010-0282-z. Soon-Yong Park and Murali Subbarao. An accurate and fast pointto-plane registration technique. Pattern Recogn. Lett., 24:29672976, December 2003. Helmut Pottmann, Stefan Leopoldseder, and Michael Hofer. Registration without icp. Computer Vision and Image Understanding, 95:5471, 2002. Timothy J. Purcell, Craig Donner, Mike Cammarano, Henrik Wann Jensen, and Pat Hanrahan. Photon mapping on programmable graphics hardware. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, pages 4150. Eurographics Association, 2003. Deyuan Qiu, Stefan May, and Andreas Nuchter. Gpu-accelerated nearest neighbor search for 3d registration. In Proceedings of the 7th International Conference on Computer Vision Systems: Computer Vision Systems, ICVS 09, pages 194203, Berlin, Heidelberg, 2009. SpringerVerlag. Szymon Rusinkiewicz and Marc Levoy. Efcient variants of the ICP algorithm. In Third International Conference on 3D Digital Imaging and Modeling (3DIM), June 2001. Radu Bogdan Rusu, Nico Blodow, and Michael Beetz. Fast point feature histograms (fpfh) for 3d registration. In Proceedings of the 2009 IEEE international conference on Robotics and Automation, ICRA09, pages 18481853, Piscataway, NJ, USA, 2009. IEEE Press. Radu Bogdan Rusu, Zoltan Csaba Marton, Nico Blodow, and Michael Beetz. Persistent Point Feature Histograms for 3D Point Clouds. In Proceedings of the 10th International Conference on Intelligent Autonomous Systems (IAS-10), Baden-Baden, Germany, 2008. Nadathur Satish, Mark Harris, and Michael Garland. Designing efcient sorting algorithms for manycore gpus. NVIDIA Technical Report NVR2008-001, NVIDIA Corporation, September 2008. Nadathur Satish, Mark Harris, and Michael Garland. Designing efcient sorting algorithms for manycore gpus. In Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing, pages 1 10, Washington, DC, USA, 2009. IEEE Computer Society. Jochen Sprickerhof, Andreas N chter, Kai Lingemann, and Joachim u Hertzberg. An explicit loop closing technique for 6d slam. In 4th European Conference on Mobile Robots ECMR09, September 23-25, 2009, Mlini/Dubrovnik, Croatia, pages 229234.
138
FaMSA: Fast Multi-Scan Alignment with Partially Known Correspondences

Ernesto H. Teniente and Juan Andrade-Cetto
Institut de Rob` tica i Inform` tica Industrial, CSIC-UPC, Barcelona, Spain o a
Abstract This paper presents FaMSA, an efcient method to boost 3D scan registration from partially known correspondence sets. This situation is typical at loop closure in large laser-based mapping sessions. In such cases, scan registration for consecutive point clouds has already been made during open loop traverse, and the point match history can be used to speed up the computation of new scan matches. FaMSA allows to quickly match a new scan with multiple consecutive scans at a time, with the consequent benets in computational speed. Registration error is shown to be comparable to that of independent scan alignment. Results are shown for dense 3D outdoor scan matching. Index terms ICP, 3D scan registration, scan alignment.
Fig. 1. Dense point cloud registration during loop closure at the Barcelona Robot Lab.
I. I NTRODUCTION The Iterative Closest Point (ICP) algorithm is the de-facto standard for range registration in 3D mapping. It is used to compute the relative displacement between two robot poses by pairwise registration of the point clouds sensed from them. In a typical mapping session, consecutive pairwise registration is performed during open loop traverse, and accumulates drift error. This error is corrected by closing loops, i.e., matching point clouds with large temporal deviation (see Fig. 1). Most SLAM algorithms keep probabilistic estimates of the robot location that can be used to determine whether or not a loop closure test is advisable. For instance, by considering not only pose uncertainty but information content as well [12]. But, once a loop closure test is deemed necessary, an algorithm that can compute it expeditiously is needed. Typically loop closure tests are checked not only from the current cloud to a query cloud in the past, but instead, to a consecutive set of query clouds in the past, which in turn have already been registered among them. Using this knowledge, we can expedite multiple registrations at a time. In this paper we propose FaMSA, a technique for fast multi-scan point cloud alignment at loop closure that takes advantage of the asserted point correspondences during sequential scan matching. The paper is organized as follows. A description of related work is given in Section II. Section III details some implementation details of our ICP algorithms; and Section IV elaborates on the particularities of the method. Experiments that validate the viability of the method are given in Section VI, and Section VII contains some concluding remarks. II. R ELATED WORK The most popular scan matching methods are based on the Iterative Closest Point algorithm [5]. The objective of this
algorithm is to compute the relative motion between two data sets partially overlapped by minimizing the mean squared error of the distance between correspondences in the two sets. In the original algorithm, a point-to-point metric is used to measure the distance between correspondences in the set. Point-to-plane metrics are also common practice [8], which make the method less susceptible to local minima. Furthermore, point-to-projection metrics are also possible [7], by matching points to ray indexes directly, inverting the ray casting process. A thorough account of these metrics and their properties is given in [21]. More recently, an error metric that weights unevenly rotation and translation was proposed for 2D [15], [16], and later extended to 3D [6]. The method uses point-to-projection minimization using triangles as the projection surface, and performs nearest neighbor search in the new metric space. FaMSA uses this metric for optimization. ICPs computational bottleneck is in correspondence search. Most strategies to accelerate this search rely on some prior ordering of points within each point cloud, and use treebased search methods such as the Approximate Nearest Neighbor [19], [3] that uses balanced kd-trees; kd-trees with caching mechanisms [24]; or parallelized kd-trees [18]. A method for fast NN search that competes with kd trees for execution speed is based on the spherical triangle constraint [10]. Like in [24], point caching is maintained from one iteration to the next, and ordering and triangle constraints are used to quickly identify correspondences. Aside from tree structures, other space partitioning mechanisms that allow for fast NN search include double z-buffering [4] and grid decomposition [26]. Point sampling is also a common strategy used to accel-
139
erate the matching process. Sampling however, only reduces asymptotic computational complexity by a constant factor. It is common practice to use hierarchical coarse-to-ne sampling methods to avoid missing ne resolution correspondences [25], [27]; and sampling can be either uniform [25], [7], random [14], or with ad-hoc reduction heuristics related to the sensing mechanism [20]. Outlier removal is also a major concern on most modern ICP implementations. Point rejection can be based on statistical point distributions [27], [14], [20], using xed or dynamic distance thresholds [22], or using topological heuristics [25], [27], [22]. The idea of multi-scan alignment has been addressed as a bundle adjustment problem for 2D range scans [13] using force eld simulation. The work that most relates to ours is the latent map [11], a multi-scan matching technique for 2D range matching. III. R ANGE IMAGE REGISTRATION A. Notation The objective of the classic ICP algorithm is to compute the relative rotation and translation (R, t) between two partially overlapped point clouds P and Q, iteratively minimizing the mean square error over point matches. For a given set of point match indexes Y , ICPs cost function is arg min
R,t (i,j)Y
FA MSA(P, P , Q, Y, R, t, R0 , t0 ) I NPUTS : P, P : Two consecutive query point clouds. Q: Current point cloud. Y: Correspondences between P and P . R, t: Relative displacement between P and P . R0 , t0 : Initial displacement between P and Q. O UTPUTS : RP , tP : Relative displacement between P and Q. R RP , tP : elative displacement between P and Q.
1: 2: 3: 4: 5: 6: 7: 8: 9:
R P , tP R 0 , t0 RP , tP (R0 , t0 ) (R, t) while not convergence do Z NNS EARCH(P, Q, RP , tP ) Z L INK(Z, Y ) RP , tP ICPU PDATE(P, Q, RP , tP , Z) RP , tP ICPU PDATE(P , Q, RP , tP , Z ) convergence ( < T ) and ( < T ) end while
Algorithm 1: FaMSA: Fast multi-scan alignment with partial known correspondences
(pi Rqj t)
(1)
This minimization is solved iteratively, revising at each iteration the list of point matches, using for instance, NN search. B. Implementation details and computational complexity Correspondence search is the most expensive step in the ICP algorithm. Finding the NN to a given query point relies on the ability to discard large portions of the data with simple tests. Brute force correspondence search would take O(n), with n the size of the point cloud. The preferred data structures used to solve the NN problem in low multidimensional spaces are kd-trees [9] with O(n log n) construction complexity and O(log n) search complexity. Box structures on the other hand take polynomial time to build [2] and constant time to search. Box structures are possible in ICP only when the initial and nal poses do not change signicantly so that NNs remain in the originally computed box. We implement Ackas box search structure with some modications. The box structure in [2] assigns to empty boxes the index of the last occupied box. We instead leave empty boxes out of the search. This serves effectively as a xed distance lter with signicant savings in computational load. Our method is faster than using the optimized Approximate Nearest Neigborh (ANN) library [3] with xed radius search, as shown in the experiments section. The original ICP algorithm of Besl and McKey [5] assumes that for each point in the reference set there must be a correspondence in the query set. In most applications this is not
the case and adequate similarity tests must be implemented. Using point distance as the only criteria for point similarity usually leads to wrong data association and local minima. We use, as in [22], oriented normal similarity constraints, together with statistical constraints [14], i.e, points at distances larger than a multiple of their standard deviation are rejected. These ltering strategies are time consuming, and should be used with discretion, since they require sorting and binary search. Correspondence uniqueness is also enforced and its implementation needs appropriate bookkeeping of matches at each iteration. Several metrics exist to nd the closest point during correspondence search [23], [21]. We adopt in this work the metric proposed in [6], but use point-to-point matching instead a point-to-triangle matching, and avoid the computational burden of computing the corresponding triangle mesh. The metric is an approximated distance that penalizes rotations with a user dened weight L, d(pi , qj ) = pi Rqj t
2
qj (pi Rqj t) qj 2 + L2
(2) and a point norm q = x2 + y 2 + z 2 + L2 2 . The metric d substitutes the Euclidean distance in Eq. 1, and as L , this measure tends to the Euclidean distance. IV. FAST MULTI SCAN ALIGNMENT WITH PARTIALLY
KNOWN CORRESPONDENCES
Given that correspondence search is the most expensive part of any ICP implementation, we propose FaMSA to boost multiple scan alignment using previously known correspondences. That is, given two previously aligned point clouds P and P , the relative transformation between the two R, t, and a list Y of correspondences, we want to nd the registration between the current point cloud Q and the two query scans P and P .
140
FA MSA2(P, P , Q, Y, R, t, R0 , t0 ) I NPUTS : P, P : Two consecutive query point clouds. Q: Current point cloud. Y: Correspondences between P and P . R, t: Relative displacement between P and P . R0 , t0 : Initial displacement between P and Q. O UTPUTS : RP , tP : Relative displacement between P and Q. R RP , tP : elative displacement between P and Q.
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:
R P , tP R 0 , t0 while not convergence do Z NNS EARCH(P, Q, RP , tP ) RP , tP ICPU PDATE(P, Q, RP , tP , Z) convergence ( < T ) end while RP , tP (RP , tP ) (R, t) while not convergence do Z L INK(Z, Y ) RP , tP ICPU PDATE(P , Q , RP , tP , Z ) end while
Fig. 2. Our mobile robotic platform.
Algorithm 2: FaMSA2: Very fast multi-scan alignment with partial known correspondences
V. E XPERIMENT SETUP The method proceeds as follows. Standard correspondence search is implemented between clouds P and Q, and for each match between points pi and qi , a link to P is read from Y , and consequently the distance from qj to p is immediately k established, avoiding the computation of similarity search and lters. Aside from the previous alignment of P and P , the method needs, as any other iterative ICP algorithm, an initial estimation of the relative displacement between the query cloud Q and P . Algorithm 1 shows the approach. In the algorithm, Z and Z indicate the correspondence sets between P and Q; and P and Q, respectively. Appropriate index bookkeeping links to the other in constant time. The threshold T is used to indicate the maximum error allowed for the registration of both point clouds. The method also limits the search to a maximum number of iterations, typically set to 100. The method is suboptimal in the sense that no new matches are sought for between point clouds P and Q. For sufciently close reference clouds P and P it does not impose a limitation on the quality of the nal correspondence. In the same way that FaMSA takes advantage of the point correspondences between P and P to boost the computation of the relative displacement between P and Q, one can also defer the estimation of the pose between P and Q until all iterations for P have nished and use the result as a starting point for the second optimization. This method is shown in Algorithm 2. Extensive experimentation shows that only one iteration of ICP update sufces to revise the pose of P with respect to Q, once the relative transformation between P and Q has been optimized. We call this method FaMSA2. Our experimental data was acquired in the Barcelona Robot Lab, located at the Campus Nord of the Universitat Polit` cnica e de Catalunya. The point clouds were captured using a Pioneer 3AT mobile robot and a custom built 3D laser with a Hokuyo UTM-30LX scanner mounted in a slip-ring. Each scan has 194,580 points with resolution of 0.5 deg azimuth and 0.25 deg elevation. Figure 2 shows the coordinate frames of all of our robot sensors. For the work reported here, only 39 scans from this dataset were used. Figure 7(a) shows a partial view of the mapped environment. The entire dataset is available in [1]. Each scan was uniformed sampled for faster convergence using voxel space discretization with a voxel size of 0.35 meters. During sampling, we also computed surface normals and enforced a minimum voxel occupancy restriction of 4 points. Random sampling with set sizes of 20 points was used for those boxes exceeding such number of points. Normal orientations are computed after random sampling. This has shown to produced better orientation estimates, especially around corners, when compared to other strategies such as k-NNs with density ltering. ICP is executed in open loop for 39 consecutive scans, storing all relative pose displacements as well as the correspondence indexes. Then, a number of possible loop closure locations were selected manually. FaMSA was executed on these loop closure candidates. The specic parameters of the ICP implementation include: maximum angle between normals of 35 deg; upper and lower bounds of sigma rejection at 0.25 and 5, respectively; and maximum number of iterations at 100. For the execution times reported, experiments were run in MATLAB using mex les of C++ routines in an Intel Core 2
141
22 23 24
21 20 19 18 17
25 34 26 33 27 5 2 1 3 28 4 29 30 32 31 6 7 8 9 12 35 36 37 16 38 39 15 14 13
11 10
(a) Dense point cloud registration. Color indicates height. Fig. 3.
(b) Robot trajectory. In green the initial pose, in red the nal pose.
A path with 39 poses around the FIB plaza of the Barcelona Robot Lab.
18 16 14 12 BNN ANN fr ANN FaMSA FaMSA2
22 BNN BNN+FaMSA BNN+FaMSA2
20
18
16 time (sec) 10 8 6 4 2 0 427 10 time (sec) 428 529 530 631 732 Cloud pairs 733 1837 1638 1739
14
12
6 3427 3428 4529 4530 5631 6732 6733 171837 151638 161739 Cloud chains
(a) Time required to match Q and P , when the correspondences between P and P are known. Fig. 4.
(b) Time required to match Q with both P and P .
Algorithm performance.
Quad CPU Q9650 3.0 GHz system, with 4 GB RAM running Ubuntu 10.04 32 bits. VI. R ESULTS First, we compare the execution time in seconds for various implementations of multi-scan ICP. To this end, 10 loop closure locations Q are selected in the trajectory, and each is compared against its query clouds P and P . Figure 4(a) shows the time it takes to align the current cloud Q to the second query cloud P given the correspondences between Q and rst cloud P are knwon. The methods BNN, ANN-FR and ANN refer to our implementation of voxel NNs; ANN with xed radius, the size of the voxels; and conventional ANN. FaMSA and FaMSA2 stand for the methods presented in this paper that make use of previous point correspondence indexes to speed up registration. Note that FaMSA2 is the
fastest of the methods, requiring only one iteration in the minimization. Extensive experimentation showed that further renement in the case of FaMSA2 does not signicantly improve the registration. Figure 4(b) plots the time it takes to register the current point cloud Q against both query clouds P and P . The plot shows individual registration using BNN and combined registration using the proposed schemes BNN+FaMSA and BNN+FaMSA2. The advantages in computational load of using the proposed mechanism are signicative. One might think that using only the correspondences in Y would yield suboptimal estimation. As a matter of fact, when using only this set to compute the relative displacement between P and Q, the number of correspondences effectively halves (see Fig.5), but pose estimation accuracy does not suffer signicantly.
142
3500 BNN Y 3000
E.H. Teniente, by the Spanish Ministry of Science and Innovation under projects PAU (DPI2008-06022) and MIPRCV Consolider Ingenio (CSD2007-018), and by the CEEDS (FP7-ICT2009-5-95682) and INTELLACT (FP7-ICT2009-6-269959) projects of the EU. R EFERENCES
2500 Correspondences
2000
1500
1000
500 427
428
529
530
631 732 Cloud pairs
733
1837
1638
1739
Fig. 5. Number of correspondences between P and Q running a full BNN compared to using the stored set Y .
Figure 6 plots proportional translation and rotational errors as compared with full ICP estimation using BNN, and computed as follows [17]: using as ground truth the relative pose between Q and P as computed with BNN (RBN N , tBN N ), we measure the relative error of the estimated rotation (R, t), as ER (%) = qBN N q / q , where qBN N and q are the normalized quaternions of the corresponding orientation matrices RBN N and R, respectively. Similarly, the relative error of the estimated translation is computed with Et (%) = tBN N tP / tP . Translation error turns out to be less than 0.7% for FaMSA and for all cloud pairs, and less than 0.2% for FaMSA2. Rotation error is barely noticable for both methods. Figure 7 shows a sample of the point cloud match (best viewed in color). In blue, the current pose. In green and red, the query poses. A safe percentage of point cloud overlap in our method is roughly 50%. This is achieved with displacements of about 4 meters. VII. C ONCLUSIONS This paper presents a novel ICP variation for simultaneous multiple scan registration that benets from prior known correspondences. Speed up gain is substantial when compared with other methods. The method uses a voxel structure to efciently search for correspondences to the rst cloud in the set, and a metric that unevenly weights rotation and translation. The method was devised to search for loop closures after long sequences in open loop traverse but could be used for other congurations, provided the correspondences on the query set are known. VIII. ACKNOWLEDGMENTS This work has been partially supported by the Mexican Council of Science and Technology with a PhD Scholarship to
[1] Barcelona Robot Lab data set. [online] http://www.iri.upc. edu/research/webprojects/pau/datasets/BRL/php/ dataset.php, 2011. [2] D. Akca and A. Gruen. Fast correspondece search for 3D surface matching. In Proc. ISPRS Workshop on Laser Scanning, pages 186 191, Enschede, Sep. 2005. [3] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. An optimal algorithm for approximate nearest neighbor searching xed dimensions. J. ACM, 45(6):891923, Nov. 1998. [4] R. Benjemaa and F. Schmitt. Fast global registration of 3d sampled surfaces using a multi-z-buffer technique. Image Vision Comput., 17:113123, 1999. [5] P.J. Besl and N.D. McKay. A method for registration of 3D shapes. IEEE Trans. Pattern Anal. Machine Intell., 14(2):239256, Feb. 1992. [6] L. Biota, L. Montesano, J. Minguez, and F. Lamiraux. Toward a metric-based scan matching algorithm for displacement estimation in 3d workspaces. In Proc. IEEE Int. Conf. Robot. Automat., pages 4330 4332, Orlando, May 2006. [7] G. Blais and M.D. Levine. Registering multiview range data to create 3D computer objects. IEEE Trans. Pattern Anal. Machine Intell., 17(8):820 824, Aug. 1995. [8] Y. Chen and G. Medioni. Object modeling by registration os multiples ranges images. In Proc. IEEE Int. Conf. Robot. Automat., volume 3, pages 27242729, Sacramento, Apr. 1991. [9] J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for nding best matches in logarithmic expected time. ACM T. Math. Software, 3(3):209226, Sep. 1977. [10] M. Greenspan and G. Godin. A nearest neighbor method for efcient icp. In Proc. 3rd Int. Conf. 3D Digital Imaging Modeling, pages 161 168, Quebec, May 2001. [11] Q-X Huang and D. Anguelov. High quality pose estimation by aligning multiple scans to a latent map. In Proc. IEEE Int. Conf. Robot. Automat., pages 13531360, Anchorage, May 2010. [12] V. Ila, J. M. Porta, and J. Andrade-Cetto. Information-based compact Pose SLAM. IEEE Trans. Robot., 26(1):7893, Feb. 2010. [13] R. Lakaemper, N. Adluru, and L.J. Latecki. Force eld based n-scan alignment. In Proc. European Conf. Mobile Robotics, Freiburg, Sep. 2007. [14] T. Masuda. Registration and integration of multiple range images by matching signed distance elds for object shape modeling. Comput. Vis. Image Und., 87(1):5165, 2002. [15] J. Minguez, F. Lamiraux, and L. Montesano. Metric-based scan matching algorithms for mobile robot displacement estimation. In Proc. IEEE Int. Conf. Robot. Automat., pages 35683574, Barcelona, Apr. 2005. [16] J. Minguez, L. Montesano, and F. Lamiraux. Metric-based iterative closest point scan matching for sensor displacement estimation. IEEE Trans. Robot., 22(5):10471054, Oct. 2006. [17] F. Moreno-Noguer, V. Lepetit, and P. Fua. EPnP: An accurate O(n) solution to the PnP problem. Int. J. Comput. Vision, 81(2):155166, 2009. [18] A. N chter. Parallelization of scan matching for robotic 3d mapping. In u Proc. European Conf. Mobile Robotics, Freiburg, Sep. 2007. [19] A. Nuchter, K. Lingemann, J. Hertzberg, and H. Surmann. 6D SLAM with approximate data association. In Proc. 12th Int. Conf. Advanced Robotics, pages 242249, Seattle, Jul. 2005. [20] A. Nuchter, H. Surmann, K. Lingemann, J. Hertzberg, and S. Thrun. 6D SLAM with an application in autonomous mine mapping. In Proc. IEEE Int. Conf. Robot. Automat., pages 19982003, New Orleans, Apr. 2004. [21] S-Y. Park and M. Subbarao. A fast point-to-tangent plane technique for multi-view registration. In Proc. 4th Int. Conf. 3D Digital Imaging Modeling, pages 276283, Banff, Oct. 2003. [22] K. Pulli. Multiview registration for large data sets. In Proc. 2nd Int. Conf. 3D Digital Imaging Modeling, pages 160168, Ottawa, Oct. 1999.
143
0.7 FaMSA FaMSA2 0.6
3.5
x 10
FaMSA FaMSA2 3
0.5 2.5 0.4 ER(%) 0.3 1.5 0.2 0.1 0 427 428 529 530 631 732 Cloud pairs 733 1837 1638 1739 Et(%) 2
0.5 427
428
529
530
631 732 Cloud pairs
733
1837
1638
1739
(a) Relative translational error.
(b) Relative rotational error.
Fig. 6. Proportional translation and rotation errors for the registration between Q and P with the proposed methods. BNN is used for ground truth comparison.
(a) P in yellow, P in red, and Q in blue. Fig. 7.
(b) Cenital view.
A loop closure location between clouds 3, 4, and 28 in the BRL dataset (best viewed in color).
[23] S. Rusinkiewicz and M. Levoy. Efcient variants of the ICP algorithm. In Proc. 3rd Int. Conf. 3D Digital Imaging Modeling, pages 145152, Quebec, May 2001. [24] D.A. Simon, M. Hebert, and T. Kanade. Real-time 3D pose estimation using a high-speed range sensor. In Proc. IEEE Int. Conf. Robot. Automat., volume 3, pages 22352241, New Orleans, Apr. 2004. [25] G. Turk and M. Levoy. Zippered polygon meshes from range images. In Computer Graphics. Proc. ACM SIGGRAPH Conf., pages 311318, Orlando, Jul. 1994. ACM Press. [26] S.M. Yamany, M.N. Ahmed, and A.A. Farag. A new genetic-based technique for matching 3D curves and surfaces. Pattern Recogn., 32(10):18271820. [27] Z. Zhang. Iterative point matching for registration of free-form curves and surfaces. Int. J. Comput. Vision, 13:119152, 1994.
144
Robust Real-Time Number Sign Detection on a Mobile Outdoor Robot

Sebastian A. Scherer Daniel Dube Philippe Komma Andreas Masselli Andreas Zell Department of Computer Science, University of Tuebingen, Tuebingen, Germany
Abstract We present the evolution of a highly-efcient system for visually detecting number signs in real-time on a mobile robot with limited computational power. The system was designed for use on a robot participating in the 2010 SICK robot day robotics challenge, in which the robot needed to autonomously nd 10 number signs and drive to all of them in a given order as fast as possible. Our most efcient method has proven to be robust and successful, since our robot won the competition using the nal vision systems described in this paper. Index Terms number detection, number classication, realtime, robotic vision,
I. I NTRODUCTION Automatic sign detection is an essential task for robots working in environments built for humans: Signs can serve as landmarks and convey messages for both humans and robots. In this work we try to nd the best system for detecting a set of number signs used for marking subgoals in the 2010 SICK robot day robotics challenge. The goal of this challenge was to build a robot that autonomously nds and drives to 10 number signs in a given order within at most 10 minutes. Hence the system had to run on an on-board processor of a mobile robot and had to provide real-time performance. Even though the system presented here solves a very specic task, there are several domains of research which are related to the problem at hand: recognition of text on certain objects such as license plates [5, 11, 9] and containers [12], video OCR [16], and text detection in natural images [15, 23, 20]. A few examples for the latter eld of research are Liang et al. [14], who give an overview of document analysis from cameracaptured images. Yamaguchi et al. in [22] address the problem of recognizing phone numbers in natural images. Both take advantage of the fact that there is more than a single character and that characters usually arranged in lines. In contrast, de Campo et al. [7] focus on recognizing single characters in natural images, which is more similar to our problem at hand. Another related area of research is the detection of signs, e.g. trafc signs. In comparison with video OCR, sign detection is a more challenging problem due to the requirement of coping with highly dynamic settings in real-time. Note, however, that the task of sign detection is facilitated by the fact that almost all trafc signs are either colored in or at least framed in an easy-to-detect signal color to make them easier to spot. This also implies that there is usually a distinct edge around their boundary, which renders their detection less demanding and makes color and shape information popular features for detecting trafc signs [8].
Fig. 1. Our robot driving towards sign number 5 during its winning run at the SICK robot day robotics challenge 2010.
For the SICK robot day competition, it was not totally clear whether there would be an always visible boundary that could be used for sign detection. Hence, we could not apply car number extraction techniques based on color segmentation [13] or rectangular area detection [11]. The contribution of this work is to provide a framework which addresses these problems in real-time. As shown in Sect. V the proposed method allows for robust number detection and classication in a highly dynamic environment. The remainder of this paper is organized as follows: In Sect. II we describe the 2010 SICK robot day robotics challenge. Sect. III provides details of our experimental setup. In Sect. IV we describe four different number detection and classication approaches we implemented. The results are presented and discussed in Sect. V. Finally, Sect. VI gives conclusions. II. T HE 2010 SICK ROBOT DAY C HALLENGE The 2010 SICK robot day challenge was an international robotic race for universities, colleges and other research groups. Participating teams were required to build and program an autonomous mobile robot that is able to explore and navigate an unknown environment, nd and recognize landmarks given by number signs, and reach all of them in a given order as fast as possible. The organizers originally announced that the challenge could take place either indoors or outdoors in
145
course 30 40 meters in size with obstacles that can easily be detected by laser range nders. As the challenge had to take place indoors in the end, however, the actual course was circular-shaped with a diameter of only approx. 15 meters. In addition to static obstacles that were moved by the organizers between consecutive runs, there was always a second robot competing on the course at the same time. Collisions with static objects would lead to a penalty of 5% of the total time and collisions with the other robot to disqualication. We also knew in advance that the number signs to be found consisted of black numbers from zero to nine printed in size 37 58 cm on white rectangular boards. The robots had to nd and reach these number signs in a given order. The quantity of number signs reached in the correct order and the time required for this determined the ranking of all participating robots. Note that it would be fatal to miss a number sign early in the given order: Signs correctly reached after one was missed would not count towards a robots score. III. E XPERIMENTAL S ETUP A. The Mobile Robot The mobile robots used for these experiment are custombuilt based on a remote-controlled monster truck E-MAXX by Traxxas. Among several other sensors they contain three cameras: The main camera is a forward-looking AVT Marlin mounted on a pan unit so it can keep a number sign focused while driving around corners. It captures grayscale images sized 780 580 at a rate of 30 Hz. We additionally employed two Firey MV cameras by Point Grey Research looking left and right, hoping to nd more signs due to the larger total area of view. These two cameras grab grayscale images sized 640480 at a rate of 7.5 Hz. All processing is done on a MiniITX computer featuring a 2.26 GHz Core 2 Duo Mobile CPU mounted on the robot. The number sign recognition module is only one of many computationally demanding modules running on the computer at the same time. B. Data We have a total of 29 logles available of our robot driving in different environments with number signs recorded on 7 different days in different locations, indoors and outdoors. We split these logles into a training set (21 logles), used to train the classiers of sections IV-D and IV-A, and a validation set (8 logles). From each logle we consider only one camera image per second to avoid processing large numbers of similar images. This still leaves us with 6179 and 2292 images, respectively, in the training and validation set. We semiautomatically labeled all images in both sets as we needed correctly labeled training data for learning our classiers and validation data for the performance analysis in Sect. V. IV. N UMBER S IGN D ETECTION A LGORITHMS In this section we present four iterations of our sign recognition software evolved during the preparation of the SICK robot day challenge of 2009 and 2010. The problem to be solved is detecting black number signs printed on an otherwise
Fig. 2. Image as seen by the robot (left) and binarized image (right) with blobs classied as numbers (marked in green) and as non-numbers (marked in red).
white board in images captured by a camera mounted on a moving robot. We rst describe the application of the object detection framework by Viola et al. [19] to our problem in Sect. IVA: A general-purpose approach, which is well-known for its good performance detecting mainly faces, but can also be trained to detect arbitrary objects like soccer balls [18]. For the remaining three approaches presented in this work, we decompose the task at hand into three subproblems: Thresholding or binarization, connected component labeling, and classication of the connected components. We deem this decomposition reasonable as we try to nd black numbers printed on white background: Proper binarization retains this property in the resulting binary image and the numbers printed onto white signs become black blobs which can be easily extracted using connected component labeling. Thresholding is the task of classifying image pixels as either dark (represented by the value 0) or light (represented by 1). The obtained binary image provides the basis of the following connected component labeling process. Connected component labeling identies connected regions and assigns each connected region a unique label. In the context of number detection, region detection is important since each number is represented by connected components in the image. After the labeling process we obtain a set of connected regions (or image patches) containing both numbers and non-numbers. For identifying the respective number and ltering out non-numbers, a classier has to be established. The approach described in Sect. IV-B relies on many readily-available methods, e.g. by OpenCV [2] and pattern matching for classication. The one described in Sect. IV-C speeds up the binarization step by using an integral image, connected component labeling using ood ll, and classication using 1-nearest neighbor. In our nal approach described in Sect. IV-D, we again try to speed up connected labeling using a run-based approach and improve classication by training an articial neural network. A. The Boosted Cascade Classier (Viola&Jones) Approach This approach is based on the boosted cascade detector by Viola et al. [19] used for real-time face detection. They use the AdaBoost algorithm to train a classier which then decides in a the recall phase whether an image patch shows a face or not. An image is scanned for faces by applying the classier on a window that is moved across the image
146
C 0
reject
accept reject
reject
reject
reject
accept
accept
accept
accept
Fig. 4. Structure of the classier composed from multiple cascade detectors. All image patches must pass a common stage (C) to get examined by the classiers specic for each single number, going from 0 to 9. Patches that are rejected by either the C stage or the 9 classier will be considered as non-numbers.
Fig. 3.
Patches used for training the classier.
in different scales. Window positions at which the classier predicts a face are memorized to produce the nal detection result for the image. One crucial idea in [19] is the introduction of a so called attentional cascade, which rapidly improves the speed of the overall detector. A classier consists of a cascade of simpler classiers, or layers, which are applied successively. It starts with the simplest layer, which is designed to be quickly evaluated and highly sensitive, allowing a high false positive rate. This layer can already reject many non-faces, so the remaining layers focus on the image patches accepted by the rst layer. The whole classier only accepts an image patch as a face if it has been predicted positive by all layers. Originally published as a fast and robust face detector, the system can be trained with arbitrary objects. We created a number detector by training the system with patches of number signs that were cropped out of our log les (see also section III-B). Examples of the image patches are shown in gure 3. Since a single detector can only distinguish between two classes, usually objects and non-objects, we arrange multiple detectors to create a classier which is capable of telling not only if a number is found, but also which number is predicted (see Figure 4). The rst stage has been designed to quickly select patches which are candidates for a number and reject the majority of image patches, which do not have to be processed further. Only patches that pass the rst stage are examined more sophistically by the classiers sensitive for a certain number. By using a rst stage common for all numbers we extended the idea of the attentional cascade described in [19], overcoming the burden of applying 10 different classiers, one for each number, to all window positions and scales when scanning an image. As shown in 4, the structure of the additional processing is linear and relies on a robust response of the detectors that are specic for each number, e.g. if the one-classier erroneously predicts a one instead of a nine, the nine-classier never gets a chance to correct this. However, due to processing speed and ease of implementation this structure has been chosen to be used as opposed to a more sophisticated approach. 1) Training: The stage common for all number candidates consists of a 10 layer cascade trained with an initial set of 1472
images of all 10 numbers and 12252 images of non-numbers, i.e. the images were used for training the rst layer of the cascade. For the last layers the bootstrapping method described in [19] is used, taking only those images for training that have been predicted positive by the rst layers, and relling the non-number set by gathering false positives from large images that do not contain numbers. Gathering is done by scanning these images with the classier created so far, consisting of the former layers. The detection rate is set to 0.97 per cascade layer, the false positive rate is set to 0.4 per layer, which means that the common stage already rejects about 1 0.410 = 99.990% of all non-numbers. The classiers for a certain number are trained with 112 to 193 positive examples and initially 12252 negative examples (non-numbers), also using a detection rate and false positive rate of 0.97 and 0.4 per layer, respectively. Bootstrapping is applied for images that do not contain the certain number, i.e. all other numbers are present. First training results show that this is necessary to ensure a high specicity of the classier. Training is done until the non-number training set cannot be relled with false positives, which results in classiers consisting of 13 to 22 cascade layers. B. The OpenCV Approach For our second approach, we focused on achieving real-time number detection by using basic and fast image processing methods. We therefore used standard algorithms, e.g. adaptive thresholding and pattern matching, as implemented in the OpenCV library. 1) Thresholding: Depending on the lighting conditions, the setup of the camera and the captured scene, the detected gray value of the white board background or the black number is highly variable. Methods separating pixels into fore- or background by globally adjusted thresholds are therefore not applicable. In an adaptive approach, this threshold is determined dynamically taking spatial variations in illumination into account. For example, Wellner [21] employs a moving average technique in which the mean m of the last s visited pixels is considered. If the value of the current pixel is t percent less than the average m then this pixel is assigned the value of 0 (dark) and 1 (light) otherwise. Since this approach depends on the scanning order of the pixels and the neighborhood samples are not evenly distributed during the average calculation, Bradley et al. [1] adopt an alternative
147
approach for determining the threshold. They compute the average of a s s window of pixels centered around the pixel under consideration and dene a certain percentage of this average as the threshold value. We used the implementation of the described adaptive thresholding approach from the open source image library OpenCV with a dynamic window size of 15 15 pixels to assign each pixel to either light background or dark foreground. Although this setup works fast and stable for most conditions, it has major difculties detecting very close numbers. There, the dynamic window is smaller than the line width of the number, hence the inner number area is randomly chosen depending on the image noise. The computation time of the adaptive threshold, however, increases with the window size, thus a larger window is no option to deal with this problem. 2) Component Labeling: We applied a contour-based approach [4] 1 to obtain the connected components of the binary image. The algorithm identies the components by tracing a contour chain for each connected region. Since we were only interested in blobs which are likely to be numbers, we applied several lters with experimentally determined thresholds to discard some of the extracted components, which saves computation time in the classication step. In our case, numbers are higher than wide, so we removed all components with a aspect ratio greater than 1. Since numbers have usually straight contours, we only kept blobs with a ratio of perimeter to diagonal between 2 and 10. Furthermore, the relative line width of the numbers is equivalent to a certain value, so we discarded blobs with a ratio between blob area and bounding box area lower than 0.1 and greater than 0.5. 3) Classication: We based the classication task on a knearest-neighbor classier (KNN) with k = 1. Since the image patches are two dimensional and vary in size whereas the KNN classier requires one dimensional inputs of constant length, the connected regions had to be preprocessed. For this we scaled every contour chain to t in a 16 16 pixels bounding box. Now the OpenCV method cvWarpImage was used to crop this bounding box from the image. Then we set all not blob related areas of this patch to the background color and scaled the pixel values of the patch linearly to ll the interval between 0 to 255. Finally, we determined the nearest neighbor of the extracted patch by calculating a normalized distance between the patch and every patch of a labeled reference set. In our case, the reference set consists of images of the numbers from zero to nine rendered with rotation angles between 15 and 15 in steps of 5 . Given two image patches p1 and p2 , the L2 normalized distance is calculated by d = p1 p2 A , where 255 . L2 is the L2 norm, calculated by the OpenCV method norm, and A is the area of the patch. If the minimal distance between the extracted patch and a training patch is lower than a threshold of 0.33, the label of the training patch is assigned to the patch. In the other case, the patch is classied as a nonnumber. The threshold was experimentally chosen to yield a good ratio between correct detections and false positives.
1 We used the open source library cvBlobsLib which implements [4] and is based on OpenCV: http://opencv.willowgarage.com/wiki/ cvBlobsLib.
C. The Accelerated k-Nearest Neighbor (kNN) Approach The third approach aimed at reducing the run-time complexity of each individual step by means of several modications: an integral image-based adaptive thresholding approach, connected-component labeling using ood ll, and accelerated k-nearest neighbor classication. 1) Thresholding: Using integral images [6, 19], the average calculation for adaptive thresholding can be accomplished in constant time, rendering the approach of [1] practical for realtime applications. Given an image I with pixel intensities I(x, y) at pixel locations (x, y), an entry within the integral image I(x, y) stores the sum of all intensities which are contained in the rectangle dened by the upper left corner (0,0) and the lower right corner (x,y). Then, the average intensity i(x, y) of an arbitrary rectangular area of I given by (xur , yur ) and (xll , yll ) can be calculated as: I(xur , yur ) I(xur , yll ) I(xll , yur ) + I(xll , yll ) , (xur xll )(yur yll ) (1) where xur , yur and xll , yll denote the upper-right and lowerleft x- and y-coordinates of the given rectangle, respectively. Equation (1) shows how the average can be computed efciently. Further acceleration was achieved by choosing a power of two as the window size, which enables the use of a rightshift instead of a division operation. Adaptive thresholding using an integral image is a two-stage process. At rst, the integral image has to be calculated, which can be accomplished in linear time. In the second stage, the original image is binarized by comparing each pixel intensity of I with the average intensity of the rectangular area which surrounds the pixel under consideration. 2) Component Labeling: For the labeling task, we employed a simple seed ll algorithm: Given a certain location in the binary image (x, y), its neighbors are recursively tested whether they are assigned the same binary value as the pixel under consideration. If this is the case the respective neighbor (x , y ) is assigned the same label as the pixel (x, y). Further, this process is recursively repeated to all non-visited neighbors of pixel (x , y ). Recursion stops if the binary value of the neighboring pixel (x , y ) differs from the one of (x, y). All pixels with the same label assigned dene a set of connected components. We further represented the area which was covered by the minimum bounding box surrounding a single connected component set by a binary matrix. Here, a matrix element was assigned the value of 1 if it represented a connected component and 0 otherwise. Finally, the set of all binary matrices established the basis of the succeeding classication step. 3) Classication: Following the classication scheme of the rst approach, we also employed k-nearest neighbor classication with k = 1 given rescaled image regions as inputs. In contrast to the previous approach, however, these image regions did not originate from the acquired grayscale image but from their respective connected components extracted in the previous step. Further, resampling was achieved by nearest neighbor interpolation yielding rescaled connected components of size 32 32 pixels. In the following step, the i(x, y) =
148
3232 matrix of binary values was transformed into a 1024 bit sized one dimensional vector by concatenating each individual row of the matrix. Here, the one dimensional vector was represented by a byte array of length 128, Fn = {fn,1 , . . . , fn,128 }, each byte containing 8 pixels of the binary vector. Classication was performed by calculating the Hamming distance between a candidate input vector and each entry of the number training set. Similar to the rst approach the training set consisted of unmodied and transformed numbers rotated by an angle between 15 and +15 with step size 5 , yielding 70 training patterns in total. The image in the training set with the minimum distance to the candidate vector determined the classication result. If the Hamming distance exceeded a certain threshold, the considered image patch was dened to be a non-number. We determined this threshold experimentally by choosing a threshold value which resulted in the smallest classication error. Adopting our representation of connected areas, the Hamming distance between two input vectors F1 and F2 , H(F1 , F2 ), can be efciently computed using addition 128 and XOR operations only: H(F1 , F2 ) = i=1 bitcount(f1,i f2,i ), where bitcount determines the number of 1-bits in the resulting bitstring after the XOR operation. In our implementation, the function bitcount was realized using a lookup table covering all 28 possible congurations of an input vector entry fn,i . We disregarded further acceleration measures such as using 16 bit up to 64 bit integers instead of 8 bit integers as the basis data type for the input vector representation (cf. [17]) or employing specialized instructions of the SSE4.2 instruction set as proposed in [3]. This is because the classication step proved to be fast enough and hence had not to be accelerated. D. The Articial Neural Network (ANN) Approach In the third approach we reused the algorithm to compute an adaptive thresholding using the integral image. We notably changed, however, how connected components are extracted and classied. 1) Extracting Suitable Connected Components: We based our connected components labeling algorithm on the run-based two-scan labeling algorithm by He et. al. [10]. This algorithm regards runs, i.e. continuous horizontal line segments with the same pixel value, instead of pixels as the smallest elements of connected components, which promises a signicant speedup for binary images with large areas of constant value. It requires extra space for a label image, which assigns the same unique label to all pixels that belong to the same connected component upon completion of the algorithm, and for three arrays of limited size to store temporary information about so-called provisional label sets. The original algorithm by He traverses the image twice: In the rst scan, the algorithm traverses the image pixel-wise to extract runs. If a run is not connected to any previous run, a new label is assigned to all of its pixels. Otherwise it inherits the label of the connected run. In case there are two or more runs with different labels both connected to each other, they are merged to form a provisional label set. This merging operation can be implemented efciently since all information about provisional label sets is in compact data structures.
In the second scan, the original algorithm traverses the whole image again to compute the nal label image, replacing all provisional labels with their nal ones. In our case, however, we do not need the complete nal label image. As we usually only need to extract a small number of connected components, we modied Hes algorithm in the following way: In addition to the data structures proposed by He, we also store the bounding box of each provisional label set. Thus, we can disregard connected components depending on the dimension of their bounding box and extract only connected components which are likely to be numbers. Once we have selected a smaller number of connected components as candidates, we extract these efciently, based on their bounding box instead of traversing the whole label image for a second time. 2) Classication of Patches: To further improve classication performance over the previous kNN classier without requiring much more computation time, we employ an articial neural network (ANN) to classify image patches. The neural network has to decide to which of the 11 classes a patch belongs: It is either one of the 10 numbers or not a number. We implemented this using the 1-of-11 target coding scheme, which requires one output neuron per class: For each input, we expect the output neuron corresponding to its class to return 1 and all others to return 0. The structure of our neural network is a multilayer perceptron with 160 input units (one per pixel of patches resized to 10 16) and only one hidden layer consisting of 20 hidden units. All units of our neural network use the logistic activation function, except the output layer which uses the softmax activation function, so we can interpret the output of each output unit as the posterior probability of the input patch belonging to its corresponding class. We used the Stuttgart Neural Network Simulator (SNNS) [24] to train and evaluate the neural network because of its ability to automatically generate efcient C code evaluating a previously trained network using its snns2c module. We trained the network using standard backpropagation on a large dataset containing real-world patches extracted from logles in which a robot moved around environments that contained all 10 signs. The classier of the previous approach supplied us with a preliminary labeling of this dataset and we only needed to manually verify or correct this labeling afterwards. Since these logles contain a relatively small number of patches that correspond to actual numbers compared to a large number of non-number patches, we synthetically generated more training patches by warping images of numbers using different perspective transformations. We relied on early stopping to prevent overtting to the training data and employed a 10-fold cross validation on the training data to nd that the average test error was minimal after 53 epochs of training. By adding a large number of synthetic training patches, we signicantly changed the ratio between number and nonnumber patches and thus the prior class probabilities assumed by the network. We need to correct this by scaling the posteriors returned by the network with the ratio of relative frequencies of each class before and after adding synthetic
149
TABLE I D ETECTION PERFORMANCE OF THE O PEN CV, K NN, ANN AND V IOLA /J ONES APPROACHES . T HE FIRST TWO VALUES ARE RELATIVE TO THE NUMBER OF TRUE NUMBER SIGNS IN THE VALIDATION SET. OpenCV true positives in % false negatives in % false positives per frame 35.0 65.0 0.012 kNN 34.6 65.4 0.011 ANN 49.7 50.3 0.012 Viola/Jones 64.0 36.0 0.007
0.9 0.8
Detection performance in relation to the size of number signs OpenCV kNN ANN Viola
Correct detection rate
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
TABLE II C LASSIFICATION PERFORMANCE OF THE O PEN CV, K NN, ANN AND V IOLA /J ONES APPROACHES . OpenCV correct detections in % wrong detection in % 100.0 0.0 kNN 99.8 0.2 ANN 99.6 0.4 Viola/Jones 97.6 2.4
[10, 25)
[25, 40)
[40, 55)
[55, 70)
[70,170]
Height interval of number signs (in pixels)
Fig. 5. Detection performance depending on the visible size of numbers. The correct detection rate in this plot is equivalent to the true positives rate from table I times the correct detection rate from table II. TABLE III P ROCESSING TIME PER FRAME IN MILLISECONDS .
data. Since we were more concerned about false positives and wrong detections than about false negatives, we applied a loss function that assigns a loss of 1 to false negatives and a loss of 10 to false positives and wrong detections. Classes are then chosen to minimize the expected loss. V. E XPERIMENTAL R ESULTS We evaluate all four of the approaches described in chapter IV using the validation dataset described in III-B with regard to both classication performance and computation time. A. Detection Performance In order to compare the detection performance, we decouple the two problems of detecting a general number sign and the classication of detected signs. Each true number sign is either correctly detected (true positive) or missed (false negative). Detections that do not correspond to a true number sign are counted as false positives. Since numbers as found by the described detectors might slightly differ in position and size from the ones labeled by humans, we consider two labels to coincide if the intersection the area of both covers at least onefourth of each labels individual area. The resulting detection rates can be seen in table I. Rates of the OpenCV and kNN approach are almost identical because they mainly differ in details of their implementation. The ANN approach achieves many more true positives while keeping the number of errors at a similar rate due to the better classication performance of the articial neural network. The Viola/Jones-based detector performs signicantly better than all other methods. The classication performance, i.e. the percentage of correctly detected signs that were also assigned the correct label, is shown in Table II. This is consistently high with the exception of the Viola/Jones approach which tends to confuse number signs more often than the others. The overall ratios of correctly recognized numbers as reported in Table I might seem low to the reader. This is because many of the number signs visible and tagged in our
OpenCV thresholding segmentation classication overall 3.5 9.3 0.4 13.2 0.4 5.3 0.3 5.5 2.5 9.3 0.2 12.0
kNN 0.4 1.3 0.4 1.6 2.4 2.3 0.1 4.8
ANN 0.3 0.7 0.2 1.0
Viola/Jones 193.5 73.3
log les suffer from motion blur or are far away from the robots. In combination with the artifacts introduced by the discretization in the thresholding step of the rst three approaches, this makes classifying very small numbers correctly a hard problem. Another important issue to keep in mind is, that for our application, low ratios of false positives or incorrectly classied numbers are much more important than higher recognition rates: As long as our robot has not seen the next number it needs to reach, it can simply keep exploring until it will nd it eventually. If, on the other hand, it drives towards an incorrectly recognized (false positive) sign and decides it has reached it, there is no way to recover from this bad decision and the competition would be lost. Fig. 5 shows how detection performance of different approaches depends on the size in pixels of the number sign as seen in the image. While the performance is comparable for large numbers, the most signicant differences can be observed for small numbers. Good performance for small numbers, however, is essential, as we want to nd number signs early on when the robot is still far away. Once the robot is close to a sign, we hope to have an accurate estimate of its position and can afford to not detect it from time to time. B. Computation Time Table III lists the computation time required by the described methods. We measured these times by running the number sign recognition systems on our robot in batch mode, i.e. with no other jobs running at the same time and utilizing only one of the two cores. It is worth noting that our own implementation of adaptive thresholding using the integral image is signicantly faster
150
than the implementation provided by OpenCV on a window size of 15 15 pixels. The most important improvement with respect to computation time, however, comes from using a run-based algorithm for segmentation as described in the ANN approach. While using runs might not help that much when processing images with many very small connected components, e.g. documents with very small letters, binarized images of natural scenes in our experience always contain at least some uniform regions, in which using runs improves performance considerably. VI. C ONCLUSIONS We have implemented and compared the performance of four methods for detecting number signs on a mobile robot and managed to create a robust system that requires less than 5 ms per image. Using this system, our robot is able to detect number signs in real-time even when processing images from three cameras at the same time, while still leaving enough computational resources for other modules running on the robot. Specialized algorithms using efcient simplications, e.g. binarization, are key to image processing on mobile robots in high-speed environments like robot races. Unfortunately, they usually require easily exploitable properties. For the task described here, we were able to exploit the fact that there is a white region around the black numbers to be detected. This allowed us to implement a robust number sign detection algorithm that is several orders of magnitude faster than general-purpose object detection algorithms. Our robot described in III-A won the 1st place in the SICK robot day robotics challenge 2010 using the articial neural network-based number sign detection system. A video of our rst run at the SICK robot day challenge 2010 can be found at http://www.youtube.com/watch?v= tpILYAUGxfU. ACKNOWLEDGMENT The authors would like to thank all other colleagues of our team that participated in the SICK robot day robotics challenge: Karsten Bohlmann, Yasir Niaz Khan, Stefan Laible and Henrik Marks R EFERENCES
[1] D. Bradley and G. Roth. Adaptive thresholding using the integral image. Journal of Graphics Tools, 12(2):1321, 2007. [2] G. Bradski. The OpenCV Library. Dr. Dobbs Journal of Software Tools, 2000. [3] M. Calonder, V. Lepetit, C. Strecha, and P. Fua. BRIEF: Binary robust independent elementary features. In Kostas Daniilidis, Petros Maragos, and Nikos Paragios, editors, Computer Vision - ECCV 2010, volume 6314 of Lecture Notes in Computer Science, chapter 56, pages 778 792. Springer Berlin / Heidelberg, Berlin, Heidelberg, 2010. [4] F. Chang. A linear-time component-labeling algorithm using contour tracing technique. Computer Vision and Image Understanding, 93(2):206220, February 2004. [5] X. Chen, J. Yang, J. Zhang, and A. Waibel. Automatic detection and recognition of signs from natural scenes. IEEE Transactions on Image Processing, 13(1):8799, 2004. [6] F. C. Crow. Summed-area tables for texture mapping. In Proceedings of the 11th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH 84), pages 207212, New York, NY, USA, 1984. ACM.
[7] T. E. de Campos, B. R. Babu, and M. Varma. Character recognition in natural images. In Proceedings of the International Conference on Computer Vision Theory and Applications, Lisbon, Portugal, February 2009. [8] A. de la Escalera, J. M Armingol, and M. Mata. Trafc sign recognition and analysis for intelligent vehicles. Image and Vision Computing, 21(3):247 258, 2003. [9] K. Deb, H.-U. Chae, and K.-H. Jo. Vehicle license plate detection method based on sliding concentric windows and histogram. Journal of Computers, 4(8):771777, 2009. [10] Lifeng He, Yuyan Chao, and K. Suzuki. A run-based two-scan labeling algorithm. Image Processing, IEEE Transactions on, 17(5):749756, May 2008. [11] C.G. Keller, C. Sprunk, C. Bahlmann, J. Giebel, and G. Baratoff. Realtime recognition of u.s. speed signs. In IEEE Intelligent Vehicles Symposium, pages 518523, Eindhoven, June 2008. [12] S. Kumano, K. Miyamoto, M. Tamagawa, H. Ikeda, and K. Kan. Development of a container identication mark recognition system. Electronics and Communications in Japan, Part 2, 87(12):3850, 2004. [13] M. Lalonde and Y. Li. Detection of road signs using color indexing. Technical Report CRIM-IT-95, Centre de Recherche Informatique de Montreal, 1995. [14] J. Liang, D. Doermann, and H. Li. Camera-based analysis of text and documents: a survey. International Journal on Document Analysis and Recognition, 7:84104, 2005. 10.1007/s10032-004-0138-z. [15] J. Ohya, A. Shio, and S. Akamatsu. Recognizing characters in scene images. IEEE Trans. Pattern Anal. Mach. Intell., 16:214220, February 1994. [16] T. Sato, T. Kanade, E. K. Hughes, and M. A. Smith. Video OCR for digital news archive. In Proceedings of the 1998 International Workshop on Content-Based Access of Image and Video Databases (CAIVD 98), CAIVD 98, pages 5260, Washington, DC, USA, 1998. IEEE Computer Society. [17] S. Taylor, E. Rosten, and T. Drummond. Robust feature matching in 2.3s. In IEEE CVPR Workshop on Feature Detectors and Descriptors: The State Of The Art and Beyond, June 2009. [18] A. Treptow, A. Masselli, and A. Zell. Real-time object tracking for soccer-robots without color information. In European Conference on Mobile Robotics (ECMR 2003), pages 3338, Radziejowice, Poland, 2003. [19] P. Viola and M. J. Jones. Robust real-time face detection. International Journal of Computer Vision, 57:137154, May 2004. [20] K. Wang and J. A. Kangas. Character location in scene images from digital camera. Pattern Recognition, 36(10):22872299, October 2003. [21] P. D. Wellner. Adaptive thresholding for the digitaldesk. Technical Report EPC-1993-110, 1993. [22] T. Yamaguchi, Y. Nakano, M. Maruyama, H. Miyao, and T. Hananoi. Digit classication on signboards for telephone number recognition. In Document Analysis and Recognition, 2003. Proceedings. Seventh International Conference on, pages 359363, 2003. [23] J. Yang, X. Chen, J. Zhang, Y. Zhang, and A. Waibel. Automatic detection and translation of text from natural scenes. In Proceedings of International Conference on Acoustics and Signal Processing 2002(ICASSP 02), Orlando, FL, USA, May 2002. [24] A. Zell, N. Mache, R. H bner, G. Mamier, M. Vogt, M. Schmalzl, and u K.-U. Herrmann. SNNS (stuttgart neural network simulator). In Josef Skrzypek, editor, Neural Network Simulation Environments, volume 254 of The Springer International Series in Engineering and Computer Science. Kluwer Academic Publishers, Norwell, MA, USA, February 1994. Chapter 9.
151
152
Terrain Classication with Markov Random Fields on fused Camera and 3D Laser Range Data
Marcel H selich a Marc Arends Dagmar Lang Dietrich Paulus Active Vision Group, AGAS Robotics, University of Koblenz-Landau, 56070 Koblenz, Germany {mhaeselich, marends, dagmarlang, paulus}@uni-koblenz.de
Abstract In this paper we consider the problem of interpreting the data of a 3D laser range nder. The surrounding terrain is segmented into a 2D grid where each cell can be an obstacle or negotiable region. A Markov random eld models the relationships between neighboring terrain cells and classies the whole surrounding terrain. This allows us to add context sensitive information to the grid cells where sensor noise or uncertainties could lead to false classication. Camera images provide a perfect complement to the laser range data because they add color and texture features to the point cloud. Therefore camera images are fused with the 3D points and the features from both sensors are considered for classication. We present a novel approach for online terrain classication from fused camera and laser range data by applying a Markov random eld. In our experiments we achieved a recall ratio of about 90% for detecting streets and obstacles and prove that our approach is fast enough to be used on an autonomous mobile robot in real time. Index Terms Multi-sensor fusion, terrain negotiability
This paper is organized as follows. First of all we present the deployed sensors and their layout in Sec. II. After discussing related work in Sec. III, we will briey describe the Markov random eld basics in Sec. IV. The algorithm is described in detail in Sec. V followed by the experiments in Sec. VI. Finally, we present the conclusion in Sec. VII. II. HARDWARE
I. INTRODUCTION Autonomous navigation in unstructured environments is a current and challenging task in robotics. The mobile system needs a detailed interpretation of the surrounding terrain to avoid obstacles and to regard the negotiability of the surface. A modern 3D Laser Range Finder (LRF) provides a rich and thorough picture of the environment in form of 3D distance measurements. Considering a path planing algorithm, the data of the 3D LRF cannot be directly used because of their vast amount. Therefore, as a rst step, a reduction of the huge point cloud is necessary and an efcient data structure is essential. Our work provides a direct extension to the terrain analysis by Neuhaus et al. [14], where a data structure was introduced that we use in our work. The laser range measurements alone allow no differentiation between different surfaces. Therefore we calibrated three cameras to the LRF and can then access the fused data in one coordinate system. This allows us to determine the color and texture of the 3D points in the eld of view of each camera. In unstructured environments, the classication of the terrain can be a problem due to sensor noise, varying density of the data, egomotion or percussions on rough terrain. For that reason we apply a Markov random eld to add context-sensitive information to the terrain classication, which models the relationships in our data structure. Our goal is to determine the negotiability of the surrounding terrain with a Markov random eld in real time based on the sensors described in the following section.
Fig. 1. Deployed sensors: A 3D laser Velodyne HDL-64E S2 (left) and two different commercially available cameras (Logitech HD Pro Webcam C910 on the upper right and Philips SPC1300NC on the lower right) are used to perceive the environment.
As shown in Fig. 1 we use a Velodyne HDL-64E S2 [8] and two different camera models. The head of the Velodyne consists of 64 lasers which permanently gather data of the environment as the head spins at a frequency from up to 15 Hz around the upright axis. In doing so, the sensor produces a rich dataset of 1.8 million distance measurements per second. The data of one full rotation are accumulated into one point cloud. A Logitech HD Pro Webcam C910 [18] is installed to the front and two Philips SPC1300NC [15] cameras are xed on the left and the right side of the construction. The sensors are either be mounted on top of a 500 kg robot (autonomous driving) or on a car (recording of sensor data). III. RELATED WORK There exist various approaches to classify the terrain surrounding an autonomous mobile robot platform. Especially image or laser based strategies are wide spread when terrain negotiability information is needed.
153
The image based strategies either use a single, stereo or combined setup of digital and infrared cameras. Konolige et al. [10] and Alberts et al. [1] both use stereo vision approaches to maneuver a vehicle through unstructured environments. Stereo vision allows them an extraction of traversable regions from the camera video streams. Furthermore, Vernaza et al. [23] present a camera based terrain classication approach.Their approach uses a Markov random eld that classies image data of a stereo system into obstacles or ground regions for an autonomous robot. Negative obstacles (non-negotiable regions underneath the ground level) present a difcult challenge in non-urban environments. Thermal infrared images have the characteristic that negative obstacles remain warmer than the surrounding terrain in the night. Rankin et al. [17] therefore combine thermal signatures and stereo range data to determine the terrain negotiability. Laser based approaches either work with a 2D, a self-build rotating 2D or a 3D LRF, like the Velodyne HDL-64E S2. Wurm et al. [26] use the laser remission value of a 2D LRF on a pan-tilt unit to classify the surface terrain based on the resulting 3D scans. In this way, they can detect grass-like vegetation and prefer paved routes with their robot. Another approach for terrain classication is presented by Wolf et al. [25]. Their robot uses a 2D LRF oriented to the ground, records data while driving and produces 3D maps using Hidden Markov models. The authors are able to differentiate at areas from grass, gravel or other obstacles. Vandapel et al. [22] segment 3D distance measurements and classify the segments into three different classes for terrain surface, clutter or wires. Their approach worked with a detailed stationary 3D sensor as well as on a mobile platform with a rotating 2D scanning mount. Ye and Borenstein [27] present an algorithm for terrain mapping with a 2D LRF. Their LRF is mounted at a xed angle to the ground in front of their robot and creates an elevation map while driving. A color stereo camera and a 2D LRF are used by Manduchi et al. [12] to detect obstacles. The authors present a colorbased classication system and an algorithm that analyses the laser data to discriminate between grass and obstacles. Schenk and Csatho [20] fuse aerial images with 3D point clouds to construct surfaces of urban scenes. The surfaces are represented in a 3D object space coordinate system by patches that store the shape and the boundary of the corresponding surface region. A stereo pair of digital cameras, an infrared camera and two 2D LRFs mounted on scanning mounts are used by Wellington et al. [24]. The authors apply multiple Markov random elds that interact through a hidden semi-Markov model that enforces a prior on vertical structures. Their results showed that including the neighborhood structure signicantly improved obstacle classication. Beyond terrain classication another research topic covers the simultaneous localization and mapping (SLAM) problem, where LRFs are used to process the 2D and 3D distance measurements to build maps [5, 13] of the environment. In contrast to mapping, we want to process the sensor data as
fast as possible online on the robot so that a path planing algorithm can directly access these data and perform tasks autonomously in real time. All presented approaches have in common, that unstructured environments are described as challenging by the authors. Sensor noise, manifold and complex vegetation, rough terrain and concussions while driving require a solid approach to realize an adequate classication of the terrain. Under the circumstances described, we want to achieve the best classication possible with our sensor data. A statistical approach can handle sensor noise and uncertainties quite well and for this reason we chose a Markov random eld, which will be described in the following section. IV. MARKOV RANDOM FIELD BASICS Markov random elds can be assumed to be a 2D expansion of Markov chains. This implies that a random state ij at position (i, j) has a neighborhood Nij including states with ij = Nij (1) (2)
The neighboring states can be located in different distances forming different cliques of neighbors. In our work, we examine the 8 nearest neighbors surrounding a random state in a 2D data representation. Assuming that each state takes a value results in a conguration of a random eld. A random eld can be called Markov random eld if the probability for a conguration takes a positive value P () > 0 and if a random state depends only on its neighbor states P (ij |ij ) = P (ij | ), with Nij (4) (3)
where ij stands for \ij the set without ij . For easier computation and modeling Markov random elds are often used as a Gibbs random eld. Markov random elds and Gibbs random elds can be seen equal, which is described by the Hammersley-Clifford Theorem. In a Gibbs random eld the distribution of the random states corresponds to the Gibbs distribution, which is dened as P () = 1 1 exp U () Z T (5)
where T is a temperature parameter, U () a function which calculates the energy E of a conguration and Z is a normalizing constant with Z=
1 exp U () T
(6)
that calculates a sum over all possible congurations. The modeling of U () depends on the problem, which needs to be solved. Maximizing P () can be achieved by nding a conguration that minimizes U (). For this purpose several sampling techniques exist, as it is too time consuming to test every possible conguration.
154
An introduction of Markov random elds for classication tasks is presented by Theodoridis and Koutroumbas [21] and for a detailed description and several possibilities for modeling with Markov random elds are discussed by Li [11]. V. MARKOV RANDOM FIELD APPLICATION After an introduction of the Markov random elds, the following section details their application to our sensor data and their use in our framework. The section starts with the data structure and pre-classication and continues with a description of the features. Our Markov Model of the terrain and the details of the camera and LRF fusion follow right after. The section ends with a description of the feature parameter estimation for the Markov random elds classes. A. PRINCIPAL COMPONENT ANALYSIS As a rst step, the large point cloud delivered by our sensor is subdivided into smaller chunks, which can then be analyzed with the help of the principal component analysis (PCA) [14]. We use a 2D equidistant grid, centered around the origin of the sensor, to subdivide the point cloud. Our grid has a size of 40 40 m and each of the cells is scaled approximately 50 50 cm. The output of the PCA is a pre-classied grid, where each cell can either be at, obstacle or unknown. A detailed description of the PCA in form of a hierarchical implementation is provided by Neuhaus et al. [14]. B. FEATURES Features are essential for classication tasks. We dene a set of features which describe different terrain classes we use in our approach. This set can be divided into two categories. The rst type of features is acquired from of the data gained by the LRF mounted on our robot. The second type consists of features computed from the camera images of the terrain. The rst laser-based feature we use, is the roughness fr of a terrain cell. The roughness can be obtained by calculating the local height disturbance [14]. The roughness feature offers valuable clues about how uneven a terrain cell is. This information is helpful for distinguishing between roads and grass or acres, for example. The second laser-based feature, the height difference fh , is capable of differentiating between obstacles and negotiable terrain. This feature simply displays the distance between the height of the lowest and the highest laser point of a terrain cell and was presented by Happold et al. [6]. The usage of this feature is based on our assumption that the main characteristic of an obstacle is its height. Under the assumption that texture differs for diverse types of terrain, we apply features acquired from image data. We exert the second angular moment fsam , variance fv and inverse difference moment fidm texture feature. These are 3 easy to compute features out of the 14 texture calculations proposed by Haralick et al. [7]. We use another fast calculating feature, expressing the homogeneity ff h of a texture, proposed by Knauer et al. [9]. This is based on the assumption that rough terrain has an inhomogeneous texture in contrast to the texture
of even terrain. In addition to texture features, the color fc of a terrain cell is also used, which is fast and easy to compute. The Combination of all features yields the feature vector f = (fr , fh , fsam , fv , fidm , ff h , fc ) that integrates in the terrain classication process. C. MARKOV MODEL OF THE TERRAIN Regarding the fact that at terrain cells can either be easy or hard to pass for a robot, the two classes road and rough are used for differentiation. This leads to a total of four possible terrain-classes {unknown, road, rough, obstacle}. A Markov model of the terrain should respect the context-sensitivity of neighboring cells and the acquired features. For this reason we chose a Markov random eld model used for image segmentation tasks, described by Deng et al. [2]. This model is able to fulll both requirements and can easily be adapted to the terrain classication problem by regarding each terrain cell as random state in the Markov random eld. Here the energy E of a Gibbs random eld is calculated piecewise by splitting it up into a component EN that models the relationship of the neighboring cells and another component Ef which models the computed features of a cell. The neighborhood component ENij of a cell Cij at position (i, j) in a grid is dened as ENij = ( (cij , c )) (7)
Nij
where cij is the terrain class assigned to the cell Cij and c is the terrain class assigned to a cell C , which is part of the neighborhood of Cij . The function () returns a value of 1 for cij = c and +1 for cij = c . is used to weight a neighbors impact according to its distance to the cell. In our approach, we observe the 8 closest neighbors to a cell where the value of is xed and adjusted to our environment. The corresponding feature component Efij of a cell Cij is based on the assumption that the features of a cell are Gauss distributed. As an energy its computation is dened as Efij =
k
(fijk ijk )2 + log 2ijk 2 2ijk
(8)
where fijk is the k-th feature of Cij and ijk and ijk are the mean respectively the standard deviation of the k-th feature of the class cij assigned to a cell Cij . Combining the two calculations, the complete energy Eij of a cell Cij can be calculated as follows Eij = ENij + Efij (9)
where is a weighting constant used to control the inuence of the different energy types. For classication the sum of all computed energies Eij needs to be minimized, which leads to a maximization of the a posteriori probability of the labeling of the terrain. These energies can be minimized by nding a label for each cell which ts best for the computed features and the labels of the neighbor cells. We apply the Gibbs Sampler described by Geman and Geman [3] to solve this task.
155
Fig. 2. Example scene of an unstructured environment with three different representations (best viewed in color). In each representation the images from the left, front and right camera are shown in the upper part. The upper image shows the 3D point cloud provided by the Velodyne HDL-64E S2. The color of a point is determined by its height. On the image in the middle the LRF data are fused with all three cameras. This way color, texture and 3D position are available in one coordinate system. The lower image shows the result of the MRF application: classied terrain cells with negotiability information. The classes unknown, road, rough and obstacle are respectively visualized in black, gray, brown and red.
156
D. CAMERA AND LIDAR FUSION Our framework grants the possibility to fuse laser points and image pixels. This is achieved by a scenegraph representation of the used hardware, which provides transformation matrices between different sensors and corresponding intrinsic parameters. Thus we can build a lookup table for each laser point, which returns the corresponding pixel. Using this table nding the image region spanned up by the measured laser points is only a minor problem. Further information how to calibrate a camera to a LRF are available in the literature [19]. E. ESTIMATION OF FEATURE PARAMETERS The application of the described Markov random eld model depends on the knowledge of the means and standard deviations of each feature for each class. Thus, we estimate these parameters by annotating terrain cells with labels out of the used set of classes, by hand. From the annotated terrain it is possible to calculate the needed parameters of the features. The laser feature parameters can be applied in different environments in contrast to the image feature parameters, which need to be estimated separately for each scenario, because of the different appearances of the terrain classes. This has the advantage that the estimated parameters are based on real data acquired from the environment, which are carefully classied by a human observer. VI. EXPERIMENTAL RESULTS In our experiments we compared our terrain classication results with a ground truth. The ground truth is acquired by a human observer, who annotates the terrain cells per hand. By comparing the results with their respective ground truth we calculate the true positive rate and the false positive rate for each terrain class. We evaluated the classication for different scenarios, where the relative appearances of the terrain classes differ. An example scene is shown in Fig. 2 where the 3D data, the fused data and the Markov random eld result are visualized. Table I shows the result for a scenario, where the robot drives on a road trough elds and small to medium sized vegetation and Fig. 3 the corresponding ROC curves. In this table the usage of laser based features is denoted by L, selected Haralick features by H, the homogeneity feature by FH and color by C. Furthermore TPR is the true positive rate, or recall rate and FPR the corresponding false positive rate. Table II shows the results for a forest scenario, where the class rough appears only at single terrain cells and not at larger regions. In this case, the results show that the detection of the less appearing class does not work properly. The reason is the modeling of the terrain, in which we assume that terrain cells of one class show a tendency to group and not to exists as single cells. Furthermore we observed that the usage of image features does not improve the quality of the results in our approach. A reason for this may be that texture and color varies more than we assumed, which leads to high values for the standard deviations of these features. Thus a high deviation from the mean value of such a feature does not change the computed energy value sufciently to have impact on the classication.
The evaluation of the runtime, shown in Table III, shows that the usage of image features multiplies the needed calculation time. In this table PCA expresses the usage of the PCA based algorithm without the application of our Markov random eld and MRF stands for the application of our Markov random eld. Using laser features exclusively leads to a fast runtime and good classication results. In all experiments the xed values and were respectively set to 0.4 and 0.8.
Class/value Road/TPR Road/FPR Rough/TPR Rough/FPR Obstacle/TPR Obstacle/FPR L 91.049 % 1.523 % 74.618 % 1.190 % 92.826 % 3.565 % L+H 91.914 % 1.146 % 73.828 % 1.103 % 91.785 % 4.426 % TABLE I Class/value Road/TPR Road/FPR Rough/TPR Rough/FPR Obstacle/TPR Obstacle/FPR L 95.197 % 4.336 % 5.680 % 0.292 % 96.234 % 6.900 % L+H 91.747 % 5.571 % 0.406 % 0.101 % 95.365 % 6.686 % TABLE II Table I and Table II shows our results for a rural and a forest scenario, respectively. The true positive rate (TPR) and the false positive rate (FPR) are presented for the three classes Road, Rough and Obstacle. The columns are separated by the used features: laser based (L), Haralick features (H), the homogeneity feature (FH) and the color information (C). Mode PCA L (MRF) L+H (MRF) L+FH (MRF) L+C (MRF) Mean 9.832 ms 62.667 ms 744.917 ms 141.379 ms 144.439 ms Std. dev. 1.757 ms 7.357 ms 131.507 ms 11.233 ms 12.566 ms TABLE III Table III displays the runtime results for our approach. All values represent long term measurements and result from the direct application to real sensor data, without preprocessing or any form of data reduction. The terms L, H, FH and C are equivalent to Table I and II. The runtime of the PCA is compared to the Markov random eld application with different features from single sensor fused sensor data. Max 16.145 ms 82.240 ms 993.987 ms 188.420 ms 187.013 ms Min 5.871 ms 48.787 ms 421.465 ms 121.733 ms 117.326 ms L + FH 95.021 % 4.312 % 2.840 % 0.259 % 96.076 % 7.276 % L+C 94.981 % 4.499 % 3.854 % 0.281 % 96.339 % 6.793 % L + FH 90.432 % 1.158 % 75.158 % 1.306 % 92.881 % 3.221 % L+C 91.049 % 1.372 % 75.293 % 1.422 % 92.552 % 3.118 %
VII. CONCLUSIONS AND FUTURE WORK In this paper, we presented an approach to process the fused data of a modern 3D LRF and three cameras with a Markov random eld. We thereby integrate context sensitivity and different features from both laser and image data. The application of the Markov random eld results in a classied 2D grid with terrain negotiability information for each cell. We annotated terrain cells in different scenarios to learn the class parameters of the features. The resulting grid forms a perfect data structure for an autonomous system in unstructured environments. We achieved a recall ratio of about 90% for detecting streets and obstacles. Limitations of our approach reveal at higher speed. The robot platform we use has a maximum speed of 14 km/h, but
157
Fig. 3. ROC curves corresponding to the values of Table I. The three curves show the true positive rate in relation to the false positive rate for the three classes Road (cyan), Rough (brown) and Obstacle (blue)
the sensor data were also recorded while driving around with a car. For the usage of our approach on our autonomous system, the presented fusion is sufcient. In order to deal with very fast proper motion, the sensor calibration needs to concern the instant of time where the data was recorded. A system of precise time stamps should be able to solve this issue. Computation time of the image features is another difcult aspect. The image features need to be chosen wisely or a solution for speeding up this cost intensive part is needed. In future work, we will work on an integration of time in the Markov random elds. By accessing previous classications from the current, we hope to ll out spots with missing or sparse data. In example in regions with negative obstacles or in the immediate vicinity around our sensor. Another quite promising extension is a self-supervised learning of the feature parameters. In doing so, our approach can adapt to various situations and is able to improve during runtime if the features are relearned under certain circumstances. In addition, we will investigate the application of graph cuts [4] to the Markov random eld data structure. The Markov random eld and the fused data provide the opportunity to extract semantic information in real time from the sensor readings. This information can be used to create further information for 3D maps [16] and result in an semantic 3D map. ACKNOWLEDGMENT This work was partially funded by Wehrtechnische Dienststelle 51 (WTD), Koblenz, Germany. R EFERENCES
[1] J. Alberts, D. Edwards, T. Soule, M. Anderson, and M. ORourke. Autonomous navigation of an unmanned ground vehicle in unstructured forest terrain. In Proceedings of the ECSIS Symposium on Learning and Adaptive Behaviors for Robotic Systems, pages 103108, 2008. [2] H. Deng and D. A. Clausi. Unsupervised image segmentation using a simple mrf model with a new implementation scheme. Pattern Recogn., 2004. [3] S. Geman and D. Geman. Stochastic relaxation, gibbs distribution, and bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1984.
[4] D. M. Greig, B. T. Porteous, and A. H. Seheult. Exact maximum a posteriori estimation for binary images. Journal of the Royal Statistical Society. Series B (Methodological), 1989. [5] J. Guivant, E. Nebot, and S. Baiker. Autonomous navigation and map building using laser range sensors in outdoor applications. Journal of Robotic Systems, 2000. [6] M. Happold, M. Ollis, and N. Johnson. Enhancing supervised terrain classication with predictive unsupervised learning. In Proceedings of Robotics: Science and Systems, 2006. [7] R. M. Haralick, I. Dinstein, and K. Shanmugam. Textural features for image classication. In Proceedings of IEEE Transactions on Systems, Man, and Cybernetics, pages 610621, 1973. [8] Velodyne Lidar Inc. Velodyne Lidar. http://www.velodyne.com/lidar, April 2011. [9] U. Knauer and B. Meffert. Fast computation of region homogeneity with application in a surveillance task. In Proceedings of ISPRS Commission V Mid-Term Symposium Close Range Image Measurement Techniques, 2010. [10] K. Konolige, M. Agrawal, R. C. Bolles, C. Cowan, M. Fischler, and B. Gerkey. Outdoor mapping and navigation using stereo vision. In Proceedings of the International Symposium on Experimental Robotics, pages 179190, 2006. [11] S. Z. Li. Markov Random Field Modeling in Computer Vision. Springer, 2009. [12] R. Manduchi, A. Castano, A. Talukder, and L. Matthies. Obstacle detection and terrain classication for autonomous off-road navigation. Autonomous Robots, 2004. [13] M. Montemerlo and S. Thrun. Large-scale robotic 3-d mapping of urban structures. In Experimental Robotics IX, pages 141150. Springer, 2006. [14] F. Neuhaus, D. Dillenberger, J. Pellenz, and D. Paulus. Terrain drivability analysis in 3d laser range data for autonomous robot navigation in unstructured environments. In Proceedings of the IEEE International Conference on Emerging Technologies and Factory Automation, pages 16861689, 2009. [15] Philips Electronics N.V. SPC1300NC. http://www.p4c.philips.com/cgibin/dcbint/cpindex.pl?ctn=SPC1300NC/00, April 2011. [16] J. Pellenz, D. Lang, F. Neuhaus, and D. Paulus. Real-time 3d mapping of rough terrain: A eld report from disaster city. In Proceedings of the IEEE International Workshop on Safty, Security and Rescue Robotics, 2010. [17] A. Rankin, A. Huertas, and L. Matthies. Negative obstacle detection by thermal signature. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 906913, 2003. [18] Logitech Europe S.A. HD Pro C910. http://www.logitech.com/dede/webcam-communications/webcams/devices/6816, April 2011. [19] D. Scaramuzza, A. Harati, and R. Siegwart. Extrinsic self calibration of a camera and a 3d laser range nder from natural scenes. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 41644169, 2007. [20] T. Schenk and B. Csatho. Fusing imagery and 3d point clouds for reconstructing visible surfaces of urban scenes. In Proceedings of the IEEE GRSS/ISPRS Joint Workshop on Remote Sensing and Data Fusion over Urban Areas, 2007. [21] S. Theodoridis and K. Koutroumbas. Pattern Recognition. Academic Press, 2009. [22] N. Vandapel, D. Huber, A. Kapuria, and M. Hebert. Natural terrain classication using 3-d ladar data. In Proceedings of the IEEE International Conference on Robotics and Automation, pages 51175122, 2004. [23] P. Vernaza, B. Taskar, and D. D. Lee. Online, self-supervised terrain classication via discriminatively trained submodular markov random elds. In IEEE International Conference on Robotics and Automation, 2008. [24] C. Wellington, A. Courville, and A. Stentz. Interacting markov random elds for simultaneous terrain modeling and obstacle detection. In Proceedings of Robotics Science and Systems, 2005. [25] D. F. Wolf, G. Sukhatme, D. Fox, and W. Burgard. Autonomous terrain mapping and classication using hidden markov models. In Proceedings of the IEEE International Conference on Robotics and Automation, pages 20262031, 2005. [26] K. M. Wurm, R. K mmerle, C. Stachniss, and W. Burgard. Improving u robot navigation in structured outdoor environments by identifying vegetation from laser data. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 12171222, 2009. [27] C. Ye and J. Borenstein. A new terrain mapping method for mobile robots obstacle negotiation. In Proceedings of the UGV Technology Conference at the SPIE AeroSense Symposium, pages 2125, 2003.
158
Hierarchical Multi-Modal Place Categorization

Centre
Andrzej Pronobis Patric Jensfelt for Autonomous Systems, Royal Institute of Technology, Stockholm, Sweden
Abstract In this paper we present an hierarchical approach to place categorization. Low level sensory data is processed into more abstract concept, named properties of space. The framework allows for fusing information from heterogeneous sensory modalities and a range of derivatives of their data. Place categories are dened based on the properties that decouples them from the low level sensory data. This gives for better scalability, both in terms of memory and computations. The probabilistic inference is performed in a chain graph which supports incremental learning of the room category models. Experimental results are presented where the shape, size and appearance of the rooms are used as properties along with the number of objects of certain classes and the topology of space. Index Terms place categoriation; graphical models; semantic mapping; machine learning
I. I NTRODUCTION The topic of this paper is place categorization, denoting the problem of assigning a label (kitchen, ofce, corridor, etc) to each place in space. To motivate why this is useful, consider a domestic service robot. Such a robot should be able to speak the language of the operator/user to minimize training efforts and to be able to understand what the user is saying. That is, the robot should be able to make use of high level concepts such as rooms when communicating with a person, both to verbalize spatial knowledge but also to process received information from the human in an efcient way. Besides robustness and speed, there are a number of additional desirable characteristics of a place categorization system: C1: Categorization The system should support true categorization and not just recognition of room instances. That is, it should be able to classify an unknown room as a kitchen and not only recognize the kitchen. C2: Spatio-temporal integration The system should support integration over space and time as the information acquired at a single point rarely provides enough evidence for reliable categorization C3: Multiple sources of information No single source of information will be enough in all situations and it is thus important to be able to make use of as much information as possible. C4: Handles input at various levels of abstraction The system should not only be able to use low level sensor data but also higher level concepts such as objects. C5: Automatically detect and add new categories The system should be able to augment the model with new categories identied from data. C6: Scalability and complexity The system should be scalable both in terms of memory and computations. That is, for example, it should not be a problem to double the number of
room categories. C7: Automatic and dynamic segmentation of space The system should be able to segment space into areas (such as rooms) automatically and should be able to revise its decision if new evidence suggesting another segmentation is received. C8: Support life-long incremental learning The robot system cannot be supplied with all the information at production time, it needs to learn along the way in an incremental fashion throughout its life. C9: Measure of certainty There are very few cases where the categorization can be made without uncertainty due to imperfections in sensing but also model ambiguities. Ideally the system should produce a probability distribution over all categories, or at least say something about the certainty in the result. In out previous work we have designed methods that meet C1, C3, C7 and partly C2, C4 and C9. In this paper we will improve on C4 and C9 and add C6 and C7. The main contribution of the paper relates to C4, C6 and C9. A. Outline In Section II presents related work and describes our contribution with respect to that. Section III describes our method and Section IV provides implementation details. Finally, Section V describes the experimental evaluation and Section VI draws some conclusions and discusses future work. II. R ELATED WORK In this section we give an overview of the related work in the area of place recognition and categorization. Place categorization has been addressed both by the computer vision and the robotics community. In computer vision the problem is often referred to as scene categorization. Although also related, object categorization methods are not covered here. However, we believe that objects are key to understanding space and we will include them in our representation but will make use of standard methods for recognizing/categorizing them. Table II maps some of the methods presented below to the desired characteristics presented in the previous section. In computer vision one of the rst works to address the problem of place categorization is [19] based on the so called gist of a scene. One of the key insights in the paper is that the context is very important for recognition and categorization of both places and objects and that these processes are intimately connected. Place recognition is formulated in the context of localization and information about the connectivity of space is utilized in an HMM. Place categorization is also addressed using a HMM. In [23] the problem of grouping images into semantic categories is addressed. It is pointed out that many
159
C5: Novelty detection
[19] [23] [20] [10] [12] [14] [9, 16] [13] [26] [15] [24] [18] [17] [22] [21] This work
X X X X
X x
x X x x X X X x X X X X
x X X X X X
x x X x x
x X
TABLE I C HARACTERIZING SOME OF THE PLACE CATEGORIZATION WORK BASED ON THE DESIRABLE CHARACTERISTICS FROM S ECTION I.
natural scenes are ambiguous and the performance of the system is often quite subjective. That is, if two people are asked to sort the images into different categories they are likely to come up with different partitions. [23] argue that typicality is a key measure to use in achieving meaningful categorizations. Each cue used in the categorization should be assigned a typicality measure to express the uncertainty when used in the categorization, i.e. the saliency of that cue. The system is evaluated in natural outdoor scenes. In [4] another method is presented for categorization of outdoors scenes based on representing the distribution of codewords in each scene category. In [25] a new image descriptor, PACT, is presented and shown to give superior results on the datasets used in [19, 4]. In robotics, one of the early systems for place recognition is [20] where color histograms is used to model the appearance of places in a topological map and place recognition performed as a part of the localization process. Later [10] uses laser data to extract a large number of features used to train classiers using AdaBoost. This system shows impressive results based on laser data alone. The system is not able to identify and learn new categories: adding a new category required offline re-training, no measure of certainty and it segmented space only implicitly by providing an estimate of the category for every point in space. In [12] this work is extended to also incorporate visual information in the form of object detections. Furthermore, this work also adds a HMM on top of the point-wise classications to incorporate information about the connectivity of space and make use of information such as ofces are typically connected to corridors. In [14] a vision only place recognition system is presented. Super Vector Machines (SVMs) are used as classiers. The characteristics are similar to those of [10]; cannot identify and learn new categorizes on-line, only works with data from a
single source and classication was done frame by frame. In [9, 16] a version of the system supporting incremental learning is presented. The other limitations remains the same. In [13] a measure of condence is introduce as a means to better fuse different cues and also provide the consumer of the information with some information about the certainty in the end result. In [15] the works in [10, 14] are combined using an SVM on top of the laser and vision based classiers. This allows the system to learn what cues to rely on in what room category. For example, in a corridor the laser based classier is more reliable than vision whereas in rooms the laser does not distinguish between different room types. Segmentation of space is done based on detecting doors that are assumed to delimit the rooms. Evidence is accumulated within a room to provide a more robust and stable classication. It is also shown that the method support categorization and not only recognition. In [24] the work from [25] is extended with a new image descriptor, CENTRIS, and a focus on visual place categorization in indoor environment for robotics. A database, VPC, for benchmarking of vision based place categorization systems is also presented. A Bayesian ltering scheme is added on top of the frame based categorization to increase robustness and give smoother category estimates. In [17] the problem of place categorization is addressed in a drastically different and novel way. The problem is cast in a fully probabilistic framework which operates on sequences rather than individual images. The method uses change point detection to detect abrupt changes in the statistical properties of the data. A Rao-Blackwellized particle lter implementation is presented for the Bayesian change point detection to allow for real-time performance. All information deemed to belong to the same segment is used to estimate the category for that segment using a bag-of-words technique. In [27] a system for clustering panoramic images into convex regions of space indoors is presented. These regions correspond roughly with the human concept of rooms and are dened by the similarity between the images. In [21] panoramic images from indoor and outdoor scenes are clustered into topological regions using incremental spectral clustering. These clusters are dened by appearance and the aim is to support localization rather than human robot interaction. The clusters therefore have no obvious semantic meaning. As mentioned above [12] makes use of object observations to perform the place categorization. In [6] objects play a key role in the creation of semantic maps. In [18] a 3D model centered around objects is presented as a way to model places and to support place recognition. In [22] a Bayesian framework for connecting objects to place categories is presented. In [26] the work in [12] is combined with detections of objects to deduce the specic category of a room in a rst-order logic way. A. Contributions In this paper we contribute a method for hierarchical categorization of places. The method can make use of a very diverse set of input data, potentially also including spoken dialogue. We make use of classical classiers (SVM in our
C2: Spatio/temporal
C1: Categorization
C3: Multi source
C4: Multi levels
C7: Segmentation
C8: Incremental
C9: Uncertainty
C6: Scalability
160
case, building on the work [15]) and a graphical model to fuse information at a higher level. The categorical models for rooms are based on so called properties of space, rather than the low level sensor characteristics which is the case in most of the other work presented above. This also means that a new category could be dened without having the need to retrain from the sensor data level. The properties decouples the system. The introduction of properties also makes the system more scalable as the low level resources (memory for models and computations for classiers) can be shared across room categorizers. The system we present still rely on the detection of doors like [15] but the graphical model allows us to add and remove these doors and thus change the segmentation of space. The system will automatically adjust the category estimates for each room taking into account the new topology of space. III. H IERARCHICAL MULTI - MODAL CATEGORIZATION We pose the problem of place categorization as that of estimating the probability distribution of category labels, ci , over places, pj . That is, we want to estimate p(ci , pj ). We consider a discrete set of places rather than a continuous space. In our implementation the places are spread out over space like bread crumbs every one meter [26]. The places become nodes (representing free space) in a graph covering the environment. Edges are added when the robot has traveled directly between two nodes. In our previous work [26] we performed place categorization by combining a room/corridor classier (based on [10]) with an ontology that related objects to specic room types. For example, we inferred being in a living room if the classication system reported a room and a sofa and a TV set were found (objects associated with a living rooms according to the ontology). This method had some clear and severe shortcomings that made it only appropriate for illustrating ideas rather than being a real world categorization system in anything but simple and idealized test scenarios. Furthermore, because the system was unable to retract inferred information any categorization was crisp and set in stone. Conceptually the solution has several appealing traits. It allowed us to teach the system, at a symbolic level, to distinguish different room categories simply by assigning specic objects to them. It combined information from low level sensor data (to classify room/corridor) with high level concepts such as objects. The place categorization system in this paper provides a principled way to maintain the advantages mentioned above even in natural environments. Our approach is based on the insight that what made the previous system easy to re-train was that the categorization was based on high level concepts rather than on low level sensor data. For this purpose, we introduce what we call properties of space where in the previous system the properties corresponded to the existence of certain types of objects. In general these properties could be related to, for example, the size, shape and appearance of a place. The introduction of properties decomposes our approach hierarchically. The categories are dened based on the properties and the properties are dened based on sensor data, either directly or in further hierarchies. This is closely related to the
work on part based object recognition and categorization [3]. The property based decomposition buys us better scalability in several ways. Instead of having to build a model from the level of sensor data for every new category, we can reuse the low level concepts. This saves memory (models for SVMs can be hundreds of megabytes in size) and saves computations (calculations shared across categories). The introduction of properties also makes training easier. Once we have the models for the properties, training the system for a new category is decoupled from low level sensor data. The properties can be seen as high level basis functions on which the categories are dened, providing a signicant dimensionality reduction. The graph made up of the free space nodes can be used to impose topological constraints on the places as well and help lay the foundation for the segmentation process.
Fig. 1. Structure of the graphical model for the places showing the inuence of the properties and the topology on the categorization and segmentation.
We use a graphical model to structure the problem, starting from the place graph. More precisely we will use a probabilistic chain graph model [8]. Chain graphs are a natural generalization of directed (Bayesian Networks) and undirected (Markov Random Fields) graphical models. As such, they allow for modelling both directed causal as well as undirected symmetric or associative relationships, including circular dependencies. Figure 1 shows our graphical model. The structure of model depends on the topology of the environment. Each discrete place is represented by a set of random variables connected to variables representing the semantic category of areas. Moreover, the category variables are connected by undirected links to one another according to the topology of the environment. The potential functions rc (, ) represent the knowledge about the connectivity of areas of certain semantic categories (e.g. kitchens are typically connected to corridors). The remaining variables represent properties of space. These can be connected to observations of features extracted directly from the sensory input. Finally, the functions pp1 (|), pp2 (|), . . . , ppN (|) model spatial properties. The joint density f of a distribution that satises the Markov property associated with a chain graph can be written as [8]: f (x) =
T
f (x |xpa( ) ),
161
where pa( ) denotes the set of parents of vertices . This corresponds to an outer factorization which can be viewed as a directed acyclic graph with vertices representing the multivariate random variables X , for in T (one for each chain component). Each factor f (x |xpa( ) ) factorizes further into: 1 (x ), f (x |xpa( ) ) = Z(xpa( ) )
A( )
where A( ) represents sets of vertices in the moralized undirected graph G pa( ) , such that in every set, there exist edges between every pair of vertices in the set. The factor Z normalizes f (x |xpa( ) ) into a proper distribution. In order to perform inference on the chain graph, we rst convert it into a factor graph representation [1]. To meet the real time constraints posed by most robotics applications we then use an approximate inference engine, namely Loopy Belief Propagation [11]. IV. I MPLEMENTATATION In our implementation, each object class results in one property, encoding the expected/observed number of such objects. In addition, we use of the following properties: shape (e.g. elongated, square) Extracted from laser data size (e.g. large (compared to other typical rooms)) Extracted from laser data appearance (e.g. ofce-like appearance) Extracted from visual data doorway (is this place in a doorway) Extracted from laser data In indoor environments, rooms tend to share similar functionality and semantics. In this work we cluster places into areas based on the door property of places (using door detector from [15]). The doorway property is considered to be crisp. The door places are not part of the chain graph but rather act as edges between areas. However, the graphical model allows us to easily change the topology if new information becomes available. The overall system therefore performs segmentation automatically and the dynamic nature of it is based on reevaluating the existence of doors. Figure 2 illustrates how the places (small circles) are segmented into areas (ellipses) by the existence of doors (red small circles) and how this denes the topology of the areas. We build on the work in [15] when dening the property categorizers for shape, size and appearance (see [15] for details). The categorizers are based on Support Vector Machines (SVMs) and the models are trained on features extracted directly from the robots sensory input. A set of simple geometrical features [10] are extracted from laser range data in order to train the shape and size models. The appearance models are build from two types of visual cues, global, Composed Receptive Field Histograms (CRFH) and local based on the SURF features discretized into visual words [2]. The two visual features are further integrated using the Generalized Discriminative Accumulation Scheme (G-DAS [15]). The models are trained from sequences of images and laser range data recorded in multiple instances
Fig. 2. The set of places, {pi }, is segmented into areas based on the door places. The doors form the edges in the topological area graph.
of rooms belonging to different categories and under various different illumination settings (during the day and at night). By including several different room instances into training, the acquired model can generalize sufciently to provide categorization rather than instance recognition. The estimate for the uncertainty in the categorization results is based on the distances between the classied samples and discriminative model hyperplanes (see [13] for details). To learn the probabilities associated with the relations between rooms, objects, shapes, sizes and appearances we analyzed common-sense resources available online (for details see [7]) and the annotated data in the COLD-Stockholm database1 . The relations between rooms and objects were bootstrapped from part of the Open Mind Indoor Common Sense database2 . The object-location pairs found through this process were then used to form queries on the form obj in the loc that were fed to an online image search engine. The number of hits returned was used as a basis for the probability estimate. Relations that where not found this way were assigned a certain low default probability not to rule them out completely.
Fig. 3. The Poisson distributions modelling the existence of a certain number of objects in a room on the example of computers present in a double ofce and a professors ofce.
The conditional probability distributions ppi (|) for the object properties are represented by Poisson distributions. The parameter of the distribution allows to set the expected number of object occurrences. This is exemplied in Fig. 3
1 http://www.cas.kth.se/cold-stockholm 2 http://openmind.hri-us.com/
162
which shows two distributions corresponding to the relation between the number of computers in a double ofce and a professors ofce. In the specic case of the double ofce, we set the expected number of computers to two. In all remaining cases the parameter is estimated by matching p (n = 0) with the probability of there being no objects of a certain category according to the common sense knowledge databases. V. E XPERIMENTS A. Experimental Setup The COLD-Stockholm database contains data from four oors. We divide the database into two subsets. For training and validation, we used the data acquired on oors 4, 5 and 7. The data acquired on oor 6 is used for evaluation of the performance of the property classiers and for the real-world experiment. For the purpose of the experiments presented in this paper, we have extended the annotation of the COLD-Stockholm database to include 3 room shapes (elongated, square and rectangular), 3 room sizes (small, medium and large) as well as 7 general appearances (anteroom-, bathroom-, hallway, kitchen-, lab-, meetingroom- and ofce-like). The room size and shape, were decided based on the length ratio and maximum length of edges of a rectangle tted to the room outline. These properties together with 6 object types dened 11 room categories used in our experiments, see Figure 5.
Shape
Size
Appearance CRFH
Fig. 4.
Appearance BOW-SURF
Appearance CRFH + BOW-SURF
Confusion matrices for the evaluation of the property categorizers.
C. Real-world experiments In the real-world experiment the robot was joysticked around manually. The robot started with only the models obtained in the evaluation of the property categorizers. Laser based SLAM [5] was performed while moving and new places were added every meter traveled into unexplored space. The robot was driven through 15 different rooms while performing real-time place categorization without relying on any previous observations of this particular part of the environment. The object observations where provided by human input. The information comes into the change graph in exactly the same was as would real object detections. Figure 5 illustrates the performance of the system during part of a run. The 11 categories can be found along the vertical axis. The ground truth for the room category is marked with a box with thick dashed lines. The Maximum a posteriori (MAP) estimate for the room category is indicated with white dots. The system correctly identied the rst two rooms as a hallway and a single ofce using only shape, size and general appearance (no objects were found). The next room was properly classied as a double ofce. The MAP estimate switches to professors ofce for a short while when one computer is found and switches back again when a second if found. After some initial uncertainty where the MAP switches category several times the next room is classied as a double ofce until the robot nds a computer at which point it switches to professors ofce. Later the robot enters a robot lab which according to its models is very similar to a computerlab. Initially there is a slightly higher probability for the hypothesis that it is a computerlab, but once the robot detectes a robot arm the robotlab hypothesis completely dominates. The next nonhallway room is a single person ofce currently occupied by a bunch of Masters students. Because of its current appearance, the best match is a double ofce. The robot continues and the rest of the categorizations are correct. The system is able to perform the categorization in real-time as can be seen these preliminary results indicate that the accuracy is quite good.
B. Evaluation of Property Categorizers The performance of each of the property categorizers was evaluated in separation. Training and validation datasets were formed by grouping rooms having the same values of properties. Parameters of the models were obtained by crossvalidation. All training and validation data were collected together and used for training the nal models which were evaluated on test data acquired in previously unseen rooms. Table II presents the results of the evaluation. The classication rates were obtained separately for each of the classes and then averaged in order to exclude the inuence of unbalanced testing set. As can be seen all classiers provided a recognition rates above 80%. Furthermore, integrating the two visual cues (CRFH and BOW-SURF) increased the classication rate of the appearance property by almost 5%. From the confusion matrices in Fig. 4 we see that the cases with confusion occurs between property values being semantically close.
Property Shape Size Appearance Appearance Appearance Cues Geometric features Geometric features CRFH BOW-SURF CRFH + BOW-SURF Classication rate 84.9% 84.5% 80.5% 79.9% 84.9%
TABLE II C LASSIFICATION RATES FOR EACH OF THE PROPERTIES AND CUES .
163
Fig. 5. Visualization of the beliefs about the categories of the rooms. The room category ground truth is marked with thick dashed lines while the MAP value is indicated with white dots.
VI. C ONCLUSIONS AND F UTURE W ORK In this paper we have presented a probabilistic framework combining multi-modal and uncertain information in a hierarchical fashion. So called properties were introduces as a way to model high level characteristics of the environment. These properties gave us a way to decouple the categorization into categorization of the properties based on low level sensor information and categorization of high level concepts such as rooms based on the properties. A chain graph model was used for the probabilistic inference. We provided an initial evaluation of the system which indicates that it works in well practice. Part of the future work is to evaluate the system more thoroughly. It is important to note that we are not able to evaluate our system on other databases such as VPC [24] as it does not contain laser data. We will also investigate the use of the place categorization system in semantic mapping. ACKNOWLEDGMENT This work was supported by the SSF through its Centre for Autonomous Systems (CAS), and by the EU FP7 project CogX. R EFERENCES
[1] An introduction to factor graphs. IEEE Signal Processing Magazine, 21:2841, January 2004. [2] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. SURF: Speeded up robust features. In Proc. of ECCV06, 2006. [3] G. Bouchard and B. Triggs. Hierarchical part-based visual object categorization. 2005. [4] L. Fei-Fei and P. Perona. A bayesian hierarchical model for learning natural scene categories. In In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2005. [5] J. Folkesson, P. Jensfelt, and H. I. Christensen. The m-space feature representation for SLAM. IEEE Trans. Robotics, 23(5):10241035, October 2007. [6] C. Galindo, A. Safotti, S. Coradeschi, P. Buschka, J. A. FernandezMadrigal, and J. Gonzalez. Multi-hierarchical semantic maps for mobile robotics. In IROS, August 2005. [7] Marc Hanheide, Nick Hawes, Charles Gretton, Alper Aydemir, Hendrik Zender, Andrzej Pronobis, Jeremy Wyatt, and Moritz G belbecker. o Exploiting probabilistic knowledge under uncertain sensing for efcient robot behaviour. In IJCAI11, 2011. [8] S. L. Lauritzen and T. S. Richardson. Chain graph models and their causal interpretations. J. Roy. Statistical Society, Series B, 64(3):321 348, 2002.
[9] J. Luo, A. Pronobis, B. Caputo, and P. Jensfelt. Incremental learning for place recognition in dynamic environments. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS07), San Diego, CA, USA, October 2007. [10] O. Martnez Mozos, C. Stachniss, and W. Burgard. Supervised learning of places from range data using adaboost. In ICRA05, 2005. [11] J. M. Mooij. libDAI: A free and open source C++ library for discrete approximate inference in graphical models. J. Mach. Learn. Res., 11:21692173, August 2010. [12] Oscar Martnez Mozos, Rudolph Triebel, Patric Jensfelt, Axel Rottmann, and Wolfram Burgard. Supervised semantic labeling of places using information extracted from laser and vision sensor data. Robotics and Autonomous Systems Journal, 55(5):391402, May 2007. [13] A. Pronobis and B. Caputo. Condence-based cue integration for visual place recognition. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS07), San Diego, CA, USA, October 2007. [14] A. Pronobis, B. Caputo, P. Jensfelt, and H.I. Christensen. A discriminative approach to robust visual place recognition. In IROS06, 2006. [15] A. Pronobis, O. Martinez Mozos, B. Caputo, and P. Jensfelt. Multimodal semantic place classication. IJRR, 29(2-3), February 2010. [16] Andrzej Pronobis, Jie Luo, and Barbara Caputo. The more you learn, the less you store: Memory-controlled incremental SVM for visual place recognition. Image and Vision Computing (IMAVIS), March 2010. [17] Ananth Ranganathan. Pliss: Detecting and labeling places using online change-point detection. In RSS, 2010. [18] Ananth Ranganathan and Frank Dellaert. Semantic modeling of places using objects. In RSS, 2007. [19] A. Torralba, K. P. Murphy, W. T. Freeman, and M. A. Rubin. Contextbased vision system for place and object recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV03), pages 273280, 2003. [20] Iwan Ulrich and Ilah Nourbakhsh. Appearance-based place recognition for topological localization. In Proc. of the IEEE International Conference on Robotics and Automation (ICRA00), volume 2, pages 10231029, April 2000. [21] C. Valgren and A. Lilienthal. Incremental spectral clustering and seasons: Appearance-based localization in outdoor environments. In ICRA 2008, pages 18561861. IEEE, 2008. [22] S. Vasudevan and R. Siegwart. Bayesian space conceptualization and place classication for semantic maps in mobile robotics. Robot. Auton. Syst., 56:522537, June 2008. [23] J. Vogel and B. Schiele. A semantic typicality measure for natural scene categorization. Pattern Recognition, pages 195203, 2004. [24] Jianxin Wu, Henrik I. Christensen, and James M. Rehg. Visual place categorization: Problem, dataset, and algorithm. In In IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS09), 2009. [25] Jianxin Wu and James M. Rehg. Where am i: Place instance and category recognition using spatial pact. In In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Anchorage, Alaska, June 2008. [26] H. Zender, O. M. Mozos, P. Jensfelt, G.-J. M. Kruijff, and W. Burgard. Conceptual spatial representations for indoor mobile robots. Robotics and Autonomous Systems, 56(6):493502, June 2008. [27] Zoran Zivkovic, Olaf Booij, and Ben Kr se. From images to rooms. o Robotics and Autonomous Systems, special issue From Sensors to Human Spatial Concepts, 55(5):411418, May 2007.
164
Mobile Robot Odor Plume Tracking Using Three Dimensional Information

Lus Osrio, Gonalo Cabrita and Lino Marques
AbstractThis paper studies the utilization of data gathered from multiple gas sensors placed vertically on a wheeled mobile robot to perform plume tracking strategies. The main research questions to be answered are if is this advantageous and if so, trying to quantify that advantage in terms of plume tracking over traditional two dimensional plume acquisition. This was accomplished by testing a few well known plume tracking algorithms (Surge-Caste, Casting and Gradient following) against a modified version of each one of the algorithms, designed to take into account odor data from the gas sensor vertical array. Extensive real world experiments were conducted in a large indoor environment. It was found that plume tracking with three dimensional acquisition outperformed the two dimensional variants in the tested environment in about 25%. Index TermsThree dimensional plume acquisition, mobile robot olfaction, odor search.
extremely rare. Occasionally they become visible due to particles in suspension (e.g. smoke leaving a chimney), or due to variations in the air density (e.g. the ascension of hot air over the asphalt in the summer). A simple technique that allows visualizing of the air flow around a robot is the introduction of small solid or liquid particles in the air [2]. Their capacity of reflecting or dispersing the light makes them visible. If the particles are sufficiently small, their low mass allows them to follow the changes of the air's trajectory as they are not largely affected by gravity. Some examples of these cases are the smoke from tobacco, from burning oils, or from the drops of steam [3]. A. Two Dimensional Odor Plume Tracking Up to this point, most of the research in mobile robot olfaction has focused on the problem of odor source localization. This problem can be divided in three phases [4]. The first phase consists on finding odor cues (odor exploration), the second phase is characterized by following the plume (plume tracking) and the last phase is in charge of pinpointing the odor source (odor source declaration). Current methods and algorithms for detecting odor sources have originated in a wide range of areas of study: biological inspiration [6][7]; logical reasoning [8]; gradient following [9]; neural networks [12]; probabilistic methods [13]; multirobot cooperation [14][15]. Odor tracking can be done with a single gas sensor and multiple spatial samples [5] or with the help of a directional anemometer. Lochmatter et al. [16] used this approach to evaluate the effectiveness of three algorithms that follow the wind direction to track odor sources. The odor sensor was used in binary, distinguishing only if the robot was inside or outside the odor plume. In another work, this group studied the shortening of search time obtained with the utilization of a group of robots instead of one [15]. Increasing the number of sensors allows a system to collect simultaneous readings of odor concentration in different positions thus characterizing the atmosphere in a shorter time interval or orienting easily towards a gradient ascent direction. Waphare et al. [18] explored this strategy with three odor sensors on a robot, one was located on the front of the robot and the other two on the sides . The cooperation between robots showed great potential to speed-up the location of the odor source, however due to the increase in complexity progress on this area has been slow. On this subject one can also find the proposal of a cooperative Surge-Spiral algorithm [14], the odor guided cooperative exploration of areas using stochastic search techniques [20] [21], an algorithm inspired by nature to be used in environments with low air flow [22], a system of cooperative odor guided odor source localization [23] and a study of
165
I.
INTRODUCTION The last two decades have seen a growing interest in autonomous odor source localization. The use of autonomous agents capable of locating odor sources can be extremely useful in various scenarios, from monitoring public structures for hazardous gasses or explosives to aiding in the search of survivors during a disaster. When working with mobile robot olfaction it is necessary to understand how odor dispersion behaves in uncontrolled conditions. In natural environments, even without noticeable air flow, odor disperses by thermal convection induced by thermal gradients existing in the surfaces. On the other hand, in the presence of advection the drainage speed will superimpose to this effect, becoming usually the dominant phenomena. In any case, when close to planar surfaces, only laminar flow is possible, as a fluid moves parallel to the surface resulting in the dispersion of odor molecules under the form of filaments that enlarge as they get increasingly away from the odor source. However if the intensity of the air flow is high enough as one gets further away from the source it is possible to perceive increasing turbulence. As the odor filament enters the turbulent areas it separates into smaller segments. The odor plume will continue to disperse in the direction of the air flow, however it will now happen irregularly as it expands in a series of odor fractions. Furthermore the weight of the particles and their temperature influence the vertical movement of the plume as well. All these effects along with the way air is diverted by obstacles will have a major impact on the process of odor dispersion [1]. The effects of air flow are part of everyday life, in spite of this the opportunity to observe these patterns in full detail is
L. Osorio, G. Cabrita and L. Marques are with the Department of Electrical and Computer Engineering, Institute of Systems and Robotics, University of Coimbra, 3030-290 Coimbra, Portugal. {bica, goncabrita, lino}@isr.uc.pt
2 swarm-based robotic plume tracking [24]. Regardless of the approach all the algorithms previously discussed share one characteristic: The fact that the odor plume is sampled at a constant height. This means that all these algorithms perceive the odor plume as a two dimensional entity when in fact the chemical will scatter over a three dimensional space. B. Three Dimensional Odor Plume Tracking Although there is plenty of work related to tracking airborne odor plumes in two dimensions, there have been relatively few works exploring the three dimensions capabilities of plume acquisition. Marques et al. [19] used a robot equipped with two gas sensors, one on the top and another near its bottom in order to infer to localization of a ground located gas source. In [25] Ishida used a blimp as a platform to measure the odor concentration in three dimensions, creating an odor distribution map. In [26] a group of robots was used to determine the environments odor concentration, recurring to the Kernel DM+V algorithm [27]. In [17] Russell built a robot with a gas sensor moved vertically through the action of a winch which allowed to measure odor in three dimensions building a 3D representation of the plume while it was tracked. II. PROBLEM APPROACH To track odor in three dimensions a robot capable of determining the odor concentration at different heights and detecting the winds direction was built. Although the robot has the ability to detect odor in three dimensions, it can only move in two dimensions. Therefore the odor was measured at different heights however only the highest concentration measured along the gas sensor array was used, resulting in a pseudo three dimensional system. An analogy can be made with mobile robot navigation. While the robot can only move in a two dimensional space it occupies a three dimensional space and as such should avoid any obstacle in its way. In our case although the robot moves in a two dimensional space the plume will scatter along a three dimensional space, hence the need to detect it as such yet translate that information to the robots two dimensional workspace. Three algorithms (Surge-Cast, Casting and Gradient) were tested tracking three dimensional odor plumes and compared against a purely two dimensional system, to evaluate the advantages of three dimensional plume acquisition over the traditional two dimensional method for the purpose of plume tracking. Surge-Cast and Casting use upwind movement while Gradient, as the name suggests, relies solely on concentration measurements in order to estimate its gradient. Although these algorithms are already well known a quick overview follows. A. Surge-Cast The robot executes an upwind movement while it is inside the odor plume, executing movements perpendicular to the wind when it leaves the plume. Once it is back inside the plume the robot returns to performing the upwind movement [16]. This is illustrated in Figure 1 and the pseudocode can be found in Algorithm 1.
Algorithm 1: Surge-Cast if odor_reading < threshold and !found_plume then wander(); end if odor_reading >= threshold then upwindSurge(); end else if found_plume then if last_direction == left then castRight(); last_direction = right; else castLeft(); last_direction = left; end end
Wind flow dcast Odor Source
Fail
dlost
Figure 1 Surge-Cast algorithm.
B. Casting While inside the odor plume, the robot moves towards the wind direction added an offset angle that makes the robot leave the odor plume from a known side. When it is outside of the plume it moves perpendicularly to the wind in the opposite direction of the offset angle until it is once more inside the plume, repeating the process with an inverse offset angle [16]. This is illustrated in Figure 2 while the pseudocode can be found in Algorithm 2.
Algorithm 2: Casting if odor_reading < threshold and !found_plume then wander(); end if odor_reading == threshold then if reaquiring_plume then angle = -angle; reaquiring_plume = false; end cast(angle); end else if found_plume then cast( -(angle/||angle||)*/2 ); reaquiring_plume = true; end
166
3 direction measurement and a vertical array of four Figaro TGS2620 gas sensors placed 0.5m apart from each other as can be seen in Figure 4. The gas sensors were connected to the Erratic through an Arduino microcontroller board. This robot allowed testing both two and three dimensional setups, by using concentration data from only one sensor (2D) or by using the full array of gas sensors (3D). The Erratic was running the ROS navigation stack for autonomous navigation and localization [11]. A metric map of the environment was provided to the robot for each experiment. All the algorithms developed for this work were done so using the ROS framework and can be downloaded from http://www.ros.org/wiki/odor_search.
Wind flow
dl
os
Odor Source
Figure 2 Casting algorithm.
C. Gradient This algorithm only uses odor concentration data to navigate. The robot describes a circle to acquire odor concentration in its neighborhood, moving to the position where the maximum concentration was found, repeating the process until it reaches the odor source. The radius of the circle is enlarged if odor was detected in most of the circular path and is reduced if odor was absent for most of the path. This is illustrated in Figure 3 and the pseudocode can be found in Algorithm 3.
Algorithm 3: Gradient if odor_reading < threshold and found_plume then wander ( ); end if odor_reading == threshold then circularPath(radius); goToMaxOdor(); end /* Percentage of path where odor was found */ if odor_percent > odor_percent_upper_threshold then radius = radius_inc_coeff*radius; end /* Percentage of path where was found odor */ if odor_percent < odor_percent_lower_threshold then radius = radius_dec_coeff*radius; end
0.50m
0.50m
0.50m
0.06m
Figure 4 The robot used in the experiments and the location of the Figaro TGS-2620 gas sensors on the robot.
B. Gas Sensor Calibration Since an array of gas sensors was used a simultaneous calibration of the Figaro TGS-2620 array was done to guarantee that all the readings were referenced to the same concentration referential (ppm in this case). This was accomplished by enclosing all four sensors inside a compartment with constant volume, pressure and temperature and then inserting controlled volumes of liquid ethanol, that was latter vaporized. The readings of each sensor were registered after each ethanol increase. C. The Testing Environment Although simulation is an important step during research, it is important to ensure that algorithms perform well in the real world. Most of the proposed odor localization algorithms have only been demonstrated in simulation so far. In the real world the behavior of an odor plume can appear random and chaotic most of times, meaning that a realistic plume simulation is hard to achieve. Real world experimentation is thus an important tool for validating odor related algorithms. With this in mind, the experiments for this work were performed in our building entrance hallway which is 5m in width, by more than 100m long and more than 10m in height. A part of the testing environment can be seen in Figures 5 and 6. It was observed that this hallway generated an air flow in one direction during the day and in the opposite direction during the night. The tests were performed during the night have similar conditions during all the tests and to avoid disturbances from people
167
Wind flow
Odor Source
Figure 3 Gradient algorithm.
III.
EXPERIMENTAL SETUP
A. The Robot A Videre Erratic robot was used for all the experiments presented in this paper. The robot was equipped with a Laser range finder (Hokuyo URG-04LX) for obstacles avoidance, a 2D anemometer (Gill Instruments WindSonic) for wind
4 walking by. No artificial air flow was induced during the experiments, but a steady airflow ranging from 0.1m/s to 0.3m/s was observed. The ethanol vapor was released into the atmosphere bubbling air into an ethanol solution.
Wind flow
Odor source
Figure 6 Picture and scheme of the testing area. Notice the odor source, the wind direction and the three starting positions marked by the three starts. Figure 5 Pictures of the hallway where the experiments were conducted.
D. Particle Plume The Particle Plume mapping algorithm was presented in [10] and allows for a three dimensional representation of odor plumes. To accomplish this the chemical readings from the gas sensors are converted into a point cloud around the sensors location on a global frame. The number of points generated for each reading is proportional to the chemical reading in parts per million (ppm). The point cloud is confined to a predefined radius around the point of origin. The newly created point cloud is finally added to the plume. If the volume occupied by the new points contains older points, these older points are removed from the plume. E. Gas Molecular Weight The Figaro TGS-2620 has the ability to measure alcohol and organic gases, mainly Ethanol (CH3CH2OH), Hydrogen (H2), Isobutane (C4H10), Carbon Monoxide (CO) and Methane (CH4). Since their composition differs, they will behave differently when released into the environment affecting the way they are carried by the air flux. The molecular weight of the chemical being released (Table 1) has great influence on the experiments conducted in this work as it changes the way the chemical moves vertically. If the chemicals molecular weight is greater than the airs it will tend to descend, otherwise the chemical will tend to rise. This means that away from the source the chemical might be traced at a height different from that at which it was released supporting the idea of sensing chemicals at three dimensions. IV. EXPERIMENTAL RESULTS The first set of experiments consisted in setting the odor source at different heights and producing a map of the plume in order to determine the differences between representing the plume in two and three dimensions. First the odor source was set at ground height and the Surge-Cast algorithm was used to track the plume. This test was performed for both the two and three dimensional versions of the algorithm.
Next the source was placed at the height of the third gas sensor and the test was repeated. The second set of experiments consisted in comparing the performance of both setups (two and three dimensions) for each algorithm, Several tests were conducted consisting in placing the robot outside the odor plume followed by instructing the robot to move into it. As soon as the gas sensors detect concentrations above 7ppm, the robot started tracking the odor plume to its source using the algorithms described previously. When the odor concentration detected was higher than a known predefined threshold measured in the neighborhood of the source (30ppm), the odor source is declared found and the experiment is finished. A set of 9 experiments were performed for each setup (two and three dimensions) for each algorithm in a total of 54 experiments. The odor source was placed about 1.06m above the ground. A. Results Figure 7 shows the results for the plume mapping experiments conducted with the odor source at ground height while in Figure 8 it is possible to see the results for the same experiments conducted with the odor source at the height of the third gas sensor. These screenshots were taken from rviz, a 3D visualization environment for robots using ROS. The plume is represented by the red point cloud while the yellow path represents the visited map cells. The results for the performance test between two dimensional plume acquisition and three dimensional plume acquisition for plume tracking can be found in Table 2. Furthermore Figure 9 shows three rviz screenshots from each one of the algorithms tested running with the three dimensional version of the software.
Table 1 Molecular weight of gases detected by the Figaro TGS-2620. Gas Hydrogen Methane Carbon monoxide Air Ethanol Isobutane Formula H2 CH4 CO CH3CH2OH C4H10 Molecular weight (g/mol) 2.0 16.0 28.0 29.0 46.1 58.1
168
5
Table 2 Table of results from the tests with the Surge-Cast, Casting and Gradient both in 2D and 3D. Surge-Cast 2D Success Fail a) Two dimensional plume acquisition. Succ. Rate (%) 5 4 56 3D 7 2 78 Casting 2D 8 1 89 3D 8 1 89 Gradient 2D 2 7 22 3D 4 5 44 Total 2D 15 12 56 3D 19 8 70
II.
DISCUSSION
b) Three dimensional plume acquisition. Figure 7 Surge-cast with the odor source at ground height.
A. Odor Plume Acquisition at Various Heights In the first experiment, ethanol was released at ground level and since ethanol vapor is heavier than air it tends to keep close to the ground. As a consequence, the three upper sensors are unable to detect the odor. In these conditions the three dimensional setup has no advantage over the two dimensional one, in fact they perform exactly the same, since only the lower sensor detects any odor. In the second experiment the advantages of using sensors at different heights became evident. The second sensor was the first to detect odor, leading the robot to the proximity of the odor source. Then the third sensor, at 1.06m heigh, started detecting the odor leading the robot to the odor source and performing the declaration. This can be seen in Figure 8. The lack of the gas sensor array would have resulted in the robot not finding the odor source. B. Odor Plume Tracking The Surge-Cast algorithm showed an improvement in the success rate when using the three dimensional setup raising its success ratio from 56% to 78%. This algorithm takes the robot close to the border of the odor plume hence its success rate suffered from the flickering of the odor concentration caused by the irregular air flow. It was noticeable the fact that the two dimensional setup sometimes failed to reacquire the odor plume due to momentaneous decreases in the odor concentration while the cast was being performed. For the same situation the three dimensional setup performed much better, as the momentarily decrease did not affect all the sensors at the same time. The Casting algorithm performed successfully with both the two and three dimensional setups with a success rate of 89% for either case. This is probably due to the fact that this algorithm relies mostly on the wind direction to navigate. The Gradient algorithm on the other hand relies solely on the odor data to determine the movement of the robot. It is thus no surprise that the three dimensional setup reveled to be more accurate than the two dimensional setup by scoring a success rate of 44% against 22%. Overall the three dimensional odor acquisition improved the success rate of the plume tracking from 56% to 70% representing and increase of about 25% in the success rate. For the two dimensional setup experiments the gas sensor used was always the one at the odor sources height. The two dimensional setup would have performed worst if sensor and source were at different heights, thus resulting in a greater difference between the traditional two dimensional odor acquisition setup and the three dimensional one.
169
Figure 8 Surge-cast with the odor source at the third gas sensors height, about 1.06m from the ground. Three dimensional plume acquisition.
a) Surge-cast.
b) Casting
c) Gradient. Figure 9 Three experiments with three dimensional plume acquisition.
6 III. CONCLUSIONS Three dimensional plume acquisition proved to be more efficient than the more traditional two dimensional approach in terms of odor plume tracking. Odor moves in all three dimensions rising or descending depending on the molecular weight of the mixture. Even if a chemical compound is heavier or lighter than the air it can still be carried upward or downward by the wind (vortexes will affect the plume not only in width and length but also in height), rendering a ground robot equipped with a single fixed gas sensor handicapped in various scenarios. A robot with this setup might be unable to pinpoint the odor source or even fail to detect the plume at all. Many odor related algorithms have been demonstrated in constrained conditions with the pair odor source/gas sensors localized at a weight favorable for the plume detection. This work shows that as the field of mobile robot olfaction moves towards real world applications, the ability to acquire a plume in all its three dimensions will become a mandatory requirement. Future work in this area will include the implementation of other plume tracking algorithms (namely the inclusion of wind information in a variant of the gradient-based algorithm). The study of plumes generated by other gases of different molar masses. And finally, the possibilities to fully track odor plumes in all their three dimensions and declare the corresponding source. IV. REFERENCES
[13] J. Farrell, J. Murlis, W. Li, and R. Carde, "Filament-based atmospheric dispersion model to achieve short time-scale structure of odor plumes", Environmental fluid Mechanics, vol. 2, pp. 143-169, 2002. [14] A. T. Hayes, A. Martinoli, and R. M. Goodman, "Distributed odor source localization", IEEE Sensors Journal, vol. 2, n3, pp. 260-271, June 2002. [15] T. Lochmatter, N. Heiniger, and A. Martinoli, "Understanding the Potential Impact of Multiple Robots in Odor Source Localization", 9th Symposium on Distributed Autonomous Robotic Systems, Tsukuba, Japan, November 2008. [16] T. Lochmatter, N. Heiniger, and A. Martinoli, "Theoretical Analysis of Three Bio-Inspired Plume Tracking Algorithms", Proc. IEEE Int. Conf. on Robotics and Automation, Kobe, Japan, May 2009. [17] A. R. Russel, "Tracking Chemical Plumes in 3-Dimensions", Proc. Int. Conf. on Robotics and Biomimetics, pp. 31-36, Clayton, Australia, 2006. [18] S. Waphare, D. Gharpure, and A. Shaligram, "Implementation of 3Nose Strategy In Odor Plume-Tracking Algorithm", Proc. Int. Conf. on Signal Acquisition and Processing, 2010. [19] L. Marques, N. Almeida, and A. de Almeida, "Mobile robot olfactory sensory system", Proc. of EuroSensors, 2003. [20] L. Marques, U. Nunes, and AT de Almeida, Particle swarm-based olfactory guided search, Autonomous Robots, vol. 20, no 3, 2006. [21] M. Turduev, M. Kirtay, P. Sousa, V. Gazi, and L. Marques, "Chemical concentration map building through bacterial foraging optimization based search algorithm by mobile robots", Proc. IEEE Int. Conference on Systems Man and Cybernetics (SMC), 2010. [22] G. Ferri, E. Casellia, V. Mattolib, A. Mondinib, B. Mazzolaib and P. Dario, "Spiral: A novel biologically-inspired algorithm for gas/odor source localization in an indoor environment with no strong airflow" Robotics and Autonomous Systems, pp. vol 57, n4, 2009. [23] A. Marjovi, J. Nunes, P. Sousa, R. Faria, and L. Marques, "An Olfactory-Based Robot Swarm Navigation Method", Proc. IEEE Int. Conf. on Robotics and Automation, pp. 4958-4963, Alaska, USA, 2010. [24] D. V. Zarzhitsky, D. F. Spears, and D. R. Thayer, "Experimental studies of swarm robotic chemical plume tracing using computational fluid dynamics simulations", Int. Journal of Intelligent Computing and Cybernetics, 2010. [25] H. Ishida, "Blimp Robot for Three-Dimensional Gas Distribution Mapping in Indoor Environment", in Proc. of the 13 International Symposium on Olfaction and Electronic Nose, pp. 61-64, 2009. [26] G. Ferri, A. Mondini, A. Manzi, B. Mazzolai, C. Laschi, V. Mattoli, M. Reggente, T. Stoyanov, A. Lilienthal, M. Lettere and P. Dario, "DustCart, a Mobile Robot for Urban Environments: Experiments of Pollution Monitoring and Maping during Autonomous -Navigation in Urban Scenarios" Proc. ICRA Workshop on Networked and Mobile Robot Olfaction in Natural, Dynamic Environments, 2010. [27] A. J. Lilienthal, M. Reggente, M. Trincavelli, J. L. Blanco, and J. Gonzalez, "A Statistical Approach to Gas Distribution Modelling with Mobile Robot - The Kernel DM+V Algorithm", Proc. Int. Conf. on Intelligent Robots and Systems, pp. 570-576, USA, 2009.
[1] R. A. Russel, Odour Detection by Mobile Robots, World Scientific Publishing, 1999. [2] R. C. Pankhurst and D.W. Holder, "Wind-Tunnel Technique: An Account of Experimental Methods in Low-Speed and High-Speed Wind Tunnels", Sir Isaac Pitman & Sons, 1952 [3] T. J. Mueller, "Flow visualization by direct injection," Goldstein, R.J. (ed), Fluid Mechanics Measurements, 1983. [4] A. Lilienthal, A. Loutfi and T. Duckett, Airborne Chemical Sensing with Mobile Robots, Sensors, 6, pp. 1616-1678, 2006. [5] L. Marques, U. Nunes, and A.T. de Almeida, "Olfaction-based mobile robot navigation", Thin Solid Films, vol. 418, n1, pp. 51-58, 2002. [6] N. J. Vickers, "Mechanisms of animal navigation in odor plumes," Biological Bulletin, vol. 198, pp. 203-212, 2000. [7] F. W. Grasso, T. R. Consi, D. C. Mountain, and J. Atema, "Biomimetic robot lobster performs chemo-orientation in turbulence using a pair of spatially separated sensors: Progress and challenges", Rob. Autonom. Syst., vol. 30, pp. 115-131, 2000. [8] R. A. Russell, D. Thiel, R. Deveza, and Mackay-Sim, "A robotic system to locate hazardous chemical leaks", Proc. IEEE Int. Conf. Robotics Automation, pp. 556-561, 1995. [9] H. Ishida, T. Nakamoto, and T. Moriizumi, "Remote sensing of gas/ odor source location and concentration distribution using mobile system", Sensors and Actuators B, vol. 49, pp. 52-57, 1998. [10] G. Cabrita, P. Sousa and L. Marques, Odor guided exploration and plume tracking - Particle Plume Explorer, Proc. of European Conf. on Mobile Robotics (ECMR), 2011. [11] E. Marder-Eppstein, E. Berger, T. Foote, B. Gerkey and K. Konolige, "The Office Marathon: Robust Navigation in an Indoor Office Environment", Proc. IEEE Int. Conf. on Robotics and Automation, Anchorage, USA, 2010. [12] A. M. Farah and T. Duckett, "Reactive localization of an odor source by a learning mobile robot", Proc. 2nd Swedish Wokshop on Autonomous Robotics, Stockholm, pp. 29-38, 2002.
170
A Multi-Goal Path Planning for Goal Regions in the Polygonal Domain

Jan Faigl , Vojt ch Von sek, Libor Peu il e a r c for Applied Cybernetics, Department of Cybernetics Faculty of Electrical Engineering, Czech Technical University in Prague {xfaigl,vonasek,preucil}@labe.felk.cvut.cz
Center
Abstract This paper concerns a variant of the multi-goal path planning problem in which goals may be polygonal regions. The problem is to nd a closed shortest path in a polygonal map such that all goals are visited. The proposed solution is based on a self-organizing map algorithm for the traveling salesman problem, which is extended to the polygonal goals. Neurons weights are considered as nodes inside the polygonal domain and connected nodes represent a path that evolves according to the proposed adaptation rules. Performance of the rules is evaluated in a set of problems including an instance of the watchman route problem with restricted visibility range. Regarding the experimental results the proposed algorithm provides exible approach to solve various NP-hard routing problems in polygonal maps.
I. I NTRODUCTION The multi-goal path planning problem (MTP) stands to nd a shortest path connecting a given set of goals located in a robot working environment. The environment can be represented by the polygonal domain W and the goals may be sensing locations in the inspection task. Such point goals guaranteeing the whole environment would be covered using a sensor with limited sensing range can be found by a sensor placement algorithm [8]. The MTP with point goals can be formulated as the Traveling Salesman Problem (TSP) [16], e.g., using all shortest path between goals found in a visibility graph by Dijkstras algorithm. Then, the MTP is a combinatorial optimization problem to nd a sequence of goals visits. A more general variant of the MTP can be more appropriate if objects of interest may be located in certain regions of W, e.g., when it is sufcient to reach a particular part of the environment to see the requested object. In such a problem formulation, a goal is a polygonal region rather than a single point. Several algorithms addressing this problem can be found in literature; however, only for its particular restricted variant. For example goals form a disjoint set of convex polygons attached to a simple polygon in the safari route problem [12], which can be solved in O(n3 ) [17]. If the route enter to the convex goal is not allowed, the problem is called the zookeeper problem, which can be solved in O(n log n) for a given starting point and the full shortest path map [1]. However, both problems are NP-hard in general. Routing problems with polygonal goals can be considered as variants of the TSP with neighborhoods (TSPN) [10]. The TSPN is studied for graphs or as a geometric variant in a plane but typically without obstacles. Approximate algorithms for restricted variants of the TSPN have been proposed [5, 3];
however, the TSPN is APX-hard and cannot be approximated to within a factor 2 , where > 0, unless P=NP [13]. A combinatorial approach [14] can be used for the MTP with partitioned goals, where each goal is represented by a nite (small) set of point goals. However, combinatorial approaches cannot be used for continuous sets because of too many possibilities how to connect the goals. This is also the case of the watchman route problem (WRP) in which goals are not explicitly prescribed. The WRP stands to nd a closed shortest path such that all points of W are visible from at least one point [11]. Although polynomial algorithms have been proposed for restricted class of polygons [2], the WRP is NP-hard for the polygonal domain. In this paper, a self-organizing map (SOM) algorithm for the TSP in W [9] is modied to deal with a general variant of the MTP. Contrary to combinatorial approaches, a geometrical interpretation of SOM evolution in W allows easy and straightforward extensions to deal with polygonal goals. To demonstrate geometric relation between the learning network and polygonal goals several modications of the adaptation rules are proposed and evaluated in a set of problems. The main advantage of the proposed approach is ability to address general multi-goal path planning problems in W (not only in a simple polygon) and with goals not necessarily attached to W. The rest of this paper is organized as follows. The addressed problem formulation is presented in the next section. The proposed algorithms are based on the SOM adaptation schema for the TSP in W, and therefore, a brief overview of the schema is presented in Section III. The proposed modications of the adaptation rules for polygonal goals are presented in Section IV. Experimental evaluation of the proposed algorithm variants is presented in Section V. Concluding remarks are presented in Section VI. II. P ROBLEM S TATEMENT The problem addressed in this paper can be formulated as follows: Find a closed shortest path visiting given set of goals represented as convex polygons (possibly overlapping each other) in a polygonal map W. The problem formulation is based on the safari route problem [12]; however, it is a more general in three aspects. First, polygons can be placed inside a polygon with holes. Also, it is not required that convex polygons are attached to the boundary of W like in the original safari route problem formulation. Finally, polygons can overlap each other, and therefore, such polygons can represent a polygonal goal of an arbitrary shape.
171
The proposed problem formulation comprises the WRP with restricted visibility range d. The set of goals can be found as a convex cover set of W, i.e., a set of convex polygons whose union is W. The advantage of an algorithm solving the formulated problem is that it is not required to have a minimal cover set. The restricted convex polygons to the size d can be found by a simple algorithm based on a triangular mesh of W [6]. III. SOM ALGORITHM FOR THE TSP IN W
A SOM algorithm for routing problems, in particular the SOM for the TSP in W [9], is Kohonens type of unsupervised two-layered learning neural network. The network contains two dimensional input vector and an array of output units that are organized into a uni-dimensional structure. An input vector represents coordinates of a point goal and connections weights (between the input and output units) represent coordinates of the output units. Connections weights can be considered as nodes representing a path, which provides direct geometric interpretation of neurons weights. So, the nodes form a ring in W because of the uni-dimensional structure of the output layer, see Fig. 1.
connection weights 1 2 j j,1 j j,2 gi,2 m1 m input layer output units gi+2 gi+1 ring of connected nodes gi1 goal i gi =(gi,1 , gi,2 ) (j,1 , j,2 )
presented goal gi =(gi,1 , gi,2 ) gi,1
Fig. 1: A schema of the two-layered neural network and associated geometric representation. The network learning process is an iterative stochastic procedure in which goals are presented to the network in a random order. The procedure basically consists of two phases: (1) selection of winner node to the presented goal; (2) adaptation of the winner and its neighbouring nodes toward the goal. The learning procedure works as follows. 1) Initialization: For a set of n goals G and a polygonal map W, create 2n nodes N around the rst goal. Let the initial value of the learning gain be =12.41n+0.06, and adaptation parameters be =0.6, =0.1. 2) Randomizing: Create a random permutation of goals (G). 3) Clear Inhibition: I . 4) Winner Selection: Select the closest node to the goal g (G) according to: argminN , I |S(, g)|, / where |S(, g)| is the length of the shortest path among obstacles S(, g) from to g.
5) Adapt: Move and its neighbouring nodes along a particular path toward g: Let the current number of nodes be m, and N (N N ) be a set of s neighborhoods in the cardinal distance less than or equal to 0.2m. Move along the shortest path S( , g) toward g by the distance |S( , g)|. Move nodes N toward g along the path S(, g) by the distance |S(, g)|f (, l), where f is the neighbouring function f = exp(l2 / 2 ) and l is the cardinal distance of to . Update the permutation: (G) (G) \ {g}. Inhibit the winner: I I { }. If |(G)| > 0 go to Step 4. 6) Decrease the learning gain: (1 ). 7) Termination condition: If all goals have the winner in a sufcient distance, e.g., less than 103 , or < 104 Stop the adaptation. Otherwise go to Step 2. 8) Final path construction: Use the last winners to determine a sequence of goals visits. The algorithm is terminated after nite number of adaptation steps as is decreased after presentation of all goals to the network. Moreover, the inhibition of the winners guarantees that each goal has associated a distinct winner; thus, a sequence of all goals visits can be obtained by traversing the ring at the end of each adaptation step. The computational burden of the adaptation procedure depends on determination of the shortest path in W, because 2n2 nodegoal distance queries (Step 4) and (0.8n+1)n nodegoal path queries (Step 5) have to be resolved in each adaptation step. Therefore, an approximate shortest path is considered using a supporting division of W into convex cells (convex partition of W) and pre-computed all shortest path between map vertices to the point goals. The approximate nodegoal path is found as a path over vertices of the cells in which the points (node and goal) are located. Then, such a rough approximation is rened using a test of direct visibility from the node to the vertices of the path. Details and evaluation of renement variants can be found in [9]. Beside the approximation, the computational burden can be decreased using the Euclidean pre-selection [7], because only the node with a shorter Euclidean distance to the goal than the distance (length of the approximate shortest path) of the current winner node candidate can become the winner. In Fig. 2, a ring of nodes connected by an approximate shortest path between two points is shown to provide an overview of the ring evolution in W. IV. A DAPTATION RULES FOR P OLYGONAL G OALS Although it is obvious that a polygonal goal can be sampled into a nite set of points and the problem can be solved as the MTP with partitioned goals, the aforementioned SOM procedure can be straightforwardly extended to sample the goals during the self-adaptation. Thus, instead of explicit sampling of the goals three simple strategies how to deal with adaptation toward polygonal goals are presented in this section. The proposed algorithms are based on the SOM for
172
(a) step 29
(b) step 40 (a) a found path for termination of the adaptation if all winners are inside the goals, L=84.3 m (b) a found path with avoiding adaptation of winners inside the goal, L=65.0 m
Fig. 3: Examples of found paths without and with consideration of winners inside the goals. Goals are represented by yellow regions with small disks representing the centroids of the regions. Winner nodes are represented by small orange disks. The length of the found path is denoted as L, and the length of the path connecting the centroids is Lref =85.9 m.
(c) step 58 (d) step 78
Fig. 2: An example of ring evolution in a polygonal map for the MTP with point goals, small green disks represent goals and blue disks are nodes. the TSP using centroids of the polygonal goals as point goals. However, the select winner and adapt phases are modied to nd a more appropriate point of the polygonal goal and to avoid unnecessary movement into the goal. Therefore, a new point representing a polygonal goal is determined during the adaptation and used as a point goal, which leads to computation of a shortest path between two arbitrary points in W. Similarly to the nodegoal queries an approximate nodepoint path is considered to decrease the computational burden. The approximation is also based on a convex partition of W and the shortest path over cells vertices (detailed description can be found in [6]). A. Interior of the Goal Probably the simplest approach (called goal interior here) can be based on the regular adaptation to the centroids of the polygonal goals. However, the adaptation, i.e., the node movement toward the centroid, is performed only if the node is not inside the polygonal goal. Determination if a node is inside the polygonal goal with n vertices can be done in O(n) computing the winding number or in O(log n) in the case of a convex goal. So, in this strategy, the centroids are more like attraction points toward which nodes are attracted because the adaptation process is terminated if all winner nodes are inside the particular polygonal goals. Then, the nal path is constructed from a sequence of winner nodes using the approximate shortest nodenode path. An example of solutions using the new termination condition and with the avoiding adaptation of winners inside goals is shown in Fig. 3.
the winner node the intersection point
(a) an intersection point
(b) a found path, L=59.7 m
Fig. 4: Examples of an intersection point and a found path using the attraction algorithm variant.
B. Attraction Point The strategy described above can be extended by determination of a new attraction point at the border of the polygonal goal. First, a winner node is found regarding its distance to the centroid c(g) of the goal g. Then, an intersection point p of g with the path S(, c(g)) is determined. The point p is used as the point goal to adapt the winner and its neighbouring nodes. This modication is denoted as attraction in the rest of this paper. An example of determined intersection point p and the nal found path is shown in Fig. 4. The found path is about ve meters shorter than a path found by avoiding adaptation of winner nodes inside the goals. Determination of the intersection point increases the computational burden, therefore an experimental evaluation of the proposed algorithm variants is presented in Section V.
173
C. Selection of Alternate Goal Point A polygonal goal can be visited using any point of its border. The closest point at the goal border to a node can be determined in the winner selection phase. To nd such a point, straight line segments forming the goal are considered instead of the goal centroid. Moreover, a goal can be attached to the map, and therefore, only segments laying inside the free space of W are used. Let S g = {s1 , s2 , . . . , sk } be the border segments of the polygonal goal g that are entirely inside W. Then, the winner node is selected from a set of non-inhibited nodes regarding the shortest path S(, s) from a point to the segment s, s S g . Beside the winner node, a point p at the border of g is found in the winner selection procedure as a result of determination of S(, s). The border point p is then used as an alternate point goal for adaptation, therefore this modication is denoted as alternate goal.
TABLE I: Properties of environments and their polygonal representation

Map jh pb h2 dense potholes Dimensions [m m] 20.6 133.3 84.9 21.0 20.0 23.2 104.8 49.7 21.5 20.0 No. vertices 196 89 1 061 288 153 No. holes 9 3 34 32 23 No. convex polygons 77 41 476 150 75
the winner node
the alternate goal (border) point
(a) an alternate goal point
(b) a found path, L=56.9 m
Fig. 5: An example of the alternate goal point and the nal found path. Red straight line segments around the goal regions denote parts of the goal border inside the free space of W.
Determination of the exact shortest pointsegment path can be too computationally demanding, therefore the following approximation is proposed. First, the Euclidean distance between the node and the segment s is determined. If the distance is smaller than the distance of the current winner node candidate, then the resulting point p of s is used to determine an approximate path among obstacles between p and . If |S(p, )| is shorter than the path length of the current winner node candidate to its border point, becomes the new winner candidate and p is the current alternate goal (border) point. Even though this modication is similar to the modication described in Section IV-B, it provides sampling of the goal boundary with a less distance of the goal point to the winner node; thus, a shorter nal path can be found. An example of found alternate goal point and the found path is shown in Fig. 5. V. E XPERIMENTAL R ESULTS
The last column shows the number of convex polygons of the supporting convex polygon partition utilized in the approximation of the shortest path. The partition is found by Seidels algorithm [15]. Maps jh, pb, and h2 represent real environments (building plans), and maps dense and potholes are articial environments with many obstacles. Sets of polygonal goals have been placed within the maps in order to create representative multi-goal path planning problems. The name of the problem is derived from the name of the map, considered visibility range d in meters written as a subscript, and a particular problem variant, i.e., the problem name is in a form mapd -variant. The value of d restricts the size of the convex polygonal goal, i.e., all vertices are in mutual distance less than d. An unrestricted visibility range is considered in problems without the subscript. Three proposed variants of the SOM based algorithm for the MTP with polygonal goals have been experimentally evaluated within the set of problems. The algorithms are randomized, therefore twenty solutions of each problem have been found by each algorithm variant. The average length of the path L, the minimal found path length Lmin , and the standard deviation in percents of L (denoted as sL %) are used as the quality metrics. All presented length values are in meters. The experimental results are shown in Table II, where n is the number of goals. The best found solutions are shown in Fig. 6. From the visualized solutions, one can assume that high quality solutions are found for all problems. The required computational time is presented in the column T . All algorithms have been implemented in C++, compiled by the G++ version 4.2 with -O2 optimization, and executed within the same computational environment using a single core of 2 GHz CPU. Therefore, the presented required computational times can be directly compared. The time includes determination of all shortest path between map vertices (used in the path approximation) and the adaptation time. The supporting convex partition and the complete visibility graph are found in tens of milliseconds, and therefore, these times are negligible regarding the time of the adaptation procedure and determination of the shortest paths. Discussion The presented results provide performance overview of the proposed adaptation rules. The principle of the attraction and alternate goal algorithm variants are very similar; however, the alternate goal variant provides better results. The advantage of the alternate goal is sampling of goals borders. Even though a simple approximation of the shortest path between
The proposed adaptation rules in Section IV have been experimentally veried in a set of problems. Due to lack of commonly available multi-goal path planning problems with polygonal goals several problems have been created within maps of real and articial environments. An overview of the basic properties of the environments is shown in Table I.
174
TABLE II: Experimental results

Problem dense-small dense5 -A h25 -A jh-rooms jh10 -doors jh10 -coverage jh4 -A jh5 -corridors pb5 -A potholes2 -A n 35 9 26 21 21 106 16 11 7 13 L [m] 119.5 68.2 425.3 103.6 71.5 134.2 66.6 69.6 277.8 73.9 goal interior sL % Lmin 3.21 1.75 1.42 1.49 3.24 2.52 2.97 2.05 3.21 2.09 113.51 66.21 416.22 101.37 67.48 128.31 64.00 66.96 268.61 72.01
T [s]
0.85 0.41 3.56 0.28 0.47 3.64 0.44 0.38 0.84 0.13
L [m] 114.4 62.4 405.9 88.2 68.0 106.6 61.5 66.0 273.8 72.0
attraction sL % Lmin 3.47 2.19 1.25 0.22 1.52 1.59 2.74 1.23 4.55 1.95 108.64 60.60 399.62 87.83 66.11 101.19 59.10 64.83 265.56 70.42
T [s]
0.96 0.43 3.54 0.33 0.46 4.19 0.45 0.39 0.88 0.14
L [m] 111.8 58.7 402.1 88.1 62.2 93.8 57.3 60.0 270.1 72.0
alternate goal sL % Lmin 3.09 1.75 0.97 0.26 0.25 0.47 1.02 0.56 2.66 1.78 106.11 58.06 396.46 87.79 62.06 93.12 56.80 59.60 264.91 70.32
T [s]
1.20 0.44 3.75 0.34 0.57 6.22 0.51 0.41 0.86 0.16
(a) jh4 -A, Lbest =56.6 m
(b) jh5 -corridors,Lbest =59.6 m
(c) jh10 -doors, Lbest =62.1 m
(d) jh-rooms, Lbest =87.8 m
(e) jh10 -coverage,Lbest =93.1 m
(f) h25 -A, Lbest =395.6 m
(g) pb5 -A, Lbest =264.7 m
(h) dense-small, Lbest =102.8 m
(i) dense5 -A, Lbest =58.0 m
(j) potholes2 -A, Lbest =68.7 m
Fig. 6: The best found solutions. a node (point) and goals segment is used, a precision of the approximation increases with node movements toward the goal, and therefore, a better point of the goal is sampled. This is an import benet of the SOM adaptation, which allows usage of a relatively rough approximation of the shortest path. On the other hand, the attraction algorithm variant is a more straightforward, as the path to the centroid is utilized as a path to the xed point goal. The xed point goals allow to use precomputed all shortest paths from map vertices to the goals, which improves precision of the approximate nodegoal path.
175
Such approximation is less computationally intensive in the cost of higher memory requirements. However, this benet is not evident from the results, because the alternate goal variant provides a faster convergence of the network. Although convex goals are assumed in the problem formulation, the presented adaptation rules do not depend on the goal convexity. The convex goals are advantageous in visual inspection tasks (covering tasks), because the whole goal region is inspected by visiting the goal at any point of the goal. Also a point representative of the convex goal can be simply computed as the centroid. If a goal is not convex a point that is inside the goal has to be determined for the goal interior and attraction algorithms. Basically any point inside the goal can be used, but a bias toward the point can be expected. The alternate goal algorithm variant uses a set of segments representing the goal, and therefore, this algorithm can be directly used for problems with non-convex goals (see Fig. 7).
practical point of view, the proposed SOM algorithm is based on relatively simple algorithms and supporting structures, which is an additional benet. SOM is not a typical technique used for routing problems motivated by robotic applications. The presented results demonstrate exibility of SOM based algorithm; thus, they may encourage roboticists to consider SOM as a suitable planning technique for other multi-goal path planning problems. ACKNOWLEDGMENT This work has been supported by the Grant Agency of the Czech Technical University in Prague under grant No. SGS10/185, by the Ministry of Education of the Czech Republic under Projects No. 1M0567 and No. LH11053 and partially by Project No. 7E08006, and by the EU Project No. 216342. R EFERENCES
[1] Sergei Bespamyatnikh. An o(n log n) algorithm for the Zoo-keepers problem. Computational Geometry: Theory and Applications, 24(2):63 74, 2002. [2] Wei-Pang Chin and Simeon Ntafos. Optimum watchman routes. In SCG 86: Proceedings of the second annual symposium on Computational geometry, pages 2433, Yorktown Heights, New York, United States, 1986. ACM Press. [3] Mark de Berg, Joachim Gudmundsson, Matthew J. Katz, Christos Levcopoulos, Mark H. Overmars, and A. Frank van der Stappen. Tsp with neighborhoods of varying size. Journal of Algorithms, 57(1):2236, 2005. [4] Moshe Dror, Alon Efrat, Anna Lubiw, and Joseph S. B. Mitchell. Touring a sequence of polygons. In STOC 03: Proceedings of the thirtyfth annual ACM symposium on Theory of computing, pages 473482, San Diego, CA, USA, 2003. ACM Press. [5] Adrian Dumitrescu and Joseph S. B. Mitchell. Approximation algorithms for tsp with neighborhoods in the plane. Journal of Algorithms, 48(1):135 159, 2003. [6] Jan Faigl. Approximate Solution of the Multiple Watchman Routes Problem with Restricted Visibility Range. IEEE Transactions on Neural Networks, 21(10):16681679, 2010. [7] Jan Faigl. On the performance of self-organizing maps for the noneuclidean traveling salesman problem in the polygonal domain. Information Sciences, 181(19):42144229, 2011. [8] Jan Faigl, Miroslav Kulich, and Libor Peu il. A sensor placement r c algorithm for a mobile robot inspection planning. Journal of Intelligent & Robotic Systems, 62(3-4):329353, 2011. [9] Jan Faigl, Miroslav Kulich, Vojt ch Von sek, and Libor Peu il. An e a r c Application of Self-Organizing Map in the non-Euclidean Traveling Salesman Problem. Neurocomputing, 74:671679, 2011. [10] Jacob E. Goodman and Joseph ORourke, editors. Handbook of Discrete and Computational Geometry. CRC Press LLC, Boca Raton, FL, 2004. [11] Fajie Li and Reinhard Klette. An approximate algorithm for solving the watchman route problem. In RobVis08: Proceedings of the 2nd international conference on Robot vision, pages 189206, Berlin, Heidelberg, 2008. Springer-Verlag. [12] Simeon Ntafos. Watchman routes under limited visibility. Computational Geometry: Theory and Applications, 1(3):149170, 1992. [13] Shmuel Safra and Oded Schwartz. On the complexity of approximating tsp with neighborhoods and related problems. Comput. Complex., 14(4):281307, 2006. [14] Mitul Saha, Tim Roughgarden, Jean-Claude Latombe, and Gildardo S nchez-Ante. Planning Tours of Robotic Arms among Partitioned a Goals. Int. J. Rob. Res., 25(3):207223, 2006. [15] Raimund Seidel. A simple and fast incremental randomized algorithm for computing trapezoidal decompositions and for triangulating polygons. Comput. Geom. Theory Appl., 1(1):5164, 1991. [16] Steven N. Spitz and Aristides A. G. Requicha. Multiple-Goals Path Planning for Coordinate Measuring Machines. In ICRA, pages 2322 2327, 2000. [17] Xuehou Tan and Tomio Hirata. Finding shortest safari routes in simple polygons. Information Processing Letters, 87(4):179186, 2003.
(a) the jh environment
(b) potholes environment
Fig. 7: Found solutions of problems with non-convex goals by the alternate goal algorithm variant. Regarding the problems with disjoint convex goals that are relatively far from each other, a sequence of goals visits can be found as a solution of the TSP for centroids of the goals. Then, having the sequence of polygonal goals the problem of nding the path connecting them can be formulated as the touring polygons problem (TPP) if start and end points are given. A polynomial algorithm exists for the TPP with convex goals lying inside a simple polygon [4]; however, the TPP is NP-hard in general. Therefore, the proposed alternate goal algorithm seems be more practical due to its exibility. VI. C ONCLUSION A self-organizing map based algorithm for the multi-goal path planning problem in the polygonal domain has been presented. Three variants of the algorithm addressing polygonal goals have been proposed and experimentally evaluated for a set of problems including an instance of the WRP with restricted visibility range (jh10 -coverage). Even though the solution quality is not guaranteed because of SOM, regarding the experimental results the algorithms provide high quality solutions. The advantage of the proposed alternate goal algorithm is that it provides a exible approach to solve various routing problems including the TSP, WRP, safari route problems, and their variants in the polygonal domain. From the
176
Real-Time 3D Perception and Efcient Grasp Planning for Everyday Manipulation Tasks
Joerg Stueckler, Ricarda Steffens, Dirk Holz, and Sven Behnke Autonomous Intelligent Systems Group, University of Bonn, 53113 Bonn, Germany
Abstract In this paper, we describe efcient methods for solving everyday mobile manipulation tasks that require object pick-up. In order to achieve real-time performance in complex environments, we focus our approach on fast yet robust solutions. For 3D perception of objects on planar surfaces, we develop scene segmentation methods that process Microsoft Kinect depth images in real-time at high frame rates. We efciently plan feasible, collision-free grasps on the segmented objects directly from the perceived point clouds to achieve fast execution times. We evaluate our approaches quantitatively in lab experiments and also report on the successful integration of our methods in public demonstrations at RoboCup German Open 2011 and RoboCup 2011 in Istanbul, Turkey. Index Terms Scene Segmentation, Grasp Planning, Mobile Manipulation
I. I NTRODUCTION Mobile manipulation tasks in domestic environments require a vast set of perception and action capabilities. The robot not only requires localization, mapping, path planning, and obstacle avoidance abilities to safely navigate through the environment. It also needs to integrate object detection, recognition, and manipulation. A typical requirement for a service robot is not just to achieve the task, but to perform it in reasonable time. While much research has been invested into the general solution of complex perception and motion planning problems, only few work has been focused on methods that solve the tasks efciently in order to allow for continuous task execution without interruptions. In this paper, we present fast methods to exibly grasp objects from planar surfaces. To achieve fast performance, we combine real-time object perception with efcient grasp planning and motion control. For real-time perception, we combine rapid normal estimation using integral images with efcient segmentation techniques. We segment the scene into the support plane of interest and the objects thereon. Our perception algorithm processes depth images of a Microsoft Kinect in real-time at a frame rate of approx. 16 Hz. From the raw object point clouds our grasp planning method derives feasible, collision-free grasps within about 100 milliseconds. We consider grasps on objects from either the side or from above. The planned grasps are then executed using parametrized motion primitives. We integrate our approaches into a system that we publicly evaluate at RoboCup competitions. We further conduct experiments in our lab to demonstrate the robustness and efciency of our approaches. This paper is organized as follows: after a brief system overview in Sec. III, we detail our approaches to real-time
Fig. 1. Top: Cosero grasps a spoon and pours milk into a bowl of cereals. A 5 min video showing the complete demonstration at RoboCup GermanOpen 2011 is available at http://nimbro.net.
3D perception and efcient grasp planning in Sec. IV and Sec. V, respectively. In Sec. VI, we evaluate our approaches and report on successful public demonstrations at RoboCup GermanOpen 2011 and at RoboCup 2011 in Istanbul, Turkey. II. R ELATED W ORK Many research groups currently develop systems for mobile manipulation in everyday environments. A very prominent example is the Personal Robot 2 (PR2) developed by Willow Garage [5]. It is equipped with two 7 DoF compliant arms and a parallel gripper with touch sensor matrices on the gripper tips. Similar to our approach, they derive feasible, collisionfree grasps from the raw object point cloud [3]. They select the best-ranked grasp and plan a collision-free motion for the arm taking into account obstacles that are perceived by the robots 3D sensors. While the authors demonstrate that the approach can robustly grasp a variety of objects in a wide range of congurations, the execution speed of the system for perception and grasping is still far slower than human performance. Further systems that perform object manipulation in cluttered environments have been reported by Srinivasa et al. [8, 7]. In [8], the authors present a robotic busboy system in which a mobile tray delivers mugs to a statically mounted manipulator. The manipulator grasps the mugs and loads them into a dishwasher rack. A real-time vision system that is designed for the mugs estimates the pose of the mugs on the tray. Since the objects are known, valid grasps on the
177
mug are precomputed. The grasp planner then selects online a best feasible grasp from several criteria like reachability and collision avoidance. The authors report on a total duration of 51 sec in average for executing a grasp and releasing the mug in the dishrack. With the robot HERB [7], the vision system has been extended to more general object recognition and motion planning. While object recognition is aborted after 1 sec, the planning of motions is reported to take several seconds. Our approach is not restricted to recognizable objects. Jain and Kemp develop EL-E [4], a mobile manipulator that shall assist the impaired. EL-E consists of a Katana manipulator on a vertical linear actuator mounted on a Erratic differential drive. While we extract object information in realtime from a depth image sensor, they segment measurements of a 3D laser using connected components labelling to nd object clusters above table height. Similar to our approach, they perform top grasps along the objects principal axis. However, side grasps are not considered. If an object is too high or too wide to t into the gripper, they also consider overhead grasps on top-most points on the object. To ensure that the grasping motion is not in collision, a cuboid volume from the manipulator base to the object is checked for obstacles. Morales et al. [11] propose a system that selects feasible, collision-free grasps on objects from a database. They determine the set of feasible grasps on the object from its CAD model in an ofine phase. After the object has been recognized and localized with a stereo vision system, a grasp simulation framework (GraspIt! [6]) is used to select a collision-free grasp among the potential grasps on the object. The authors report 5 ms computation time for the recognition of objects in a database of 5 objects. The time for planning of collision-free, feasible grasps in GraspIt is reported to range from seconds to several minutes in [6]. III. S YSTEM OVERVIEW A. Design of Cognitive Service Robot Cosero Domestic environments are designed for the specic capabilities of the human body. It is therefore natural to endow the robot with an anthropomorphic upper body scheme for similar manipulation abilities. Furthermore, the actions of the robot become predictable and interpretable, when they are performed human-like. In such environments, robots also have to interact closely with humans. By its lightweight design, Cosero is inherently less dangerous than a heavy-weight industrial-grade robot. Finally, the robot should also possess natural sensing capabilities, e.g., vision and audio, since humans design their environments salient and distinguishable in such perception channels. We focused the design of Cosero on such typical requirements for household settings. We equipped Cosero with an omnidirectional drive to maneuver in the narrow passages found in household environments. Its two anthropomorphic arms resemble average human body proportions and reaching capabilities. A yaw joint in the torso enlarges the workspace of the arms. In order to compensate for the missing torso pitch joint and legs, a linear actuator in the trunk can move the upper body vertically by approx. 0.9 m. This allows the robot to manipulate on similar heights like humans.
Cosero has been constructed from light-weight aluminum parts. All joints in the robot are driven by Robotis Dynamixel actuators. These design choices allow for a light-weight and inexpensive construction, compared to other domestic service robots. While each arm has a maximum payload of 1.5 kg and the drive has a maximum speed of 0.6 m/sec, the low weight (in total ca. 32 kg) requires only moderate actuator power. Compared to its predecessor Dynamaid [9], we increased payload and precision of the robot by stronger actuation. Cosero perceives its environment with a variety of complementary sensors. The robot senses the environment in 3D with a Microsoft Kinect RGB-D camera in its head that is attached to the torso with a pan-tilt unit in the neck. To improve the robustness of manipulation, the robot can measure the distance to obstacles directly from the grippers. We attached infrared distance sensors to each gripper that point downward and forward in the nger tips. Another sensor is placed in the palm and measures distance to objects within the gripper. B. Mobile Manipulation in Everyday Environments We develop Cosero to perform a variety of mobile manipulation tasks in everyday environments. For mobile manipulation, we combine safe navigation of the robot through the environment with motion control methods for the upper body. 1) Motion Control: We developed omnidirectional driving for Coseros eight-wheeled mobile base [9]. The linear and angular velocity of the drive can be set independently and can be changed continuously. We determine the steering direction and the individual wheel velocities of the four differential drives, which are located at the corners of the rectangular base, from an analytical solution to the drives inverse kinematics. For the anthropomorphic arms, we implemented differential inverse kinematics with redundancy resolution [9]. We also developed compliance control for the arms [10]. For our method we exploit that the servo actuators are back-drivable and that the torque which the servo applies for position-control can be limited. Compliance can be set for each direction in task- or joint-space separately. For example, the end-effector can be kept loose in both lateral directions while it keeps the other directions at their targets. With these methods Cosero can perform a variety of parameterizable motions like grasping, placing objects, and pouring out containers. 2) Mobile Manipulation: In typical everyday mobile manipulation scenarios, the workspace of a statically mounted manipulator is too small. One possible solution to achieve a larger workspace is to construct robots with a restricted manipulator workspace but to extend it with a mobile base. Since the robot is not statically mounted to the environment, it has to estimate its pose in reference to static parts of the environment like walls, dynamic objects, and persons. We propose a coarse-to-ne strategy to align the robot to the objects involved in mobile manipulation. For example, when the robot grasps an object from a table, it rst approaches the table roughly within the reference frame of the static map. Then, it adjusts in height and distance to the table. Finally, it aligns itself to bring the object into the workspace of its arms. Cosero grasps objects on horizontal surfaces like tables and shelves in a height range from ca. 0.3 m to 1 m [9]. It
178
(a) Example table scene Fig. 2.
(b) RGB point cloud
(c) Detected objects
(a) Example table top setting. (b) Raw point cloud from the Kinect with RGB information. (c) Each detected object is marked with a random color.
also carries the object, and hands it to human users. We also developed solutions to pour-out containers, to place objects on horizontal surfaces, to dispose objects in containers, to grasp objects from the oor, and to receive objects from users. When handing an object over, the arms are compliant in upward direction so that the human can pull the object, the arm complies, and the object is released. For receiving an object from a person, we localize the handed object with the depth camera and drive towards it. As soon as the object is reachable with the arms, the robot grasps it. IV. R EAL - TIME 3D P ERCEPTION A typical task for a domestic service robot is to fetch and deliver objects. This involves detecting objects in the robots workspace and recognizing them. In household environments, objects are usually located on planar surfaces such as tables. In [9] we proposed to use the laser scanner in the lower torso to detect such objects. This approach, however, is not able to perceive valuable 3D information like the objects height or principal axes. We therefore developed real-time 3D scene segmentation with the RGB-D camera. In order to identify objects, we extract texture and color information. A. Real-Time 3D Scene Segmentation Our approach to object detection processes images of a depth camera such as the Microsoft Kinect at frame rates of approx. 16 Hz. This enables our system to extract information about the objects in a scene with a very low latency for further decision-making and planning stages. We base our approach on fast planar segmentation of the scene. We achieve the high computational speed of our approach by combining rapid normal estimation with efcient segmentation techniques. The basic idea of the normal estimation method is to determine local surface normals from the cross product of two tangents to the surface. For each pixel in the depth image, the tangents are estimated from local pixel neighbors. In the simplest case, both tangents could be calculated from just the horizontal and vertical neighbors, respectively. However, this approach would be highly prone to measurement noise. The tangent estimates should therefore be averaged in an image neighborhood. By using integral images, such averaging operations can be processed rapidly
in constant time independent of the neighborhood size. The overall runtime complexity of this approach is linear in the number of points for which normals are computed. One way to nd all planes in a scene would be to extract planes from the local surface normals in a two-stage process [2]. First, one could nd clusters in normal orientation which then are separated by clustering in plane distance to the origin. Since we assume here that the robot as well as the objects in the environment are typically standing on horizontal surfaces, we instead focus our method on local surface normals close to the vertical direction. On all points with such normal orientation and within a 3D region of interest, we apply efcient RANSAC [1] to determine the horizontal plane. Then, we nd the points above the plane and extract object clusters which projections lie within the convex hull of the support plane. For clustering we assume that there is a small space between the objects. The size of this space has to be chosen carefully, since due to occlusions, parts of an object may be disconnected. Fig. 2 shows a typical segmentation result for a table-top scene. A multi-object tracker is constantly updated with the detected objects. V. E FFICIENT G RASP P LANNING Objects in everyday manipulation scenarios are highly variable in shape and appearance. Furthermore, the conguration of objects and obstacles in a scene is strongly unstructured. It is therefore challenging to develop a grasp planning method that can cope with any encountered situation. Our approach is specically suited for rigid objects which shape is symmetric along the principal axes of the object. We also assume that the center of gravity roughly coincides with the center of the object. While many objects meet these assumptions, our approach can also yield robust grasps for objects that violate the constraints. We developed exible grasping motions to grasp objects from the side or from above. When the robot encounters a new situation, it plans and executes a feasible collision-free grasp on the object of interest. The robot perceives the scene with its depth camera. It interprets the raw point representation of the objects on the grasp surface which is provided by our real-time 3D perception method (s. Sec. IV-A).
179
Fig. 3. Left: We extract object pose and shape properties from the object points. The arrows mark the bounding box of the objects by the principal axes. Right: We rank feasible, collision-free grasps (red, size prop. to scores) and select the most appropriate one (larger RGB-coded coordinate frame). Fig. 4. For each sampled grasp (nal position: black dot, pre-grasp at frame origin, x-direction: red arrow, y-direction: green arrow) we check for collisions that may occur during the execution of the grasp motion primitive. All points on obstacles are projected into the horizontal plane. We require the region around the shoulder (right yellow circle) within upperarm length distance ru to contain no obstacles. We further require that the gripper and the forearm can move towards the object by checking a cone with opening angle and forearm length rf behind the grasping pose. We extend the cone towards the robots center position to cover the area swept during the reaching motion. At the nal grasp position (black circle) the gripper is not in collision, when there is no obstacle within a distance of rg .
A. Grasp Motion Primitives We distinguish two kinds of grasps for which we apply parametrizable motion primitives. Side-grasps are designed to approach the object along its vertical axis by keeping the parallel grippers aligned horizontally. To grasp objects from the top, we pitch the end-effector by 45 downwards to grasp objects with the nger tips. Both kinds of grasps are exible in the orientation around the vertical upward direction. However, we limit the yaw orientation to a range between 0 and 90 (for the right arm) due to kinematic constraints of the robot arm and torso. Orientations beyond this range are grasped in the closest limit angle. Alternatively, the robot can simply choose its left arm to grasp within the reachable range. The motion primitives approach the pre-grasp poses on a direct line with open gripper. We establish the yaw orientation at the pre-grasp pose by smooth interpolation along the reaching trajectory. Once the pre-grasp pose is reached, the side-grasp motion primitive simply approaches the object and closes the gripper. For the top-grasp motion, we do not establish the pitch orientation of the pre-grasp pose until the pre-grasp position has been reached. We assume that the pre-grasp positions are placed at a xed distance (0.1 m in our case) behind the grasp position along the grasp direction. We use the IR distance sensors in the gripper to determine premature contact with the object or the support plane. In such a case, the approach of the object is stopped. After the object has been grasped, the end-effector moves back to its initial pose. B. Planning of Collision-Free Grasps The grasp planner selects a feasible collision-free grasp for the object of interest. It samples grasp candidates, removes infeasible and colliding grasps, and ranks the remaining grasps to nd the most promising one. The planner outputs a pre-grasp pose to parametrize the grasping motion. A grasp pose directly corresponds to the pose of the end-effector which we dene as follows: We place the grasp at the center of the gripper. The x-axis and y-axis of the grasp pose align with the direction from wrist to nger tips and the direction from the right to the left nger, respectively. 1) Sampling of Grasp Candidates: We sample grasp candidates depending on pose and shape properties of the object. In order to determine these properties we project the raw
points on the object into the horizontal plane and measure the principal axes of the point distribution. In addition, we determine height, center, and bounding box (aligned with the principal axes) of the object. Once the shape and pose of the object are known, we determine feasible grasps on the object. For the side-grasps, we sample pre-grasp poses on an ellipse in the horizontal plane in equally sized angular intervals. The center and axes of the ellipse directly correspond to the properties of the objects bounding box. The diameters of the ellipse add the distance towards the grasp point to the diameters of the bounding box (0.1 m in our implementation). We grasp the objects as low as possible above the surface at a specic height. This makes the grasping more robust for measurement and control inaccuracies. Otherwise, the object could easily topple over, when the robot touches the object while moving in grasping direction. We set the grasp height to half the height of the gripper plus a safety padding of 0.03 m. We sample the top grasps equidistant along both principal axes through the center of the bounding box. For kinematic constraints of our anthropomorphic arms, we constrain the pitch of the grasp to 45 in downward direction. We place the pre-grasp pose 0.1 m above the objects height, but at least 0.1 m above the support plane. 2) Filtering for Feasible and Collision-Free Grasps: Since the sampling stage does not consider any feasibility constraints or collisions, we lter the grasp candidates in a post-processing step. We take the following criteria into account: Grasp width. We reject grasps, when the objects width orthogonal to the grasp direction does not t into the gripper. Object height. Side-grasps are likely to fail when the object is too small. Reachability. We do not consider grasps that are outside
180
processing stage normal estimation scene segmentation object clustering grasp planning
mean (std) in msec 7.2 (2.4) 11.9 (1.4) 41.6 (1.5) 98.1 (9.1)
TABLE I C OMPUTATION TIME OF INDIVIDUAL PROCESSING STAGES .
of the arms workspace. Collisions. We check for collisions during the reaching and grasping motion. Fig. 3 shows an example for grasps that satisfy our criteria. One possible solution for collision checking would be to search for collisions of all robot limbs during the complete trajectory of the grasping motion. However, we propose to use simple geometric constraints to nd all possible collisions (s. Fig. 4). While our method is more conservative, we can nd collisions with only little computational effort. We rst project all points on obstacles into the horizontal plane. In order to avoid collisions of the upperarm, we search for collisions within a circle around the shoulder with a radius equal to the upperarm length. We further require that the gripper and the forearm can move towards the object by checking a cone with opening angle and forearm length behind the grasping pose. We extend the cone towards the robots center position to cover the area swept during the reaching motion. Finally we search for collisions within a small circle at the nal grasp position. The radius of this circle is set to the maximum diameter of the open gripper. 3) Ranking of Grasps: We rank the feasible and collisionfree grasps for several criteria such as Distance to object center. We favor grasps with a smaller distance to the object center. Grasp width. We reward grasp widths closer to a preferred width (0.08 m). Grasp orientation. Preference is given to grasps with a smaller angle between the line towards the shoulder and the grasping direction. Distance from robot. We support grasps with a smaller distance to the shoulder. Fig. 3 illustrates this process with example rankings. From the ranked grasps, we nd the best top- and sidegrasps and select the most appropriate one. This decision depends on the relation of the object height to the largest extent of the object in the horizontal plane. We integrate a small bias towards the faster side grasps.
Fig. 5. Example grasps, segmentations, and grasp planning results for each of the 8 household objects used in the experiments.
VI. E XPERIMENTS A. Quantitative Results 1) Run-Time Efciency: In Table I, we summarize average run-times of several stages of our perception and grasp planning pipeline in the scene shown in Fig. 2. For a depth image resolution of 160120, our table-top segmentation approach achieves an average frame rate of approx. 16 Hz. The experiments have been carried out on an HP Pavilion dv6 notebook with an Intel Core i7 Q720 processor. Using
the integral image approach, normals can be estimated rapidly for the 19200 image pixels within only 7.2 msec in average. The segmentation of the scene into vertical points, applying RANSAC to nd the support plane, and determining the points above the support plane requires 11.9 msec (avg.). The clustering of the points into objects takes 41.6 msec (avg.). The computation time in this step depends on the number of objects in the scene. Our approach to grasp planning requires computation time in the same magnitude as the segmentation, i.e., 98.1 msec (avg.). The timings demonstrate that our approaches are very performant and yield results in short computation times. We also measured the time for the complete object pickup process. The robot has already approached the table. It perceives the objects on the table and plans a grasp on the closest object in front. It executes the grasp and moves the gripper back to its starting pose. The overall process takes approx. 15 sec for a side-grasp and 25 sec for a top-grasp. 2) Robustness: We evaluate the robustness of our perception and grasp planning pipeline in a series of experiments. We chose 8 typical household objects and executed 10 grasps with the right arm that cover the range of feasible object orientations for this arm. Fig. 5 shows an example grasp for each object. Table II summarizes the results of the experiment. The robot could grasp the most objects very reliably. For the tissues, it sometimes chooses a top-grasp along the shorter side of the object. In one situation it missed the object with this grasp. Our approach also estimates the tea box to be high enough to be grasped from the side in some congurations. Despite the fact that the clothes strongly violate our assumptions that objects are rigid and are shaped symmetric along principal axes, our method succeeds robustly for this object. B. Public Demonstration While one can assess the quality of individual system components in the laboratory, it is difcult to compare robot
181
object lter box tea box banana cup chewing gums tissues cloth pen
side-grasp 10 / 10 1/1 0/0 10 / 10 0/0 0/0 3/3 0/0
top-grasp 0/0 9/9 10 / 10 0/0 10 / 10 9 / 10 7/7 10 / 10
jury with this demonstration and achieved the highest score.
TABLE II S UCCESS RATES ( SUCCESS / TRIALS ) WHEN GRASPING OBJECTS 10 TIMES IN RANDOM ORIENTATIONS .
systems with others. In recent years, competitions such as the DARPA Grand and Urban Challenges and RoboCup, play an important role in assessing the performance of robot systems. The international RoboCup competitions include the @Home league for domestic service robots. In this competition, the robots have to perform tasks dened by the rules of the competition, in a given environment at a predetermined time. In addition, there are open challenges and the nal demonstration, where the teams can highlight the capabilities of their robots in self-dened tasks. The simultaneous presence of multiple teams allows for a direct comparison of the systems by measuring objective performance criteria, and by subjective judgment of the scientic and technical merit by a jury. With Cosero, we won the RoboCup GermanOpen 2011 competition. In the nals, Cosero and Dynamaid prepared breakfast within the 10 min demonstration slot. Dynamaid fetched orange juice out of the refrigerator, which it opened and closed successfully. It delivered the bottle on the breakfast table. In the meantime, Cosero grasped a bottle of milk, opened the bottle, and poured the milk into a cereal bowl. Cosero disposed the empty bottle into the trash bin. It then moved to another table and successfully grasped a spoon with a top-grasp. A jury member placed the spoon in an arbitrary orientation. Cosero put the spoon next to the cereal bowl and nally waited for an instruction to leave the room. Another jury member pointed towards one of two exit doors using a pointing gesture. Cosero successfully recognized the pointing gesture and left the room through the correct door. The jury awarded us the highest score for the nals. We also won the @Home competitions at RoboCup 2011 in Istanbul, Turkey. Early in the competition in the open challenge, Cosero demonstrated to prepare cereals to a jury of team leaders of other teams. In the demo challenge, Cosero cleaned up the appartement by picking up laundry from the ground and putting it into the correct laundry basket. A human user could before show in which baskets to put colored and white laundry using gestures. Afterwards, Cosero picked up 3 objects from a table using the perception and grasping pipeline proposed in this paper. In the rst attempt to pick up a carrot, it had to choose a grasp perpendicular to the carrots principal axis and failed to keep grip on the object. However, in the second attempt, it picked up the carrot successfully along its principal axis. Finally, it grasped a tea-box with a top-grasp. The objects have been placed randomly. We could convince the
VII. C ONCLUSION In this paper, we proposed highly efcient means to perceive objects on planar surfaces and to plan feasible, collision-free grasps on the object of interest. We integrate our methods into a mobile manipulation system, that robustly executes object pick-up in reasonable time without longer processing interruptions, i.e. interruptions in the milliseconds to seconds. For object perception, we segment depth images of a Microsoft Kinect camera in real-time at a frame rate of up to 6 Hz. We demonstrated that our perception and planning modules yield their results in a very short time. In the integrated system this allows for short and steady execution of the task. Our experiments demonstrate that our method is fast yet robust. In future work, we plan to integrate feedback from touch sensors into the grasp execution. By including top-grasps at high points on the objects we could further extend the range of graspable objects (bowl-like shapes, for instance). We could also improve our grasping pipeline through knowledge on how to grasp specic objects. ACKNOWLEDGMENTS This research has been partially funded by the FP7 ICT2007.2.2 project ECHORD (grant agreement 231143) experiment ActReMa. R EFERENCES
[1] M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model tting with applications to image analysis and automated cartography. Commun. ACM, 24(6):381395, 1981. [2] D. Holz, S. Holzer, R. B. Rusu, and S. Behnke. Real-time plane segmentation using RGB-D cameras. In Proc. of the 15th RoboCup International Symposium, 2011. [3] K. Hsiao, S. Chitta, M. Ciocarlie, and E. G. Jones. Contact-reactive grasping of objects with partial shape information. In Proc. of the IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS), 2010. [4] A. Jain and C. C. Kemp. EL-E: an assistive mobile manipulator that autonomously fetches objects from at surfaces. Autonomous Robots, 28:4564, January 2010. [5] W. Meeussen, M. Wise, S. Glaser, S. Chitta, C. McGann, P. Mihelich, E. Marder-Eppstein, M. Muja, V. Eruhimov, T. Foote, J. Hsu, R. B. Rusu, B. Marthi, G. Bradski, K. Konolige, B. P. Gerkey, and E. Berger. Autonomous door opening and plugging in with a personal robot. In Proc. of the IEEE Int. Conf. on Robotics and Automation (ICRA), 2010. [6] A.T. Miller, S. Knoop, H.I. Christensen, and P.K. Allen. Automatic grasp planning using shape primitives. In Proc. of the IEEE International Conference on Robotics and Automation (ICRA), 2003. [7] S. Srinivasa, D. Ferguson, C. Helfrich, D. Berenson, A. C. Romea, R. Diankov, G. Gallagher, G. Hollinger, J. Kuffner, and J. M. Vandeweghe. Herb: a home exploring robotic butler. Autonomous Robots, 28(1):520, January 2010. [8] S. Srinivasa, D. Ferguson, J. M. Vandeweghe, R. Diankov, D. Berenson, C. Helfrich, and H. Strasdat. The robotic busboy: Steps towards developing a mobile robotic home assistant. In Proc. of the Int. Conf. on Intelligent Autonomous Systems (IAS), 2008. [9] J. St ckler and S. Behnke. Integrating indoor mobility, object manipuu lation, and intuitive interaction for domestic service tasks. In Proc. of the 9th IEEE-RAS Int. Conf. on Humanoid Robots (Humanoids), 2009. [10] J. St ckler and S. Behnke. Compliant task-space control with backu drivable servo actuators. In Proc. of the 15th RoboCup International Symposium, 2011. [11] A. Morales T., Asfour, P. Azad, S. Knoop, and R. Dillmann. Integrated grasp planning and visual object localization for a humanoid robot with ve-ngered hands. In Proc. of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2006.
1 Videos
showing these demonstrations can be found at http://nimbro.net.
182
Odor guided exploration and plume tracking - Particle Plume Explorer

Goncalo Cabrita, Pedro Sousa and Lino Marques
Abstract This paper presents a new algorithm for single robot odor exploration called Particle Plume Explorer. This algorithm is supported by a novel point cloud based plume mapping method called Particle Plume also presented here. The developed algorithm was validated through extensive real world experiments on multiple environments using a single robot equipped with a gas sensor.
I. INTRODUCTION The sense of smell is becoming increasingly popular among mobile robots. The applications for a gas sensor equipped robot are endless, from patrolling an airport for explosives and other illegal substances to aiding in the search of survivors during a natural catastrophe. Applications like these require a robot or group of robots to determine the localization of odor sources. Although some applications are still out of the grasp of current off-the-shelf chemical sensing technologies, the algorithms which will one day drive these robots are starting to emerge. The existing solutions to the odor source localization problem can be divided in three families of algorithms, (1) hill climbing or gradient ascent techniques; (2) biologically inspired algorithms; and (3) probabilistic methods. Gradient based techniques require comparisons between two or more spatially separated chemical measurements, meaning that there must be more than one gas sensor on the same agent or, in the case of a unique sensor, the agent must perform multiple measurements separated in time and space in order to estimate the gradient and make a decision for the next move. This behavior has been widely observed in several insects (e.g. Drosophila) and in some bacteria (e.g. E. coli). However, this type of behavior is satisfactory only if the agent is either close to or inside the plume, otherwise it will perform random movements while trying to re-acquire the plume. Biologically inspired algorithms have been broadly used in robotics as an attempt to mimic the successful behaviors of animals, in this case while performing odor tracking or odor source declaration. The bacterium E. coli, which is also referred in smooth gradient ascend techniques, presents a chemotaxis behavior that consists of a series of movements that probabilistically lead towards highest concentrations. However, the moth is the living being that inspired the most algorithms in this area. The moth performs a set of movements to reach the odor source [1], [2] which consist of moving straight upwind while inside the odor plume - the
G. Cabrita, P. Sousa and L. Marques are with the Department of Electrical and Computer Engineering, Institute of Systems and Robotics, University of Coimbra, 3030-290 Coimbra, Portugal.
surge; then, the moth performs counter-turning patterns (zigzag movements) along the plume and according to the wind direction, aiming to acquire odor cues - the casting. Some silkworm moths [3] also describe some kind of irregular spiral movements when the casting strategy bears no fruits. Along with this case, other bio-inspired algorithms have been implemented and tested in mobile robots to accomplish the task of odor source localization [4], [5], using search strategies like L vy-taxi [6], [7] or Biased Random Walk e [8]. Different approaches of bio-inspired algorithms which rely on large groups of individuals carrying out a task as a whole (swarm behavior) can be found in [9], [10]. The odor source localization problem was also researched employing probabilistic methods such as Hidden Markov Methods [11] or Bayesian methods [12], [13]. These works proposed to localize the odor source by building a probability map of the source localization and, whenever new information was available, updating each cell in the map. In spite of the interest of the research community on odor source localization problem most of the proposed approaches are limited by certain constraints as the absence of obstacles or near-perfect airow, meaning that most methods are not suitable for real world implementation. In fact the majority of the existing odor localization algorithms have only been demonstrated on simulation so far. The tradeoffs between real world, laboratory and simulation experiments are commonly discussed in the community. It is important to ensure that algorithms that perform well in simulation or in the laboratory will also perform well in real world experiments, thus the simulation environment must include the key features that make it as close as possible to the real world. Due to the amount of variables involved in the process of odor displacement, the behavior of a chemical plume can appear somewhat random and chaotic, hence realistic plume simulation is extremely difcult to achieve, meaning that real world experimentation is an important step for validating most odor related algorithms. The remaining of this paper is organized as follows, section II presents the proposed approach for solving the odor oriented exploration and plume tracking problems. In section III one can nd the experimental setup used to validate the developed algorithm. Section IV holds the results and section V the discussion. The conclusions can be found in section VI. II. ODOR SOURCE LOCALIZATION The odor source localization problem can be divided in three stages as depicted in Figure 1, (1) exploration of odor cues, (2) plume tracking and (3) odor source declaration. Although odor plume tracking based approaches are not the
{goncabrita, pvsousa, lino}@isr.uc.pt
183
Odor Source Declaration Odor Map Plume Tracking Odor Exploration
Fig. 1. Diagram of the various stages of an odor source localization algorithm.
only solution for the odor source localization problem they are in fact the most studied by the scientic community and are the basis for this work. While searching for odor sources a robot might often nd itself lacking odor cues to follow. Furthermore due to the chaotic nature of chemical plumes, odor cues might be present yet insufcient to provide an accurate estimate of where the robot should advance to. The odor exploration block in Figure 1 is thus the base of many odor source localization algorithms, providing the robot with the ability to explore the environment, unknown from a chemical point of view at the beginning of the search. Without any additional information this becomes an optimization problem, as such during this stage the robot should attempt to cover the unknown area as efciently as possible while searching for chemical cues. If during chemical exploration the robot is able to nd enough odor cues to take an informed decision, the next step towards the odor source is plume tracking as depicted by Figure 1. Odor is caused by the volatilization of chemicals which are then mixed with the air molecules. Odor can then travel by the means of: (1) advection, caused by the motion of the uid where the chemical particles are suspended in (in this case the air); (2) diffusion, resulting from the movement of particles along concentration gradients; and (3) a combination of the previous two. This leads to the intermittence of the plume, caused by the irregular spreading pattern. This characteristic is responsible for the high discrepancies in the chemical concentrations that can be measured at close points inside the plume, making of intermittence one of the main problems while attempting to track a plume. Wind plays an important role in the motion component of the odor propagation. Along with this, the changes in the wind direction and speed will also affect the odor plume, causing it to meander and produce instantaneous alterations on the wind direction. It is thus natural to use information provided by the wind to optimize plume tracking. However the wind might be almost imperceivable under certain circumstances or so chaotic that it becomes unreliable, giving no useful information to the algorithm. In nal step, odor source declaration, the robot should be nally able to pinpoint the localization of the odor source by means of varying chemical concentrations or by the shape of the plume. Nonetheless there are some researchers who have combined the olfactory information with other sources of information in order to better locate the source, e.g. using
visual information [14]. This topic is however out of the scope of this paper since the proposed algorithm only goes as far as plume tracking. In spite of the fact that odor mapping is not traditionally seen as part of the solution for odor source localization, it is shown in Figure Figure 1 as a part of the entire algorithm. Odor mapping is often present in odor related experiments in some way, however building odor maps is usually performed as a goal and not as a tool. One of the main problems encountered during odor mapping is the fact that the chemical concentration on the same point inside a plume can have extremely high variations over time. The work developed in this project can be divided in two different parts, Particle Plume (PP), a plume mapping algorithm and Particle Plume Explorer (PPE), an odor oriented exploration and plume tracking algorithm. A. Particle Plume The PP mapping algorithm intends to tackle the problems of plume intermittence, gas sensor noise and chemical data sharing between robots when applicable. To accomplish this the chemical readings from any gas sensor available (from a robot or static sensor network) are converted into a point cloud around the sensors location on a global frame. The number of points generated for each reading is proportional to the chemical reading in parts per million (ppm). In this case the MOX gas sensors being used were calibrated to provide readings in ppm, the calibration procedure is described further ahead. The point cloud is conned to a predened radius around the point of origin. The newly created point cloud is nally added to the plume. If the volume occupied by the new points contains older points, these older points are removed from the plume. This is necessary in order to assure that the number of points in a certain volume reects the last concentration read in that area. This process is represented in Figure 2. The result is a low-pass-lter-like behavior, producing a smooth representation of the plume. Point clouds with bigger radius produce smoother plumes while smaller radius result in bigger variations inside the plume. The idea is not to capture the plume in a realistic manner, but in such a way that facilitates the job of the odor algorithms. Along with the particle cloud representation of the plume, PP is in charge of generating a grid of visited cells. This means that it is possible to differentiate between explored areas with no odor detected and unexplored areas.
Chemical reading Remove older points Add to plume
Yes
Convert to global frame

Overlaping points?
No
Convert to point cloud Point cloud
Plume
Fig. 2.
Flowchart for the Particle Plume plume mapping algorithm.
184
(a) Step 1.
(b) Step 2.
(c) Step 3.
(d) Step 4.
Fig. 3. The four steps of the Particle Plume Explorer algorithm: a) Draw a circle of radius r around the robot and divide it in n slices; b) Remove the slices which overlap with obstacles; c) Remove the slices which contain a percentage of explored area over a certain threshold; d) Calculate the goal using the cost function in Equation 1.
B. Particle Plume Explorer The PPE algorithm is in charge of odor oriented exploration and plume tracking. This is accomplished in four steps. (1) Initially a circle of radius r is drawn around the robot and divided into n slices. (2) Next the slices which overlap with obstacles are removed. (3) The slices which contain a percentage of explored area over a certain threshold are then also removed, i.e. slices which overlap a signicant amount of already explored area (leaving only the slices which overlap unexplored areas). (4) Finally the cost of each remaining slice is calculated using Equation 1, Scost = Wo |mad(, )| + Wv Av + Wd |mad(, )| (1) where Wo , Wv and Wd represent the weight of the odor, visited cells and current robot orientation, respectively; the mad() is the minimum angular distance function; , and are the direction of growing odor concentrations, the direction of the current slice and the orientation of the robot, respectively; nally, Av represents the quantity of area in the current slice that was already explored, i.e a slice which overlaps no explored areas will have a value of Av of 0. The slice with the lowest cost is chosen and the goal pose is set to the edge of the chosen slice. The four stages of the PPE algorithm are illustrated in Figure 3 and the pseudocode can be found in Algorithm 1. There might be situations where all slices are removed before the nal stage. When this happens PPE enters a recovery behavior where the area surrounding the robot is scanned for the closest viable spot to continue the exploration. This is done respecting the obstacles present in the environment. Determining if a certain location is suitable for further exploration is accomplished using the four-stage algorithm discussed earlier. If while performing the recovery behavior PPE is unable to locate a suitable pose to continue exploration it assumes that there are no more places it needs to explore and the algorithm comes to an end. Since the purpose of PPE is odor exploration and plume tracking, areas which appear to have odor cues are given more importance and are thus more thoroughly explored. This is achieved by means of a dynamic threshold in the third stage of the algorithm. The presence of nearby odor
Algorithm 1: Particle Plume Explorer algorithm. input : obstacle map, odor map, robot pose output: goal /* Step 1: Set the pie around the robot */ initPie(radius, number of slices); if odor is present in the pie then max visited area no odor max visited area; calculateOdorGradient(); else max visited area odor max visited area; odor weight 0 end foreach slice do /* Step 2: Discard slices with obstacles */ if slice overlaps with obstacle then discard slice; /* Step 3: Discard overly explored slices */ else if slice explored area > max visited area then discard slice; else slice cost = odor weight abs(shortestAngularDistance(odor, robot)) + visited weight visited area + direction weight abs(shortestAngularDistance(slice, robot)); end end /* Step 4: Determine best slice and set goal */ sortSlices(); goal getBestSliceGoal();
cues results in higher thresholds for the percentage of visited area for each slice while the absence of odor cues results in little tolerance for visited areas. C. Implementation Both PP and PPE were implemented in C++ under the ROS framework [15]. All the source code developed for this project is open source and can be downloaded for free from http://www.ros.org/wiki/pp explorer. III. EXPERIMENTAL SETUP For the real world experiments the iRobot Roomba platform was used. Each robot was equipped with a Hokuyo LASER range nder, a Figaro TGS2620 MOX gas sensor and an onboard computer for running the software. The robot can be viewed in Figure 4. The robots were using the ROS navigation stack for autonomous navigation and localization. Since odor exploration is the main focus of this paper, the metric map of the environment was provided to the robots
185
A. Simulated Experiments A set of simulated experiments were used to ne tune the algorithms and determine the best parameters for PP and PPE. A set of simulations using a single robot was also conducted to determine the relation between the area of the environment and the amount of time needed to complete the exploration. To accomplish this the robot was allowed to perform the PPE algorithm in empty square arenas of different sizes, 9m2 , 12m2 , 36m2 and 81m2 .
(a) The Roomba used for the experiments.
B. Real World Experiments The real world experiments were performed using a single robot on the testing arena. Three different maps were tested to determine how robust the PPE algorithm was for different environments. A set of 10 experiments was performed for each map. For each experiment the robot started at a random location inside the map. The distance travelled by the robot, the time spent on the exploration and the percentage of covered area was recorded for each experience. The robot was placed at a random location at the start of each experiment. The second set of experiments performed in Map 3 consisted in varying the parameters of the PPE algorithm. For each experiment a single parameter was changed from the control parameters to evaluate how it would affect the behavior of the robot. Only both percentage of visited area thresholds were changed simultaneously to the same value so that in practice PPE would work with only one threshold. A fourth map was used to test the following PPE parameter changes: Radius of the circle around the robot increased to 1m and decreased to 0.3m (control value 0.6m); Number of slices decreased to 4 and increased to 32 (control value 16); Both percentage of visited area thresholds set to 20% and to 80% (control value 20% for the percentage of visited area percentage threshold in the absence of odor and 50% for the presence of odor). IV. RESULTS The initial simulations and real world experiments resulted in the following set of parameters used for the remaining of the experiments: For PP the sphere radius used was 0.10m. For PPE the pie radius r used was 0.60m; the number of slices n used was 16; the visited area, current heading and odor weights used were respectively 1.00, 0.01 and 0.10; the maximum allowed percentage of discovered area per slice on the absence of odor was 20% and for the presence of odor 50%. The results of the real world experiments conducted on Maps 1, 2 and 3 can be found in Figure 5 and Table I. The robot performed all of the experiments successfully. In Figure 5 it is possible to see a picture of the arena followed by a sequence of screenshots from the visualization software rviz captured during operation. The screenshots were captured in realtime as the robot was performing the exploration algorithm. The visited cells can be seen in yellow while the point cloud representation of the plume appears in red. In Figure 6 it is possible to see the graphic representing the evolution of the time needed to perform an exploration as the area to explore grows. The results for the cost function
(b) Figaro TGS2620 MOX sensor on the left. Fig. 4. The Roomba equipped with Hokuyo LASER range nders and gas sensors.
for each experiment. The Figaro TGS2620 gas sensors is considerably slow, with about 2.0 seconds of rising time and about 15.0 seconds of decay time. For this reason the maximum robot speed allowed during the experiments was 0.10m/s. The Figaro TGS2620 MOX sensor was calibrated to provide readings in ppm. The sensor was placed inside a sealed chamber of known volume containing clean air. Next a known quantity of CH3 OH (the chemical used throughout the experiments) was released into the chamber. After the sensor stabilized a sensor reading was recorded along with the current quantity of chemical inside the chamber in ppm. MOX sensors produce changes in resistance as the chemical concentration changes. The procedure was repeated for several calibration points until the sensor was close to saturation. The obtained data set was then tted by least square minimization to a logarithmic model. Real world experiments were performed inside an arena specically designed for mobile robot odor experiments. The 3m 4m arena is a completely enclosed environment delimited by four walls where two of them are made of plastic with the shape of honeycombs. It includes an array of controllable fans allowing for the generation of different kinds of air ows. The chemical substance released during all real world experiments was methanol vapor (CH3 OH). It was always released at a point about 0.3m above the ground using the bubbling method. Simulations were performed in Stage under the ROS framework. Plume simulation was accomplished using PlumeSim [16], a plume simulation software developed by the authors. The plumes were generated using the Computational uid dynamics (CFD) software ANSYS Fluent.
186
(a) Map 1.
Area (m^2) 9 12 36 81
Time Spent (min) 5.42833333 8.19129358 31.3466667 126.733333
Covered Area (%) 52.348395 55.327144 63.74605 66.355258
(b) Map 2.
(c) Map 3. Fig. 5. Results of the real world experiments. Map 1 has no obstacles. Map 2 has a set of obstacles similar to what a robot can nd while navigating a building. Map 3 contains a set of obstacles with awkward shapes and angles. The blue circle depicts the localization of the odor source.
150 120 Time spent (min) 90 60 30 0 0 18 36 54 72 90 Area to explore (m2)
y = 0.0163x2 + 0.2123x + 2.705 R! = 0.9999
Fig. 6. Graphic representing the evolution of the time needed to perform an exploration as the area to explore increases.
experiments can be found in Figure 7 which is composed by three graphics. Each set of three graphics shows the distance travelled in meters, time spent in minutes and percentage of explored area.
TABLE I S INGLE ROBOT A RENA E XPERIMENTS Map 1 Explored Area (%) Distance Traveled (m) Time Spent (min) 56.17 2.22 26.03 2.81 9.69 1.75 Map 2 43.57 3.84 21.65 3.26 7.99 1.59 Map 3 47.05 3.23 22.57 3.56 7.05 1.32
V. DISCUSSION The results in Table I show that PPE performs in similar way when faced with different sets of obstacles. It is natural
that Map 2 and Map 3 get lower percentages of explored area, travelled distance and time spent when compared to Map 1 since the obstacles present in Map 2 and 3 will actually end up decreasing the area to explore. The robot tends to avoid exploring the areas very close to the obstacles since a real robot must take into account its own dimensions when transversing any area. The time spent on the explorations is related to the speed of the robot which in turn was affected by the slow response speed of the gas sensors being used. Nevertheless the times show in Table I are very satisfactory. The fact that the robot never never failed any experiment shows how robust the PPE algorithm is. From the sequences of images in Figure 5 it is possible to observe that the robot covered more area were odor was present. Furthermore it was also possible to verify during the experiments that the robot was following the odor gradient as intended. This was achieved due to the smooth plumes generated by PP, point cloud representations of the odor captured by the gas sensors. It is visible that particularly after being exposed to high concentrations PP generated a drag effect of the chemical. This is due to the low response speed of the gas sensors being used, however this behavior was both expected and acceptable in order to maintain a desired average robot speed. This is visible in the rightmost image in Figure 5(b). The graphic in Figure 6 shows that as the area to explore increases so does the time spent on the exploration following a second order polynomial curve. This behavior was expected and Figure 6 shows that PPEs performance does not degrade as the area to explore increases. The circle radius experiments (Figure 7) showed that as the radius increases the robot is unable to move in tight places since slices which overlap obstacles are removed, leaving
187
(a) Distance Travelled (m). Fig. 7.
(b) Time Spent (min).
(c) Explored Area (%).
Graphics related to the parameter tests performed for the PPE algorithm.
most of the area to explore. Although PPE ended faster the percentage of explored area is about half of that explored during the control experiments. On the other hand as the radius was decreased to 0.3m the robot tended to perform a swirl-like movement, having spent more time to complete the exploration, while also having the highest percentage of explored area (Figure 7). The high amount of time necessary for completing the exploration can also be explained by the increase in the number of iterations. As the radius decreases the number of iterations PPE must run increases. When the number of slices was decreased to 4 the results were similar to those of a large circle radius as Figure 7 shows. The robot left much of the area to explore. However now this is due to the fact that as the number of slices decreases PPE has fewer slices to chose from, making it almost impossible to move near obstacles. When the number of slices was increased to 32 the results were similar to those of the control experiments. The slight increase in time spent, distance travelled and explored area can be explained by the fact that the map being used and thus the slices are in fact composed by a grid of cells which have 0.05m of side. With 32 slices each slice had very few cells, thus the performance of the algorithm decayed. The results for the percentage of visited area performed as expected. With a lower threshold the explored area, time spent and travelled distance fell, whereas with a higher threshold the robot was able to visit more of the area, consequently increasing time spent and distance travelled. This can be seen in the graphics of Figure 7. VI. CONCLUSIONS The presented PP and PPE algorithms were tested under realistic scenarios and performed efciently. Particle Plume Explorer was able to complete all the odor explorations successfully under three different scenarios thus proving to be both reliable and robust. Future work will include adding the nal stage of odor source declaration to the current work, completing the odor source localization algorithm. Furthermore the use of the wind direction during plume tracking when such information is available. As the time needed to complete an exploration grows with the area to be explored it is only natural that a multirobot version of the PPE algorithm will be researched in the near future, to allow a quicker exploration of larger
areas. Furthermore faster gas sensors will be tested to allow the robots to move at greater speeds, thus accelerating the exploration process. R EFERENCES
[1] J. Belanger and E. Arbas, Behavioral strategies underlying pheromone-modulated ight in moths: lessons from simulation studies, Journal of Comparative Physiology A: Neuroethology, Sensory, Neural, and Behavioral Physiology, vol. 183, no. 3, pp. 345 360, 1998. [2] T. Lochmatter and A. Martinoli, Tracking odor plumes in a laminar wind eld with bio-inspired algorithms, in Proc. of the 11th Int. Symposium on Experimental Robotics 2008 (ISER 2008). Springer, 2009, pp. 473482. [3] R. A. Russell, Odour Detection by Mobile Robots, ser. Robotics and Intelligent Systems. World Scientic, 1999, vol. 22. [4] L. Marques, U. Nunes, and A. de Almeida, Olfaction-based mobile robot navigation, Thin Solid Films, vol. 418, no. 1, pp. 5158, 2002, 1st Int. School on Gas Sensors. [5] R. Russell, A. Bab-Hadiashar, R. Shepherd, and G. Wallace, A comparison of reactive robot chemotaxis algorithms, Robotics and Autonomous Systems, vol. 45, no. 2, pp. 8397, 2003. [6] Z. Pasternak, F. Bartumeus, and F. W. Grasso, L vy-taxis: a novel e search strategy for nding odor plumes in turbulent ow-dominated environments, Journal of Physics A: Mathematical and Theoretical, vol. 42, no. 43, 2009. [Online]. Available: http://stacks.iop.org/17518121/42/i=43/a=434010 [7] S. Nurzaman, Y. Matsumoto, Y. Nakamura, S. Koizumi, and H. Ishiguro, Biologically inspired adaptive mobile robot search with and without gradient sensing, in IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, (IROS 2009), 2009, pp. 142 147. [8] A. Dhariwal, G. Sukhatme, and A. Requicha, Bacterium-inspired robots for environmental monitoring, in Proc. of the IEEE Int. Conf.on Robotics and Automation, (ICRA 2004), vol. 2, 2004, pp. 1436 1443. [9] L. Marques, U. Nunes, and A. de Almeida, Particle swarm-based olfactory guided search, Autonomous Robots, vol. 20, no. 3, pp. 277 287, 2006, special Issue on Mobile Robot Olfaction. [10] V. Gazi and K. Passino, Swarm Stability and Optimization. Springer, 2011. [11] J. A. Farrell, S. Pang, and W. Li, Plume mapping via hidden markov methods, IEEE Trans. on Systems, Man, and Cybernetics - Part B, vol. 33, no. 6, pp. 850863, 2003. [12] S. Pang and J. Farrell, Chemical plume source localization, IEEE Trans. on Systems, Man, and Cybernetics, Part B, vol. 36, no. 5, pp. 1068 1080, 2006. [13] S. Pang, Plume source localization for AUV based autonomous hydrothermal vent discovery, in OCEANS 10, 2010, pp. 1 8. [14] H. Ishida, H. Tanaka, H. Taniguchi, and T. Moriizumi, Mobile robot navigation using vision and olfaction to search for a gas/odor source, Autonomous Robots, vol. 20, pp. 231 238, 2006, 10.1007/s10514006-7100-5. [15] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, and A. Y. Ng, ROS: an open-source Robot Operating System, in ICRA Workshop on Open Source Software, 2009. [16] G. Cabrita, P. Sousa, and L. Marques, Player/stage simulation of olfactory experiments, in IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS 2010), 2010, pp. 1120 1125.
188
Environmental Perception for Intelligent Vehicles Using Catadioptric Stereo Vision Systems
Miriam Sch nbein o Bernd Kitt Martin Lauer Institute for Measurement and Control Systems, Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany
Abstract This paper presents advantages of catadioptric camera systems for robotic applications using the example of autonomous vehicles and driver assistance systems. We propose two different catadioptric stereo camera systems to improve environmental perception. Both stereo camera systems consist of at least one catadioptric camera, which combines a lens with a hyperbolic mirror and provides a panoramic view of the surrounding. This allows a 360 eld of view of the vehicles surrounding and ultimately allows e.g. object detection around the vehicle or the detection of lane markings next to the vehicle. We discuss the advantages of both camera systems for stereo vision and 3D reconstruction. Furthermore, we show results for depth estimation with sparse features in a simulated environment as well as on a real data set. Index Terms catadioptric camera, stereo vision, hybrid imaging system
Stereo-Area omni and perspective camera
Stereo-Area left and right camera
(a) Catadioptric/Perspective System mounted in the center above the windshield. Fig. 1.
(b) Catadioptric/Catadioptric System mounted on the left and right side above the windshield.
Field of View of the catadioptric stereo camera systems.
I. INTRODUCTION For robotic applications and rst of all autonomous driving vision-based obstacle detection and 3D reconstruction is an important task. Therefore, it is desirable for camera systems to have a large eld of view of the surrounding of the vehicle. However, conventional perspective cameras typically used in vehicles, only provide a horizontally limited eld of view (< 90 ). Commonly the cameras point in frontal direction and objects alongside the vehicle are not visible with a single perspective camera. However, for many advanced driver assistance systems such as blind spot detection, lane departure warning or lane tracking, knowledge about the sides of the vehicle is essential. In addition, a panoramic view of the surrounding could enable new applications like intersection reconstruction (e.g. [1]) for autonomous driving in urban environments. Omnidirectional cameras are able to provide such information, for example passing vehicles, pedestrians or lane markings. The apparent possibility to obtain a panoramic view of the vehicle environment by capturing a set of perspective images using cameras with different orientations or one rotating camera suffers from the complexity to stitch the images together, the sizable installation space and the extensive calibration. Hence, such systems are not suitable for the dynamic environment of a vehicle, although such systems are capable to obtain a high resolution image. Moreover, violation of the single view point condition while stitching a panorama from multiple images introduces unwanted effects such as ghosting which makes the panoramas unsuitable for stereo matching and reconstruction. In contrast, catadioptric camera systems are able to provide a 360 eld of view of the environment
with only one camera. This eliminates the problem to stitch a panorama out of many perspective images. Moreover, a central catadioptric camera system inherently fullls the single view point condition [2]. Catadioptric cameras are a combination of a convex mirror above a lens and provide a panoramic image. Such panoramic imaging systems are very popular in the robotic research community for autonomous navigation and surveillance systems and have been studied in the past [3, 4]. While monoscopic catadioptric cameras have been considered for driver assistance applications such as blind spot detection [5], lane tracking [6] or egomotion estimation [7], there is very little work on the potential of stereoscopic catadioptric systems for intelligent vehicles. In recent years, different types for catadioptric stereo vision systems have been proposed. Zhu et al. [8] discuss different congurations of omnidirectional stereo setups, mainly binocular stereo with vertically or horizontally aligned cameras. Common methods to use vertically aligned omnistereo are the combination of two lenses with two separated mirrors which are vertically aligned [9] or two mirrors mounted on top of each other and only one lens [10]. Gandhi and Trivedi [11] used two horizontally aligned catadioptric sensors mounted on the side mirrors of a vehicle for visualizing and analyzing the nearby surrounding of a vehicle and Ehlgen et al. [12] applied two horizontally aligned catadioptric cameras mounted on the rear roof edges. Both approaches give the driver a visualization of the entire environment. A hybrid camera system, combining a catadioptric and a perspective camera, has been proposed by Lauer et al. [13] and Sturm [14]. This imaging system combines the advantages of a 360 eld of view from a catadioptric camera and the high resolution from a conventional perspective
189
Mono-Area right camera
Mono-Area left camera
Mono-Area omni camera
Mono-Area omni camera
camera. In [15] such a hybrid omnidirectional pinhole sensor (HOPS) is used for stereo obstacle detection in a robotic environment. However, little research has been performed on stereo matching and 3D reconstruction with a stereo camera system consisting of at least one catadioptric camera as well as on the choice of a catadioptric camera system for autonomous vehicle applications. In this paper, we evaluate the advantages of catadioptric camera systems for environmental perception in applications for driver assistance systems and intelligent vehicles. Herefor, we propose two promising systems to improve environmental perception. The rst system consists of a perspective and a catadioptric camera which are vertically aligned (Fig. 1(a)). The second system consists of two horizontally aligned catadioptric cameras (Fig. 1(b)). Both systems provide a panoramic view of the surrounding, which allows object detection around the vehicle and not only in the small area in front of the vehicle. We introduce stereo vision and 3D reconstruction for both catadioptric systems and discuss results for simulated and real data sets as well as the respective advantages of both methods. The remainder of this paper is organized as follows. Section II introduces the imaging model of catadioptric cameras and the construction of a catadioptric/perspective stereo camera system as well as a catadioptric/catadioptric stereo camera setup. Section III describes stereo vision utilizing the proposed catadioptric stereo camera systems. We explain feature matching as well as the 3D reconstruction for both camera systems. In Section IV we discuss results for feature matching and 3D reconstruction on real data and evaluate the accuracy of the reconstruction for both systems within a simulated environment. Furthermore, we show the advantages of both systems for future intelligent vehicle applications. Finally, Section V concludes our work and gives an overview of future work. II. CATADIOPTRIC STEREO CAMERA SYSTEMS A. Catadioptric Camera Our catadioptric camera consists of a hyperbolic mirror and a camera lens. The catadioptric system fullls the single view point (SVP) condition, which means that the camera center C coincides with the second focal point F of the hyperbolic mirror as shown in Fig. 2. The shape of a hyperbolic mirror is dened by the following equation (z + e)2 R2 2 =1 a2 b (1)
Catadioptric Camera
z
RM ; tM
F
y
XM
RC ,t
C
2e
uc
fc
z y x
F =C fp
X
up
Perspective Camera
Fig. 2. Hybrid Imaging System consisting of a catadioptric camera with image plane uc and a perspective camera with image plane u p .
where RM is the rotation matrix and tM the translation vector between the world and the mirror coordinate system. The point XW M is projected to the point on the mirror surface XM by the nonlinear function 1/2 (XW M ). XM = 1/2 (XW M ) XW M with (3)
1/2 (XW M ) =
b2 (eZ a X ) . b2 Z 2 a2 X 2 a2Y 2
(4)
Consequently, the point XM is computed from the intersection of the ray from XW M to the focal point F and the hyperbolic mirror surface. Thus, we receive two values for caused by the two intersection points between the mirror and the ray. The correct is given by
1/2 (XW M ) =
min(1/2 ), max(1/2 ),
for 1/2 > 0 for 1 > 0 and 2 < 0 .
(5)
where a and b are the mirror parameters and e = a2 + b2 . The relationship between a world point and the corresponding camera point consists of two projection steps [16], a nonlinear projection from the world point X to the point on the mirror surface XM and a linear projection from the point on the mirror surface XM to the image point uC . The world point in the mirror coordinate system XW M is given by the following equation XW M = RM (X tM ) (2)
The point XM is then transformed into the camera coordinate system with rotation matrix RC and translation vector tC = [0, 0, 2e]T . Finally, the point on the mirror surface in the camera coordinate system XC is projected to the image point uc with the camera calibration matrix K by the following equation 1 (6) uc = K XC . zC Using equation (6), the complete imaging model is given by uc = K 1 RC ( RM (X tM ) tC ) . zC (7)
190
Fig. 3. The images show the stereo and the mono region of the catadioptric/catadioptric system in the unwarped panoramic images of the left (above) and the right camera (below). The green area is the visible region in both cameras, the yellow area the visible region of the left camera and the blue area the visible region of the right camera.
B. Catadioptric/Perspective Stereo Camera System The catadioptric/perspective stereo camera system, which will be denoted as System A, consists of a catadioptric and a perspective camera. The system is mounted on top of the vehicle in the center above the windshield. As depicted in Fig. 2, the perspective camera is attached under the catadioptric camera, so that the focal points of the perspective camera and the catadioptric camera are aligned vertically. As shown in Fig. 1(a) the system allows a monoscopic view of the surrounding environment in all directions of the car, only the area behind the vehicle is occluded by the vehicle itself. Additionally, the system obtains stereo information in a preferential direction which is the frontal direction in our case. The stereo area of the system is depicted in Fig. 1(a) and is bounded by the eld of view of the perspective camera. The advantages of this camera system are the 360 eld of view of the catadioptric camera while at the same time providing a standard perspective camera image in frontal direction. This allows the use of well-established monoscopic image processing algorithms such as lane boundary estimation, trafc sign recognition and obstacle detection even for longer distances. Besides, in the close up range in frontal direction we obtain stereo vision information in the overlapping perspective and catadioptric eld of view. C. Catadioptric/Catadioptric Stereo Camera System The catadioptric/catadioptric stereo camera system, which will be denoted as System B, consists of two catadioptric cameras which are aligned horizontally. The cameras are mounted on the left and right side on top of the vehicle above the windshield. This enables a binocular observation in front of the vehicle as well as on both sides of the vehicle in difference to Gandhi et al. [11] who mounted the cameras on the side mirrors and obtain stereo information only in frontal direction. Moreover the mounting of the camera system on a rigid frame on top of the vehicle yields to a higher rigidity of the stereo camera system which is important for stereo vision. The overlapping stereo area which is visible in both catadioptric cameras is shown in Fig. 1(b). The stereo area is also larger than in the catadioptric/perspective system. Due to the two horizontally aligned cameras, the stereo area of the system is only restricted by the vehicle itself. As depicted
in the unwarped panoramic view of the left and the right catadioptric camera in Fig. 3, both cameras observe the area in front of the vehicle and far range alongside the vehicle. The close range behind and on both sides alongside the vehicle is only visible within one of the two catadioptric camera images. Hence, the advantages of the catadioptric/catadioptric camera system are the large stereo eld of view in which we obtain 3D information about the environment and the monocular observation alongside the vehicle. III. STEREO VISION WITH CATADIOPTRIC CAMERAS The rst step for stereo vision is the matching process. During the matching process we obtain corresponding points in the image regions which are visible in both images. To simplify the correspondence search and to verify the established correspondence points with the epipolar geometry, we need the intrinsic and the extrinsic calibration parameters of the camera setup. We use the calibration toolbox from [17] for the intrinsic calibration of the catadioptric cameras. The extrinsic calibration is done with the aid of virtual images which are generated out of the catadioptric images. Due to the fact that we use perspective images for the extrinsic calibration, we can use the stereo calibration toolbox [18] for perspective images. Finally, we reconstruct the corresponding 3D world points from the 2D image matches by triangulation. A. System A: Catadioptric/Perspective System In System A we obtain two different images, a panoramic catadioptric image and a perspective image. For the feature matching we compare the overlapping area of both images and search corresponding feature points directly in the different images. Due to the different mappings in the catadioptric and perspective camera, standard block matching algorithms are non-applicable to establish feature correspondences. Therefore, we use the well-known SURF features [19] which are scale and rotation invariant and hence applicable for the different images. By means of the correspondence search directly in the different images we do not lose any image information by the transformation of the catadioptric image into a virtual perspective image. The epipolar geometry for a combined catadioptric/perspective stereo setup is more difcult
191
X
z x
FL
y
xL
xR
FR z
y x
uL CL
uR CR
Fig. 4. Reconstruction of a 3D scene point X from the corresponding catadioptric image points uL and uR by the shortest distance between the rays through FR ,xR and FL ,xL .
than in the classical case of stereo vision. Points which lie on a vertical line in the perspective image are on a radial curve in the catadioptric image. In our case, the perspective camera is attached in the symmetry axis of the mirror under the catadioptric camera. This simplies the epipolar geometry so that points on a radial line in the catadioptric image are mapped onto a straight line in the perspective image [13]. The 3D scene point X is calculated out of the corresponding image points. To estimate the 3D points we use the points on the mirror surface XM which belong to the corresponding image points uc of the respective image. The mirror points are back-projections from the image points uc to the mirror points Xc by equation (3). The 3D point can be found as the midpoint of the shortest distance [20] between the ray through the focal point of the mirror Fc and the point on the mirror surface XM and the ray through the focal point of the perspective camera Fp and the 3D point on the perspective image plane X p . B. System B: Catadioptric/Catadioptric System In System B we obtain two catadioptric images. The overlapping area as shown in Fig. 1 is larger and not only in front of the vehicle. Again, block matching algorithms are nonapplicable to match directly in the catadioptric images, due to the distortions in catadioptric images. As in System A, we use SURF features to nd corresponding feature points directly in the catadioptric images. The epipolar geometry for a catadioptric image pair is described in [16]. A corresponding point in one image lies on a conic circle in the other catadioptric image. The estimation of the 3D point is performed by calculating the midpoint of the shortest distance between the ray through the focal point of the left mirror FL and the point on the left mirror surface XL and the ray through the focal point on the right mirror FR and the point on the right mirror surface XR . The points on the left and right mirror surface depend on the corresponding image points in the catadioptric images. The reconstruction of a 3D scene point from the corresponding image points in case of two catadioptric cameras is shown in Fig. 4. IV. EXPERIMENTAL RESULTS For our experiments we use catadioptric cameras with IT+Robotics VS-C450MR hyperbolic mirrors with the para-
Fig. 5. SURF correspondences between the perspective and the catadioptric image (System A).
meters a = 26.8578 mm, b = 20.8485 mm and rmax = 22.5 mm. The spatial angle of view is = 104 . The image resolution of the catadioptric cameras and of the perspective camera is 1392 1032. The catadioptric/catadioptric stereo system as well as the perspective/catadioptric stereo system can outperform standard perspective camera systems in the task of object detection, since the detection of objects is possible not only in front of the vehicle but also on both sides. Moreover, both systems have different additional advantages. Due to the mounting of System B on the sides on top of the vehicle, it is also possible to detect small objects alongside the car such as lane markings or curbs. As shown in Fig. 3, lane markings alongside the vehicle are clearly visible at both sides in the images captured with System B. The ability to not only detect lane markings in front of the vehicle may help to improve systems such as lane departure warning or vehicle localization signicantly. With System A it is not possible to see close objects alongside to the vehicle. However, this system has advantages for object detection in front of the vehicle due to the higher resolution of the perspective camera. Objects far away could be detected in the high resolution perspective image before their position is estimated in the stereo vision area. This camera system allows also to track detected objects even after they have left the stereo eld of view. A. Feature Matching Within the matching process we obtain corresponding features in both images of the respective camera system by means
192
1 0.8 0.6 0.4
x 10
9000
100
10000 9000 8000 7000 6000 5000 4000 3000 2000 1000
8000
200
7000
300
6000
y [mm]
y [mm]
0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -2 -1.5 -1 -0.5 0 0.5 x [mm] 1 1.5 x 10 2
4
400
5000
500
4000
600
3000 2000 1000 0 -5000

700 800 900
0 x [mm]
5000
1000 100 200 300 400 500 600 700 800 900 1000
(a) Distribution of the 3D reconstruction for a noisy circular point cloud with 8m radius around the midpoint of the camera baseline from System B in left image and from System A in the right image. Fig. 6. 3D reconstruction results within a simulated environment.
(b) Depth Image from the catadioptric/catadioptric system of a simulated environment with noisy input data.
of a SURF feature detector. We evaluate the feature matching and 3D reconstruction in a sequence of 600 frames captured with the considered catadioptric stereo camera systems. The features are only matched in the area around the vehicle and not on the vehicle itself. On average, 500 features are found with the SURF detector in the perspective images, whereas in the catadioptric images 1000 features are found. From these features an average of 100 corresponding image points is found for System B, which is about two times higher than for System A. This can be explained by the larger stereo eld of view for System B than for System A and the fact that corresponding points are also found on both sides of the vehicle. The number of correspondence points in frontal direction which is visible with both systems is approximately the same. The corresponding image points for one frame in System A are depicted in Fig. 5 and the corresponding image points for System B are shown in Fig. 7. B. Reconstruction Out of the established feature points and the extrinsic calibration, the corresponding 3D world points are calculated. The accuracy of the world points for both systems is evaluated using a simulated 3D environment. Herein, we simulate both camera systems and a circular point cloud with 8 m radius around the midpoint of the camera baseline. Afterwards, we map all these points onto the image planes of the cameras and perform a Monte-Carlo-Simulation. Therefore, we create 100 samples of the mapped points, each sample disturbed with Gaussian noise of standard deviation 1.5 pixel. Subsequently, we reconstruct each sample and compute mean and standard deviation of the Euclidean distance between reconstructed and ground truth scene points. For System B the error depends on the azimuth angle as shown in the left image of Fig. 6(a). The accuracy of the reconstruction decreases from the points in the middle between the two cameras to the points on both sides of the cameras. The reconstruction of a point which is horizontally aligned with both catadioptric cameras is not possible, because the baselength in horizontal direction of both cameras is close to zero. Table I shows the back-projection error depending on the azimuth angle. For System A the
accuracy is more or less constant due to the small aperture angle of the perspective camera which restricts the stereo vision area. The right image of Fig. 6(a) shows the results for the catadioptric/perspective system within the same simulation environment. Fig. 6(b) shows the depth image out of the catadioptric/catadioptric stereo camera system of a simulated environment. Therefore we map a simulated 3D scene onto the catadioptric image planes and disturb the image points with Gaussian noise with standard deviation 1.5 pixels. The image shows the calculated depth information of the back-projected scene points. The catadioptric/catadioptric system captures a larger stereo vision area and thus allows feature matching and 3D reconstruction also alongside the vehicle. The mean and standard deviation of the Euclidean distance between reconstructed and ground truth scene points in the frontal direction for System A are not signicantly smaller than for System B. Therefore System B is better suitable for 3D reconstruction because 3D points on the sides of a vehicle can be calculated in addition to points in frontal direction. Moreover, with the 3D information alongside the vehicle it is also possible to detect and track objects which enter the eld of view from one side of the observer vehicle. V. CONCLUSIONS AND FUTURE WORK In this work, we presented two different catadioptric stereo camera systems for advanced intelligent vehicle applications. The feature matching and the 3D reconstruction out of the feature points for a catadioptric/perspective and a catadioptric/catadioptric stereo camera system is explained. We performed experiments on point correspondence search with SURF features as well as 3D reconstruction for both catadioptric camera systems on real data and evaluated the accuracy of the 3D reconstruction on simulated data. Furthermore, we pointed out the advantages for applications in autonomous vehicles for both systems. The catadioptric/perspective system is advantageous for object detection in the preferential frontal direction due to the higher resolution of the perspective camera. Besides, it would be possible to track objects alongside the vehicle if they were detected rst in frontal direction. However,
193
Azimuth angle Mean [m] Standard deviation [m]
0 -
10 1.7837 1.8950
20 0.8703 0.7073
30 0.5572 0.4281
40 0.4160 0.3203
50 0.3769 0.2739
60 0.3274 0.2476
70 0.2917 0.2020
80 0.2995 0.2371
90 0.2779 0.2030
TABLE I M EAN AND STANDARD DEVIATION OF THE E UCLIDEAN RECONSTRUCTION ERROR DEPENDING ON THE AZIMUTH ANGLE FOR SIMULATED SCENE POINTS . F OR EVALUATION PURPOSES WE CREATE SCENE POINTS ON A CIRCLE WITH RADIUS 8 M AND DISTURB THE MAPPED SCENE POINTS WITH G AUSSIAN NOISE OF 1.5 PIXEL STANDARD DEVIATION .
2.5
x 10
1.5
y
0.5 0 -0.5 -1 -8000
-6000
-4000
-2000
0 x
2000
4000
6000
8000
Fig. 7.
SURF correspondences between the both catadioptric images (System B) and the corresponding reconstructed 3D points on top view.
the catadioptric/catadioptric system allows for feature matching and 3D reconstruction also on both sides of the vehicle. Hence, it is possible to track objects which are not detected in frontal direction at rst. Moreover, lane markings on the side of the vehicle can be detected and e.g. improve localization. The results of this evaluation showed that the use of catadioptric cameras improves many intelligent vehicle applications. Based on the investigation in this study, the catadioptric/catadioptric system turned out to be more promising for further evaluation. Our next steps will include an implementation for object detection and tracking in the entire vehicle surrounding from catadioptric images of a catadioptric/catadioptric stereo system. The possibility to use lane markings next to the vehicle to improve localization or lane departure warning will also be evaluated. R EFERENCES
[1] A. Geiger, M. Lauer, and R. Urtasun, A generative model for 3d urban scene understanding from movable platforms, in Computer Vision and Pattern Recognition (CVPR), Colorado Springs, USA, June 2011. [2] C. Geyer and K. Daniilidis, A unifying theory for central panoramic systems and practical applications, in ECCV 00: Proceedings of the 6th European Conference on Computer Vision-Part II. London, UK: Springer-Verlag, 2000, pp. 445461. [3] S. Baker and S. K. Nayar, A theory of single-viewpoint catadioptric image formation, International Journal of Computer Vision, vol. 35, no. 1, pp. 1 22, 1999. [4] R. Benosman and S. e. Kang, Panoramic vision: sensors, theory, and applications. Secaucus, NJ, USA: Springer-Verlag New York, Inc., 2001. [5] T. Ehlgen, T. Pajdla, and D. Ammon, Eliminating blind spots for assisted driving, vol. 9, no. 4, pp. 657665, December 2008. [6] S. Y. Cheng and M. M. Trivedi, Toward a comparative study of lane tracking using omni-directional and rectilinear images for driver assistance systems, in ICRA 2007 Workshop: Planning, Perception and Navigation for Intelligent Vehicles, 2007.
[7] D. Scaramuzza and R. Siegwart, Monocular omnidirectional visual odometry for outdoor ground vehicles, Springer Lecture Notes in Computer Science, Computer Vision Systems, 2008. [8] Z. Zhu, Omnidirectional stereo vision, in Proc. of ICAR 01, 2001, pp. 2225. [9] J. Gluckman, S. K. Nayar, and K. J. Thoresz, Real-time omnidirectional and panoramic stereo, in In Proceedings of the 1998 DARPA Image Understanding Workshop. Morgan Kaufmann, 1998, pp. 299303. [10] S. Yi and N. Ahuja, An omnidirectional stereo vision system using a single camera, in ICPR 06: Proceedings of the 18th International Conference on Pattern Recognition. Washington, DC, USA: IEEE Computer Society, 2006, pp. 861865. [11] T. Gandhi and M. Trivedi, Vehicle surround capture: Survey of techniques and a novel omni-video-based approach for dynamic panoramic surround maps, vol. 7, no. 3, pp. 293308, September 2006. [12] T. Ehlgen, M. Thom, and M. Glaser, Omnidirectional cameras as backing-up aid, Computer Vision, IEEE International Conference on, vol. 0, pp. 15, 2007. [13] M. Lauer, M. Sch nbein, S. Lange, and S. Welker, 3d-objecttracking o with a mixed omnidirectional stereo camera system, Mechatronics, vol. 21, no. 2, pp. 390 398, 2011, special Issue on Advances in intelligent robot design for the Robocup Middle Size League. [14] P. Sturm, Mixing catadioptric and perspective cameras, Omnidirectional Vision, Workshop on, vol. 0, p. 37, 2002. [15] G. Adorni, L. Bolognini, S. Cagnoni, and M. Mordonini, Stereo obstacle detection method for a hybrid omni-directional/pin-hole vision system, in RoboCup, 2001, pp. 244250. [16] T. Svoboda and T. Pajdla, Epipolar geometry for central catadioptric cameras, Int. J. Comput. Vision, vol. 49, no. 1, pp. 2337, 2002. [17] C. Mei and P. Rives, Single view point omnidirectional camera calibration from planar grids, in IEEE International Conference on Robotics and Automation, April 2007. [18] J. Y. Bouguet, Camera calibration toolbox for Matlab, 2008. [Online]. Available: http://www.vision.caltech.edu/bouguetj/calib doc/. [19] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, Speeded-up robust features (surf), Comput. Vis. Image Underst., vol. 110, no. 3, pp. 346 359, 2008. [20] R. I. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, 2nd ed. Cambridge University Press, ISBN: 0521540518, 2004.
194
On-Board Perception and Motion Planning for Legged Locomotion over Rough Terrain
Institute
Dominik Belter Przemysaw abecki Piotr Skrzypczy ski n of Control and Information Engineering, Pozna University of Technology, Pozna , Poland n n
representation (Fig. 2), which employs a coarse strategic elevation, grid describing the more distant areas perceived by means of the camera, and a smaller, ne-grained tactical elevation map that surrounds the robot, and is updated from the URG-04LX range data. The robot is located in the centre of these maps, and both maps are translated while the robot is walking. With such a terrain representation the approach to motion planning resembles a human behaviour during a trip in rugged terrain: we usually plan a general path by looking at the area from a distance to avoid trees, large stones, ditches, etc., and then we follow this path, focusing our visual attention on the terrain at our feet in order to avoid smaller obstacles and nd appropriate footholds. These two planning levels are implemented in the robot by two different algorithms. The Rapidly-exploring Random Tree (RRT) algorithm [10], which provides a single-step, probabilistically complete planning method for high-dimensional conguration spaces, is applied at the tactical level. At this level the RRT-based planner handles the 6-d.o.f. robot pose and the 18-dimensional space dened by the legs joints. However, the probabilistic nature of RRT often leads to exploring spaces unrelated to any reasonable path.
qg
obstacle perceived by stereo camera
Abstract This paper addresses the issues of perception and motion planning in a legged robot walking over a rough terrain. The solution takes limited perceptual capabilities of the robot into account. A passive stereo camera and a 2D laser scanner are considered for sensing the terrain, with environment representations matching the uncertainty characteristics of these sensors. The motion planner adheres to the dual-representation concept by applying the A algorithm for strategic level planning on a coarse elevation grid from stereo data, and then using a RRTbased method to nd a sequence of feasible motions on a more precise but smaller map obtained by using the laser data. Results of simulations and experiments on real data are provided. Index Terms legged robot, terrain mapping, path planning
I. I NTRODUCTION In this paper we introduce a motion planning system that allows a legged robot to traverse previously unseen rough terrain using only on-board perception. In our previous work [9] we presented a mapping system for the Messor hexapod robot (Fig. 1A), which uses a tilted-down miniature 2D laser scanner Hokuyo URG-04LX to acquire the terrain prole. The obtained elevation map is then used for planning of the footholds [1], and as shown in our recent work [2], can be a basis for local motion planning. However, the perception range of the tilted scanner is too small to plan an effective path of the robot on-line, while perceiving the terrain.
qt xx xx x x
high-cost cells
strategic planner A* path
tactical planner RRT path

15 mm
qc x x xx x
non-traversable local area
Fig. 1. Messor robot with the dual-mode terrain perception system (A), and left camera image from the STOC system (B) Fig. 2.
100 mm
Concept of the terrain representation and motion planning approach
Hence, we added a stereo camera to obtain a more extended representation of the terrain (because of the limited payload of the robot we cannot use a 3D laser scanner). This permits to extend our previous motion planning approach [2] to a strategic level, allowing the robot to devise a path to a more distant goal. The motion planning approach is closely related to the perception capabilities of our robot. The stereo camera perceives the environment up to several metres from the robot (Fig. 1B), but the range data are sparse and corrupted by many artefacts. On the other hand, the laser scanner, which is much more precise in acquiring the terrain prole, provides only local data. Therefore, we conceived a dual-resolution terrain
To mitigate this problem the strategic-level path planner is added, which uses the heuristic, non-randomized A algorithm. This algorithm works in a lower-dimensional search space, because the coarse elevation map allows only to roughly determine traversability between two cells. However, the A algorithm is guaranteed to yield a path optimal up to the resolution of the map. The path planned at the strategic level is then used to dene local sub-goals for the tactical RRT-based planner. Because the two planning levels run repeatedly as the robot moves and perceives new areas, the ne-grained motion plan yielded by the RRT follows the general path from the A
195
planner, but can make detours if the strategic plan turns out to be locally unfeasible for the walking robot. II. R ELATED WORK This section briey surveys the terrain perception and motion planning works that are most directly related to our approach. Visual terrain perception in legged robots was pioneered by the work on Ambler [7], which used a laser scanner to build an elevation grid of the surroundings. Passive stereo vision, which is a popular sensing modality in robotics, as so far found limited use in terrain perception for walking robots. One notable example is [13], however it relies on the specic mechanics of the RHex robot, which provides robust motion in moderately rough terrain without explicit foothold selection. Also Kolter et al. [5] describe a perception system with an onboard stereo camera, but off-board data processing is used. The terrain modelling method proposed in [5] focuses on lling-in the occluded areas of the terrain that arise due to the line-ofsight constraints of the camera. Elevation values at unseen locations are also predicted in [12] by applying Gaussian process regression to 3D laser data. This approach enables lling-in large discontinuities, but its computational cost is high. Conversely, we use not only a long-range sensor (stereo camera), but also the 2D laser scanner, which yields dense data and is purposefully congured to minimize discontinuities in the obtained map [9]. The notion of variable-resolution grid map for motion planning is also used in [4]. The spatial resolution of the map decreases with the distance from the robot, which allows to use an optimal A path planner in real-time for an environment with moving obstacles. A similar grid map concept is applied in our work, however it is motivated mostly by the properties of the sensors being used. Our strategic level path planner exploits some form of cost map computed over this variableresolution grid. It is based on the spherical variance, which is used also by [14] in connection with the Fast Marching wave-propagating path planner. Motion planning for a legged robot is computationally expensive because of the huge search space with numerous constraints. Some researchers try to simplify the problem, e.g. by adapting existing 2D planning techniques to operate in 3D terrain. Such an approach may generate quite conservative motion plans, which do not take full advantage of the locomotion capabilities of a legged robot. Another way of addressing the issue of complexity is presented by Kolter et al. [6], who at rst plan a path of the LittleDog robots trunk without considering the trajectories of its feet. Then, the footholds are planned to follow the initial path while considering appropriate constraints. This approach also may result in motion plans that do not consider all possible paths, as some solutions could be eliminated in an early stage of planning. The highdimensionality issue in motion planning for legged robots can be tackled by employing a randomized planning algorithm. In particular, Hauser et al. [3] successfully used the Probabilistic Roadmaps technique to plan motion of a six-legged lunar robot. In a recent work [16] the R algorithm, which combines aspects of deterministic and randomized search, is used.
III. T ERRAIN PERCEPTION AND MODELLING Because the terrain map updating procedure applied here is virtually identical to the one we have used for the laser-only perception [9], we provide here only a brief description of the terrain modelling sub-system, focusing on the added stereo vision data processing. The foothold selection procedure [1] requires a high resolution elevation map, which reects geometric properties of the ground at scale of the robot foot size. Such a map is created incrementally, as the robot moves, using range data from the tilted URG-04LX laser scanner. Currently, self-localization is accomplished by means of proprioceptive sensing supported by a vision-based SLAM procedure [15]. The mapping algorithm maintains two co-located grids of the same size: an elevation grid and a certainty grid. Each cell in the certainty grid holds a value that describes the accuracy of the corresponding cells estimate of the elevation. We use a local grid with cells of 1515 mm in size, and maximal dimensions of 22 m, which are determined by the range of the URG-04LX sensor in the tilted conguration.
A B C D
Fig. 3. STOC camera results (brighter colours correspond to objects closer to the camera): disparity image with default lter thresholds (A), disparity images with lowered lter thresholds (B), disparity image after removing pixels of high variance (C), disparity image after nal ltering (D)
Long-range terrain perception is accomplished by means of passive stereo vision. Stereo cameras are lightweight and inexpensive sensors, which can be used to obtain 3D data for a range of several metres, depending on the base-line used. To cope with the high computational requirements of the stereo image processing in a legged robot with limited on-board computing power we use the STereo-On-Chip (STOC) camera from Videre Design, which provides embedded image processing. The STOC camera implements the Small Vision System stereo algorithm based on area correlation and generates a point cloud of 640480 resolution at 30 frames per second. However, the small base-line (only 6 cm) results in a limited practical range of the sensor, while changing illumination and insufcient texture of observed objects contribute to artefacts and large empty areas in the disparity images. Although there are built-in lters in the camera, they dont remove all of the artefacts, and there is a need for measurement postprocessing [8]. The artefacts that are most problematic for the built-in lters are those on the borders of objects with disparity values much different from the background. Such measurements can be removed by analyzing the depth image. The artefact removal procedure is illustrated for a simple indoor mockup scene depicted in Fig. 1B. Figure 3A shows the disparity image acquired with default lter thresholds. There are few false data in the disparity image, but large patches of the terrain are missing. Lowering the lter thresholds provides much richer image, but introduces some artefacts (Fig. 3B).
196
Some of the invalid disparity values are of random nature and can be eliminated by analyzing several consecutive images and discarding pixels of large disparity variance (Fig. 3C). There are also some false measurements that are stable from image to image (encircled by red lines in Fig. 3C). Some of these artefacts can be removed at the expense of increasing the minimal measurement distance (Fig. 3D).
single co
lumn
350
50 100 0
single bin 100 150
y [cm]
200
250
300
50
Fig. 4.
Voting scheme for stereo data ltering and elimination of artefacts
Although the depth image analysis removes much of the artefacts, we apply a simple voting scheme in a local 3D data structure (Fig. 4) in order to eliminate as many spurious measurements as possible before integrating the measurements into the actual elevation map. The obtained point cloud (green points) is projected into a 3D grid. Usually more than one measured point falls into a single cell, which has the height of 0.1 m. Each cell in the 3D grid is treated as a separate bin, which has its index of votes increased whenever a range measurement hits that bin. After projecting all the points, for each cell of the 2D oor the bin in its respective vertical column, that has the largest number of votes is selected. The nal elevation value in the given cell is computed as a weighted average of the vertical coordinates of the measurements accumulated in the winning bin, with the disparity standard deviation used as the weighting factor. The nal elevation values are shown as coloured patches in Fig. 4. Measurements allocated to other bins in the given column are discarded. Moreover, points with the elevation |h| > 0.5 m are discarded, as we are only interested in the terrain description.
elevation h [cm]
400
20 10 0 -10
cell is 0.10.1 m. This map allows to roughly plan the path of the robot, but doesnt allow foothold selection. The strategic elevation map is updated only from stereo data, while the smaller tactical map updating procedure uses mostly the URG-04LX range measurements. However, as the robot walks forward the precise map gradually overlaps the area that was already mapped with the STOC camera. To re-use the already acquired elevation data in the precise map we implemented a simple fusion mechanism. As the laser data are much more reliable, these data override the stereo data in all cells that have the certainty value above a given threshold (i.e. they were successfully measured several times). However, if there are low-certainty cells in the tactical map, they are lled with the elevation data from the corresponding (but larger) cells of the less precise map from stereo. Results of this fusion procedure are shown in Fig. 5B (stereo and laser data mapped to different colours to make them distinguishable). IV. P ERCEPTION - BASED MOTION PLANNING A. Outline of the planning procedure Our motion planning method involves two algorithms. The whole procedure (Fig. 6) nds and executes a path between the current start conguration qc of the robot and the goal conguration qg . Although the conguration (denoted by q) is dened in the 24-dimensional space constituted by the 6-d.o.f. pose and the 18-dimensional posture of the robot, in some operations of the planning algorithm particular variables are set to default values, effectively reducing the dimensionality. At rst, the planning procedure nds a coarse path PA between the current and the goal pose by using the A algorithm, which handles only the 2D position of the robots trunk, assuming a default posture. If the A fails to yield a path, the goal is considered to be unreachable. If a path is found, a temporary goal conguration qt is created on the path PA . The temporary goal is located in the distance dRRT from qc , which at start is equal to the current robot pose qR . Then the RRT-based planning procedure searches for a full motion plan between the qc and qt congurations. If this search is successful, the robot executes the desired sequence of motions, but only up to the limits of the known tactical map (this conguration is denoted by qm ). However, if the RRT-based algorithm fails to nd a feasible motion sequence to qt , the distance dRRT is increased, allowing the tactical path planner to make a bigger detour from the general path found by the strategic planner. While walking, the robot acquires new data which are stored in the tactical and the strategic elevation maps. After reaching a temporary goal the path planning procedures (both A and RRT-based) are repeated on the updated maps. The whole motion planning procedure stops when the robot reaches the goal conguration or the RRT-based planner fails during searching for a motion plan between qc and qg (cf. Fig. 6). B. Strategic level planning The A algorithm performs strategic-level path planning on the strategic map obtained from the stereo camera. This representation is sufcient for coarse path planning, which allows to obtain a general direction of further motion, i.e., to
elevation scale (stereo)
x [c m]
x [c m]
Fig. 5. Elevation maps obtained with an indoor mockup: stereo data only (A), and combined stereo and laser data (B)
Long-range perception with the STOC sensor allows to create a much bigger, strategic elevation map. The STOC camera version we use can perceive objects as far as 5 m, thus the map size can be set to 10 10 m. Because of the artefacts and empty areas the precision of the strategic map is quite low (Fig. 5A). Therefore, the size of its grid
elevation scale (laser)
y[
cm
h [cm]
h [cm]
y[ cm ]
m] x [c
197
NO PATH
NO YES move qt closer to NO
C. Tactical level planning To nd a precise sequence of movements for the legged robot, an algorithm working in a continuous conguration space should be used, thus graph search methods like A are unacceptable. We chose a RRT-based planner because of its ability to quickly explore the high dimensional search space. The tactical level planner is based on the RRT-Connect concept [10], which is used here as a meta-algorithm, invoking several lower-level planning modules. The particular subprocedures of this meta-algorithm are explained further in the paper. Like RRT-Connect the RRT M OTION P LANNING procedure (Alg. 1) creates two random trees. Each node in a tree represents a kinematic conguration of the robot. A child node is connected to the parent node if a transition between these two congurations is possible. The root conguration of the rst tree Tc is located in the current conguration qc . The root conguration qt of the second tree Tt denes a temporary goal state of the robot. For the Tc tree the motion is planned forward, whereas the direction in the Tt tree is inverted. Algorithm 1
1 2 3 4 5 6 7 8 9 procedure RRT M OTION P LANNING (qc ,qg ) Tc .Init(qinit ); Tt .Init(qinit ) for k:=1 to K do begin qrand :=R ANDOM C ONFIG; if not (qnew :=E XTEND (Tc ,qrand ))=Trapped if E XTEND (Tt ,qnew )=Reached return PATH (Tc ,Tt ) S WAP (Tc ,Tt ); end;
START
qc:=qR
A*(qc,qg)
success ?
qg
qt=qg ?
NO
YES find temporary goal qt
RRT(qc,qt)
success ?
YES NO
qR=qg ?
execute path from qc to qm
YES
FINISH
Fig. 6.
Block diagram of the motion planning procedure
guide the RRT-based planner. Moreover, the lower-resolution grid allows A to run in real-time. A cost function and a heuristic distance function must be dened for the A path planner. We dened a transition cost between two neighbouring nodes by using two coefcients and a procedure that nds non-traversable transitions. The rst coefcient c1 simply denes the Euclidean distance between the centres of two neighbouring cells. The second coefcient c2 is based on the spherical variance [14], which describes a roughness of the terrain. The higher value the bigger difculties the robot can expect in nding appropriate footholds. To compute this coefcient the elevation map is converted to a triangle mesh by using the considered [i, j] cell and its neighbours. The vector normal to each triangle on the surface Nk = (xk , yk , zk ) is computed. Then the module R of the vector set Nk is computed:
n 2 n 2 n 2
R=
k=0
xk
+
k=0
yk
+
k=0
zk
(1)
The module R is divided by the number of vectors n (we use n = 8) to normalize the values into a range between 0 and 1, and the coefcient is computed as follows: c2 = 1 R . n (2)
Prior to computing the nal transition cost cnal the kinematic ability of the robot to traverse between the two given nodes is considered by the S IMPLE T RAVERSE procedure. While planning, the roll, pitch and yaw angles of the robots trunk are assumed to be zero, and the robot uses nominal footholds, resulting from the gait pattern. Finally, if the transition between the initial and the goal state is possible from the point of view of our hexapods kinematics, S IMPLE T RAVERSE returns True, and cnal coefcient is computed as a mean value of c1 and c2 coefcients. If the transition is not achievable (S IMPLE T RAVERSE=False) the nal cost cnal is set to innity. We use Euclidean distance to dene the heuristic cost in A , which ensures that we never overestimate the heuristic part of the cost, and that we fulll the monotony condition, guaranteeing the optimality of the planner.
At rst, the procedure R ANDOM C ONFIG randomly selects a pose of the robot qrand (only the x and y coordinates) on the given elevation map. Then the E XTEND operation extends the tree and tries to build new branch in the direction of qrand . If it is possible, and the new node qnew is created, the algorithm extends the second tree in the direction of qnew node. In the next iteration of the algorithm the order of the trees is swapped (S WAP procedure). As a result, the Tt tree is extended at the beginning of the next iteration. The maximal number of iterations is dened by K. When the two trees are connected and the E XTEND procedure returns Reached value, the algorithm nds the shortest path between qc and qt . The qnew conguration is common for the two trees. The PATH procedure nds the path from qnew to the root nodes in both trees. The sequence of the nodes for Tt tree should be inverted. Finally, the sequence of congurations which allows to move from the initial to the goal conguration is obtained. The E XTEND procedure (Alg. 2) plays a key role in the RRT-based planner for a legged robot. It checks if a transition between two given congurations of the robot is possible. At the beginning the N EAREST N EIGHBOUR operation nds the existing node qnear , which is the closest one to the new, random position qrand . Then, a new conguration is created by using the module, which selects footholds and creates a posture (C REATE P OSTURE). Then, the procedure C HECK T RAVERSE veries if the transition from qrand to qnear
198
is possible. Trajectories for the feet and a path for the trunk are created. Then, the planner checks if the desired sequence of the robots postures is achievable (lack of collisions, acceptable footholds inside the workspaces) and safe (robot maintains the static stability). If a transition to this conguration is impossible, another conguration qrand , which is however closer to the qnear is created by R EDUCE D ISTANCE procedure. The maximal number of iterations is N . Algorithm 2
1 2 3 4 5 6 7 8 9 10 11 12 procedure E XTEND (T,qrand ) qnear :=N EAREST N EIGHBOUR(qrand ) for n:=1 to N do begin C REATE P OSTURE(qrand ); if C HECK T RAVERSE(qrand ,qnear )=true if n=1 return qrand , Reached; else return qrand , Advanced; R EDUCE D ISTANCE(qrand ,qnear ) end; return Trapped;
collide if a selected foot position is inside of the workspace of the other legs. To detect such situations, the planner is equipped with a module for collision detection, which uses a simplied CAD model of the robot (see [2] for details). In the transition between two congurations the path of the robots trunk is determined as a straight line. However, the motion planner also has to determine if the robot is able to move its legs in such a way, that the goal state is reached, e.g. by moving the legs above an obstacle. Therefore, a fast trajectory planner for the feet of the robot is implemented, which takes into account the shape of the terrain as modelled by the detailed elevation map. The procedure for planning trajectories of the feet also sets the sequence of swing and stance phases for the legs, which denes the gait. Although currently the motion planner applies a wave or a tripod gait, it allows to plan the movements using any kind of cyclic gait. Because using only the position-based control is risky on a rough terrain, when the internal representation of the environment is uncertain, the robot uses the force sensors in its feet to detect collisions with the ground and to stop the movement when support of the terrain is sufcient. V. R ESULTS Initially, the proposed motion planning method was tested with a physic engine-based simulator of the Messor robot. A specially prepared terrain model, that includes a real map of a rocky terrain acquired using the URG-04LX sensor, was used in the simulator1 . Example results are shown in Fig. 7A, where the tree paths (depicted in shades of blue) represent three independent simulations of the motion planning procedure. The branches of the resulting RRT trees (for one of the three paths) are shown in green. Although the randomized RRT algorithm is used, the results are repeatable thanks to the guidance provided by the strategic planner. On a 2GHz PC the average execution time of the A planner is 10 s, and execution of the RRT-planner takes also 10 s for dRRT =1.2 m.
If the considered qrand node is located inside the tactical map the robot plans its movements using all the modules described in Subsection IV-D. But whenever the dRRT distance increases beyond the tactical map area, this node can be outside the precise map. In such a case the movement is planned by using a simplied method that roughly determines traversability between two nodes. The C REATE P OSTURE and C HECK T RAVERSE procedures are not used, replaced by the default posture and S IMPLE T RAVERSE procedure, used by the strategic planner. However, as the robot advances on its path and the precise map covers new areas the motion sequence is replanned with the full-featured version. D. Main modules of the tactical motion planner The module responsible for maintaining an acceptable posture creates the robots conguration at the given (x, y) location on the elevation map. It computes the required inclination of the robots trunk and its elevation above the terrain. The static stability of the new posture is checked using a simple but fast procedure implementing the stability criterion dened by [11]. At a given posture of the body, the robot has to nd appropriate footholds. The algorithm proposed in [1] is used. The aim of this algorithm is to nd an analytical relation between some simple characteristics of the local shape of the terrain, and the predicted slippage of the robots feet. The predicted slippage is a slippage which is expected when the robot chooses the considered point of the elevation map. As shown in [1] the robot learns unsupervised how to determine the footholds. Then it exerts the obtained experience to predict the slippage and minimize it by choosing proper points from the map. Finally, to determine if the considered footholds are reachable, the planner checks if each foothold is located inside the workspace of the respective leg. Points that are more distant from the borders of the legs workspace are preferred to avoid potential kinematic deadlocks. Legs of the robot can
A
20 0 -20 -300 -200 -100
x[cm]
0 0 100 200 300 -200 -300 0 -100 200 100
300
y[cm]
B
20 0 -20 -300 -200 -100
x[cm]
0 100 200 300 -200 -300 0 -100 200 100
300
y[cm]
Fig. 7. Paths planned in simulation by using our algorithm (A), and a path planned by using the RRT-Connect algorithm (B)
1A
movie clip is available at http://lrm.cie.put.poznan.pl/ecmr2011.wmv
199
For the sake of comparison we show how the same problem is solved by the RRT M OTION P LANNING procedure, which is used here without the strategic level planner (Fig. 7B). The method, being in fact our implementation of the RRT-Connect idea for the walking robot, searches over the whole map, which is time-consuming. In contrast, the same procedure in the proposed two-level planning approach is focused on exploring areas around the solution provided by the A planner.
path planning: the RRT and the graph-search-based approach. Moreover, it provides a natural decomposition of the computationally expensive task, which being inspired by human behaviour, is well-suited for the dual-mode terrain perception system of the Messor robot. It allows the RRT algorithm to take advantage of the guidance provided by the strategic planner, but at the same time does not eliminate any solutions in the early stage of planning if the A path is unfeasible for legged locomotion (e.g., because of poor cost assessment) the whole procedure reduces to a RRT-Connect like planner, which takes much more time, but nds a path if one exists. Future work includes improved perception from the STOC camera and improvements to the low-level modules, which should make the RRT-based planner fully real-time. Also replacement of the A algorithm with a dynamic planner (e.g. D Lite) is considered, to make the replanning step faster. R EFERENCES
Fig. 8.
Experimental results obtained with the indoor mockup
The described motion planning method and the terrain representation were also veried in a series of small-scale indoor and outdoor experiments, involving a mockup of a rocky terrain (cf. Fig. 1) and a real rubble-covered terrain with some vegetation. Figure 8 visualizes two moments during an indoor experiment, when the robot is about 40 cm from the start point (A), and about one metre from the start (B). The general path provided by the strategic planner is red, the RRT tree is shown in green, while the nal path is black. This gure also shows the local feet trajectories (blue lines). It is visible that both the general path and the motion plan get updated as the robot receives new information about the terrain. Figure 9 shows a similar situation during an outdoor experiment, illustrating the ability of the on-board perception system to handle natural scenes. The distance between the two congurations shown in Fig. 9A and B is 80 cm.
A
Fig. 9.
Experimental results obtained outdoors, in a rubble-covered area
VI. C ONCLUSIONS AND F UTURE W ORK The motion planning system presented in this paper enables to combine the strengths of the two popular approaches to
[1] D. Belter, P. abecki, and P. Skrzypczy ski. Map-based adaptive n foothold planning for unstructured terrain walking. In Proc. IEEE Int. Conf. on Robot. & Autom., pages 52565261, Anchorage, 2010. [2] D. Belter and P. Skrzypczy ski. Integrated motion planning for a n hexapod robot walking on rough terrain. In Prepr. 18th IFAC World Congress, Milan, 2011 (to appear). [3] K. Hauser, T. Bretl, J.-C. Latombe, and D. Wilcox. Motion planning for a six-legged lunar robot. In 7th Int. Workshop on the Algorithmic Foundations of Robotics, pages 301316, New York, 2006. [4] R. Kirby, R. Simmons, and J. Forlizzi. Variable sized grid cells for rapid replanning in dynamic environment. In Proc. IEEE/RSJ Int. Conf. on Intel. Robots & Syst., pages 49134918, St Louis, 2009. [5] J. Z. Kolter, Y. Kimz, and A. Y. Ng. Stereo vision and terrain modeling for quadruped robots. In Proc. IEEE Int. Conf. on Robot. & Autom., pages 15571564, Kobe, 2009. [6] J. Z. Kolter, M. P. Rodgers, and A. Y. Ng. A control architecture for quadruped locomotion over rough terrain. In Proc. IEEE Int. Conf. on Robot. & Autom., pages 811818, Pasadena, 2008. [7] E. Krotkov and R. Simmons. Perception, planning, and control for autonomous walking with the Ambler planetary rover. Int. Journal of Robotics Research, 15(2):155180, 1996. [8] P. abecki. Improved data processing for an embedded stereo vision system of an inspection robot. In R. Jablonski et al., editor, Recent Advances in Mechatronics, Springer, Berlin, 2011 (to appear). [9] P. abecki, D. Rosi ski, and P. Skrzypczy ski. Terrain perception and n n mapping in a walking robot with a compact 2D laser scanner. In H. Fujimoto et al., editor, Emerging Trends in Mobile Robotics, pages 981 988, World Scientic, Singapore, 2010. [10] J. J. Kuffner Jr. and S. M. LaValle. RRT-Connect: An efcient approach to single-query path planning. In Proc. IEEE Int. Conf. on Robot. & Autom., pages 9951001, San Francisco, 2000. [11] R. B. McGhee and G.I. Iswandhi. Adaptive locomotion for a multilegged robot over rough terrain. IEEE Trans. on Systems, Man, and Cybernetics, SMC-9(4):176182, 1979. [12] C. Plagemann, S. Mischke, S. Prentice, K. Kersting, N. Roy, and W. Burgard. A Bayesian regression approach to terrain mapping and an application to legged robot locomotion. Journal of Field Robotics, 26(10):789811, 2009. [13] R. B. Rusu, A. Sundaresan, B. Morisset, K. Hauser, M. Agrawal, J.C. Latombe, and M. Beetz. Leaving atland: efcient real-time threedimensional perception and motion planning. Journal of Field Robotics, 26(10):841862, 2009. [14] L. Sanctis, S. Garrido, L. Moreno, and D. Blanco. Outdoor motion planning using fast marching. In O. Tosun et al., editor, Mobile Robotics: Solutions and Challenges, pages 10711080, World Scientic, Singapore, 2009. [15] A. Schmidt and A. Kasi ski. The visual SLAM system for a hexapod n robot. In L. Bolc et al., editor, Computer Vision and Graphics, volume 6375 of LNCS, pages 260267, Springer, Berlin, 2010. [16] P. Vernaza, M. Likhachev, S. Bhattacharya, S. Chitta, A. Kushleyev, and D.D. Lee. Search-based planning for a legged robot over rough terrain. In Proc. IEEE Int. Conf. on Robot. & Autom., pages 23802387, Kobe, 2009.
200
A Sampling Schema for Rapidly Exploring Random Trees using a Guiding Path
Vojt ch Von sek e a Jan Faigl Tom s Krajnk a Libor Peu il r c Department of Cybernetics, Center for Applied Cybernetics Faculty of Electrical Engineering, Czech Technical University in Prague {vonasek,xfaigl,tkrajnik,preucil}@labe.felk.cvut.cz
Abstract In this paper, a novel sampling schema for Rapidly Exploring Random Trees (RRT) is proposed to address the narrow passage issue. The introduced method employs a guiding path to steer the tree growth towards a given goal. The main idea of the proposed approach stands in a preference of the sampling of the conguration space C along a given guiding path instead of sampling of the whole space. While for a lowdimensional C the guiding path can be found as a geometric path in the robot workspace, such a path does not provide useful information for efcient sampling of a high-dimensional C. We propose an iterative scaling approach to nd a guiding path in such high-dimensional conguration spaces. The approach starts with a scaled geometric model of the robot to a fraction of its original size for which a guiding path is found using the RRT algorithm. Then, such a path is iteratively used in the proposed RRT-Path algorithm for a larger robot up to its original size. The experimental results indicate benet of the proposed technique in signicantly higher ratio of the successfully found feasible paths in comparison to the state-of-the-art RRT algorithms. Index Terms motion planning, RRT
I. I NTRODUCTION The motion planning is a classical robotic problem, which can be described using a notion of the conguration space C, where q C is a conguration of a robot. The obstacles in the workspace correspond to a set Cobs C, while Cf ree = C\Cobs is a set of feasible congurations. The motion of a robot in a workspace equals to a feasible path in Cf ree . Several complete methods have been proposed to solve the planning problem, e.g., Voronoi diagrams or Visibility Graphs. These approaches require an explicit representation of the conguration space. However, such a representation cannot be easily computed for systems with many degrees of freedom. To overcome this problem, sampling-based methods as Probabilistic Roadmaps (PRM) [13], Rapidly Exploring Random Trees (RRT) [15] or Expansive Spaces Trees (EST) [9] were suggested to obtain an approximation of the conguration space. The common idea of the sampling-based methods is to build a roadmap of free congurations q Cf ree . These methods randomly sample the conguration space C and classify the samples as free or non-free using a collision detection method. The free samples are stored and connected to make a roadmap, in which a solution is found. The collision detection is applied as a black-box, which allows to cope with robots of arbitrary shapes. The sampling-based methods are intensively studied, and their advantages and issues have been described in many papers. Here, we refer to summary [17]. The well known issue
of the sampling-based methods is the narrow passage problem. A narrow passage can be dened as a part of C, which removal changes the topology of C [14]. The narrow passage becomes important if the result path in C has to pass it. The issue comes from a less number of the samples covering the narrow passage, which disallows to construct a feasible path through it. The sparse coverage of the passage is due to the uniform distribution, which is usually used in sampling-based motion planning methods. Even though the methods are probabilistic complete, they do not provide a feasible path within a given computational time. In this paper, a novel sampling schema for the RRT algorithm is proposed to cope with the narrow passage problem, and to increase success ratio of nding feasible paths. The approach is based on a modication of the RRTPath [2] algorithm, which uses a guiding path to steer the tree growth in C. Although a guiding path can be computed directly in workspace corresponding to a low dimensional C, it is difcult to construct a useful path in a general C. Therefore, we propose to iteratively employ the RRTPath algorithm to nd a path for a robot with downscaled size, but preserving the same motion model. The path found for a smaller robot is then used as a guiding path for a larger one, which increases performance of the motion planning. The paper is organized as follows. Related work is described in the next section. In Section III, the modied RRTPath algorithm is described. It is then used in Section IV in the iterative method to nd a guiding path in a high-dimensional C. Experimental verications are described in Section V. II. R ELATED WORK A crucial part of the sampling-based methods is the sampling of the conguration space. The sampling process can be characterized by a sampling source and a sampling measure [11]. The sampling source denotes how the samples are constructed, e.g., using pseudorandom number generators or by deterministic approaches like Halton sequences. The sampling measure is the distribution of the samples in C and it is more important for the planning process itself than the sampling source [11]. In the original PRM and RRT, the conguration space is sampled uniformly. To cope with the narrow passage problem, the sampling measure can be modied to generate more samples in difcult regions. A simple method that modies the sampling measure in the RRT is the goal-bias [16], where
201
random conguration is replaced by the goal conguration with a probability pg . The goal-bias suppresses the exploration of C, while the tree is expanded more preferably towards the goal. Although the goal-bias speeds up the growth towards the goal, it may cause a congestion problem of the tree growth due to existence of an obstacle in the pathway. A location of narrow passage can be estimated using a shape of the workspace prior the planning. This approach is used in the Gaussian PRM [18] and Bridge-Test [8], where random congurations are generated close to obstacles. If a roadmap with a higher clearance is preferred, the samples can be generated on the medial axis [23]. Other approaches that use knowledge about the workspace to determine regions for dense sampling are [14, 22, 1, 10]. The sampling measure can be adapted in the planning phase, which allows to generate more samples in sparsely covered regions. The adaptive sampling can be based on the level of information gain brought by the samples [4, 5]. The coverage of the conguration space by the samples can be estimated using KPIECE [20] algorithm. The adaptive methods improve the sampling in high-dimensional conguration spaces, where the sampling cannot be modied respecting purely the workspace knowledge [14]. The sampling measure can be easily modied in a PRM algorithm, because such algorithm rst samples the conguration space, and after that, the roadmap is constructed using the samples located in Cf ree . The modications originally suggested for the PRM cannot be directly used in the RRT, because the RRT algorithm simultaneously builds the tree and samples the conguration space. To modify the sampling measure in RRT, the tree growth needs to be considered, otherwise the tree can get stuck due to an obstacle, like in the situation depicted in Fig. 1a. More samples in the narrow passage np do not guarantee the tree will escape the u obstacle, because the tree tends to growth towards the samples in the np, but in that directions an obstacle is located. Therefore, a sampling measure should be boosted around np after the tree approaches the narrow passage (Fig. 1b).
q init
q goal
q init
q goal
np
np
Fig. 1. Environment with a narrow passage (np). If the sampling is boosted around np before the tree approaches it, the tree attempts to grow to the obstacles (left). To help the tree to pass np the sampling measure should be boosted in np when the tree approaches it (right).
Several approaches to adjust the sampling measure considering the tree growth in the RRT have been presented so far. The tree nodes can be classied into two groups: (1) frontier nodes, whose Voronoi cells grow together with the growth of the environment; and (2) boundary nodes, which are close to obstacles [24]. The tree should be expanded from the frontier nodes, as these are found at the boundary of the tree; thus, these can be extended towards unexplored
regions. In narrow passages, the nodes are mainly of both the frontier and the boundary types. These are frequently chosen for tree expansion, because they are frontier nodes. Nevertheless, the expansion of the nodes may not be successful as these are located close to obstacles. The authors of [24] proposed the RRT-DD algorithm to deactivate nodes from which the tree cannot be expanded. Each node has assigned a radius dening maximum distance of a random sample in C, which can activate the node for the expansion. The radius is initialized to ; hence, the nodes can be chosen for the expansion by an arbitrary random sample. If the expansion of a node fails, its radius is decreased to a predened value R. This suppresses frequent selection of the boundary nodes, and increases probability of selecting frontier nodes. However, the performance of the RRT-DD strongly depends on the value of R. A strategy for adapting the activation radius has been proposed in [12]. If the expansion of a node succeeds, its activation radius is increased by a given percentage , otherwise it is decreased by the same percentage. To attract the tree into a narrow passage, the RetractionRRT [26] computes contact congurations qc that are located on a boundary of Cf ree and Cobs . Such congurations are found using retraction procedure, which works as follows. A random conguration qrand Cf ree is generated and a close non-free conguration qobs Cobs is found. A contact conguration qc on a segment (qrand , qobs ) is found and its neighborhood is searched for qc minimizing the distance between qobs and qc . The conguration qc is then added to the tree. It was shown that this approach can deal with narrow passages efciently, because the generated contact congurations penetrate into the narrow passages. However, to nd the contact congurations, the collision detection algorithm is called frequently, which can decrease the performance of the algorithm. In [2], the performance of the RRT for problems with a lowdimensional C has been increased by the proposed RRTPath algorithm. The approach is based on a precomputed geometric path in the workspace that is used to guide randomized construction of the tree through the conguration space. Therein, the guiding path is computed using well known approaches like Visibility graph, Voronoi diagram, or the VisibilityVoronoi combination. A similar idea for attracting the tree was proposed in [21]. In this approach, the tree is grown towards multiple precomputed key congurations. In the RRT Path, only one attraction conguration is used. Moreover, this conguration is moved along the path considering the growth of the tree. It allows the tree to reach the goal conguration even in an environment with obstacles. A signicant performance improvement of the RRTPath in problems with many narrow passages (like ofce like building plans) allows to consider the motion planning in the watchman route problem [6] where many trajectories have to be determined to nd a nal watchman route trajectory from which the whole environment is covered. However, a guiding path found in the workspace cannot be used to guide the RRT in a high-dimensional C, because the workspace importance decreases as the dimension of the conguration space increases [14]. Therefore, in this paper, we extended the
202
former RRTPath algorithm to efciently deal with problems in a high-dimensional C. III. G UIDING A TREE USING A PRECOMPUTED PATH Although the RRTPath has been proposed in [2], the algorithm has been further improved to cope with highdimensional spaces. That is why a detailed description of the modied algorithm is presented in this section. The basic idea of the algorithm is to steer the growth of the tree through the environment without unnecessary exploration steps. To steer a growing process of the tree a precomputed guiding path is employed, and new samples are generated along this path using a temporal goal gt , which follows the goal-bias principle. The temporal goal gt is maintained considering the growth of the tree; thus, gt slides along the guiding path during the tree construction. To determine the temporal goal gt , the nearest node in the tree should be projected to a line segment of the guiding path. To avoid computing of this projection, we represent the guiding path by a set of sample points. Then, the temporal goal is searched among these samples. To bias the tree growth towards gt a new random sample qrand is randomly drawn from w neighbor sample points of gt . So, qrand is sampled around gt with the probability pg , otherwise qrand is sampled from C. The principle of the sampling is depicted in Fig. 2. Similarly to the original RRT, the nearest conguration qnear in the tree is found and the tree is extended towards qrand . If the tree approaches the current goal gt in a distance less than t , the temporal goal is updated to the most distant point of the guiding path that has not yet been reached by the tree. The algorithm terminates if the tree approaches qgoal . The RRTPath algorithm is listed in Algorithm 1. The guiding path can be found easily for 2D or 3D workspaces using well know approaches like Visibility graphs or Voronoi diagrams. Here, we do not focus on the guiding path computation itself, rather we assume to have one and focus to the sampling along such a path. To measure distance between a conguration q C and a path point p, only the corresponding variables are used in Euclidean metric. For example, for a car-like robot with q = {x, y, } and 2D guiding path point p = {x, y}, only x and y variables are used to determine the distance between q and p. Maintenance of the temporal goal (lines 412 in Algorithm 1) is based on distances between the tree and points on the guiding path that have a higher index than gt . If the distance is lower than t for a point pi , then the next point pi+1 is labeled as a new temporal goal. The considered distance t implies that the tree may not follow the path exactly. A higher t leads to a less accurate tracking of the guide path. In such a situation, a new path
Algorithm 1: RRTPath Input: Congurations qinit , qgoal , the maximum number of iterations K, the temporal goal bias pg , the number of the temporal goal neighbors w, a guiding path P = {p1 , . . . , pn } Output: A trajectory between qinit and qgoal or failure
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
T.add(qinit ) gt = 1 // index of the temporal goal point while iteration< K and qgoal not reached do i=n while i > gt do qn = nearestNeighbor(T, pi ) if distance(qn , pi ) < t then gt = i + 1 // new temporal goal break end i=i1 end if pg > rand(0, 1) then qrand = random point among w neighbors of gt else qrand = random conguration in C end qnear = nearestNeighbor(T, qrand ) qnew = extend qnear towards qrand if qnew can be connected to qnear then T.add(qnew ) end iteration = iteration + 1 end if qgoal was reached by the tree T then return trajectory between qinit and qgoal else return failure end
4=g t 3 1 2
6 7 8 9
Fig. 2. An example of a guiding path with nine points. The gt is set to point 4. If w = 5 neighbors of the temporal goal is used to attract the tree, qrand is drawn from points {2, 3, 4, 5, 6}.
through the environment may be found and even it can be interconnected with an existing path behind the temporal goal. So, the tree can skip a part of the guiding path, which is useful in situations, where a part of the guiding path is not able to steer the tree properly, e.g., due to a low clearance. An example of such a situation is depicted in Fig. 3. A balance between path following and exploration of C depends on the number of neighbors of gt and the distance between the consecutive points on the guiding path. If the number of the neighbors w is high, or the consecutive path points are too far, the algorithm prefers exploration of C rather than following the guiding path. The reason is that the tree is more attracted by widespread points than by the guiding path itself. If the distance between the consecutive path sample points is small or only several gt neighbors are used to select the random points, the tree is attracted by points located in a close area; thus, the tree does not explore, but attempts to reach these points. As the distance between consecutive path points decreases, the number of path points increases, which also increases the computational burden of the temporal goal
203
Algorithm 2: RRTIS Input: qinit , qgoal , the number of RRTPath trials m, the gt maximum number of allowed iterations K q init Output: A path for the robot with the scale 1.0 or failure 1 guidingPath = 2 while qgoal is not reached by the original robot do Fig. 3. Example of skipping part of the guiding path by the tree. The temporal 3 scale = getNextScale() // scaling strategy goal is set to gt but the tree approaches the point p. The new temporal goal 4 robot.scaleGeometry(scale) is thus set to gt . 5 if guidingPath.size = 0 then 6 p=rrtOriginal.ndPath(qinit ,qgoal ,robot, K) 7 else selection. To overcome these situations, the distance between for t = 1 : m do consecutive sample points on the guiding path should not be 8 9 p=rrtpath.ndPath(guidingPath, qinit ,qgoal ,K) higher than t . 10 if goal reached by p then 11 break IV. RRTPATH WITH ITERATIVE SCALING OF ROBOT 12 end GEOMETRY 13 end The RRTPath is able to grow the tree along a predened 14 end geometric path. The guiding path can be computed easily for 15 if goal reached by p then a point robot in 2D or 3D workspaces, while its computation 16 guidingP ath = p is PSPACE-Hard [19] for a non-point robot. In this section, 17 else we proposed an iterative scaling algorithm denoted as RRT-IS 18 return failure to nd a guiding path in a general C. 19 end The approach is based on an iterative renement of the 20 end guiding path using a scaled model of the robot. First, the 21 return guidingPath // a solution for scale 1.0 original RRT is used to nd an initial guiding path with a downscaled model of the robot. This path is then used as the guiding path in the modied RRTPath algorithm that is executed several times using an enlarged robot in a stepwise C obs manner unless the original size of the robot is achieved. The q goal q goal RRTPath attempts to follow the given guiding path at most C obs m-times. Here, we assume that the robot is a single body robot and it is scaled equally in all dimensions. The algorithm q init q init C obs is listed in Algorithm 2. The narrow passages in C of a smaller robot are relatively C obs wider than the passages in C of the original (full-sized) robot; q goal q goal thus, it is easier to nd a path through the passages. The C obs iterative scaling approach assumes that the path found for a smaller robot is topologically equivalent to the path for the q init q init robot with the original size. It is clear that this assumption is not always valid. For example a path in C found for a An example paths found for different scales small robot can become infeasible for a larger robot due to Fig. 4.The workspaces of guidingrobot are depictedtwo the left column, of a robot. with the in and existence of an obstacle, see for an example Fig. 4. However, their conguration spaces are on the right. Two paths can be found in the the modied RRTPath is able to skip a part of the guiding conguration space of a smaller robot (b); however, only the dashed path is path, and therefore, it may nd a solution also in these cases. feasible for a larger robot. Performance of the proposed RRT-IS depends on the used scaling strategy. The simplest strategy is to start the planning with a low scale and increase the scale at each iteration by a predened value. Although the initial robot scale can be built. In the next step, the scale of the robot is increased and chosen as extremely low to increase a chance to nd an the previous roadmap is adapted. The benet of our approach admissible guiding path, such a path may become infeasible designed for the RRT is that it provides solution in situations for a larger robot. An improved scaling strategy can be used, where: (a) differential constraints must be considered, or (b) e.g., employing a binary search, to speed up the RRT-IS in a case of changing environment, where nding one path algorithm. between start and goal is faster than building a roadmap of The idea of iterative scaling approach for nding the guiding the whole conguration space. Moreover, the proposed RRTpath in C is similar to the Iterative Relaxation of Constraints IS uses a guiding path that considers the motion model of (IRC) [3], which was introduced for PRM. The IRC method the robot, and therefore, it can be used to nd a path for a iteratively scales the robot and for each scale a roadmap is non-holonomic robot.
qgoal p
, gt
204
V. E XPERIMENTS The performance of the proposed modication of the RRT with iterative scaling (RRT-IS) has been experimentally veried and compared with state-of-the-arts RRT approaches in two sets of experiments. All experiments have been performed within the same computational environment using the Intel Core2Duo 3.0 GHz CPU with 2 GB RAM. The nearest neighbors in RRT algorithms were searched using the MPNN library [25] and the collisions were detected using the RAPID library [7]. In the rst set of experiments, the RRT-IS was veried in 2D environments1 BT100 and BT30 with a car-like robot. The subscript in the map name denotes the size of the narrow passage in centimeters. The size of the robot is 100 20 cm. Each algorithm was executed 100 times for each map. A trial in which the distance between the resulting trajectory and the goal state was higher than 30 cm is considered as planning failure and the trial is discarded. The ratio of the number of planning failures to the total number of the trials performed denotes the failure ratio. Results from the successful trials are averaged. In the RRT-IS, the scale was initialized to 0.9, and it was increased by 0.02 at each iteration. The number of the inner RRTPath trials was set to m = 3. The parameters of RRT Path were: w = 15, pg = 95%, and t = 20 cm. The results are shown in Table I and examples of found solutions in Fig. 5. The maximum number of iterations for the RRT-Original is in row No. iterations. For RRT-IS, this denotes maximum number of iterations of one run of RRT and RRTPath on a certain scale (parameter K in Alg. 2). The row Time is the required computational time, Tree size is the size of tree created by the method, No. Collisions is the number of collision queries. The tree size of the RRT-IS can be higher than the number of iterations as the RRTPath algorithm can be performed up to m-times. A passage that is a more than four times wider than the robot does not represent a signicant issue for the RRT-Original algorithm in the BT100 environment. However, the failure ratio of RRT-Original is higher in the BT30 environment with a narrow passage. The failure ratio of RRT-Original depends on the maximum number of iterations. However, even if the number was increased (to 50, 000), the failure ratio of the RRT-Original was still higher than in RRT-IS with only 5, 000 of allowed iterations. Although the RRT-IS is more computationally intensive than the RRT-Original, it provides signicantly better performance regarding the failure ratio. The high computational requirements of the RRT-IS are caused by the determination of the temporal goals as RRTPath has to nd the nearest neighbors to all points on the guiding path in each iteration. The initial construction of the rst guiding path for the most downscaled robot is time consuming, whilst the guidance of larger robots is faster as can be seen in Fig. 7. The second experimental verication concerns 3D environments in scenarios Bugtrap2 and Room. The task is to nd
1 Maps 2 http://parasol.tamu.edu/groups/amatogroup/benchmarks/mp/bug-trap/
a path for a stick-shaped robot. Both environments contain a narrow passage. The iterative scaling was started at a scale 0.05 for the Bugtrap environment and it was increased by 0.05 after each iteration. The initial scale for the Room environment was 0.7 and it was increased by 0.02 after each iteration. The parameters of RRTPath were: w = 15, pg = 95%, and t = 0.5 map unit. A trial was accepted if the distance between found trajectory and the goal state was less than 1 map unit. The results are shown in Table II. Beside the RRTOriginal, the RRT-Retraction [26] and RRT-DynamicDomain [24] with adaptive tuning were implemented for a comparison. The parameters = 0.1 and R = 100 of RRT-DD were used, where is the length of the expansion step in the RRT. Although the state-of-the-art methods RRT-DD and RRTRetraction are faster and need less number of collision detection queries, their failure ratios are signicantly higher than for the RRT-IS. Regarding the experimental results the proposed RRT-IS provides feasible path with a higher frequency, which indicates benet of proposed method of guided sampling.
TABLE I R ESULTS FOR BT100 AND BT30 MAPS WITH CAR - LIKE ROBOT. RRT-Original BT100 No. iterations Time [s] Tree size No. collisions Failures BT30 No. iterations Time [s] Tree size No. collisions Failures 15,000 1.45 10,738 180,557 8% 30,000 2.72 19,335 331,085 82 % 35,000 3.18 24,485 384,971 1% 50,000 4.2 26,734 510,084 78 % RRT-IS 5,000 5.0 8,279 257,796 1% 5,000 28.81 18,834 860,929 1%
Fig. 5. Results of RRT-IS on maps BT100 (left) and BT30 (right). The rectangle represents the robot.
VI. C ONCLUSION A novel sampling schema for RRT planning algorithms has been presented. The proposed method employs a guiding path to steer the growth of the tree in the conguration space. The guiding path can be computed in the workspace by geometric path-planning methods, or by the RRT approach with the herein proposed iterative scaling. Whilst the geometric methods are more suitable for a low-dimensional conguration space, the RRT-IS can even deal with high-dimensional
are available at http://imr.felk.cvut.cz/planning/maps.xml
205
TABLE II R ESULTS FOR 3D ENVIRONMENTS . RRT-Original RRT-IS
R EFERENCES
RRT-Retr RRT-DD
[1] Nancy M. Amato, O. Burchan Bayazit, Lucia K. Dale, Christopher Jones, and Daniel Vallejo. OBPRM: an obstacle-based PRM for 3D workspaces. In WAFR, pages 155168. A. K. Peters, Ltd., 1998. Room [2] Von sek Vojt ch, Jan Faigl, Tom s Krajnk, and Libor Peu il. RRTa e a r c Path: a guided Rapidly Exploring Random tree. In Robot motion and No. iterations 200,000 40,000 40,000 100,000 control, Poznan, Poland, June 2009. Time [s] 66.84 58.2 9.9 19.2 Tree size 156,874 209,031 69,886 100,000 [3] O.B. Bayazit, Dawen Xie, and N.M. Amato. Iterative relaxation of constraints: a framework for improving automated motion planning. In No. collisions (106 ) 4.38 4.57 1.23 1.2 IROS 2005, pages 3433 3440, aug. 2005. Failures 59% 0% 75% 80% [4] B. Burns and O. Brock. Information theoretic construction of probabilistic roadmaps. In IROS, volume 1, pages 650655, Oct. 2003. Bugtrap [5] Brendan Burns and Oliver Brock. Toward optimal conguration space No. iterations 4 106 3.5 106 106 106 sampling. In Proceedings of Robotics: Science and Systems, Cambridge, Time [s] 2,692 3,000 257 394 USA, June 2005. Tree size (106 ) 3.8 2.2 1.63 0.99 [6] Jan Faigl. Multi-Goal Path Planning for Cooperative Sensing. PhD 6) No. collisions (10 133 120 16 12 thesis, Czech Technical University in Prague, 2010. [7] S. Gottschalk, M. C. Lin, and D. Manocha. OBBTree: a hierarchical Failures 86 % 1% 46 % 86 % structure for rapid interference detection. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pages 171180, New York, NY, USA, 1996. ACM. [8] David Hsu. The bridge test for sampling narrow passages with probabilistic roadmap planners. In IEEE ICRA, 2003. [9] David Hsu, Jean claude Latombe, and Rajeev Motwani. Path planning in expansive conguration spaces. In International Journal of Computational Geometry and Applications, pages 27192726, 1997. [10] David Hsu, Lydia E. Kavraki, Jean-Claude Latombe, and Rajeev Motwani. On nding narrow passages with probabilistic roadmap planners. In WAFR, 1998. [11] David Hsu, Jean-Claude Latombe, and Hanna Kurniawati. On the probabilistic foundations of probabilistic roadmap planning. International Journal of Robotics Research, 25(7):627643, 2006. a b [12] L. Jaillet, A. Yershova, S.M. La Valle, and T. Simeon. Adaptive tuning of the sampling domain for dynamic-domain RRTs. In IEEE/RSJ IROS, Fig. 6. 3D environments Bugtrap (a) and Room (b) with a tree along a path. pages 2851 2856, 2-6 2005. [13] Lydia E. Kavraki, Petr Svestka, Jean claude Latombe, and Mark H. Overmars. Probabilistic roadmaps for path planning in high-dimensional
conguration spaces. IEEE Transactions on Robotics and Automation,
12:566580, 1996.
[14] Hanna Kurniawati and David Hsu. Workspace importance sampling for probabilistic roadmap planning. In IROS, September 2004. [15] S. M. LaValle. Rapidly-exploring random trees: A new tool for path planning, 1998. TR 98-11. [16] Steven M. Lavalle and James J. Kuffner. Rapidly-exploring random trees: Progress and prospects. In Algorithmic and Computational Robotics: New Directions, pages 293308, 2000. a b [17] Stephen R. Lindemann and Steven M. LaValle. Current issues Fig. 7. Tree size needed to nd a solution with different scales of a carin sampling-based motion planning. In Robotics Research: The like robot in BT100 and BT30 (a); running time of the RRTPath for various Eleventh International Symposium, pages 3654, Berlin, Germany, 2005. scales of robot in the Bugtrap problem (b). Springer-Verlag. [18] Mark H. Overmars. The Gaussian Sampling strategy for probabilistic roadmap planners. In ICRA, pages 10181023, 1999. [19] John H. Reif. Complexity of the movers problem and generalizations. problems; thus, it can nd a path through a narrow passage In Proceedings of SFCS 79, pages 421427, Washington, DC, USA, 1979. IEEE Computer Society. in the C even if the corresponding part of the workspace does [20] Ioan Alexandru Sucan and Lydia E. Kavraki. Kinodynamic motion not indicate the presence of a narrow passage. planning by interior-exterior cell exploration. In WAFR, 2008. The experimental verications have shown that the proposed [21] E. Szadeczky-Kardoss and B. Kiss. Extension of the rapidly exploring random tree algorithm with key congurations for nonholonomic motion RRT-IS algorithm is able to cope with the narrow passage IEEE International Conference problem in both 2D and 3D workspaces with signicantly [22] planning. In Berg and M.H. Overmars. Usingon Mechatronics, 2006. as J.P. van den workspace information higher success rate than other methods. Although the proposed a guide to non-uniform sampling in probabilistic roadmap planners. In IEEE ICRA, volume 1, pages 453460 Vol.1, April-1 May 2004. method is computationally demanding, it may be combined [23] Steven A. Wilmarth, Nancy M. Amato, and Peter F. Stiller. Maprm: A with the RRT-DD or RRT-Retraction to decrease the compuprobabilistic roadmap planner with sampling on the medial axis of the tational burden, which is a subject of our future work. free space. In IEEE ICRA, pages 10241031, 1999. [24] A. Yershova, L. Jaillet, T. Simeon, and S.M. LaValle. Dynamic-domain RRTs: Efcient exploration by controlling the sampling domain. In ICRA, pages 38563861, April 2005. VII. ACKNOWLEDGMENTS [25] A. Yershova and S. M. LaValle. Improving motion-planning algorithms by efcient nearest-neighbor searching. Robotics, IEEE TransThis work has been supported by the Grant Agency of actions on, 23(1):151157, Feb. 2007. http://msl.cs.uiuc.edu/ yerthe Czech Technical University in Prague under grant No. shova/MPNN/MPNN.htm. SGS10/185 and SGS10/195, by the Ministry of Education of [26] Liangjun Zhang and D. Manocha. An efcient retraction-based RRT planner. In IEEE International Conference on Robotics and Automation, the Czech Republic under Projects No. 7E08006, No. 1M0567, pages 3743 3750, 19-23 2008. and No. LH11053, and by the EU under Project No. 216342.

206
Navigation of mobile robots by potential eld methods and market-based optimization

AASS, AASS,
Rainer Palm
Abdelbaki Bouguerra
Dept. of Technology, Oerebro University, Sweden Dept. of Technology, Oerebro University, Sweden
Abstract Mobile robots play an increasing role in everyday life, be it for industrial purposes, military missions, or for health care and for the support of handicapped people. A prominent aspect is the multi-robot planning, and autonomous navigation of a team of mobile robots, especially the avoidance of static and dynamic obstacles. The present paper deals with obstacle avoidance using articial potential elds and selected trafc rules. As a novelty, the potential eld method is enhanced by a decentralized market-based optimization (MBO) between competing potential elds of mobile robots. Some potential elds are strengthened and others are weakened depending on the local situation. In addition to that, circular potential elds are deformed by using fuzzy rules to avoid an undesired behavior of a robot in the vicinity of obstacles.
I. I NTRODUCTION A group of autonomously acting mobile robots (mobile platforms) is a system of subsystems with different goals and velocities, competing navigation commands and obstacles to be avoided. In the last two decades several methods addressing these issues have been reported. One of the most popular methods for obstacle avoidance is the articial potential eld method [1]. Borenstein and Koren gave a critical view on this method addressing its advantages and drawbacks regarding stability and deadlocks [2]. Aircardi and Baglietto addressed team building among mobile robots sharing the same task and the appropriate decentralized control [3]. In approaching situations robots act as moving obstacles where coordination is done by online local path planning using the so-called via points. Further research results regarding navigation of nonholonomic mobile robots can be found in [4] and [5]. The execution of robot tasks based on semantic domain-knowledge is reported in detail in [6]. Trying to achieve different tasks at the same time makes a decentralized optimization necessary. Decentralized methods like multi-agent control are expected to handle optimization tasks for systems being composed of a large number of complex local systems more efciently than centralized approaches. One example is the ow control of mobile platforms in a manufacturing plant using intelligent agents [7]. Other decentral control strategies are the so-called utility approach [8] and the behavioral approach [9] used for mobile robot navigation. The most difcult task to optimize a decentralized system consisting of many local systems leads us to game theoretic and related methods. One of the most interesting and promising approaches to large decentralized systems is the market-based (MB) optimization. MB algorithms imitate
the behavior of economic systems in which producer and consumer agents both compete and cooperate on a market of commodities. This simultaneous cooperation and competition of agents is also called coopetition [10]. In [11] an overview on MB multi-robot coordination is presented, which is based on bidding processes. The method deals with motion planning, task allocation and team cooperation, whereas obstacles are not considered. General ideas and some results of MB control strategies are presented in [12] and [13]. In [14], a more detailed description of the optimization algorithm is presented. The authors show how to optimize distributed systems by socalled producer and consumer agents using local cost functions. Given desired setpoints and, with this, a cost function for each local system a set of controls has to be found that leads the whole set of local systems to a so-called Paretooptimum. The present paper adopts many ideas from [13], [14] and [15] to improve the performance of safe navigation of multiple robots using potential elds. In the context of MB navigation, combinations of competing tasks that should be optimized can be manifold, for example the presence of a trafc rule and the necessity for avoiding an obstacle at the same time. Another case is the accidental meeting of more than two robots within a small area. This requires a certain minimum distance between the robots and appropriate (smooth) manoeuvers to keep stability of trajectories to be tracked. The present paper addresses exactly this point where - as a novelty - optimization takes place between competing potential elds of mobile robots, whereas some potential elds are strengthened and some are weakened depending on the local situation. Repulsive forces both between robots and between robots and obstacles are computed under the assumption of circular force elds meaning that forces are computed between the centers of mass of the objects considered. It should be mentioned that each mobile robot uses only local data available through its own sensors in order to compute its local actions. Section II shows the navigation principles applied to the robot task. In Section III the general task of obstacle avoidance for a multi-robot system using articial potential elds is outlined. Section IV gives an introduction to a special MB approach used in this paper. The connection between the MB approach and the system to be controlled is outlined in Section V. Section VI shows simulation experiments and Section VII draws conclusions and highlights future work.
207
II. NAVIGATION PRINCIPLES To illustrate the navigation problems, let, in a working area, n mobile platforms (autonomous mobile robots) perform special tasks like loading materials from a starting station, bringing them to a target station and unloading the materials there. The task of the platforms is to reach their targets while avoiding other platforms according to specic rules. The problem is to nd an appropriate navigation strategy.
III. NAVIGATION AND OBSTACLE AVOIDANCE USING

POTENTIAL FIELDS
This section gives a more detailed view on the potential eld method for obstacle avoidance. Let us start with a simple rst order dynamics for each platform P i which automatically avoids abrupt changes in the orientation - actual velocity vector of platform P i , vi vdi 2 - vector of desired velocity of platform P i , kdi 22 - damping matrix (diagonal) The tracking velocity is designed as a control term - position of platform P i , xi xti 2 - position of target T i , kti 22 - gain matrix (diagonal) Virtual repulsive forces appear between platform P i and obstacle Oj from which repulsive velocities are derived vij o 2 - repulsive velocity between platform P i and obstacle Oj , xj o 2 - position of obstacle O j , dij o - Euclidian distance between platform P i and obstacle Oj , cij o 22 - gain matrix (diagonal) Virtual repulsive forces also appear between platforms P i and Pj from which we get the repulsive velocities - repulsive velocity between platforms P i and Pj , vij p xi 2 - position of platform P i , xj 2 - position of platform P j , dij p - Euclidian distance between platforms P i and Pj , cij p 22 - gain matrix (diagonal)
2 2 2
vi = kdi (vi vdi )
(1)
vti = kti (xi xti )
(2)
vij o = cij o (xi xj o )dij 2 o
(3)
Fig. 1.
Platform area
Let, as an example, mobile robots (platforms) P 1 , P2 , and P3 be supposed to move to targets T 1 , T2 , and T3 , respectively, whereas collisions should be avoided. Each platform has some estimation about its own position and orientation and also the position of its own target. The position of another platform P j relative to Pi can be measured if it lies within the sensor cone of P i . Navigation principles for a mobile robot (platform) P i are meant to be heuristic rules to perform a specic task under certain restrictions originating from the environment, obstacles Oj , and other robots P j . As already pointed out, each platform Pi is supposed to have an estimation about position/orientation of itself and the target T i . This information can be either given in the base frame (world coordinate system) or in the local frame of platform P i . In our case these information is given in world coordinates. Apart from the heading to target movement, all other navigation actions take place in the local coordinate system of platform P i . The positions of obstacles (static or dynamic) Oj or other platforms P j are formulated in the local frame of platform P i . In the following, 4 navigation principles are formulated that have been used in our work: 1. Move in direction of target T i 2. Avoid an obstacle O j (static or dynamic) if it appears in the sensor cone at a certain distance. Always orient platform in direction of motion 3. Decrease speed if dynamic (moving) obstacle O j comes from the right 4. Move to the right if the obstacle angles [16] of two approaching platforms are small (e.g. < 10) (see Fig. 2)
vij p = cij p (xi xj )dij 2 p
(4)
Fig. 2.
Geometrical relationship between platforms
The resulting velocity velocity v di is the sum

mo mp
vdi = vti +
j=1
vij o +
j=1
vij p
208
(5)
where mo and mp are the numbers of contributing obstacles and platforms, respectively. It has to be stressed that force elds are switched on/off according to the actual scenario: distance between interacting systems, state of activation according to the sensor cones of the platforms, positions and velocities of platforms w.r.t. to targets, obstacles and other platforms. All calculations of the velocitiy components (1)-(5), angles and sensor cones are formulated in the local coordinate systems of the platforms (see Fig. 2) A. Deformation of potential elds using fuzzy rules Potential elds of obstacles (static and dynamic) act normally independently of the attractive force of the target. This may cause unnecessary repulsive forces especially in the case when the platform can see the target.
vij o = coefij cij o (xi xj o )dij 2 o
(6)
The coefcients coefij can be calculated by a set of 16 fuzzy rules like coefij = M (7) where ij is the angle between vij o and vti . The whole set of 16 rules can be summarized in a table shown in Fig. 4. Z ZERO, S - SMALL, M - MEDIUM, B - BIG are fuzzy sets [17]. IV. MB APPROACH Imitation of economical market mechanisms and the application to a multi robot system requires the modeling of both the system to be optimized (see Sect. III) and the optimization strategy itself. The system and optimization strategy are presented as continuous models where the computational realization is usually discrete. The desired motion of platform Pi is described by
m
IF
vti = B
AN D
ij = M
T HEN
vdi = voi +
j=1,i=j
wij vij p
(8)
where voi is a combination of - tracking velocity depending on distance between platform i and goal i - repulsive/control terms between platform i and obstacles - Trafc rules m - number of platforms vij p - repulsive velocity between platforms i and j wij - weighting factors for repulsive velocities where m j=1,i=j wij = 1
Fig. 3. Deformation of potential eld
Fig. 4.
Fuzzy table for potential eld
The goal is therefore to deform the repulsive potential eld in such a way that it is strong if the obstacle hides the target and weak if the target can be seen from the platform. In addition, the potential eld should also be strong for a high tracking velocity and weak for a small one (see Fig. 3). This is done by a coefcient coef ij [0, 1] that is multiplied by vij o to obtain a new vij o as follows
The objective is to change the weights w ij so that all contributing platforms show a smooth dynamical behavior in the process of avoiding each other. One possible option for tuning the weights w ij is to nd a global optimum over all contributing platforms. This, however, is rather difcult especially in the case of many interacting platforms. Therefore a multi-agent approach has been preferred. The determination of the weights is done by producer-consumer agent pairs in a MB scenario that is presented in the following. Assume that to every local system S i (platform) belongs a set of m producer agents P agij and m consumer agents Cag ij . Producer and consumer agents sell and buy, respectively, the weights w ij on the basis of a common price p i . Producer agents P ag ij supply weights wij p and try to maximize specic local prot functions ij where local means belonging to system S i . On the other hand, consumer agents Cag ij demand for weights w ij c from the producer agents and try to maximize specic local utility functions U ij . The whole economy is in equilibrium as the sum over all supplied weights w ij p is equal to the sum over all utilized weights w ij c .
209
wij p (pi ) =
j=1 j=1
wij c (pi )
(9)
The requirement for an equilibrium between the sums of the produced w ij p and the demanded w ij c leads to the balance equation
m m
The trade between the producer and consumer agents is based on the denition of cost functions for each type of agent. We dene a local utility function for the consumer agent Cag ij U tility Uij benef it expenditure ij wij cij pi (wij )2 = b c c =
wij c =
j=1 j=1
wij p
(19)
Substituting (16) and (18) into (19) gives the prices p i for the weights wilp pi =
m j=1 |bij |/cij m j=1 1/ij
(10)
where ij , cij 0 , pi 0. Furthermore a local prot b function is dened for the producer agent P ag ij prof it = income costs ij = gij pi (wij p ) eij (wij p )2 (11)
(20)
where gij , eij 0 are free parameters which determine the average price level. It has to be stressed that both cost functions (10) and (11) use the same price p i on the basis of which the weights wij are calculated. Using the system equation (8) we dene further a local energy function between the pair of platforms p i and pj to be minimized Jij = =
T vdi vdi
Substituting (20) into (16) yields the nal weights w ij to be implemented in each local system. Once the new weights wij are calculated, each of them has to be normalized with respect to m wij which guarantees the above requirement j=1 m j=1 wij = 1 . V. MB OPTIMIZATION OF OBSTACLE AVOIDANCE In the following the optimization of obstacle avoidance between moving platforms by MB methods will be addressed. Coming back to the equation of the system of mobile robots (8)
m
vdi = voi +
2
wij vij p
j=1,i=j
(21)
where Jij 0, aij , cij > 0 . The question is how to combine the local energy function (12) and the utility function (10), and how are the parameters in (10) to be chosen? An intuitive choice ij = |bij |, b cij = cij (13)
aij + bij wij + cij (wij ) min
(12)
where voi is a subset of the RHS of (5) - a combination of different velocities (tracking velocity, control terms, etc.). v ij p reects the repulsive forces between platforms P i and Pj . The global energy function (12) reads Ji = = +
T vdi vdi m
guarantees wij 0. It can also be shown that, independently of aij , near the equilibrium v di = 0, and for p i = 1 , the energy function (12) reaches its minimum, and the utility function (10) its maximum, respectively. With (13) the utility function (10) becomes Uij = |bij |wij c cij pi (wij c )2 A maximization of (14) leads to Uij = |bij | 2cij pi wij c = 0 wij c from which a local w ij c is obtained wij c |bij | 1 = 2cij pi (16) + (17) + + (18) 2 (15) (14)
voT voi + 2voT i i

m
wij vij p
j=1,i=j m T j=1,i=j
(22)
(
j=1,i=j
wij vij p ) (
wij vi p )
The local energy funcion reects only the energy of a pair of two interacting platforms P i and Pj Jij = =
T vdi vdi m m
voT voi + ( i
k=1,k=i,j m
wik vik p )T (
k=1,k=i,j
wik vik p )
Maximization of the local prot function (11) yields ij = gij pi 2eij wij p = 0 wij p from which a local w ij p is obtained wij p
5th
wik voT vikp i

k=1,k=i,j m
2wij (voT i
+
k=1,k=i,j
wik vik T )vij p p
(23)
pi = 2ij
where
eij ij = gij
2 wij (vij T vij p ) p
Comparison of (23) and (12) yields

210
European Conference on Mobile Robots, September 7-9, 2011, rebro, Sweden
aij
= +
voT voi + ( i
k=1,k=i,j m
wik vik p )T (
k=1,k=i,j
wik vik p )
2
k=1,k=i,j
wik voT vik p i

m
bij cij
= =
2(voT i
+
k=1,k=i,j
wik vik T )vij p p
(24)
vij T vij p p VI. S IMULATION RESULTS
The following simulation results consider mainly the obstacle avoidance of the multi-robot system in a relatively small area. The sensor cone of a platform amounts to +/- 170 deg. Inside the cone a platform can see another platform within the range of 0-140 units. Platforms P 1 and P3 are approaching head-on. At the same time platform P 2 crosses the course of P1 and P3 right before their avoidance manoeuver. If there were only platforms P 1 and P3 involved the avoidance manoeuver would work without problems. According to the built-in trafc rules both platforms would move some steps to the right (seen from their local coordinate system) and keep heading to their target after their encounter. Platform P 2 works as a disturbance since both P 1 and P3 react on the repulsive potential of P2 which has an inuence on their avoidance manoeuver. The result is a disturbed trajectory (see Fig. 5) characterized by drastic changes especially of the course of P3 during the rendezvous situation. A collision between P 1 and P3 cannot be excluded because of the crossing of the courses of P1 and P3 . When we activate the MB optimization, we obtain a behavior that follows both the repulsive potential law for obstacle avoidance and the trafc rules during approaching head-on (see Fig. 6). There is no crossing of tracks between P 1 and P3 anymore due to the MB optimization of the repulsive forces between platforms P 1 ,P2 , and P3 and a respective tuning of the weights wij . Figure 7 shows the resulting weights. We also notice that w12 and w13 , w21 and w23 , and w31 and w32 are pairwise mirror-inverted due to the condition m j=1,i=j wij = 1 (see also eq. (8)). In further simulations, which are not shown here, the platforms move on circles with different speed, similar diameters and centerpoints while avoiding other platforms and static obstacles on their tracks. To determine the smoothness of the trajectories, the averages of the curvatures along the trajectories of the platforms were calculated. It turned out that the use of MB optimization leads to a signicant improvement of the smoothness of trajectories ( see Figs. 8 and 9). VII. C ONCLUSIONS Navigation while avoiding obstacles by mobile robots can be successfully done by using a mixture of principles like virtual potential elds, trafc rules, and control methods. It has also been shown that a deformation of the central symmetry
Fig. 5.
Approach, no MB optimization
Fig. 6.
Approach, with MB optimization
211
1200
With MB optimization between platforms y 1000 3 800 2 1 600
p3 p2
p1 400
200
curv1 =91.1495 curv2 =116.5760 curv3 =165.3848
Fig. 7.
1200 y
weights, with MB
50
100
150
200
250
300
350
400
450 x
500
No MB optimization between platforms 1000 p3 3 800 2 1 600 p1 400 p2
Fig. 9.
moving on circular trajectories, with MB
200
curv1 =224.4767 curv2 =86.4355 curv3 =335.1066
50
100
150
200
250
300
350
400
450 x
500
Fig. 8.
moving on circular trajectories, no MB
of an obstacle is helpful because it takes into account whether a robot is moving towards or away from it. An important aspect presented in this paper is the market-based (MB) optimization of competing potential elds of mobile platforms. MB optimization imitates economical behavior between consumer and producer agents on the basis of a common price. By means of MB optimization some potential elds will be strengthened and some weakened depending on the actual scenario. This is especially required when more than two robots meet within a small area which makes a certain minimum distance between the robots and appropriate manoeuvers necessary. Therefore, MB navigation allows smooth motions in such situations. Simulation experiments with simplied robot kinematics and dynamics have shown the feasibility of the presented method. Although it does not belittle the presented results, it has to be underlined that the simulation of the vehicles only meet a part of the requirements for non-holonomic systems. Restrictions w.r.t. to abrupt changes of motions in position and orientation are only indirectly considered. Restrictions regarding the weight, height, engine and wheels of the vehicles and the sensor have been neglected to a large extend. Therefore, future aspects of this work are the enhancement of the simulation regarding more realistic vehicles with the goal of implementation on a real mobile robot. R EFERENCES
[1] O. Khatib. Real-time 0bstacle avoidance for manipulators and mobile robots. IEEE Int. Conf. On Robotics and Automation,St. Loius,Missouri, 1985, page 500505, 1985.
[2] Y. Koren and J. Borenstein. Potential eld methods and their inherent limitations for mobile robot navigation. Proceedings of the IEEE Conference on Robotics and Automation, Sacramento, California, pages 13981404, April 7-12, 1991. [3] M.Aicardi and M. Baglietto. Decentralized supervisory control of a set of mobile robots. Proceedings of the European Control Conference 2001., pages 557561, 2001. [4] M. J.Sorensen. Feedback control of a class of nonholonomic hamiltonian systems. Ph.D. Thesis,Department of Control Engineering Institute of Electronic Systems Aalborg University, Denmark., 2005. [5] J.Alonso-Mora, A. Breitenmoser, M.Rui, P. Beardsley, and R. Siegwart. Optimal reciprocal collision avoidance for multiple non-holonomic robots. Proc. of the 10th Intern. Symp. on Distributed Autonomous Robotic Systems (DARS), Switzerland, Nov 2010. [6] A. Bouguerra. Robust execution of robot task-plans: A knowledge-based approach. Ph.D. Thesis,Oerebro University, 2008. [7] A. Wallace. Flow control of mobile robots using agents. 29th International Symposium on Robotics Birmingham, UK, pages 273276, 1998. [8] T.B.Gold, J.K. Archibald, and R.L. Frost. A utility approach to multi-agent coordination. Proceedings of the 2000 IEEE International Conference on Robotics and Automation, San Francisco, pages 2052 2057, 1996. [9] E.W.Large, H.I.Christensen, and R.Bajcsy. Scaling the dynamic approach to path planning and control: Competition among behavioral constraints. Vol. 18, No. 1, pages 3758, 1999. [10] T.Teredesai and V.C.Ramesh. A multi-agent initiative system for realtime scheduling. SMC98 Conference, San Diego, CA, USA, pages 439 444, 1998. [11] M. B. Dias, R. Zlot, N. Kalra, and A. Stentz. Market-based multirobot coordination: a survey and analysis. Proceedings of the IEEE, vol. 94, no. 7, pages 12571270, July 2006. [12] S.H. Clearwater (ed.). Market-based control: A paradigm for distributed resource allocation. Proceedings of the 38th CDC, Phoenix, Arizona USA., World Scientic, Singapore., 1996. [13] O.Guenther, T.Hogg, and B.A.Huberman. Controls for unstable structures. Proceedings of the SPIE, San Diego, CA, USA, pages 754763, 1997. [14] H.Voos and L.Litz. A new approach for optimal control using marketbased algorithms. Proceedings of the European Control Conference ECC99, Karlsruhe, 1999. [15] R. Palm. Synchronization of decentralized multiple-model systems by market-based optimization. IEEE Trans Syst Man Cybern B, Vol. 34, pages 66572, Feb 2004. [16] B. R. Fajen and W. H. Warren. Behavioral dynamics of steering, obstacle avoidance, and route selection. Journal of Experimental Psychology: Copyright by the American Psychological Association, Inc. Human Perception and Performance, Vol. 29, No. 2, page 343362, 2003. [17] R.Palm, D.Driankov, and H.Hellendoorn. Model based fuzzy control. Springer-Verlag Berlin New York Heidelberg, 1997.
212
A Sparse Hybrid Map for Vision-Guided Mobile Robots

Feras Dayoub Grzegorz Cielniak Tom Duckett
Department of Computing and Informatics, University of Lincoln, Lincoln, UK
Abstract This paper introduces a minimalistic approach to produce a visual hybrid map of a mobile robots working environment. The proposed system uses omnidirectional images along with odometry information to build an initial dense posegraph map. Then a two level hybrid map is extracted from the dense graph. The hybrid map consists of global and local levels. The global level contains a sparse topological map extracted from the initial graph using a dual clustering approach. The local level contains a spherical view stored at each node of the global level. The spherical views provide both an appearance signature for the nodes, which the robot uses to localize itself in the environment, and heading information when the robot uses the map for visual navigation. In order to show the usefulness of the map, an experiment was conducted where the map was used for multiple visual navigation tasks inside an ofce workplace. {fdayoub,gcielniak,tduckett}@lincoln.ac.uk
I. I NTRODUCTION Different methods have been introduced to tackle the problem of acquiring a map of a mobile robots environment. The problem is called simultaneous localization and mapping (SLAM). Due to the importance of solving the SLAM problem before truly autonomous mobile robots can be built, the robotics community has given much attention to SLAM over the last decade. The result is a wide variety of methods, with each method having its own advantages and disadvantages. Recently, the methods which deal with SLAM as a nonlinear graph optimization problem, i.e. graph-based SLAM, have gained increasing attention in the literature. The output of these methods is a pose-graph where the nodes associate past poses of the robot with map features. The edges of the graph model spatial constraints between the nodes [1]. The optimization step aims to select the spatial conguration of the nodes, which best satises the constraints encoded in the edges. Generally, the output from the graph optimization algorithms is a pose-graph map, where the nodes of the graph are created at every step the robot has performed. Therefore, the graph is dense and it contains redundant information that can be removed leading to a more compact map, which is preferable in the case of mobile robots with limited resources. In this paper we use a graph-based SLAM algorithm to produce an initial dense pose-graph map of the environment. Then the initial map is used to build a sparse hybrid map consisting of two levels, global and local. Fig. 1 illustrates the hybrid map. On the global level, the world is represented as a topological map. The topological map is extracted from the initial dense pose-graph using a dual clustering approach, introduced in this paper. The proposed approach selects nodes from the initial map which are located in areas such as doorways and corners, allowing the topological map to
maintain full coverage of the environment while minimizing the required number of nodes. On the local level of the map, each node stores a spherical view representation of image features extracted from an omnidirectional image recorded at the position of the node. The spherical views contain the 3D location of the image features on a sphere. Thus, we only store the direction of the features (but not their distance or depth) from the centre of the sphere, which corresponds to the centre of that node. The spherical views are used for estimating the robots heading in a visual navigation system where we use the map to perform a path following task. The paper is constructed as follows. In Section II, we discuss related work in the eld. Section III contains details of the method to build the initial map. In Section IV we present the dual clustering algorithm which selects the nodes of the global map. A visual navigation strategy is presented in Section V. The experiment and results are presented in Section VI. Finally, we draw conclusions in Section VII. II. BACKGROUND Different vision-based mapping methods using graph optimization have been proposed [2], [3], [4]. These methods follow the same general approach where the map is built as a graph, with the nodes containing camera views from the environment and the graph edges are expressed as geometric constraints between these views. A loop closing mechanism is deployed to detect when previously mapped areas are revisited. When a loop is detected, a new constraint link is added to the graph and then a graph optimizer is invoked to correct the map. Although the work presented in this paper follows the same general approach, we differ by proposing a spherical view representation for the nodes in the local level of the hybrid map and also we introduce a dual clustering algorithm to reduce the number of nodes in the global level. The loop detection mechanism in this paper uses a similarity measure between the views stored in the nodes of the map. The detection mechanism could simply be a direct one-to-one view matching procedure. However, more sophisticated methods for loop detection can be used to provide real time performance when the graph contains a large number of nodes. Such methods include the hierarchical vocabulary tree [5] and the FABMAP visual vocabulary [6]. A naive solution to reduce the number of nodes in the graph would be based on the time stamp of the nodes, where the graph is sampled using a xed time step. This method would fail if the robot stands still for some time or has a changing speed while mapping the environment. Another simple solution would be based on the distance between the nodes [7]. However, this method does not take into
213
paper uses different criteria to reduce the number of nodes in the map. Our main goal is to produce a map with sparse nodes located in places that are suitable to use the map for visual navigation tasks. III. B UILDING THE H YBRID M AP We aim to produce a hybrid map extracted from an initial dense pose-graph map. The hybrid map consists of a global level contains a sparse topological map and a local level which stores a spherical view for each node (see Fig. 1). Each spherical view is generated from an omnidirectional image recorded from the position of the node. The initial map building process is carried out based on a stop-sense-go strategy, where the robot is driven by a human operator. The driver follows the following routine. The robot starts a mapping step by capturing an omnidirectional image from a camera on-board, while it is static, along with the odometry reading from the wheel encoder. Then the robot moves a short distance forward and stops. If there is a need to perform a rotation, the robot rotates a certain angle and ends the current mapping step by stopping. The driver repeats as many mapping steps as required to cover the environment. The constraints between each consecutive poses along the trajectory of the robot consist of two components, i.e. translation and rotation. These constraints can be extracted from the internal odometry of the robot. However, using the odometry alone is not always accurate enough to build the constraint network. First, the odometry measurements are affected by noise due to wheel drift and slippage, especially during rotation. Second, using odometry alone cannot provide any information about loop closure. To deal with these limitations, we use measurements from the omnidirectional vision sensor on-board as an input for a Bayesian ltering framework (extended Kalman lter) to reduce the error from the odometry measurements. In addition to that, the vision sensor is used to perform loop closure. A loop closure results in an edge in the constraint network which relates the current robot pose with a former robot pose. These loop closure edges contradict the pose estimates resulting from plain odometry. In order to correct the structure of the graph after a loop is detected, we use the graph optimization algorithm TORO [10] which considers each edge in the pose graph as a cost function for each two connected nodes and re-arranges the nodes in the graph such that the total costs associated with all edges are minimized. A. Relations Based on Odometry Let the robots pose at any given time step t be represented as pt = (xt , yt , t ), where (xt , yt ) are the coordinates of the robot and t is the current heading. In our stop-sensego strategy, robot motion between two consecutive poses is approximated by a translation d followed by a rotation , and the model which obtains the pose pt from pt1 is dt cos (t1 ) xt xt1 yt = yt1 + dt sin (t1 ) , (1) t t1 t
214
Fig. 1. Proposed hybrid map with two levels i.e. global and local. The environment is represented as an adjacency graph of nodes on the global level of the map and each node on the local level represents the 3D location of image features on a sphere. Our method represents the direction of the features (but not their distance or depth) from the centre of the sphere, which corresponds to the centre of that node.
account the rapid change in the appearance of the robots surroundings which could occur when the robot crosses a door or goes round a corner. Successive images can differ considerably on the two sides of the doors or corners, which could become a challenge when the robot tries to actually use the map for navigation. This implies that image similarity should be considered in the process. In general, the methods which use image similarity to reduce the number of nodes in the map start by clustering the nodes based on image similarity and then choose key nodes as a representative for each image group [8]. These methods can be used when information about the geometric distances between the nodes is not available. However, in our case the distance between the nodes is provided by the graph optimization approach and we can use that as an additional source of information. Instead of choosing the key nodes in the graph based on the geometric distance alone or the similarity alone, we propose a method which selects the nodes in the graph based on a combination of the two metrics. Recently, Ila et al. [9] introduced a pose-graph mapping method where they measure statistical content on links to enforce node sparsity and limited graph connectivity. The result is a compact map which contains non-redundant nodes and links. The main criteria with which the nodes are selected relates directly to reducing the uncertainty in the estimation of the robot position. This tends to produce densely distributed nodes when the robot performs curved paths and sparsely distributed nodes when the robot performs straight forward paths. This is because of the increasing uncertainty resulting from turning motions. The method presented in this
where dt = dt + d and t = t + rot are obtained from odometry measurement by adding independent Gaussian noise, where: (2) d N (0, range |d|),
rot
N (0, turn |turn | + drif t |d|).
(3)
range , turn and drif t represent the range error, the turn error and the drift error of the robot encoder respectively. B. Tracking of the robot heading In order to reduce the uncertainty in tracking the position of the robot as much as possible, we use an extended Kalman lter (EKF) [11] to track the heading component of the robots pose as follows: at step t, we compute the relative orientation between the current pose of the robot pt and the previous pose pt1 using the two omnidirectional images which were recorded at each position. Then by adding this relative orientation to the heading t1 , we obtain a vision based observation for the robots heading at step t. Then by feeding this heading observation to the EKF lter with the robot motion model from Eq. 1 to produce a prediction step of the lter, we estimate the robot pose at step t. The relative orientation between two omnidirectional images is computed based on the epipolar geometry of spherical cameras [12]. C. Loop Closure Using Vision Loop closing capability is an essential part of any mapping system. In our case, without this capability the robot can face the problem of assigning the same area in the environment to multiple nodes leading to a globally inconsistent map. Therefore, the robot uses its vision sensor to detect loops along the trajectory using place recognition based on image similarity. As mentioned earlier, each node in the map contains a group of image feature points extracted from an omnidirectional image recorded when the node was rst created. Using these image features, the similarity between any two nodes in the map is measured using the number of matched feature points between the two groups of features stored in each node [13]. So in order to detect loop closures, the robot calculates the similarity between the current node and all the nodes which are located within a certain radius. When the ratio of the number of matched feature points to the number of features stored in the current node exceeds a certain threshold, a loop is considered detected. As a result, the robot adds a link between the two detected nodes in the graph with a distance of zero. Then TORO, the tree optimizer, is run on the graph to correct the pose estimates of the nodes in the map according to the newly added link. There are two important parameters that appear in the above loop closure strategy. First, the distance in which the robot checks for a loop closure. The longer the radius the more nodes need to be matched with the current node. In our experiments we use a radius of 3m. Second, the matching threshold which the robot uses to decide that a loop has been closed. This threshold is affected by the texture of the mapped environment, which in turn affects the number of image features that can be extracted from the images. In our
Fig. 2. Maximising the intra-cluster similarity aims to automate the selection of reference views from areas such the middle of the doorways and the edge of the corners.
experiment a loop is considered closed if 35% of the features are matched (a parameter that was obtained experimentally for our set-up). IV. G RAPH P RUNING BY D UAL C LUSTERING The resulting map from the above step is a dense graph containing nodes created at each step the robot has performed. In this section we extract a sparse topological map from the dense graph by deploying a graph pruning step. In our case, the map is considered as a set of spatial objects with the nodes representing these objects. Each node has two attribute domains, a geometric domain in the XY plane and a non-geometric domain represented by image similarity. The aim is to select a sub-set of these nodes in a way that sufciently covers the environment, allowing the robot to use the map for tasks such as autonomous visual navigation. In order to do that we use a clustering method called dual clustering [14]. Dual clustering is the process which partitions a set of spatial objects into different clusters in such a way that each cluster forms a compact region in the geometric domain while maximizing the similarity in the non-geometric domain. In the geometric domain, the clustering algorithm produces compact clusters. This is a preferable effect as we do not want the selected nodes, which correspond to the centers of the clusters, to be very sparse creating gaps in the nal map. In the non-geometric domain (i.e. image similarity), the clustering algorithm selects the centers of the clusters in a way which maximizes the intra-cluster similarity. The effect of this process is also preferable for our case because, in the cases where a cluster of nodes expands through a doorway or a corner, the center of the cluster will be selected from the middle of the doorway or the edge of the corner. This selection allows the node to cover both sides of the door and the corner preventing a discontinuity in covering the environment, which can be a problem when actually using
215
the map for tasks such as visual navigation. Fig 2 illustrates this situation. A. The clustering algorithm Our clustering algorithm is inspired by a fast implementation of the dual clustering method presented in [14]. The algorithm performs clustering based on density in the geometric domain, while maximizing the similarity in the non-geometric domain. In order to achieve this, for each cluster center in the geometrical domain, the neighborhood of a given radius d should contain a minimum number of points. And in the non-geometric domain, the similarity between the neighborhood points of each cluster and the center of that cluster should be above a certain threshold s . The clustering process is performed incrementally as follows: 1) Initialize a cluster by starting from the rst unclassied node in the graph Np (based on the time stamp). If the number of nodes in the neighborhood of Np is less than a predened threshold the node is ignored. Otherwise, create a new cluster C. 2) Insert all nodes from the neighborhood of Np , which have similarity with Np greater than s , in C. 3) Compute the center of the cluster Nref by selecting the node which is most similar to all other nodes in the cluster, as Nref = arg max(
kC jC
k Next Node
Nk
Current Node
j
r j
Robot =*+*
j k
Nj
Fig. 3. The proposed visual navigation strategy. Nj is the current node in the path and Nk is the next node. The dashed line is the path, j and k are the relative orientations between the robots heading and the reference orientation of the nodes Nj and Nk respectively. r is the robots desired heading.
sim(Nj , Nk )),
where sim(Nj , Nk ), the similarity between the two views in the nodes Nj and Nk , is the number of matched image features. 4) Check an arbitrary node Nq , from the neighborhood of Np . If the number of nodes in the neighborhood of Nq is at least , and the similarity between an unclassied node No in the neighborhood of Nq and the center of the cluster is above s , then insert No to the cluster C and recompute the center of the cluster as in step 3. Repeat step 4 until the cluster C can not be extended any more. 5) Repeat all the steps until all the nodes in the graph are classied. We discuss how to set the clustering parameters in our experiments in Section VI. V. U SING THE M AP FOR NAVIGATION Every map can be judged by its usefulness for practical purposes. In our case the map is used for a path following routine inside an indoor environment. When robots work inside an indoor environment, their navigation generally is restricted to what the humans consider to be a path inside that environment, such as corridors and the areas between the furniture. These routes effectively simplify the task of navigation by limiting the robot to only one degree of freedom along the path. And by representing this path as a sequence of images, the following framework
of the appearance-based approach for visual navigation is used in the literature [15], [16], [17]: The path is rst built during a learning phase where the robot is controlled by a human operator. During this phase the robot captures a sequence of images along the path. A subset of the captured images is selected to represent the reference images along the path. During the replay phase, the robot starts near the rst position and is required to repeat the same path. The robot extracts the control commands of its motion by comparing the currently observed image with the reference images along the path. In this work we adopted a similar framework for visual path following using a sequence of nodes from the map. Fig. 3 illustrates the navigation strategy. First the robot localizes itself to one of the nodes in the path. This is done by selecting the node which has the the highest similarity score with the currently observed view. Let Sj be the similarity score, i.e. the number of matched points. The similarity score is also computed between the current view and the next node in the sequence. Let Sk be the similarity score with the next node. Then the following ratio is computed: j = Sj Sk , k = . Sj + Sk Sj + Sk (4)
The heading angle r is computed as a weighted sum: r = j j + k k . (5)
where j and k are the relative orientation between the current view and the nodes Nj and Nk respectively (see Fig. 3). By following this navigation strategy, the nodes in the path can be considered as directional signs which lead the robot toward its goal. In order to estimate the relative orientation between two views, we use epipolar geometry. The method rst estimates
216
Fig. 4. Left: The true trajectory of the robot. Right: The experimental platform. An ActivMedia P3-AT robot equipped with an omnidirectional vision system.
the so-called essential matrix E. Then, based on the method introduced by Hartley and Zisserman in [18], the essential matrix is factored to give Eq. 6 which contains the rotation matrix R SO(3) and the skew-symmetric matrix [t] of the translation vector t R3 . E = [t] R. (6) After that, the the relative orientation is extracted from the rotation matrix R. VI. E XPERIMENTS AND R ESULTS Our experimental platform is an ActivMedia P3-AT robot equipped with a GigE progressive camera (Jai TMC4100GE, 4.2 megapixels) with a curved mirror from 0360.com. The following experiment was carried out in an ofce oor at the University of Lincoln. We drove the robot on a tour between the ofces while recording a set of omnidirectional images. The resulting database contains 222 images with approximately 35 cm between the consecutive images. Fig. 4 shows the positions of the recorded images obtained using the GMapping library [19]. The GMapping algorithm provides a SLAM solution using laser rangender scans based on a Rao-Blackwellized particle lter. The output of the algorithm is an estimate of the robot trajectory along with an occupancy grid map of the environment. We would like to emphasize that our method does not use laser scan matching. We use this information for visualisation purposes only. The robot starts from location (x0 = 0, y0 = 0) and comes back to the same start point. The dashed line in Fig. 5 shows the trajectory of the robot based on the odometry alone where the drift effect is clear. Fig. 5 also shows the trajectory after we added the EKF as a lter for the observation of the robot heading. Although the robot does not have any method to detect loop closure, the resulting error in estimating the trajectory is smaller. The nal output of the mapping step
Fig. 5. The green dashed line represents the trajectory of the robot from the odometry. The red line is the result of using vision estimated relative orientation as an observation with an EKF lter. The black line is the nal output after loop closing and graph relaxation.
is shown as well in Fig. 5, where the graph optimization algorithm produced a map which is globally consistent. In the next step, we applied our dual clustering algorithm which pruned the dense graph produced from the previous step and generated the nal map. The number of nodes in the nal map is affected by the initial neighborhood radius d and the image similarity threshold s . The parameter d gives an initial radius for the node to cover and then the algorithm expands the node based on the image similarity threshold s . In our experiments the map is intended to be used for autonomous visual navigation; therefore the sparsity of the map should not result in gaps where the robot cannot estimate its heading relative to one of the nodes in the map. In order to achieve that we assign d to 1 m as a minimum distance between the nodes and s to 35 feature points as the minimum number of matched points between any view in the node and the center of the node; and nally we assign , the minimum number of points required in the neighborhood of any node, to 1 m. Fig. 6 shows the result of the pruning step where a set of 23 nodes was selected. In order to test the map for visual navigation, we performed ve path following runs using the nodes of the map. The paths were chosen randomly, while covering all nodes in the map. At the start of each run the robot was given a sequence of nodes to follow. The robot then followed each path using the navigation strategy presented in Section. V. In addition, an array of sonar sensors was used for obstacle avoidance. The same ve runs were then re-executed using manual drive where a human driver steers the robot taking the best track where the robot was driven through the shortest distance and at the same time was kept away from obstacles. Table I shows the results for both the autonomous and the manual runs. In order to show the robustness of the navigation procedure, the 5 autonomous runs were executed
217
the robots heading using multi-view geometry. The map is built using a sequence of images along with odometry information. The global consistency of the map is achieved using a graph optimization algorithm. In order to reduce the number of nodes in the map, a dual clustering algorithm for post-processing the initial map was developed. The map was used in an experiment where the robot performed multiple path following tasks. R EFERENCES
[1] F. Lu and E. Milios. Globally consistent range scan alignment for environment mapping. Autonomous Robots, 4(4):333349, 1997. [2] F. Fraundorfer, C. Engels, and D. Nistr. Topological mapping, localization and navigation using image collections. In Proc. IEEE International Conference on Intelligent Robots and Systems (IROS), pages 38723877, 2007. [3] E. Eade and T. Drummond. Monocular slam as a graph of coalesced observations. In Proc. IEEE International Conference on Computer Vision (ICCV), pages 18, 2007. [4] K. Konolige, J. Bowman, J. D. Chen, P. Mihelich, M. Calonder, V. Lepetit, and P. Fua. View-based maps. The International Journal of Robotics Research, 29(8):941, 2010. [5] D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2006. [6] M. Cummins and P. Newman. Probabilistic appearance based navigation and loop closing. In Proc. IEEE International Conference on Robotics and Automation (ICRA), pages 20422048, 2007. [7] M. Jogan and A. Leonardis. Robust localization using an omnidirectional appearance-based subspace model of environment. Robotics and Autonomous Systems, 45(1):5172, 2003. [8] O. Booij, Z. Zivkovic, and B. Krose. Sampling in image space for vision based SLAM. In Proc. Workshop on the Inside Data Association, Robotics: Science and Systems Conference (RSS), 2008. [9] V. Ila, J. M Porta, and J. Andrade-Cetto. Information-based compact pose SLAM. IEEE Transactions on Robotics, 26(1):7893, 2010. [10] G. Grisetti, C. Stachniss, S. Grzonka, and W. Burgard. A tree parameterization for efciently computing maximum likelihood maps using gradient descent. In Proc. of Robotics: Science and Systems (RSS), 2007. [11] N. J. Gordon, D. J. Salmond, and A. F. M. Smith. Novel approach to nonlinear/non-Gaussian bayesian state estimation. In Radar and Signal Processing, IEE Proceedings F, volume 140, pages 107113, 2002. [12] C. Geyer and K. Daniilidis. A unifying theory for central panoramic systems and practical implications. Proc. European Conference on Computer Vision (ECCV), pages 445461, 2000. [13] H. Bay, T. Tuytelaars, and L. Van Gool. SURF: Speeded Up Robust Features. In Proc. European Conference on Computer Vision (ECCV), 2006. [14] J. Zhou, F. Bian, J. Guan, and M. Zhang. Fast implementation of dual clustering algorithm for spatial data mining. In Proc. International Conference on Fuzzy Systems and Knowledge Discovery, volume 3, pages 568572. IEEE Computer Society, 2007. [15] G. Blanc, Y. Mezouar, and P. Martinet. Indoor navigation of a wheeled mobile robot along visual routes. In Proc. IEEE International Conference on Robotics and Automation (ICRA), pages 33543359, 2005. [16] T. Goedem , M. Nuttin, T. Tuytelaars, and L. Van Gool. Omnidirece tional Vision Based Topological Navigation. International Journal of Computer Vision, 74(3):219236, 2007. [17] O. Booij, B. Terwijn, Z. Zivkovic, and B. Krose. Navigation using an appearance based topological map. Proc. of the IEEE International Conference on Robotics and Automation (ICRA), 2007. [18] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2 edition, March 2004. [19] G. Grisetti, C. Stachniss, and W. Burgard. Improved techniques for grid mapping with Rao-Blackwellized particle lters. IEEE Transactions on Robotics, 23(1):3446, 2007.
Fig. 6. The selected nodes from the dual clustering algorithm along with ve sequences of nodes which the robot was given to follow.
twice. The mean and minimum sonar range distances to obstacles were calculated for each run along with the traveled distance. These values are used as an indication about the quality of the navigation performance. As the results show, although the robot takes a slightly longer distance to reach its goal, the autonomous routes were smooth and similar to the manual runs. The average mean distance to any obstacle was 1.00 m and for the autonomous runs was 0.88 m. The average value of the minimum range to obstacles was 0.49 m for manual driving and 0.44 m for the autonomous ones. VII. C ONCLUSION This paper introduced a minimalistic mapping method using an omnidirectional vision sensor. The produced map is hybrid with two levels of representation, global and local. On the global level, the world is represented as a graph of adjacent nodes with each node containing a group of image features. On the local level, the features inside each node form a spherical view, which is used for estimating
TABLE I V ISION GUIDED NAVIGATION RESULTS Distance [m] 22.87 24.07 24.28 14.30 15.34 14.90 9.81 10.66 10.70 8.97 9.27 9.09 15.52 16.42 16.69 Mean Range [m] 0.85 0.82 0.82 1.20 0.90 0.99 0.94 0.85 0.87 1.04 0.97 1.02 0.98 0.85 0.78 Minimum Range [m] 0.43 0.42 0.44 0.58 0.45 0.48 0.56 0.41 0.51 0.42 0.37 0.42 0.48 0.46 0.47
Path 1 Path 2 Path 3 Path 4 Path 5
Manual Auto1 Auto2 Manual Auto1 Auto2 Manual Auto1 Auto2 Manual Auto1 Auto2 Manual Auto1 Auto2
218
An incremental hybrid approach to indoor modeling

Marco A. Guti rrez e Pilar Bachiller Luis J. Manso Pablo Bustos Pedro N nez u Department of Computer and Communication Technology, University of Extremadura, Spain
Abstract Most of mobile robots basic functions are highly dependent on a model of their environment. Proper modeling is crucial for tasks such as local navigation, localization or route planning. This paper describes a novel solution for building models of indoor environments. We use a 3D Laser on a mobile robot to scan the surroundings and obtain sensing information. While the robot explores the environment, perceived points are clustered forming models of rooms. Room modeling is solved using a new variation of the Hough transform. The result of the modeling process is a topological graph that represents the rooms (nodes) and their connecting doors (edges). Each node in this representation contains the metric parametrization of the room model. Using this basic metric information, robots do not need to maintain in parallel a metric map of the environment. Instead, this metric map can be totally or partially built from the topological representation whenever it is necessary. To test the approach we have carried out a modeling experiment of a real environment, obtaining promising results. Index Terms Environment modeling, hybrid representation, active exploration, 3D laser scanning.
I. I NTRODUCTION In recent years, the problem of map building has become an important research topic in the eld of robotics. A wide variety of proposals use dense metric maps to represent the robot environment. While a metric representation is necessary for solving certain robot tasks, many others require a more qualitative organization of the environment. Topological maps provide such a qualitative description of the space and constitute a good support to the metric information. Several approaches on mobile robotics propose the use of topological representation to complement the metric information of the environment. In [16] it is proposed to create off-line topological graphs by partitioning metric maps into regions separated by narrow passages. In [14] the environment is represented by a hybrid topological-metric map composed of a set of local metric maps called islands of reliability. [17] describes the environment using a global topological map that associates places which are metrically represented by innite lines belonging to the same places. [18] constructs a topological representation as a route graph using Vorono diagrams. In [19] the environment is represented by a graph whose nodes are crossings (corners or intersections). [9] organizes the information of the environment in a graph of planar regions. [3] proposes an off-line method that builds a topological representation, whose nodes correspond to rooms, from a previously obtained metric map. In this paper, we propose a novel incremental modeling method that provides a hybrid topological/metric representation of indoor environments. Our approach improves the current state of the art in several aspects. Firstly, as in [3], the topological space is represented by a graph whose
nodes encode rooms and whose edges describe connections between rooms. However, in our proposal, the topological representation is incrementally built from the local information provided by the sensor. This means that no global metric map is needed to extract a topological description of the environment. In addition, instead of maintaining in parallel a dense metric map, each topological node contains a minimal set of metric information that allows building a map of a part of the environment when it is needed. This approach reduces drastically the computation the robot must perform to maintain an internal representation of its surroundings. In addition, it can be very helpful for solving certain tasks in an efcient way, such as global navigation or self-localization. We also present a new variation of the Hough transform used for room modeling. Under the rectangularity assumption, this method provides the geometrical parametrization of a room from a set of points. It is shown how this proposal improves the results compared to other approaches. The rest of the paper is organized as follows. In section II, an overview of the approach is presented. Section III describes the modeling method that provides a description of the environment in terms of rooms and their connections. Along section IV, the process for creating the proposed hybrid representation is detailed. Experimental results in a real environment are presented in section V. Finally, section VI summarizes the main conclusions of our work. II. OVERVIEW OF THE APPROACH As a rst approach towards environment modeling, we focus on indoor environments composed by several rooms connected through doors. Rooms are considered approximately rectangular and contain several objects on the oor. To solve the low-level perceptual process, we have equipped one of our mobile robots with a motorized 2D LRF (Laser Range Finder). Stereo cameras, static 2D LRFs and 3D LRFs constitute an alternative for this purpose. It is possible nowadays to use stereo vision or even the very popular RGB-D primesense sensor[1] to retrieve depth from images and, in consequence, be able to map the robot surroundings. However most of these vision studied techniques are performed under almost ideal circumstances. Uniformly colored surfaces or light variations are some of the problems these solutions might face. In addition, compared to LRF performance, sensors like RGB-D get small elds of view and low depth precision (3cm in 3m scan). LRFs constitute a strong alternative to these sensors, especially when facing not ideal environments. A wide variety of this kind of sensors have became lately available: point range nders, 2D LRFs and 3D LRFs. 3D LRFs seem promising, but their high cost makes them in
219
practice less usable than other sensors. Motorized 2D LRFs have arisen as a good solution combining important advantages in relation to other sensors for this applications. To cover the whole 3D spectrum, the 2D LRF is attached to a step motor that makes it scan the whole hemisphere in front of the mobile robot. The resolution is made to be directly dependent on the scanning speed and the robot is able to adjust the related parameters accordingly to its needs. During the exploration, perceived points are stored in a 3D occupancy grid that constitutes a discrete representation of a particular zone of the environment. This occupancy grid is locally used, so, when the robot gets into a new room, the grid is reseted. Each cell of this grid contains the certainty degree about the occupancy of the corresponding volume of the environment. The certainty increases if a point is perceived over time in the same position and decreases when the point vanishes from the space covered by the sensor. Thus, stable points produce higher occupancy values than unstable ones. Cells with a high certainty degree are used for detecting a room model tting the set of perceived points. Once the model of the current room can be considered stable, it is stored in an internal representation that maintains topological and metric information of the environment. In this representation, the environment is described as an undirected graph whose vertices represent the different explored rooms (see gure 1). An edge linking two vertices expresses the existence of a door that connects two rooms. This is a very useful representation for the robot to effectively move around man-made environments. For instance, the robot could analyze the graph to obtain the minimum path connecting any two rooms. Moreover, this representation can be extended using recursive descriptions to express more complex world structures like buildings. Thus, a building could be represented by a node containing several interconnected subgraphs. Each subgraph would represent a oor of the building and contain a description of the interconnections between the different rooms and corridors in it.
by performing specic actions directed to quickly obtain measures around the robot. Once a model is xed, errors in the following models, i.e. adjacent rooms, are canceled by applying geometric restrictions. Finally, detected loop closings are used through a global minimization procedure to further reduce modeling errors. Localization errors are canceled by continuously rebuilding a new model that is compared to the current one. Using an estimated rectangular model instead of the raw measures is an effective procedure to update the position of the robot relative to the current model, since it is less sensitive to wrong measurements. As long as the rectangular hypothesis is coherent with the real world around the robot, these methods work robustly in real situations. More details on this are given on section IV. III. ROOMS AND DOORS MODELING Since rooms are assumed to be rectangular and its walls perpendicular to the oor, the problem of modeling a room from a set of points can be treated as a rectangle detection problem using the projections of those points onto the oor plane. Several rectangle detection techniques can be found in the literature [5, 6, 15]. Most of them are based on a search in the 2D point space (for instance, a search in the edge representation of an image) using line primitives. These methods are computationally expensive and can be very sensitive to noisy data. In order to solve the modeling problem in an efcient way, we propose a new rectangle detection technique based on a search in the parameter space using a variation of the Hough Transform [13, 2]. For line detection, several variations of the Hough Transform have been proposed [8, 11]. The extension of the Hough Transform for rectangle detection is not new. [20] proposes a Rectangular Hough Transform used to detect the center and orientation of a rectangle with known dimensions. [4] proposes a Windowed Hough Transform that consists of searching rectangle patterns in the Hough space of every window of suitable dimensions of an image. Our approach for rectangle detection uses a 3D version of the Hough Transform that facilitates the detection of segments instead of lines. This allows considering only those points that belong to the contour of a rectangle in the detection process, which is very important to obtain reliable results. For instance, consider the 2D view of the occupancy grid that is shown in gure 2(a). In this situation, the robot has perceived all the walls and objects of the room it is located and, partially, two walls of the adjoining room. Figures 2(b) and 2(c) show the results of the rectangle detection process using variations of the Hough transform based on lines and segments, respectively. The lined region of both gures corresponds to the detected rectangle. As it can be observed, a method based on lines, as the one proposed in [4], considers points that can be situated outside the rectangle, leading sometimes to wrong results. Nevertheless, the proposed variation of the Hough transform takes into account only those points belonging to the four segments of the rectangle providing always the best rectangle pattern.
(a)
(b)
Fig. 1. Topological representation (a) of an environment (b) composed by three intercommunicated rooms.
Using this topological graph, a minimal set of metric information is maintained. Specically, each vertex of the graph stores the parametric description of the corresponding room and its doors. Using this basic metric information, the robot does not need to maintain in parallel a metric map of the environment. Instead, a total or partial metric representation can be recovered whenever it is necessary from the topological one. To deal with the uncertainty derived from odometric and sensor errors, we follow a non simultaneous approach to localization and modeling. Modeling errors are minimized
220
(a) 2D view of the occupancy grid.
the two segments with non-xed extreme points Vi and Vj , respectively, according to equation 1. Since a rectangle is composed of four segments, the 3D Hough space parameterized by (, d, p) allows computing the total number of points included in the contour of the rectangle. Thus, considering a rectangle expressed by its four vertices V1 = (x1 , y1 ), V2 = (x2 , y2 ), V3 = (x3 , y3 ) and V4 = (x4 , y4 ) (in clockwise order), the number of points of its contour, denoted as Hr , can be computed as: Hr = H12 + H23 + H34 + H41 (5) Considering the restrictions about the segments of the rectangle and using the equation 4, each Hij of the expression 5 can be rewritten as follows: H12 = |H(, d12 , d41 ) H(, d12 , d23 )| (6)
H23 = |H(+/2, d23 , d12 )H(+/2, d23 , d34 )| (7) H34 = |H(, d34 , d23 ) H(, d34 , d41 )| (8)
(b) Rectangle detection using lines. (c) Rectangle detection using segments.
Fig. 2. Rectangle detection using the variation of the Hough transform based on lines (b) and the proposed one based on segments (c).
A. Room detection In our room detection method, the Hough space is parameterized by (, d, p), being and d the parameters of the line representation (d = x cos() + y sin()) and |p| the length of a segment in the line. For computing p it is assumed that one of the extreme points of its associated segment is initially xed and situated at a distance of 0 to the perpendicular line passing through the origin. Under this assumption, being (x, y) the other extreme point of the segment, its signed length p can be computed as: p = x cos( + /2) + y sin( + /2) (1)
H41 = |H(+/2, d41 , d34 )H(+/2, d41 , d12 )| (9) being the orientation of the rectangle and dij the normal distance from the origin to the straight line dened by the points Vi and Vj . Since Hr expresses the number of points in a rectangle r dened by (, d12 , d23 , d34 , d41 ), the problem of obtaining the best rectangle given a set of points can be solved by nding the combination of (, d12 , d23 , d34 , d41 ) that maximizes Hr . This parametrization of the rectangle can be transformed into a more practical representation dened by the ve-tuple (, xc , yc , w, h), being (xc , yc ) the central point of the rectangle and w and h its dimensions. This transformation can be achieved using the following expressions: xc = d23 + d41 d12 + d34 cos() sin() 2 2 d12 + d34 d23 + d41 yc = sin() + cos() 2 2 w = d23 d41 h = d34 d12 (10) (11) (12) (13)
Using this representation, any point (x, y) contributes to those points (, d, p) in the Hough space that veries: d = x cos() + y sin() and p >= x cos( + /2) + y sin( + /2) (3) Equation 2 represents every line intersecting the point as in the original Hough Transform. The additional condition expressed by equation 3 limits the point contribution to those line segments containing the point. This allows computing the total number of points included in a given segment. For instance, given a segment with extreme points Vi = (xi , yi ) and Vj = (xj , yj ) and being H the 3D Hough space, the number of points that belongs to the segment, which is denoted as Hij , can be computed as: Hij = |H(ij , dij , pi ) H(ij , dij , pj )| (4) (2)
where ij and dij are the parameters of the common line to both points and pi and pj are the signed lengths of
In order to compute Hr , the parameter space H is discretized assuming the rank [/2, /2] for and [dmin , dmax ] for d and p, being dmin and dmax the minimum and maximum distance, respectively, between a line and the origin. Taking some sample step for each parameter and being G the 3D occupancy grid and the minimum occupancy value to consider a non-empty region of the environment, the proposed method for room modeling can be summarized in the following steps: 1) Initialize all the cells of the discrete Hough space H to 0. 2) For each cell, G(xd , yd , zd ), such that G(xd , yd , zd ).occupancy > : Compute the real coordinates (x, y) associated to the cell indexes (xd , yd ). For d = dM in . . . dM ax : a) Compute the real value associated to d .
221
Compute d = x cos() + y sin(). Compute the discrete value dd associated to d. Compute p = x cos( + /2) + y sin( + /2). Compute the discrete value pd associated to p. For pd = pd . . . ddM ax : increment H(d , dd , pd ) by 1. 3) Compute argmax Hr (, d12 , d23 , d34 , d41 ). 4) Obtain the rectangle r = (, xc , yc , w, h) using equations 10, 11, 12 and 13. As it can be observed, the height of walls is only taken into account through histogram contributions. This is because walls correspond to 2D segment with higher histogram values than any other plane perpendicular to the oor. Thus, it is not necessary to explicitly consider the height of points in the room detection method. Even though this method is computationally expensive, in practice, its complexity can be signicantly reduced in two ways. Firstly, instead of computing H from the whole occupancy grid, it can be updated using only those cells whose occupancy state has changed. In addition, it is not necessary to apply step 3 over the entire parameter space, since only rectangles of certain dimensions are considered rooms. Thus, it is assumed a specic rank of w and h that limits the search to those values of d12 , d23 , d34 and d41 fullling that rank. B. Door detection The proposed 3D Hough space is also used for door detection. Doors are free passage zones that connect two different rooms, so they can be considered as empty segments of the corresponding room rectangle (i.e. segments without points). Taking this into account, once the room model is obtained, doors can be detected by analyzing each wall segment in the 3D Hough space. Therefore, for each segment of the rectangle, dened by Vi and Vj , two points Dk = (xk , yk ) and Dl = (xl , yl ) situated on the inside of that segment constitute a door segment if they verify: Hkl = |H(ij , dij , pk ) H(ij , dij , pl )| = 0 (14) being ij and dij the parameters of the straight line dened by Vi and Vj and pk and pl the signed lengths of the segments for Dk and Dl : pk = xk cos(ij + /2) + yk sin(ij + /2) pl = xl cos(ij + /2) + yl sin(ij + /2) (15) (16)
b) c) d) e) f)
5) Compute the discrete value ld associated to l (minimum length of doors). 6) pdk pdi 7) While pdk pdj ld : a) pdl pdk + 1 b) While pdl < pdj and |H(d , dd , pdk ) H(d , dd , pdl )| = 0: pdl pdl + 1 c) If pdl pdk > ld : i) Compute the real value pk associated to pdk . ii) Compute the real value pl associated to (pdl 1). iii) Compute the door limits Dk and Dl from pk and pl . iv) Insert the new door segment with extreme points Dk and Dl to the list of doors. d) pdk pdl IV. I NCREMENTAL MODELING OF THE ENVIRONMENT Building topological maps requires to endow the robot not only with modeling skills, but also with the ability to actively explore the environment. Exploration plays an important role in our proposal because the robot must make sure that each room model corresponds to a real room before leaving it behind. For this reason the robot must scan the whole space surrounding it to take correct decisions about the current model. In our system, the exploration task is driven by the 3D local grid. When a room model is detected from the set of points stored in the grid, the robot must verify it by scanning the unexplored zones inside the estimated room model. Depending on its occupancy value, the cells of the grid are labeled as occupied, empty and unexplored. Thus, by analyzing the grid, the robot can direct its sensor towards new places and retrieve the necessary information to get a reliable model. At this point, the benets of using a long range sensor become clear. Few movements of the robot are needed to cover the whole space around it and, in consequence, the modeling process is less sensitive to odometric errors. Once the local space around the robot has been completely explored, the current room model is inserted as a node in the graph representing the topological space and the robot gets out of the room to model new places. Each node in the graph stores the geometric parametrization of the room and its doors. The center and orientation of the room are used to form a reference frame (Fr ) that expresses the location of the room in relation to a global reference frame (Fw ). Thus, being r = (, xc , yc , w, h) the rectangle that models a given room, the transformation matrix (Tr ) that relates Fr with Fw is dened as: cos() sin() 0 xc sin() cos() 0 yc Tr = (17) 0 0 1 0 0 0 0 1 Using this transformation matrix, any point of the model is expressed in coordinates relative to the room. This way, if the local or global reference frames are modied, points of room models remain unaffected. In addition, the robot can be aware
Assuming pi pk < pl pj and a minimum length l for each door segment, the door detection process can be carried out by verifying equation 14 for every pair of points between Vi and Vj , such that pl pk l. Starting from the discrete representation of the Hough space, this process can be summarized in the following steps: 1) Compute the discrete value d associated to ij . 2) Compute the discrete value dd associated to dij . 3) Compute the discrete value pdi associated to pi . 4) Compute the discrete value pdj associated to pj .
222
of errors in its odometric estimation by detecting changes in the room reference frame. These changes come from the estimation of new room models, with the same dimensions than the current one, using the set of recently perceived points. Thus, being r(i) = ((i), xc (i), yc (i), w, h) the room model at instant i and r(i+1) = ((i+1), xc (i+1), yc (i+1), w, h) a new estimation of the room model at i+1, changes in the robot pose can be computed using the rotational and translational model deviations of equations 18 and 19. (18) 0 xc (i) 0 yc (i) 1 0 (19) This is useful once the model is created for dealing with the problem of robot pose estimation. However, odometric errors are present during the whole modeling process, affecting the resulting representation in two ways. Firstly, when the robot models a new room, the location of the new room could not match its real location. Secondly, odometric errors lead to wrong measurements that cause imperfect estimations of the parameters of rooms and doors. Though the essence of these two problems is slightly different, both can be detected and corrected using the notion of adjacent rooms. Therefore, if two adjacent rooms, r1 and r2, are communicated by a door, any point of the door is common to both rooms. Assume a door point dr1 viewed from the room r1 and the corresponding point dr2 of the room r2. The metric representation of both rooms would ideally be subject to the following restriction: = (i + 1) (i) xc (i + 1) cos( ) sin( ) yc (i + 1) sin( ) cos( ) t= 0 0 0 Tr2 dr2 Tr1 dr1
2
tively, and Tri and Trj the transformation matrices of such rooms. The employed minimization method is based in the Stochastic Gradient Descent [12]. The basic idea [10] is to minimize the global error function by introducing small variations in the parameters of room and door models. These variations are constrained by the uncertainty of the measurement, so high-condent parameters remain almost unchanged during the error minimization process. As result, a reliable representation of the environment that maintains the restrictions imposed by the real world is obtained. V. E XPERIMENTAL RESULTS We used one of our custom differential robots to build a model of the environment. It has been equipped with a Hokuyo UTM-30LX 2D LR moved by a step motor. It scans up to 30 meters and a range of 270 at a rate of 25ms per line of scan (700 points). The LRF is positioned in the front of the robot and it performs a roll movement to cover an amplitude of 360 , (see gure 3(b)). The software has been developed using the component oriented robotics framework RoboComp [7]. The experimental results presented in this paper correspond to the modeling of an environment formed by two contiguous rooms with its corresponding doors. Rooms have been made to be similar to any regular room and have been lled with random objects to simulate a real human environment. Figure 3 shows two views of the environment used in this experiment. In these views, the two rooms, labeled as 1 and 2, and the doors have been marked in red and green, respectively.
=0
(20)
When the robot creates a new room model that is adjacent to a previous one, expression 20 determines the need for correcting the new model reference frame. In addition, the deviation between the positions of the common door allows computing how this correction should be applied in order to fulll the restriction imposed by the expression above. After an exploration of arbitrary length, if the robot returns to a previously visited room (i.e. a loop closing is detected), the non-correspondence between the input and output doors can also be determined using expression 20. In such cases, new corrections must be done in order to cancel the error. However, this error is caused by wrong estimations of room and door models and, in consequence, a reference frame correction will surely not solve the problem. Our solution to these situations is to minimize a global error function dened over the whole metric representation. In our representation of the environment, the global error is dened in terms of deviations between the positions of the doors connecting adjacent rooms. Thus, the error function to minimize can be expressed as: =
connected(dri ,drj ) (n) (m)
(n) (m)
(a) Frontal view of the scene.
(b) Side view of the scene taken from the left up corner of 3(a). Fig. 3. Two views of the real scene of the experiment.
Tri dri Trj drj
(n)
(m) 2
(21)
being dri and drj the middle points of a common door expressed in the reference frames of rooms ri and rj, respec-
We placed our mobile robot in the center of room 1 in order to start our testing. In the rst part of the experiment, the robot performs a full 3D scan of the rst room. Results
223
can be seen in gure 4. From its initial location, the robot makes a rst scan and obtains an initial model of the room (gures 4(a) and 4(b)). Figure 4(a) shows the regions and the model while gure 4(b) shows only the model. Notice that regions containing points are considered walls (in red) and those without points are considered doors (in green). Using this initial model, the robot tries now to scan the unknown areas. After moving 90 counter clockwise, new points are obtained from a second scan. Figures 4(c) and 4(d) show on regions and model results, respectively. As it can be observed, the accuracy of the model increases with this second scan. The detected rectangle ts the room size but there are still some unknown parts considered as doors. The robot tries to verify whether those regions are real doors or not by moving 180 and performing another scan. After this new scan, the room is perfectly matched by the model, as gures 4(e) and 4(f) show. The room scanning is nished when all regions in the model have been scanned, and further scanning (even using higher resolutions) results in no modications. Then, the obtained room model is xed and stored in the topological graph. Notice that, although the model size and doors are properly obtained, the room has been a little mispositioned when compared to reality. This is due to the accumulation of odometric errors during the robot movements. Nevertheless, the relative position of the robot inside the room remains correct and therefore the detection of a new room is not affected by this misplacement.
uses one of the obtained doors to get to the next room. In the experiment, the robot goes through the door on the right to room 2 and performs a full 3D scan. Figure 5 shows results of the second room scans. Specically gures 5(a) and 5(b) correspond to the rst one. As it can be observed, the model is still not matching the real room because of the existence of big spaces with missing information. Therefore the robot turns and get another scan of one of these places. Figures 5(c) and 5(d) show the model and regions after the second scan in this room. Now the model ts the room but, again there are still unscanned parts. The robot turns 90 again and performs the scan whose regions and resulting model are shown in gures 5(e) and 5(f). After doing all these scans in the second room the nal model is obtained. Further scans do not change the model, so it is xed.
(a)
(b)
(c)
(d)
(a)
(b)
(e) Fig. 5. Modeling process of room 2.
(f)
(c)
(d)
(e) Fig. 4. Modeling process of room 1.
(f)
After the model of the rst room has been xed, the robot
Fig. 6.
Final representation obtained after the modeling of the two rooms.
224
Once the second model has been xed, the deviation between the two room models (gure 5(f)), caused by odometric errors, is corrected according to the common door restriction (equation 20). Figure 6 shows the result of this correction. Each square of the gure represents an area of 0, 270, 27m2 . This size corresponds to the sampling step of the Hough space for the parameters d and p. This means that the accuracy of each room model is limited by this sampling step. The real sizes of the two rooms are 3, 19 3, 78m2 (room 1) and 4, 20 3, 78m2 (room 2). The sizes obtained by the modeling process are 2, 97 3, 78m2 (room 1) and 4, 05 3, 78m2 (room 2). As it can be observed, the difference between the representation and the real world is in the range of the permissible error. More accurate models can be obtained by reducing the sampling step of the discrete Hough space. VI. C ONCLUSIONS AND FUTURE WORK In this paper, we have presented an incremental modeling method for building hybrid representations of indoor environments. The proposed representation consists in a topological graph that describes the rooms of the environment and their connections. Each node of the graph represents a room and contains the geometric parametrization of the corresponding room and its doors. This geometric parametrization is the only metric information included in the representation. Using this information, the robot can recover a metric map of a particular zone of the environment when it is needed. Thus, no dense metric map is mantained in parallel to the topological graph. Rooms and doors are modeled using a variation of the Hough Transform that detects segments and rectangle patterns. We have proposed methods for dealing with odometric errors in the creation of new models as well as in loop closings. In addition, a method for robot pose estimation using room models has been presented. Real experiments have been carried out using a mobile robot equipped with a motorized 2D Laser. Results show the accuracy of the modeling process in real environments. We are working on several improvements of our proposal. In particular, work in order to relax the rectangle assumption is currently in progress. We are also extending the modeling ability of our robots for representing other structures of manmade environments like corridors. Bigger and more complex possible surroundings are also being taken into account. ACKNOWLEDGMENT This work has partly been supported by grants PRI09A037 and PDT09A044, from the Ministry of Economy, Trade and Innovation of the Extremaduran Government, and by grants TSI-020301-2009-27 and IPT-430000-2010-2, from the Spanish Government and the FEDER funds. R EFERENCES
[1] The primesensor technology. http://www.primesense.com/. [2] R.O. Duda and P.E. Hart. Use of the hough transformation to detect lines and curves in pictures. Commun. ACM, 15:1115, January 1972. [3] Kwangro Joo, Tae-Kyeong Lee, Sanghoon Baek, and Se-Young Oh. Generating topological map from occupancy grid-map using virtual door detection. In IEEE Congress on Evolutionary Computation, pages 16, 2010.
[4] C.R. Jung and R. Schramm. Rectangle detection based on a windowed hough transform. In Proceedins of the XVII Brasilian Symposium on Computer Graphics and Image Processing, pages 113120, 2004. [5] D. Lagunovsky and S. Ablameyko. Straight-line-based primitive extraction in grey-scale object recognition. Pattern Recognition Letters, 20(10):10051014, 1999. [6] C. Lin and R. Nevatia. Building detection and description from a single intensity image. Computer Vision and Image Understanding, 72(2):101 121, 1998. [7] L. Manso, P. Bachiller, P. Bustos, P. N nez, R. Cintas, and L. Calderita. u Robocomp: a tool-based robotics framework. In Proceedings of the Second international conference on Simulation, modeling, and programming for autonomous robots, SIMPAR10, pages 251262, 2010. [8] J. Matas, C. Galambos, and J. Kittler. Robust detection of lines using the progressive probabilistic hough transform. Computer Vision and Image Understanding, 78(1):119137, 2000. [9] E. Montijano and C. Sagues. Topological maps based on graphs of planar regions. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 16611666, October 2009. [10] E. Olson. Robust and Efcient Robotic Mapping. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, June 2008. [11] P.L. Palmer, J. Kittler, and M. Petrou. Using focus of attention with the hough transform for accurate line parameter estimation. Pattern Recognition, 27(9):11271134, 1994. [12] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400407, 1951. [13] A. Rosenfeld. Picture processing by computer. ACM Comput. Surv., 1:147176, September 1969. [14] S. Simhon and G. Dudek. A global topological map formed by local metric maps. In In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 17081714, 1998. [15] W.-B. Tao, J.-W. Tian, and J. Liu. A new approach to extract rectangular building from aerial urban images. In Signal Processing, 2002 6th International Conference on, volume 1, pages 143 146, aug. 2002. [16] S. Thrun. Learning metric-topological maps for indoor mobile robot navigation. Articial Intelligence, 99(1):2171, 1998. [17] N. Tomatis, I. Nourbakhsh, and R. Siegwart. Hybrid simultaneous localization and map building: a natural integration of topological and metric. Robotics and Autonomous Systems, 44(1):314, 2003. [18] D. Van Zwynsvoorde, T. Simeon, and R. Alami. Incremental topological modeling using local vorono-like graphs. In Proc. of IEEE/RSJ Int. Conf. on Intelligent Robots and System (IROS 2000), volume 2, pages 897 902, 2000. [19] F. Yan, Y. Zhuang, and W. Wang. Large-scale topological environmental model based particle lters for mobile robot indoor localization. Robotics and Biomimetics, IEEE International Conference on, 0:858 863, 2006. [20] Y. Zhu, B. Carragher, F. Mouche, and C.S. Potter. Automatic particle detection through efcient hough transforms. IEEE Trans. Med. Imaging, 22(9), 2003.
225
226
Efcient Topological Mapping with Image Sequence Partitioning

Hemanth Korrapati, Youcef Mezouar, Philippe Martinet ` LASMEA, CNRS JOINT UNIT 6602, 24, Avenue des Landais, 63177-AUBIERE, FRANCE Email: rstname.lastname@lasmea.univ-bpclermont.fr
Abstract Topological maps are vital for fast and accurate localization in large environments. Sparse topological maps can be constructed by partitioning a sequence of images acquired by a robot, according to their appearance. All images in a partition have similar appearance and are represented by a node in a topological map. In this paper, we present a topological mapping framework which makes use of image sequence partitioning (ISP) to produce sparse maps. The framework facilitates coarse loop closure at node level and a ner loop closure at image level. Hierarchical inverted les (HIF) are proposed which are naturally adaptable to our sparse topological mapping framework and enable efcient loop closure. Computational gain attained in loop closure with HIF over sparse topological maps is demonstrated. Experiments are performed on outdoor environments using an omnidirectional camera. Index Terms Topological Mapping, Omnidirectional Vision, Loop Closure
ISP
Feature Extraction Newly Acquired Image (Query Image) Augment Winning Node Yes Loop Closed ? Node Level Loop Closure
Image Level Loop Closure
Extract Local Features Yes
No
Finer Loop Closure Required ?
Augment Current Node
Yes
Similar to Current Place ?
No Augment New Node Create New Place(Node)
Fig. 1: A global view of our topological mapping framework.
I. INTRODUCTION Mapping is one of the fundamental problems of Autonomous Mobile robotics. Mapping problem can be widely categorized as Topological and Metrical [16]. Metrical mapping involves accurate position estimates of robots and landmarks of the environment. Topological mapping on the other hand represents an environment as a graph in which nodes correspond to places and the edges between them indicate some sort of connectivity. Recently, a third category called TopoMetric Mapping [17], [9] is gaining popularity. Topo-Metric mapping is a hybrid approach which uses both metrical and topological information in map building. Building an accurate map either metrical or topological depends on loop closure accuracy. Such maps are difcult to build using metrical information which is prone to gross errors in position estimation of robot and landmarks. Topological maps facilitate accurate loop closure as they depend on appearance information rather than on exact metrical information of the environment. Many powerful loop closing techniques for topological maps have been introduced recently [2, 3, 5, 7]. Most of them produce dense topological maps, in which every acquired image stands as a node in the topological graph. Sparser topological maps can be built by representing sets of contiguous images with similar appearance features as places. Each place is represented by a node in the topological graph. We refer to this kind of partitioning of image sequences into places as Image Sequence Partitioning (ISP). In a sparse topological map, since each node represents multiple images, fewer nodes would be sufcient for the map representation. Maps with fewer nodes reduce computational complexity involved in loop closure and map merging. An example map merging problem can be to localize a topological map of a tiny environment in a larger map (ex:- google maps). We use a topological mapping framework which facilitates coarse loop closure to the node level and a ner loop closure to the image level. The topological map is represented as a graph T = (N, E), where N and E are sets of nodes and edges respectively. The map is updated with each newly acquired image. Every new image (query image) is veried if it is similar to a previously visited node (place) or the current place node and if so, the corresponding node is augmented with the image. If the query image is not similar to any of the existing nodes, then a new place node is created and augmented with the image. This process of map update is nothing but Image Sequence Partitioning (ISP). Each node contains a set of representative features representing all the member images of the node (place). The representative features are used in evaluating node-image similarity during ISP. Another contribution of this paper is the proposal of Hierarchical inverted les (HIF) for efcient loop closure at both node and image levels. Similar to traditional inverted les used for object recognition [11] and loop closure problems [2], HIFs are also associated to the visual words in the vocabulary. HIFs combine the power of regular inverted les with our sparse topological map structure and help in fast loop closure. Considering the fact that in a sparse topological map, images are again grouped into nodes, HIFs organize
227
the previous occurrence information of visual words in a hierarchical fashion which enables fast loop closure. We use an inverse similarity score evaluation methodology in order to take advantage of HIFs for loop closure. Experiments were performed on omnidirectional image data acquired in outdoor urban environments. Omnidirectional images offer a 360 degree eld of view which helps in building topological maps, invariant of robots heading. As a result loop closure can be performed even if the robot is not headed in the same direction as of the previous visit to the same location. Map building in outdoor environments is challenging due to illumination variation and possible occlusions [15]. Sparsity and accuracy of topological maps constructed using ISP are evaluated. The computational savings achieved in HIF-based loop closure is analysed. The rest of the paper is organized as follows: Section II details the related work done in this area. Section III describes the steps involved in ISP in detail and provides algorithmic illustrations. Section IV introduces HIFs and discusses how node and image level loop closures are performed using HIFs. Section V evaluates sparsity of maps produced by ISP, accuracy and computational cost of loop closure on the generated topological maps. II. RELATED WORK Scene Change Detection and Key Frame Selection for video segmentation and abstraction [19], [13] have similar goals as that of ISP. They try to represent a video with fewer images called key frames whenever there is a sufcient change in the scene and most of them focussed on video compression domain. The major difference between these video abstraction problems and topological mapping is that topological mapping demands localization of a query image which is obtained at a previously visited place, but with variation in illumination and viewpoint, and a possible occlusion. Hence, video segmentation techniques using pixel-wise intensity measures and global image features like histograms, motion based segmentation cannot be applied to our problem. Loop closure in topological maps has gained popularity among mobile robotic researchers during the recent times. Many loop closure algorithms for topological mapping have been proposed and tested in both indoor [7], [2], [3], [21], [8] and outdoor environments [5], [6], [1], [12]. In [21], [22] topological maps are built for indoor environments. They segment the topological graph of the environment using normalized graph-cuts algorithm resulting in subgraphs corresponding to convex areas in the environment. In [8] SIFT features were used to perform matching over a sequence of images. They detected transitions between individual indoor locations depending on the number of SIFT features which can be successfully matched between the successive frames. In [14] ngerprint of an acquired image is generated using omnidirectional image and laser readings, and these ngerprints are compared to those of the previous images. If the similarity is above a threshold the image is added to the existing node and if not a new node is formed. All of these works were focused on indoor environments. Indoor environments contain convex
spaces (rooms) which are relatively simpler to be partitioned as compared to outdoor environments. A topological mapping framework using incremental spectral clustering has been presented in [20]. Nodes containing similar images are constructed using incremental spectral clustering over the afnity matrix of the images, thereby producing a topological graph. An optical ow based ISP technique was presented in [12] for topological mapping in outdoor environments using a quad rotor robot. Optical ow is used to discover change in environmental appearance. In [1], gist features [18] were used to cluster images with similar appearance for topological map construction. III. IMAGE SEQUENCE PARTITIONING Figure 1 depicts a global view of our framework, in which we can see a modular view of ISP enclosed by a red dashed line. As can be seen from Figure 1, ISP consists of three main modules: node level loop closure, evaluation of similarity to current place and new node formation. Given a query image, initially SURF [4] features are extracted from the image. Using the SURF features, we evaluate the node-image similarity of the query image with all the nodes in the graph except the current place node and pick out the top k similar nodes. The top k similar nodes are assigned to the set of winning nodes Nw . This process is called node level loop closure as it nds the previously visited places (nodes) most similar to the query image. Only the representative feature sets of the nodes are used to compute the node-image similarities during node level loop closure. In our framework, the representative features of a node are the SURF features of the rst image augmented to the node. An empty Nw indicates loop closure failure ; that is, query image is not similar to any of the previously visited places. In that case, query image similarity to the current node is evaluated. If the similarity of query image with current place node is above a certain threshold, current node is augmented with the query image. If the query image is not similar to current node also, a new node is created with the query image. But if Nw is not empty indicating a loop closure, then the set of winning nodes can be considered for a thorough image level loop closure. Algorithm 1 shows the steps involved in ISP. The node level loop closure function in lines 4 is discussed in detail in sections IV. The function current node image similarity evaluates the similarity of the current node with that of the query image. This is done by matching the SURF features of the query image to that of the features of the current place node. Feature matching is performed as proposed in [10]. Two features are considered to be matched if the ratio of the best match and the second best match is greater than 0.6. IV. LOOP CLOSURE & HIERARCHICAL INVERTED FILES Node and Image level loop closures are performed at image level using visual words corresponding to the SURF features of the image. Given a query image, to nd the most similar
228
Algorithm 1 Image Sequence Partitioning Algorithm

1: procedure PROCESS QUERY IMAGE(T, Iq , nc )
topological graph respectively. T, Iq , nc , T.N are the Topological graph, query image, current node in topological graph & node-set of
VW 1 VW 2 VW 3
...........
I2 I2 I11
I3 I3
I9 I5
I21 ............ I243 I97 ............ I143
I12 I115
2:
3: Nw = {} Set of winning nodes. 4: Nw =Node level Loop Closure(N , Iq , T hs ) 5: if is empty(Nw ) then 6: if current node image similarity(nc ,Iq ) > T ht then 7: nc .add image(Iq ) 8: else 9: n=new node() 10: n.add image(Iq ) 11: update map(T, n) 12: end if 13: else 14: Nw =get top k similar nodes(Nw ) 15: Iw =get images of nodes(Nw ) 16: n =Image Level Loop Closure(Iw ,Nw ,Iq ) 17: add image to node(Iq , n ) 18: update map(T, n ) 19: end if 20: end procedure
N = T.N {nc }
Reference node set excluding current node.
(a)
N1 I2 I5 I32 I26 I9 I17 I3 ............ I243 I16 I54 ............ I178 I95 ............ I165 I23 I35
VW 1
N3 N8 N13
VW 2
..............
N24 N15
....
....
(b) Fig. 2: (a) represents a traditional inverted le. (b) represents a hierarchical inverted le. VW1, VW2,... represent visual words. N1, N2,... represent node ids in the rst level of HIF and I1, I2, I3, ... represent the image ids in the inverted le.
reference image in the database, [11, 7] uses a Tf-Idf based similarity score for all the reference images in the database and the query image and select the reference images corresponding to the top n similarity scores. We use an inverse methodology to compute image similarities for loop closure. The steps involved are enumerated as follows: 1) Let the set of reference images be I = {I1 , I2 , , IM }. We consider a histogram H with the number of bins corresponding to the number of reference images, M . 2) Extract the set of visual words W = {w1 , w2 , , wp } from the query image, Iq . 3) For each visual word wi , using the inverted le IF i of the word, we extract the reference image indexes w w Iwi = {I1 i , I2 i , } in which the word has been previously seen. The histogram bins corresponding to these extracted reference images are incremented by a factor of Tf-Idf of the corresponding word.
w w w H[Ij i ] = H[Ij i ] + T f Idf (Ij i , wi )
(1)
The resulting histogram can be interpreted to contain the degrees of similarity of the query image with respect to the reference images. As we can see, the loop closure computation time only depends on number of words in the query image and the average inverted le length at that instant. As a result loop closure time does not increase so steeply as is the case with forward method. A closely related work can be found in [6, 2, 3]. But this method is suitable only for loop closure over dense topological maps but not for sparse topological maps in which each node represents multiple images. With a change in the inverted le structure, we can adapt this similarity evaluation method to sparse topological maps. A regular inverted le corresponding to a visual word simply contains a list of all previous images which contained the word. We associate Hierarchical inverted les (HIF) to each visual word. As the name suggests, HIF contain two levels. The rst level consists of the ids of the nodes in which the visual word occurred previously. The second level consists of small child inverted les attached to each of the
node id. These image ids indicate the images belonging to the parent node in which the visual word has occurred. Each child inverted le corresponding attached to a node of the topological graph, contains the list of all previous images belonging to the parent node in which the word has occurred. The difference between traditional inverted les and HIFs is illustrated in gure 2. To perform a node level loop closure using HIF, we do not have to go through the entire HIF, but its sufcient to go through rst level (node ids) of the HIFs. For an image level loop closure using HIF, we only have to traverse through those child inverted les corresponding to the winning nodes; which form only a fraction of the total HIF. Thus, HIFs offer computational gain in loop closure when compared to regular inverted les which is demonstrated in section V. Algorithms 2 and 3 give a clearer picture of the node level and image level loop closures. There can be multiple winning images given by the image level loop closure and hence multiple corresponding nodes, but for the sake of simplicity we do not represent that in the algorithm. In such cases we use RANSAC based geometric verication to nd the right match. Algorithm 2 Node Level Loop Closure Algorithm
1: procedure N ODE L EVEL L OOP C LOSURE(N , Iq , T hs ) 2: N = T.N {nc } 3: H=Histogram(size of(N )) 4: W=extract and quantize Features(Iq ) 5: for each word wi in W do 6: HIF i =get hierarchical inverted le(wi ) 7: for each node nj in HIF i do 8: H[nj ] = H[nj ] + 1 9: end for 10: end for 11: N =get winners from histogram(H, T hs ) 12: return N 13: end procedure
229
Algorithm 3 Image Level Loop Closure Algorithm

1: procedure IMAGE L EVEL L OOP C LOSURE(Iw ,Nw , Iq ) 2: H=Histogram(M) 3: W=extract and quantize Features(Iq ) 4: for each word wi in W do 5: HIF i =get hierarchical inverted le(wi ) 6: for each node nj in Nw do n 7: IF i j =get inverted le of node(HIF i , nj ) n 8: for each entry fi in IF i j do 9: H[fi ] = H[fi ] + T f Idf (fi , wi ) 10: end for 11: end for 12: end for 13: I =get winner from histogram(H) 14: n =get corresponding node(I ) 15: return n 16: end procedure
(a) Installment A
(b) Installment B
V. EXPERIMENTS Our experimental setup consists of a Pioneer P3DX robot equipped with an omnidirectional camera. A laptop equipped with an Intel Centrino 2 processor running ubuntu 9.04 is used for data processing. The experiments were carried out in our articial urban environment - PAVIN. The environment contains roads, articial buildings, and a variety of real-world road settings like junctions, trafc lights, round-abouts, curved roads and dead ends. Omnidirectional images were acquired at a frame rate of 2 fps, as the robot moves along a manually controlled trajectory. Image data was acquired in four installments(A, B, C and D) at very different times of two days and hence contained signicant illumination variation. Figure 3 shows the parts of the environment through which the robot traversed during each installment. We took care that data from all the four installments contained overlaps so as to put our loop closure algorithm to test. We constructed two data-sets by combining data from all the four installments. Dataset-6560 was obtained by combining data of installments A, C and D. It contains 6560 images with 52 possible loop closures. Another data-set Dataset-11200 was obtained by a combination of all the four installments. It contains 11200 images and 71 possible loop closures. The number of loop closures were determined by manually examining the data-sets. A. ISP - Sparsity The number of nodes in a topological map indicate its sparsity. An ideal topological map is one in which each distinct place in the environment is represented by a node in the topological graph. The sparsity of these ideal maps represents the optimal sparsity of the actual environment. But practically, an ideal topological map is far from being attained. Different features produce different topological representations of the environment. We have experimented using SURF128, U-SURF128, and SIFT features, out of which we found out that U-SURF128 features lead to topological structure closest to the ground truth. Another important factor that effects stability of appearance is the image distortion. The features directly extracted from warped omnidirectional images are unstable as the appearance of keypoint patch changes very much even with a small
(c) Installment C
(d) Installment D
Fig. 3: Shows paths traversed by the robot during each image acquisition installment. TABLE I: SPARSITY (a) Sparsity - Warped
DATASET-6560 DATASET-11200
SURF128 539 756
U-SURF128 502 742
SIFT 1582 2795
(b) Sparsity - Unwarped
SURF128 504 737
U-SURF128 473 723
SIFT 1037 1257
displacement of the camera. Hence it is likely that the maps produced using warped images contain greater number of place nodes. This happens because due to the feature instability, each place can be understood as multiple adjacent places and hence multiple nodes in the topological graph. The undistorted (unwarped) images produce relatively sparser maps. Tables I(a) and I(b) show the sparsity of maps produced by ISP using different features on warped and unwarped images. We can see that U-SURF128 is the best performing feature producing the most sparse maps. B. Accuracy In this subsection, we discuss the accuracy of the maps produced by ISP, based on the number of accurate loop closures and the obtained false positives. The most sparse map may not guarantee an accurate map. Only those maps with accurate place partitioning are accurate and can lead to accurate loop closures. Thus a good mapping technique is
230
5 TABLE II: NODE LEVEL LOOP CLOSURE ACCURACY
Dataset-6560 Dataset-11200
#(LC) 49 68
#(FP) 4 6
one which provides an optimal combination of sparsity and accuracy. Given a query image, rst we perform loop closure at node level and then at image level if more accuracy is required. Table II shows the number of loop closures detected and the number of false positives obtained on Dataset-6560 and Dataset-11200 respectively by node level loop closure. We can see that a few loop closures are missed out, and some false positives are observed. The inaccuracy can be attributed to the imperfections in ISP. There is a direct relation between accuracy of the maps and ISP. The way in which we perform topological mapping does not induce any information loss. In other words we do perform any kind of sampling or selection of reduced number of features. Instead, we consider each and every feature extracted from the images and store it in HIFs. This process guarantees that there is no information lost during the mapping process. But inaccurate loop closure detection might occur due to inaccurate partitioning of places. For example a place can be represented by two nodes by partitioning it inaccurately during ISP due to appearance feature instability. As a result, during node level loop closure using a query image, both of the nodes representing that place may not get high similarity scores and hence the loop closure becomes inaccurate. A good ISP algorithm produces maps with minimum number of these situations. Image level loop closure accuracy depends on the accuracy of node level loop closure. If node level loop closure selects an inaccurate set of winning nodes, then as a result, image level loop closure also becomes inaccurate. However in case of an accurate node level loop closure, we have observed that 99% accuracy was possible in image level loop closure irrespective of the ISP technique. Figure 4 shows two loop closure scenarios that occurred in our mapping. C. Computational Time Table III shows average computational time (in milliseconds) taken by each stage of our topological mapping framework in processing each query frame on both Dataset-6560 and Dataset-11200. The abbreviations NLLC, LFE+QUANT, ILLC stand for Node Level Loop Closure, Local Feature Extraction & Quantization and Image Level Loop Closure respectively. NLLC is the node level loop closure which involves extracting the most similar nodes using HIFs as mentioned in Algorithm 2. This takes 10ms as shown in table III, and requires additional 50ms in order to compare with the current place node whenever needed. Obviously, local feature extraction (LFE) (200ms) and quantization (QUANT) (70ms) time is constant for every acquired frame. Actually, time required for both of these tasks increases with the number of features in an image. Computation time of image level loop closure (ILLC) is too low. This low computation time is the result of using
(a)
(b)
(c) Fig. 4: Example loop closures.
(d)
TABLE III: AVERAGE COMPUTATION TIMES (in ms)
NLLC 10 + 50
LFE+QUANT 200 + 70
ILLC 21
hierarchical inverted les(HIF). As we mentioned before, HIFs make the loop closure computation almost independent of the number of reference images and also in our case, nodes of the topological graph. Figures 5(a) and 5(b) show graphs comparing the loop closure times of our HIF-based method and without using HIF (non-HIF based). Red curves in the graphs indicate the time taken for feature quantization, node level loop closure and image level loop closure, for each image frame in a sequence. The blue curves represent the time taken by feature quantization and similarity score generation using inverted les by using inverse similarity evaluation methodology. We can see that the loop closure time of non-HIF based method increases relatively more with the increase in the number of images in the map, while our method using our method, loop closure time increases more slowly. Also, the performance gain becomes more prominent in case of huge datasets (huge number of images) as can be seen in the gure 5(b) corresponding to Dataset-11200, which contains 11200 images. The non-HIF based loop closure time for Dataset-11200 increases less steeply than that of Dataset-6560. This happened because the average number of features of Dataset-11200 is lesser than that of Dataset-6560 and as a result it takes lesser time to process each reference frame. This efciency of our HIF-based method can be attributed to the combination of sparse topological mapping and HIFs for efcient map storage. The representational power of HIFs saved lot of computation involved in loop closure.
231
0.35
R EFERENCES
[1] J. Kosecka A. C. Murillo, P. Campos and J. J. Guerrero. Gist vocabularies in omnidirectional images for appearance based mapping and localization. In 10th IEEE Workshop on Omnidirectional Vision, Camera Networks and Non-classical Cameras (OMNIVIS), held with Robotics, Science and Systems, 2010. [2] A. Angeli, D. Filliat, S. Doncieux, and J.-A. Meyer. A fast and incremental method for loop-closure detection using bags of visual words. IEEE Transactions On Robotics, Special Issue on Visual SLAM, 2008. [3] Adrien Angeli, St phane Doncieux, Jean-Arcady Meyer, and David e Filliat. Visual topological slam and global localization. In Proceedings of the 2009 IEEE international conference on Robotics and Automation, ICRA09, pages 20292034, 2009. [4] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. Speeded-up robust features (surf). Comput. Vis. Image Underst., 110, June 2008. [5] Mark Cummins and Paul Newman. FAB-MAP: Probabilistic Localization and Mapping in the Space of Appearance. The International Journal of Robotics Research, 27(6):647665, 2008. [6] Mark Cummins and Paul Newman. Highly scalable appearance-only slam - fab-map 2.0. In Robotics Science and Systems (RSS), Seattle, USA, June 2009. [7] Friedrich Fraundorfer, Christopher Engels, and David Nist r. Topologe ical mapping, localization and navigation using image collections. In IROS, pages 38723877, 2007. [8] Jana Koseck, Fayin Li, and Xialong Yang. Global localization and relative positioning based on scale-invariant keypoints. Robotics and Autonomous Systems, 52(1):27 38, 2005. [9] Benjamin Kuipers, Joseph Modayil, Patrick Beeson, Matt MacMahon, and Francesco Savelli. Local metrical and global topological maps in the hybrid Spatial Semantic Hierarchy. In Proceedings of the IEEE Intl. Conf. on Robotics & Automation (ICRA-04), 2004. [10] David G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60:91110, 2004. [11] David Nister and Henrik Stewenius. Scalable recognition with a vocabulary tree. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2, CVPR 06, pages 21612168, 2006. [12] Navid Nourani-Vatani and Cedric Pradalier. Scene change detection for topological localization and mapping. In IEEE/RSJ Intl. Conf. on Intelligent Robotics and Systems (IROS), 2010. [13] Cees G.M. Snoek and Marcel Worring. Multimodal video indexing: A review of the state-of-the-art. Multimedia Tools and Applications, 25:535, 2005. [14] A. Tapus and R. Siegwart. Incremental robot mapping with ngerprints of places. In IEEE/RSJ Intl. Conf. on Intelligent Robotics and Systems (IROS), pages 24292434, 2005. [15] Sebastian Thrun. Robotic mapping: A survey. In Exploring Articial Intelligence in the New Millenium. Morgan Kaufmann, 2002. [16] Sebastian Thrun, Wolfram Burgard, and Dieter Fox. Probabilistic Robotics (Intelligent Robotics and Autonomous Agents). The MIT Press, 2005. [17] Nicola Tomatis. Hybrid, metric-topological representation for localization and mapping. In Robotics And Cognitive Approaches To Spatial Mapping, volume 38 of Springer Tracts In Advanced Robotics, pages 4363, 2008. [18] Antonio Torralba, Kevin P. Murphy, William T. Freeman, and Mark A. Rubin. Context-based vision system for place and object recognition. In Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2, ICCV, pages 273, 2003. [19] Ba Tu Truong and Svetha Venkatesh. Video abstraction: A systematic review and classication. ACM Trans. Multimedia Comput. Commun. Appl., 3, 2007. [20] Christoffer Valgren, Tom Duckett, and Achim Lilienthal. Incremental spectral clustering and its application to topological mapping. In In Proc. IEEE Int. Conf. on Robotics and Automation, pages 42834288, 2007. [21] Zoran Zivkovic, Olaf Booij, and Ben Kr se. From images to rooms. o Robot. Auton. Syst., 55:411418, May 2007. [22] Z.Zivkovic, B.Bakker, and B.Krose. Hierarchical map building using visual landmarks and geometric constraints. In In Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 712, 2005.
0.3
0.25 HIF based nonHIF based

Times (sec)
0.2
0.15
0.1
0.05
0 0
1000
2000
3000 Image Id
4000
5000
6000
7000
(a) Loop Closure time - Dataset-6560
0.45
0.4
0.35 HIF based 0.3 nonHIF based
Time (sec)
0.25
0.2
0.15
0.1
0.05
0 0
2000
4000
6000 Image Id
8000
10000
12000
(b) Loop Closure time - Dataset-11200 Fig. 5: Loop closure computation times of non-HIF based loop closure and HIF based loop closure on maps generated on our datasets.
VI. CONCLUSION We proposed a sparse topological mapping framework involving two levels of loop closure. Hierarchical Inverted Files(HIF) were naturally adaptable for loop closure in our sparse topological mapping framework and made fast loop closure possible. Image Sequence Partitioning(ISP) played a key role in producing sparse topological maps. Sparsity of the maps produced by different features are analyzed and the accuracy is evaluated. Finally, our framework is evaluated on computational time required for loop closure. The experiments prove our argument that HIFs are suitable for sparse topological maps as they take advantage of the sparsity of the map in performing loop closure efciently without discarding any information.
ACKNOWLEDGMENT
This work has been supported by the ANR projects - CityVip and R-Discover. Special thanks to Mr.Jonathan Courbon for his help in experimentation.
232
Kalman lter correction with rational non-linear functions: Application to Visual-SLAM

Thomas F raud, Roland Chapuis, Romuald Aufr` re and Paul Checchin e e Clermont Universit , Universit Blaise Pascal, e e LASMEA UMR 6602, CNRS, 63177 AUBIERE, FRANCE, Email: rstname.name@univ-bpclermont.fr
Abstract This article deals with the divergence of the Kalman lter when used on rational non-linear observation functions in the Visual SLAM framework. The context objective is to localize a vehicle and simultaneously to build a map according to environment perceived by a camera. There are many ways to fuse all data from sensors and the usual one is the Kalman lter. A main problem of this approach is the divergence due to an improper linearization of the observation model. It leads to wrong estimation which disturbs all the process. The presented method allows, under weak constraint on the observation function, to reduce the divergence effect without modifying the observation noise. In the Visual SLAM context, this method drastically improves results and gives more stability to monocular system in order to initialize landmarks. Index Terms Visual SLAM, depth estimation, Kalman ltering
the estimation process which consists in integrating the various relative measurements to estimate the robot and landmarks positions in a common reference frame.
Fig. 1 gives an example of a monocular SLAM process. Historically, SLAM relied on laser telemeters such as Range-and-Bearing (RB) sensors allowing the robots to build planar maps of the environment. The original solution [12] utilized an Extended Kalman Filter (EKF) to fuse data acquired by a laser range scanner or other range and bearing sensors, leading to Range and Bearing EKF-SLAM. Recently, the use of vision sensors has put the 3D world within reach but has given rise to new difculties due, on the one hand, to the richness of the information provided and, on the other hand, to the loss of the depth dimension in the measurements inherent to the projective nature of vision. Visual SLAM with a single camera is more challenging than visual SLAM based on stereo vision, but a single camera will always be cheaper, more compact and easier to calibrate than a multi-camera rig.
I. I NTRODUCTION Simultaneous Localization and Mapping (SLAM) has become well dened in the robotics community to tackle the issue of a moving sensor platform constructing a representation of its environment on the y, while concurrently estimating its ego-motion [8, 2]. Stochastic approaches have solved the SLAM problem in a consistent way, because they explicitly deal with sensor noise [7, 13]. The implementation of a feature-based SLAM approach encompasses the following functionalities:
the perception process which depends on the kind of environment and on the sensors the robot is equipped with. It consists in detecting from the perceived data, features of the environment that are salient, easily observable and whose relative position to the robot can be estimated. the observation process which concerns the estimation of the features locations relative to the robots pose from which they are observed. the prediction process which deals with the estimation of the robots motion between two feature observations. This estimate can be provided by sensors, by a dynamic model of robots evolution fed with the motion control inputs, or based on simple assumptions such as a constant velocity model. the data association process in which the observations of landmarks must be properly associated (or matched), otherwise the robots position can become totally inconsistent.
Fig. 1. Top: two images taken at steps k and k + 1 and the corresponding extracted features. Bottom: the distance between the two observers positions (black triangular) based on the trajectory in blue.
233
The EKF estimation algorithm used in Davisons system has been shown to be inconsistent [1]. The algorithm tends to be overcondent in the estimate calculated due to linearization errors. As well as improving the computational efciency of mapping large environments, sub-mapping also improves the quality of the estimate by decreasing these linearization errors [1], [5]. Other monocular SLAM systems have also attempted to improve the quality of the estimate by using different estimation algorithms. Chekhlov et al. [6] used the Unscented Kalman Filter (UKF) for their monocular SLAM system. This algorithm improves the consistency of the estimate by making more accurate approximations to the non-linearities in the observation process. The improvement in consistency comes at a cost in complexity (O(N 3) vs. O(N 2) for the EKF). Recently, Holmes et al. [10] showed that using the square root UKF algorithm [14] a more consistent estimate can be obtained at O(N 2) complexity in a monocular SLAM system. However, the actual processing time required for the implementation was an order of magnitude greater than for the EKF on reasonably sized maps due to a large constant factor on the computational cost in their implementation. In this paper, a camera-centered Extended Kalman Filter is used to process a monocular sequence. The sensorcentered Extended Kalman Filter was rst introduced in the context of laser-based SLAM [4]. Contrary to the standard EKF-SLAM, where estimates are always referred to a world reference frame, the sensor-centered approach represents all feature locations and the camera motion in a reference frame attached to the current camera. The typical correlated featuresensor uncertainty arising from the uncertain camera location is transferred into uncertainty in the world reference frame, resulting in lower variances for the camera and map features, and thus in smaller linearization errors [5]. This paper presents a method to reduce the divergence in the specic case of rational non-linear observation models. At each step, comparisons between solutions with and without correction in the non-linear case demonstrate the effectiveness of the proposed approach. In Section II, we begin by introducing the camera-centered Extended Kalman Filter. Section III presents the solution proposed in this paper. Finally Section IV shows experimental results of this work. Section V concludes the paper. II. I SSUES OF EKF V ISUAL SLAM A. Prediction step The SLAM algorithm begins with standard EKF prediction using proprioceptive sensors and the known dynamic model associated to the vehicle. Thus, the current position of the observer can be predicted: X k+1|k = f (X k|k , uk ) (1)
where Qk is the covariance of process noise, FX is the Jacobian of the evolution model f (X k|k ) [9], with respect to X k . At this point we obtain the best estimation of the position according to the proprioceptive information. B. Update step In the case of SLAM, the extracted features are used to update the vehicles position and the map. At time k, a set of features constitutes the map which can be augmented with new observed features. On the other hand, if this is a feature previously observed, its location can be predicted in the observation space. The Kalman gain Gk+1 and the update process in this case can be written for each observation y as follows: Gk+1 = Pk+1|k HX T (HX Pk+1|k HX T + Robs )1 X k+1|k+1 = X k+1|k + Gk+1 y h(X k+1|k ) Pk+1|k+1 = Pk+1|k Gk+1 HX Pk+1|k (3) (4) (5)
where h is the observation model and HX is its Jacobian with respect to the state vector. Robs is the covariance of the noise associated with the observations. Localization accuracy depends on the number of points, the precision of the initial positioning and the reconstruction model. At this stage, the gain has two behaviors versus the observation model. If it is linear, the Kalman gain is optimum. Otherwise, it is suboptimal and may even lead to divergence mainly due to inappropriate linearization. C. Visual SLAM In the case of Visual SLAM, the dimension of observation space is lower than that of the state space and the associated model is non-linear. To understand the consequences, Fig. 2a shows the result of a 2D features position updated with a 1D observation. The divergence observed is the result of an improper linearization. The problem of negative depth was addressed for inverse depth parametrization in [11]. The presented solution includes the case of negative depth. We propose in this article to introduce a correction in the update process by putting a constraint on the rational observation function in order to make it more consistent and to avoid divergence phenomenon. D. From 2D to 3D A rst approach would be to see that if the observation model in the 3D space is linear and equal to identity, then for each 2D observation, a virtual 3D point is created along the line of sight. For each point, a variance-covariance matrix is associated, dealing with uncertainty of the pose and the observation. With this virtual 3D point, only the corresponding 3D point in the state vector is updated. The result of this method is presented in Fig. 2c. This method has the advantage of working with a linear observation function but at the cost of increased uncertainty associated with the data. Indeed, considering the uncertainty as Gaussian in the image, its rigorous retro-projection in the world is not Gaussian as shown in Fig. 3.
where X k is the state vector at step k and uk is the vector of control inputs. The estimation step of the variance-covariance matrix is then expressed as: Pk+1|k = FX Pk|k FT + Qk X (2)
234
(a). Update process in the 2D space showing divergence
Fig. 3. For a 2D Gaussian uncertainty in the image (left part), the reprojection in the 3D world is not Gaussian and corresponds to a cone (right part).
Fig. 4. 3D point uncertainty approximation by using the sum of three Gaussian functions. (b). A zoom on the real point prove the integrity III. P ROPOSED APPROACH In order to directly initialize the new point, it is placed into the world reference frame from its rst observation with an uncertainty coherent with the observation error in the image. However, despite being direct this method has a problem of linearization: gap between the created and the real point may be signicant. We propose here to compute a correction in the update step of the Kalman process in oder to attenuate the lter divergence (see Section III-A). Its application in a Visual SLAM context is introduced next in Section III-B. (c). Update step processed in the 3D world with a linear observation function Fig. 2. Simulation: let the green star be a landmark of the environment being observed from a rst robots position (left blue triangle). The position of a 3D feature is initialized along the line of sight (red cross) with its uncertainty in blue. (a) When observed from a second position (right blue triangle), the location of the 3D feature (green circle) is estimated after the update process in the 2D space showing divergence due to the nonlinear observation function. (b) A zoom on the real point prove it is contained in the uncertainty of the virtual point. (c) The new 3D point (black cross with its uncertainty) is created from the second observation but this time the Kalman update step is processed in the 3D world with a linear observation function equal to identity. The result is the black circle with the uncertainty in red. A. Weighted Coefcient When the measurement matches the predicted observation, we have the following relation: observation = h(P 3D
(k+1) (k) (k+1)
)
(k)
(6)
where P 3D = P 3D +Gptk+1 y h(P 3D ) with Gptk+1 the part of the Kalman gain regarding to the 3D point. Hypothesis: Let the observation function be dened as a ratio of two polynomial functions, N and D of P3D :
(k+1) h(P 3D )
N P 3D D P 3D
(k+1)
(k+1)
(7)
with P 3D is not a root of D. Using (6) and (7) we have a polynomial sum with respect (k+1) to P 3D : One approach commonly used in this case consists in an approximation by using the sum of Gaussian functions, each centered on a different point of the line of sight and tangent to the cone resulting from the projection of the measured uncertainty (see Fig. 4). observation D P 3D
(k+1)
(k+1)
N P 3D
(k+1)
=0
(8)
Let us dene the gain which satises (8). As the 2D-3D transformation is not well-conditioned, we intend to seek a simple proportional relationship between and Gptk+1 . This
235
research requires that satises the same existing conditions as the Kalman gain Gpt: = r Gpt = r PP t HT t HP t PP t HT t + R P P = PP t HT t HP t P = PP t T R HP t + r r
1 1 1
Algorithm 1: Kalman correction for non-linear observation model Data: Observation function h = D[X p ] is a division of p two polynomial functions in X p , part of the state vector we want to update by the Kalman lter. Result: State vector updated and corrected according to specic constraints begin Prediction step: totoX k+1|k = f (X k|k , uk ) totoPk+1|k = FX Pk|k FT + Qk X toto Kalman gain: totoGk+1 = Pk+1|k HX T (HX Pk+1|k HX T + Robs )1
N [X ]
PP t HT t HP t P HT t + R (9) P P PP t R with P = and R = (10) r r and where PP t and HP t are the parts of the covariance and the Jacobian matrices related to the point and the observation model respectively. As PP t and R are positive-denite matrices, P and R must also be the same. So r cannot be negative. Furthermore, as R is the best estimate of the noise associated with the observation, thus R cannot be improved. It leads to R R r 1 and (10) to a polynomial function in r, which is the coefcient of proportionality sought between and Gptk+1 . This can be summarized by the following relation: Once the value of r is found, the corrected update step can be processed: k+1 P k+1 3D PP k+1
3D
toto Correction: for each dimension of the observation space do X p k+1|k+1 = X p k+1|k +Gk+1 y h X p k+1|k ri is the smallest positive root of: observationD X p k+1|k+1 N X p k+1|k+1 = 0 k+1 = min(ri , 1) Gk+1 toto Update step: totoX p k+1|k+1 = X p k+1|k + k+1 y h X p totoPp k+1|k+1 = Pp k+1|k k+1 HX Pp k+1|k
k+1 = r Gptk+1 and r min(r, 1)
(11)
k+1|k
= r Gptk+1
= P k + k+1 3D
= P k + k+1 y h(P k ) 3D 3D = PP3D k+1 HX PP3D k k
(12) IV. E XPERIMENTS AND R ESULTS In this section, the correction is calculated in the Visual SLAM framework and results are given in three cases: rstly for a 2D observer and a 1D observation model, secondly for a 3D observer with a 2D observation function in simulation and nally, for a 3D observer with a 2D observation on real data. A. Correction calculation The update of a 3D point can be written as follows:
(k+1) (k)
where = observation prediction is the innovation. When the dimension of the observation space is greater than 1, then considering each coordinate as independent, we can nd corrections for them. We will then choose the smallest of them to avoid any overweighted updates. Finally, the method is summarized in Algorithm 1.
B. Observation function in Visual SLAM Consider a vehicle moving and observing its environment using a camera. Data association provides information of matching between the 2D image points (uobs , vobs )T and the 3D points P3D of the map. Let us dene the rigid transformation between the world and the reference frame of the vehicle as: R = Rz (k )Ry (k )Rx (k ) and T = (T xk , T yk , T zk )T . As in [9], P 3D are projected in the image, on uest and vest , with the following non-linear relationship: uest = hu (P 3D ) = vest = hv (P 3D ) = Nu K1 RT (P 3D T ) = K3 RT (P 3D T ) De K2 RT (P 3D T ) Nv = K3 RT (P 3D T ) De (13) (14)
P 3D
= P 3D + Gptk+1
(k+1)
(15)
The projection can be expressed as: u(k+1) = K1 RT (P 3D

(k+1) K3 RT (P 3D
+ Gptk+1 T ) + Gptk+1 T )
(16)
The observation function is written as a ratio of two linear polynomials with respect to Gpt. We can apply the relationship (16) and extract a proportionality coefcient which satises the relation between the projection of the updated 3D point and the current observation: = ru Gpt K1 RT (P 3D + T ) K3 RT (P 3D + T )
(k) (k)
(17)
where Ki is the ith line from the camera matrix and hu , hv are the part of the observation function related to the u and v coordinates.
uobs =
(18)
236
The resolution of this equation can be achieved by solving a rst order polynomial function. The root of (18) with respect to uobs coordinate is: ru = (uobs uest )De (K1 uobs K3 )RT Gpt (19)
For a 1D observation function, ru is the coefcient sought. For a 2D observation function, a coefcient can be dened, in the same way, with respect to vobs coordinate: rv = (vobs vest )De (K2 vobs K3 )RT Gpt (20)
Finally, for 2D observation model, the smallest value (less than 1) between ru and rv will be chosen. B. Results: 2D observer, 1D observation In this section, the observation function is dened as the rst component of the 2D observation and the function is then similar to h = y/x. The observer is dened by a position (x, y) and an orientation and the observed landmark by two coordinates (xp , yp ). Fig. 2a shows the case without correction, and Fig. 5 reveals the results of the rst update with correction. This gure allows to compare the Kalman update results after 10 iterations with and without correction. For more clarity, the ordinate axis is in logarithmic scale and give the Euclidian distance between the updated point and the real one. The middle graphic shows the result without the correction and conrms the divergence at the rst iteration. This divergence is corrected and the convergence can be seen in the graphic below. C. Results: 3D observer, 2D observation in simulation As the observation model is a 2D function, two coefcients can be found. The observer is now dened by a 3D position (x, y, z) and an orientation (, , ) and the observed landmark by three coordinates (xp , yp , zp ). In this case, each coordinate in the image allows to compute one correction coefcient. The experimental scenario is simple: from the rst position, a virtual point is created in the world and the state vector is augmented. This new feature just indicates a line of sight along the real point which will be placed using triangulation with the next observation when the observer reaches another position. At each innovation given in the image, we assume that it corresponds a 3D point in the rst line of sight by triangulation. By this way, we can increment the innovation to test the correction. Fig. 6 shows the result of this experiment from 1 to 100 pixels of innovation. D. Results: 3D observer, 2D observation on real data From proprioceptive data and the corresponding sequence of images acquired on platform PAVIN (a small urban realistic environment for intelligent vehicle tests) by a vehicle, the convergence of the localization of a single point is analyzed. Fig. 7 shows in a rst part a point in an image and the initialization done in the 3D reference world, and in a second part, the convergence of the Kalman process with and without correction for uncertainty at 1. The divergence which places the divergence point behind the observer while it continues to be observed, is evident here. This divergence is corrected by the proportionality sought. V. C ONCLUSION As part of a Visual SLAM project, we faced a rational non-linear observation function which is similar to h = y/x on each coordinate in the observation space, whatever its dimension. The acquisition conditions are those that we would obtained with low cost sensor. The presented method allows, in the case where the observation function diverges, to calculate a coefcient of proportionality to be applied on the Kalman gain. This correction brings all its importance since the observation function becomes clear while the estimate of the observed object is imprecise. Moreover, in the case of moving at moderate speed, the frame rate of low cost cameras (15 fps) rapidly generates this kind of divergence and is corrected by the proportionality sought. The correction method is not specic to the Visual-SLAM and we want to address the issue in general, giving the limits of use when the observation function is non-linear and satises some weak constraints. ACKNOWLEDGMENTS This work is supported by the Agence Nationale de la Recherche (ANR - the French national research agency) (ANR
Fig. 5. Top: the blue triangles are still the two base consecutive observation bases. The virtual point created from the observation of a real point (green star) on the rst base is still the red cross with the uncertainty in blue. The observation from the second base leads to an updated point (black circle) with its uncertainty (red at ellipse). The divergence is reduced and the integrity is guaranteed. Only the rst Kalman update is shown and results below prove the convergence after 10 steps. Middle: the distance between each update and the real 2D point without correction. Bottom: the distance between each corrected update and the real 2D point.
237
(a). Description of real environment
Fig. 6. Top: the world representation of the experiment. From a rst base (left blue triangle), a virtual point (black cross) is created on a line of sight (green line). From a second base (right blue triangle), the rst line of sight is projected in the image (green line in the black rectangle in the bottom right corner). On this line is chosen the observations (red crosses) and the corresponding point is nd by triangulation (red line of sight on the main graphic). Middle: the distance between each update and the real 3D point without correction. Bottom: the distance between each corrected update and the real 3D point. Rinavec PsiRob ANR-07-ROBO-0006). The authors would like to thank L. Malaterre, S. Alizon and all the other members of LASMEA who contributed to this project. R EFERENCES
[1] Bailey T., Nieto J., Guivant J., Stevens M., and Nebot E. Consistency of the EKF-SLAM algorithm. In Proc. IEEE International Conference on Intelligent Robots and Systems, 2006. [2] T. Bailey and H.F. Durrant-Whyte, Simultaneous Localization and Mapping (SLAM): Part II - State of the Art. Robotics and Automation Magazine, 10 p., 2006. [3] Bay H., Tuytelaars T., and Van Gool L. SURF: Speeded up robust features. In Proc. European Conference on Computer Vision, 2006. [4] J. Castellanos, J. Neira, and J. Tardos. Limits to the consistency of EKF-based SLAM. In 5th IFAC Symposium on Intelligent Autonomous Vehicles, 2004. [5] Castellanos J., Martinez-Cantin R., Tardos J. D., and Neira J.. Robocentric map joining: Improving the consistency of EKF-SLAM. In Robotics and Autonomous Systems, 55(1), 2007. [6] Chekhlov D., Pupilli M., Mayol-Cuevas W., and Calway A. Real-time and robust monocular SLAM using predictive multi-resolution descriptors. In Proc. 2nd International Symposium on Visual Computing, 2006. [7] G. Dissanayake, P. Newman, H.F. Durrant-Whyte, S. Clark, M. Csobra, A solution to the simultaneous localization and map building (SLAM) problem. IEEE Trans. Robotics and Automation, 17(3), pp. 229-241, 2001. [8] H.F. Durrant-Whyte, T. Bailey, Simultaneous Localization and Mapping (SLAM): Part I - The Essential Algorithms. Robotics and Automation Magazine, 9 p., 2006. [9] F raud T., Checchin P., Aufr` re R. and Chapuis R. Communicating e e Vehicles in Convoy and Monocular Vision-based Localization, 7th IFAC Symposium on Intelligent Autonomous Vehicles, September, 7-9, Lecce, Italy [10] Holmes S., Klein G., and Murray D. A square root unscented Kalman lter for visual monoSLAM. In Proc. International Conference on Robotics and Automation, 2008.
(b). Convergence without correction
(c). Convergence with correction
Fig. 7. Results of convergence in a Visual-SLAM context. 7a: On the left, the environment and the point we will precise during SLAM process. On the right, the initialization of the point in the 3D world. 7b: Result of the localization of the point by a Kalman lter without correction. After the 4th iteration, the point is always observed while its localization is behind the observer. 7c: Result of the localization of the point by a Kalman lter with correction. At the 4th iteration, the Kalman gain is corrected to obtain a best localization.
[11] Parsley Martin P. and Julier Simon J. Avoiding Negative Depth in Inverse Depth Bearing-Only SLAM, in IEEE/RSJ Int. Conf. on intelligent Robots an Systems, Nice, France, 2008. [12] R. Smith and P. Cheeseman, On the representation and estimation of spatial uncertainty, The International Journal of Robotics Research, vol. 5, no. 4, pp. 5668, 1987. [13] Thrun, S. 2002. Robotic mapping: A survey. In Exploring Articial Intelligence in the New Millenium G. Lakemeyer and B. Nebel (eds.), Morgan Kaufmann. [14] Van der Merwe R. and Wan E. The square-root unscented Kalman lter for state and parameter estimation. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, 2001.
238
Helicopter safe landing using vision and 3D sensing

Paolo Raspa, Emanuele Frontoni, Adriano Mancini, Primo Zingaretti, Sauro Longhi
Dipartimento di Ingegneria Informatica Gestionale e dellAutomazione (DIIGA) Universit Politecnica delle Marche Ancona Italy {raspa, frontoni, mancini, zinga, longhi}@diiga.univpm.it
Abstract In this paper a mixed vision-range based approach, based on vision and laser range sensors technology, for safe landing of an Unmanned Aerial Vehicle (UAV) is proposed. The guidance system allows a remote user to define target areas from a high resolution aerial or satellite image to determine the waypoints of the navigation trajectory or the landing area. The helicopter is required to navigate from an initial to a final position in a partially known environment, to locate a landing area and to land on it. The system is based on our previous work on UAV navigation and landing: a feature-based image matching algorithm finds the natural landmarks and gives feedbacks to the control system for autonomous navigation and landing. An algorithm for safe landing areas detection and classification is proposed, based on the use of 4D RGBD (Red, Green, Blue, Distance) data analysis for terrain classification. Results of a real scenario show the appropriateness of the approach that does not require any artificial landmark (e.g., helipad) and is robust to occlusions, light variations and high vibrations. Index TermsUAV, safe landing, vision, range data, terrain classification.
I. INTRODUCTION
YPICAL missions of autonomous helicopters require flying at low speed to follow a path or hovering near an object or land in a certain place. Positioning equipments such as Inertial Navigation Systems (INS) or Global Positioning Systems (GPS) are well-suited for long range, low precision helicopter flight and fall short for precise, close proximity flight. Manoeuvring helicopters close to objects requires accurate positioning in relation to the objects and also safe landing require precise information about terrain. Visual and range sensing are rich sources of data for this relative feedback and low cost sensors are now available for this kind of applications. In this paper we propose the use of both Kinect technology and Laser Range Finder to robustly acquire 3D terrain data. Unmanned Aerial Vehicles constitute a research field that has been extensively explored in the last decade [1, 2, 23-26].
In literature a wide range of studies on autonomous helicopters has been reported: modelling and identification, simulation, sensor integration, control design and fault diagnosis [2-3]. The use of computer vision as secondary or primary method for autonomous helicopter control has been discussed frequently in recent years, since classical combination of GPS and Inertial Navigation System INS can not sustain autonomous flight in any situation [4-5]. The degree of autonomy that a helicopter can achieve depends on factors such as the ability to solve unexpected critical situations, e.g., loss of GPS signal, and the ability to interact with the environment, e.g., to use natural landmarks. A vision-based solution for autonomous waypoint navigation and safe landing on unstructured terrain represents a strong improvement for both these abilities. Several techniques have been implemented, decoupling the problem of locating and tracking a high contrast well known landmark, e.g., a classical helipad, that can be easily identified by standard image processing techniques [6-8], from the problem of detecting and avoiding natural obstacles, e.g., steep slopes and rocks, on a landing area [9-11]. The dependence on fixed, artificial landmarks and on optimal visibility conditions constitutes a strong limit for visual-based navigation in real-environment applications. In some works vision approaches based on moment descriptors are used; they impose no constraints on the design of the landing pad except that it should lie on a two dimensional plane, but artificial landmarks composed by polygons are indispensable [7]. Moreover the weakness of that strategy appears in situation in which natural (e.g., due to debris or leaves) or artificial (e.g., due to engine smoke) occlusions can make the computation of moment descriptors difficult. Some examples of more recent results in the field of visual odometry can be found in [21-26]. In this paper a mixed vision-range based approach for guidance and landing of an UAV is proposed. This paper is the extension of our previous works on vision based safe landing using optical flow and image classification [27]. Here we add robust 3D data acquisition using both Kinect technology, better described below, and Hokuyo Laser Range Finder to robustly acquire and classify 3D terrain data for safe
239
2 landing area detection. In this previous work the general idea was to guide the UAV using natural landmarks only, which can be selected from aerial images of the operating area. Such images can be already available before or just acquired during the flight. The features extracted from the natural landmarks can be chosen as navigation targets allowing a vision-based feature tracker to compute the references for the flight control. The landing task is managed by the vision system through the localization of features geometry center of mass. Appropriate feedbacks are given to the autonomous landing control system. In case of landing task on unstructured areas, the ground is analyzed to verify if the area is flat and proper for landing, since Digital Elevation Map (DEM) resolution can not guarantee the possibility to identify a flat and safe target area. In a preliminary stage, our studies focused on the problem of natural landmark detection and recognition and safe landing areas detection. The system described in [27] is based on feature based surface reconstruction using the computation of a plane extraction method and heuristic estimation of safe landing area. This approach chooses features in the image that are stable for large motions. Thus, a sparse set of confident depth estimates is obtained. If the surface is considered quite stable the area is used for the autonomous landing manoeuvre. The present extension is focused on the ability to use not only vision, because of the strong dependence from lighting conditions and dynamic shadows, but also 3D sensing, to obtain a more stable terrain classification and to be able to better classify the appropriateness of the landing area. Results show the quality of the mixed vision-range based approach that is also robust to occlusions, light variations and vibrations (strongly influencing the quality of depth images). Test performed in a real scenario using a real helicopter are presented. The organization of the paper is as follows. In section 2 the Helibot test bed is presented. In section 3 the vision-range approach, a range-feature based terrain classificatory is presented together with results, in the following section. Finally in section 5 concluding remarks and directions for future improvements are discussed. sent using an advanced radio modem that transmits and receives on the 868MHz band. Autonomous flight is achieved using a hierarchical behavior-based control architecture, discussed further in [17]. Algorithm testing was largely performed using a helpful simulation framework developed by our research group. The framework is provided with a customizable HELIBOT model, allowing an easy switching from simulated to real tasks, see Fig. 1. More information about the framework can be found in [17]. The approach to the operating area is managed with the aid of GPS information. The user gives to the system the waypoints coordinates and the helicopter is autonomously guided to the goal. Then the operating mode is switched and vision system enters into the control loop. The user analyzes an high resolution image of the area and select a target. The system extract the appropriate features from the target image to define a natural landmark and starts to search it in the images sequence coming from the onboard camera. Once features are detected and tracked, the system uses the image location of these features to generate image-based velocity references to the low level attitude and height controllers. In the case of a programmed landing, the vision-based controller pilots the helicopter over the landing area and manages the landing. If the landing area is unknown, an appropriate analysis is performed to select the nearest flat area for a safe landing. Fig. 1 in [27] shows the flying area in which experimental tests have been carried out. Vision algorithm is described in the next session.
II. THE HELIBOT TEST-BED The experimental test-bed HELIBOT is a customized Bergen Twin Observer. It is a twin-cylinder, gasoline-powered radio-controlled model helicopter approximately 1,5 meters in length and capable of lifting approximately 9 kg of payload. Onboard avionics include a PC/104-based computer stack running the QNX RTOS (800 MHz PIII CPU with 256 Mb DRAM and 256 Mb flash disk), a GPS-41EBF3V receiver with WAAS and EGNOS (available in Europe) corrections and a Microstrain 3DM-GX1 Attitude Heading Reference System (AHRS). The helicopter is also equipped with a downward-pointing Nikon camera. The ground station is a laptop that is used to send high level control commands to the helicopter as well as displaying telemetry from it. Data are Fig. 1 A sample scenario from simulation framework, showing the HELIBOT vision system performing landing
III. THE PROPOSED APPROACH To obtain a precise 3D sensing model we use both a Kinect sensor and Hokuyo Laser Range Finder (LRF). Our goal is to robustly acquire and classify 3D terrain data based on RGBD pixel based data analysis. Micorsoft Kinect is equipped with an infrared (IR) light projector, a CMOS and a RGB camera. The IR infrared patterns acquired from the CMOS camera are processed by a
240
3 System-on-a-chip (SoC) to get depth data, with 640x480 resolution. From depth map a 3D point cloud can be extracted. Every pixel associated with depth information can be expressed in a the CMOS coordinate frame, using the following: x = depthmap(r,c) y = (cx - c) x / fy z = (cy r) x / fx (1) In Eq. 1 r and c denote the pixel coordinates. X is associated to the depth, therefore the x-axis points towards the camera view. Kinect depth camera calibration provides other parameters: cx and cy are the camera centre pixel coordinates, fx and fy denote the focal distance in pixel. Other sensors, like stereo and Time-of-Flight cameras can provide depth (or disparity) data for 3D image reconstruction, but the chosen method works also with bad lighting conditions. While the technology is suitable for indoor environment, in outdoor the Kinect technology is not robust due to the use of IR; strong sun light bring to an obscuration of the range sensing method and only vision data are available. In our experience the Kinect is useful only when the helicopter is close to the terrain (less than 1 meter). To solve this strong limitation we added an Hokuyo LRF to the acquisition platform of the helicopter. LRF and Kinect are calibrated together to obtain an RGBD image suitable for terrain classification. RGBD data should be not completely covered in the D dimension when the Kinect is not working (out of range) and we only have LRF data that cover only a line of pixel. Also in this case data are suitable for partial terrain classification. 3D image reconstruction should be invariant or partially invariant to translation, rotation, scale, 3D camera view point and outdoor illumination conditions. Furthermore, robustness to image noise, occlusion (see example in Fig. 2), background changes and clutter should be granted while maintaining near real-time performance. To solve all the discussed aspects an improved version of a terrain classifier, based on our previous experience and improved with 3D image technology, is used. Surface reconstruction can be defined as the process of inferring a mesh of interconnected points representing a threedimensional surface.
(a) (b)
The surface is often assumed to be rigid and fixed. Points can be acquired using many types of sensors (e.g. rangefinder, stereo-vision). Computer vision systems generally wish to use image sensors to infer the state of the world. As such, we would like vision systems to be able to reconstruct objects or environments from a sequence of pictures. Recently the availability of low cost range-image sensors has introduced a novel methodology to perform structure from motion reconstruction. In the UAV application presented in this paper we are only interested in the reconstruction of the depth estimation using stable point. In fact, due to vibrations and variables light conditions the system is not able to have a stable frame by frame depth reconstruction using depth data acquired as described before. For this reason, here following, a mixed feature based and range based depth perception, which will allow safe autonomous landing is described. We learn a parametric model of the saliencies distribution by fitting a Gaussians mixture model (GMM) using the Expectation-Maximization (EM) algorithm on a hand labeled training data set. See [15] for practical details on the EM algorithm. In order to capture the variability of the terrain we ensure that we have at least data from flat surfaces, rough bare surfaces and short grass terrains; from thin branches and power line wires; from dense and sparse tree canopy and finally tall grass. We set at 10 cm in diameter the limit between linear and surface structure for tree trunk and branches. The labeling process is performed manually on the RGBD data. The optimal number of Gaussians necessary to capture the distribution without over-fitting the training data set was evaluated experimentally. We fitted from one to ten Gaussians per class model and compared the classification rate with such model, using different test data set. We achieve the best results with 4 Gaussians per class. We also confirmed the correct convergence of the EM algorithm. This labeling and model fitting is performed off-line and only once for the test area. The final classifier is a binary threshold on the flatnessk value, resulting from the GMM classifier, which operates in the following way:
(2)
safe if < T flatnessk = unsafe if T
Fig. 2 Typical occlusions caused by engine smoke; camera is mounted on the bottom of the UAV and it points the terrain
The T threshold is calculated by another supervised training phase, where a human operator classified a set of well known images coupled with range data as safe or unsafe; this procedure is simple and requires a little amount of time. Also a simple plane extractor is computed to make the approach more robust. If we know the location and boundary of the ground plane then we can establish the position of our vehicle and determine a structure of the world. For extracting the ground plane, we applied a simple and effective algorithm where a robust plane fit is determined from the 3D point clouds given by Eq. (1), using RANSAC [24]. The percentage
241
4 N of 3D points which are part of the flatness area, according to Eq. (2) has been used for evaluation. A threshold of 5 cm has been set. Areas are considered safe, if N is greater than 80% and the variance of N among consecutives acquisitions is below a predefined threshold (10%). Any points that are beyond some distance threshold from the plane are classified as outlying obstacles and assigned a probability value. This leads to a confidence density map of where the landing surface is in the 2D view of an image. This obstacle probability is mixed with the above described T threshold by confirming the safe state or not. If the safe state is not confirmed the landing area is classified directly as unsafe. Experiments using both computer simulated images and real video images demonstrate the correctness and effectiveness of our method. These range images can be processed at 4 frames per second. In the case of autonomous landing this is also the frequency of the feedback input to the control system. The software architecture is given in Fig. 6. IV. RESULTS A test in a real scenario using a dataset of recorded range images was performed to evaluate the overall approach for safe landing area classification system on the real UAV. Sequences of safe landing area and not have been analyzed. Here following two different cases of depth estimation and surface classification are reported. In Fig. 3 two illustrative sequences of images taken flying over a safe area measuring only depth (Fig.3a) and extracting the safe are (Fig.3b yellow region) are shown.
(a) (b)
(red) points toward the ground plane and represents the distance from the camera.
(a)
3D Point Cloud
(b)
Depth Map
RGB View
(c)
Fig. 5 Ground plane extraction (a) from 3D point cloud of bushes using RANSAC. Corresponding depth map (b) and RGB image (c) are also provided. The x-axis (red) points toward the ground plane and represents the distance from the camera. Every pair of sequential images was analysed. A variance analysis, as before described, of geometrical characteristics was performed and a representative value for flatness parameter computed. Results of classification are reported in Table 1. We tested the proposed method in 10 different landing areas of the Gruppo Aeromodellistico Rio Ete flying area (see Fig. 1 in [27]), where it was allowed to fly with our helicopter. Eight of them are unsafe areas characterized by tall grass, bushes and trees. All areas were correctly classified and a sub area suitable for landing was extracted in the last two safe test cases of short grass, plain terrain, see Table 1. We also tested the use of only LRF data and we were able, adding Kinect, to correctly classify tall grass, which is classified as safe if we use only LRF data. The main issue is that depth data are available only when the landing maneuver is started and the helicopter is at least at a height of 5 meters. When we need Kinect data to have reasonable classification performances we need to be close to the terrain (less than 1 meter), increasing safe landing area analysis. Fig. 4 shows the ground plane extraction for a short grass terrain using the RANSAC algorithm. The helicopter is at a height of 0.7 m. The extracted ground plane equation is given by:
Fig. 3 Example of safe area classification and detection.

(a)
3D Point Cloud
(b)
Depth Map
(c)
RGB View
0.99 x + 0.05 y + 0.08 z = 0.72.
(3)
Fig. 4 Ground plane extraction (a) from 3D point cloud using RANSAC for short grass terrain. Corresponding depth map (b) and RGB image (c) are also provided. The x-axis
As can be seen from Eq. (3) the ground plane is approximately parallel to the Kinect camera plane mounted on the Helibot. Indeed the coefficients associated to the y and z components are nearly zero. A percentage corresponding to 99.42% of the 3D points is within a region of space, between two planes, which are parallel to the extracted ground plane reported in Eq. (3) and
242
5 are distant 5cm from it. The percentage is above the desired threshold N=80%, therefore the area can be considered as safe. In Fig. 5 the helicopter is 1.1 m above bushes. The RANSAC algorithm extracts two planes. A percentage corresponding to 35.66% and 23.94% of the 3D points is within a region of space, between two planes, which are parallel to the extracted ground planes and are distant 5cm from each of them. The most part of the points are distributed outside the area defined from the threshold T=5cm, therefore the area is detected as unsafe from the classification algorithm. the detection of safe and unsafe areas from the same image. Future work will also focus on improving the processing time on reducing the classification error rate by dealing with border effects and isolated range measurements. We will also pay more attention to the formal numerical assessment of our method and to the problem of target hidden in vegetation. The applications of such a system are notable; from exploration and rescue to target tracking and terrain data acquisition. Table 1 Results of Terrain Classification Terrain type Short Grass Short Grass Tree trunk Bushes Bushes Bushes Artificial boxes Tall Grass Tall Grass Three branches N% 99.22 99.14 89.83 25.64 34.69 29.45 71.9 83.23 80.54 32.73 Var(N) % 1.09 1.2 26.86 12.7 20.12 18.73 38.79 15.67 19.45 29.33 Classification Safe Landing Safe Landing Unsafe Landing Unsafe Landing Unsafe Landing Unsafe Landing Unsafe Landing Unsafe Landing Unsafe Landing Unsafe Landing
ACKNOWLEDGMENT Authors gratefully acknowledge Antonio Pinelli, Biagio Ambrosio and Alberto Cardinali for their essential support.
REFERENCES [1] Office of the Under Secretary of Defense, Unmanned aerial vehicles annual report, Defense Airborne Reconnaissance Office, Pentagon, Washington DC, Tech. Rep., July 1998. T.J. Koo and S. Sastry. Output tracking control design of a helicopter model based on approximate linearization. Proceedings of the 37th IEEE Conference on Decision and Control, Dec 1998. A. Monteri, P. Asthana, K. Valavanis, and S. Longhi. Model-based sensor fault detection and isolation system for unmanned ground vehicles: Theoretical aspects (part i and ii). Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2007. O. Amidi, An autonomous vision-guided helicopter, Ph.D. dissertation, Robotics Institute, Carnegie Mellon University, 1996. O. Amidi, T. Kanade, and K. Fujita, A Visual Odometer for Autonomous Helicopter Flight, Robotics and Autonomous Systems, 28, pp. 185-193, 1999. O. Shakernia, R. Vidal, C. S. Sharp, Y. Ma, and S. S. Sastry. Multiple View Motion Estimation and Control for Landing an Unmanned Aerial Vehicle.
Fig. 6 Safe landing procedure V. CONCLUSION In this paper the design and implementation of a visionrange based safe landing classification algorithm for an autonomous helicopter is presented. The vision data analysis allows defining a target area from an high resolution aerial or satellite image, hence to define the waypoints of the navigation trajectory or the landing area. An algorithm for safe landing areas detection is also proposed extending our previous works on this field and using low cost range and visual sensors, based on the plane extraction and heuristic analysis. Results show the appropriateness of the vision-range based approach that does not require any artificial landmark and is quite robust to occlusions, light variations and vibrations. Future works are oriented to improve robustness of the whole system and specifically some important aspects of the landing task concerning with safety. Another objective will be
[2]
[3]
[4] [5]
[6]
243
6 Proc. Intl Conf. Robotics and Automation (ICRA 02), pp. 27932798, 2002. S. Saripalli, J. Montgomery and G. Sukhatme, Visually-Guided Landing of an Unmanned Aerial Vehicle. IEEE Transactions on Robotics and Automation, 19(3) pp. 371-81, 2003. S. Saripalli and G. Sukhatme, Landing an helicopter on a moving target. IEEE International Conference on Robotics and Automation, 2007. Pedro J.Garcia-Padro, Gaurav S.Sukhatme, and J.F.Montgomery, Towards vision-based safe landing for an autonomous helicopter, Robotics and Autonomous Systems, 2000. A. Johnson, J. Montgomery, L. Matthies. Vision guided landing of an autonomous helicopter in hazardous terrain. Proceedings of the 2005 IEEE International Conference on Robotics and Automation, 2005. T. Templeton, D. H. Shim, C. Geyer and S. Sastry. Autonomous vision-based landing and terrain mapping using am MPC-controlled unmanned rotorcraft. In Proceedings of the IEEE International Conference on Robotics and Automation, pages 1349-1356, 2007 S. Se, D. Lowe, and J. Little, Vision-based mobile robot localization and mapping using scale-invariant features, in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) 2001, Seoul, Korea, May 2001, pp. 20512058. D. Lowe, Distinctive image features from scaleinvariant keypoints, Int. J. Comput. Vision, vol. 60, no. 2, pp. 91 110, 2004. E. Frontoni, P. Zingaretti, Adaptive and Fast Scale Invariant Feature Extraction, Second International Conference on Computer Vision Theory and Applications, Workshop on Robot Vision, Barcelona, March 2007 J. Bilmes, A gentle tutorial on the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models, University of Berkeley, Tech. Rep. ICSI-TR-97-021, 1997. A.R.S. Bramwell, G. Done, and D. Balmford. Bramwells Helicopter Dynamics. Butterworth Heinemann, second edition, 2001. A. Mancini, A. Cesetti, A. Iaul, E. Frontoni, P. Zingaretti, S. Longhi, A Framework for simulation and testing of UAVs in cooperative scenarios, Journal of Intelligent and Robotic Systems, Springer, 54, pp. 307-329, 2009. J. Montgomery. Learning Helicopter Control through Teaching by Showing. Ph.D. Thesis, School of Comp. Sci., USC, 1999. Maja J. Mataric, Behavior-based control: Examples from navigation, learning and group behavior, Journal of Experimental and Theoretical Artificial Intelligence, special issue on Software Architecture for Physical Agents, vol. 9, no. 2-3, pp. 6783, 1997. D. H. Shim, H. J. Kirn and S. Sastry. Hierarchical control system syntesys for rotorcraft-based unmanned aerial vehicles. AIAA Guidance, Navigation and Control Conference and Exhibit, 2000. B. Bethke, M. Valenti and J. How, Cooperative Vision Based Estimation and Tracking Using Multiple UAVs, Advances in Cooperative Control and Optimization, pp. 179-189, 2007. F. Caballero, L. Merino, J. Ferruz and A. Ollero, Vision-Based Odometry and SLAM for Medium and High Altitude Flying UAVs, J. Intell. Robotics Syst., vol.54, number 1-3, pp. 137-161, 2009. G. Conte and P. Doherty, An Integrated UAV Navigation System Based on Aerial Image Matching, Aerospace Conference, 2008 IEEE , pp.1-10, 1-8 March 2008. He, Zhihai, Iyer, R. V. and Chandler, P. R., Visionbased UAV flight control and obstacle avoidance, American Control Conference, 2006. Mondragon, I.F., Campoy, P., Correa, J.F., Mejias, L., Visual Model Feature Tracking For UAV Control, IEEE International Symposium on Intelligent Signal Processing, WISP 2007, vol., no., pp.1-6, 3-5 Oct. 2007. P. Campoy, J.F. Correa, I. Mondragn, C. Martnez, M. Olivares, L. Mejas and J. Artieda, Computer Vision Onboard UAVs for Civilian Tasks, Journal of Intelligent and Robotic Systems, vol. 54, issue 1, pp. 105-135, 2009. A. Cesetti, E. Frontoni, A.Mancini, P. Zingaretti and S. Longhi, A Vision-Based Guidance System for UAV Navigation and Safe Landing using Natural Landmarks, Journal of Intelligent and Robotic Systems, Springer, 57, pp. 233257, 2010
[7]
[21]
[8] [9]
[22]
[23]
[10]
[24] [25]
[11]
[26]
[12]
[27]
[13] [14]
[15]
[16] [17]
[18] [19]
[20]
244
Safety standard for mobile robots - a proposal for 3D sensors
Klas Hedenberg Bj rn Astrand o School of Technology and Society, University of Sk vde, Sk vde, Sweden o o School of Information Science, Computer and Electrical Engineering, University of Halmstad, Halmstad, Sweden
Abstract In this paper we present a new and uniform way of evaluate 3D sensor performance. It is rare that standardized test specications are used in research on mobile robots. A test rig with objects in the industrial safety standard Safety of industrial trucks - driverless trucks and their systems EN1525 is extended by thin vertical and horizontal objects that represent a fork on a forklift, a ladder and a hanging cable. A comparison of a trinocular stereo vision system, a 3D TOF (Time- Of-Flight) range camera and a Kinect device is made to verify the use of the test rig. All sensors detect the objects in the safety standard EN1525. The Kinect and 3D TOF camera shows reliable results for the objects in the safety standard at distances up to 5 m. The trinocular system is the only sensor in the test that detects the thin structures. The proposed test rig can be used to evaluate sensors to detect thin structures. Index Terms Safety standard for mobile robots, 3D sensors, Trinocular vision, EN1525
I F E
B C A J D
I. I NTRODUCTION Navigation, to get from one point to another, is the most fundamental task of a mobile robot. As with all navigation the navigator, or designer, must think about what barriers might exist during the trip and how these will be avoided. This is called obstacle avoidance. Obstacle avoidance may be divided into obstacle detection and path planning. Obstacle detection is crucial to how well obstacle avoidance algorithm will work since it is the input to path planning. A robust algorithm is the key for a mobile robot in an unknown or partially unknown environment. The obstacle detection algorithm must be able to detect all potential barriers in the environment without any false detection. To avoid an obstacle does not always mean that the route has to change. If the object is a human being, it may be easier to give a signal, light or sound, so that the human can give way for the robot. However, the robot must be able to stop if the human for some reason does not move. Contact-sensitive mechanical bumpers have only two states: on and off. Therefore, the mobile robots equipped with these bumpers need to travel at a lower speed. These kind of bumpers also arise some human factor issues, workers are uncomfortable with that they have to be hit by the robot before it will stop [22]. Workers can also try to beat the robot and thereby be injured. Therefore, virtual bumpers can increase productivity and exibility and increase the plants safety with regard to obstacle detection/avoidance [22]. Our goal is to see if a stereo vision-based obstacle avoidance system for an AGV (Automated Guided Vehicle) is sufciently
Fig. 1. The test rig simulates in one scene a collection of objects: item A - a lying human [6], item B - a standing human [6], item C - a at target proposed by [8], item D - a ladder, item E - forks on a forklift, item F hanging cable, items G and H - vertical bars, item I - horizontal bar, item J - thin cable. The thin cable in position J is included in this work to evaluate a 2D laser. A ladder, item D, typically have a steeper slope than 45 , but objects that has a larger angle may be considered to be vertical, while objects with a lower angle can be considered as horizontal.
reliable and safe for use in an industrial environment indoors. Our AGV will have a speed of 10 km/h and a braking distance of 5 meters. This means that traditional bumpers cannot be used. In this work, we make a proposal to extend the safety standard by dening a test rig, Figure 1, for 3D sensors. We make use of the safety standard Safety of industrial trucks driverless trucks and their systems EN1525 [6] and add objects that is of interest in an industrial setting. The test rig is used in a comparison of a trinocular stereo vision system, a 3D TOF (Time- Of-Flight) range camera and a Kinect sensor. This paper is organized as follows. Section II discusses related work. Section III addresses the problem of not having a standardized test specication for 3D sensors. A presentation of the virtual bumpers used in this paper is found in section V. Preliminary experiments on the proposed extension of the safety standard EN1525 (i.e. the test rig) is described in section VI. Finally, in section VII the conclusions are presented. II. R ELATED WORK The meaning of the word obstacle is according to [15] either a situation or, an event etc. that makes it difcult for you to do or achieve something or an object that is in your way and that makes it difcult for you to move forward. For mobile robots the second meaning is more applicable.
245
Fig. 2.
Forklift, cable and ladder are examples of thin structures found in an industrial environment.
Objects are considered differently depending on the environment and the task the robot has to carry out. The only objects mentioned in the safety standard EN1525 [6] for driverless trucks mention are people. Burschka et al. describe the positions of various objects in the robots route and proposes an algorithm for stereo vision that avoids positive and negative obstacles, such as gaps and staircases [10]. Broggi et al. utilizes the fact that vertical structures are more common than horizontal in an industrial environment [9]. Bostelman et al. test a 3D range camera to standard EN1525 in an industrial setting [27]. They compare the camera with a laser range nder as ground truth at distances up to about 3 m. The authors reect that the requirements in the standard EN1525 may need to be changed to meet the specications of contactless sensors. Furthermore, they argue that the obstacle detection systems must detect both people and objects that can be pushed against people and that the round objects described in the standard today are likely to perform better than at items. Bostelman et al. perform tests with optical and acoustic systems of the circular objects in the EN 1525 and a at object, 500mm x 100mm, which the authors recommend be added to the standard [8]. LADAR (Laser Detection and Ranging) is approved for use on AGVs in industrial environments. LADAR is therefore often used as ground truth for comparison with other techniques [27], [21]. Other common virtual bumpers for mobile robots are FLIR (Forward Looking InfraRed) cameras, sonar and vision [27]. Early approaches for obstacle detection with sonar is described in [20] and [7]. Computer vision for obstacle detection can be divided in optical ow, structural lightning and stereo vision. Further reading on vision-based navigation can be made in [11]. Optical ow is described in [12] and takes use of one camera. The unsolved problem with optical ow is the aperture problem, the motion eld is not fully dened at each pixels in two directions since the optical ow is dened from one equation [12]. Structural lightning, for example the projection of a laser plane, is computationally simple, but difcult at long distances and requires additional hardware [13]. The traditional stereo setup with two horizontally placed cameras is commonly used in the context of obstacle avoidance [24], [29], [30]. One problem with these systems is that they have trouble in detecting horizontal objects [14]. Several works take use of a trinocular system to achieve higher
performance on the vision system [32], [23]. Okutomi and Kanade describes a multi-baseline vision system [25]. Broggi et al. use stereo vision with a variable baseline to drive a vehicle in the DARPA Grand Challenge [9]. The authors conclude that a variable baseline results in good accuracy at various distances but needs a very accurate calibration. Current approaches in vision-based obstacle detection often combine different sensors and algorithms to overcome the drawbacks with one single approach. A combination of laser range nder and stereo are used in [18]. III. P ROBLEM DEFINITION It is rare that standardized test specications are used in research on mobile robots. It is natural, since much research is on new sensors and new algorithms where the researcher is trying to test the limits of each method. Because no single test method is used, it is often difcult to compare methods. A default case would resolve some of these problems. The existing standard EN1525 [6] is very rough for the issues of personal safety. The standard is based on contact sensors and describes that driverless trucks has to detect two objects, a standing cylinder with a diameter of 70 mm and a length of 400 mm and a lying cylinder with a diameter of 200 mm and a length of 600 mm [6]. The philosophy behind the standard is sound, it focuses on avoiding damage to the standing or lying full-grown human, and thus follow the rst law of robotics: A robot may note injure a human-being or, through inaction, allow a humanbeing to come to harm [31]. The standard has some disadvantages and there are conicts with the rst law of robotics in three ways: Firstly, people may not always be well represented by a cylinder, i.e. people are not always either standing or lying down. Secondly, a contact with an object may indirectly harm people, such as people standing on a ladder. Thirdly, a contact with an object can cause material damage and become so expensive that it harms people. Thin horizontal and thin hanging structures are common in an industrial setting. Figure 2 shows three different objects (forklift, cable and ladder) that might be found in an industrial setting. Our work focus on the detection of thin horizontal and thin hanging objects that could indirectly cause human injury. We do not treat negative obstacles, such as gaps and staircases. We propose an extension to the standard EN1525 to include thin structures to be handled by virtual bumpers. We also use
246
the test rig to compare three different virtual bumpers to verify that the proposed test rig can be used to distinguish between different sensors. IV. T EST RIG A common claim is that it is more likely with vertical thin objects in outdoor environments for mobile robots than with horizontal [21] [9]. However, in indoor environments and in human made environments horizontal objects occurs in a greater extent. To avoid accidents, discover these horizontal object is at least as important as detecting the vertical. Our proposal is to extend the standard with horizontal and hanging objects to deal with objects shown in Figure 2. In order to compare the various methods we have used a test rig where the objects described in EN1525 are included. The test rig consists, in addition to the items described in EN1525, also of thin horizontal, vertical, sloping or hanging objects. All items are black and non-reective, Figure 1. We only take into account the objects shape and not different types of materials as in [8] where they use sonar sensors. The test rig, Figure 1, simulates in one scene a collection of objects: item A - a lying human [6], item B - a standing human [6], item C - a at target proposed by [8], item D - a ladder, item E - forks on a forklift, item F - hanging cable, items G and H - vertical bars, item I - horizontal bar, item J - thin cable. The thin cable in position J is included in this work to evaluate a 2D laser. A ladder, item D, typically have a steeper slope than 45 , but objects that has a larger angle may be considered to be vertical, while objects with a lower angle can be considered as horizontal. The test rig measure 1.8 1.8m and the bars has a thickness of 25mm. The hanging cable has a diameter of 13mm. V. V IRTUAL BUMPERS In order to effectively plan for an action the robot need to know the distance to the obstacle. Methods that account the size of objects, instead of distance, exists [17]. Humans use several algorithms together and in a sophisticated way merge them together to determine the distance [28]. The methods used for mobile robots are much simpler and are often based on one principle, but combinations of principles exists [26]. The development of semiconductor technology has led to new sensors. The recently released Kinect device from Microsoft for the Xbox 360 game platform has received much attention among researcher during the last year. The Kinect device consists of a RGB-camera, an IR-projector and a monochrome camera for measuring IR. The projector emits a known speckled pattern that is measured by the monochrome camera and thereby the distance can be measured [5]. The slightly older 3D TOF (Time-Of-Flight) range camera is also an active sensors in that it emit light, and then calculate the distances to the surroundings by measuring what time it takes for light to be reected. The technique is described in [19]. Both the Kinect device and the 3D TOF camera may have problems when external lighting occurs.
Fig. 3. Four different sensors were mounted on an Ackerman steered robot: a time-of-ight range camera, a trinocular vision system and a Microsoft Kinect sensor. A laser range scanner was used for ground truth measurements.
Up
Left Right
Fig. 4. Excluding outliers. The horizontal bar is correct matched with the Up-Left camera pair. The match in the right camera is an outlier that normally is rejected with the uniqueness constraint. The star represents the case where there is a match in the Left-Right camera pair and the pixel has the wrong argument. The nal step in the algorithm excludes these incorrect matches.
A. Trinocular Traditionally in stereo vision, geometry is used to determine the distance. For a stereo camera setting with the baseline b, focal length f and the disparity d, the distance z is: z = bf /d (1)
From one camera image (e.g. the left image) disparity is determined by locating the corresponding pixel in the corresponding image (e.g. the right image). In the general case, it is still an open issue how the correspondence between the two images is made. Cameras that are horizontally mounted are optimized to detect vertical structures and vertically mounted cameras are optimized for horizontal structures. This can be combined by using three (or more) cameras mounted in a vertical and horizontal pattern. The camera system in this work is assembled as an L. As a rst step all images are rectied and smoothed. From the images of each camera two images are computed, an edge image using the Canny lter and an image that contains the argument for every pixel computed by the image gradients in the x- and y-direction of the image. These images are superimposed to obtain an edge image with the argument for each edge pixel. Two sparse disparity maps are computed with Sum of Absolute Difference (SAD) for left-right, DLR , and up-left, DU L , image pair, respectively. The mutual constraint (Left- Right consistency check) as well as uniqueness con-
247
straint (rejecting matches in regions with low texture) is used to exclude outliers. The argument, , of the lower left image is then used to compute a combined disparity map DSU M = (DU L cos )2 + (DLR sin )2 (2)
Disparity information from a pixel that only exist in one image (e.g. DLR ) is then added to the DSU M . This is done only if the argument for the pixel is valid for that disparity map (i.e. not horizontal edges for the DLR ). This step excludes some positive outliers. Figure 4 shows two ways to reject outliers. The horizontal bar is correct matched with the Up Left camera pair. The match in the right camera is an outlier that is rejected with the uniqueness constraint. The star represents the case where there is a match in the Left Right camera pair and the pixel has not an accepted argument. VI. E XPERIMENTS AND RESULTS A. Setup Our goal is to nd a method for mobile robots to nd thin structures. In this work we propose a test rig as an extension to the safety standard EN1525. The experiment aims to verify that sensors can be distinguished by the proposed test rig. A robotic platform based on an electrical wheelchair equipped with encoders is used for a comparison between different vision sensors. A SICK LMS200 laser scanner is used as ground truth. An on board computer running Player [4] is used for data acquisition. For all vision systems the placement of the cameras is essential to get a good result. This has to be considered for every new setup [16]. There are several ways to determine the placement of the camera system on the robot. The easiest way is to just choose a placement by intuition. Putting a little more effort in this judgment will probably increase the precision in the system. It will also make a discussion of camera placement more unbiased if the systems performance has to be increased later. Huang and Krotkov concludes that the best placement for mobile robots is as high as possible [16]. However, in this work, to make the comparison easier, all cameras are placed in front of the robot, close to the laser range nder. The sensors are adjusted to t the scans from the laser. The trinocular system exists of three 1/3 CCD BlueFox cameras from MatrixVision with 3.6 mm lenses mounted in a L with a baseline of 0.2 m. The cameras are adjusted to be coplanar. Images are synchronized by hardware. The camera system is calibrated with the Camera Calibration Toolbox for Matlab [1]. An image size of 640x480 is used. We use a Swiss ranger SR3000 from Mesa Imaging as a 3D TOF camera with a modulation frequency of 20 Mhz. This gives a non-ambiguity range of 7.5m. A Microsoft Kinect sensor is used with the libfreenect driver available at [3]. The Kinect is calibrated with the RGB Demo software available at [2]. All data processing is made off line in MATLAB. B. Results Figure 6 shows the three scenes used for evaluation. Scene one shows the test rig in a corridor where reections from
oor and shadows on walls are present. Scene two shows the test rig in an industrial setting with little disturbance of light and reections, but the background is a corner. Scene three has a various background. The test rig described in section IV is used in all scenes to evaluate the different types of sensors at distances of 3, 5 and 7 meters. The evaluation is qualitatively by visual inspection. A ground truth test rig is generated based on the information from the SICK laser and plotted as a point cloud. All sensor data are plotted in a point cloud and compared to ground truth. Figure 5 shows the point cloud, false and true positives, for the three sensors in scene 2 at 5 m. To help the inspection the Euclidean distance between each point and each item in the test rig is computed and plotted in a point cloud, i.e. the true positives. A point within 25 cm from the item in ground truth is considered to match the item. This is a rough method but gives a hint of the quality of the sensor data. The point clouds in Figure 7, 8 and 9 show the points that match ground truth, i.e. the true positives. As seen all sensors detect the items from the standard [6] as well as the at item suggested by [8]. Though, both the Kinect and the 3D TOF camera have problems with detecting objects at 7 m. This seems reasonable since those sensors depend on the emitted light from the sensors and the range is close or above to the maximum distance range of the sensor. The trinocular system is the only sensor that detects the thin structures, item D to H (the new thin structures). The trinocular system has problem to detect item C (at target proposed by [8]) in scene 3 (Figure 9, left) since the background has nearly the same color as the object. It also has problems detecting item H (right vertical bar) in scene 3 (Figure 9). The system detects item H, but at a further distance than the 25 cm that is to be considered as a match. A more well-manufactured test rig or more realistic model together with better calibration could solve this problem. However, a accurate distance measurement is not essential for the obstacle avoidance algorithm. From this test we cannot nd any major difference in performance between the Kinect and the 3D TOF camera. Both the Kinect and the 3D TOF camera have problem with detecting thin horizontal and vertical objects. None of them detect the ladder (item D) or the forks on the forklift (item E). They have also problem with detecting the hanging cable. The main problem with the trinocular system is to detect larger at objects (item C). At close ranges this is more pronounced, Figure 9, then the apperance of the objects becomes larger. This has to do with that the sensor is optimized to detect edges, i.e. thin structures. However, a different detection scheme may be included in the vision algorithms to overcome this problem, e.g. using a dense disparity map. VII. D ISCUSSION AND CONCLUSIONS In this paper we present a new and uniform way of evaluate 3D sensor performance. The evaluation procedure consists of a test rig of predened objects that appears in an industrial setting. Objects in the industrial safety standard Safety of industrial trucks - driverless trucks and their systems EN1525 is extended by thin vertical and horizontal objects that represents
248
Fig. 5. True and false positives (raw data) from the three sensors in scene 2 at 5 m. Trinocular (left), Kinect (middle), 3D TOF camera (right). The test rig is shown in red.
a fork on forklift, a ladder and a hanging cable. We compare different virtual bumpers to verify that the proposed test rig can be used to distinguish between different sensors to detect thin structures. It is rare that standardized test specications are used in research on mobile robots. Three different sensors (Trinocular, Kinect and 3D TOF camera) were evaluated and they all show that they can detect objects in the EN1525 safety standard. The Kinect and 3D TOF camera show reliable results for those objects at distances up to 5 m. The trinocular system is the only sensor in the test that detects the thin structures. It detects all items in the test rig at all distances. However, a passive sensor like a camera is more sensible to noise and further work will focus on rening the trinocular algorithm.
R EFERENCES
[1] Camera calibration toolbox for matlab. http://www.vision. caltech.edu/bouguetj/calib_doc/. [2] Kinect rgb demo v0.4.0. http://nicolas.burrus.name/. [3] Openkinect. http://openkinect.org/wiki/Main_Page. [4] Player - cross-platform robot device interface & server. http:// playerstage.sourceforge.net/. [5] Technical description of kinect calibration. http://www.ros.org/ wiki/kinect_calibration/technical. [6] Safety of industrial trucks - driverless trucks and their systems, Sep 1 1997. [7] J. Borenstein, Y. Koren, and Senior Member. The vector eld histogram - fast obstacle avoidance for mobile robots. IEEE Journal of Robotics and Automation, 7:278288, 1991. [8] W. Bostelman, R. Shackleford. Time of ight sensors experiments towards vehicle safety standard advancements. Draft submitted to the Computer Vision and Image Understanding special issue on Time of Flight Sensors, 2010. [9] Alberto Broggi, Claudio Caraf, Pier Paolo Porta, and Paolo Zani. The single frame stereo vision system for reliable obstacle detection used during. In the 2005 Darpa Grand Challenge on TerraMax, in IEEE ITSC06, pages 745752, 2006. [10] Darius Burschka, Stephen Lee, and Gregory Hager. Stereo-based obstacle avoidance in indoor environments with active sensor re-calibration. In IN ICRA 2002, pages 20662072, 2002. [11] Guilherme N. DeSouza and Avinash C. Kak. Vision for mobile robot navigation: A survey. IEEE, TRANS. PAMI, 24(2):237267, 2002. [12] Olivier Faugeras. Three-dimensional computer vision: a geometric viewpoint. MIT Press, Cambridge, MA, USA, 1993. [13] John Arthur Hancock. Laser intensity-based obstacle detection and tracking. PhD thesis, Pittsburgh, PA, USA, 1999. AAI9936886.
[14] A-J. Hedenberg, K. & Baerveldt. Stereo vision for mobile robots. In Blent E. Platin & Memis Acar. Eds. Abdulkadir Erden, editor, Mechatronics 2004, the 9th Mechatronics Forum International Conference, number ISBN: 975-6707-13-5., pages pp 259270., 30 aug-1 sept 2004 2004. [15] A.S. Hornby, S. Wehmeier, C. McIntosh, J. Turnbull, and M. Ashby. Oxford advanced learners dictionary of current English. Oxford University Press, 2005. [16] Wesley H. Huang and Eric P. Krotkov. Optimal stereo mast conguration for mobile robots. In International Conference on Robotics and Automation, volume 3, pages 19461951, April 1997. [17] W.H. Huang, B.R. Fajen, J.R. Fink, and W.H. Warren. Visual navigation and obstacle avoidance using a steering potential function. Robotics and Autonomous Systems, 54(4):288299, 2006. [18] Saurav Kumar, Daya Gupta, and Sakshi Yadav. Article: Sensor fusion of laser & stereo vision camera for depth estimation and obstacle avoidance. International Journal of Computer Applications, 1(25):20 25, February 2010. Published By Foundation of Computer Science. [19] R. Lange and P. Seitz. Solid-state time-of-ight range camera. Quantum Electronics, IEEE Journal of, 37(3):390 397, mar 2001. [20] John J. Leonard and Hugh F. Durrant-Whyte. Directed Sonar Sensing for Mobile Robot Navigation. Kluwer Academic Publishers, Norwell, MA, USA, 1992. [21] Jeff Michels, Ashutosh Saxena, and Andrew Y. Ng. High speed obstacle avoidance using monocular vision and reinforcement learning. In Proceedings of the 22nd international conference on Machine learning, ICML 05, pages 593600, New York, NY, USA, 2005. ACM. [22] John W. Mroszczyk. Safety practices for automated guided vehicles (agvs). 2004. [23] Don Murray and Jim Little. Using real-time stereo vision for mobile robot navigation. In Autonomous Robots, page 2000, 2000. [24] S. Nedevschi, R. Danescu, D. Frentiu, T. Marita, F. Oniga, C. Pocol, R. Schmidt, and T. Graf. High accuracy stereo vision system for far distance obstacle detection. In Intelligent Vehicles Symposium, 2004 IEEE, pages 292 297, june 2004. [25] M. Okutomi and T. Kanade. A multiple-baseline stereo. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 15(4):353363, 1993. [26] C.D. Pantilie, S. Bota, I. Haller, and S. Nedevschi. Real-time obstacle detection using dense stereo vision and dense optical ow. In Intelligent Computer Communication and Processing (ICCP), 2010 IEEE International Conference on, pages 191 196, 2010. [27] Bostelman R., Hong T., Madhavan R., and T. Chang. Safety standard advancement toward mobile robot use near humans, 2005. [28] D.D. Regan and R.R. Gray. Visually guided collision avoidance and collision achievement. Trends in Cognitive Sciences, 4(3):99107, 2000. [29] K. Sabe, M. Fukuchi, J.-S. Gutmann, T. Ohashi, K. Kawamoto, and T. Yoshigahara. Obstacle avoidance and path planning for humanoid robots using stereo vision. In Robotics and Automation, 2004. Proceedings. ICRA 04. 2004 IEEE International Conference on, volume 1, pages 592 597 Vol.1, april-1 may 2004. [30] W. van der Mark and D.M. Gavrila. Real-time dense stereo for intelligent vehicles. Intelligent Transportation Systems, IEEE Transactions on, 7(1):38 50, march 2006. [31] Wikipedia. Three laws of robotics. http://en.wikipedia.org/ wiki/Three_Laws_of_Robotics.
249
Fig. 6. The sensor performance was tested under three different scenarios likely to be found in an industrial setting: A corridor (left), with a corner background (middle) and with a workshop in background (right).
Fig. 7. Point clouds from rst scenario at 7m. Trinocular (left), Kinect (middle), 3D TOF camera (right). All sensors detect item 1-3. Trinocular system detects all items. Ground truth (red), sensor readings (blue), laser sensor (black).
Fig. 8. Point clouds from second scenario at 5m. Trinocular (left), Kinect (middle), 3D TOF camera (right). All sensors detect item 1-3. Trinocular system detects all items except item H at the distance where the ground truth is. Item H is detected a little further away. Ground truth (red), sensor readings (blue), laser sensor (black).
Fig. 9. Point clouds from third scenario at 3m. Trinocular (left), Kinect (middle), 3D TOF camera (right). All sensors detect item 1-3. Trinocular system detects all items. Ground truth (red), sensor readings (blue), laser sensor (black).
250
[32] Todd A. Williamson. A high-performance stereo vision system for obstacle detection. PhD thesis, Pittsburgh, PA, USA, 1998. AAI9918622.
251
252
Adaptive Surveying and Early Treatment of Crops with a Team of Autonomous Vehicles
Wajahat Kazmi Morten Bisgaard Francisco Garcia-Ruiz Karl D. Hansen Anders la Cour-Harbo Department of Architecture, Design and Media Technology, Aalborg University, Aalborg-9220, Denmark Department of Agriculture and Ecology, University of Copenhagen, Taastrup-2630, Denmark Department of Electronics Systems, Aalborg University, Aalborg-9220, Denmark
Abstract The ASETA project (acronym for Adaptive Surveying and Early treatment of crops with a Team of Autonomous vehicles) is a multi-disciplinary project combining cooperating airborne and ground-based vehicles with advanced sensors and automated analysis to implement a smart treatment of weeds in agricultural elds. The purpose is to control and reduce the amount of herbicides, consumed energy and vehicle emissions in the weed detection and treatment process, thus reducing the environmental impact. The project addresses this issue through a closed loop cooperation among a team of unmanned aircraft system (UAS) and unmanned ground vehicles (UGV) with advanced vision sensors for 3D and multispectral imaging. This paper presents the scientic and technological challenges in the project, which include multivehicle estimation and guidance, heterogeneous multi-agent systems, task generation and allocation, remote sensing and 3D computer vision. Index Terms multivehicle cooperation, multispectral imaging, precision farming, 3D computer vision.
I. I NTRODUCTION Weeds have always remained a major concern to farmers because they compete with crops for sunlight, water and nutrients. If not controlled, they can cause a potential loss to the monetary production value exceeding a global average of 34% [1]. Classical methods for weed removal are manual or mechanical which are time consuming and expensive. Over the last few decades, herbicide application has been a dominant practice. Indiscriminate use of chemicals, on the other hand, is also detrimental to both environment and the crop itself. Reduction in the use of pesticides in farming to an economically and ecologically acceptable level is one of the major challenges of not just developed countries but also the developing countries of the world. Introducing an upper threshold to the amount of pesticides used does not necessarily serve the purpose. It must be accompanied with the knowledge of when and where to apply them. This is known as Site-Specic Weed Management (SSWM). For SSWM, the concept of precision farming scales down to eld spots or patches [2] or even to plant scale [3]. This requires real-time intelligence on crop parameters which signicantly increases the complexity of modern production systems and therefore imply the use of automation through information technologies, smart sensors and decision support systems. Over the last ve decades, the concept of agricultural automation has evolved from mechanization of manual labor into intelligent sensor based fully autonomous precision farming
systems. It started with automation of ground vehicles [4] and over time, air vehicles also found their way in. Furthermore, advanced perception technologies such as machine vision have become an important part of agricultural automation and 2D/3D image analysis and multispectral imaging have been very well researched in agriculture. Today, with advanced sensor technologies and both air and ground, manned and unmanned vehicles available in the market, each one with its own pros and cons, the choice has become broad. The technology is at par with most of the industrial demands but the need is of an optimal subset of technical attributes since the practice, particularly in agriculture, has usually been limited to the use of one type of vehicle with a limited sensor suite. The drawback of this scheme is that one type of vehicle is unable to satisfy all operational requirements. For example an unmanned aircraft (UA) to detect and apply spray to the aquatic weeds compromises on spray volume, precision and duration of ight due to weight-size constraints [5], while a ground vehicle alone can signicantly slow down the operation along with producing substantial soil impact [6], not to mention the problem of emissions. These constraints imply the use of a team of both air and ground vehicles for a holistic solution. Unmanned (ground) vehicles being considerably smaller in size than manned vehicles have lesser soil impact and fuel consumption (thus have reduced emissions) and may also be battery operated. Therefore, for economy of time and energy and for higher precision, a network of unmanned air and ground vehicles is inevitable and is destined to outperform conventional systems. Research has also been conducted in cooperative unmanned mixed robotic systems both for civil and military purposes, for example, [7] proposes hierarchial framework for a mixed team of UAS and UGV for wildre ghting and GRASP laboratory [8] used such systems in urban environments as a part of MARS2020 project. But apparently, no such strategy has been adopted in agriculture. To the best of authors knowledge, ASETA is the rst project of its kind to use a team of both UAS and UGV in agriculture which has opened a new chapter in precision farming and researchers especially in the European Union are taking increased interest in such approaches (for example, RHEA project [9]). This paper describes the scope of ASETAs scientic research, its heterogeneous robotic eet and sensor suite for SSWM. The paper is organized as follows: the project is described in section II, followed by equipment summary in
253
section III. Main research areas of this project in the context of the related work are presented in section IV. Section V concludes the paper. II. ASETA ASETA (Adaptive Surveying and Early treatment of crops with a Team of Autonomous vehicles) is funded through a grant of 2 million EUR by the Danish Council of Strategic Research. It aims at developing new methods for automating the process of acquiring and using information about weed infestation for an early and targeted treatment. The project is based on the following four hypotheses:
team member. A further complexity to the proposed system arises from the fact that although computer vision is very popular and successful in plant inspection, however, changing weather and sunlight conditions has so far limited in-eld agricultural vision systems [10]. These challenges must be addressed in order to produce an optimal combination of more than one type of unmanned vehicles to outperform the conventional systems in the scope. Therefore, in order to achieve its goals, ASETA will carry forward scientic research in four directions, namely, multispectral imaging, 3D computer vision, task management and multivehicle cooperation. The project started in January 2010. Major research work will be carried out from 2011 to 2013. Scientic research is being conducted by four post graduates and several faculty staff involved at two Danish universities, University of Copenhagen and Aalborg University. This collaborative work is a mixture of theory, simulations, and actual elds tests. The latter is done in cooperation with the university farms at University of Copenhagen, which will maintain a eld of sugar beets throughout the growth seasons in 2011 to 2014. Since sugar beet is the crop-of-choice for the demonstrative part, Nordic Beet Research is also involved in the project. III. E QUIPMENT Some of the specialized equipment used in this project is described below:
Fig. 1.
ASETA Strategy
1) Localized detection and treatment for weeds will signicantly decrease the need for herbicides and fuel and thereby reduce environmental impact. 2) Such early detection can be accomplished by multi-scale sensing of the crop elds by having UAS surveying the eld and then performing closer inspection of detected anomalies. 3) A team of UAS and UGV can be guided to make closeto-crop measurements and to apply targeted treatment on infested areas. 4) A team of relatively few vehicles can be made to perform high level tasks through close cooperation and thereby achieve what no one vehicle can accomplish alone. The strategy adopted in ASETA (Fig. 1) is to survey crop elds using UAS in order to obtain and localize hotspots (infested areas) through multispectral imaging followed by cooperative team action among a team of air and ground vehicles for a closer 3D visual inspection, leading to the treatment. Survey may be iterated depending on the team size and eld dimensions. Obviously, ASETAs industrial gains come at the cost of certain technical and scientic challenges. A heterogeneous team of several unmanned vehicles is chosen to distribute heavy payloads on ground vehicles (sensing, perception and treatment) and relatively lighter payload (sensing and perception only) on the air vehicles which potentially is a well balanced approach but puts high demands on team cooperation and task management keeping in view the constraints of each
A. Robotic Platforms ASETA has three unmanned mobile robots available for the project. They are briey described below: 1) UAS: The UAS is comprised of two rotary wing aircraft. The rst UA is a modied Vario XLC helicopter with a JetCat SPTH-5 turbine engine (Fig. 2). The helicopter weighs 26 kg when fully equipped for autonomous ight and can y for 30 minutes with 6 kg of fuel and 7 kg of payload. For autonomous ight, a NAV440 Inertial Navigation System (INS) from Crossbow is used together with altitude sonar. Onboard computer is a Mini-ITX with dual-core 1.6 GHz Intel Atom processor and runs a Debian Linux operating system. The ight time in this conguration is approximately 30 minutes. The second UA is a modied Maxi Joker-3 helicopter from MiniCopter. It is electrically powered and weighs 11 kg when equipped for autonomous ight (Fig. 2). The helicopter can y for 15 minutes with a payload of 3 kg. It has a Xsens MTiG INS and sonar altimeters for autonomous ight and Nano-ITX size 1.3 GHz onboard computer with Debian Linux operating system. Each UA can be congured to carry the multispectral camera (see Section III-B) or a color camera. The sensors are mounted in a modied OTUS L205 gimbal from DST Control. The low level guidance, navigation, and control (GNC) system for the UAS is the baseline GNC software from Aalborg Universitys UAV lab1 . It features gain scheduled optimal controller, unscented Kalman lter for navigation and an advanced trajectory generator.
1 www.uavlab.org
254
Fig. 2.
Autonomous vehicles in ASETA, (from left): Vario XLC, Maxi Joker-3 and robuROC-4
2) UGV: The ground vehicle is a robuROC-4 from Robosoft (Fig. 2). Running on electric power this vehicle is designed for in-eld use and will carry the TOF (see Section III-B) and color cameras for close-to-crop inspection. The total weight is 140 kg (without vision system) and it is controlled by a standard laptop residing under the top lid running the crossplatform robot device interface Player/Stage. This vehicle is equipped with RTK GPS to allow it to traverse the crop rows with sufcient accuracy. B. Vision Systems As described in section II, two different imaging systems will be used: one for remote sensing and another for the ground based close-to-crop imaging. For remote sensing, a multispectral camera will be employed and for ground based imaging a fusion of Time-of-Flight and color images will be explored. 1) Multispectral Camera: The multispectral camera used in the project is a Mini MCA from Tetracam2 (Fig. 3). This specic sensor weighs 695 g and consists of six digital cameras arranged in an array. Each of the cameras is equipped with a 1.3 megapixel CMOS sensor with individual band pass lters. The spectrometer lters used in this project are 488, 550, 610, 675, 780 and 940 nm (bandwidths of 10 nm). The camera is controlled from the on-board computer through an RS232 connection and images are retrieved through a USB interface. Video output is also possible using the output video signal in the control connector.
Fig. 4.
SwissRanger SR4000 TOF Camera
Imagings SwissRangerT M SR4000 3 USB camera will be used which is an industrial grade TOF camera allowing high quality measurements in demanding environments. It operates in the Near-InfraRed (NIR) band (illumination wavelength 850 nm) hence a stable measurement accuracy and repeatability can be achieved even under variations in object reectivity and color characteristics. SR4000 can deliver a maximum frame rate of 50 frames/sec. As usually is the case with TOF cameras, the resolution is fairly low (176 x 144 pixels) which will be augmented by fusion with high resolution color images. IV. R ESEARCH A REAS The main scientic contributions will be generated by four research positions associated with the ASETA loop (Fig. 1). Two PhD studies in analysis and interpretation of images detection and treatment of weeds and one PhD study and one Post Doc in task allocation and vehicle cooperation. They are briey described below in the context of the state-of-the-art. A. Multispectral Aerial Imaging for Weed Detection As already discussed in section I, SSWM involves spraying weed patches according to weed species and densities in order to minimize herbicide use. However, a common approach in SSWM is weed mapping in crops which is still one of the major challenges. Remote sensing supplemented by targeted ground-based measurements have been widely used for mapping soil and crop conditions [11, 12]. Multispectral imaging at low and high spatial resolution (such as satellite and airborne) provide data for eld survey and weed patch allocation but depending on the system used, it varies in accuracy [13]. A higher level of spectral difference between plant and soil makes their separation relatively easy in a multispectral image.
3 www.mesa-imaging.ch
Fig. 3.
Mini MCA multispectral camera.
2) Time-of-Flight Camera: A time-of-ight (TOF) camera system has the advantage that depth information in a complete scene is captured with a single shot, thus taking care of correspondence problem of stereo matching. In this project, Mesa
2 www.tetracam.com
255
But the spectral ambiguity among plant species makes plant classication a difcult task. Thus, the spatial resolution of the sensor becomes an essential criterion for a reliable vegetation discrimination in order to detect the spectral reectance in least altered form to avoid spectral mixing at pixel level [14]. Therefore, the major requirements for robust aerial remote sensing for weed identication are a high spectral resolution with narrow spectral bands and the highest possible spatial resolution (normally limited by sensor technology) [15]. The high usability of multispectral satellite imagery from QuickBird (2.4 to 2.8 meter spatial resolution) in a sugar beet eld for Cirsium arvense L. hotspot detection for a site-specic weed control having spot diameters higher than 0.7 m was demonstrated by [16]. The relatively low spatial resolution along with the inability to image ground during cloudy conditions make such systems less suitable for analyzing in-eld spatial variability. On the other hand, high resolution images (up to 0.707 mm/pixel) were acquired in a rice crop for yield estimation using a UA ying at 20 m [12]. Keeping this fact in view, in this project, the choice of camera equipped unmanned helicopters is made because they can be guided at lower altitudes above the crop canopy in contrast to the satellite and manned airborne systems, increasing image resolution and reducing atmospheric effects on thermal images [17, 18]. Images obtained from low altitudes will support accurate decision making for precision weed and pest management of arable, tree and row crops. The goal of aerial imaging in ASETA is to explore the potential of multispectral imaging involving multistage sampling for target detection meanwhile employing spatial sampling techniques (stereology) for real-time density estimation. Stereology will be used for target sampling at various scales, using information from lower resolution images (high altitudehelicopter) to plant measurements at higher resolutions (low altitude-helicopter) to maximize information from sparse samples in real-time while obeying rules of probability sampling [19]. The maps of the eld provide the basis for optimal designs of sampling locations over several spatial scales using variance reduction techniques [19]. B. 3D Computer Vision for Weed Detection Multispectral aerial imaging will be able to detect hotspot locations and volumes, but on a macro level. It cannot resolve individual plants at intra-row level. A ground based imaging system will thus be employed for close-to-crop inspection in this project. In agricultural automation, the expected outputs of a weed detection system are weed plant detection, classication and stem center localization. Ground based imaging is not new but research has mainly focused on weeds at very early growth stages. There are two main reasons for this; an early detection will lead to an early treatment and the fact that plant imaging and recognition is one of the most demanding tests of computer vision due to complicated plant structures and the occlusion of crop and weed plants at later stages of growth prevents the proper visual separation of individual plants. While some efforts have shown promise under conditioned
environments such as green houses, lack of robust resolution of occlusions remains a major challenge for in-eld systems [20]. By utilizing 3D visual information it becomes possible to detect occlusions and make a better visual separation. Keeping this fact in view, the major objective in this project in ground based imaging is to utilize 3D computer vision techniques in weed detection. There has been a signicant amount of research work done towards 3D analysis of plants as well, but again this has mainly been aimed at navigation in the eld, in estimating overall canopy properties through stereovision or creating very detailed models of plants [10]. 3D modeling is computationally expensive and is potentially hampered by thin structures, surface discontinuities and lack of distinct object points such as corners ending up in the correspondence problem [21]. These limitations pose a major challenge for in-eld real-time 3D analysis of plants. In order to address these problems, active sensing system based on Time-of-Flight (TOF) technology will be used which has been very scantily tested in agricultural applications mainly due to a very high sensor cost. TOF has a drawback of low resolution and sensitivity to ambient light, but these problems have been recently addressed and having TOF depth map fused with high resolution color image has shown very encouraging results especially with parallelized computations which signicantly reduces the runtime [22]. The idea, therefore, is to use TOF data integrated with high resolution color images to perform in-eld plant analysis. TOF technology has only recently found its way towards industrial applications and in agricultural automation its utility assessment is quite fresh [23, 24, 25]. While 3D analysis is required for resolving occlusions and localization of plant body, discrimination of weeds from crops is still another challenge. Pattern and Object Recognition techniques have been widely used in weed discrimination [26]. But most of the techniques use color or size of the leaves (Leaf Area Index-LAI) as prime feature. The size of the leaves or the exposed area of the leaves vary due to orientation, growth stage and weather conditions. Furthermore, variations in the soil conditions and the amount of sunlight can result in color variations. Instead, vision systems based on shape are less sensitive to variation in target object color [10]. In this project, a shape based approach in distinguishing sugar beet crop plants from weeds will be used, for example [27]. In general, ASETA will contribute a new approach in weed identication by combining TOF technology with pattern recognition techniques bringing the lab research to the eld. C. Task Management The idea of Future Farms is that the farm manager should be able tomore or lessjust press a button, and then leave it until the process is nished. This demands that the system is capable of identifying the subtasks contained in this high-level command and ensure their execution. These two processes are commonly known as Task Decomposition and Task Allocation. The task decomposition process is going to break down the overall task to small manageable chunks, that the individual
256
members (robots) of the system are able to execute. The decomposition depends on the combined set of capabilities of the members. For example, if a member has the capability to take very high resolution images, the initial images might be taken from high altitude and only a few overview images may be sufcient for mapping the the entire eld. Whereas, if only low resolution cameras are available, several overview images may be required. When the overall task has been decomposed into suitable subtasks, they must be distributed to each of the members in the system. This is known as Task Allocation. Several different approaches to this have been investigated. Two broad categories can be identied as centralized and distributed allocation. The centralized approach is essentially a matter of solving a multiple travelling salesman problem (m-TSP). The distributed approach will divide the task of solving the TSP between each member. In this case the members must communicate with each other to make sure that two members are not planning to visit the same point (see section IV-D). The TSP solution has historically received a great deal of attention and has shown to be N P-hard [28], thus simple brute-force algorithms will not be practically usable in the system. The Lin-Kernighan heuristic [29] of 1971 is still one of the most preferred algorithms for solving TSPs, and maintains the world record of solving the largest TSP [30]. A strategy to solve the TSP with timing constraints (TCTSP) is devised in [31]. Helicopters conducting a closer examination of the weed infestations in the ASETA scheme will experience a TCTSP as the high altitude images will be taken over time and thus the close-up tasks are time constrained. Walshaw proposed a multi-level approach for solving the TSP [32]. This is relevant as the high altitude-helicopter process coarsens the TSP for the low altitude-helicopter, and thus gives a coarse representation of the low-level TSP free of charge. The decentralized approach relies on the members to distribute the tasks among themselves, without intervention of a governing system. The MURDOCH allocation system uses an auctioning approach where each robot bids on the different tasks depending on their own perceived tness for the task at hand [33]. The tness assessment of the ALLIANCE architecture [34] is based on a impatience behavioral pattern. These approaches will not guarantee the optimal solution, but provide some robustness that might be missing in the centralized approach. The aim of the ASETA task management is to utilize existing TSP solving methods such as Lin-Kernighan or Walshaw approach and adapt them to the situation at hand, with the members gradually revealing more and more information as they move closer to the crops, from the high altitude- over to the low altitude-helicopter down to the ground vehicle. D. Multivehicle Cooperation The close cooperation among team members (robots) is an important part of ASETA in order to ensure a safe and efcient execution of the tasks provided by the Task Management. The cooperation layer will determine which robot will tackle which task and to some extent in what order. In a situation where
a team of heterogeneous robots must cooperate in order to complete a task in an open-ended environment, it is crucial that each member has a clear understanding of its own as well as the other members capabilities because they are not equally qualied to handle a given task. In this project, The helicopters are equipped with several different types of sensors including cameras (as described in section III) well suited for observation only and the ground vehicle has an altogether different sensor suite and is meant for closer inspection and treatment. This information is to be used by every member to decide which part of the overall task it should handle and how to do it. To ensure a timely and efcient execution of the tasks it is equally important for a robot to know what its team members are doing i.e. their behavior and thereby ensuring that two members do not unnecessarily work on the same subtask. However, it is not always trivial to acquire such knowledge. The distances involved in eld operations can potentially become very large and thus can only allow limited communication. Furthermore, when reducing necessary communication among members, backwards compatibility is made easier and this is preferable in a industrial product. Therefore, the members must be able to deduce this knowledge from very limited information such as the state (position, orientation, and velocity) of the other members. This will put lesser constraints on the robots that are allowed to participate in the cooperation. In fact even robots without any cooperative capabilities can be a part of the system, as long as they can share their state with the rest of the team. Current research in cooperative control of multivehicle systems focuses mainly on the control element such as formation control or distributed optimization. A comprehensive review of recent research in cooperative control can be found in [35]. Only few projects have taken the limited communication between robots into account (for example: [36] or [37]). In this project, the actual cooperation layer is created as a decentralized two-level approach: 1) Level 1: Acquiring team behavioral information: The challenges of this level are seen primarily as a model based estimation problem which will be solved using particle ltering. This is done through the formulation of a behavioral modeling framework which in turn describes the different possible behaviors of the members. When used in a particle lter, it is capable of determining the maximum likelihood hypothesis, i.e. best tting behavior of the observed team members. 2) Level 2: Task execution: Each member is assumed to be containing a low level navigation and control system as well as simple trajectory planning. As a high level control, a receding horizon is used in the form of a decentralized Model Predictive Controller (MPC). The MPC on each member will attempt to nd an optimal behavioral action to take, given information about the current behavior of the rest of the team. In short, the ASETA cooperation scheme will use particle ltering and model predictive control to implement cooperation between loosely coupled robots.
257
V. C ONCLUSION ASETA will not only produce high quality research in multispectral imaging, computer vision and multivehicle systems, but it also aims at developing an actual demonstrator. Working within the price range of other farming machinery and the use of off-the-shelf hardware throughout enhances the likelihood of tools developed in this project being adopted by the industry. The long term objective of ASETA is a commercially available autonomous multi-scale surveying system for site specic weed management to reduce the cost and environmental impact of farming chemicals, fuel consumption and emissions. It therefore holds the potential for signicant impact on the future of precision farming worldwide. Given the rising levels of atmospheric CO2 and temperatures under climate change, weed species are expected to show a higher growth pattern than crops due to their greater genetic diversity [38]. On the other hand, governments mandate considerable reductions on the use of pesticides. This fact has added more importance and promise to such projects. Although dealing with a system of heterogeneous vehicles increases the complexity of the system, however, it also serves as a exibility on the user end in the choice of vehicles and sensors from a wide range, producing a more customized solution to the application at hand. ASETA, therefore, has future beyond agriculture towards several other applications such as re ghting, search & rescue and geological surveying, in the long run. R EFERENCES
[1] E. Oerke, Crop losses to pests, The Journal of Agricultural Science, vol. 144, no. 01, pp. 3143, 2006. [2] S. Christensen, T. Heisel, A. M. Walter, and E. Graglia, A decision algorithm for patch spraying, Weed Research, vol. 43, no. 4, pp. 276 284, 2003. [3] M. Ehsani, S. Upadhyaya, and M. Mattson, Seed location mapping using RTK GPS, Trans.-American Society of Agricultural Engineers, vol. 47, no. 3, pp. 909914, 2004. [4] K. Morgan, A step towards an automatic tractor, Farm mech, vol. 13, no. 10, pp. 440441, 1958. [5] A. H. G kto an, S. Sukkarieh, M. Bryson, J. Randle, T. Lupton, o g and C. Hung, A rotary-wing unmanned air vehicle for aquatic weed surveillance and management, J. Intell. Robotics Syst., vol. 57, pp. 467 484, January 2010. [6] V. Rusanov, Effects of wheel and track trafc on the soil and on crop growth and yield, Soil and Tillage Research, vol. 19, no. 2-3, pp. 131 143, 1991. [7] C. Phan and H. H. Liu, A cooperative UAV/UGV platform for wildre detection and ghting, in 2008 Asia Simulation Conference - 7th International Conference on System Simulation and Scientic Computing, (Beijing), pp. 494498, Ieee, Oct. 2008. [8] L. Chaimowicz, A. Cowley, D. Gomez-Ibanez, B. Grocholsky, M. Hsieh, H. Hsu, J. Keller, V. Kumar, R. Swaminathan, and C. Taylor, Deploying air-ground multi-robot teams in urban environments, vol. III, pp. 223 234. Springer, 2005. [9] RHEA: Robot Fleets for Highly Effective Agriculture and Forestry Management, http://www.rhea-project.eu/, accessed 08-Apr-2011. [10] C. L. McCarthy, N. H. Hancock, and S. R. Raine, Applied machine vision of plants: a review with implications for eld deployment in automated farming operations, Intelligent Service Robotics, vol. 3, pp. 209217, Aug. 2010. [11] K. R. Thorp and L. F. Tian, A review on remote sensing of weeds in agriculture, Precision Agriculture, vol. 5, no. 5, pp. 477508, 2004. [12] K. C. Swain, S. J. Thomson, and H. P. W. Jayasuriya, Adoption of an unmanned helicopter for low-altitude remote sensing to estimate yield and total biomass of a rice crop, Trans. of the ASABE, vol. 53, pp. 21 27, Jan-Feb 2010.
[13] M. S. Moran, Y. Inoue, and E. M. Barnes, Opportunities and limitations for image-based remote sensing in precision crop management, Remote Sensing of Environment, vol. 61, pp. 319346, Sep 1997. [14] D. W. Lamb and R. B. Brown, Remote-sensing and mapping of weeds in crops, Journal of Agricultural Engineering Research, vol. 78, pp. 117125, Feb 2001. [15] R. B. Brown and S. D. Noble, Site-specic weed management: sensing requirements - what do we need to see?, Weed Science, vol. 53, pp. 252 258, Mar-Apr 2005. [16] M. Beckes and J. Jacobi, Classication of weed patches in quickbird images: Verication by ground truth data, EARSeL eProceedings, vol. 5(2), pp. 173179, 2006. [17] J. A. J. Berni, P. J. Zarco-Tejada, L. Suarez, and E. Fereres, Thermal and narrowband multispectral remote sensing for vegetation monitoring from an unmanned aerial vehicle, IEEE Transactions on Geoscience and Remote Sensing, vol. 47, pp. 722738, Mar 2009. [18] R. Sugiura, N. Noguchi, and K. Ishii, Remote-sensing technology for vegetation monitoring using an unmanned helicopter, Biosystems Engineering, vol. 90, pp. 369379, Apr 2005. [19] D. Wulfsohn, Sampling techniques for plants and soil: In. advanced engineering systems for specialty crops: A review of precision agriculture for water, chemical, and nutrient application, and yield monitoring., Landbauforschung V lkenrode, vol. Special Issue 340, pp. 330, 2010. o [20] D. Slaughter, D. Giles, and D. Downey, Autonomous robotic weed control systems: A review, Computers and Electronics in Agriculture, vol. 61, pp. 6378, Apr. 2008. [21] a. Piron, F. Van Der Heijden, and M. Destain, Weed detection in 3D images, Precision Agriculture, pp. 116, Nov. 2010. [22] B. Huhle, T. Schairer, P. Jenke, and W. Straer, Fusion of range and color images for denoising and resolution enhancement with a non-local lter, Comp. Vision and Image Understanding, vol. 114, pp. 13361345, Dec. 2010. [23] G. Alenya, B. Dellen, and C. Torras, 3D modelling of leaves from color and ToF data for robotized plant measuring, in Proc. of the International Conference on Robotics and Automation, (accepted), 2011. [24] M. Kraft, N. Regina, S. a. D. Freitas, and A. Munack, Test of a 3D Time of Flight Camera for Shape Measurements of Plants, in CIGR Workshop on Image Analysis in Agriculture, no. August, (Budapest), 2010. [25] A. Nakarmi and L. Tang, Inter-plant Spacing Sensing at Early Growth Stages Using a Time-of-Flight of Light Based 3D Vision Sensor, in ASABE Meeting Presentation, no. 1009216, 2010. [26] M. Weis and M. S kefeld, Detection and Identication of Weeds, ch. 8, o pp. 119134. Dordrecht: Springer Netherlands, 2010. [27] M. Persson and B. Astrand, Classication of crops and weeds extracted by active shape models, Biosys. Eng., vol. 100, pp. 484497, Aug. 2008. [28] M. R. Garey and D. S. Johnson, Computers and Intractability. W. H. Freeman and Co., 1979. [29] S. Lin and B. Kernighan, An effective heuristic algorithm for the traveling-salesman problem, Operations research, vol. 21, no. 2, pp. 498516, 1973. [30] K. Helsgaun, General k-opt submoves for the Lin-Kernighan TSP heuristic, Mathematical Programming Computation, vol. 1, pp. 119 163, July 2009. [31] E. K. Baker, An exact algorithm for the time-constrained traveling salesman problem, Operations Res., vol. 31, pp. 938945, Apr. 1983. [32] C. Walshaw, A multilevel approach to the travelling salesman problem, Operations Research, vol. 50, pp. 862877, Sept. 2002. [33] B. P. Gerkey and M. J. Mataric, Sold!: auction methods for multirobot coordination, IEEE Transactions on Robotics and Automation, vol. 18, pp. 758768, Oct. 2002. [34] L. E. Parker, Alliance: An architecture for fault tolerant, cooperative control of heterogeneous mobile robots, in Proc. IROS, pp. 776783, 1994. [35] R. M. Murray, Recent Research in Cooperative Control of Multivehicle Systems, Journal of Dynamic Systems, Measurement, and Control, vol. 129, no. 5, pp. 571583, 2007. [36] G. M. Hoffmann, S. L. Wasl, and C. J. Tomlin, Distributed cooperative search using information-theoretic costs for particle lters with quadrotor applications, in Proceedings of the AIAA Guidance, Navigation, and Control Conference, pp. 2124, 2006. [37] K. S. Alessandro Arsie and E. Frazzoli, Efcient Routing Algorithms for Multiple Vehicles With no Explicit Communications, IEEE Transactions on Automatic Control, vol. 54, no. 10, pp. 23022317, 2009. [38] L. Ziska and G. Runion, Future weed, pest and disease problems for plants, pp. 261287. Boca Raton FL: CRC Press, 2007.
258
A Differential Steering System for Humanoid Robots

Shahriar Asta and Sanem Sariel-Talay
Computer Engineering Department Istanbul Technical University, Istanbul, Turkey {asta, sariel}@itu.edu.tr [3], [4]. All these methods suffer from being design specific based on the underlying humanoid robot structure. In this research, we focus on a novel trajectory planning method designed especially for a humanoid robot to move along curved-shaped trajectories. We have adapted the motion model from [5] but extended this model for curvedshape trajectories. The proposed curved walking model introduced for a humanoid robot uses the same fundamental principle behind the differential steering system for a wheeled robot. One of the most interesting properties of this method is its independency from the underlying humanoid robot platform. This new method also ensures an energyefficient trajectory as compared to some other designspecific methods since some joints (i.e., hips roll joints) are not used in this model to change the orientation. As an example, in [1] and [2] hips roll joints are used to change the motion direction of the robot. [1] Uses fuzzy logic for a gait primitive generator in order to construct a curved path and [2] smoothes the desired path based on the curve that guarantees optimum ZMP values within the trajectory. Using these roll joints may help in sharp turns in short distances, but results in more energy consumption in long distances. For an autonomous robot operating in a realistic world, where the robot is not constantly plugged to an energy source and carries a battery instead, following a desired curved path without using all joints is much more efficient. On the other hand, this approach promises a shorter traveling distance without a sudden change in the field of view. The rest of the paper is organized as follows. The underlying motion fundamentals are presented to clarify the rest of the model. Thats why; we introduce our motion model in the first section. Then, an overview of differential steering systems is given to bind these two concepts together to shape a differential steering system for humanoid robots which is presented in the next section. Finally, we present the experimental results and conclude the paper.
Abstract- Biped locomotion for humanoid robots is a challenging problem that has come into prominence in recent years. As the degrees of freedom of a humanoid robot approaches to that of humans, the need for a better, robust, flexible and simpler maneuverability becomes inevitable for real or realistic environments. This paper presents a new method of controlling the trajectory of a humanoid robot on the basis of differential steering systems. To the best of our knowledge, this is the first time that such a technique has been applied on humanoid robots. It has been empirically shown that a humanoid robot can plan its trajectory by using the same principle applied in differential steering systems, and the change in its orientation can be easily calculated in degrees. Therefore, this method enables the robot to move in a desired curved-shape trajectory, instead of aligning itself with the target prior to walking or performing more complex and time consuming motions like diagonal walk or sidewalk. This method is also beneficial when further trajectory planning constraints such as obstacle avoidance are considered. Keywords- Humanoid robots, Differential Steering Systems, Biped Walk, Gait
I.
INTRODUCTION
Wheeled design has been a proper choice for robot motion due to its relatively easy mechanical implementation and balance since the early days of robotic research. However, wheeled robots are not suitable enough for all terrains. Therefore, the research focus has shifted towards more complex designs such as legged robots, and particularly biped robots. For environments in which the amount of consumed time and energy is considerable in the performance assessment of a robot, the ability to maneuver without decreasing the speed while maintaining the walking motion type is much more efficient. Several different motion controllers and trajectory planning methods are used for humanoid robots. As an example, [1] introduced a ZMP-based motion controller for a humanoid robot which uses Fuzzy Logic. In the ZMP approach, the main objective is to design robot's motion in such a way that the zero moment point (the point where total inertia force equals to zero), does not exceed a predefined stability region. [2] Uses a Neuro-Fuzzy biped walking model based on ZMP model. Center of Mass (COM) based methods are another popular trajectory controlling method
II.
MOTION MODEL
In this paper, we present a model which is independent from the humanoid robot structure. However, the model is particularly illustrated and experiments are conducted on the simulated Aldebaran humanoid NAO robot. NAO has 22
259
degrees of freedom of which only 12 have been used in our motion model. The height of the robot is 57 cm and its weight is 4.3 kg [6]. Since the simulated Nao robot (Fig. 1(a)) is a realistic representation of the Nao humanoid robot (Fig. 1(b)), its joint structure (Fig. 1(c)) is the same with the real one.
(a)
2 () = + sin + + 2 () = + sin + 2 + () = + sin 2 + () = + sin 2 + + () = + sin () = + sin + 2 () = + sin + + 2 () = + sin + 2 + + () = + sin 2 () = + sin + () = + sin +
(2)
(c) (b) Figure 1. The Aldebaran Nao. a) Simulated Nao Robot at Simspark environment, b) Nao Robot at RoboCup Standard Platform League [7], c) Nao Structure [6].
This PFS model ensures the right and left feet of the robot alternately perform swing and support roles. Note that, this alternation can be achieved by adding a shift of for pitch joints of the right foot. The above mentioned oscillators give the desired angle of joints in a specific time. In order to control the robot with these joint angles, we use a simple control method:
( )
(3)
The human walking motion can be modeled by a smoothed rolling polygonal shape and a periodic function accordingly. In this sense, one can use the principle of Partial Fourier Series (PFS) in order to decompose the bipedal walks periodic function into a set of oscillators [8]. Assigning these oscillators to the joints of a humanoid robot enables one to develop a gait for the robot. According to [5], the main periodic function for the bipedal walk can be formulated as following:
Sending the outcome to the robots motion interface causes the joints to move with the desired speed and value. The robot can move with a maximum speed of nearly 50 cm/sec by using this motion model. The produced gait is plotted graphically in Fig. 2.
() = + sin(
+ )
(1)
Figure 2. The analysis of the step length and height values for forward walking at 50cm/s speed. The step height is 5 cm in average and the step length is 12 cm in average.
Where N is the number of frequencies (degrees of freedom which are used in gait definition), C is the offset, are amplitudes, T is the period and are phases. Decomposing the main periodic function using PFS gives us the following oscillators for the joints of a humanoid robot:
() = + sin 2 +
III.
DIFFERENTIAL STEERING SYSTEMS
This paper presents a differential steering system for humanoid robots, and it is based on the kinematic model for differential drive robots. For convenience, the common
260
notation for differential steering systems is repeated here. A differential steering robot is mounted on two wheels with a single axle. Each wheel is controlled and powered independently and this provides both drive and differential functions. When both wheels turn in the same direction and speed, the robot follows a straight line. When the wheel speeds are different in the same direction, the robot tracks a curved path towards the slower wheel. Equal speed for both wheels but in opposite directions makes the robot pivot [9]. Robots driven by a differential steering system, obey the respected simple kinematic model which provides the robot with the trajectory path and the desired wheel speeds for a specific trajectory. The following equations can be used to find the displacement of the wheels [9]:
= = ( + ) = +

humanoid robot, it is easy to see that each foot performs a semi-circular locomotion around the axle of the hip of the robot. This is as expected, since the motion model ensures that every joint moves according to a sinusoid shifted by an offset, particular to that joint. The trajectory of each joint and that of its counterpart in the neighboring leg are complementary of each other. On the other hand, each pitch joint for the right leg follows the same trajectory of that of the left leg but with a phase shift of . As a conclusion, the half circle drawn by the sinusoid of each joint in the left leg (the swinging foot), is completed by the half circle drawn by the sinusoid of the same foot joint in the right leg (the left leg is the support leg) and vice versa. This behavior is illustrated in Fig 3 for knee joints (both for left and right legs):
(4) (5) (6)
where r is the turn radius, b is the axle length and is the angle of the turn in radians. and are the distances traveled by left and right wheels respectively and is the speed at the center point on the main axle. The following equations are used to update the pose of a differential drive robot during its trajectory. Note that, since the robot is assumed to travel at a fixed and constant speed, acceleration/deceleration effects are ignored in these equations.
() = + () =
() ( )
Figure 3. Joint trajectories for knee joints.
sin ( ) sin( )

(7) (8) (9)
( ) ()
cos ( ) cos( )
As can be seen in this figure, the motion starts with a start-up phase in which the speed is increased constantly during the time until it reaches its maximum. This is done, for all the joints, to prevent robots instability due to initial acceleration. During the walk motion, when the speed reaches its maximum at approximately 50 cm/sec, it is kept constant. To ensure a curved path by a humanoid robot, the corresponding joints of the robots legs should be moved with a different angle in the same direction. Since the amount of movement of a leg in a certain amount of time will be different, the foot speed which can be interpreted as wheel speed will be different. The following general equation describes the value of a joint angle for a curved trajectory of a humanoid robot:
() = + sin

= ( )/
where and are the linear velocities of the left and right wheels respectively, [, , ] represents the initial pose of the robot. Based on this background, in the next section, we show that the same principle can be successfully applied to a humanoid robot for the construction of a curved trajectory with a given turn amount.
+ (1 + ) + + (1 )
(10) (11)
IV.
THE DIFFERENTIAL STEERING SYSTEM FOR HUMANOID ROBOTS
() = + sin
We propose a method to control the trajectory of a humanoid robot on the basis of differential steering systems. Before the design of this new model, a thorough analysis of the robots foot trajectory was made. When the trajectories of the joints are analyzed for the walking model of the
Where CA is the curve amount in radians and is applied to left and right joints symmetrically to create different joint angle values for the left and right legs in the same direction. Note that, symmetry is crucial for mathematical simplification of the model, which is illustrated through
261
equations 12 -14 in this section. Different joint angles cause the left leg to make a larger step than the right leg when the CA is positive and vice versa. Adding the time parameter to the problem results in different step speeds for left and right legs. Fig. 4 shows the effect of applying the curve amount value of 0.058 to a normal step when the robot travels at its maximum speed:
Where b, is the distance between the legs of the robot. This equation is in fact the relationship between the CA variable and robot's orientation change. Using the same method we can rewrite (7) and (8) like the following:
() = () = +

sin
.. .
sin( ) cos( )
(14) (15)
cos
.. .
By using equations (13)-(15) we can calculate the new position of a humanoid robot based on the differential steering odometry. V. EXPERIMENTAL REULTS
Figure 4. The effect of applying the Curve Amount variable, on the foot trajectory.
In this paper, we have used the realistic Simspark simulation environment as our test platform. Simspark is the official simulator for RoboCup competitions and uses ODE (Open Dynamics Engine) for physics simulation [6]. Our experiments show that the CA value is bounded by a maximum at 0.15 for a stable curve walking at a constant speed of 50 cm/sec. The validity of the approach has proven empirically and the boundary values have been determined on the simulated Nao robot. The determined values are successfully used by the beeStanbul team for the Robocup 3D simulation league competitions at Singapore, 2010 [10]. A sample illustration of the simulated Nao robot path is given in Fig. 5. In this figure, the robot tracks a cyclic path by setting the CA value as 0.058. The related video can be found at [11].
The important question here is whether the proposed model obeys the mathematical fundamentals of differential steering systems when applied to a humanoid robot. A convincing answer can be given by rewriting Eqautions (7) (9) in terms of the CA variable. Taking the proportional control model in (3) into consideration, we can rewrite (10) and (11) for all joints as following:
= (1 + )

, = (1 )
(12)
Jl is a specific joint's calculated angle value for the left leg and Jl is that joint's current angle value. Jr and J'r can be interpreted in the same way for the right leg. It is clear that without applying the CA variable, the speed of the left and right legs are equal and the default joint angular speeds are calculated via the following equations:
= = = Figure 5. The robot completes a full turn using differential steering system
So we can again rewrite the equations for VL and VR, as following:

= (1 + ) , = (1 )
Replacing VL and VR in (9) and doing the mathematical simplifications, which are now possible due to the fact that CA has been applied for left and right leg symmetrically, we have the following equation:
=
(13)
We have also investigated the relationship between the CA value and the speed of the robot. It has been observed through simulations that the threshold of CA is closely related to the speed of walking. In our experiments, we have used 0.0775 (maximum speed) as our speed coefficient in Eq. (2), and decreased this coefficient linearly with respect to CA. As the speed decreases, the maximum boundary for CA increases. As the empirical analysis illustrates, the
262
differential steering system is successfully simulated by using the proposed model for a humanoid robot. The graphs in Fig. 6 show the effect of increasing the Curve Amount variable on the stability of the robot, both for forward walking and backward walking (the speed is set to a constant value of 0.0775):
robots can be applied to biped robots of which the odometry can be calculated by the mathematical rules of differential steering systems. The empirical analysis of the proposed model is given for the simulated Nao robot in Simspark environment. Although the analysis is given for a specific robot type, the model presented here can be extended to any humanoid robot type. As the experimental results illustrate, the curve walking behavior for a humaonid robot can be successfully obtained by using our extended TFS model involving new parameters for curve walking. The future work includes utilizing the benefits of wheeled robot dynamics for humanoid robots on rough terrains, stairs, slopes and etc. VII. REFERENCES
(a)
(b) Figure 6. The effect of increasing CA on stability (a) forward walking (b) backward walking
For both forward and backward walking cases, the increase in Curve Amount value leads to a very slight increase in instability, which is normal due to the nature of motion. After a certain threshold (nearly 0.15 for forward and 0.17 for backward walking) the robot starts to make a large pendulum to the sides and forth, and then falls. Before reaching that threshold, the instability is fairly small. The difference between the CAs maximum threshold for forward and backward walking comes from the fact that in both motion types the trunk of the robot is inclined forward. In backward walking, the direction of motion vector (and hence the created force vector) is in the opposite direction of inclination and this makes backward walking more stable.The stability of the robot is measured by aggregating the gyro values in x, y, z directions. The gyro is mounted at the torso of the robot. VI. CONCLUSION AND FUTURE WORK
[1] Zhou Yanjun, Fuzzy Omnidirectional Walking Controller for the Humanoid Soccer Robot, IEEE-RAS International Conference on Humanoid Robots, 2009 [2] Ferreira Joao Paulo, Crisostomo Manuel, Coimbra A.Paulo , Neuro-Fuzzy ZMP Control of a Biped Robot, Proceedings pf the 6th WSEAS International Conference on Simulation, Modeling and Optimization, 2006. [3] Zhihua Cai, Zhenhua Li, Zhuo Kang and Yong Liu, Omnidirectional Motion Control for the Humanoid Soccer Robot, Computational Intelligence and Intelligent Systems 4th International Symposium, ISICA 2009, China, 2009. [4] Graf.C, Hrtl.A,Rfer.T, Laue.T , A Robust Closed-Loop Gait for the Standard Platform League Humanoid, IEEERAS International Conference on Humanoid Robots, 2009 [5] H.Picado, M.Gestal, N.Lau, L.P.Reis, A.M.Tome. Automatic Generation of Biped Walk Behavior Using Genetic Algorithms, Proceedings of the 10th International Conf. on Artificial Neural Networks: Part I: Bio-Inspired Systems: Computational and Ambient Intelligence. 2009 [6] Boedecker Joschka, Dorer Klaus, Rollman Markus, Xu Yuan, Xue Feng, Buchta Marian, Vatankhah Hedayat,Simspark Manual , 2010 [7] Aldebaran Official Web Site : http://www.aldebaranrobotics.com/ [8] Shafii, N., Javadi, M.H., Kimiaghalam B.: A Truncated Fourier Series with Genetic Algorithm for the control of Biped Locomotion. In: Proceeding of the 2009 IEEE/ASME International Conference on advanced intelligent Mechatronics, pp. 1781--1785, 2009 [9] J.Borenstein , H.R.Everett , L.Feng, Where am I? Sensors and Methods for Mobile Robot Positioning, University of Michigan, 1996. [10] S. Sariel-Talay, beeStanbul Team Description Paper, Robocup, 2010. [11] http://air.cs.itu.edu.tr/beeStanbul
In this work, we have shown both theoretically and empirically that the fundamentals of differential steering
263
264
Annelid - a Novel Design for Actuated Robots Inspired by Ringed Worms Locomotion
Christian Mandel Udo Frese Department of Mathematics and Computer Science - FB3, University of Bremen, Bremen, Germany
Abstract This paper describes the development of a simulation software that advances the idea of a robotic device inspired by ringed worms locomotion. Its basic design is made up of an elongated body that is composed of a spring-style skeleton which is coated by a exible skin. Our principle approach is to choose a shape memory alloy material for the skeleton, resulting in different spring forces exerted by the body under varying temperatures. The overall approach requires the elastic skin to prestress the whole body in its rest position, so that it can spatially extend when thermal energy is induced to the system, and relax when an inbuilt air cooling mechanism dissipates the heat. Index Terms simulation, snake-like robot, shape memory alloy
I. I NTRODUCTION Like its biological archetypes, snake-like robots impress with a great variability of locomotion. In contrast to robotic devices based on wheeled, legged, or track-driven designs, their ability to move by rhythmically undulating their elongated body with its small diameter qualies them for application scenarios in rough terrains with potential needs to trespass gaps and loopholes. Hopkins et al. survey in [5] major snakelike robot designs since the early 1970s that differentiate in their implemented locomotion gate, i.e. lateral undulatory, concertina, sidewinding, and rectilinear progression. Erkmen et al. have proposed a particularly attractive application for snake-like robots, namely to search for victims in the remains of a collapsed building [4]. A snake-like robot could move through tiny holes and caverns, still negotiate obstacles and gaps much larger. It could also be deployed through a hole drilled into a concrete-slab which is a common practice in urban search and rescue to gain access into the rubble pile. In this particular application power supply, computation, and teleoperation could even be realized externally by letting the snakes tail stay outside the rubble pile, a technique that even proved essential with conventional intervention robots [7]. Technically, most snake robots are realized by a long sequence of joints actuated by geared electrical motors. Compared to its biological counterpart this design appears complicated and somehow not adequate considering the visual appearance of a snake as a moving continuous shape with innite degrees of freedom. This gap triggered our approach to investigate an continuous actuation mechanism that could provide a large number, i.e. innitely many, degrees of freedom. Our idea is based on a shape memory alloy (SMA) spiral [10], intended as a spring. Due to the SMA-effect the spring expands when heated and can be contracted by an
Fig. 1. Illustration of Annelid within its envisioned application scenario of a collapsed building. Under the exible yellow skin, segments of the blue spring-style skeleton, made of shape-memory-alloy (SMA) are covered by red coil-like wires, allowing for thermal heating of parts of the skeleton. The skeleton is cooled by cool air owing through the core. In doing so, we can control the skeletons local stiffness by the SMA effect. This leads to an expansion or contraction of the spring and if applied asymmetrically to a bending. Locomotion is realized (in simulation) based on these capabilities. The green boards illustrate the electronics driving the heating coils and being attached to central back-bone wires (magenta) for power supply and control.
external pretension force when cooled down. Menciassi et al. already proved rectilinear motion capabilities by a long robotic crawler that applies a SMA backbone [6]. Instead of integrating the SMA material as a spine, we imagine a spiral forming the outer skeleton of the robot and being wrapped by an elastic skin enclosing the robot and providing the necessary pretension. By heating or cooling the spring locally, local parts of the robot expand and contract, e.g. generating a peristaltic motion to provide propulsion. By heating or cooling the spring on one side only, the spring should expand or contract on one side only and hence bend. This is the mechanism to steer the robot and to adapt its shape to motion and environment, e.g. the control where it contacts the ground. Figure 1 illustrates this envisioned technical concept. We expect the main motion capability to be longitudinal, expanding and contracting the spring, whereas we expect bending to be less effective. Hence, our idea is to imitate the locomotion of ringed worms (cf. Sec. V) who use a longitudinal peristaltic motion, not of snakes who primarily use a bending motion. Further, since ringed worms have a much simpler nervous system compared to snakes we hope
265
that longitudinal locomotion may be easier to control. Obviously, technically building such a robot is challenging, in particular concerning the integration of the heating wires, sensors, and electronics in the springs free inside. So, our rst step is to investigate in a detailed simulation whether locomotion is possible given the technical readiness of the system. We report on this investigation here. The papers two contributions comprise a novel actuation concept for a wormlike robot and a detailed simulation of the robot, covering SMA material behavior including hysteresis, thermal ow for heating and cooling, elasticity of the outer skin, and robotground interaction. The remainder of this paper is structured as follows. In the rest of this section we overview essential properties of SMAs. Sec. II describes the simulation framework used to investigate our design approach. In Sec. III we detail constructional components of the simulated Annelid, i.e. its skeleton, skin, and the heating/cooling approach used. We continue in Sec. IV with a characterization of the implemented austensite martensite hysteresis model, before Sec. V describes our algorithmic approach to control common movement patterns. We conclude in Sec. VI with a summary of the results achieved, and a description of future works that address remaining issues. A. Shape Memory Alloy The unique properties of SMA were rst observed in 1951 for Au-Cd alloy, and in 1963 for Ti-Ni alloy [9]. Dutta et al. describe the main effect as the alloys ability to regain its original undeformed state when heated to a high temperature after being deformed at a low temperature [2]. The high temperature Austensite phase (AS), and the low temperature Martensite phase (MS) share the same chemical composition and atomic order, but different crystallographic structures. While AS phases are dominated by highly symmetric structures, MS phases are characterized by low symmetric monoclinic structures [12]. SMAs behaving as described are subject to the so-called one way memory effect (OWME). These materials can acquire a two way memory effect (TWME) by a cyclic loading procedure called training [1]. The TWME shows an additional active transformation during MS phases. Both OWME, and TWME can exert large forces against external resistance, e.g. a NiTi (nickel/titanium) alloy wire of diameter can lift up to 16 pounds during its AS phase. It may be noticed that in recent years ferromagnetic shape memory alloy (FSMA) has triggered various works concentrating on fundamental research [14]. In these alloys the shape-memory effect is triggered by a magnetic eld not by temperature. A magnetic eld can be switched on and off quickly, allowing for a higher response frequency compared to temperature controlled SMA. So, these materials are expected to allow for new application scenarios in the years to come, and clearly also the approach we present here would benet from FSMA becoming available. II. S IMULATION OVERVIEW Being an intermediate step between the pure idea of an articial worm-like robotic device, and the construction of a
se
gm
s
en tn
m eg
t n+ en
segmentn+1 jointn jointn+1
Fig. 2. Illustration of single segments from Annelids spring-like skeleton. Two consecutive segments are interconnected by a restricted 6DOF joint, depicted by the joints coordinate frame. While all translational degrees of freedom, as well as the rotation around the joints y-axis (green) and zaxis(blue) are locked, can rotate (twist) around the x-axis (red) of , hereby inducing a restoring force within the joint. Twisting of this joint in the simulation corresponds to torsion of the SMA wire, which is the main mode of strain encountered in a compression spring.
rst prototype, Annelid Simulation focuses on the interplay of physical forces exerted by Annelids locomotion mechanism and its surrounding environment. For this purpose we make use of the NVIDIA PhysX SDK [8], which provides without limitation the necessary infrastructure for simulating rigid body dynamics, cloth simulation, and collision detection. Beside these available components, we provide additional simulation components that handle thermodynamic calculations for Annelids skeleton and heating/cooling system, as well as a movement generator that builds upon the distinct effects of annelids shape memory alloy made skeleton (including martensite-austensite-hysteresis effects). III. C ONSTRUCTION OF S IMULATED A NNELID A. Annelids Skeleton We model Annelids shape giving skeleton by a sequence of capsules, i.e. cylindrical elements with hemispheres at the two ends, that are arranged in form of a common compression spring (cf. Fig. 8 for an illustration, and Table I for morphological and physical parameters of the simulated spring). Two consecutive capsules are connected by a partially restricted 6DOF joint such that a force that is applied to the spring exhibits a usual spring-like behavior of the whole system (cf. Fig. 2 for an illustration). When a capsule twists out of its rest-position around the x-axis of the joint that connects it with its predecessor, it induces a restoring force within this joint. PhysX models this restoring force by a spring force that can be parametrized with appropriate spring-, damping-, and restitution-coefcients. It may be noted that this restoring force is the connecting link between the simulated skeleton and the shape memory alloys property to exert a certain force that pushes the material to its default shape during AS temperature phases. To be exact, we couple the average thermal energy of two neighboring capsules with the spring force simulated within the connecting joint (cf. Sec. IV).
266
TABLE I S UBSET OF INITIAL MORPHOLOGICAL AND PHYSICAL PARAMETERS OF A NNELID S SKELETON length along main axis radius number of windings segments/winding wire radius density specic heat capacity thermal conductivity coefcient remissivity coefcient austensite start temperature austensite end temperature martensite start temperature martensite end temperature 0.3m 0.04m 15 8 3.5mm 7440 470 42.6 0.5
cpn-7
cpn+1 cpn+8
cpn-8
cpn
cpn-1
cpn+7
B. Annelids Skin Annelids spring-style skeleton is coated with an elastic skin such that the whole compound forms a tube through which the cooling air can funnel (cf. Sec. III-C). Further reasons for the skin are the sheltering of Annelids inner electronics that control the heating and cooling system, as well as a prestressing of the skeleton. The later allows for expansion of the skeleton under heating, and contraction in cooling down phases (cf. Sec. IV). We model Annelids skin by a single PhysX cloth element. Intended to be used for simulating stretchable matter such as clothing material, cloths are represented by a mesh of particles interconnected by springs that mimic constraints between these particles (cf. Fig. 3). Beside general quantities for thickness, density, static and dynamic friction, PhysX cloth elements are described by parameters that model the behavior of the cloth springs. These include bending stiffness, stretching stiffness, and a damping coefcient. Unfortunately, the current PhysX SDK version realizes collision detection for cloth elements by solely testing on contact between the cloth particles and other objects, e.g. the ground plane. Future versions are announced to implement cloth collision detection on basis of the triangles that make up the cloth patch. To moderate this inaccuracy we currently disable the collision check of Annelids skin and get by with a collision test between the skeletons surface and the ground. C. Annelids Heating and Cooling System The primary locomotion mechanism of Annelid builds upon the temperature sensitive properties of the shape memory alloy made skeleton (cf. Sec. IV). For this reason we have modeled a heating and cooling mechanism that approximates our key ideas that are to be realized in a rst physical prototype. D. Heating System For heating up chosen parts of the spring-style skeleton we plan to attach coiled laments to the whole framework (cf. exposed heating wires in Fig. 1). By applying voltage to given sections of this heating elements, ohmic resistance causes the induction of thermal energy into the spring. We modeled this behavior by simply allowing for bringing in
Fig. 3. Illustration of Annelids elastic skin that is modeled by a single PhysX cloth element. The orange colored cloth element coats the purple colored skeleton and is made up of a mesh structure. An exemplary patch from this cloth mesh is given by red-colored cloth points and green-colored interconnecting springs that provide for the adherence of spatial restrictions between neighboring cloth points.
certain amounts of energy per simulation time step into a single skeleton capsule. A second mechanism is given by the formulation of a list of desired temperatures for the whole list of capsules that make up the skeleton. During operation each capsules current temperature is measured, and a control loop takes care for maintaining the desired temperature level. Using this approach it is possible to specify and maintain a certain temperature baseline including a systematic description of regions with more/less thermal energy. Being simple to encode in simulation, a real world prototype would require to be equipped with temperature sensors along the whole skeleton. E. Cooling System The cooling mechanism of Annelid basically builds on a fan-powered stream of air. For the rst real world prototype we envision an ordinary personal computer housing fan with a diameter of and a capacity of cooling air to be integrated into the tail of Annelid. In order to assess whether this type of cooling mechanism is able to dissipate a sufcient amount of thermal energy per time, we have modeled three physical effects involved [13]. 1) Thermal Radiation: The rst effect describes the property of each matter to emit electromagnetic radiation, thus loosing thermal energy. In the simulation of Annelid, thermal radiation (1) is one of two effects that account for an exchange of energy between Annelids skeleton and the cooling air. (1) with being the remissivity ranging between for complete reective materials, and for a true black body, being the Stefan Boltzmann Constant,
267
345 340 335 330 absolute temperature [K] 325 320
martensite end
1
austensite start
martensite start
austensite end
0.8
strain [0..1]
0 5 10 15 time [s] 20 25 30 35
315 310 305 300 295 290 285
0.6
0.4
0.2
280
290
300
310 temperature [K]
320
330
Fig. 4. Cooling behavior of a single segment from Annelid skeleton. Each line depicts the temperature over time of after being heated from with a thermal energy of at . All of the three temperaturebundles (red, green, blue) differ by the speed of the cooling air, i.e. top: , middle: , and bottom: . The three bundles itself discriminate in the initial exhaust air temperature , and the radius of . It holds: red( , ), green( , ), and blue( , ).
being the bodys radiating surface, being the bodys absolute temperature, and being the observed period of time. 2) Heat Conduction: The second effect is given in (2) and describes the propagation of thermal energy between neighboring skeleton segments with different temperatures at the same time. (2) with being the coefcient of thermal conductivity, describing the ratio of contact surface and thickness of the matter through which the thermal energy passes, being the absolute temperatures of the two points between we observe thermal conduction, and being the observed period of time. 3) Thermal Transfer: Just as thermal radiation, the third effect (3) is responsible for an exchange of thermal energy between the skeleton and the cooling air. It is based on the exchange of thermal energy between solid matter and a circulating uid. Thermal transfer can heat up solid matter or dissipate heat in subject to the given temperature gradient. (3) with being the thermal transfer coefcient dependent on the bodys surface structure and the uids velocity, being the contact surface between the body and the uid, describing the absolute temperatures of the body and the uid, and being the observed period of time.
Fig. 5. Simulated austensite-martensite-hysteresis of Annelids shape memory alloy made skeleton. During the exemplary austensite phase starting at , and ending at , the metallic alloy exerts a large force that rearranges the shape of the skeleton towards its stable and stiff state. During the exemplary martensite phase starting at , and ending at , the material can be deformed by an external force, e.g. by the prestressing force exerted by Annelids elastic skin. Inner hysteresis loops are reached when temperature decreases before austensite end temperature has been reached, or when temperature increases before martensite end temperature has been reached respectively.
For a given cooling air speed in [13], namely
we use the textbook formula for else
(4)
Fig. 4 shows exemplary temperature curves for a single skeleton capsule that is initially heated up by an energy of , and cooled down according to (1)-(3). IV. AUSTENSITE -M ARTENSITE H YSTERESIS OF S IMULATED A NNELID In Sec. I-A we described the essential properties of shape memory alloys. The most important effect is given by an active strain pushing the material into its original shape during high temperature (AS) phases, and a deformability by external forces during low temperature (MS) phases. This effect turns out to be a non-linear relationship between the materials thermal energy and the internally exerted force. According to [15], it can be modeled as a hysteretic relation with cubic slopes (cf. Fig. 5 for an illustration). For any given fraction of the materials AS/MS workspace, i.e. the interval ranging from martensite end to austensite end mapped to , we compute the coefcient for the resulting force as in (5). (5) V. L OCOMOTION OF S IMULATED A NNELID So far we have described the structural layout of simulated Annelid, its heating and cooling mechanism, as well as the
268
5
direction of movement direction of peristaltic sphincters
bristles
longitudinal muscles
Fig. 6. Illustration of the common earthworms locomotory system. The depicted shape corresponds to the sagittal plane of an earthworm oriented to the left of the image.
Fig. 7. Illustration of simulated forward movement with superimposed left turn. Annelids skeleton bends to the left due to the selective heating of left anking segments.
Fig. 8. Illustration of simulated forward movement from in top frame, to in bottom frame. The color of the skeleton segments denotes their actual temperature, ranging from red (high temperature) to blue (low temperature). It can be seen that the actual forward movement (to the left in the images) occurs between frames 1 and 2, where the front segments expand.
shape memory alloys reaction to varying thermal energy, based on AS/MS hysteresis. In this section we describe the interaction of all components described, resulting in controllable movement of Annelid. The basic principle of locomotion mimics the movement pattern of the common earthworm, i.e. Lumbricadae, a ringed worm of the order Oligochaeta within the clade of Annelida [3]. Annelidas anatomy can be characterized by several sphincters forming its tubular body, combined with longitudinal muscles (cf. Fig. 6 for an illustration). During basic forward movement, a peristaltic moves into the opposite direction through Annelidas body. This wave provides for alternating body-sections in which sphincters decrease/increase the bodys diameter, and longitudinal muscles elongate/contract the bodys length. During this process, body segments with increased diameter gain the necessary ground friction through little abdominal bristles. A. Basic Forward Movement Inspired by the basic concept of earth worms forward movement, simulated Annelids locomotion is driven by a temperature curve that travels in caudal (front back) direction through the body. Windings of the skeleton with higher temperature exert a larger force within joints that connect two consecutive segments, thus stretching subsequent windings and reducing their diameter by the way. Compared with this, windings with lower temperature are contracted, leaving these parts of the spiral with an increased diameter and isolated points of ground contact (cf. Fig. 8 for an
illustration). We modeled the initial desired temperature of Annelids skeleton by a base-temperature of plus a superimposed sinusoidal temperature curve of length. For the chosen simulation parameters as given in Table I, e.g. an initial skeleton length of with a total of windings segments, we add a maximum of at the desired temperature curves peaks. In doing so, the skeleton segments are subject to a temperature interval ranging from up to , which corresponds to available SMA operation ranges [11]. Fig. 9 depicts the resulting trajectory from of simulated forward movement. B. Bending - Sidewards Movement A basic forward movement can change into a turn left/right movement by adding thermal energy to lateral anking skeleton segments1 on top of the skeletons temperature curve necessary for forward movement (as described in the previous subsection). In doing so, simulated Annelids body bends into the direction of the heated ank (cf. Fig. 7 for an illustration). We calculate the overall desired amount of thermal energy for every anking segment, such that it equals the maximal desired skeleton segment energy necessary for forward movement, plus a constant (6). Fig. 9 shows that a variation of results in a changing curvature of the traveled path. (6)
1 A real-world prototype of Annelid has to be equipped with inertial measurement units that indicate the current orientation of the skeleton segments.
269
6
1
380 1.2 370 1.1 360 1 350 340 temperature [K] 330 320 310 300 290 280 0.9 joint force [0..1] 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 20 25 30 time [s] 35 40 45 50 55 60
0.75
0.5
0.25
0 [m]
1 5
-0.25
-0.5
2
-0.75 -1
4
-1 -0.75 -0.5 -0.25 0 0.25
3
0.5 [m] 0.75 1 1.25 1.5 1.75 2
270 260
-1.25 -1.25
250
Fig. 9. Trajectories resulting from of simulated forward movement (red), left turns (blue), and right turns (green). Initially located at , and oriented along the positive x-axis, Annelids forward movement (1) results from a sinusoidal temperature wave travelling from its head to its tail (cf. Sec. V-A). Note that the robot does not move in parallel to the x-axis due to currently uncontrolled friction while initially heating up the robot. Curved movement is triggered by additionally heating up Annelids skeleton segments that belong to its anks close to the center of curvature (cf. Sec. V-B). It holds for (3,4), and for (2,5) in the computation of in (6).
Fig. 10. Minimal (blue) and maximal (red) temperature of Annelids skeleton during of simulated forward movement as depicted in trajectory 1 in Fig. 9. The green curve depicts the normalized force exerted by an arbitrary but xed chosen joint connecting two consecutive skeleton segments.
R EFERENCES
[1] Z. Bo and D. Lagoudas. Thermomechanical Modeling of PolycrystallineSMAs under Cyclic Loading, Part I: theoretical derivations. International Journal of Engineering Science, 37(9):10891140, 1999. [2] S.M. Dutta and F.H. Ghorbel. Differential Hysteresis Modelling of a Shape Memory Alloy Wire Actuator. IEEE Transactions on Mechatronics, 10(2):189197, 2005. [3] C.A. Edwards and P.J. Bohlen. Biology and Ecology of Earthworms. Chapman & Hall, edition, 1996. [4] I. Erkmen, A.M. Erkmen, F. Matsuno, R. Chatterjee, and T. Kamegawa. Snake robots to the rescue! IEEE Robotics & Automation Magazine, 9(3):1725, 2002. [5] J.K. Hopkins, B.W. Spranklin, and S.K. Gupta. A survey of snakeinspired robot designs. Bioinspiration & Biomimetics, 4(2), 2009. DOI: 10.1088/1748-3182/4/2/021001. [6] A. Menciassi, D. Accoto, and S. Gorini. Development of a biomimetic miniature robotic crawler. Autonomous Robots, 21:155163, 2006. [7] R.R. Murphy. Trial by re activities of the rescue robots at the world trade center from 11-21 september 2001. Robotics and Automation Magazine, 11(3):50 61, 2004. [8] NVIDIA Corporation, 2701 San Thomas Expressway Santa Clara, CA 95050, USA. NVIDIA PhysX for Developers, 2010. http: //developer.nvidia.com/object/physx.html. [9] K. Otsuka and C.M. Wayman, editors. Shape Memory Materials. Cambridge University Press, 2002. [10] SAES Getters, Viale Italia 77, 20020 Lainate (Milan), Italy. SmartFlex Compression and Tensile Springs, 2010. http:// www.saesgetters.com/documents/compression%20and% 20tensile%20springs%20datasheets_low_1487.pdf. [11] SAES Getters, Viale Italia 77, 20020 Lainate (Milan), Italy. SmartFlex Wire Product Datasheet, 2010. http://www.saesgetters.com/ documents/smartflexproductdatasheet_low_1486.pdf. [12] N.Ma. Song and H-J. Lee. Position control of shape memory alloy actuators with internal electrical resistance feedback using neural networks. Smart Materials and Structures, 13:777783, 2004. [13] H. St cker. Taschenbuch der Physik. Harri Deutsch Verlag, Frankfurt o am Main, Germany, 2010. [14] J. Sui, Z. Gao, H. Yu, Z. Zhang, and W. Cai. Martensitic and magnetic transformations of high-temperature ferromagnetic shape memory alloys. Scripta Materiala, 59(8):874877, 2008. [15] Z. Zhu, J. Wang, and J. Xu . Modeling of Shape Memory Alloy Based on Hysteretic Nonlinear Theory. Applied Mechanics and Materials, 44 47:537541, 2011.
VI. C ONCLUSIONS AND F UTURE W ORK In this work we have presented a novel design for snake-like robots, based on a spring-style skeleton made of SMA. The main contribution of this paper is given by a feasibility study in terms of a simulation framework that addresses constructional details such as the modeling of the skeletonss spring-like behavior, the description of a prestressing exible skin, and a thermodynamic simulation of the heating- and coolingmechanism involved. Simulation results show that the proposed design can produce stable forward movement, as well as trajectories with controlable curvature. A. Future Work The next upcoming step in the realization of the presented ideas is to set up a physical workbench version of Annelid. In this conguration we have to approve the feasibility of the presented heating and cooling mechanism under several aspects. First we have to assure the controlability of the skeletons thermal energy within a tight range of operation, since overheating may permanently damage the SMAs Austensite effect. A second task that will be investigated with this mock-up, is the selection of an appropriate material for Annelids skin. According to our presented simulation results, this material must at least resist to, and isolate from, skeleton temperatures of around . This requirement is hard, since we need a huge temperature difference between the skeleton and the cooling air in order to dissipate thermal energy fast enough. ACKNOWLEDGMENT This work has been partially funded by the Deutsche Forschungsgemeinschaft in the context of the SFB/TR8 Spatial Cognition.
270
` SLAM a la carte GPGPU for Globally Consistent Scan Matching

Andreas N chter Seyedshams Feyzabadi Deyuan Qiu Stefan May u
Abstract The computational complexity of SLAM is large and constitutes a challenge for real-time processing of a huge amount of sensor data with the limited resources of a mobile robot. Often, notebooks are used to control a mobile system and even these computing devices have nowadays graphics cards which allow general purpose computation using many cores. SLAM a la carte (graphique) exploits these capabilities and ` carries out 3D scan registrations on the GPU. A speed-up of more than one order of magnitude for precise 3D mapping is reported.
I. I NTRODUCTION Many basic robot tasks require a globally consistent representation of the environment. Many modern robots are equipped with 3D scanners, that either work according to one of the following principles: the time-of-ight principle using pulsed or continuously emitted laser light, triangulation using projections of laser patterns or the projection of structured light, i.e., light coding. These sensors acquire 3D point clouds in a local coordinate frame from their surrounding. Local representations have to be matched to build a
A. N chter is with the Automation Group within the School of Engineeru ing and Science of the Jacobs University Bremen gGmbH, 28759 Bremen, Germany a.nuechter@jacobs-university.de S. Feyzabadi is with DFKI Bremen Robotics Innovation Center, Germany shams.feyzabadi@dfki.de D. Qiu is with the Shanghai Quality Supervision & Inspection Center for Medical Devices, China (PRC) deyuan.qiu@googlemail.com Stefan May is with the Georg-Simon-Ohm University of Applied Sciences, N rnberg, Germany stefan.may@ohm-hochschule.de u
global map. The well-known iterative closest point algorithm (ICP) [2] puts two independently acquired 3D point clouds into one frame of reference. Instead of performing the scan matching based on extracted features, it uses closest points in Cartesian coordinates and an iterative procedure to compute the registration. In previous work we have extended the ICP algorithm to globally consistent 3D scan matching [3], [16]. To yield a high-precise 3D mapping algorithm, it differs in an important aspect from other state-of-the-art simultaneous localization and mapping (SLAM) procedures. It repeatedly matches closest points of 3D scans stored in a pose graph to build a linear system of equations, that is also solved repeatedly to update the pose estimates. Thus, globally consistent 3D scan matching performs GraphSLAM in every iteration. Unlike Grisetti et al. who claims in [10, page 9] that [. . .] the time to compute the linear system is negligible compared to the time to solve it we experienced otherwise and strongly object (cf. Fig. 1). The data association is the bottleneck in scan matching based SLAM: Image to match n point clouds, each containing N 3D points. Usually it holds: N n. For example typical 3D scans taken by the robot Kurt3D [17] contain up to 300,000 points and a typical 3D scan acquired by our robot Irma3D contains 3,000,000 points, while typical scenes contain only a few thousand scan poses. Finding nearest neigbors can be accomplished on average in O(N log N )-time by using search trees [8] Covariance Computation
7% Covariance Computation Time 6% 0% Nearest Neighbors Search Computing S2 87% Computing M Other Computations
GraphSLAM as given in [3, Fig. 1]
Scan Relaxation
5%
95%
Other Computations
45% 55% Relaxation ICP
ICP Algorithm [2]

41% 58% 1% 6.5%
1% Alignment SVD Cross-covariance Matrix Calculation 92.5%
Pair Assignment Transformation Estimation Transformation
Fig. 1. Timing of GraphSLAM based globally consistent scan matching. Initial sequential ICP alignment is rened by an iterative procedure (cf. [16]). Most of the time are spent in nearest neighbor search, which is carried out on a k-d tree.
271
while solving the linear system is in O(n3 ). However, since the system of equations refers usually to a sparse graph, the letter Landau notation turns out to be not relevant, since a sparse Choleskey decomposition [6] is applicable. Therefore, speeding-up the data association, i.e., the search for correspondences, remains a central point. To this end, we have suggested in [15] to employ O PEN MP to perform computing closest points in a parallel fashion. Of course, for feature based SLAM the number of equations grows with every associated feature and then the SLAM back-end becomes the bottleneck. This paper presents results obtained by incorporating the immense computational power of GPUs (Graphics Processing Units) into our SLAM framework and into 3DTK The 3D Toolkit (http://threedtk.de). The increasing programmability of graphics cards enables general-purpose computing on graphics processing units (GPGPU) and yields a powerful massive parallel processing alternative to conventional multi-computer or multi-processor systems. Moreover, costs of commodity graphics cards are lower when measured in cost per FLOPS. Traditional methods for computing closest points are difcult to be implemented on GPUs due to their recursive nature. We take advantage of Aryas priority search algorithm [1], to t nearest neighbor search (NNS) in the Single Instruction Multiple Data (SIMD) model, so that it is possible to be accelerated by the use of a GPU [21]. The proposed algorithm, is implemented using CUDA, NVIDIAs parallel computing architecture [18]. II. S TATE OF
THE
A RT
An introduction to robotic mapping and SLAM is given in [24]. Recently, Grisetti et al. published a tutorial on graphbased SLAM [10]. To avoid repeating these summaries, the focus in this related work section is on related GPGPU algorithms. Up to our knowledge, this has not been applied to 3D robotic mapping. Our own previous algorithms have been published for example in [3], [16] and [15]. NNS is typically implemented using brute force methods on GPUs, which are by nature highly parallelizable. This property makes brute force based NNS methods easily adaptable for a GPU implementation and there are several implementations available. Purcell et al. used cells to represent photon locations [20]. By calculating the distances between photons in cells that are intersected with a search radius, and a query point, k nearest neighbors are located to estimate the radiance. Purcell et al. stressed that k-d tree and priority queue methods are efcient but difcult to be implemented on GPU [20]. Bustos et al. stored indices and distance information as quaternion in RGBA channels of a texture buffer. They used three fragment programs to calculate Manhattan distances and to minimize them by reduction [4]. Rozen et al. adopted a bucket sort primitive to search nearest neighbors [22]. Van Kooten et al. introduced a nearest neighbor search method based on projection for the particle repulsion problem [25], which chooses a viewport encompassing every particle (or as many as possible), and
projects particles onto a projection plane. The approach takes advantage of the hardware accelerated functionalities of GPUs, such as projection and 2D layout of the texture buffer for grid calculation. Garcia et al. implemented a brute force NNS approach using CUDA [9]. When the data dimension is more than 96, and the number of points is less than around 10000, a speed up of 40 comparing to the CPUbased k-d tree implementation is reported. However, if the data dimension is less than 8 and the number of points is larger than around 19000, their algorithm becomes slower than the CPU implementation. Other than brute force implementations, GPU-based NNS with advanced search structures are also available in the eld of global illustration. In the context of ray tracing, the NNS procedure builds trees from a triangle soup, and takes also triangles but not points as the objects of interest. These algorithms cannot be used as general point-based NNS algorithms. On the other hand, the NNS algorithm for photon mapping shares a similar model with a general point-based NNS problem. Foley built a k-d tree on CPU and used the GPU to accelerate the search procedure [7]. Horn et al. extended Foleys work by restarting the search at half of the tree but not from the root [11]. Singh presented an SIMD photon mapping framework, using stack based k-d tree traverse to search k nearest neighbors [23]. Lieberman et al. used quad-tree for the similarity joint algorithm [14]. Zhou et al. implemented k-d tree based NNS on GPU for both ray tracing and photon mapping using the CPU as a coordinator [26]. They applied a heuristic function to construct k-d trees. In the k-d tree traverse stage, range searching is used to nd the k nearest neighbors. In recent GPU-accelerated NNS implementations, most NNS kernels are based on brute force methods. They are easy to implement but possess the natural drawback of low efciency compared with advanced data structures. On the other hand, brute force methods mostly need reduction kernels in order to nd the minimum in distance. A reduction kernel is slow due to its non-parallel nature, even implemented by highly optimized blocks. Tree-based NNS algorithms have shown a performance leap in global illumination. Hints and inspirations can be gained from these algorithms. Therefore the purpose for which they were designed makes them not easily adoptable for non-graphics purposes. III. N EAREST N EIGHBOR S EARCH
USING THE
GPU
GPU-based algorithms conform to the SIMD model. In our GPU-NNS algorithm, all nearest neighbor searches (as many as the number of points in the scene point cloud) are executed simultaneously. When the number of searches is not the same as the number of threads the processor can allocate, the CUDA driver will schedule and assign the threads to the multi-processors, which is transparent to the users. Our GPUNNS procedure features three steps. a) k-d Trees: k-d trees are a generalization of binary search trees. Every node represents a partition of a point set to the two successor nodes. The root represents the whole point cloud and the leaves provide a complete disjoint
272
partition of the points. These leaves are called buckets. Every node contains the ranges of the represented point set. b) Array-based k-d Tree: Before the search procedure is performed, a piece of page-locked memory is allocated on the host side. As a convention, we refer to graphics cards as devices, and to CPU and main memory as host. A leftbalanced k-d tree is built for the point set S by splitting the space always at the median of the longest axis. Being left-balanced, the k-d tree can be serialized into a at array, and thus stored in the page-locked memory. Since the device memory cannot be dynamically allocated on the device, the array-based k-d tree is downloaded to the device before NNS. It is worth mentioning that in order to satisfy the coalescing of CUDA global memory, a Structure of Array (SoA) is used. Members of the structure are mostly 32-bit words, so that all threads of a half-warp (16 threads) are coalesced into one 64-byte global memory transaction. c) Priority Search Method: Because recursion is not possible with CUDA, the traditional k-d tree search method cannot be used. However, the priority search method provides a way to put NNS on the GPU [1]. Priority queues are maintained in the registers of the GPU. The priority search algorithm iteratively executes the following three steps: First, extract the element having minimal distance to the query point from the queue. Second, expand the extracted node. Insert the higher node in the queue and then expand the lower node. This step is repeated till the leaf node. Third, update the nearest neighbor so far. The complete GPU-NNS algorithm is shown in Algorithm 1. Algorithm 1 GPU-NNS Algorithm Require: download the k-d tree, model point cloud and scene point cloud to GPU global memory. 1: assign n threads, where n = number of query points. 2: assign arrays pair[n] and distance[n] in GPU memory for results. 3: for every query point in parallel, do 4: assign the query point to a thread 5: allocate a dynamic list for the thread 6: construct dynamic queue q 7: Initialize q with the root node 8: do priority search, nd: pair w. shortest distance d 9: if d < distance threshold then 10: pair[threadID] = pair 11: distance[threadID] = d 12: else 13: pair[threadID] = non-pair ag 14: distance[threadID] = 0 15: end if 16: end for
of a rotation matrix R and a translation vector t which minimizes the following cost function [2]: E(R, t) = 1 N
N i=1
||mi (Rdi + t)|| ,
(1)
IV. GPU- ACCELERATED ICP Given two independently acquired sets of 3D points, M (data set) which correspond to a single (model set) and D shape, we want to nd the transformation (R, t) consisting
All corresponding points can be represented in a tuple (mi , di ) where mi M M and di D D. Two things have to be calculated: First, the corresponding points, and second, the transformation (R, t) that minimizes E(R, t) on the basis of the corresponding points. The ICP algorithm uses closest points as corresponding points. A sufciently good starting guess enables the ICP algorithm to converge to the correct minimum. Respecting the GPU-based ICP approach, there are only a few implementations yet, due to the very recent development of the GPGPU technique. Kitaaki et al. implemented Modied ICP (M-ICP) [12] on GPU [13]. For nearest neighbor search, they took advantage of the massive raw computational power of GPU, calculating NM ND times the distances between query points and model points, in which, NM and ND are the numbers of points in the model and data point cloud respectively. The method of transformation estimation was not presented, although the whole ICP pipeline is declared to have been implemented on GPU. As a result, their method is reported to be 2 to 3 times faster than the original one. Park et al. implemented a GPU-based pose estimation pipeline [19]. ICP was used as the last stage for ne-tuning, which is unfortunately not implemented on GPU. Choi et al. [5] applied Iterative Projection Point (IPP) instead of searching for the closest points to register 3D shapes. Since no searching procedure is required, a registration time of more than 200 milliseconds was reported for point clouds with 320 240 points. However, the matching accuracy might not be comparable to the ICP algorithm, which is not discussed in the publication. Our extension of the ICP to the SIMD model accelerates not only the NNS stage but also the remaining parts of the ICP algorithm, namely the scene point cloud transformation, zero mean alignment by computing centroids [2] and the covariance matrix for the SVD based minimizer [16], which take also observable amount of time. Fig. 2 illustrates the coordination between the host and the device of the GPUICP algorithm. The major steps are: Centroids Calculation: To calculate the centroids of the point clouds, the coordinates need to be summed up separately over all the points, yielding 6 sums to be divided by the number of points. To sum up the values, the reduction kernel in CUBLAS library (CUDA implementation of basic linear algebra methods) is applied, which heavily uses the shared memory. Unlike on CPU, where the number of nearest neighbor pairs is a natural gain in the end of the loop, GPU nds the pairs by threads in parallel, so that the number of pairs has to be counted after the NNS procedure. Counting the number of pairs is implemented by the compact function of CUDPP library (CUDA Data Parallel Primitives Library),
273
part is again the NNS which is required for the covariance computation. We summarize our task distribution of the covariance computation in Fig. 3. Currently, we copy the tree to GPU in each SLAM iteration to retrieve the covariance matrix for each link. This is time consuming which can not be solved easily. Since the number of calls of this method is very high, it decreases the efciency of our algorithm. However, the procedure given by Fig. 3 effectively exploits NNS on the graphics card.
Fig. 2. The coordination between CPU and GPU in the GPU-ICP algorithm. Notice that other than constructing the k-d tree, CPU does only negligible work. The data transfer over the PCIe is minimized. Pictures of the chips are taken from manufacturer websites.
which compacts an array according to a predened mask array. The number of pairs is a side gain of the function. The six divisions are executed on the CPU in double precision, and then the centroids are send to GPU. Point Cloud Alignment: The point clouds are aligned to the original points using the two calculated centroids in parallel. Similar to the principle of GPU-NNS, the number of threads is the number of points and each point is translated by a standalone thread in parallel. Preparation for SVD: The covariance matrix for the SVD-based ICP [16] is calculated before it is decomposed. It is lled by the dot product function of CUBLAS on GPU. Every thread, i.e., every point pair, contributes in parallel to the matrix. If a point pair is rejected, we use zero vectors for the dot product. Since the covariance matrix is small (3 3) and needs double precision, the SVD is not implemented on the GPU. We used the NewMat library to calculate the decomposition on the CPU. Further Implementation Details: The block size is congured to be 192 or 256 to get the optimal compromise between enough active threads and enough registers per multi-processor. V. GPU- ACCELERATED G LOBALLY C ONSISTENT S CAN M ATCHING In the overall scan matching based 3D mapping, global relaxation takes a longer time than ICP (see Fig. 1). Analyzing the relaxation phase, yields that the most time consuming
Fig. 3. The coordination between CPU and GPU in the scan relaxation algorithm. In every iteration, the pre-computed trees are transferred to GPU for NNS and the resulting covariance matrices are copied back.
VI. E XPERIMENTS
AND
R ESULTS
The computer for our experiment was a Intel(R) Core(TM)2 Quad CPU Q9450 @ 2.66GHz with a GeForce GTX 260 with 216 cores. Its graphics memory is 896 MB. For a system test of the ICP algorithm, we use synthetic data with known ground truth poses, thus we are able to analyze the convergence of the proposed algorithm. The 3D scan registration is based on two articial cubes (Fig. 4, left). Each of them consists of N random points lying on the surface of the cube. The side length of each cube is 10 units. The second cube is generated by translating the rst cube through the vector [1.0, 1.0, 1.0] and rotating it around the axis [1.0, 1.0, 1.0] for 0.1 radian. The convergence, which represents the matching quality, is measured by variation as dened by root mean squared error of the distances between nearest neighbors. It is observed in Fig. 4 that although GPUICP with different queue lengths present different convergent rates, they end up with very similar convergent limits. The reason for this behavior is the iterative fashion of ICP. The assumption made by the ICP algorithm, is that in the last iteration, the point pairs are correct. Retrieving approximate nearest neighbors yields different point pairs for the minimization step. Due to the large number of points in the point cloud, the computed transformation (R, t) is still similar to
274
Convergence test GPUICP. Number of point N=10000

RMS error [unit^2]
0.8

GPUICP: Q=1 GPUICP: Q=1 to 6 GPUICP: Q=8 GPUICP: Q=16 GPUICP: Q=32 CPUICP
RMS error [unit^2]
0.6
GPUICP: Q=16 GPUICP: Q=32 CPUICP
0.8 0.6 0.4
RMS error [unit^2]
GPUICP: Q=1 GPUICP: Q=1 to 6 GPUICP: Q=8
GPUICP: Q=1 GPUICP: Q=1 to 6 GPUICP: Q=8 GPUICP: Q=16 GPUICP: Q=32 CPUICP
0.1
0.08 0.06
0.4
0.04 0.02 0
0.2
0.2 0
0 10 20 30 40
10
20
30
40
10
20
30
40
iterations
iterations
iterations
Fig. 4. The left image shows the articial cube used for the convergence test. The remaining subgures present the convergence test results for different number of points. Comparison of the convergence over 50 iterations between GPU-ICP implementations and the CPU-based ICP implementation. Q means priority queue length. Approaches with different queue lengths present different convergent rate. However, they end up with very similar convergent limits.
Convergence test GPUICP. 3D laser scan
0.012
Runtime without tree construction

Processing time
14000 12891
Runtime with tree construction

15000
RMS error [unit^2]
0.01
runtime [ms]
Processing time [ms]
0.008
10000 8812
CPUICP
0.006
8000
runtime [ms]
GPUICP: Q=1 GPUICP: Q=1 to 6 GPUICP: Q=8 GPUICP: Q=16
12000
GPUICP OpenMP4ICP Seq.ICP
GPUICP OpenMP4ICP Seq.ICP
10000
6000 4493 4619
0.004
4000
5000
0.002
2000 512 582 16 592 5
1112 10 132 OpenMP4 OpenMP2 Seq.
0 0 50 100 150 200
20000
40000
60000
20000
40000
60000
iterations
Algorithm
number of points
number of points
Fig. 5. From left to right: (1) Convergence test for a 3D laser scan. (2) Speed test. The result of the 3D registration experiment of 68229 points. 1-6 means the queue length is increased gradually in iterations as follows: 1, 2, 3, 4, 5, 6, and 1-32 as: 1, 2, 4, 8, 16, 32. It is observed from the speed test that GPU-ICP is 25 times faster than the sequential CPU-based ICP. In addition, the OpenMP results with 4 and 2 threads are given. (3) The runtime comparison among sequential CPU implementation, O PEN MP-based implementation and GPU-ICP. Without k-d tree construction. (4) Including k-d tree construction.
the exact case. Therefore, more iterations will be needed to match the point clouds, and again, in the last iteration the point pairs are correct, since the query point will be in the bucket, where the true nearest neighbor is. If less points are present in the point cloud the effect of initially wrong correspondences becomes larger, however, the minimum of zero is always reached. In all experiments a slightly better accuracy could be observed for the CPU implementation compared to the GPU implementation. The reason is that the CPU operates in a double oating mode and allows a longer queue. The data sets used in our real-life evaluation are available in the Robotic 3D Scan Repository (http://kos. informatik.uni-osnabrueck.de/3Dscans/). The rst data set is a typical indoor scan acquired by an actuated, i.e., tilting SICK LMS200 scanner. Both point clouds contain 68,229 points. The second data set refers to the rst 65 3D scans of the set Hannover1 contributed by Oliver Wulf, Leibniz Universit t Hannover, Germany. These 3D scans a have a eld of view of 360 180 and each contains approximately 30,000 3D points. The third data set is the Koblenz data set acquired by Johannes Pellenz and Dagmar Lang from the Active Vision Group, University of KoblenzLandau, Germany using a Velodyne HDL-64E Laser range nder with a eld of view 360 28.8 . Each scan contains roughly 102,000 points. Registration of two 3D Scans: This experiment is conducted to compare the performance of the GPU-ICP algorithm, the O PEN MP-based ICP algorithm [15] and the sequential CPU-based ICP implementation, which uses a single CPU. The convergence test of GPU-ICP with different queue lengths is presented in Fig. 5 (1). It is observed that
although the executions with different queue lengths yield different convergence speeds at the beginning, they all converge to a similar variation limit, so as the one with a singleelement queue. Figure 5 (2) shows the time consumed by different queue length congurations, as well as the two CPU implementations. When registering 68,229 points, GPU-ICP with single-element queues performs more than 25 times faster than the standard sequential CPU implementation. Fig. 5 compares the speed of the three implementations: standard CPU-based ICP, O PEN MP-based ICP and GPU-ICP. When registering 68,229 points, O PEN MP multi-threading on a quad core machine achieves a speed-up of around 3 compared to the sequential execution, while a speed-up of 25 is achieved by GPU implementation. It is observed that with more points, the speed-up ratio of GPU-ICP to CPU implementations is improved. Data are transferred to the GPU only once per ICP registration process (here: 268,229 points amounting to 7 MB), which take less than 3 msec in case the GPU is connected by PCIe 2.0. The data sets Hannover1 and Koblenz were recorded using continuous 3D scanning. By using this scanning methodology a very large number of scans can be recorded. Fig. 6 shows the timing results and two views of the map obtained by registration of the Koblenz data set. Similar results were achieved with the Hannover1 data sets. The overall speed-up is mainly attributed to the acceleration of the ICP algorithm. The high number of points in every scan is advantageous for our GPU-accelerated ICP. It turned out, that one can obtain only a speed-up factor of about 5 for the GraphSLAM part as given in Fig. 3. This is due to fact that in every GraphSLAM iteration all trees must be transferred to the GPU in order to execute the NNS for all graph links.
275
300000 265858
200000
run time
145352 100000 84663 53179 35744 26703 25900 32 64

8
CPU 1 thread
16
block size
Fig. 6. Left two images: 3D views of the point cloud acquired by a Velodyne scanner. Different scans have different colors. Middle: Birds-eye view. Right: Run time comparison for the Koblenz data set.
VII. D ISCUSSION , C ONCLUSIONS ,
AND
F UTURE W ORK
We developed a GPU-accelerated NNS approach and used this as part of our implementation for the SLAM front-end. More precisely, we employ a k-d tree based priority search method for computing closest points. The performance is enhanced signicantly by the massive parallelism of the GPU SIMD architecture. It is higher by one order of magnitude using a modern commodity video card. The resulting approach is still faster than the MIMD-based approaches running on a state-of-the-art quad core CPU. Needless to say a lot of work remains to be done. Our current approach suffers from the fact, that in every iteration of GraphSLAM, where we conduct a NNS, compute covariances and execute a SLAM back-end, the k-d trees must be time-consumingly transferred to GPU memory. In future work, we will concentrate on solving this memory bounded GraphSLAM optimization problem, where we only store a certain part of the scans in the limited GPU memory (here 896 MB). With the rapid development of GPU technology, the k-d tree construction stage will be migrated to GPU, possibly using a breadth rst, dynamic list based approach. ACKNOWLEDGEMENTS The GPGPU based registration software is available in the 3DTK The 3D Toolkit. All contributers are gratefully acknowledged. This work was partially supported by B-IT foundation, Applied Sciences Institute, a cooperation between Fraunhofer IAIS and University of Applied Sciences Bonn-Rhein-Sieg and has been partially funded by the grants AUTOSCAN (BMWi KF2470003DF0) and VeloSLAM (NSFC 41050110437). R EFERENCES
[1] S. Arya and D. M. Mount. Algorithms for Fast Vector Quantization. In Proc. Data Compression Conf., IEEE Comp. Society Press, 1993. [2] P. Besl and N. McKay. A method for Registration of 3-D Shapes. IEEE Trans. Pattern Analysis and Machine Intelligence (PAMI), 14(2), 1992. [3] D. Borrmann, J. Elseberg, K. Lingemann, A. N chter, and u J. Hertzberg. Globally consistent 3d mapping with scan matching. Journal Robotics and Autonomous Systems (JRAS), 56(2), 2008. [4] B. Bustos, O. Deussen, S. Hiller, and D. Keim. A Graphics Hardware Accelerated Algorithm for Nearest Neighbor Search. In Proc. of the 6th Intl. Conf. on Computational Science, pages 196199, 2006. [5] S.-I. Choi, S.-Y. Park, J. Kim, and Y.-W. Park. Multi-view Range Image Registration using CUDA. In The 23rd Intl. Technical Conf. on Circuits/Systems, Computers and Communications, 2008. [6] T. A. Davis. Algorithm 849: A concise sparse Cholesky factorization package. ACM Transaction of Mathematical Software, 31(4), 2005.
[7] T. Foley and J. Sugerman. KD-Tree Acceleration Structures for a GPU Raytracer. In Graphics Hardware, pages 1522, July 2005. [8] J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for nding best matches in logarithmic expected time. ACM Transaction on Mathematical Software, 3(3):209226, September 1977. [9] V. Garcia, E. Debreuve, and M. Barlaud. Fast k Nearest Neighbor Search using GPU. In Proceedings of the Comp. Vision and Pattern Recognition Workshop (CVPRW), pages 16, June 2008. [10] G. Grisetti, R. K mmerle, C. Stachniss, and W. Burgard. A tutorial on u graph-based SLAM. IEEE Transactions on Intelligent Transportation Systems Magazine, 2:3143, 2010. [11] D. R. Horn, J. Sugerman, M. Houston, and P. Hanrahan. Interactive k-D Tree GPU Raytracing. In Proceedings of the Symposium on Interactive 3D graphics and games, pages 167174, April 2007. [12] S. Kaneko, T. Kondo, and A. Miyamoto. Robust Matching of 3D Contours Using Iterative Closest Point Algorithm Improved by Mestimation. In Pattern Recognition, vol. 36, pp. 20412047, 2003. [13] Y. Kitaaki, H. Okuda, H. Kage, and K. Sumi. High speed 3-D registration using GPU. In SICE Ann. Conf. 2008, 2008. [14] M. D. Lieberman, J. Sankaranarayanan, and H. Samet. A Fast Similarity Join Algorithm Using Graphics Processing Units. In Proc. of the 24th IEEE Intl. Conf. on Data Engineering, 2008. [15] A. N chter. Parallelization of Scan Matching for Robotic 3D Mapping. u In Proceedings of the 3rd European Conference on Mobile Robots (ECMR 07), Freiburg, Germany, September 2007. [16] A. N chter, J. Elseberg, P. Schneider, and D. Paulus. Study of u Parameterizations for the Rigid Body Transformations of The Scan Registration Problem. Journal Computer Vision and Image Understanding (CVIU), 114(8):963980, August 2010. [17] A. N chter, K. Lingemann, J. Hertzberg, H. Surmann, K. Perv lz, u o M. Hennig, K. R. Tiruchinapalli, R. Worst, and T. Christaller. Mapping of Rescue Environments with Kurt3D. In Proc. of the IEEE Intl. SSRR, pages 158163, Kobe, Japan, June 2005. [18] nVidia. NVIDIA CUDA Compute Unied Device Architecture Programming Guide. nVidia, version 2.0 edition, June 2008. [19] I. K. Park, M. Germann, M. D. Breitenstein, and H.-P. Pster. Fast and automatic object pose estimation for range images on the gpu. In Machine Vision and Applications. Springer-Verlag, August 2009. [20] T. J. Purcell, C. Donner, M. Cammarano, H. W. Jensen, and P. Hanrahan. Photon Mapping on Programmable Graphics Hardware. In Proc. of the ACM SIGGRAPH/EUROGRAPHICS Conf. on Graphics Hardware, 2003. [21] D. Qiu, S. May, and A. N chter. GPU-accelerated Nearest Neighbor u Search for 3D Registration. In Proc. of the 7th Intl. Conf. on Computer Vision Systems (ICVS 09), 2009. [22] T. Rozen, K. Boryczko, and W. Alda. GPU bucket sort algorithm with applications to nearest-neighbour search. In Journal of the 16th International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision, February 2008. [23] S. Singh and P. Faloutsos. SIMD Packet Techniques for Photon Mapping. In Proceedings of the IEEE/EG Symposium on Interactive Ray Tracing, pages 8794, September 2007. [24] S. Thrun. Robotic mapping: A survey. In G. Lakemeyer and B. Nebel, editors, Exploring Articial Intelligence in the New Millenium. Morgan Kaufmann, 2002. [25] K. van Kooten, G. van den bergen, and A. Telea. GPU Gems 3, chapter 7, pages 123148. Addison Wesley Professional, August 2007. [26] K. Zhou, Q. Hou, R. Wang, and B. Guo. Real-Time KD-Tree Construction on Graphics Hardware. In SIGGRAPH Asia 2008, page 10, April 2008.
276
Efcient Topological Mapping with Image Sequence Partitioning

Hemanth Korrapati, Youcef Mezouar, Philippe Martinet ` LASMEA, CNRS JOINT UNIT 6602, 24, Avenue des Landais, 63177-AUBIERE, FRANCE Email: rstname.lastname@lasmea.univ-bpclermont.fr
Abstract Topological maps are vital for fast and accurate localization in large environments. Sparse topological maps can be constructed by partitioning a sequence of images acquired by a robot, according to their appearance. All images in a partition have similar appearance and are represented by a node in a topological map. In this paper, we present a topological mapping framework which makes use of image sequence partitioning (ISP) to produce sparse maps. The framework facilitates coarse loop closure at node level and a ner loop closure at image level. Hierarchical inverted les (HIF) are proposed which are naturally adaptable to our sparse topological mapping framework and enable efcient loop closure. Computational gain attained in loop closure with HIF over sparse topological maps is demonstrated. Experiments are performed on outdoor environments using an omnidirectional camera. Index Terms Topological Mapping, Omnidirectional Vision, Loop Closure
ISP
Feature Extraction Newly Acquired Image (Query Image) Augment Winning Node Yes Loop Closed ? Node Level Loop Closure
Image Level Loop Closure
Extract Local Features Yes
No
Finer Loop Closure Required ?
Augment Current Node
Yes
Similar to Current Place ?
No Augment New Node Create New Place(Node)
Fig. 1: A global view of our topological mapping framework.
I. INTRODUCTION Mapping is one of the fundamental problems of Autonomous Mobile robotics. Mapping problem can be widely categorized as Topological and Metrical [16]. Metrical mapping involves accurate position estimates of robots and landmarks of the environment. Topological mapping on the other hand represents an environment as a graph in which nodes correspond to places and the edges between them indicate some sort of connectivity. Recently, a third category called TopoMetric Mapping [17], [9] is gaining popularity. Topo-Metric mapping is a hybrid approach which uses both metrical and topological information in map building. Building an accurate map either metrical or topological depends on loop closure accuracy. Such maps are difcult to build using metrical information which is prone to gross errors in position estimation of robot and landmarks. Topological maps facilitate accurate loop closure as they depend on appearance information rather than on exact metrical information of the environment. Many powerful loop closing techniques for topological maps have been introduced recently [2, 3, 5, 7]. Most of them produce dense topological maps, in which every acquired image stands as a node in the topological graph. Sparser topological maps can be built by representing sets of contiguous images with similar appearance features as places. Each place is represented by a node in the topological graph. We refer to this kind of partitioning of image sequences into places as Image Sequence Partitioning (ISP). In a sparse topological map, since each node represents multiple images, fewer nodes would be sufcient for the map representation. Maps with fewer nodes reduce computational complexity involved in loop closure and map merging. An example map merging problem can be to localize a topological map of a tiny environment in a larger map (ex:- google maps). We use a topological mapping framework which facilitates coarse loop closure to the node level and a ner loop closure to the image level. The topological map is represented as a graph T = (N, E), where N and E are sets of nodes and edges respectively. The map is updated with each newly acquired image. Every new image (query image) is veried if it is similar to a previously visited node (place) or the current place node and if so, the corresponding node is augmented with the image. If the query image is not similar to any of the existing nodes, then a new place node is created and augmented with the image. This process of map update is nothing but Image Sequence Partitioning (ISP). Each node contains a set of representative features representing all the member images of the node (place). The representative features are used in evaluating node-image similarity during ISP. Another contribution of this paper is the proposal of Hierarchical inverted les (HIF) for efcient loop closure at both node and image levels. Similar to traditional inverted les used for object recognition [11] and loop closure problems [2], HIFs are also associated to the visual words in the vocabulary. HIFs combine the power of regular inverted les with our sparse topological map structure and help in fast loop closure. Considering the fact that in a sparse topological map, images are again grouped into nodes, HIFs organize
277
the previous occurrence information of visual words in a hierarchical fashion which enables fast loop closure. We use an inverse similarity score evaluation methodology in order to take advantage of HIFs for loop closure. Experiments were performed on omnidirectional image data acquired in outdoor urban environments. Omnidirectional images offer a 360 degree eld of view which helps in building topological maps, invariant of robots heading. As a result loop closure can be performed even if the robot is not headed in the same direction as of the previous visit to the same location. Map building in outdoor environments is challenging due to illumination variation and possible occlusions [15]. Sparsity and accuracy of topological maps constructed using ISP are evaluated. The computational savings achieved in HIF-based loop closure is analysed. The rest of the paper is organized as follows: Section II details the related work done in this area. Section III describes the steps involved in ISP in detail and provides algorithmic illustrations. Section IV introduces HIFs and discusses how node and image level loop closures are performed using HIFs. Section V evaluates sparsity of maps produced by ISP, accuracy and computational cost of loop closure on the generated topological maps. II. RELATED WORK Scene Change Detection and Key Frame Selection for video segmentation and abstraction [19], [13] have similar goals as that of ISP. They try to represent a video with fewer images called key frames whenever there is a sufcient change in the scene and most of them focussed on video compression domain. The major difference between these video abstraction problems and topological mapping is that topological mapping demands localization of a query image which is obtained at a previously visited place, but with variation in illumination and viewpoint, and a possible occlusion. Hence, video segmentation techniques using pixel-wise intensity measures and global image features like histograms, motion based segmentation cannot be applied to our problem. Loop closure in topological maps has gained popularity among mobile robotic researchers during the recent times. Many loop closure algorithms for topological mapping have been proposed and tested in both indoor [7], [2], [3], [21], [8] and outdoor environments [5], [6], [1], [12]. In [21], [22] topological maps are built for indoor environments. They segment the topological graph of the environment using normalized graph-cuts algorithm resulting in subgraphs corresponding to convex areas in the environment. In [8] SIFT features were used to perform matching over a sequence of images. They detected transitions between individual indoor locations depending on the number of SIFT features which can be successfully matched between the successive frames. In [14] ngerprint of an acquired image is generated using omnidirectional image and laser readings, and these ngerprints are compared to those of the previous images. If the similarity is above a threshold the image is added to the existing node and if not a new node is formed. All of these works were focused on indoor environments. Indoor environments contain convex
spaces (rooms) which are relatively simpler to be partitioned as compared to outdoor environments. A topological mapping framework using incremental spectral clustering has been presented in [20]. Nodes containing similar images are constructed using incremental spectral clustering over the afnity matrix of the images, thereby producing a topological graph. An optical ow based ISP technique was presented in [12] for topological mapping in outdoor environments using a quad rotor robot. Optical ow is used to discover change in environmental appearance. In [1], gist features [18] were used to cluster images with similar appearance for topological map construction. III. IMAGE SEQUENCE PARTITIONING Figure 1 depicts a global view of our framework, in which we can see a modular view of ISP enclosed by a red dashed line. As can be seen from Figure 1, ISP consists of three main modules: node level loop closure, evaluation of similarity to current place and new node formation. Given a query image, initially SURF [4] features are extracted from the image. Using the SURF features, we evaluate the node-image similarity of the query image with all the nodes in the graph except the current place node and pick out the top k similar nodes. The top k similar nodes are assigned to the set of winning nodes Nw . This process is called node level loop closure as it nds the previously visited places (nodes) most similar to the query image. Only the representative feature sets of the nodes are used to compute the node-image similarities during node level loop closure. In our framework, the representative features of a node are the SURF features of the rst image augmented to the node. An empty Nw indicates loop closure failure ; that is, query image is not similar to any of the previously visited places. In that case, query image similarity to the current node is evaluated. If the similarity of query image with current place node is above a certain threshold, current node is augmented with the query image. If the query image is not similar to current node also, a new node is created with the query image. But if Nw is not empty indicating a loop closure, then the set of winning nodes can be considered for a thorough image level loop closure. Algorithm 1 shows the steps involved in ISP. The node level loop closure function in lines 4 is discussed in detail in sections IV. The function current node image similarity evaluates the similarity of the current node with that of the query image. This is done by matching the SURF features of the query image to that of the features of the current place node. Feature matching is performed as proposed in [10]. Two features are considered to be matched if the ratio of the best match and the second best match is greater than 0.6. IV. LOOP CLOSURE & HIERARCHICAL INVERTED FILES Node and Image level loop closures are performed at image level using visual words corresponding to the SURF features of the image. Given a query image, to nd the most similar
278
Algorithm 1 Image Sequence Partitioning Algorithm

1: procedure PROCESS QUERY IMAGE(T, Iq , nc )
topological graph respectively. T, Iq , nc , T.N are the Topological graph, query image, current node in topological graph & node-set of
VW 1 VW 2 VW 3
...........
I2 I2 I11
I3 I3
I9 I5
I21 ............ I243 I97 ............ I143
I12 I115
2:
3: Nw = {} Set of winning nodes. 4: Nw =Node level Loop Closure(N , Iq , T hs ) 5: if is empty(Nw ) then 6: if current node image similarity(nc ,Iq ) > T ht then 7: nc .add image(Iq ) 8: else 9: n=new node() 10: n.add image(Iq ) 11: update map(T, n) 12: end if 13: else 14: Nw =get top k similar nodes(Nw ) 15: Iw =get images of nodes(Nw ) 16: n =Image Level Loop Closure(Iw ,Nw ,Iq ) 17: add image to node(Iq , n ) 18: update map(T, n ) 19: end if 20: end procedure
N = T.N {nc }
Reference node set excluding current node.
(a)
N1 I2 I5 I32 I26 I9 I17 I3 ............ I243 I16 I54 ............ I178 I95 ............ I165 I23 I35
VW 1
N3 N8 N13
VW 2
..............
N24 N15
....
....
(b) Fig. 2: (a) represents a traditional inverted le. (b) represents a hierarchical inverted le. VW1, VW2,... represent visual words. N1, N2,... represent node ids in the rst level of HIF and I1, I2, I3, ... represent the image ids in the inverted le.
reference image in the database, [11, 7] uses a Tf-Idf based similarity score for all the reference images in the database and the query image and select the reference images corresponding to the top n similarity scores. We use an inverse methodology to compute image similarities for loop closure. The steps involved are enumerated as follows: 1) Let the set of reference images be I = {I1 , I2 , , IM }. We consider a histogram H with the number of bins corresponding to the number of reference images, M . 2) Extract the set of visual words W = {w1 , w2 , , wp } from the query image, Iq . 3) For each visual word wi , using the inverted le IF i of the word, we extract the reference image indexes w w Iwi = {I1 i , I2 i , } in which the word has been previously seen. The histogram bins corresponding to these extracted reference images are incremented by a factor of Tf-Idf of the corresponding word.
w w w H[Ij i ] = H[Ij i ] + T f Idf (Ij i , wi )
(1)
The resulting histogram can be interpreted to contain the degrees of similarity of the query image with respect to the reference images. As we can see, the loop closure computation time only depends on number of words in the query image and the average inverted le length at that instant. As a result loop closure time does not increase so steeply as is the case with forward method. A closely related work can be found in [6, 2, 3]. But this method is suitable only for loop closure over dense topological maps but not for sparse topological maps in which each node represents multiple images. With a change in the inverted le structure, we can adapt this similarity evaluation method to sparse topological maps. A regular inverted le corresponding to a visual word simply contains a list of all previous images which contained the word. We associate Hierarchical inverted les (HIF) to each visual word. As the name suggests, HIF contain two levels. The rst level consists of the ids of the nodes in which the visual word occurred previously. The second level consists of small child inverted les attached to each of the
node id. These image ids indicate the images belonging to the parent node in which the visual word has occurred. Each child inverted le corresponding attached to a node of the topological graph, contains the list of all previous images belonging to the parent node in which the word has occurred. The difference between traditional inverted les and HIFs is illustrated in gure 2. To perform a node level loop closure using HIF, we do not have to go through the entire HIF, but its sufcient to go through rst level (node ids) of the HIFs. For an image level loop closure using HIF, we only have to traverse through those child inverted les corresponding to the winning nodes; which form only a fraction of the total HIF. Thus, HIFs offer computational gain in loop closure when compared to regular inverted les which is demonstrated in section V. Algorithms 2 and 3 give a clearer picture of the node level and image level loop closures. There can be multiple winning images given by the image level loop closure and hence multiple corresponding nodes, but for the sake of simplicity we do not represent that in the algorithm. In such cases we use RANSAC based geometric verication to nd the right match. Algorithm 2 Node Level Loop Closure Algorithm
1: procedure N ODE L EVEL L OOP C LOSURE(N , Iq , T hs ) 2: N = T.N {nc } 3: H=Histogram(size of(N )) 4: W=extract and quantize Features(Iq ) 5: for each word wi in W do 6: HIF i =get hierarchical inverted le(wi ) 7: for each node nj in HIF i do 8: H[nj ] = H[nj ] + 1 9: end for 10: end for 11: N =get winners from histogram(H, T hs ) 12: return N 13: end procedure
279
Algorithm 3 Image Level Loop Closure Algorithm

1: procedure IMAGE L EVEL L OOP C LOSURE(Iw ,Nw , Iq ) 2: H=Histogram(M) 3: W=extract and quantize Features(Iq ) 4: for each word wi in W do 5: HIF i =get hierarchical inverted le(wi ) 6: for each node nj in Nw do n 7: IF i j =get inverted le of node(HIF i , nj ) n 8: for each entry fi in IF i j do 9: H[fi ] = H[fi ] + T f Idf (fi , wi ) 10: end for 11: end for 12: end for 13: I =get winner from histogram(H) 14: n =get corresponding node(I ) 15: return n 16: end procedure
(a) Installment A
(b) Installment B
V. EXPERIMENTS Our experimental setup consists of a Pioneer P3DX robot equipped with an omnidirectional camera. A laptop equipped with an Intel Centrino 2 processor running ubuntu 9.04 is used for data processing. The experiments were carried out in our articial urban environment - PAVIN. The environment contains roads, articial buildings, and a variety of real-world road settings like junctions, trafc lights, round-abouts, curved roads and dead ends. Omnidirectional images were acquired at a frame rate of 2 fps, as the robot moves along a manually controlled trajectory. Image data was acquired in four installments(A, B, C and D) at very different times of two days and hence contained signicant illumination variation. Figure 3 shows the parts of the environment through which the robot traversed during each installment. We took care that data from all the four installments contained overlaps so as to put our loop closure algorithm to test. We constructed two data-sets by combining data from all the four installments. Dataset-6560 was obtained by combining data of installments A, C and D. It contains 6560 images with 52 possible loop closures. Another data-set Dataset-11200 was obtained by a combination of all the four installments. It contains 11200 images and 71 possible loop closures. The number of loop closures were determined by manually examining the data-sets. A. ISP - Sparsity The number of nodes in a topological map indicate its sparsity. An ideal topological map is one in which each distinct place in the environment is represented by a node in the topological graph. The sparsity of these ideal maps represents the optimal sparsity of the actual environment. But practically, an ideal topological map is far from being attained. Different features produce different topological representations of the environment. We have experimented using SURF128, U-SURF128, and SIFT features, out of which we found out that U-SURF128 features lead to topological structure closest to the ground truth. Another important factor that effects stability of appearance is the image distortion. The features directly extracted from warped omnidirectional images are unstable as the appearance of keypoint patch changes very much even with a small
(c) Installment C
(d) Installment D
Fig. 3: Shows paths traversed by the robot during each image acquisition installment. TABLE I: SPARSITY (a) Sparsity - Warped
SURF128 539 756
U-SURF128 502 742
SIFT 1582 2795
(b) Sparsity - Unwarped
SURF128 504 737
U-SURF128 473 723
SIFT 1037 1257
displacement of the camera. Hence it is likely that the maps produced using warped images contain greater number of place nodes. This happens because due to the feature instability, each place can be understood as multiple adjacent places and hence multiple nodes in the topological graph. The undistorted (unwarped) images produce relatively sparser maps. Tables I(a) and I(b) show the sparsity of maps produced by ISP using different features on warped and unwarped images. We can see that U-SURF128 is the best performing feature producing the most sparse maps. B. Accuracy In this subsection, we discuss the accuracy of the maps produced by ISP, based on the number of accurate loop closures and the obtained false positives. The most sparse map may not guarantee an accurate map. Only those maps with accurate place partitioning are accurate and can lead to accurate loop closures. Thus a good mapping technique is
280
5 TABLE II: NODE LEVEL LOOP CLOSURE ACCURACY
Dataset-6560 Dataset-11200
#(LC) 49 68
#(FP) 4 6
one which provides an optimal combination of sparsity and accuracy. Given a query image, rst we perform loop closure at node level and then at image level if more accuracy is required. Table II shows the number of loop closures detected and the number of false positives obtained on Dataset-6560 and Dataset-11200 respectively by node level loop closure. We can see that a few loop closures are missed out, and some false positives are observed. The inaccuracy can be attributed to the imperfections in ISP. There is a direct relation between accuracy of the maps and ISP. The way in which we perform topological mapping does not induce any information loss. In other words we do perform any kind of sampling or selection of reduced number of features. Instead, we consider each and every feature extracted from the images and store it in HIFs. This process guarantees that there is no information lost during the mapping process. But inaccurate loop closure detection might occur due to inaccurate partitioning of places. For example a place can be represented by two nodes by partitioning it inaccurately during ISP due to appearance feature instability. As a result, during node level loop closure using a query image, both of the nodes representing that place may not get high similarity scores and hence the loop closure becomes inaccurate. A good ISP algorithm produces maps with minimum number of these situations. Image level loop closure accuracy depends on the accuracy of node level loop closure. If node level loop closure selects an inaccurate set of winning nodes, then as a result, image level loop closure also becomes inaccurate. However in case of an accurate node level loop closure, we have observed that 99% accuracy was possible in image level loop closure irrespective of the ISP technique. Figure 4 shows two loop closure scenarios that occurred in our mapping. C. Computational Time Table III shows average computational time (in milliseconds) taken by each stage of our topological mapping framework in processing each query frame on both Dataset-6560 and Dataset-11200. The abbreviations NLLC, LFE+QUANT, ILLC stand for Node Level Loop Closure, Local Feature Extraction & Quantization and Image Level Loop Closure respectively. NLLC is the node level loop closure which involves extracting the most similar nodes using HIFs as mentioned in Algorithm 2. This takes 10ms as shown in table III, and requires additional 50ms in order to compare with the current place node whenever needed. Obviously, local feature extraction (LFE) (200ms) and quantization (QUANT) (70ms) time is constant for every acquired frame. Actually, time required for both of these tasks increases with the number of features in an image. Computation time of image level loop closure (ILLC) is too low. This low computation time is the result of using
(a)
(b)
(c) Fig. 4: Example loop closures.
(d)
TABLE III: AVERAGE COMPUTATION TIMES (in ms)
NLLC 10 + 50
LFE+QUANT 200 + 70
ILLC 21
hierarchical inverted les(HIF). As we mentioned before, HIFs make the loop closure computation almost independent of the number of reference images and also in our case, nodes of the topological graph. Figures 5(a) and 5(b) show graphs comparing the loop closure times of our HIF-based method and without using HIF (non-HIF based). Red curves in the graphs indicate the time taken for feature quantization, node level loop closure and image level loop closure, for each image frame in a sequence. The blue curves represent the time taken by feature quantization and similarity score generation using inverted les by using inverse similarity evaluation methodology. We can see that the loop closure time of non-HIF based method increases relatively more with the increase in the number of images in the map, while our method using our method, loop closure time increases more slowly. Also, the performance gain becomes more prominent in case of huge datasets (huge number of images) as can be seen in the gure 5(b) corresponding to Dataset-11200, which contains 11200 images. The non-HIF based loop closure time for Dataset-11200 increases less steeply than that of Dataset-6560. This happened because the average number of features of Dataset-11200 is lesser than that of Dataset-6560 and as a result it takes lesser time to process each reference frame. This efciency of our HIF-based method can be attributed to the combination of sparse topological mapping and HIFs for efcient map storage. The representational power of HIFs saved lot of computation involved in loop closure.
281
0.35
R EFERENCES
[1] J. Kosecka A. C. Murillo, P. Campos and J. J. Guerrero. Gist vocabularies in omnidirectional images for appearance based mapping and localization. In 10th IEEE Workshop on Omnidirectional Vision, Camera Networks and Non-classical Cameras (OMNIVIS), held with Robotics, Science and Systems, 2010. [2] A. Angeli, D. Filliat, S. Doncieux, and J.-A. Meyer. A fast and incremental method for loop-closure detection using bags of visual words. IEEE Transactions On Robotics, Special Issue on Visual SLAM, 2008. [3] Adrien Angeli, St phane Doncieux, Jean-Arcady Meyer, and David e Filliat. Visual topological slam and global localization. In Proceedings of the 2009 IEEE international conference on Robotics and Automation, ICRA09, pages 20292034, 2009. [4] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. Speeded-up robust features (surf). Comput. Vis. Image Underst., 110, June 2008. [5] Mark Cummins and Paul Newman. FAB-MAP: Probabilistic Localization and Mapping in the Space of Appearance. The International Journal of Robotics Research, 27(6):647665, 2008. [6] Mark Cummins and Paul Newman. Highly scalable appearance-only slam - fab-map 2.0. In Robotics Science and Systems (RSS), Seattle, USA, June 2009. [7] Friedrich Fraundorfer, Christopher Engels, and David Nist r. Topologe ical mapping, localization and navigation using image collections. In IROS, pages 38723877, 2007. [8] Jana Koseck, Fayin Li, and Xialong Yang. Global localization and relative positioning based on scale-invariant keypoints. Robotics and Autonomous Systems, 52(1):27 38, 2005. [9] Benjamin Kuipers, Joseph Modayil, Patrick Beeson, Matt MacMahon, and Francesco Savelli. Local metrical and global topological maps in the hybrid Spatial Semantic Hierarchy. In Proceedings of the IEEE Intl. Conf. on Robotics & Automation (ICRA-04), 2004. [10] David G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60:91110, 2004. [11] David Nister and Henrik Stewenius. Scalable recognition with a vocabulary tree. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2, CVPR 06, pages 21612168, 2006. [12] Navid Nourani-Vatani and Cedric Pradalier. Scene change detection for topological localization and mapping. In IEEE/RSJ Intl. Conf. on Intelligent Robotics and Systems (IROS), 2010. [13] Cees G.M. Snoek and Marcel Worring. Multimodal video indexing: A review of the state-of-the-art. Multimedia Tools and Applications, 25:535, 2005. [14] A. Tapus and R. Siegwart. Incremental robot mapping with ngerprints of places. In IEEE/RSJ Intl. Conf. on Intelligent Robotics and Systems (IROS), pages 24292434, 2005. [15] Sebastian Thrun. Robotic mapping: A survey. In Exploring Articial Intelligence in the New Millenium. Morgan Kaufmann, 2002. [16] Sebastian Thrun, Wolfram Burgard, and Dieter Fox. Probabilistic Robotics (Intelligent Robotics and Autonomous Agents). The MIT Press, 2005. [17] Nicola Tomatis. Hybrid, metric-topological representation for localization and mapping. In Robotics And Cognitive Approaches To Spatial Mapping, volume 38 of Springer Tracts In Advanced Robotics, pages 4363, 2008. [18] Antonio Torralba, Kevin P. Murphy, William T. Freeman, and Mark A. Rubin. Context-based vision system for place and object recognition. In Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2, ICCV, pages 273, 2003. [19] Ba Tu Truong and Svetha Venkatesh. Video abstraction: A systematic review and classication. ACM Trans. Multimedia Comput. Commun. Appl., 3, 2007. [20] Christoffer Valgren, Tom Duckett, and Achim Lilienthal. Incremental spectral clustering and its application to topological mapping. In In Proc. IEEE Int. Conf. on Robotics and Automation, pages 42834288, 2007. [21] Zoran Zivkovic, Olaf Booij, and Ben Kr se. From images to rooms. o Robot. Auton. Syst., 55:411418, May 2007. [22] Z.Zivkovic, B.Bakker, and B.Krose. Hierarchical map building using visual landmarks and geometric constraints. In In Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 712, 2005.
0.3
0.25 HIF based nonHIF based

Times (sec)
0.2
0.15
0.1
0.05
0 0
1000
2000
3000 Image Id
4000
5000
6000
7000
(a) Loop Closure time - Dataset-6560
0.45
0.4
0.35 HIF based 0.3 nonHIF based
Time (sec)
0.25
0.2
0.15
0.1
0.05
0 0
2000
4000
6000 Image Id
8000
10000
12000
(b) Loop Closure time - Dataset-11200 Fig. 5: Loop closure computation times of non-HIF based loop closure and HIF based loop closure on maps generated on our datasets.
VI. CONCLUSION We proposed a sparse topological mapping framework involving two levels of loop closure. Hierarchical Inverted Files(HIF) were naturally adaptable for loop closure in our sparse topological mapping framework and made fast loop closure possible. Image Sequence Partitioning(ISP) played a key role in producing sparse topological maps. Sparsity of the maps produced by different features are analyzed and the accuracy is evaluated. Finally, our framework is evaluated on computational time required for loop closure. The experiments prove our argument that HIFs are suitable for sparse topological maps as they take advantage of the sparsity of the map in performing loop closure efciently without discarding any information.
ACKNOWLEDGMENT
This work has been supported by the ANR projects - CityVip and R-Discover. Special thanks to Mr.Jonathan Courbon for his help in experimentation.
282
Near-optimal Exploration in Gaussian Process SLAM: Scalable Optimality Factor and Model Quality Rating
Anssi Kemppainen Janne Haverinen Ilari Vallivaara Juha Rning Computer Science and Engineering Laboratory, University of Oulu, Finland
Abstract In our previous studies we examined near-optimal exploration in Gaussian process SLAM with a constant optimality factor. In each iteration we used the same factor to select an action that guaranteed near-optimal sensing while minimizing sensing site (i.e. localization) uncertainty. A problem arose from a tendency to select actions that provided extrapolating data with large prediction variance. After a short period of time, the sensing site distribution, presented by particle discretization, was too dispersed for the algorithm to nd any near-optimal action. In this paper we propose a scalable optimality factor which still guarantees a constant level of optimality for exploration. This provides a mechanism for achieving a balance between exploration and exploitation, enabling exploitation actions in sensing regions that have smaller prediction variance. We also propose a method for comparing different models in terms of how well they are suited for localization. This is expressed as the expected sensing space coverage required to produce noisy measurements drawn from the model. Our primary application area is indoor localization based on ambient magnetic elds. In order to provide some quantication, we conducted simulations using three different data sets: a random eld generated by sampling three orthogonal Gaussian processes, and magnetic eld data from a small lobby and a larger hall. Even though the experiments conrmed the usability of our approach in smaller areas, localization problems in larger areas hinder our approach in practical applications. The authors believe the localization problems originate from using only a small fraction of the information that is available in a robots path. This is argued by comparing the promising results obtained in our parallel research, where we have conducted real-world magnetic eld SLAM experiments.
I. I NTRODUCTION The objective of our research is to develop methods for exploiting magnetic elds in localization. Magnetic eld localization is based on the observation that indoor magnetic elds provide sufcient spatial variation and temporal stability to permit inference about sensing sites, given noisy measurements (see e.g. [3]). One essential question is how to collect magnetic eld information in an efcient way using mobile robots with localization uncertainties? In traditional SLAM approaches based on laser, sonar, or machine vision techniques, sensor coverage plays an important role, enabling previously observed structures to be utilized to decrease motion model uncertainties. In magnetic eld SLAM, however, sensor coverage restricted by point-wise magnetometers requires frequent visits to already measured areas. While we exploit already gathered information in localization, we lose some degree of optimality in sensing quality (or
modelling quality). How can we measure this quality, and is there some way to overcome the so-called exploration / exploitation dilemma? Near-optimal sensing refers to studies by Guestrin et al. [2, 6], in which they developed a mutual-information-based sensing quality function for sensor placement applications. In [4] we extended their work to the SLAM context, where there is a need for replicated measurements, by proving that in Gaussian processes, the expected information gain (mutual information) that sensing at certain sites provides for the model is a non-decreasing submodular set function. As was stated by Nemhauser et al. [7], for such a function a greedy algorithm is guaranteed to provide 1 e1 optimality. Compared to the related SLAM exploration studies (see e.g. [10, 5]) that focus on minimizing uncertainty in the joint distribution of the environmental model and robots trajectory, we only consider here the optimality of modeling. Authors dont know if the near-optimality based on the non-decreasing submodular set function can be extended to consider also the joint distribution. Since the trajectory distribution is depended on the order the sensing sites are selected, the formal denition would require considering a set of action sequences instead of a set of sensing sites. In addition, we consider here only near-optimal sensing ignoring traveling costs (i.e. travel time) required to reach each sensing site. Near-optimal sensing that considers also traveling costs have been studied by Singh et al. [8, 9], and their approach could be utilized later on our exploration studies. In this paper we improve our exploration algorithm by proposing a scalable optimality factor. The problem with a constant factor arose from a tendency to select actions that provided extrapolating data with large prediction variance. After a short period of time, the sensing site distribution, presented by particle discretization, was too dispersed for the algorithm to nd any near-optimal action. In short, the algorithm was restricted to exploration with the lowest allowable optimality factor, so it never had a chance to take exploitation actions. In order to compare how different models are suited for localization, we also propose a model quality rating method. This method is based on the idea that entropy describes the expected fractal dimensionality of a set of sensing sites observed when sampling randomly from a sensing site distribution. It can be extended to approximate the expected sensing space coverage when randomly sampling from Gaussian processes. This paper is organized as follows. In section 2 we give
283
a brief review of the sensing quality function. In section 3 we propose a renewed SLAM exploration algorithm that utilizes a scalable optimality factor. Section 4 introduces model quality rating with quantitative results from different models. In section 5 we provide results from our SLAM exploration experiments, and nally we conclude our study in section 6. II. S ENSING
QUALITY FUNCTION
Consider a Gaussian process given by f (x) GP(m(x), k(x, x)) where m(x) is the mean function and k(x, x) is the covariance function. The Gaussian process marginalizes into a normal random vector F V where V = {s1 , . . . , sm } is a set of possible sensing sites. Now consider that for the measurement vector we have Y V = FV + WV where F V and W V are independent process and noise vectors, respectively. Following the notation of Krause [6], we can now dene the sensing quality of a subset A V with a set function F (A) = p(y V )u(y V , A)dy V ,
where q P (ixk (a)) = 1, q is the number of particles, and i=1 a is the action sequence a = (a1 , . . . , ak1 ). For submodular functions, any convex combination of submodular functions is also a submodular function. In that sense, the expected mutual information is also a submodular function. In contrast to our previous approach, for each action we make two measurements; one after a rotational control and the second after translational control. This improves particle convergence by considering predictive densities also after the rotational control. However, for notational clarity, here we use xk+1 and yk+1 to denote such sensing sites and measurements. Notice that we also use the same principle for the greedy sampler. In the following subsection we present a scalable optimality factor that enables visiting to the sensing regions with smaller prediction variance, yet it guarantees near-optimality. A. Scalable optimality factor As in [4], we now require for all particles FMI (xk (a)) FMI (A)greedy (1 e1 ) FMI (A)OP T
1
(3)
where y V is a realization of a random vector and u(y V , A) is a utility function. Spatial sampling design [1] asks what is the optimal selection of sensing sites |A| = k necessary to maximize sensing quality? In general, this is an NP-hard problem, but for certain types of sensing quality functions a greedy algorithm can be guaranteed to provide 1 e1 optimality. As was stated by Nemhauser et al. [7], these functions must be non-decreasing and submodular. In [4] we proved that the following mutual information criterion FMI (A) = H(Y A ) H(Y A |F V ) (1)
where is a constant optimality factor . In [4], in each iteration we required an action that provides minimum posterior sensing site uncertainty, yet guarantees a constant near-optimal increment in sensing quality. We measured the uncertainty (or sensing cost) by differential entropy over Gaussian approximation. Here we use a trace of the posterior covariance matrix, which gives a large sensing cost also to narrow distributions. The constrained optimization problem that we had is given by min subject to Cost(X k , X k+1 (a)|y k ) (xk+1 (a)|xk ) > (4) x X
k k
is a non-decreasing submodular set function when A V k . In practice this criterion gives us the expected information gain that sensing at sites A provides for the model. For a continuous sensing space X Rd , we conjectured that max FMI (A) = FMI (A )
A
(2)
where A Xk and A V k . If the conjecture is true, this means the optimal set of sensing sites is always within the set of sensing sites where the quality of the model is estimated. III. N EAR - OPTIMAL
EXPLORATION
where (xk+1 (a)|xk ) is the mutual information increase for a particle xk taking action a, and is the greedy increase obtained in round k + 1.2 The algorithm favors actions close to the lower optimality limit, continuously requiring new information from the environment. Since the posterior sensing site distribution is proportional to the measurement likelihood, new information with large prediction variance provides hardly any decrease in sensing site uncertainty. Instead, if we allow the algorithm to take actions that provide maximal increments in sensing quality, this leads to a dispersed particle distribution, which often has no individual particles close to the true sensing sites. So, we have to dene the maximum allowed dispersion, which gives us the opportunity to use a scalable optimality factor. This is given by max
a
min (xk+1 (a)|xk )
In the SLAM framework, motion model uncertainties affect the sensing site distribution described by a stochastic process X k = (X 1 , . . . , X k ), X k : Xk Xk . As in [4], we use here particle discretization to describe the sensing site distribution. Now,for the expected sensing quality we have
q
xk X k
(5)
subject to
Cost(X k+1 (a)|X k , y k ) < Clim
where Clim is the upper limit of dispersion of the k+1th sensing site. In the following, we explain how we can guarantee that this approach leads to near-optimal solutions.
1 Note 2 Note
E[FMI (a)] =
i=1
FMI (ixk (a)) P (ixk (a))
that in [4] we used (1 ) to denote the same factor. that xk = (x1 , . . . , xk ) and yk = (y1 , . . . , yk )
284
In each iteration we can compute the total sensing quality already provided by the greedy sampler, i.e.
k FMI = i=1
each particle with two sensing sites provided by the action a . IV. M ODEL QUALITY Consider a random variable X : V V whose possible outcomes belong to the discrete set of sensing sites V = {s1 , . . . , sn }. By randomly sampling the sensing site distribution, the average number of sensing sites required to observe site si is given by the inverse probability ni = 1/P (si ) of that site. Now, the fractal dimensionality of such a set is given by |V|D(si ) = ni log(P (si )) , log(|V|)
(i)
and dene the lowest allowed increment by min (k + 1) = +

FMI minxk X k FMI (xk ) (k + 1)
where is the constant optimality factor, see Equation (3). Now, we have the property that min (k + 1) , which in practical uses allows several exploitation actions to be taken. This leads us to the algorithm where in each iteration we rst start with constrained exploitation (4) and continue, if possible, with constrained exploration (5). The full iterative procedure is presented in Algorithm 1. Algorithm 1 Iteration of near-optimal exploration with a scalable optimality factor Input: V, X k , , min , y k Output: X k+1 , a 1: A 0 2: for all s V do 3: compute control action as from E(X k ) to s 4: draw X k+1 (as ) g(X k , U (as )) 5: A A {as } 6: end for k 7: A sort(Cost(X , X k+1 (A)|y k )) 8: i 1 9: repeat 10: ae ai {A = (a1 , a2 , . . . , a|A| )} 11: ii+1 k 12: until xk X : (xk+1 (ae )|xk ) > min 13: k+1 min k 14: while Cost(X k+1 (ai )|X , y k ) < Clim do k 15: if xk X : (xk+1 (ai )|xk ) > k+1 then 16: k+1 k+1 + 17: ae ai 18: else 19: ii+1 20: end if 21: end while 22: : a ae k+1 23: : X X k X k+1 (a ) At the input we have the set of sensing sites V, particle distribution X k , greedy sampler increase = (k + 1), minimum optimality factor min , and measurements y k . At the output we have the near-optimal action a and the updated particle distribution X k+1 . On lines 2-7 we build a set of actions, draw particle distributions for each action, and sort actions in ascending order of sensing costs. On lines 9 12 we select a minimum sensing cost action that guarantees near-optimality. On lines 14 - 21 we select an action with a maximum sensing quality increase whose particle deprivation is beyond the lower limit. The constant is used to increment scalable optimality factor k+1 . Finally, on line 23 we update
D(si ) =
where the base |V| expresses that all sites can be represented separately.3 The expected dimensionality among all possible outcomes is expressed by entropy H|V| (X) E[D(X)] =
n
1 log(|V|)
log(P (si ))P (si )

i=1
and the expected number of sensing sites observed in i.i.d random sampling is given by NV = |V|H|V| (X) . In order to consider dimensionality in a continuous sensing space X Rd , we can divide the sensing space into a set of evenly sized regions X = {X1 , X2 , . . . , Xn } and approximate the sensing region probabilities by P (Xi ) = p(x) dx
n j=1
xXi
p(xi ) , p(xj )
where p(xi ) is the probability density at the center value xi . This way we can provide a measure of the expected coverage by 1 |X|P (X) = nHn (X) |X|, n where |X| is the volume of the sensing space. Consider now that we are provided with i.i.d. random samples from the marginal measurement distribution and we want to know the expected sensing space coverage required to observe such events. This enables us to compare different models in terms of how well they are suited for localization. Entropy for a posterior sensing region distribution is given by Hn (X|y) = 1 log(n)
n i=1
log(P (Xi |y))P (Xi |y)
and the expected value of entropy integrated over the marginal measurement distribution is given by Hn (X|Y ) = Hn (X|y) p(y)dy, (6)
where p(y) is the probability density for a measurement y. If we now dene a measurement as y(x) = f (x) + w, where
3 Note that decoding the sensing sites on base 2 would lead us to Shannons self-information.
285
f (x) is the Gaussian process and w is normally distributed sensor noise, the marginal measurement density is given by
n
p(y) =
i=1
2 N (y; m(xi ), f (xi )2 + n ) P (Xi ),
where m(x) is the mean function, f (x)2 is signal variance, 2 and n is sensor noise. As we see above, p(y) is, in fact, a mixture of Gaussian densities. Next, assuming that the prior distribution of sensing regions is uniform, we have p(y) = 1 n
n i=1 2 N (y; m(xi ), f (xi )2 + n )
and for the posterior sensing region probabilities we get P (Xi |y) =
n j=1
p(y|xi ) , p(y|xj )
Fig. 1. A small lobby (blue color) in the middle of intersecting corridors on the CSE Oulu laboratory oor.
2 where p(y|xi ) = N (y; m(xi ), f (xi )2 + n ) as already stated above. In order to compute expected entropy (6), we can average over m random samples from the marginal measurement distribution, which gives the entropy as
Hn (X|Y ) =
1 m
Hn (X|yi ).
i=1
(7)
In magnetic eld localization, our emphasis is on providing quantication for the models collected either manually or using SLAM methods (see e.g. [11]). Magnetic elds can be described by three orthogonal Gaussian processes fx (x), fy (x) and fz (x), which means the conditional measurement densities for each component are independent, i.e. p(y|xi ) = p(yx |xi ) p(yy |xi ) p(yz |xi ). However, for the marginal measurement distribution this independency does not hold in general, since p(y) = = 1 n 1 n
n i=1 n
Fig. 2. A larger open hall on the bottom oor of the Department of Electrical and Information Engineering. The magnitude of the magnetic eld is projected in the picture.
A. Results We measured sensing space coverage with three different data sets: three random functions drawn from a stationary Gaussian process, and magnetic eld measurements from a small lobby (see Fig. 1) and also from a larger hall (see Fig. 2). Figures 3, 4, and 5 present these data sets, respectively. In the gures, the different vector components are, from left to right, Fx , Fy and Fz . The model qualities were approximated on |Xi | = (0.05m)2 grids. In order to provide predictive densities p(f |xi ), we used stationary Gaussian process regression with zero mean functions and a squared exponential covariance 2 function with hyperparameters f = (20T )2 and l2 = 2 (0.5m)2 . For sensor noise we used n = (0.4T )2 , which modeled true sensor noise in our robotic platform. The same values, were also used in our SLAM exploration experiments. Tables I, II, and III present the results of the sensing space coverage for each vector eld component separately and nally for the whole vector eld. The benet of using the whole vector eld in localization is clear from the results. However, since we consider only sensing site distribution, the results assume the sensor orientation is known. In practical applications where robots move in lateral directions, only yz
p(yx |xi ) p(yy |xi ) p(yz |xi ) 1 p(yy |xi ) n i=1

n n i=1
1 p(yx |xi ) n i=1
p(yz |xi ),
where equality holds only if two of these conditional densities are sensing site invariant. The entropy of the marginal measurement distribution is given by Hn (Y ) = Hn (Yx , Yy , Yz ) Hn (Yx ) + Hn (Yx ) + Hn (Yx ) and for the posterior sensing region distribution we have Hn (X|Y ) = + Hn (X) + Hn (Yx |X) + Hn (Yy |X)
Hn (Yz |X) Hn (Y ) Hn (X) + Hn (Yx |X) + Hn (Yy |X)
Hn (Yz |X) Hn (Yx ) Hn (Yy ) Hn (Yz ).
In practice, this means we cannot be too optimistic w.r.t. our models and sampling must be performed in a joint marginal distribution.
286
y yx yy yz (yx , yy , yz )T
Hn (X|Y ) 0.777 0.743 0.751 0.503
nHn (X|Y ) 910 680 730 83
|X|P (X|Y ) 2.3 m2 1.7 m2 1.8 m2 0.21 m2
TABLE II S ENSING SPACE COVERAGE FOR LOBBY DATA . Fig. 3. Randomly sampled data sets from a stationary Gaussian process. The data sets are sampled from 400 evenly spaced sites in a [1900, 1900] [1900, 1900] mm2 area.
y yx yy yz (yx , yy , yz )T
Hn (X|Y ) 0.773 0.719 0.770 0.479
nHn (X|Y ) 1300 800 1300 85
|X|P (X|Y ) 3.3 m2 2.0 m2 3.3 m2 0.21 m2
TABLE III S ENSING SPACE COVERAGE FOR HALL DATA . Fig. 4. Measured magnetic eld data sets from a small lobby. The data sets consist of measurements from 285 evenly spaced sites in a [1750, 1750] [2250, 2250] mm2 area.
V. E XPERIMENTS We conducted SLAM exploration simulations on the previously described data sets. For each vector component we used the same hyperparameters and zero mean functions as described in the previous section. For the optimality factor we used the value = 0.5 and the number of particles was 500. For each data set we conducted 10 simulations in a [500, 500] [500, 500] mm2 area with |V| = 25 sensing sites. In these simulations we used the motion model parameters described in [4]. Table IV presents the mean squared errors for sensing site distributions with different data sets. In order to provide some meaningful quantities, we present both the averaged value and the standard deviation among 10 different simulation runs. From the standard deviations we see that the results vary between simulations. However, in general, the near-optimal exploration works with useful accuracy. Comparing the results with different data sets, we notice that the random eld provides a better environment for localization than the true data sets. This can be explained with the results provided in the previous section. Table V presents mean squared errors for models averaged over 10 simulations. Here we have provided the results from the greedy sampler, the near-optimal explorer, and based on pure odometry. What is interesting to notice is that the mean squared error for the fz component is rather low also for odometry with the true magnetic eld data. This can be explained by looking Figures 4 and 5. Since fz is orientation invariant and the variations in the sensing area are small, the modelling accuracy is less dependent on the sensing site errors. On the other hand, this means the amount of information the fz channel provides for localization is low. In order to evaluate how well the algorithm scales up to a larger environment, we compared localization results in two different sized sensing areas. Figure 6 shows how the localization error grows when actions are taken outside the well modeled area. This can be explained by the sparser sample set required for near-optimality, which however, produces large prediction variance.
measurements can be assumed to be somewhat orientation 2 2 2 2 2 invariant. For L2 norms yx + yy and yx + yy + yz the conditional measurement densities follow non-central chi distributions, which are harder to analyze, and we have excluded that comparison here. Comparing the results in different data sets, we notice that the random vector eld with more spatial variation and a larger range of values provides much lower expected sensing space coverage. The results for the lobby and hall data are very similar. This can be explained by looking at the data sets: the lobby data provide more spatial variation, but the hall data have a larger range of values. We used here 1000 samples drawn from the marginal measurement densities.
y yx yy yz (yx , yy , yz )T Hn (X|Y ) 0.543 0.580 0.555 0.141 nHn (X|Y ) 110 150 120 3.4 |X|P (X|Y ) 0.28 m2 0.38 m2 0.30 m2 0.0085 m2
TABLE I S ENSING SPACE COVERAGE FOR RANDOM DATA .
Fig. 5. Measured magnetic eld data sets from a larger hall. The data sets consist of 714 measurements from a [2000, 2000] [3300, 3300] mm2 area.
287
iteration Random eld Exp. avg std Odo. avg std Lobby mf Exp. avg std Odo. avg std Hall mf Exp. avg std Odo. avg std
5 0.009 0.009 0.050 0.042 0.008 0.009 0.064 0.072 0.022 0.038 0.125 0.268
10 0.052 0.066 0.468 0.762 0.071 0.070 0.458 0.334 0.057 0.044 0.715 1.00
20 0.138 0.111 2.14 1.71 0.207 0.238 2.03 1.26 0.303 0.191 3.33 3.63
30 0.243 0.166 6.19 4.18 0.466 0.469 4.87 2.14 0.596 0.477 8.76 11.1
40 0.323 0.249 10.3 7.04 0.715 0.756 10.2 4.61 0.765 0.520 15.2 14.6
50 0.394 0.294 16.7 9.50 1.01 1.10 16.6 7.19 1.12 0.682 20.8 18.9
TABLE IV M EAN SQUARED ERRORS FOR SENSING SITE DISTRIBUTION IN 10

EXPERIMENTS IN A
[500, 500] [500, 500] mm2 [m2 ].
AREA .
U NITS ARE
iteration Random eld Gre. fx fy fz Exp. fx fy fz Odo. fx fy fz Lobby mf Gre. fx fy fz Exp. fx fy fz Odo. fx fy fz Hall mf Gre. fx fy fz Exp. fx fy fz Odo. fx fy fz
5 35.0 63.8 202 125 147 177 192.5 237 455 1.10 0.857 9.27 38.0 11.3 411 151 91.6 400 0.499 0.980 13.2 9.48 43.3 406 108 79.4 373.2
10 19.8 30.7 17.4 96.2 108 119 194.5 534 979 0.534 0.589 6.58 18.8 12.1 180 111 402 200 0.120 0.228 3.08 8.40 22.8 213 140 45.5 216
20 0.436 1.40 0.639 36.5 25.6 29.8 585 861 2870 0.117 0.156 0.070 12.5 9.30 70.1 98.1 468 72.2 0.097 0.096 0.093 10.2 15.0 67.4 255 145 116
30 0.056 0.065 0.057 12.8 15.0 15.4 861 2440 2850 0.074 0.069 0.063 9.47 7.28 27.8 86.2 339 36.2 0.070 0.067 0.075 8.85 10.3 30.6 220 138 97.2
40 0.049 0.053 0.048 6.45 14.1 13.6 1610 1030 2290 0.057 0.047 0.048 4.78 4.66 9.32 59.5 284 28.9 0.047 0.056 0.049 5.94 3.17 11.7 139 213 39.4
50 0.037 0.037 0.038 6.17 13.3 13.0 1200 940 3200 0.041 0.040 0.043 2.70 1.93 3.27 57.5 180 17.1 0.035 0.040 0.034 3.40 1.27 3.52 132 109 33.8
Fig. 6. Two experiments with different sized sensing areas in a random eld dataset. In the upper case the sensing area is [500, 500] [500, 500] mm2 , whereas in the lower case it is [1000, 1000][1000, 1000] mm2 . Green lines are odometry, black lines are the true path, and particles are presented in gray. The particle with maximum weight is presented in blue.
TABLE V AVERAGE MEAN SQUARED ERRORS FOR MODELS IN 10 EXPERIMENTS IN A [500, 500] [500, 500] mm2 AREA . U NITS ARE [T 2 ].
VI. C ONCLUSIONS In this paper we proposed a near-optimal exploration algorithm for Gaussian process SLAM with a scalable optimality factor. Even though the optimality factor scales between iterations, we can still guarantee a constant level of optimality in exploration. Meanwhile, this provides a technique for achieving a balance between exploration and exploitation. In order to evaluate how well different models are suited for
localization, we proposed a model quality rating based on the expected sensing space coverage required to observe random samples drawn from the marginal measurement distribution. This method can be used later to evaluate magnetic elds for localization purposes in different indoor environments (e.g. hospitals, commercial buildings, etc.). Finally, in order to give some results with the renewed nearoptimal exploration algorithm, we built simulations on three different data sets; two of them being real-world magnetic elds. Even though the results were encouraging in small sensing environments, in larger and more practical environments we still face localization problems. Compared to our real-world magnetic eld SLAM studies [11, 12] where we have utilized a dense measurement interval, larger prediction variance in near-optimal exploration causes slower particle covergence when exploitation actions are taken. Instead of measuring sensing quality w.r.t. the size of the sample set one could overcome the localization problems by considering sensing quality w.r.t. the length of the sensing path. This could also decrease the computational complexities by reducing the number of particles required to approximate the sensing site distribution. However, the authors dont know if there are means to measure the optimality of such paths. ACKNOWLEDGMENT This work has been supported by the Infotech Oulu and the Academy of Finland.
288
R EFERENCES
[1] N. Cressie. Statistics for spatial data. Wiley, New York, USA, 1993. [2] C. Guestrin, A. Krause, and A. Singh. Near-optimal sensor placements in gaussian processes. In International Conference on Machine Learning (ICML), 2005. [3] J. Haverinen and A. Kemppainen. Global indoor self-localization based on the ambient magnetic eld. Robotics and Autonomous Systems, 57(10):10281035, 2009. [4] A. Kemppainen, J. Haverinen, I. Vallivaara, and J. Rning. Near-optimal slam exploration in gaussian processes. In IEEE 2010 International Conference on Multisensor Fusion and Integration for Intelligent Systems (IEEE MFI 2010), 2010. [5] T. Kollar and N. Roy. Efcient optimization of information-theoretic exploration in slam. In Proceedings of the National Conference on Articial Intelligence (AAAI), pages 13691379, 2008. [6] A. Krause. Optimizing Sensing: Theory and Applications. PhD thesis, Carnegie Mellon University, 2008. [7] G.L. Nemhauser, L.A. Wolsey, and M.L. Fisher. An analysis of approximations for maximizing submodular set functions-i. Mathematical Programming, 14(1):265294, 1978. [8] A. Singh, A. Krause, C. Guestrin, W. Kaiser, and M. Batalin. Efcient informative sensing using multiple robots. Journal of Articial Intelligence Research (JAIR), 34:707755, 2009. [9] A. Singh, A. Krause, and W. Kaiser. Nonmyopic adaptive informative path planning for multiple robots. In Proc. International Joint Conference on Articial Intelligence (IJCAI), 2009. [10] C. Stachniss, G. Grisetti, and W. Burgard. Information gain-based exploration using rao-blackwellized particle lters. In Proceedings of Robotics: Science and Systems (RSS), pages 6572, 2005. [11] I. Vallivaara, J. Haverinen, A. Kemppainen, and J. Rning. Simultaneus localization and mapping using ambient magnetic eld. In IEEE 2010 International Conference on Multisensor Fusion and Integration for Intelligent Systems (IEEE MFI 2010), 2010. [12] I. Vallivaara, J. Haverinen, A. Kemppainen, and J. Rning. Magnetic eld-based slam method for solving the localization problem in mobile robot oor-cleaning task. In The 15th International Conference on Advanced Robotics (ICAR 2011), 2011.
289
290
An Active SLAM Approach for Autonomous Navigation of Nonholonomic Vehicles

Eduardo Lopez Caleb De Bernardis Tomas Martinez-Marin
Department of Physics, System Engineering and Signal Theory, University of Alicante, Alicante, Spain
Abstract In this paper we propose a new approach for active SLAM (Simultaneous Localization And Mapping) of nonholonomic vehicles. Both the environment and the vehicle model are unknown in advance, thus the path planner uses reinforcement learning to acquire the vehicle model, which is estimated by a reduced set of transitions. At the same time, the vehicle explores the environment creating a consistent map through optimal path motion. The mapping is represented by sets of ordered and weighted data points, named objects, that provide some advantages with respect to conventional methods. In order to guide the navigation and to build a map of the environment the planner employs a three-dimensional controller based on the concept of virtual wall following. Both simulation and experimental results are reported to show the satisfactory performance of the method. Index Terms Learning and adaptation, Optimal path planning, SLAM
I. INTRODUCTION Most of the autonomous vehicles we can currently nd in the industry follow a strict itinerary with a very limited interaction with the environment. Intelligent vehicles should exhibit an autonomous behavior learning from experience through interaction with the environment. In particular, they should be able of building a consistent map of the environment to navigate in an intelligent manner, as well as to learn and update their internal dynamic models to maximize their short-term capabilities under changing conditions (e.g. terrain conditions, battery levels). Over the recent years, optimal path planning and active SLAM applications have become more important in many elds. The necessity of keeping hybrid information of the map in path planning problems was early stated in [1], where the integration of grid-based and topological maps allowed the developing of path planning processes. Metric and dense maps are essentials for these kind of applications. Extended Kalman Filter (EKF) based SLAM generates sparse maps since it uses features for localization. Hybrid maps for EKFSLAM containing discrete features along with metric grid information are developed in [2], but they have problems of consistency when the features are updated. In [3], hierarchical SLAM schemes are presented, extracting sparse maps, solving the problem of loop-closure and updating the correlations of the features involved in the loop every time it is detected. Rao-Blackwellized particle lter (RBPF) methods can generate dense maps ( [4]) and can be suitable for control and planning purposes, but they present problems of data association. Car-like vehicles are widely used in industry due to their
loading capability with only a power motor. These vehicles are nonholonomic systems, whose motion laws have been a well studied topic [5], [6], [7]. A number of planning algorithms have been proposed for generating trajectories that are computed in open-loop [8]. Minimal length paths of simplied car-like vehicles have been characterized in [7]. This result is proved in the absence of obstacles and it is computed in open-loop. Dynamic Programming (DP) provides a closed-loop solution including obstacles [9]. In this paper, we present a generic approach for active SLAM applied to nonholonomic vehicles, where the planner guides the motion through the path that maximizes the map consistency. The algorithm combines reinforcement learning with some concepts from the CACM technique [10]. In this context, there are several applications of robot motion planning by reinforcement learning. For example, in [11] an optimal motion planner for mobile robots based on RL was developed, but without the SLAM capability. More recently, in [12] a robot docking task was solved through RL in a visual servoing framework. The paper is organized as follows. Section II describes the vehicle platform. Section III provides a brief introduction to the SLAM problem, including some improvements with respect to conventional SLAM methods. In section IV the optimal motion planner based on reinforcement learning is presented. Implementational aspects of the new algorithm are tested in section V and VI through simulations and experimentation on a real vehicle, respectively. Conclusions appear in section VII. II. VEHICLE MODEL The reinforcement learning controllers have been implemented in the nonholonomic vehicle shown in Fig. 1. The vehicle is equipped with an array of infrared sensors, a laser scanner (Hokuyo URG-04LX) and a CMOS camera (the image sensor has not been employed in this application). The vehicle is autonomous, using two MPC555 microcontrollers to process all sensors. In order to process the data in real time, the laser scanner has been congurated to the lowest resolution (2 ), affecting in a signicant manner to the quality of the experimental results. Furthermore, the steering device of the vehicle is a radio control (RC) servo without a feedback sensor for the measurement of the wheel angle. This fact prevent the correct vehicle odometry, giving more relevance to the online learning of the vehicle model using the range sensor.
291
A. Prediction stage Our proposed SLAM method aims to fuse the system dynamics of the vehicle along with the observation to generate a most robust estimation of sensor position. For this purpose, we have considered a parametric model based on state transitions to describe the vehicle dynamics. This model will be used to predict the next system state in the localization stage. As shown in Fig.3 ( right), each possible trajectory of the vehicle is modeled as a circumference with a specic radius that is obtained considering a set of calculated positions when the vehicle is performing that trajectory. Radius for each trajectory is calculated considering the optimal radius that minimizes the square error between the model and observations. This parametric model has the advantage of taking into account the dynamical variations that could affect the system prediction. Vehicle steering errors can depend on several mechanical factors that usually change over time. Considering the most recent observations, it is possible to maintain an adaptive vehicle model. B. Data association x = vT cos cos , (1) y = vT sin cos , (2) = vT sin . (3) L where vT is the translational velocity of the driving wheels and is the steering angle of the vehicle, which represent the two control variables of the system. The distance between the reference point (x, y) and the middle point of the driving wheels is L. The orientation of the car is denoted by . It is important to note that the state equations are only used in the simulations to describe the robot trajectories. In the experiments, the model is built on-line by recording the transitions between states and the immediate rewards. III. SLAM SLAM deals the problem of building a map of the environment where the vehicle is navigating while maintaining an estimation of the vehicle location within the map. In this paper we propose to represent the environment by objects, in contrast with conventional methods, such as EKF-SLAM or grid-based approaches. In this section we will outline the basis of the proposed SLAM method that is integrated with the optimal motion planner in an active SLAM framework. Subsection III-A describes the prediction stage that uses an online updated vehicle model. Subsection III-B describes how to improve robustness of data association taking into account the concept of object as a new way of organizing sensor data for mapping in order to reduce erroneous associations between features and observations. Subsection III-C provides a brief introduction to some localization aspects considered in the presented SLAM method. Finally, in subsection III-D the objects-based approach for mapping the environment is briey described. In order to properly estimate the vehicle location, data association between observations from sensor and features kept in the map must be solved. The general approach of Nearest Neighbor [13], based on one-to-one association between observations and features can lead to errors in data association in certain cases, where observations are closed each other or when the vehicle position uncertainty remains high. In order to enhance data association performance, the object-based description is exploited due to its higher level of abstraction in comparison with other SLAM methods. C. Localization stage Most SLAM methods proposed in the literature are mainly based on EKF or RBPF lters [14]. In these methods, crosscorrelations between state vector components and those of features are calculated to estimate the reliability of each feature to localize the vehicle. In our case, a metric-based scan matching algorithm is employed [15]. The method uses the Iterative Closest Point framework to estimate and minimize the relative displacements between the laser scan and the associated objects. To overcome inconsistencies due to local scan matching, weighted average between observed (Xo ) and predicted poses (Xp ) is calculated considering the mean squared error between laser data and the map data points as a measure of goodness of each pose. D. Mapping stage As it was mentioned above, in this paper we propose to represent the environment by objects. An object is a parametric description of a physical contour by means of a set of ordered and weighted data points. The objects grow, merge and split as the robot explores the environment (mapping). Two mechanisms of resampling the objects are implemented: One physical, when the robot range sensor collides with the object updating the position of the points sequence; and
292
Fig. 1.
The autonomous vehicle employed for the experiments.
The vehicle model will be described by means of a statespace formulation [7], as follows:
State variables: 2. (1661 states) Objective state: Control variables: 2. (11 actions) Sampling time: Reward:
Adjoining distance:
x1 : -2.5 d 2.5 m. Cells: 41 x2 : -100 100 . Cells: 41 (d, ) = (0 m, 0 ) u1 : -0.2 vT 0.4 m/s. (u2 : -15 15 .) Ts : 0.1 sec. r = 100 if goal r = -20 if sink r = -n (Ts ) otherwise D-2
TABLE I
A. Reinforcement learning Reinforcement learning methods only require a scalar reward to learn to map situations (states) in actions [16]. As opposed to supervised learning, they do not require a teacher to acquire the optimal behavior, they only need to interact with the environment learning from experience. The knowledge is saved in a look-up table that contains an estimation of the accumulated reward to reach the goal from each situation or state. The objective is to nd the actions (a = (s)) that maximize the accumulated reward in each state. The accumulated reward for each state-action pair Q(s, a) is updated by the one-step equation Q(s, a) = (r + maxa Q(s , a ) Q(s, a)) (4)
PARAMETERS IN THE RL ALGORITHM ON THE REAL VEHICLE .
another virtual, in order to maintain the minimum number of particles that represent the object with a certain degree of accuracy. For this task, the Douglas-Peuker (DP) algorithm has been used. The object-based description introduce some advantages in relation to the existing methods. The objects can represent any arbitrary contour, which is particularly suitable for outdoor environments. Through a scan matching algorithm, the robot can be localized employing all the information provided by the associated objects, instead of using a set of landmarks (EKF-SLAM based approaches), which is especially useful in the case of curved-shape objects (see simulation results). On the other hand, the proposed method is appropriated for mapping large environments, since the representation is formed by a set of one-dimensional structures. In contrast, grid-based methods builds maps in form of two-dimensional structures. IV. OPTIMAL MOTION PLANNER Although the state space of the system is threedimensional, some vehicle behaviors, such as virtual wall following, can be specied in a two-dimensional state space. A virtual wall is created by manipulating the vehicle sensors in such a way that the vehicle can navigate on a free space without any physical guide. In this case, the problem is simplied by xing the value of vT . Then, the task is reduced to control the orientation of the vehicle () in a two-dimensional state space (d, ), where d is the distance between the vehicle and the virtual wall, and is the relative orientation of the vehicle with respect to the wall. In our reinforcement learning experiments, we only allow 11 possible actions in each state within the steering range (15 , 15 ). The vehicle control has been implemented by a two-level architecture, which is online updated using reinforcement learning. Firstly, a two-dimensional controller is designed following a reinforcement learning approach. The objective of the 2D controller is to move the vehicle forward or backward in a smooth and local fashion. Such controller becomes a single action of a three-dimensional controller, which provides the vehicle path planning in the Cartesian space.
where Q is the expected value of performing action a in state s, r is the reward, is a learning rate which controls convergence and is the discount factor. If the reward function is proper, the discount factor can be omitted ( = 1). The action a with highest Q value at state s is the best policy up to instant t. When t the policy approximates the optimal behavior: a = (s) = arg maxa Q(s, a ) (5)
In Q-learning the transitions between states are evaluated at xed sample times, while with our RL controller the transitions have to satisfy the adjoining distance condition in order to be evaluated. By appropriate selection of this distance with respect to the number of cells, it is possible to minimize quantization effects and better approximate the optimal behavior of the system. The adjoining distance states that the distance D between the current cell and the previous cell is equal to some integer value k equal or greater than 1. For our RL controller, we will dene the adjoining property Dk in terms of the continuous states x and x as follows: The state x and x satisfy the adjoining distance Dk if: D(x, x ) = max |
j
xj xj | = k, hj
(6)
where xj indicates the j-th component of the state x and hj is the cell size of the j-th dimension. B. The two-dimensional controller The two-dimensional controller has been implemented as a model-based reinforcement learning method, where the backups are made by simulation following the shortest path search backward in time, starting from the goal state. The RL algorithm deals with several data structures for organizing the available information and storing the partial results of the learning process. These structures are the following: Q(s, a): is the Q-value table where the accumulated reward for the (s, a)-pair is saved.
293
Initialize Q(s, a), M (x, a) and f irst(Queue) goal x current state s cell(x) a policy(s, M, Q) Execute action a Observe resultant state x and reward r IF Dk -adjoining(x, x ) THEN M (x, a) x , r s f irst(Queue) FOR all (s, a) that led to s x M (x , a) s cell() x Update Q(s, a) using Eq. 4 Insert s in Queue sorting by maxa Q(s, a) UNTIL N times If Queue is empty: f irst(Queue) goal UNTIL training terminated
Fig. 2. Reinforcement Learning Algorithm.
Fig. 3. Left. Two-dimensional controller. State space (d, ) showing the controllable trajectories for forward motion. Right. In red, vehicle trajectories performed for six different controls. In blue, circumference tting model.
M (x, a): is the local vehicle model. It contains an average of relative transitions of the car transformed to local coordinates during the learning process. Queue: contains the states to be updated in the correct order to nd the optimal policy in only one sweeping. policy(s, M, Q): selects a policy to estimate the model (M ) and to exploit the best policy acquired (Q).
The RL algorithm that implements the concepts described above is presented in Fig. 2(see [11] for a more detailed description). The state is represented in the algorithm by a real valued vector x, which is converted to the discrete state s (integer index) by the function cell(). In our experiments, uniform discretization was used with 41 cells per variable (see Table I for full details of the RL parameters). The function Dk adjoining() is used to determine whether the adjoining property has been satised. The index s is used to update the Q-table, and x is used to update the function M . Since the controller uses noisy data from an image sensor, the function M () estimates the state of the system by ltering before storing it. In our experiments an average lter was used. For the virtual wall following behavior the aim of the controller is to move the vehicle from any initial position inside the region of interest to a goal position (in the case of the 2D controller the goal is (0 , 0 )) through a minimumtime trajectory. The function policy() selects an action for each transition of the system. In this application is important that at the beginning the vehicle does not move far away from the home state, since the map has not been created yet. For this reason, at the beginning the policy consist in a sweeping of the all the actions executed in the adequate order to keep the vehicle close to the home state, until the 3D controller takes the
Fig. 4. Three-dimensional controller. Simulation of the time-optimal motion planning of the nonholonomic vehicle considering the presence of obstacles.
control of the vehicle. The 3D controller requires that the vehicle moves forward and backward to avoid possible obstacles in the environment (Fig. 3). For that reason, two RL controllers are built: one for forward motion and the other for backward motion. Thus, employing both controllers the vehicle is able to avoid obstacles and turn around (see experimental results in Fig. 6) in a natural manner. C. The three-dimensional controller The path planning can be improved using a control architecture with two levels of abstraction. The bottom layer of the control architecture is the two-dimensional (2D) controller, which becomes a single action when it is used by the threedimensional (3D) controller (upper layer). The 3D controller employs the current map and the 2D controller to build an optimal look-up table that allows the vehicle to move from the current state to the objective state in presence of
294
obstacles, as follows: 1) When a new objective is selected, if there is a path free of obstacles the controller uses a look-up table previously calculated and saved in memory for optimal motion in the absence of obstacles. 2) If the path collides with an obstacle the controller calculates a new optimal path using the 2D controller for forward and backward motion in an alternately manner. This way, the path will be time-optimal avoiding obstacles and employing the minimum number of changes between forward and backward motion. Furthermore, several artifacts due to discretization of the state space, such as spurious limit cycles are avoided. Fig. 4 shows a simulation of the 3D controller in presence of obstacles. It is observed the forward and backward motion of the vehicle in order to reach the goal (x = 0 m, y = 0 m, = 0 ) avoiding the obstacles. The path planner makes the SLAM active by selecting the objectives that maximize both the reliability map and the exploration zone. The reliability factor of the robot pose is a measurement of the certainty of the robot localization considering the current mapping. In contrast, the exploration map measures the opportunity of growth of the explored zone. Both criteria should be balanced to get a maximization of the explored area without a signicant reduction in the reliability of the robot localization. The reliability of the current map can be incorporated to the motion planner in several ways. A straightforward way is to introduce the reliability factor in the cost function of the optimal planner. The inconvenient with this option is that the reliability factor is changing continuously, forcing the planner to update the policy at the same rate, which can be unpractical in most of scenarios. Another way is to include the reliability information in intermediate objectives, acting as constraints that the planner must satisfy during the environment exploration. In this case the computational effort of the motion planning is greatly reduced. For that reason, in this paper the latter option has been adopted. The reliability factor associated to a position of the vehicle is calculated applying the DP algorithm to the accessible objects from that position. The exploration map is calculated as the ratio of free area in relation with the area detected by the range sensor. V. SIMULATION RESULTS In this section we outline the results obtained using the simulation environment depicted in Fig. 5 (top left). To clarify the concept of reliability map, it has been considered a large enough environment so that there exists a central zone where the range sensor cannot detect. Combining information from both exploration and reliability maps, the path planner will perform a SLAM policy consisting in exploring the environment while keeping bounded the estimation accuracy. The reliability of the central zone is very low, since it is far away from any contour. Thus, the path planner will calculate intermediate objectives to avoid crossing that
Fig. 5. Simulation results. Top left, a curved-shape environment. Top right, robot trajectory along with the reliability map. Bottom left, mapping using the path planner. A single object (extern red line) maps the environment (extern blue line). Bottom right, mapping without the path planner. A new object (extern green line) appears mapping incorrectly the environment (extern blue line) .
zone. Fig. 5 (top right) shows the robot trajectory along with the reliability map projection of the orientation dimension over the xy-plane. Although the estimation error (difference between real trajectory (red line) and ideal trajectory (blue line)) grows in the central zone, it decreases again as the vehicle leaves the unreliable estimation zone, since the vehicle is able to correct the location error using the observations and the map. In this case, the vehicle is initially located at position (x = 0, y = 0) with vertical orientation. The path planner starts selecting intermediate objectives so that the vehicle rst moves straight ahead and afterwards performs a counterclockwise round. At that point, the path planner is switched off moving the vehicle through the central zone of the environment where the reliability factor is low, as shown in Fig. 5 (bottom left). For the sake of comparison, in Fig. 5 (bottom right) a simulation without the path planner is shown. In this case, the vehicle moves through the central zone before performing a closed loop as in the previous case. The accumulated error due to odometry provokes a remarkable error in the mapping when the vehicle detects the rst unexplored zone (green line) encountered after leaving the central zone. VI. EXPERIMENTAL RESULTS In this section we will present a preliminary experiment carried out on the real vehicle described in section II. To test the performance of the proposed method on a real autonomous vehicle, we have considered moving the vehicle in an open space with a corridor. Fig. 6 (top) shows the experimental environment, where the vehicle is situated in
295
the top-right of the image. The rst step was to move the vehicle around the home state to acquire the optimal motion behavior in a known area. In order to reduce the computation time the steering actions have been limited to eleven in each state within the range of (15 , 15 ). Fig. 6 shows the results of the active SLAM algorithm. The environment is modeled by a set of objects, each of them plotted in a different color. The path planner calculates intermediate objectives, using the reliability factors extracted from the current map. The concatenate objectives allow the vehicle to explore new regions while keeping a precise localization at any moment. The environment contains some obstacles that do not appear in the map showed in Fig. 6 (top). The vehicle has avoided successfully the obstacles at the end of the corridor, although the mapping of the corridor has suffered a certain deviation. Considering that the vehicle odometry is decient (the vehicle does not have a steering sensor), the resulting mapping can be considered acceptable for its purpose. VII. CONCLUSION The new active SLAM approach has been successfully employed for autonomous navigation of nonholonomic vehicles. The vehicle learns in a few minutes a virtual wall following behavior used by a three-dimensional controller to explore the environment. In contrast with conventional RL techniques (Q-learning, Dyna-Q, Prioritized Sweeping), the algorithm does not need to use function interpolation to nd a close to optimal behavior in continuous state spaces. Furthermore, this approach provides a closed-loop solution in the presence of obstacles including forward and backward motion. The result provides a consistent map through the adequate selection of the intermediate objectives with high localization reliability and close to unexplored areas. The mapping is represented by one-dimensional objects. The object-based description introduces some advantages in relation to existing methods, such as scalability, etc.. The proposed algorithm is suitable to explore unknown environments creating a vehicle model and a map of the environment. The path planning is improved using a control architecture with two levels of abstraction. Such as it was commented in the introduction, vehicles that aim to be considered intelligent vehicles must exhibit an autonomous behavior, learning from experience through interaction with the environment. The approach proposed in this work looks for the construction of fully autonomous and intelligent vehicles. R EFERENCES
[1] S. Thrun and A. Bucken, Integrating grid-based and topological maps for mobile robot navigation, 1996, pp. 944950. [2] J. G. J. Nieto and E. Nebot, The hybrid metric maps (hymms): A novel map representation for denseslam, in Proc. IEEE Int. Conf. on Robotics and Automation (ICRA), 2004. [3] J. G. J. Blanco and J. Fernandez-Madrigal, Subjective local maps for hybrid metric-topological slam, Int. J. of Robot. and Autonomous Systems, vol. 57, pp. 6474, 2009.
Fig. 6. Experimental results. The environment is mapped by objects representing contours in different colors.
[4] J. Y. W. Kuo, S. Tseng and L. Fu, A hybrid approach to rbpf based slam with grid mapping enhanced by line matching, in IEEE/RSJ Int. Conf. on Intelligent Robots and Systems - Visual Servoing Workshop, St. Louis, USA, 2009, pp. 15231528. [5] J.-C. Latombe, Robot Motion Planning. Kluwer Academic, 1991. [6] H. Choset, K. M. Lynch, S. Hutchinson, G. Kantor, W. Burgard, L. E. Kavraki, and S. Thrun, Principles of Robot Motion. MIT Press, 2005. [7] J. A. Reeds and R. A. Shepp, Optimal path for a car that goes both forward and backward, Pacic J. Math, vol. 145, no. 2, pp. 367393, 1990. [8] F. Lamiraux and J. P. Laumond, Smooth motion planning for car-like vehicles, IEEE Trans. Robot. Automat., vol. 17, no. 4, pp. 498502, 2001. [9] J. Barraquand and J. C. Latombe, On nonholonomic mobile robots and optimal maneuvering, Revue dInteligence Articielle, vol. 3, no. 2, pp. 77103, 1989. [10] P. Zuria and T. Martnez-Marn, Improved optimal control methods based upon the adjoining cell mapping technique, Journal of Optimization Theory and Applications, vol. 118, no. 3, pp. 657680, 2003. [11] T. Martinez-Marin, On-line optimal motion planning for nonholonomic mobile robots, in Proc. IEEE Int. Conf. on Robotics and Automation (ICRA), Orlando, 2006, pp. 512517. [12] T. Martinez-Marin and T. Duckett, Learning visual docking for non-holonomic autonomous vehicles, in IEEE Intelligent Vehicles Symposium, Eindhoven, 2008, pp. 10151020. [13] J. Neira and J. Tardos, Data association and stochastic mapping using the joint compatibility test, IEEE Trans. Robotics and Automation, vol. 17, no. 6, pp. 890897, 2001. [14] S. Chakravorty and R. Saha, A frequentist approach to mapping under uncertainty, Automatica, vol. 47, no. 3, pp. 477484, 2011. [15] J. Minguez, L. Montesano, and F. Lamiraux, Metric-based iterative closest point scan matching for sensor displacement estimation, IEEE Trans. Robotics, 2005. [16] R. Sutton and A. Barto, Reinforcement Learning: An Introduction. MIT Press, 1998.
296
Robustness of the Quadratic Antiparticle Filter for Robot Localization

Center
John Folkesson for Autonomous Systems,Royal Institute of Tech. KTH, Stockholm, Sweden
Abstract Robot localization using odometry and feature measurements is a nonlinear estimation problem. An efcient solution is found using the extended Kalman lter, EKF. The EKF however suffers from divergence and inconsistency when the nonlinearities are signicant. We recently developed a new type of lter based on an auxiliary variable Gaussian distribution which we call the antiparticle lter AF as an alternative nonlinear estimation lter that has improved consistency and stability. The AF reduces to the iterative EKF, IEKF, when the posterior distribution is well represented by a simple Gaussian. It transitions to a more complex representation as required. We have implemented an example of the AF which uses a parameterization of the mean as a quadratic function of the auxiliary variables which we call the quadratic antiparticle lter, QAF. We present simulation of robot feature based localization in which we examine the robustness to bias, and disturbances with comparison to the EKF. Index Terms Localization, Nonlinear Estimation, Robust Estimation
I. AF IN RELATION TO THE S TATE OF THE A RT Nonlinear estimators are used in many elds. They are an important part of feedback control. They can be used for parameter estimation or machine learning. We have developed a new type of recursive Bayesian estimator based on an auxiliary variable Gaussian distribution as the posterior. We call the lter the antiparticle lter, AF. This has properties that allow it to ll a gap in the current spectrum of existing nonlinear estimators. The extended Kalman lter, EKF [17] has been a workhorse of nonlinear estimation due to its ease of implementation and efciency of computation. It is however well known that the EKF become inconsistent over time due to the linearization around the estimated state. If the estimated state is very close to the true state then this inconsistency will be limited. Inconsistency is often characterized by the state covariance estimate being too small relative to the true error distribution, so called overcondence. This can lead to divergence in the lter. Robustness can be improved by adjusting (increasing) the covariance as in the robust EKF [20]. This can be done adaptively [11], [18]. This however can make the lter undercondent which will lead to suboptimal decisions and estimates. Our AF is equivalent to the iterative EKF, IEKF [3], while the uncertainty is low. It then adaptively evolves to use a nonGaussian distribution as uncertainty increases, evolving back if the uncertainty is then reduced.
Problems in the EKF arise due to the linearization of the nonlinear system in order to then apply the linear Kalman lter equations. This linearization at a single point in the state space does not give a picture of the true system at points away from the linearization point. This is addressed by approaches such as the unscented Kalman lter, UKF which is an example of a linear regression Kalman lters LRKF [19, 12] [13, 10]. These estimation methods are similar to the EKF in that they represent the posterior distribution as a Gaussian. They however do not linearize the equations but rather map a selection of regression points (or sigma points) through the nonlinearities to estimate how the entire Gaussian should be mapped. Our AF is similar to the UKF in that it also maps selected points1 through the nonlinearity. It however does not then t the points to a Gaussian ellipse but rather to a more general shape. Particle lters, PF [9, 2], offer a nonparametric approach to nonlinear state estimation. They represent the posterior distribution by a set of particles or samples drawn from that distribution. By mapping these using the full nonlinear system equations and sampling the noise one can estimate arbitrary systems. They do however require that the particles ll the state space adequately. Thus they need more particles as the uncertainty increases and as the number of dimensions increases. Our AF is more general than Gaussian lters but not as general as PF. On the other hand the AF does not sample the distribution and so it does not suffer the limitations of sampling. In particular it is more efcient for many applications. The need for more particles as the dimension increases is especially problematic for SLAM. This led to the development of Rao-Blackwellized PF for SLAM, FastSLAM [14]. These represent only part of the state by particles while the easy dimensions are represented by conditional EKF. Our AF is similar to the Rao-Blackwellized PF in that it uses a conditional Gaussian distribution solved by the IEKF. We use a continuous distribution instead of the PF for the Gaussian parameters distribution. This allows a continous analytic distribution and avoids particle depletion completely. Besides the EKF and PF variations there are a few other techniques. Some of these use a mixture of Gaussians [1, 6]. Others use batch and graphical methods, [4, 15, 16, 8, 5]. These methods give very accurate results but are in general more complicated to implement and require more calculation
1 which
we call antiparticles
297
than the AF. II. G AUSSIAN E STIMATION Gaussian estimation of the state xk starts with a process such as: xk = f (xk1 , uk ) + k , zk = h(xk ) + k , k G(0, Qk ) k G(0, Rk )
Fig. 1. The left gure shows how one can shift uncertainty (cov x) from P to (P C). The right gure is a schematic of the predict phase.
Here zk are the measurements, uk are the control signals, and Q/R are the covariances of the white noise. In general the functions f and h might be nonlinear. In the EKF one linearizes these around a good estimate: f (xk1 , uk ) f (, uk ) + J f ()(xk1 ), h(xk ) h() + J h (xk ). where J represents the Jacobian of the nonlinear vector function. The linear Kalman lter is then used as an recursive state estimator which models the posterior distribution as a Gaussian: G(x , P ) = 1 (2)n |P | exp 1 (x )T P 1 (x ) 2 (1)
IV. G ROWING AND S HRINKING AUXILIARY VARIABLES The Gaussian distribution (as when n = 0) will work ne as long as the uncertainty (modeled by P ) is small. We will adaptively change the number of auxiliary dimensions, n , based on the size of the P eigenvalues. Eq. (5) and g. (1) show how one can shift uncertainty from P to the auxiliary variables and back. If the eigenvector decomposition of P is P = U U T (7) We can choose to grow a new auxiliary dimension, q , when rr > Lg for some threshold Lg 2 . This is done using the eigen vector u iq = Uir = ui Cqq = rr (1 ) P 1 P = P uCqq u
T
(8) (9) (10) (11)
III. AUXILARY VARIABLE G AUSSAIN DISTRIBUTION In [7] we have introduced the antiparticle lter AF and shown it to be far more consistent than the EKF and IEKF. We will present the main points of the AF here. We take instead of eq. (1) a different form for the posterior: p(x) =

G(x m , P )G(, C)(d)n
(2)
where is a small number to maintain a positive denite P . We see from Eq. (5) that the covariance will not change from introducing the auxiliary variable.3 We can similarly eliminate auxiliary dimension q. We rst diagonalize the C matrix. Then we have no change to the mean or covariance by making these changes to and P 1 + qq Cqq 2 Pij Pij + Pij = Pij + iq Cqq jq 1 + i Cqq Ckk j i Cqq Cqq j qk qq qk 2 qq
k
(P 1 u)(P 1 u)T = P 1 + Cqq 1 Cqq uT P 1 u
Here is a vector of auxiliary variables. The curves m are the mean state conditional on the auxiliary variable. These curves only need to be differentiable wrt . We will, however, make a specic choice in this paper of: 1 m = + + T 2 (3)
where is an x space vector of symmetric space matrices. We call the antiparticle lter that results from this as the quadratic antiparticle lter, QAF. For this case we can write the expectation values: E[x] =

We make the above changes if traceP < Ls and also shrink C, , and by deleting the q rows and columns.4 V. P REDICT We will use the common double subscript notation to distinguish parameters, such as P , estimated before the k th prediction, Pk1|k1 , after prediction, Pk|k1 , and after updating, Pk|k . Fig. (1) describes the predict step. We predict
= 1.0 in our experiments. can add auxiliary dimension for any unit vector direction u by substituting (uT P 1 u)1 for rr . 4 L = 0.8 for our experiments s
2L g 3 We
m G(, C)(d)n = + 1 2
1 2
ij Cij
i,j
(4)
E[xxT ]E[x]E[xT ] = P +CT +
ij Cik Cjl (kl )T

i,j,k,l
E[T ] E[]E[T ] = C
(5) (6)
298
the new mean and P exactly as in the EKF. The C parameter is unchanged by prediction while the predicted m is interpolated based on mapping selected points, called antiparticles, through the nonlinear function f . The antiparticles correspond to m k1|k1 evaluated at = i , chosen as 0 = 0, i = (0, ..., Ck|k1,ii , ..., 0)T , i = i , ij = (i + j )/ 2, i, j {1, 2, ...n }, j < i. xi k1|k1 mi k1|k1 , xij k1|k1
ij mk1|k1
which use a Gauss-Newton minimization of the -log of the posterior distribution. Fig. (2) describes the 3 phase update. The maximum likelihood estimate MLE is given by: (xM LE , M LE ) = arg min g(xk , )
xk , T 1 g(xk , ) = [(xk m k|k1 ) Pk|k1 (xk mk|k1 )
(16)
1 + z(xk )T R1 z(xk ) + T Ck|k1 ]/2.
where z(xk ) = zk h(xk ). During phase 1 we hold xk = m k|k1 while minimizing g wrt . Then in phase 2 we allow = = both to vary. In phase 3 we hold = i during minimization This gives us: of g leading to a new set of antiparticles to be used in the following predict. i i xk|k1 = f (xk1|k1 , uk ) Phase 1 is needed when making large changes to the Pk|k1 Qk + J f (xk1 )Pk1|k1 (J f )T (xk1 ) |xk1 =m0 . mean. Attempting to vary both xk and is not stable. The k1|k1 We then t the predicted curve parameters to the predicted minimization is done using Gauss-Newton minimization: antiparticles. 0 = arg min g(m k|k1 , )
0
where we have rst diagonalized C. The antiparticles are:
(17)
k|k1 = x (12) i i x x k|k1,i = (13) 2|i | xi + xi 2x0 k|k1 (14) ii = 2 i 2xij + (xi + xj ) (xi + xj ) k|k1 (15) ij = |i ||j | where = (1 2)/2 and the subscripts on xi k|k1 are suppressed. We see that we can parameterize the m either by (, , ) or by the antiparticle set {xi , xij }. For repeated prediction there is no need to t the (, , ) as the antiparticle set can be used for the next prediction. VI. U PDATE
We can then move 0 iteratively starting from 0 = 0:

1 1 g = (Ck|k1 0 (J h J m )T Rk z) |x=m0
k|k1
1 1 (Ck|k1 + (J h J m )T Rk J h J m )1 |x=m0
k|k1
0 =
m
g m k|k1
(18)
where J denotes the Jacobian of wrt. . Phase 2 is very similar except that the composite state of both and x is used:
1 1 w (J m )T Pk|k1 (xk m k|k1 ) Ck|k1
(19) (20) (21) (22) (23)
W Wx Wxx
wx
1 1 (J ) Rk z(xk ) Pk|k1 (xk 1 1 (J m )T Pk|k1 J m + Ck|k1 1 Pk|k1 J m 1 1 (J h )T Rk J h + Pk|k1 h T
m k|k1 )
So now the iteration starts at = 0 and xk = m0 k|k1 using: xk = W 1 w = W Wx

T Wx Wxx 1
w wx
(24)
Before we can do phase 3 we need to update the C matrix in order to nd the new i points. We do this by rst shifting the denition of so that = 0 at the MLE. m m+M LE k|k1

(25)
We match the covariance of the new by updating C by, Ck|k =

Fig. 2. The update is done in 3 phases, rst move along the curves m , k|k1 then nd the new MLE by varying both x and , then nd the updated interpolation points to t m to. k|k
2 G((, x xM LE ), W 1 )(dx)n (d)n
1 T 1 Ck|k =[W Wx Wxx Wx ] |xk ,=M LE 1 =Ck|k1
The update starts after converting from the antiparticle set to (, , ). There are three phases to the update each of
+ (J h J m )T (J h Pk|k1 J hT + R)1 J h J m |M LE =C 1 + (J h )T (J h Pk|k1 J hT + R)1 J h

k|k1
299
where we used the fact that = J m |xk ,=M LE for our particular choice of m . We also can set
1 Pk|k = Wxx |M LE
(26)
Phase 3 then uses the updated Ck|k to dene i as we did during predict. Then another Gauss-Newton iterative minimization is done for each to nd the updated antiparticles starting from xi = mi using k|k xi = arg min g(xk , i ) k|k xi = k|k
xk 1 Wxx wx |xk =xi ,=i k|k
(27) (28)
In experiment 2 we modeled the steady state noise correctly but introduced a random heading disturbance at four points along the trajectory in the middle of each straight section. In the three right plots of (5) we summarize the errors. The results were similar to those for adding bias except that the highest disturbance did not effect the IEKF and QAF as much as the highest bias did. That is of course dependent on just the amount of disturbance and bias we chose. Again the EKF and IEKF had divergent runs for all values of the disturbance while the QAF never diverged. So one can state that the behavior of the lters with disturbance or bias is similar.
We then use exactly the same interpolation formula to give the updated curves m . k|k The three update phases all require Gauss-Newton minimization. If the system were linear one iteration of this would take one to the global minimum. In general neither the direction nor the distance along that direction is correct. We therefore do Ni iterations for each Gauss-Newton step and move a distance given by a line search along the given direction. The line search is done by evaluating g at Ns equally spaced points from distance 0 to distance 2 times the indicated distance. We then t the minimum of these Ns points and the two points to either side of it to a quadratic polynomial in the distance along the line. Minimizing this polynomial then gives the starting point for the next iteration. We stop after Ni iterations or when the change is small.5 VII. E XPERIMENTS In the rst 3 experiments we studied the robustness of the QAF estimates wrt unmodeled noise. We compared the EKF, IEKF and QAF in robot localization experiments. For each experiment we did simulations where the true path was the same each time with noise added to the odometry. The map had 4 point features on the corners of a square and the path is shown in g. (3) For these experiments the main source of nonlinear error was the motion model. cos 0 x s f (x, u) = y + sin 0 (29) 0 1
]
Fig. 3. Here is an example of one of the 1000 simulations for Experiment 1 where we add no bias. The EKF is shown on the right while the QAF is in th center. The 4 point features are arranged on the corners of a square and the true path is a counter-clockwise circuit around this map. The rst leg along the bottom edge has a relatively small error and both the EKF (left) and QAF (right) handle this error well. At the start of the fourth leg the EKF amplies its heading error while the QAF manages to almost eliminate its heading error completely. The nal pose of the EKF is even worse with it heading along the x-axis while the true heading is in the negative y. The right shows the QAF with large bias just before updating with the forth feature. The fact the the estimate is still consistent with the QAF distribution is why the QAF is able to correct the heading. The EKF and IEKF were not consistent at this same point.
T We set Qk = Ju QJu + 1010 I where Ju is the Jacobian of f wrt u and Q is a 2x2 diagonal matrix. The measurements were of the range and bearing to point features. In experiment 1 we modeled the covariance correctly but introduced bias in the odometry. gs. (3) we show an example of the paths. Even with no bias the EKF diverges badly in this example. The QAF manages to make the correct adjustment at all four feature observations for all levels bias. In gs. (4) and (5) we summarize the errors over all 12 cases. We see that the QAF while degraded by the bias had fewer instances of divergence even at the highest bias and none at the lower biases. The EKF and IEKF had divergences on all their eight cases even if it is not possible to see all the IEKF outliers in g. (4) due to the resolution.
5N i
Fig. 4. We show the error histogram for with the amount of theta bias added shown along the left edge of each row. The few divergent runs for the IEKF each row do not show up at this resolution, but we can note that EKF and IEKF had runs with heading errors of on all series while the QAF never had errors that high (although the angle may have wrapped past on the last row). For right column the errors were in (-0.8, 0.9), (-0.8, 0.9), (-0.66, 1.10), and (-3.10, 3.12) top to bottom.
= 20, and Ns = 21 for our experirments.
In experiment 3 we tested the robustness wrt the models Q value was investigated by estimating using Q/10 and 10Q. In g. (6) we show the root mean square errors in the pose that resulted from 500 simulations. The QAF did not diverge as a result of the modeling error in Q. The EKF and IEKF diverged on some simulations for each of the cases. We conclude that the shape matters more than the size of the estimated posterior distribution. In experiment 4 we changed the map. The odometry path
300
Fig. 5. Here we show the 1000 simulations root mean square errors in x, y, and for Experiments 1 and 2. The bars are in groups of 3 for EKF, IEKF and QAF. For the three exp. 1 plots on the left bars 1-3 have no bias, 4-6 have 0.00002 per iteration, 7-9 have 0.0002, and 10-12 have 0.002. We see that the EKF gave very poor estimates regardless of the bias. The IEKF had about double the mean errors of the QAF until series 11 in which it diverged often. The QAF degraded with the highest bias but only to the level of the IEKF with the lower biases. The three left most plots shoe Exp. 2: bars 1-3 have no disturbances, 4-6 have 0.02 rads, 7-9 have 0.08, and 10-12 have 0.16. We see that the EKF gave very poor estimates regardless of the disturbance. The IEKF had about double the mean errors of the QAF until series 11 in which it diverged often. The QAF showed little degradation in the mean from the disturbances.
versions of the QAF. In QAF 1 we use the simplest update with only one iteration and no line search. This is done for all three phases of the update. In QAF 2 we add to algorithm QAF 1 only for the updates where criteria eq. (30) was true (for example the rst measurement), a random sampling step to nd a good starting point for update phase 1. In QAF 3 we increase the number of iterations in QAF 1 from 1 to a maximum of 20 and also search along the line dened by the Gauss-Newton update. We do this for the update phases 1, 2, and 3. In QAF 4 we add to QAF 3 the random sampling as in QAF 2. This turns out to not help QAF 3 further indicating that the iterative Gauss-Newton method updates with line search are sufcient to nd the minimum. In g. (8) we plotted the results of 1000 trials for each of 6 algorithms. The test statistic here is based on the mahalanobis distance after the rst feature measurement update: d2 = (x xtrue )T (Covx )1 (x xtrue ) (31)
Fig. 6. Here we see the summarized 500 simulations root mean square errors in (x, y, ) for the nine trials of Experiment 3. The estimators are grouped by threes for the modeled noise Qm = (Q, Q/10, 10Q) respectively. Estimators 1, 4, 7 are the EKF, 2, 5,and 8 are the IEKF and 3, 6, and 9 are the QAF.
was a straight line along the x axis starting from the origin and moving 0.2 distance units per iteration. The true path was this with noise added as a diagonal Q matrix giving the Gaussian variance in u = (s, )T . The robot went with no measurements for about 500 time steps after which it got a measurement of one of the features that were placed in a large ring around the origin. An example is shown in g. (7). In order to evaluate the need of the iteration and interpolation procedure in update phases 1-3, we did some trials (QAF 1 and 2)where we left these out, (ie. set the max iterations to 1 and use the default Gauss-Newton distance). We also evaluated the sufciency of the pre-update step, phase 1, by preceding it with a random sampling step (for QAF 2 and 4) when the following criteria was met:
T 1 0 z(m0 k|k1 ) Rk z(mk|k1 ) > 5dim(z).
(30)
We sampled the G(, C) distribution after the predict step to generate many points at which to evaluate g(m , ). We then started our phase 1 pre-update at the point that gave the minimum g. This did not help further when we used 20 iterations and interpolation (QAF 3 vs. 4). It did help quite a bit when we did only one iteration and no interpolation (QAF 1 vs 2). Besides the EKF and IEKF we compared 4 different
Where xtrue is the true pose and Covx is the estimated covariance according to eq. (5). If the errors were Gaussain as the EKF assumes then d2 would follow a chi-square distribution. The cumulative chi-square distribution of d2 is computed for each of the 1000 trials and then this list is sorted and the values plotted. If the EKF estimator were consistent the test statistic would be uniformly distributed and its cumulative distribution would be a line from 0 to 1. We varied the uncertainty in the Q value to see how the various estimators performed. We see in left g. (8) that for relatively low uncertainty the plots for the QAF 2-4 seem to tend towards the hypothetical straight line. They show some over condence but not as much as the IEKF. The EKF and QAF 1 algorithms already here are very overcondent. In right g. (8) we have increased the uncertainty to the point that the QAF is just beginning to break down. We know from looking at the shapes of the distribution that the QAF are becoming inconsistent but as they do not assume that errors are Gaussian, they can be consistent even off the straight line. QAF 3 and 4 are still able to stay near the hypothetical line for most of the trials but seem overcondent on about a quarter of them. The reason that the QAF outperforms the EKF is easily seen by comparing the EKF and QAF in g. (7). We see that the distribution of the EKF model simply can not follow the crescent shape of the true distribution while the QAF comes a lot closer. Fig. (7) shows the result of the rst feature measurement update. The EKF has the robot outside the circle of features and heading into it. The QAF on the other hand gives a pose that while not exactly correct is a reasonable one given the information. The true pose is within the the distribution of likely poses for the QAF. VIII. C ONCLUSIONS The improved posterior distribution of the QAF as compared to the (I)EKF leads both to better accuracy and less sensitivity
301
Fig. 7. We show an example of one trial of the QAF algorithm. For this estimate the noise was relatively high. The estimated path here is based on the odometry which gives a path along the x axis starting from the origin. The true path includes simulated noise. This is the distribution just before getting the rst feature measurement. Therefore all four QAF variations looked the same at this point. There are 13 auxiliary dimensions and the resulting distribution of poses is displayed by plotting points sampled from it. As one can see the true path is within the point cloud and thus the estimate is approximately consistent in this regard. The distribution is beginning to become inconsistent as can be seen by the fact that the point cloud has points outside the circle of features, (a circle can not be represented by quadratic m ). In the midle and right gures we show the QAF 3 algorithm one step after the iteration shown on the left. This is the distribution just after updating with the rst feature measurement. There are 4 auxiliary dimensions and the resulting distribution of poses is displayed by plotting points sampled from it. As one can see for the QAF on the left, the true path is within the point cloud and thus the estimate is still approximately consistent in this regard. For the EKF on the right, the distribution is displayed by plotting the 2 ellipse around the mean of the Gaussian xy distribution. As one can see the true path is not near this ellipse and thus the estimate is certainly inconsistent. The feature observation is not very close to the actual point feature location either.
Fig. 8. Here we plot the sorted test statistic 2 (d2 ), where d is the 3 normalized estimated error square after the rst feature measurement is used for updating the estimate (as shown in gure (7). And 2 is the cumulative chi3 square distribution function. This is then compared to the expected uniform distributions cumulative distribution function. The simulated noise on the odometry was low on the left. The estimates for the QAF were all fairly close to the hypothetical values except for QAF 1 which was not a good t. The iterative EKF was somewhat overcondent and the EKF was poor with more than 85% of the trials badly overcondent (test statistic of 1.0). On the right the noise was relatively high. Here we see that the IEKF and EKF had only a few trials that were not out in the tail of the 2 distribution (test statistic of 1.0). This shows that these two estimates were hopelessly overcondent. QAF had broken down as well but rather gracefully. The QAF 3 and 4 had about 25% in the tail. The fact that both version 3 and 4 were very similar shows that the random sampling search done in QAF 4 did not help to nd better minimum than the analytic iterative Gauss-Newton line search in the space done in both versions. QAF 1 and 2 were badly overcondent. but not as quite bad as the EKFs. This indicates that the iterative line search is needed when making large adjustments during the updates. We see that the random sampling of QAF 2 did help somewhat.
to modeling errors. On the other hand we note that the unmodeled errors lead to inconsistency for all the lters including the QAF. It is not surprising that the estimator can not be consistent when the model is inconsistent. The inconsistency just seemed to lead to divergence of the lter more often for the EKF and IEKF than for the QAF. The QAF is a new type of nonlinear recursive Bayesian estimator that is not much like the EKF and nothing like a PF. It has been shown to be superior to the EKF and IEKF when the nonlinearities become signicant. ACKNOWLEDGMENT This work was supported by the SSF through its Centre for Autonomous Systems (CAS). R EFERENCES
[1] Ienkaran Arasaratnam, Simon Haykin, and Robert J. Elliott. Discretetime nonlinear ltering algorithms using gauss hermite quadrature. In Proc. of the IEEE, volume 95, pages 953977, 2007. [2] M S Arulampalam, S Maskell, N. Gordon, and T. Clapp. A tutorial on particle lters for online nonlinear/non-gaussian bayesian tracking. IEEE Transaction on Signal Processin, 50(2):174188, February 2002. [3] B.M. Bell and F.W Cathey. The iterated kalman lter update as a gaussnewton method. IEEE Transaction on Automatic Control, 38(2):294 297, February 1993. [4] Y. Bresler and A. Macovski. Exact maximum likelihood parameter estimation of superimposed exponential signals in noise. IEEE Transaction on Acoustice, Speech and Signal Processing, 34(5):10811089, 1986. [5] Frank Dellaert. Square root sam: Simultaneous location and mapping via square root information smoothing. In Robotics: Science and Systems, 2005. [6] Hugh Durrant-Whyte, Somajyoti Majumder, Sebastian Thrun, Marc de Battista1, and Steve Scheding. A bayesian algorithm for simultaneous localisation and map building. Springer Tracts in Advanced Robotics, 6:4960, 2003. [7] J. Folkesson. The antiparticle lter - an adaptive nonlinear estimator. In to appear Proc. of the 15th International Symposium on Robotics Research (ISRR2011), volume 1, 2011.
[8] J. Folkesson and H. I. Christensen. Graphical SLAM - a self-correcting map. In Proc. of the IEEE International Conference on Robotics and Automation (ICRA04), volume 1, 2004. [9] N.J. Gordon, D.J. Salmond, and A.F.M. Smith. Novel approach to nonlinear/non-gaussian bayesian state estimation. In Proc. of the IEEE for Radar and Signal Processing, volume 140, pages 107113, 1993. [10] K. Ito and K. Xiong. Gaussian lters for nonlinear ltering problems. IEEE Transaction on Automatic Control, 45(5):910927, May 2000. [11] L. Jetto, S. Longhi, and G. Venturini. Development and experimental validation of an adaptive extended kalman lter for the localization of mobile robots. IEEE Transaction on Robotics and Automation, 15(2):219229, April 1999. [12] S. J. Julier and J. K. Uhlmann. Unscented ltering and nonlinear estimation. In Proc. of the IEEE, volume 92, pages 401422, 2004. [13] T. Lefebvre, H. Bruyninck, and J. De Schutter. Kalman lters for nonlinear systems: a comparison of performance. Int. Journal of Control, 77(7):639653, May 2004. [14] Michael Montemerlo and et al. FastSLAM: A factored solution to the simultaneous localization and mapping problem. In Proc. of the Nat. Conf. on Articial Intelligence (AAAI-02), Edmonton, Canada, 2002. [15] Kenneth R. Muske and James B. Rawlings. Nonlinear moving horizon state estimation. Klewer, 1995. [16] C.V. Rao, J.B. Rawlings, and D.Q. Mayne. Constrained state estimation for nonlinear discrete-time systems: stability and moving horizon approximations. IEEE Transaction on Automatic Control, 48(2):246258, February 2003. [17] K. Reif, S. Gunther, E. Yaz, and R Unbehauen. Stochastic stability of the discrete-time extended kalman lter. IEEE Transaction on Automatic Control, 44(4):714728, April 1999. [18] K. Xiong, H. Zhang, and L. Liu. Adaptive robust extended kalman lter for nonlinear stochastic systems. Control Theory and Applications,IET, 2(3):239250, March 2008. [19] K. Xiong, H.Y. Zhanga, and C.W. Chan. Performance evaluation of UKF-based nonlinear ltering. Automatica, 42(2):261270, February 2006. [20] W. Zhang, B. S. Chen, and C.S. Tseng. Robust H ltering for nonlinear stochastic systems. IEEE Trans. on Signal Processng, 53(2):589 598, February 2005.
302
Qualitative localization using vision and odometry for path following in topo-metric maps
St phane Bazeille, Emmanuel Battesti and e David Filliat ENSTA ParisTech, Unit Electronique et e Informatique, 32 Boulevard Victor, 75015 Paris, FRANCE.
firstname.lastname@ensta-paristech.fr Abstract We address the problem of navigation in topometric maps created by using odometry data and visual loopclosure detection. Based on our previous work [6], we present an optimized version of our loop-closure detection algorithm that makes it possible to create consistent topo-metric maps in real-time while the robot is teleoperated. Using such a map, the proposed navigation algorithm performs qualitative localization using the same loop-closure detection framework and the odometry data. This qualitative position is used to support robot guidance to follow a predicted path in the topo-metric map compensating the odometry drift. Compared to purely visual servoing approaches for similar tasks, our path-following algorithm is real-time, light (not more than two images per seconds are processed), and robust as odometry is still available to navigate even if vision information is absent for a short time. The approach has been validated experimentally with a Pioneer P3DX robot in indoor environments with embedded and remote computations. Keywords: Path following, vision, robot odometry, topometric map, visual servoing.
I. I NTRODUCTION To navigate autonomously in a large environment, a robot often requires the ability to build a map and to localize itself using a process named Simultaneous Localization and Mapping (SLAM). The eld of SLAM can be broadly divided into two approaches: topological and metrical. The most common approach is the metrical SLAM in which we traditionally use range sensors such as laser or sonars. This mapping method is explicitly based on measured distances and positions. The localization is geometric and clearly corresponds to the real world. It can be done continuously and planned navigation is accurate. The main problem is that global geometric consistency is hard to ensure and the map is therefore hard to build. Moreover, the sensors are usually expensive and the computation heavy and greedy. Now, more and more often those sensors are replaced by cameras because they provide many advantages such as lower price, smaller size, lighter weight, lower energy consumption and give a richer environmental information. Using these sensors, it is possible to recover metric information, but a more direct way to map the environment is to use topological approaches where the environment is modeled as a graph of discrete locations. These maps are easy to build, suitable for large environments and for human interactions. Their
main drawback is the lack of geometric and free space information that only allows localization and navigation close to previously mapped routes. In our previous work [6], we built topological maps using visual loop-closure detection and we used odometry data to enrich this topological map with metric information. The choice of this second sensor makes the mapping more accurate, reduces the computational cost compared to purely visual solutions and also makes the system more robust to vision failure. Contribution: The main contribution of this paper is the development of a new robust and light path following algorithm combining the use of these two cheap sensors (odometer and camera) that allows autonomous navigation in such previously learned topo-metric maps. The approach is qualitative and uses the feedback information given by the vision sensor to approximately correct the odometry drift in order to follow a path computed from the map. This path following system can be used for delivery robots, security robots, guide and following robots for example. Content: In Section 2, we present a review of related work on topological navigation. In Section 3, we recall our previous work on topo-metric mapping and present new optimizations that have been brought to this system. In Section 4, we explain our new framework on path following navigation using these topo-metrical maps. Finally, in Section 5, we show experimental results and we conclude in Section 6 with a discussion about this contribution and our future work. II. R ELATED WORK Localization is a key issue for mobile robots in environments where a globally accurate positioning system, such as GPS, is not available. Today, the most used sensor to map an environment and to navigate autonomously in a map is denitively the laser sensor, combined with a SLAM framework it builds up a map within an unknown environment while at the same time keeping track of the current location (see [26] for an overview of the metrical SLAM approaches). The position is accurate, and the map displayed as a geometrical occupancy grid allows the robot to explore its surrounding. In our work, we have addressed the problem of autonomous navigation method, but focusing on the visual sensor. As we make use of a topological map [5], [7], [25], we have no information about obstacles,
303
and about free space around the robot, that is why our navigation method has been limited to follow path that have been already taken by the robot. The traditional method for this kind of application is visual servoing also known as vision-based robot control which uses feedback information extracted from images to control the motion of the robot [20]. Those methods generally require camera calibration (homography, fundamental matrix, Jacobian, removal of lens distortion [4], [9], [21], [24], [8]). Also, some approaches make assumptions on the environment (articial landmarks, vertical straight lines, parallel walls) or sometimes need more than one camera or camera of different kind (omnidirectional for example) [4], [19], [14], [12]. In our research context, we have been interested by the use of a perspective camera without calibration (indeed, our method also works with omnidirectional camera [6]), and above all without any assumption on the environment. Such calibration free methods had been developed by [10], [11]. They are based on image features tracking, and use qualitative comparisons of images to control the motion. Such methods are very interesting but they require real-time image processing at high frame rate and are highly dependent of the quality of image data. Tracking errors or temporary absence of information lead quickly to system failure. Moreover, they need lighting constancy so additional processing are generally added to ensure the desired behavior. To make our system more robust and accurate, and above all lighter from a computational point of view, we enable the use of one more cheap sensor: the odometer. Visual sensor provides a rich information and an accurate positioning system and the combined use of odometry makes the algorithm more robust and relieve the visual system from high frame rate computation. Odometry allows localization for a short time in absence of visual information, vision failure (dark or dazzle areas, blurry image, occlusions), or important changes in the scene that has been learned (light, people). When embedded on small platforms, this makes it possible to remotely process images by guiding the robot in case of network lag. As we are not too much dependent of visual information, it is also possible to use visual localization information only when it is very reliable, avoiding to give position information that would be unsure. We therefore developed a robust visual localization system that completely banned false alarms, to the price of giving less localization information. III. I MPROVED TOPO - METRIC MAPPING For the next, we will call loop-closure the event where the robot detect a matching between the current and the reference frame. It differs from the traditional loop-closure denition in which we associate the loop-closure to the event where the robot revisit an area it has not been to before a while. A. Summary of our previous work In [6], we have developed a fully incremental topo-metric mapping framework. This algorithm builds in real-time topometric maps of an unknown environment, with a monocular or omnidirectional camera and the odometry gathered by
Fig. 1. Comparison of topo-metrical mapping and laser mapping. 1. Raw odometry 2. Corrected odometry applying graph relaxation taking into account the visual loop-closure (two loop-closure locations detected) 3. Ground truth trajectory (SLAM Laser). The three trajectories are shown in the frame of the reference laser map.
motors encoders (see Fig. 1). The system is based on an appearance loop-closure detection method that has been designed as a two-level decision system to ensure robust and accurate detection. A rst step detects potential loop-closure locations when the robot comes back to a previously visited area using appearance only. A second one veries and selects the best potential location using image geometry. A Bayesian lter based on incremental bags of visual words [16] is used to extract potential loop-closure locations that is to say nd the previous positions that are potentially close to the current one. In the second step these locations are veried with a 2D motion computation in the image space (translation and rotation in image plane) based on the SIFT [22] keypoints and we select the loop-closure which shows the smallest translation. In order to discard outliers, the 2D motion is computed using RANSAC, accepting the result only if the number of matching points is above a threshold.
304
With the inclusion of an odometry-based evolution model in the Bayesian lter which improves accuracy, robustness and responsiveness, and the addition of a consistent metric position estimation applying an efcient optimization algorithm at each validated loop-closure [18], our system produces a map that corresponds to the real world and only presented limited local drift. It makes it usable for global localization and planned navigation. For the current work, the robot is teleoperated during an initial mapping phase and our algorithm is used to build a map usable later for navigation. The environment is divided into locations (dened by one or more images) that are linked by relative odometry vector. The sampling of the environment is done each time the robot goes ahead for 50 cm or turn of 10 degrees. This mapping phase does not need any preprocessing, calibration, neither postprocessing or parameters adjustment and it builds incrementally its map, adding new location if no loop-closure has been found or updating a location and correcting the graph if a loop-closure has been found (see Fig. 1). B. Optimization of the loop-closure detection algorithm Since we will use this framework for real-time pathfollowing navigation, we brought some optimizations to the approach, notably to improve the performances of the visual localization module: We have improved the performances of the algorithm by replacing SIFT [22] keypoints by STAR [3] keypoints. It has greatly divided the keypoints extraction time (more than 20 times), but it has decreased the number of keypoints and their quality. We have compensated this quality loss by using a new validation strategy less restrictive on the number of extracted keypoints. We improved the accuracy of the prediction step of the Bayesian lter which is used to extract potential loop-closure locations. Our rst version was only using the probability at the previous time-step to predict the new one. In the new version, the Bayesian lter takes into account several previous time steps and the evolution model is applied to the odometry displacements corresponding to these time-steps. The predictions are lastly merged using the max operator to give the nal prediction. This step reduces the inuence of the map discretization on the quality of the prediction and makes more accurate the extracted potential locations. We simplied the validation stage by modifying the geometric model of image transform and by thresholding using all the parameters extracted from 2D motion computation in the image space (translation, rotation and scale). The computation of an homography using four couples of matching points through RANSAC[17] has been replaced by a simpler computation of a 2D motion using two couples of points through RANSAC. Homography was already a simplied version of the real transform but as we work on images with very close viewpoint when closing loops, it could be again simplied to speed up the computation.
TABLE I C OMPARED RESULTS OF OUR VISUAL LOOP - CLOSURE DETECTION SYSTEM (LCDS) BEFORE AND AFTER OPTIMIZATION . Museum 112 38 14 13 7% 0% 42s 0.37s 13 7% 0% 2.16s 0.019s Gostai 169 82 25 18 12 % 5% 70s 0.41s 26 0% 3.84 % 2.57s 0.015s Lab (Fig. 1) 350 98 9 7 20 % 0% 210s 0.5s 8 22 % 11 % 5.56s 0.016s
Images Distance (m) LCD Truth Old LCDS [5] Missed LC False LC CPU Time CPU Time/image New LCDS Missed LC False LC CPU Time CPU Time/image
A dedicated embedded version of the algorithm for a use directly on the robot has been developed. The code has been fully rewritten in C++ suppressing many dependencies, the logging part and obsolete functionalities added during the development. The incremental dictionary has been replaced by a generic static one generated from various indoor data set [23] to increase processing speed. A new navigation mode (described in the next section) has been created to perform path following navigation using the loop-closure detection framework. Compared to the mapping mode, the incremental part of the system that adds new words in the dictionary, new locations in the graph and that relaxes the topo-metrical map is disabled. It therefore enables qualitative visual localization in the topo-metric map.
It is important to note that, during mapping, loop-closure are only accepted and integrated in the map if the robot comes back very close to a previously visited location. A loop-closure is therefore accepted only if the two images show enough matching points, and if the computed rotation, translation and scale between them are below some threshold [6]. We will see below that this denition has been relaxed for the navigation mode by disabling the translation threshold. Table I presents some computation time and loop-closure comparison results using the old and the optimized version of the algorithm. The old LCDS includes the rst version of our algorithm to which we add the odometry and the relaxation. It uses SIFT feature, the old odometry model, and the homography-based validation system. The new LCDS includes the fully optimized code version, the modied odometry model, the simplied validation system and the use of STAR feature. The machine used for experimentation was an Intel Xeon 3Ghz, the images size 320x240, and the average speed of the robot 0.4m/s. See [5] for a description of the different sequences and for more information about the old LCDS.
305
these corrections are less accurate. This limited precision is however not a problem as only localization is performed and the map quality is therefore not impacted. This navigation mode requires that the trajectory is obstacle free because obstacle avoidance is not currently included in our model. B. Qualitative localization using vision and odometry The visual loop-closure detection framework veries at each recorded image if the robot is in an already visited location or not. When a loop-closure is detected, the simple matching between images does not permit to estimate precisely the robot position relatively to the image in the map as the scale factor is unknown when computing the camera displacement. Moreover, for small displacement and particular environment conguration, there is an ambiguity because a lateral translation in images can be caused either by a robot translation or by a rotation. For these reasons, we prefer to estimate a qualitative position, by assuming that the image movement is caused only by a rotation of the robot.
Fig. 2. Diagram of the developed system. Each box represents an uobject.
C. Software overview The system has been developed using Urbi[1] an opensource software platform to control robots. It includes a C++ component library called Uobject to describe motors, sensors and algorithms. We also use urbiscript to glue the components together using embedded parallel and eventdriven semantics. Figure 2 presents the whole description of the architecture of our mapping, localization and path following system. It is composed of ve different components including a viewer to supervise robot behavior. Two components have been tested remotely in particular our visual SLAM algorithm. IV. Q UALITATIVE NAVIGATION SYSTEM The newly developed navigation mode is based on a qualitative position estimate that combines odometry with the visual information provided by loop-closure detection. A. The navigation mode The navigation mode of the algorithm presented in this paper requires a topo-metrical map, and the knowledge of the robot starting position in the map. A path to reach a goal from the starting position is computed as a list of nodes using Dijkstra algorithm [13], taking into account the orientation of the robot in each node. In order to follow the computed path, the robot position is continuously computed using odometry and visual loopclosure detections. In this mode, loop-closure detection is less restrictive than in the mapping mode, as loop-closure are accepted whatever the translation between images is. This translation is used to estimate an approximate position which is used to guide the robot. This use of a less restrictive loopclosure validation makes it possible to benet from much more position correction than in the mapping mode, even if
Fig. 3. Illustration of the qualitative visual localization. A loop-closure is detected between image 9 (right) and image 45 (left) with 33 pixels of x-axis translation. The computed corresponding angle of robot rotation is 4.73 degrees.
Therefore, when a loop-closure is detected the parameters extracted during the validation of potentials loop-closure locations are used to estimate a qualitative direction. Among the three parameters (translation, rotation and scale), we only use the x-axis translation in pixels between the two matching images to compute the angle between the current robot direction and the direction recorded in the map. An rough camera calibration (only based on image size and camera vision eld) make it possible to convert this translation in pixels into an angle (Figure 3). The position of the robot is therefore computed as the position of the loop-closure node but taking into account the deviation in direction. If no loop-closure is detected between places corresponding to an image acquisitions, the position is computed as the previous loop-closure location position to which the relative odometry recorded since this point in time is added. This makes it possible to produce a continuous position estimate which is corrected when a loop-closure is detected. C. Servoing system for path following To control robot motion we have used the strategy proposed by [15]. In order to reach a goal in the topo-metric map, we rst compute a sequence of nodes using Dijkstra algorithm. The path linking this sequence of nodes is then discretized each centimeter to form the global path that
306
Fig. 4. Left: Example of vision-based path following replaying the trajectory used for mapping. The green and pink trajectories are the laser SLAM trajectory recorded during the mapping phase and the autonomous replay phase respectively (approx path length 30m). The blue trajectory is the odometry recorded by the robot during the replay phase. Right: Topometric map of the environment created during the same experiment. Green and pink circles are the nodes of the topological map, displayed here without map relaxation as we have not close loops during the learning phase. The pink circles are the loop-closure places detected during the replay phase. The length of line in the middle of the pink circles is proportional to the x-axis translation computed from the matching images.
target. A heading direction error between this point and the robot position is computed and used to estimate the rotation speed by using a PID controller. As the robot translation speed is set to a constant, the servoing system adjusts the velocity of each wheel to correct the heading error and to follow at best the local path (see Fig. 5). While this guidance strategy is quite standard, it should be noted that the interplay between this strategy and the qualitative localization method has the effect of guiding the robot to actively close loops during movement. Indeed, without the qualitative localization, the robot would be guided by the odometry only and the drift induced would lead the robot far from the map nodes, thus preventing from visual loop closure detection. With this strategy, each time the robot deviate from the predicted path, the qualitative position correction lead to a local path that guides the robot back on the global path, thus enforcing future loop-closure and position correction. V. R ESULTS AND DISCUSSION To validate our method, experimentation have been done in an indoor environment using a Pioneer P3DX mobile robot mounted with a Canon VC-C50i camera with a wide angle lens. During the showed experiment all the code was embedded on the robot except the viewing system. The image processing rate was 1 image each 50 cm or 10 degrees. To give an accurate idea of what the system is able to do, we have launched in parallel with our mapping and path following system the laser SLAM positioning system Karto [2]. It gives a reference trajectory in a laser map during the learning and the path following phases. Figure 4 (left) shows an experiment where the trajectory used for mapping (in green) has been replayed using our system (trajectory in pink). The odometry recorded during the path following run (in blue) shows the drift that has been compensated by the visual localization system and that would have led inescapably to wall collision without these compensations. Figure 4 (right) illustrates the effect of our qualitative localization approach during the same experiment. The pink circles correspond to the locations where loop-closures have been detected during path following. The pink line in the circle is the translation computed between the loop-closing images that is used for the qualitative position estimation. We can see that our guidance framework lead to a high loopclosure detection rate (around 60% here) and that the path following behavior is very smooth with sometimes 5 images without direction correction. Figure 6 shows another experiment illustrating the purpose of Dijkstra algorithm. The replayed trajectory (in pink) to go from the rst node to the last node of the map is avoiding the large loop executed during map construction as a shortcut is available. This experiment also illustrate that it is possible to map and localize with different image sampling frequency. Here, the map has been produced with images sampled every 5 cm, thus leading to a very precise map. Guidance has then been executed with images recorded every 40 cm that is to say lighter computations.
Fig. 5. Left: The robot follows the local path but deviates from the true trajectory because of the odometry drift. Middle : The visual loop-closure detection framework gives a qualitative localization of the robot in the graph taking into account the deviation in direction. As a consequence, in the real world, the local path is modied and the robot corrects its trajectory in order to stay on the desired path. Right : The robot follows the local path and regain the true trajectory.
should be followed to reach the goal. This global path is only computed one time. When an image is acquired, the position is updated using visual information, and a local path to join or to follow the desired trajectory is computed between the position and the global path. The local path is a line between the position and a point of the global path situated at 40 cm in front of the robot to which we add the global path after this point. Given the local path, each time the robot moves, the position is estimated as described above and the rst point in the local path situated at more than 20 cm of the robot is selected as a
307
information about the occupancy of the free space around us. This will allow navigation with obstacle detection and avoidance and also autonomous exploration. R EFERENCES
[1] http://www.urbiforge.org, [Online; accessed 20-April-2011]. [2] http://www.kartorobotics.com, [Online; accessed 20 April-2011]. [3] M. Agrawal, K. Konolige, and M. Blas, Censure: Center surround extremas for realtime feature detection and matching, in European Conference on Computer Vision, 2008, pp. 102 115. [4] S. Atiya and G. Hager, Real time vision based robot localization, IEEE Trans. Robot. Automat., vol. 9, pp. 785799, 1993. [5] S. Bazeille and D. Filliat, Combining odometry and visual loopclosure detection for consistent topo-metrical mapping, International journal on operations research (RAIRO), vol. 44, pp. 365377, 2010. [6] S. Bazeille and D. Filliat, Incremental topo-metric slam using vision and robot odometry, in Proceedings of the International Conference on Robotics and Automation (ICRA11), 2011. [7] J. Blanco, J. Gonzlez, and J.-A. Fernndez-Madrigal, Subjective local maps for hybrid metric-topological slam, Robotics and Autonomous Systems, vol. 57, pp. 6474, 2009. [8] O. Booij, B. Terwijn, Z. Zivkovic, and B. Kr se, Navigation using an o appearance based topological map, in IEEE International Conference on Robotics and Automation, 2007. [9] D. Burschka and G. Hager, Vision based control of mobile robots, in Proc. of the International Conference on Robotics and Automation, 2001. [10] Z. Chen and T. Bircheld, Qualitative vision-based mobile robot navigation, in Proc. IEEE Int. Conf. Robot. Autom., 2006. [11] , Qualitative vision-based path following, IEEE Transactions on Robotics, vol. 25, 2009. [12] J. Correa and A. Soto, Active visual perception for mobile robot localization, Journal of Intelligent and Robotic Systems, vol. 58, pp. 339354, 2010. [13] E. W. Dijkstra, A note on two problems in connexion with graphs, Numerische Mathematik, vol. 1, pp. 269271, 1959. [14] A. Diosi, A. Remazeilles, S. Segvic, and F. Chaumette, Outdoor visual path following experiments, in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2007. [15] M. Egerstedt, X. Hu, and A. Stotsky, Control of mobile platforms using a virtual vehicle approach, IEEE Transactions on Automatic Control, vol. 46, pp. 17771782, 2001. [16] D. Filliat, A visual bag of words method for interactive qualitative localization and mapping, in IEEE International Conference on Robotics and Automation, 2007. [17] M. A. Fischler and R. C. Bolles, Random sample consensus: A paradigm for model tting with applications to image analysis and automated cartography, Communications of the ACM, vol. 24, no. 6, pp. 381395, 1981. [18] G. Grisetti, C. Stachniss, S. Grzonka, and W. Burgard, A tree parameterization for efciently computing maximum likelihood maps using gradient descent, in Proceedings of Robotics: Science and Systems, Atlanta, GA, USA, June 2007. [19] J. Guerrero and C. Sagues, Uncalibrated vision based on lines for robot navigation, Mechatronics, vol. 11, pp. 759777, 2001. [20] S. A. Hutchinson, G. D. Hager, and P. I. Corke, A tutorial on visual servo control, IEEE Trans. Robot. Automat., vol. 12, pp. 654670, 1996. [21] B. Liang and N. Pears, Visual navigation using planar homographies, in Proc. of the International Conference on Pattern Recognition, 2002. [22] D. Lowe, Distinctive image feature from scale-invariant keypoint, International Journal of Computer Vision, vol. 60, no. 2, pp. 91110, 2004. [23] D. Nister and H. Stewenius, Scalable recognition with a vocabulary tree, in Accepted for oral presentation at CVPR 2006, 2006. [24] C. Sagues and J. Guerrero, Visual correction for mobile robot homing, Robotics and Autonomous Systems, vol. 50, pp. 4149, 2005. [25] S. Thrun, Learning metric-topological maps for indoor mobile robot navigation, Journal of Articial Intelligence, vol. 99, pp. 2171, 1998. [26] S. Thrun, W. Burgard, and D. Fox, Probabilistic robotics, MIT Press, Cambridge, MA,, vol. 99, pp. 2171, 2005.
Fig. 6. Another example of the path following system illustrating the use of Dijkstra algorithm to replay a short-cut trajectory and the use of different sampling frequencies for mapping and navigation.
VI. C ONCLUSION AND FUTURE WORK In this paper, we have addressed the problem of navigation in topo-metric maps by using visual loop-closure detection. The presented algorithm uses the vision system and the odometric data for qualitative localization in the topo-metric map in order to guide the robot to follow a path already taken during a learning phase. The qualitative visual localization is computed and sent to a servoing system that compensates the odometry drift to ensure we are always on the learned trajectory. This system can be seen as an active loopclosure detection framework as we are forcing by controlled guidance to close loops. The system only needs two sensors and really few computer resources to achieve the navigation task. It is real-time and does not need any precise camera calibration or parameters adjustment. With minor adaptations, it can use any kind of camera (omnidirectional, wide angle or directional) and is suitable for toy robots as it just needs cheap sensors and small computation performances. As odometry is used to complement the visual information, the system is robust to lighting change, furniture moved, people crossing, blurry image or even from temporary sensor occlusion or lag of the vision system response. From this point of view, it can also be used as a remote process to lighten again the on-board computer charge. Future work will deal with localization after kidnapping by applying the loop-closure detection framework while the robot is spinning around. This will be used to retrieve the direction to follow a path after a kidnapping. We will also add to our system sonar data to obtain a map giving an
308
Stereo-Based Outdoor Localization using a Line Drawing Building Map

Keisuke Matsuo Jun Miura Junji Satake Department of Computer Science and Engineering Toyohashi University of Technology, Toyohashi, Japan
Abstract Line drawing maps are frequently used by people to exchange location information. Mobile robots in the future are expected to be able to communicate with people and thus to localize themselves using such maps. We therefore develop a method of outdoor localization using a stereo camera and a line drawing building map. Omnidirectional range data obtained by a stereo camera and a pan-tilt unit are projected onto a robotcentered 2D grid map, which is to be matched with an input line drawing map. Since various objects such as trees often obstruct buildings in usual outdoor scenes, we apply a view-based object classier in order to extract only stereo data from buildings. To cope with uncertainties in stereo data and building extraction, we adopt Monte Carlo localization. Experimental results show that our simple approach works reasonably well in an actual outdoor environment. Index Terms Outdoor navigation, Line-drawing map, Stereo, Particle lter.
I. I NTRODUCTION Localization is one of the fundamental functions of mobile robots. People often use a line drawing map and can navigate themselves where they have never been. Although only a part of objects such as buildings are written in the map, people can determine their position with respect to such objects by matching their knowledge of the map with observations of the environment. Since mobile robots in the future are expected to communicate with people as in the way people do, localization using a line drawing map will be an important capability. As a rst step towards this goal, this paper proposes an outdoor localization method using line drawing building maps and stereo. In recent times, vision is often used for SLAM and localization problems [13, 5, 4, 6], in which visual features are used as landmarks. Several view-based approaches have also been proposed [1, 11]. These works are basically learningbased, that is, the localization is performed using a learned map. This means that the data of the environment should be collected before localization. Yun and Miura [15] proposed a localization method using a line drawing building map with uncertainty. Their method relies on line segment features constituting building boundaries, the directions of which are estimated using vanishing points, and stereo range features corresponding to building walls. Since many false matches exist between the features and a map, they adopt a multi-hypothesis approach, which is relatively complicated and costly. Leung, Clark, and Huissoon [8] proposed a localization method based on the matching between such line segments and the line features extracted from
aerial images in conjunction with particle lter. K mmerle et u al. [7] developed a graph SLAM method based on a similar idea and an accurate 3D range sensor. Parsley and Julier [12] proposed a general framework of exploiting the use of various prior information with uncertainty in SLAM. In usual outdoor scenes, buildings are sometimes good landmarks but often obstructed by various objects such as trees and bushes. Conventional SLAM method using 2D range nders (e.g., [3, 2]) are therefore most probably difcult to apply. The above methods using prior information employ several heuristics to extract information from the input data (from a 3D laser scanner, for example) and maps (or aerial images) to be matched. In this paper, we develop a localization method using a stereo camera and a line drawing building map. We extract stereo range data from building by a view-based object classier. The extracted data are matched with a part of the map, which is actually a set of visible building walls, in the Monte Carlo localization framework. We use omnidirectional stereo data acquired by a stereo camera on a pan-tilt head to obtain a wide eld of view. We will show our simple method works reasonably well in actual outdoor localization problems in spite of the low quality of stereo range data. The rest of the paper is organized as follows. Sec. II describes our robot and acquisition of stereo data. Sec. III describes a mapping of stereo data to a local map with viewbased building data extraction. Sec. IV explains the input line drawing map and hidden lines removal for matching. Sec. V describes the procedure for Monte Carlo localization. Sec. VI shows experimental results validating our approach. Sec. VII discusses the current performance and possible extensions of the proposed method, and Sec. VIII summarizes the paper. II. S TEREO DATA ACQUISITION A wider eld of view is usually more effective for localization, especially in outdoor environments where the features are more scarce and distance to objects are larger. So we congure a measurement unit which has a stereo camera (Point Gray Research Bumblebee XB3, 66 [deg.] horizontal eld of view) mounted on a pan-tilt head (TRAC Labs Biclops PT) and acquire omnidirectional stereo data. We take pairs of stereo images at six panning positions while the robot is stopping. The tilt angle is set to 20 [deg.] looking up. The measurement unit is put on a mobile robot which is based on an electric wheelchair (PatraFour from Kanto Auto Works
309
Fig. 4.
Examples of training images.
(a) measurement unit.
(b) Our mobile robot.
Fig. 1.
The measurement unit and the robot.
Fig. 5.
Building extraction results for the images shown in Fig. 2.
Fig. 2.
An example set of six input images.
Co.). Fig. 1 shows the measurement unit and the mobile robot. Fig. 2 shows an example set of six images, collectively cover 360 [deg.], taken at a position. Fig. 3 is a mapping of omnidirectional stereo data onto a oor in a relatively small room; stereo measurements are fairly accurate in small distances. For distant objects, however, range measurements will have large errors as shown later. III. L OCAL MAPPING OF BUILDING DATA A set of stereo data obtained at a robot position is transformed into a local grid map. An SVM (support vector machine)-based building classier is used for extracting only stereo data from buildings. A. Extracting building data using SVM-based classier We use the SVM-based classier proposed by Miura and Yamamoto [11], which has been shown to be effective in viewbased localization under various weather and seasons.
y [m] 4
We divide an input image to 16x16 windows and classify each region using several image features. For the building classier, we use the normalized color (r, g, b), an edge density, the peak value of the voting in Hough transform, and the variance of edge directions [11]. This classier is for extracting building boundaries and windows where strong edge segments exist. We used 57 training images taken in various seasons and weather conditions. The windows in the images are manually labeled for training. Some of the training images are shown in Fig. 4. We use the SVM with RBF kernel (K(x 1 , x2 ) = exp(||x1 x2 ||2 ), = 50) for this classier. Fig. 5 shows the building extraction results for the images shown in Fig. 2. Although some regions other than buildings are extracted, the overall result is reasonable because complete classication is not a necessary condition for Monte Carlo localization. B. Making a local map The map given to the robot is a 2D line drawing map. The obtained stereo data are therefore converted into a 2D local grid map for matching. The local map is robot-centered and the size of each cell is 0.1 [m]0.1 [m]. Each of stereo data points, which is originally represented in the camera coordinates, is transformed into the robot local coordinates and then voted on the grid map. Each cell accumulates the votes. Finally, a Gaussian smoothing is applied to the local map to consider the discretization of the grid map. Stereo measurements have larger uncertainty for distant objects, and many models for this uncertainty have been proposed (e.g., [10]). In this mapping, however, we do not explicitly consider the positional error of each measured point. The reason is as follows. We model buildings by their 2D boundaries and each boundary is matched with the map in localization as described later. The distribution of actual
-2
-4 -4 -2 0 2 4 x [m]
Fig. 3.
An example indoor mapping using the measurement unit.
310
z [m]
z [m]
y [m] x [m]
y [m] x [m]
(a) Local map without building extraction.
(b) Local map with building extraction.
(a) Line drawing map.
(b) Hidden line removal example 1.
(c) Local map without building extraction.
(d) Local map with building extraction.
Fig. 6.
Effect of building extraction on the local map.
measured points for a boundary is naturally larger for a more distant building and their contribution to localization will be smaller without any adjustments. C. Mapping results Fig. 6(a) is the 3D data points obtained from the images shown in Fig. 2. These data are projected and smoothed on a 2D plane to produce the local grid map shown in Fig. 6(c). The data from large trees in front of the building on the right (see the second and the third image in Fig. 2) exist in the map. Fig. 6(b) and (d), on the other hand, indicate the 3D data points and the local grid map, respectively, after applying the building data extraction. These gures show that many of the data from large trees mentioned above are deleted correctly, although the 2D position of data points are not very accurate. IV. L INE D RAWING MAP AND HIDDEN LINES REMOVAL A. Map representation The input line drawing map is composed of lines, each of which is represented by a pair of 2D endpoints in the world coordinates. We developed a line drawing map editor for making, modifying, and saving the map data. The shapes of the buildings are manually measured from the oor and campus plans. The scale of the plans is 1:2500 and the error in the map is estimated to reach a few meters. B. Hidden lines removal The degree of matching between the input local map and the line drawing map is used for calculating the likelihood of a particle in Monte Carlo localization, as described later. Although the input map includes all building boundaries, only a part of them is visible from a given position. It is therefore necessary to remove occluded boundaries from the position in
(c) Hidden line removal example 2.
Fig. 7. Examples of hidden lines removal. Circular markers in (b) and (c) indicate robot positions.
order to increase the reliability of matching. Since the input map has only 2D information, we consider only 2D visibility and thus perform a hidden lines removal. Once a robot position is set on the world coordinates, the visible part of each line segment can be calculated by analyzing the relationship between the robot position (i.e., viewpoint), that line segment, and other potentially occluding segments. Note that since we use omnidirectional stereo data, the orientation of the robot does not matter in this hidden lines removal. Fig. 7 shows examples of hidden lines removal for two different robot positions. V. L OCALIZATION A. Monte Carlo localization Monte Carlo localization (MCL) [14] is a powerful tool for localizing the robot under uncertain prior knowledge and sensor data, and has been successfully applied to various localization problems. The state vector is represented by a 2D robot pose in the world coordinates, that is: x = (xw , yw , w ) . (1)
The algorithm of MCL is as follows: 1) Initialization. Particles are distributed around the starting position. 2) Repeat the following for each pair of movement and measurement:
311
a) Prediction step. Each particle is moved by the odometry reading with a probabilistic error sampling according to the odometry error estimate. The odometry error is estimated according to the motion model described below (see Sec. V-B). b) Correction step. The likelihood of each particle is calculated as the degree of matching. We compare the local grid map with a part of line drawing map composed of only visible lines to estimate the likelihood (see Sec. V-C). c) Resampling step. Resampling is performed by draw with replacements using the normalized weights as probabilities for particles. B. Motion model
Fig. 8.
Route for the experiments.
B. Localization results
Figs. 9 and 10 show the results at several positions on the red and the orange route in Fig. 8, respectively. Several factors, from the previous We calculate a new pose in general, affect the degree of convergence of particles. t1 t1 pose (xt1 , yw , w ) and odometry data (x t , yt , t ) w One is the distance to the surrounding buildings because as follows: the uncertainty in stereo range data become larger for more t t1 t1 t1 sin w 0 xt distant objects. Another factor is the existence of objects xw cos w xw t t1 t1 t1 yw = yw + sin w cos w 0 yt obstructing building views. Although our building extraction t 0 0 1 w t1 t works reasonably well, it is not complete and uneliminated w nx stereo data degrade the localization quality. (2) In Fig. 9, particles are well converged because the robot + ny , n moves near the building on the left (see the red route in where (nx , ny , n ) is the process noise. Currently, each ele- Fig. 8). On the other hand, in Fig. 10, the particles diverge ment of the noise estimate is independent of each other and more at many places because the route is in the middle of the constant; the standard deviation of n x , ny , and n for the buildings and is near big trees (see the orange route in Fig. movement of 1 [m] is set to 0.05 [m], 0.05 [m], and 1.0 [deg.], 8). Although the degree of convergence changes from place to place, the overall localization ability is reasonable considering respectively. the uncertainty in stereo data and building extraction. We then performed localization using raw stereo data withC. Likelihood calculation out building extraction (see Fig. 6(c)). Fig. 11 shows the The likelihood of each particle is given by the degree results. Compared to those in Fig. 10, the localization accuracy of matching between the local grid map and the input line is much worse in places with trees (steps 5, 10, and 20). These drawing map with hidden lines removal. Using the position results show the effectiveness of building extraction in treeof a particle, lines visible from that point are calculated and populated areas. mapped onto the local map using the pose of the particle. The values in the local map are rst summed up where the VII. D ISCUSSION mapped lines exist and then divided by the number of pixels on the lines to calculate the averaged value, which is used as A. Localization accuracy the degree of matching. Necessary accuracy in localization depends on the objective Another approach to calculating the likelihood might be of each mobile robot and its task. If the robot has to reach extracting line features from the local map and to match them a specic goal only by the navigation method like the one with the line drawing map. We, however, do not take this presented in this paper, the accuracy should be, say, around approach because a reliable segmentation of measured points 50 [cm]. This is a hard task with only stereo range data. Such into boundaries and an accurate line tting seem difcult due a high accuracy is not, however, necessarily required when we to large stereo measurement errors. view the navigation task hierarchically.
t t (xt , yw , w ) w
VI. E XPERIMENTAL RESULTS A. Experimental environment Fig. 8 shows the experimental site in our campus and two routes used. The buildings on both sides of the routes are from two to eight stories high, although that information is not written in the line drawing map. We moved the robot along the routes using a Wiimote and took data for off-line localization. The rough estimate of the initial position is given to the robot, but the orientation is not.
Let us consider the following scenario. The task of the robot is to reach the entrance of some building. Then, a global-level navigation takes the robot to the place where it can observe the entrance, and a local-level navigation guides the robot movement to there using visual feedback. Similar scenarios can be applied to navigation in urban areas, where signboards and logos of shops are used for nal, precise localization. In such a scenario, the navigation presented in this paper corresponds to the global-level one, and the required accuracy is not very high; a required accuracy is to assure that the robot
312
can observe the entrance after the global-level navigation. Our proposed approach will sufce for this purpose. Also note that even if a global-level navigation provides very accurate positioning, it is still necessary to be combined with some local scene recognition methods such as road and obstacle detection. We are now planning to combine the localization method presented in this paper with our multisensory road detection method [9]. B. Map uncertainty At present, line drawing maps are assumed to be almost complete except that objects other than buildings are not written. The current localization method relies on this completeness. On the other hand, people can navigate themselves using hand-drawn maps which include many errors, omissions, and sometimes large deformations. In [16], we dened four key uncertainties usually included in hand-drawn maps: dimension, position, shape, and existence uncertainty. As far as the amount of uncertainty is small enough, the current method will probably work thanks to the MCL approach, but will not for larger uncertainties, especially for existence uncertainties (i.e., omission of many buildings in the map). We sometimes use maps with drastic omissions of buildings, but instead, we usually put additional landmark information such as prominent buildings. It is therefore necessary to improve the method to utilize such additional information in localization. C. Data acquisition cost and localization accuracy Taking omnidirectional stereo data is certainly the most costly part of the current system (about six seconds, at present), although it provides fairly accurate stereo data thanks to a well-calibrated stereo camera. Motion stereo for omnidirectional images can be used for obtaining point clouds, while it may degrade the position data. It is also possible to adopt various SLAM or visual odometry algorithms to obtain reliable position data, but it is still important to extract only information written in the map, as done by a view-based classier in this paper. VIII. S UMMARY This paper describes an outdoor localization method using a stereo camera and a line drawing building map. We extract stereo data from building using an SVM-based classier to be matched with visible building boundaries on the map. The Monte Carlo localization approach provides a reasonably good localization performance in spite of incomplete building extraction and low quality stereo range data. Several future research directions have also been discussed. ACKNOWLEDGMENT This work is supported in part by Grant-in-Aid for Scientic Research (No. 21300075) from JSPS.
R EFERENCES
[1] M. Cummins and P. Newman. FAB-MAP: Probabilistic Localization and Mapping in the Space of Appearance. Int. J. of Robotics Research, Vol. 27, No. 6, pp. 647665, 2008. [2] G. Grisetti, C. Stachniss, and W. Burgard. Improving Grid-based SLAM with Rao-Blackwellized Particle Filters by Adaptive Proposals and Selective Resampling. In Proceedings of 2005 IEEE Int. Conf. on Robotics and Automation, pp. 24322437, 2005. [3] J.E. Guivant and E.M. Nebot. Optimization of the Simultaneous Localization and Map-Building Algorithm for Real-Time Implementation. IEEE Trans. on Robotics and Automation, Vol. 17, No. 3, pp. 242257, 2001. [4] K. Irie, T. Yoshida, and M. Tomono. Mobile Robot Localization Using Stereo Vision in Outdoor Environments under Various Illumination Conditions. In Proceedings of 2010 IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, pp. 51755181, 2010. [5] J. Klippenstein and H. Zhang. Quantitative Evaluation of Feature Extractors for Visual SLAM. In 4th Canadian Conf. on Computer and Robot Vision (CRV 07), pp. 157164, 2007. [6] M. Kronfeld, C. Weiss, and A. Zell. Swarm-supported Outdoor Localization with Sparse Visual Data. Robotics and Autonomous Systems, Vol. 50, No. 2, pp. 166173, 2010. [7] R. K mmerle, B. Steder, C. Dornhege, A. Kleiner, G. Grisetti, and u W. Burgard. Large Scale Graph-Based SLAM using Aerial Images as Prior Information. Autonomous Robots, Vol. 30, No. 1, pp. 2539, 2011. [8] K.Y.K. Leung, C.M. Clark, and J.P. Huissoon. Localization in Urban Environments by Matching Ground Level Video Images with an Aerial Image. In Proceedings of 2008 IEEE Int. Conf. on Robotics and Automation, pp. 551556, 2008. [9] Y. Matsushita and J. Miura. On-Line Road Boundary Modeling with Multiple Sensory Features, Flexible Road Model, and Particle Filter. Robotics and Autonomous Systems, Vol. 59, No. 5, pp. 274284, 2011. [10] J. Miura and Y. Shirai. Vision and Motion Planning for a Mobile Robot under Uncertainty. Int. J. of Robotics Research, Vol. 16, No. 6, pp. 806825, 1997. [11] J. Miura and K. Yamamoto. Robust View Matching-Based Markov Localization in Outdoor Environments. In Proceedings of 2008 IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, pp. 29702976, 2008. [12] M.P. Parsley and S.J. Julier. Towards the Exploitation of Prior Information in SLAM. In Proceedings of 2010 IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, pp. 29912996, 2010. [13] F.T. Ramos, J. Nieto, and H.F. Durrant-Whyte. Recognising and Modeling Landmarks to Close Loops in Outdoor SLAM. In Proceedings of 2007 IEEE Int. Conf. on Robotics and Automation, pp. 20362041, 2007. [14] S. Thrun, W. Burgard, and D. Fox. Probabilistic Robotics. The MIT Press, 2005. [15] J. Yun and J. Miura. Multi-Hypothesis Localization with a Rough Map using Multiple Visual Features for Outdoor Navigation. Advanced Robotics, Vol. 21, No. 11, pp. 12811304, 2007. [16] J. Yun and J. Miura. A Quantitative Navigability Measure of Rough Maps. J. of Robotics and Mechatronics, Vol. 21, No. 1, pp. 95103, 2009.
313
(a) Step 0.
(a) Step 0.
(a) Step 0.
(b) Step 5.
(b) Step 5.
(b) Step 5.
(c) Step 10.
(c) Step 10.
(c) Step 10.
(d) Step 15.
(d) Step 15.
(d) Step 15.
(e) Step 20.
(e) Step 20.
(e) Step 20.
(f) Step 25.
(f) Step 25.
(f) Step 25.
(g) Step 30.
(g) Step 29.
(g) Step 29.
Fig. 9. Localization result for the red route in Fig. 8.
Fig. 10. Localization result for the orange route in Fig. 8.
Fig. 11. Localization result for the orange route in Fig. 8 without building extraction.
314
Adaptive Extended Kalman Filter for Indoor/Outdoor Localization using a 802.15.4a Wireless Network
A. Benini A. Mancini E. Frontoni P. Zingaretti S. Longhi
Dipartimento di Ingegneria Informatica Gestionale e dellAutomazione (DIIGA) Universit Politecnica delle Marche, Ancona, Italy
Abstract Indoor and outdoor localization of mobile robots using wireless technologies is very attractive in many applications as cooperative robotics. Wireless networks can be successfully used not only for communication among heterogeneous vehicles (e.g., ground, aerial) but also for localization. This paper introduces an approach for the indoor/outdoor localization of a mobile robot using IEEE 802.15.4a devices with ranging capability based on Symmetrical Double-Sided Two Way Ranging (SDS-TWR). This technique tries to overtake the limitations of the classical one Received Signal Strength Indication (RSSI) (e.g., Wi- mapping) that does not ensure good performance especially in structured environments. The set of these devices allows to create a Wireless Sensor Network (WSN) that is suitable for cooperative tasks where the data link is fundamental to share data and support the relative localization. In this paper an Adaptive Extended Kalman Filter (EKF) is introduced as a possible technique to improve the localization in both outdoor and indoor environments also in real time due to a reduced computational complexity. The proposed approach allows to model the bias of ranging data considering also the faults in the measurements. The obtained results put in evidence the necessity of further ranging data to obtain centimetric accuracy and precision of localization that actually is rated to 1m. Index Terms Localization, Ranging, IEEE 802.15.4a, WSN
I. I NTRODUCTION
and outdoor location of mobile robots using wireless technologies is very attractive in many application. Wireless networks can be successfully used not only for communication between devices but also for localization. IEEE 802.15.4a species two additional physical layers (PHYs): a PHY using ultra-wideband (UWB) and a PHY using chirp spread spectrum (CSS) both with a precision time-based ranging capability [1], [2], [3], [4], [5], [6]. The UWB PHY is operating in three frequency bands: below 1 GHz, between 3 and 5 GHz, and between 6 and 10 GHz. The UWB physical layer channels have large bandwidth and provide high ranging accuracy, up to 1 meter. The main advantages of UWB technology are high data transfer, low power consumption, high spatial capacity of wireless data transmission and sophisticated usage of radio frequencies. UWB technology is based on sending and receiving carrierless radio impulses using extremely accurate timing. UWB can be used for example in such applications where high bandwidth signals must be transmitted between devices. The CSS PHY is operating in 2.4 GHz ISM band. The chirp solution does not support ranging, but the rst
NDOOR
802.15.4a CSS chip (nanoLOC) developed by Nanotron has the ranging as additional (proprietary) function. It offers a unique solution for devices moving at high speeds because of its immunity to Doppler Effect and provides communicating at longer ranges. The main problem of time-based range measurement consists in multipath and non-line-of-sight (NLOS) measurements. Since for time-based ranging measurement the lineof-sight (LOS) between sensors are needed, indoor and outdoor measurements can be affected by bias due to presence of obstacles, metallic surfaces and/or electromagnetic elds. This paper studies the indoor/outdoor localization of a mobile robot using the nanoLoc WSN system in conjunction with a Extended Kalman Filter (EKF) in order to improve the position accuracy considering also the bias. The nanoLOC Development Kit is a complete, easy to use set of tools for evaluating, prototyping and developing applications based on Nanotrons nanoLOC TRX Transceiver. nanoLOC is a highly integrated mixed signal chip utilizing Nanotron unique Chirp Spread Spectrum (CSS) technology. It provides built-in ranging in the globally available 2.4 GHz ISM band [7]. A review of existing ranging techniques is provided in [8]. The most important methodologies for ranging calculation are listed below: Angle of Arrival (AoA) Time of Arrival (ToA) Time Difference of Arrival (TDoA) Received Signal Strength Indication (RSSI) Symmetrical Double Sided Two Way Ranging (SDSTWR) The Angle of Arrival, or AoA, is a method for determining the direction of propagation of an RF signal received from a tag to an anchor. Using directional antennas, an estimate of the direction of the tag can be obtained. The Angle of Arrival is determined by measuring the angle between a line that runs from the anchor to the tag and a line from the reader and a predened direction, for instance the position of a known point. When the signal to noise ratio is relatively high, the angle of arrival of the strongest signal can be estimated with a very simple methods. The differences in the arrival times of the wideband signal received by spatially separated sensors can be estimated using the polarity coincidence correlation. These time differences, (for example time delays), determine the angle of arrival. In [9] the effects of quantization of the time delays are studied. It is found out that this simple
315
method gives comparable performance to the conventional, direct correlation based methods in the case of a relatively high signal to noise ratio. The Time of Arrival, or ToA, is a method based on the measurement of the propagation delay of the radio signal between a transmitter (tag) and one or more receivers (anchors). The accuracy of measuring the distance using this method is sensitive to the bandwidth of the system and the multipath condition between the wireless terminal and the access point. In general, as the bandwidth increases beyond a certain value, it is expected that the measured TOA error approaches zero. However, for the undetected direct path (UDP) conditions, the system exhibits substantially high distance measurement errors that cannot be eliminated with the increase in the bandwidth of the system. In [10], an analysis of the behaviour of super-resolution and traditional TOA estimation algorithms in line-of-sight (LOS), non- LOS, and UDP conditions in indoor areas is presented. The Time Difference of Arrival (TDOA) method is slightly different from ToA. Systems using the TDOA method measures the difference in transmission time between the signals received from each transmitter to a tag. The RSSI (Received Signal Strength Indication) allows the location of a device based on the strength of signals sent from the anchors to the device that is to be localized [11]. To determine the location of the tag by means of this technique, at least three anchors are needed. In a location system based on RSSI, the distance between the tag and one of the anchors is based on the difference between the power of the signal transmitted from the anchor and the one received by the tag. To be effective, RSSI requires a dense deployment of Access Points, which increase considerably to the cost of the system. However, the key problem related to RSSI- based systems is that an adequate underlying path-loss model must be found for both non-line-of-sight and non-stationary environments. Consequently, in practice, estimated distances are somewhat unreliable. In [12], the effects of different frequencies on RSSI measurements and a methods to increasing the distance measurement accuracy is provided. The operating principle of the SDS-TWR technique is similar to the principle on which the TOF is based: it avoids the need to synchronize the clocks of the nodes used in ranging measurements. The SDS-TWR bases its operation on the basis of two delays that naturally characterize the transmission of signals in the ether propagation: Propagation delay of the signal between two devices Time needed for signal processing inside devices This technique is called Symmetrical Double-Sided Two Way Ranging, because: It is symmetric: the measurement from the tag T and the anchor A is a mirror-image of the measurement made from anchor A to the tag T (TAT to ATA); It is double-sided: only two devices (Tag and Anchor) are needed for ranging; It is two-way ranging because SDS-TWR works with sending and receiving of ACK packets and information packets;
During the SDS-TWR measurements, one node (Tag) sends a packet to a second node (Anchor) and starts the time counting. The time measurement on second node starts from the reception of the packet until the propagation of packet back to the rst node. When the rst node receives the acknowledgement from the second one, it stops the time counting and store the value in its memory. The acknowledgement sent back to the rst node includes in its header two delay values - the Signal Propagation Delay and the Processing Delay. After this, the entire process is repeated from second node to the rst. At the end of this procedure, two range values are determined and an average of the two can be used to achieve a fairly accurate distance measurement between these two nodes (Fig. 1). SDS-TWR does not require any kind of time synchronization between the transmitter and the receiver; instead it only depends on time-of-ight between two devices in order to estimate the range. As reported in [13], SDS-TWR technology demonstrates a superiority over other localization methods in that it removes the time synchronization between devices, which is a quite demanding requirement.
Fig. 1.
Simmetric Double Sided Two Way Ranging Technique
II. T HE NANO L OC L OCALIZATION S YSTEM Nanotron Technologies has developed a WSN system that allows the development of RTLS applications. The distance between two wireless nodes is determined using the Simmetric Double Sided Two Way Ranging (SDS-TWR)
316
technique. Both, the wireless communication technology and SDS-TWR technology are integrated into a single chip, the nanoLOC TRX Transceiver [14]. The transceiver operates on S.4 GHz ISM band and supports not only location-aware applications but also robust wireless communication. Wireless communication is based on Nanotron Chirp Spread Spectrum (CSS) according to IEEE 802.15.4a wireless standard. Data rates are selectable from 125kbit/s to 2Mbit/s. III. T HE E XTENDED K ALMAN F ILTER The Kalman Filter is suitable for the estimation of x and y coordinates of a mobile device (tag) on the basis of ranging measurements made between the tag and at least three known points (anchors). Let denote with (ax,i , ay,i ) (i = 1, ..., n) the x and y coordinates of the anchors and with T T = (tx , ty ) the unknown tag coordinates. The distance between an anchor and the tag is calculated in the following way: di = (tx ax,i ) + (ty ay,i ) H where tx ty =z 2 ay,1 2 ay,2 . . .
2 2
observation vector yk represents ranging measurements made between tag and anchors, and denes the entry parameter of the lter. Because in the analyzed system the predictor equation contains a linear relationship, the process function f can be expressed as a linear function: xk+1 = Axk + Buk + wk where the transition matrix A and B are dened cos k 1 0 ut sin k A = 0 1 ut cos k , B = sin k 0 0 1 0 (8) as follows: 0 0 . (9) 1
and T is the time sample. The input control vector contains the linear (ut ) and angular (ua ) speed of the robot: uk = ut ua (10)
(1)
The tag position can be obtained by trilateration as follows: (2)
and
2 ax,1 2 ax,2 . . H= .
The equation yk+1 = h(k+1 , vk+1 ) is calculated in the x following way: x ax,1 )2 + (ty ay,1 )2 (t r1 . . . = . + vk+1 (11) . . 2 2 rn x ax,n ) + (ty ay,n ) (t
(3)
2 ax,1 2 ax,n
2 ay,1 2 ay,n
An estimation of T can be obtained using the method of least squares: 1 T tx T (5) = (H H) H z ty In the extended Kalman lter (EKF), the state transition and observation models need not be linear functions of the state but may instead be differentiable functions: k+1 = f(k , uk , wk ), x x k+1 = h(k+1 , vk+1 ) y x (6)
d2 d2 + a2 a2 + a2 a2 y,2 y,1 x,2 x,1 1 2 . . z= . d2 n d2 1 + a2 x,1 a2 x,n + a2 y,1 a2 y,n
(4)
h and the related Jacobian matrix Ck+1 = (k , 0) is given x x as: r1 r1 tx ty . . . Ck+1 = . (12) . . rn rn tx ty where ri = x t ri = y t tx ax,i , (13)
2 2 (tx ax,1 ) + (ty ay,1 ) ty ay,i
(tx ax,1 ) + (ty ay,1 )
A. Detection of faulty range measurements The measures provided by Nanotron sensors are accompanied with ags that indicate whether the measure was made correctly or not. Specically, if the ranging ri is done correctly, the corresponding ag variable is set to 15. Otherwise, the ag variable assumes a different value depending on the type of failure occurred. If a fault on a measurement is detected, the corresponding element of the measurement covariance matrix R is increased: 2 r,1 0 0 2 0 r,2 0 R= . (14) . . .. . . . . . . . 2 0 0 r,n
where xk and k denote respectively the approximated a y priori state and observation and k the a posteriori estimate x of the previous step. The state vector contains the predicted tag coordinate as expressed in the following equation: xk = tkx tky (7)
Referring to the state estimation, the process is characterized by the statistical variables wk and vk that represent respectively the process noise and measurement noise. Wk and vk are supposed to be independent, white and normal probably distributed with given covariances matrix Qk and Rk . The
317
2 where r,i can be identied by experiments. In [15], [16], 2 after a series of tests, authors set r,1 = 0.1328m2 . Depending of presence of a fault, the i th element of R matrix is set in the following way: 2 r,i : with fault 2 r,i : without fault
IV. E XPERIMENTAL R ESULTS In this sections the localization experiments in two different environments are presented. The rst environment is critical due to the presence of walls with metallic surfaces that make not possible the use of GPS. The second test allows to evaluate the performance in terms of accuracy and precision considering the GPS trace as ground truth. A. Conguration of the experiments The architecture of the mobile robotic platform (see Fig. 2) is aligned with a typical outdoor SLAM mission. The selected mobile robotic platform is the Pioneer P3-AT produced by the MobileRobots Inc. This platform is suitable for outdoor environments and can hold a payload of 12kg. The instrumentation payload is formed by: Differential Global Positioning System (DGPS) Topcon GR3 receiver; Inertial Sensor (Microstrain 3DM-GX1); Nanotron tag. The DGPS receiver is coupled with a master station (same kind of receiver) to increase the accuracy and the precision of positioning data; the antenna is mounted at 0.5m over ground to avoid dangerous reections and/or signal attenuation due to the presence of obstacles. In kinematic mode the obtained accuracy and precision was < 2cm. A Real Time Operating System (RTOS) provides access to robot by means of a set of software tools (e.g., ARIA and Carmen) included in the Robobuntu Linux distro [18], [19].
2 r,i =
(15)
where is chosen by experiments to give a good localization performance; in the following tests, is set to 105 . B. The Extended Kalman Filter with bias modelling The model discussed in the previous paragraphs is based on the assumption that the measured distances contain only noise. However, range measurements made with CSS sensor non only have noise but also a non-zero average error. This non-zero average error can be treated as a bias. In this case, the measured distance, at time k, referring to anchor i can be modelled more accurately in the following way: y ki = dki + vki + bki (16)
where y ki denotes the biased measure and bki the bias related to anchor i at time k. The bias vector is difcult to manage because it is not possible to measure it. A strategy could be to estimate the bias vector by subtracting the estimated bias vector from the measured distance vector [17]. But in more cases this is not possible because the bias vector depends on more factor as multipath or presence of moving obstacles. To model the bias change, we introduce a scaling factor sk which satises the following relationship: dki + bki (1 + sk )dki (17)
By using 17, we can derive a new formulation of the measurement model: y ki = (1 + sk )dki + vki (18)
We assume that the bias hardly changes during an iteration; thus, the scaling factor is expressed as sk+1 = sk and the process noise of the scaling factor as wrk . Now we can extend the EKF by adding a new state variable describing the bias that affects range measurements: k+1 = f(k , uk , wk ), x x yk+1 = h(xk+1 , vk+1 ) where xk = uk = wk = xk sk uk 0 wk wrk (19)
(20)
Fig. 2. Robotic platform
(21)
B. Experimental results in Environment n 1 In this subsection the localization results in a indoor environment are presented. The reason of this choice is for the performances evaluations in presence of multipath.
(22)
318
We assigned 2-D Cartesian coordinates of anchors with the following values: A1 = [11.90 A2 = [21.00 A3 = [30.90 A4 = [47.54 0.15]T 2.30]T 0.15]T 2.30]T (23)
The unit of these coordinates was meter. The following gure (Fig. 3) shows the indoor environment (environment n 1):
In Test n.1 the true position of the robot at the beginning of the test is [1.70 1.25]T m, while in test n.2 is [2.75 2.25]T m. The obtained results (see Figures 4 and 5) show that the lter allows to obtain better result in terms of localization due to the estimation of bias; its amplitude is 1m circa. The lter is also able to track the position of robot considering a high uncertainty in the initial position (green line in Fig.5) adjusting the error covariance matrix P at time 0. In this case the mean error is approximately 1m. C. Experimental results in Environment n 2 In this second series of test we chose an outdoor environment without obstacles. The reason of this choice is for the performances evaluations in an open space. The 2-D Cartesian coordinates of anchors have the following values: A1 = [0 0]T 50.3]T 13.8]T (24) A2 = [17.1 36.5]T
Fig. 3.
Map of the environment
n 1
A3 = [12.0 A4 = [29.4
The maximum speed of the robot is set to 0.8m/s. While the mobile robot moves along the path, the CSS tag node attached to the robot measures ranging from the four anchors. First the tag measures the distance form anchor 1 (A1), after the distances from anchors 2 (A2), and so on. Because the time required to obtain the four ranging measurements is fast (0.05s) compared to the speed of the robot, we assume that the four distances were measured at the same time.
The unit of these coordinates was meter.
Fig. 4.
Test n 1 in environment n 1
Fig. 6.
Map of the environment n 2
Fig. 5.
In Fig. 7 the obtained results are presented. The green dashed lines represent the estimation of path using the odometric information obtained by on board encoders integrated with a MEMS gyro and Nanotron Sensors. The blue one represent the true path calculated with differential GPS (DGPS). The gyro is necessary to estimate the rotation speed of robot due to high friction of wheels with the rough terrain. The EKF lter allows to reduce the localization error if compared with the standard dead reckoning. The blue lines represent the ground truth obtained by the GPS with differential corrections. The EKF is also able to manage the loss of anchor due to a temporary / continuous fault (e.g., anchor too far,
319
Fig. 7.
low battery, high occlusion). Considering the total network trafc the 80% of packets contains useful ranging data. The remaining 20% contains invalid data (e.g., out of range) that are managed by the EKF. V. C ONCLUSION AND F UTURE W ORKS In this paper a Extended Kalman lter for indoor/outdoor localization using a 802.15.4a Wireless Network was presented considering also the bias. The paper present two different approaches based on the Kalman approach to estimate the position of robot using ranging data from the Wireless Network. The algorithms are able to manage measurement faults isolating the corrupted data without shutting down the localization approach. The obtained results reinforces the necessity of integrate additional sensors to obtain better results in terms of accuracy and precision. In case of indoor environments the integration of Laser Range Finder or camera (with SIFT/SURF) extractor could signicantly improve the overall performance. Future works will be steered to extend the set of sensors integrating visual information based on SIFT/SURF [20] and to evaluate this technology also for small size unmanned aerial systems as quad-rotors that are suitable for indoor environments.
ACKNOWLEDGMENTS
This work is developed in the context of ARTEMIS-JU EU Project R3-COP. In particular the authors would like to thank Mauro Montanari and Riccardo Minutolo (Thales Italia S.p.A) for their support. R EFERENCES
[1] Z. Sahinoglu and S. Gezici. Ranging in the ieee 802.15.4a standard. In IEEE Annual Wireless and Microwave Technology Conference, 2006. WAMICON 06., pages 1 5, dec. 2006.
[2] Hyeonwoo Cho and Sang Woo Kim. Mobile robot localization using biased chirp-spread-spectrum ranging. IEEE Transactions on Industrial Electronics, 57(8):2826 2835, aug. 2010. [3] C. Rohrig, D. Hes, C. Kirsch, and F. Kunemund. Localization of an omnidirectional transport robot using ieee 802.15.4a ranging and laser range nder. In Proceedings of the 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2010), pages 37983803, October 2010. [4] A. Sikora and V.F. Groza. Fields tests for ranging and localization with time-of-ight-measurements using chirp spread spectrum rf-devices. In IEEE Instrumentation and Measurement Technology Conference Proceedings, 2007. IMTC 2007., pages 1 6, may 2007. [5] Jae-Eon Kim, Jihoon Kang, Daeyoung Kim, Younghwoon Ko, and Jungsik Kim. Ieee 802.15.4a css-based localization system for wireless sensor networks. In IEEE Internatonal Conference on Mobile Adhoc and Sensor Systems, 2007. MASS 2007., pages 1 3, oct. 2007. [6] S. Krishnan, P. Sharma, Zhang Guoping, and Ong Hwee Woon. A uwb based localization system for indoor robot navigation. In IEEE International Conference on Ultra-Wideband, 2007. ICUWB 2007., pages 77 82, sept. 2007. [7] Nanotron technologies gmbh - http://www.nanotron.com. [8] A.H. Sayed, A. Tarighat, and N. Khajehnouri. Network-based wireless location: challenges faced in developing techniques for accurate wireless location information. Signal Processing Magazine, IEEE, 22(4):24 40, july 2005. [9] Jari Yli-hietanen, Kari Kalliojarvi, and Jaakko Astola. Low-complexity angle of arrival estimation of wideband signals using small arrays. In In Proceedings of the 8th IEEE Signal Processing Workshop on Statistical Signal and Array Signal Processing, pages 109112, 1996. [10] N. Alsindi, Xinrong Li, and K. Pahlavan. Analysis of time of arrival estimation using wideband measurements of indoor radio propagations. IEEE Transactions on Instrumentation and Measurement, 56(5):1537 1545, oct. 2007. [11] A.S. Paul and E.A. Wan. Rssi-based indoor localization and tracking using sigma-point kalman smoothers. IEEE Journal of Selected Topics in Signal Processing, 3(5):860 873, oct. 2009. [12] Osman Ceylan, K. Firat Taraktas, and H. Bulent Yagci. Enhancing rssi technologies in wireless sensor networks by using different frequencies. In Proceedings of the 2010 International Conference on Broadband, Wireless Computing, Communication and Applications, BWCCA 10, pages 369372, Washington, DC, USA, 2010. IEEE Computer Society. [13] Hyo-Sung Ahn, Hwan Hur, and Wan-Sik Choi. One-way ranging technique for css-based indoor localization. In Industrial Informatics, 2008. INDIN 2008. 6th IEEE International Conference on, pages 1513 1518, july 2008. [14] "nanoloc trx transceiver (na5tr1)", nanotron technologies gmbh, berlin, germany, datasheet na-06-0230-0388-2.00, apr. 2008. [15] C. Rohrig and M. Muller. Indoor location tracking in non-line-of-sight environments using a ieee 802.15.4a wireless network. In IEEE/RSJ International Conference on Intelligent Robots and Systems, 2009. IROS 2009, pages 552 557, oct. 2009. [16] S. Spieker and C. Rohrig. Localization of pallets in warehouses using wireless sensor networks. In 16th Mediterranean Conference on Control and Automation, 2008, pages 1833 1838, june 2008. [17] F. Capezio, A. Sgorbissa, and R. Zaccaria. An augmented state vector approach to gps-based localization. In Intelligent Robots and Systems, 2007. IROS 2007. IEEE/RSJ International Conference on, pages 2480 2485, 29 2007-nov. 2 2007. [18] A. Mancini, E. Frontoni, A. Ascani, and P. Zingaretti. Robobuntu: A linux distribution for mobile robotics. pages 2544 2549, may. 2009. [19] robobuntu linux distribution. http://www.robobuntu.diiga. univpm.it. [20] Andrea Ascani, Emanuele Frontoni, Adriano Mancini, and Primo Zingaretti. Feature group matching for appearance-based localization. In IROS, pages 39333938, 2008.
320
List of Authors
/ Symbols / strand, Bjrn . . . . . . . . . . . . . . . . . . . . . . . . . . 245 abecki, Przemysaw . . . . . . . . . . . . . . . . . . . . 195 / A / Ahmad, Aamir . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Algan, Hatice Esra . . . . . . . . . . . . . . . . . . . . . . . 109 Aly, Amir . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Amigoni, Francesco . . . . . . . . . . . . . . . . . . . . . . . 25 Andrade-Cetto, Juan . . . . . . . . . . . . . . . . . 115, 139 Andreasson, Henrik . . . . . . . . . . . . . . . . . . . . . . . 19 Angeli, Adrien . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Arends, Marc . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Asta, Shahriar . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 Aufrre, Romuald . . . . . . . . . . . . . . . . . . . . . . . 233 Aydemir, Alper . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 / B / Bachiller, Pilar . . . . . . . . . . . . . . . . . . . . . . . . . . 219 Basilico, Nicola . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Battesti, Emmanuel . . . . . . . . . . . . . . . . . . . . . . 303 Bazeille, Stphane . . . . . . . . . . . . . . . . . . . . . . . 303 Bedkowski, Janusz . . . . . . . . . . . . . . . . . . . . . . . 133 Behnke, Sven . . . . . . . . . . . . . . . . . . . . . . . . 25, 177 Beinhofer, Maximilian . . . . . . . . . . . . . . . . . . . . 55 Belter, Dominik . . . . . . . . . . . . . . . . . . . . . . . . . 195 Benini, Alessandro. . . . . . . . . . . . . . . . . . . . . . .315 Bisgaard, Morten . . . . . . . . . . . . . . . . . . . . . . . . 253 Bouguerra, Abdelbaki . . . . . . . . . . . . . . . . . . . . 207 Burgard, Wolfram . . . . . . . . . . . . . . . . . . . . . 49, 55 Bustos, Pablo . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 / C / Cabrita, Gonalo . . . . . . . . . . . . . . . . . . . . 165, 183 Cadena, Cesar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Capezio, Francesco . . . . . . . . . . . . . . . . . . . . . . . 61 Carrera, Gerardo . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Chambers, Andrew . . . . . . . . . . . . . . . . . . . . . . . 43 Chapuis, Roland . . . . . . . . . . . . . . . . . . . . . . . . . 233 Checchin, Paul . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Cielniak, Grzegorz . . . . . . . . . . . . . . . . . . . . . . . 213 Colonius, Immo . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Cupec, Robert . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 / D / Davison, Andrew J. . . . . . . . . . . . . . . . . . . . . . . . 77 Dayoub, Feras . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 De Bernardis, Caleb . . . . . . . . . . . . . . . . . . . . . 291 Dube, Daniel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Duckett, Tom. . . . . . . . . . . . . . . . . . . . . . . . . . . .213 Dylla, Frank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 / E / Einhorn, Erik . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 / F / Fraud, Thomas . . . . . . . . . . . . . . . . . . . . . . . . . 233 Faigl, Jan . . . . . . . . . . . . . . . . . . . . . . . . . . . 171, 201 Fernndez-Madrigal, Juan-Antonio . . . . . . . . . . 1 Feyzabadi, Seyedshams . . . . . . . . . . . . . . . . . . 271 Filko, Damir . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Filliat, David . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 Filzhuth, Markus . . . . . . . . . . . . . . . . . . . . . . . . 121 Firl, Jonas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Folkesson, John . . . . . . . . . . . . . . . . . . . . . . . . . 297 Freksa, Christian . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Frese, Udo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 Frommberger, Lutz . . . . . . . . . . . . . . . . . . . . . . . . . 7 Frontoni, Emanuele . . . . . . . . . . . . . . . . . 239, 315 / G / Gbelbecker, Moritz . . . . . . . . . . . . . . . . . . . . . . 13 Galindo, Cipriano . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Garcia-Ruiz, Francisco . . . . . . . . . . . . . . . . . . . 253 Gonzlez, Javier . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Gross, Horst-Michael . . . . . . . . . . . . . . . . . 95, 121 Gutirrez, Marco A.. . . . . . . . . . . . . . . . . . . . . .219 / H / Hselich, Marcel . . . . . . . . . . . . . . . . . . . . . . . . 153 Hansen, Karl D. . . . . . . . . . . . . . . . . . . . . . . . . . 253 Haverinen, Janne . . . . . . . . . . . . . . . . . . . . . . . . 283 Hedenberg, Klas . . . . . . . . . . . . . . . . . . . . . . . . . 245 Holz, Dirk . . . . . . . . . . . . . . . . . . . . . . . . . . . 25, 177 / J / Jensfelt, Patric . . . . . . . . . . . . . . . . . . . . . . . 13, 159 Juri -Kavelj, Sre ko . . . . . . . . . . . . . . . . . . . . . . 31 c c
/ K / Kaess, Michael . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Karami, Abir B. . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Kazmi, Wajahat . . . . . . . . . . . . . . . . . . . . . . . . . 253 Kemppainen, Anssi . . . . . . . . . . . . . . . . . . . . . . 283 Kessler, Jens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Kitanov, Andreja . . . . . . . . . . . . . . . . . . . . . . . . 277 Kitt, Bernd . . . . . . . . . . . . . . . . . . . . . . . . . . 43, 189 Komma, Philippe . . . . . . . . . . . . . . . . . . . . . . . . 145 Korrapati, Hemanth . . . . . . . . . . . . . . . . . . . . . . 227 Kose-Bagci, Hatice . . . . . . . . . . . . . . . . . . . . . . 109 Krajnk, Tom . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Kreutzmann, Arne . . . . . . . . . . . . . . . . . . . . . . . . . 7 / L / la Cour-Harbo, Anders . . . . . . . . . . . . . . . . . . . 253 Lang, Dagmar . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Lategahn, Henning . . . . . . . . . . . . . . . . . . . . . . . . 43 Lau, Boris . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Lauer, Martin . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Leonard, John J. . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Lilienthal, Achim J. . . . . . . . . . . . . . . . . . . . . . . . 19 Lima, Pedro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Longhi, Sauro . . . . . . . . . . . . . . . . . . . . . . 239, 315 Lopez, Eduardo . . . . . . . . . . . . . . . . . . . . . . . . . 291 Louloudi, Athanasia. . . . . . . . . . . . . . . . . . . . . . .19 / M / Mller, Jrg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Mancini, Adriano . . . . . . . . . . . . . . . . . . . 239, 315 Mandel, Christian. . . . . . . . . . . . . . . . . . . . . . . .265 Manso, Luis J. . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 Markovi , Ivan . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 c Marques, Lino . . . . . . . . . . . . . . . . . . . . . . 165, 183 Martinet, Philippe . . . . . . . . . . . . . . . . . . . . . . . 227 Martinez-Marin, Tomas . . . . . . . . . . . . . . . . . . 291 Masselli, Andreas. . . . . . . . . . . . . . . . . . . . . . . .145 Mastrogiovanni, Fulvio . . . . . . . . . . . . . . . . . . . . 61 Matsuo, Keisuke . . . . . . . . . . . . . . . . . . . . . . . . . 309 May, Stefan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 McDonald, John . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Mezouar, Youcef . . . . . . . . . . . . . . . . . . . . . . . . 227 Miura, Jun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 Mouaddib, Abdel-Illah . . . . . . . . . . . . . . . . . . . . 83 / N / Nchter, Andreas . . . . . . . . . . . . . . . . . . . . . . . . 271
Nez, Pedro . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 Neira, Jos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Nyarko, Emmanuel K. . . . . . . . . . . . . . . . . . . . 127 / O / Ortega, Agustin . . . . . . . . . . . . . . . . . . . . . . . . . 115 Osrio, Lus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 / P / Palm, Rainer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Paulus, Dietrich . . . . . . . . . . . . . . . . . . . . . . . . . 153 Petrovi , Ivan . . . . . . . . . . . . . . . . . . . . . . . . 31, 277 c Pronobis, Andrzej . . . . . . . . . . . . . . . . . . . . 13, 159 Peu il, Libor . . . . . . . . . . . . . . . . . . . . . . . 171, 201 r c / Q / Qiu, Deyuan . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 / R / Rning, Juha . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 Raspa, Paolo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 Rehder, Jrn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 / S / Safotti, Alessandro . . . . . . . . . . . . . . . . . . . . . . . . 1 Sariel-Talay, Sanem . . . . . . . . . . . . . . . . . . . . . . 259 Satake, Junji . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 Scalmato, Antonello . . . . . . . . . . . . . . . . . . . . . . 61 Schnbein, Miriam . . . . . . . . . . . . . . . . . . . 43, 189 Scheidig, Andrea . . . . . . . . . . . . . . . . . . . . . . . . . 95 Scherer, Sebastian . . . . . . . . . . . . . . . . . . . . . . . 145 Schrter, Christof . . . . . . . . . . . . . . . . . . . . . . . . 121 Sgorbissa, Antonio . . . . . . . . . . . . . . . . . . . . . . . . 61 Singh, Sanjiv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Sj, Kristoffer . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Skrzypczy ski, Piotr . . . . . . . . . . . . . . . . . . . . . 195 n Sousa, Pedro . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 Sprunk, Christoph . . . . . . . . . . . . . . . . . . . . . . . . 49 Stckler, Jrg . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 Steffens, Ricarda . . . . . . . . . . . . . . . . . . . . . . . . 177 Stoyanov, Todor . . . . . . . . . . . . . . . . . . . . . . . . . . 19 / T / Tapus, Adriana . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Teniente, Ernesto H. . . . . . . . . . . . . . . . . . . . . . 139 Tran, Quan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 / V / Vallivaara, Ilari . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Vernazza, Paolo . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Vernazza, Tullio . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Vonsek, Vojt ch . . . . . . . . . . . . . . . . . . . . 171, 201 e / W / Wolter, Diedrich . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
/ Y / Yorganci, Rabia . . . . . . . . . . . . . . . . . . . . . . . . . 109 / Z / Zaccaria, Renato . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Zell, Andreas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Zingaretti, Primo . . . . . . . . . . . . . . . . . . . . 239, 315

ECMR11 Proceedings

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ECMR11 Proceedings

Uploaded by

Copyright:

Available Formats

Proceedings of the 5th European Conference on Mobile Robots ECMR2011

September 7-9, 2011 rebro, Sweden

Editors: Achim J. Lilienthal Tom Duckett

ABB Volvo Construction Equipment Robot Dalen Atlas Copco

Achim J. Lilienthal Tom Duckett

153 159 165

(defrelation Nr1 :is normative-relation :domain normative-concept :range ) x

(defconcept object-places :is-primitive (and location (at-least 1 has-object) :partitions $$))

Normative Concepts (defrelation normative-relation) (defconcept normative-concept :partitions $Norms$))

(defconcept non-ulfilling-norm ..)

(defrelation has-humidity :is (and normative-relation :domain object :range humidity)..)

(defrelation place :is (and normative-relation :domain object :range location))

Range Partitions (defconcept normative-concept :partitions $Norms$))

(defconcept room :partitions $Rooms$))

(defconcept humidity :partitions $H-levels$))

the following goal is deduced:

(exist ?x (fridge ?x) (place mb ?x))

(PLACE astrid mb-22) (CLOSE fridge-1))

On Process Recognition by Logical Inference

symbolic reasoning place inference and process recognition

odometry abstract process speci cation observation behavior process understanding

Conceptual overview of our software architecture

picking zone (P)

buffer zone (B)

outlet zone (O)

entrance zone (E)

storage zone (S)

A warehouse and its functional zones.

A set A subsumes all axioms (1) (4).

at(G, L1 ) in(L1 , E) in(L2 , B) in(L2 , S) in(L3 , B) in(L3 , S)

at(G, L2 ) in(L1 , E) in(L2 , B) in(L2 , S) in(L3 , B) in(L3 , S)

in(L1 , E) in(L2 , B) in(L2 , S) in(L3 , B) in(L3 , S)

at(G, L3 ) in(L1 , E) in(L2 , B) in(L2 , S) in(L3 , B) in(L3 , S)

Admission at(G, L1 ) in(L1 , E)

at(G, L2 ) in(L2 , B) at(G, L3 ) in(L3 , S)

(a) Floor plan with zones

(b) Automatically3 generated 5clusters 0 1 2 4 6

(c) The lab (picture taken from the outlet zone)

predefined centroids 0.754 0.829

0,75 0.485 0.256

Schematic image of chain graph

(d) Fig. 3. Trajectories taken by the robot in multiple experiments

Comparative Evaluation of Range Sensor Accuracy in Indoor Environments

ROC Plot Evaluation with Known Ground Truth

ROC Plot Evaluation with Known Ground Truth (Laser)

ROC Plot for all Scans

Laser Kinect Swiss Ranger Fotonic

Laser Kinect Swiss Ranger Fotonic

True Positive Rate (tpr)

True Positive Rate (tpr)

True Positive Rate (tpr)

False Positive Rate (fpr)

False Positive Rate (fpr)

False Positive Rate (fpr)

Laser Kinect Swiss Ranger Fotonic

True Positive Rate (tpr)

True Positive Rate (tpr)

True Positive Rate (tpr)

False Positive Rate (fpr)

False Positive Rate (fpr)

False Positive Rate (fpr)

TABLE I AVERAGE AND STANDARD DEVIATION OF TRUE AND FALSE POSITIVE

ROC Plot for Full Evaluation with Constrained Range