Multi-Modal People Detection From A Mobile Robot in Crowded Scenes

Multi-modal People Detection from a Mobile Robot in Crowded Scenes
Alhayat Ali Mekonnen
Faculty of Sciences and Techniques University of Bourgogne Department of Computer Architecture and Technology University of Girona School of Engineering and Physical Sciences Heriot Watt University
Masters Thesis Carried Out At
LAAS - CNRS
7, avenue du Colonel Roche 31077 - Toulouse Cedex 04
Supervised By: Dr. Fr ed eric Lerasle
A Thesis Submitted for the Degree of MSc Erasmus Mundus in Vision and Robotics (VIBOT) 2010
Abstract Automated person detection has a wide range of applications in H/R interaction, robot navigation in presence of humans, pedestrian detection for Automated Driver Assistance Systems(ADAS), video surveillance, content based image and video processing. But at the same time, it is one of the most challenging problems in computer vision due to the large variation of person appearances and sensor limitations. Through the years researchers have used various sensors to automatically detect persons. Recently, approaches that use multiple sensors cooperatively have gained much attention as they benet from the positive merits and dierent abilities of the individual sensors. Towards this end, this thesis presents a multi-modal person detection in crowded scenes from a mobile robot that uses a 2D SICK Laser Range Finder and a visual camera. A sequential approach in which the laser data is segmented to lter human leg like structures to generate person hypothesis which are further rened by a state of the art parts based visual person detector for nal detection, is proposed. Integration of the implemented multi-modal person detector in our robotic platform and associated experiments are presented. Results obtained from all tests carried out have been clearly reported proving the multi-modal person detector outperforms its single sensor counterparts taking detection, subsequent use, computation time, and precision into account.
Research is what Im doing when I dont know what Im doing. . . .

Werner von Braun
Contents
Acknowledgments 1 Introduction 1.1 1.2 1.3 1.4 1.5 1.6 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Overview of the problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Objective of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Robotic platform and software environment . . . . . . . . . . . . . . . . . . . . . Specic tasks and investigations . . . . . . . . . . . . . . . . . . . . . . . . . . . . Document organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1 1 2 4 4 5 6 7 8 8
2 State of the Art and Framework 2.1 Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 2.1.2 2.2 Passive Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Active Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Candidate Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 2.2.2 2.2.3 2.2.4 Sliding Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Flat World Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Stereo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Multi-sensor Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3
Pertinent Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 i
2.3.1 2.3.2 2.3.3 2.3.4 2.4
Haar Like Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Edge Orientation Histograms (EOH) . . . . . . . . . . . . . . . . . . . . . 15 Histogram of Orientation Gradients (HOG) . . . . . . . . . . . . . . . . . 16 Shape Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.1 2.4.2 2.4.3 Chamfer Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 AdaBoost(Adaptive Boosting) . . . . . . . . . . . . . . . . . . . . . . . . 19 Support Vector Machines (SVMs) . . . . . . . . . . . . . . . . . . . . . . 19
2.5
Person Detection Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5.1 2.5.2 Single Sensor Based Person Detection . . . . . . . . . . . . . . . . . . . . 21 Multi-modal Person Detection . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 2.7
Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Discussion and Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 29
3 Detector Implementation 3.1
Person Detection from a 2D Range Data . . . . . . . . . . . . . . . . . . . . . . . 29 3.1.1 3.1.2 Sensor Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Detector Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2
Visual Person Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2.1 3.2.2 3.2.3 3.2.4 3.2.5 Feature Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Detection Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Computation time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3
Our Multi-modal Person Detector 3.3.1 3.3.2
Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 ii
4 Robotic Integration and Associated Experiments 4.1 4.2
49
Robotic Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Test Set and Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.1 4.2.2 Test Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3 4.4
Evaluation Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Discussion and Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 59
5 Conclusions and Future Work 5.1 5.2
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 60
Bibliography
List of Figures
1.1 2.1 2.2
Rackham in the Lab at LAAS-CNRS. . . . . . . . . . . . . . . . . . . . . . . . . . .
5 8
General person detection ow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Images taken of a pedestrian with a) a visible spectrum camera and b) Thermal infrared camera (from [30]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 2.4
Position of ve radar sensor onboard a vehicle for pedestrian detection (from [49]) . . . 11 Robot with multiple sensors. The sensor readings correspond to two persons standing in front of the robot. b) shows the image taken by the Fisheye Omnidirectional camera c) the corresponding laser scan and d) the sensor reading of the sonar sensor (from [39]). 11
2.5
Eight directive RF antennas to detect range and azimuth angle of a passive RFID tag along with position on the robot (from [26]). . . . . . . . . . . . . . . . . . . . . . . 12
2.6 2.7 2.8 2.9
Images illustrating sliding window and at world assumption (from [27]). . . . . . . . 14 Haar like features templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 HOG feature extraction steps (from [13]). . . . . . . . . . . . . . . . . . . . . . . . . 18 Linear classiers. a) possible hyperplanes to separate the two classes b) the optimal hyperplane that separates the two classes while maximizing the margin. . . . . . . . . 20
3.1
The 2D SICK LMS200 Laser Range Finder along with its position and orientation on Rackham.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2
2D laser scans illustrating captured leg patterns. The leg patterns are shown in the white rectangular bounding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3
Figures illustrating the dierent geometrical properties used for discriminating leg structures in a 2D laser scan. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4
Flow chart illustrating the algorithm used to detect persons from a 2D laser scan. . . . 35
3.5
Sequence of gures illustrating the person detection from 2D laser scans. The red circle represents the robot. Notice that the second leg of the person detected and denoted by a purple circle was occluded by his rst leg, hence a virtual leg is placed along the line of sight of the robot and the detected leg away from the robot. . . . . . . . . . . . . . 37
3.6
Feature computation performed by [18]. A 4x27 2D feature, columns represent orientation while rows represent the dierent normalization factors, is reduced to a 31 dimensional vector by 27 sums over dierent normalizations, one for each contrast sensitive and insensitive orientation channel, and 4 sums over the 9 contrast insensitive orientations, one for each normalization factor. The arrows represent the summation over the values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.7 3.8 3.9
Two person models making up a complete person mixture model (from [18]).
. . . . . 39
Illustration of the person detection based on trained parts based models (from [18]). . 42 Visual person detections outlining the detected parts along with the complete bounding box. The detection in a) corresponds to the parts model of 3.7a while b) is for 3.7b . . 43
3.10 Block diagram of the proposed multi-modal person detection system. . . . . . . . . . . 44 3.11 Illustration of the multi-modal person detection system with output images placed on
corresponding blocks of the system. The white thin lines depict the camera eld of view. The generated hypothesis are shown as white circles on the 2D laser map while the projected rectangles are shown as thin green rectangular windows on the image plane. The nal detected persons are outlined with a bounding box of varying colors, aqua, red, and blue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1 4.2
Rackhams software architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Sample image from each dataset used for detector evaluation. a-d from set I, e h from set II, i - l from set III. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 4.4
Sample images used for validating the four visual detectors. . . . . . . . . . . . . 53 Sample laser only detections. The upper images show the scene as seen by robot camera looking forward and the bottom images show the corresponding laser leg detections. The arc shows the laser scan eld and the shaded region corresponds to the camera eld of view. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.5
Sample image showing Human-Robot position and corresponding person detections based on Laser and Vision. The green boxes on the video image in b) show the hypothesis generated from the laser scans, and the detections with the corresponding laser detection are shown in various colors. On the laser scan map, a small white Circle denotes detected leg and a green circle, detected person. The actual scan data is shown in red while blobs are shown as thin blue circle. The gray shaded region corresponds to the camera eld of view. . . . . . . . . . . . . 55
4.6
Some person detections of the multi-modal person detector. Green boxes denote hypothesis generated using the information from the laser scanner and red boxes are conrmed detections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.7
Detection mistakes made by the multi-modal person detector. Green boxes denote hypothesis generated using the information from the laser scanner and red boxes are conrmed detections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
List of Tables
3.1 3.2 3.3 4.1 4.2 4.3
Main characteristics of the SICK LMS200 2D Laser Range Finder on Rackham Set threshold values of the 2D laser based person detector
. . . . 31
. . . . . . . . . . . . . . . 36
Threshold values of the laser based person hypothesis generator. . . . . . . . . . . . . 45 Summary of person detection results on 142 images containing 388 person instances. . . 52 Summary of person detection results of the 2D laser range only detection on test set I-III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Summary of person detection results for the multi-modal person detector on the test set I-III. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
ix
Acknowledgments
This master program has been the best experience that I have had so far and I want to take this opportunity to thank the VIBOT team specially Fabrice, Alice, Valerie, David, Yvan, Robert, and Julia, for the cooperation, support, and attention they have given me. Next my deepest gratitude goes to my advisor Fr ed eric Lerasle who believed in me and took me as his intern to do this research in one of the best robotic labs, LAAS-CNRS. This masters thesis would not have had the form it had now if it was not for his great support and constructive advises. In conjunction, my thanks goes to Thierry Germa for answering my emails with no delays and helping me with technical details. I want to also express my appreciation for all the sta of LAAS-CNRS for making my stay bearable and the interns at LAAS-CNRS specially Didier, Lucie, Arthur, J eremy, and Jorge for their cooperation with the experiments and walking in front of Rackham without complaints. A special thanks goes to Aaron Montoya for being a friend and keeping me company for the entire period of this thesis. I also want to take this opportunity to thank all my vibot classmates for showing me the world in a small scale and teaching me quite a lot. On a personal note, my heartfelt appreciation goes to my parents, Ali and Mebrat, for being there for me throughout my life, in every aspect of life, and for raising me to be what I am today. I would also like to express my gratefulness to my sister and brothers, Amira, Ousman, and Abdu, for their support. And last but not least, my sincere gratitude goes to my wife, Tizita, for being understanding, caring, encouraging, and everything that I had dreamed of having. June 1, 2010
xi
Chapter 1
Introduction
Robots have been used in the production of goods for a long time. They have levied mankind o many tasks that require high precision, diligence, and endurance. Yet, there is even more demand to use robots in everyday life, a demand for their introduction into human all day environments. For this task, robots should be able to interact with humans at a higher level with more natural and eective interaction. To have these new generation of robots, a lot of research is being carried out on human-robot interaction, including human detection, motion planning, scene reconstruction, intelligent behavior through task planning etc... At the core of all these is safe interaction, which becomes of a great concern as robots come out of the industry, where the workspace at any moment is shared by the robot only, and interact with humans in a shared workspace. Hence, the rst and foremost task should be perception of the whereabouts of humans sharing the workspace.
1.1
Background
The Laboratory for Analysis and Architecture of Systems (LAAS-CNRS) is a research unit of the French National Center for Scientic Research doing active research in the areas of Robotics and Articial Intelligence (RIA) and many more. Under the RIA research department at LAASCNRS, dierent research groups focus on the study and design of autonomous robots integrating perception, action, and reasoning capabilities. These groups carry out their research with the aim of having these robots interact rationally with a variable and dynamic environment to perform a wide variety of tasks. One of the research groups that the work presented here has been carried out at LAAS-CNRS is the Robotic Action and Perception (RAP) research group, a research group working in the RIA area with emphasis on integration of advanced functions, functional and decisional autonomy of robotic systems, and on physical robotic task 1
Chapter 1: Introduction
execution in a dynamic environment. One of the achievements of the research groups in the RIA department is the design and deployment of an autonomous interactive tour guide robot in the Cite de lEspace
1
[9, 10]. The deployed robot, Rackham (presented in section 1.4), can
address humans with an animated avatar and voice synthesis; on the other end visitors may ask it to guide them to specic places of the museum using menus on a tactile screen and it does so. This is a classic instance of robot use in public and human all day environments. Another work these research groups are involved in, in this track, is the CommRob European Project which primarily aims to advance the state of the art in high level communication with and among robots [29](www.commrob.eu). In doing so, it tries to address detection and avoidance of dynamic obstacles, self-localization based on landmarks, learning of a topological map, as well as autonomous navigation based on topological and metrical information tackling more complex environments for robots. To achieve this, a prototype robot trolley is being built to carry goods, guide a user in a complex structured and dynamic environment, and also act as a walking aid in crowded scenes [29]. The work presented in this thesis document addresses multi-modal person detection from a mobile robot in crowded scenes. It is aimed at improving the performance of Rackham as an assistant robot in the presence of crowds and is also planned to be incorporated in the CommRob European Project.
1.2
Overview of the problem
Automated person detection nds its applications in many areas including human-robot interaction, surveillance, pedestrian protection systems, automated image and video indexing. In the robotics context, detecting the whereabouts of humans is the rst requirement if robots are to be highly useful sharing the workspace with humans. Especially in human - robot interaction, detecting the whereabouts of persons is the fundamental task to avoid collision and be able to assist persons. But person detection is by far one of the most challenging tasks due to: - Physical variation of persons. Persons appearance can vary greatly. People do not share a common body size, color, texture, and the appearance is highly inuenced by the clothing they wear. - Body deformations. For a detection system that depends highly on the shape of a person, body shape deformations can adversely aect the detection system. - Illumination. For a detection system that depends on the lighting condition, varying illuminations and shadings in dierent environments can aect the detection.
1 The
Cite de lEspace is a space adventure park in Toulouse,France (http://www.cite-espace.com)
1.2 Overview of the problem
- Viewpoint changes. Depending from which angle persons are viewed, they can yield dierent shapes with varying aspect ratios. - Background clutter. Sometimes background structure exhibit similar structure and shape as that of a person, making distinction dicult. - Occlusions. Sometimes people are partially or completely occluded by things they are carrying, overlaps with other people, or by structures in the environment. Hence making the detection with partial views or complete occlusion dicult. - Sensors limitations. In robotic context, mostly embedded sensors have short elds of view and they are usually mobile, making the detection task dicult. - Computational cost. Techniques and methods that achieve state of the arts detection usually require a lot of computation time compared to trivial person detection method. This is a diculty specially in robotic context, where it is required to have reactive response acceptable by humans. Balancing detection performance with computational requirement adds to the challenge faced in person detection. All these challenges make successful person detections based on a single sensor very dicult. For real world scenarios, more promising approaches combine more inputs from more than one sensory channel. In multi-modal person detection, detections from the dierent sensors can be used to cross validate the occurring detection to make a robust detector. Features that are not captured by one sensor can be captured by another one making the detection more invariant to the above listed challenges. Existing works utilizing single and multiple sensors for person detection are presented in Chapter 2. The context of this work is on multiple person detection from Rackham. As mentioned in section 1.1, Rackham has been used as an interactive tour guide robot in a museum. Evolving the scenario to a higher level, having a general assistant robot would require the robot to follow the target (assisted person), rather than the other way around, in a socially acceptable manner. Recently, Germa et al. [26] have implemented a system to track and follow a given user using Rackham. Their system is capable of detecting and tracking a target person using vision and Radio Frequency ID sensor with the targeted person during person following task wearing a passive RFID tag. But up to now the robot does not have an onboarded system to detect and avoid other multiple persons around. If the robot is to be used in an actual scenario in a crowded environment, it should be able to detect, track and avoid passers-by in its vicinity while following the target person. This entails for detection of the whereabouts of the people around the robot using all possible sensors on-board. This is crucial as one of the major concerns of introducing robots to human environments is safe interaction. The person detection used by
Germa et al. [26] is not applicable for passers-by detection as the visual detector is based on face detection and would fail for people facing away from the robot, and it is unlikely to have all people around public areas wearing RFID tags.
1.3
Objective of the thesis
This thesis tries to address the problem of human detection in the context of using robots in crowded public environments. The aim of this thesis in a nutshell is to investigate and implement a multiple person detector module to detect people around the robot using all possible sensors onboard Rackham, possibly a 2D SICK Laser Range Finder (LRF), RF sensor, and monocular vision. This work ts into the work of a PhD student at LAAS-CNRS which enabled a service robot to track and follow a target person using vision and RF sensors, irrespective of sporadic occlusions, camera out of eld of view, and appearance changes in an acceptable social manner [26]. This work complements it by detecting the other people around for safe socially acceptable reactive navigation while following the target person. Hence in this work, it is expected to do a comprehensive literature review of the existing single sensor and multiple sensor based people detection approaches with their advantages and downside, and use insights from this to investigate and implement a multi-sensor based multiple person detector. The outcomes of this work will be used to build a multi-person tracker which in turn will be used to dene a reactive navigational control law for the robot, to follow the user while avoiding people around in a socially acceptable manner.
1.4
Robotic platform and software environment
The target platform in this thesis is Rackham, hence all developments are carried out on it. Rackham is an iRobot B21r mobile platform. To increase its capability, its standard equipment has been extended with one pan-tilt Sony EVI-D70 camera, one digital camera mounted on a Directed Perception pan tilt unit(PTU), one ELO touch screen, a pair of loudspeakers, an optical ber gyroscope, a Wireless Ethernet, and an RF system for detecting RFID tags. Rackham also has an LMS200 SICK Laser Range Finder as its standard equipment. Figure 1.1 shows Rackham in the laboratory environment. All these devices give Rackham the ability to operate in public areas as a service robot. Rackhams software architecture is based on LAASs architecture for autonomy [1]. All its functionalities have been embedded in modules created by GenoM using C/C++ interface. GenoM, an acronym for Generator of Modules, is a tool for the Specication and the implementation of operating modules in a distributed robot architecture [20]. It is a development
1.5 Specic tasks and investigations
Figure 1.1: Rackham in the Lab at LAAS-CNRS.
framework that allows the denition and the production of modules that encapsulate algorithms. Hence, all software developments are to be embedded in modules created by GenoM. For visualization purposes and some elementary image I/O operations, the OpenCv library is to be used and some preliminary tests are also to be carried out with Matlab.
1.5
Specic tasks and investigations
To meet the objectives of the thesis, the specic tasks and investigations to be carried out are: - A comprehensive literature review of existing single sensor and multi-modal based people detection methods. - Investigation of sensors on Rackham suitable for person detection while at the same time rening and updating an already implemented algorithm [22] for detecting humans from a 2D Laser Range Finder data. - Investigation and implementation of a visual person detector in the LAASs openrobots framework.
- Investigation and implementation of possible fusion of the dierent detections to have a robust person detector. - Integration of the developed modules on the robotic platform, for human user tracking while avoiding people amidst based on the above detectors, and carrying out live experiments. On top of all this, a seamless integration into the RAP research group coordinating with the dierent teams working on the human aware reactive navigation of Rackham is expected. Specically, with T. Germa and F. Lerasle who are going to develop the multi-person tracker using the outputs of the person detector and the team working on dening the reactive navigation control laws, A. D. Petiteville and V. Cadenat, based on the trackers output.
1.6
Document organization
To eectively communicate the existing works and the work carried out in this area, the thesis document is structured as follows: Chapter 1 briey presents the problems to be addressed which make up the objectives of this thesis work along with utilities to be used for development. In Chapter 2, state of the arts in automated person detection approaches and used tools are highlighted. Chapter 3 presents the details of the methodology used to develop single sensor based person detection and the combined multi-modal person detection. Chapter 4 gives robotic integration details along with tests performed, results obtained, and discussions of the obtained results. Chapter 5 concludes the document with conclusions and future works to be carried out.
Chapter 2
State of the Art and Framework

Automated people detection is an active research area that is being researched intensively with a rapid rate of innovation. Detecting people automatically has a wide range of applications including in human-robot (H/R) interaction, robot navigation in presence of humans, pedestrian detection for Automated Driver Assistance Systems(ADAS), video surveillance, content based image and video processing. At the same time, it is one of the most challenging problems in computer vision due to the large variation of person appearances. These challenges are further augmented in robotic applications due to motion and short eld of view of embedded sensors, and computational time requirements for a reactive system. Most of the researches on human detection have been carried out from H/R interaction, surveillance, pedestrian protection systems, and image and video indexing points of views. As stated in chapter 1, detection of people around a robot makes the fundamental part of H/R interaction for using robots in human all day environments. Eventhough not perfect, dierent robotic systems that use various modalities to detect people around have been implemented in the past. For example to mention a few, Minerva [48] inferred peoples location during interaction based on laser scan data and distance ltering, Alpha [6] used a combination of visual and audio sensors, and Biron [37] used a combination of stereo microphones, for acoustic persons localization, laser scan, for leg detection, and face detection using a camera, to detect a person during interaction. In this section, a literature review of dierent people detection approaches are presented. To help structure the ow of this chapter gure 2.1 shows a typical ow diagram of people detection procedure. First the sensors used capture the environment, then depending on the amount of data available specic regions are considered for further processing (candidate generation ). From the selected candidates useful features that have prominent discriminative quality and able to capture pertinent features of persons are extracted for classication which makes the nal 7
Chapter 2: State of the Art and Framework
decision whether the candidate evaluated corresponds to a person or not. As the aim of the work is to detect static and moving people from a mobile autonomous robot, the literature review emphasizes on approaches that do not make a moving target or stationary sensor assumption. The section begins by briey introducing dierent sensors used for detection, then dierent candidate generation, pertinent features, and classication methods extensively used for person detection are introduced. A review of person detection approaches followed by discussion and summary is presented to conclude the chapter.
Figure 2.1: General person detection ow.
2.1
Sensors
In the many years of research on person detection, dierent sensors have been used to detect people. The dierent sensors can be broadly categorized as active and passive sensors. Passive sensors measure levels of energy that are naturally emitted, reected, or transmitted by the target object where as active sensors illuminate the target with their own energy source and measure how the target interacts with energy for detection. In robotic applications, these sensors are embedded on the robotic platform. Almost all of the sensors require further processing techniques to segment and detect persons from the environment.
2.1.1
Passive Sensors
As stated above, passive sensors measure level of energy that are naturally emitted, reected, or transmitted by target objects. The most common ones are cameras which capture light by making use of matricial-scan chips (CCD or CMOS). Visible Spectrum Cameras Most of the research on person detection has been done based on visible spectrum cameras. These cameras capture light in the visible electromagnetic spectrum (0.4 - 0.74 m). These cameras have the advantage of a high spatial resolution in both vertical and horizontal direction and also provide interesting information from cues like texture or color. The trend in person detection is to detect faces, skin color, full body, and parts of a body from images taken by
2.1 Sensors
these sensors. But without enough ambient light, these cameras provide too dark and poorly contrasted scenes making person detection almost impossible. Due to their low cost, high potential features, high spatial resolution and richness of texture and color cues, these sensors have been widely used in robotic applications.
Thermal Infrared Cameras Also known as night vision, long wave infrared, and far infrared cameras, are cameras which capture electromagnet radiation in the range 6 15m. Any object at a certain temperature emits electromagnetic radiation that is a function of the body temperature. Since human beings posses a distinct body temperature, the contrast between humans and background in images captured with these cameras can be used to detect people and have also been used by some researchers, [30]. The main disadvantage of thermal imaging is that the contrast between humans and the rest of the environment depends on the ambient temperature making detection very dicult in warm conditions [30]. Figure 2.2 shows an image of a pedestrian taken with a visible and thermal camera respectively.
(a)
(b)
Figure 2.2: Images taken of a pedestrian with a) a visible spectrum camera and b) Thermal infrared
camera (from [30]).
Microphones Microphones change mechanical vibration caused by sound into electrical signal. Using at least two microphones separated by a certain distance, it is possible to localize the direction of the sound source, similar to how we, human beings, perceive direction of a sound using our two ears. Many researchers have used dierent signal processing techniques to determine the Time Delay of Arrival (TDOA) between the two microphone inputs to obtain the angle of the sound source with respect to the sensors, [6,8,32]. Noise and acoustic reections (reverberation) aect the performance of the localization. These sound source localization is usually used in H/R
10
interaction to give attention to persons actively interacting with the robot, for example giving verbal commands to it. It can not be used alone for person detection as it can not detect silent persons. When detecting, only direction information is obtained and might not be able to dierentiate between human and non human sounds unless human voice recognition is utilized. [8] have shown that source right in front of the stereo mic rig, 0o , can be located quite accurately while angles over 70o both ways result in rough direction estimation. It is mostly used in H/R interactions in combination with other sensors for human perception.
2.1.2
Active Sensors
Active sensors transmit signals and observe their reection from the objects present to detect targets. Thus far some researchers have used Laser Range Finders (LRFs) [21], Radars [49], and Sonars [39] to detect persons, and also Radio Frequency receivers have also been used to infer the position of persons carrying some sort of tag [26]. Active sensors are mostly used in robotic applications. Laser Range Finders(LRFs) LRFs send a laser beam and detect the reection of the beam, then using the time of ight obtain range information. Typical Laser scanners rotate and acquire range data for every angular position. For a person in the scan area, the scans will have a distinct shape that resembles the geometry of the person (leg, waist..etc) at the scanning height. By segmenting these shapes, possible person detection can be achieved. The SICK LMS 200 LRF is one type of a 2D LRF that has been extensively used in the robotic world and has also been used for people detection [22]. It has a very precise accuracy in mms and a scan area spanning 180o . A SICK LRF along with typical scan prole containing two people standing in front of a robot is shown in gure 2.4a and 2.4c. Radars Radar is an acronym for RAdio Detection and Ranging. It is a device that emits radio waves and based on the received signals tries to identify the range, altitude, direction, or speed of both moving and xed objects. For persons or any other obstacles in the radar eld of view, the reected radar signals carry information about the direct radial range and relative velocity of the person. These sensors can not be used alone for person detection as reections from persons or other obstacles in the radar eld of view are the same. To obtain azimuth information of an obstacle, a network of radar sensors can be used. Tons et al. [49] used a network of ve Frequency Modulated Continuous Wave (FMCW) radar sensors to determine position,
11
2.1 Sensors
direction, and relative velocity of obstacles by fusing the outputs of each sensor, and further used visual detectors to lter out pedestrians from these hypothesis. Using a combination of wide and narrow beam sensors, the authors were able to detect pedestrians from 0.3 to 30 m. Figure 2.3 shows the ve radar sensors used by [49] for pedestrian detection.
Figure 2.3: Position of ve radar sensor onboard a vehicle for pedestrian detection (from [49]) .
Sonars An acronym for SOund Navigation And Ranging, send a sound wave and depending on the reection received tries to determine the range and direction of an obstacle. Hence similar to LRFs people detection works by analyzing sonar scan for person proles at the height the scan was performed. It is mostly used in robotic applications. For example, Martin et al. [39] used sonars to detect leg proles and combined this with detection from other sensors to detect people robustly. The measurement depends not only on the distance of an object, but also on the objects material, the direction of reecting surface, and cross-talk eects when using several sonar sensors [39]. Figure 2.4a and 2.4d show sonar sensors on a robot and corresponding sensor reading respectively.
(a) A Robot with various sensors
(b) Image from the Fisheye Camera
(c) LRF scan
(d) Sonar Sensor reading
Figure 2.4: Robot with multiple sensors. The sensor readings correspond to two persons standing in
front of the robot. b) shows the image taken by the Fisheye Omnidirectional camera c) the corresponding laser scan and d) the sensor reading of the sonar sensor (from [39]).
12
RF sensors Radio Frequency sensors are used to detect RFID tag in the vicinity. By having people around wore specic tags, the position of the people wearing the tags can be detected indirectly by detecting the tags. Since the RFID tags provide unique ID information upon detection, they are very useful for dierentiating detections among multiple tags and are very suitable to make associations in consecutive detections. The tags can be active or passive. Active tags emit energy and need power source hence they tend to be bulky and unsuitable to wear for people eventhough the detection performance is higher. On the other end, passive RFID tags are very small and compact as they do not have their own power source which in turn reduces detection range. Germa et al. [26] have used RF sensors to detect passive RFID tags a person is carrying hence detecting the position of the person indirectly. The system is embedded on Rackham and uses eight directive antennas to detect range and azimuth angle of the passive tags worn by a person all around a robot upto 5m range, gure 2.5.
Figure 2.5: Eight directive RF antennas to detect range and azimuth angle of a passive RFID tag
along with position on the robot (from [26]).
2.2
Candidate Generation
In object detection in general, all information from sensors do not contain objects. Candidate generation is then selection of part of this information likely to contain the object required while discarding the rest of irrelevant information. The advantage of this is two folds. First the workload in subsequent detection modules would be decreased and second possible false positives would be ltered out in this module. On the contrary, if the candidate generation
13
2.2 Candidate Generation
module discards relevant information, it leads to misdetections. Considering all the sensors described in section 2.1, with the exception of visual sensors, the information obtained from the other sensors is well manageable with subsequent detection modules (i.e. classiers). But visual sensors obtain quite a lot of information and it is computationally very expensive to handle by subsequent modules. Hence ltering and passing possible reduced region of interests is very important. In this section dierent methods used by researchers to generate regions of interest in images obtained from visual sensors are described.
2.2.1
Sliding Window
This method is the simplest and is an exhaustive scanning approach that selects all of the possible candidates in an image according to a given scanning size conforming to the size of a person. Some authors scale the image while others scale the scanning windows to detect people at various scales. For example, Dalal and Triggs [13] construct an image pyramid by scaling the input image by a factor of 1.2. Then candidates are generated by sampling 64x128 pixels placed every 8 pixel interval vertically as well as horizontally at all layers of the pyramid . Eventhough the candidates generated are smaller compared to sampling the image for all possible sizes of rectangular regions (without keeping aspect ratio), it is still quite many and a lot of irrelevant regions are passed to the classier. Figure 2.6a illustrated this method and 2.6b shows 10% of the generated candidate windows.
2.2.2
Flat World Assumption
This approach is based on the assumption that people in front of the camera are on the ground, a at world, and that the geometry of the ground and its position with respect to the camera does not change. For a calibrated camera, rectangular regions conforming to the aspect ratio of a person are places on the ground of the 3D world up front and projected onto the image using camera transformation matrix. These regions then form the candidate windows for further processing. These approach has been applied in pedestrian detection from a vehicle by Gavrila et al. [25] and Ger onimo et al. [28]. Figure 2.6c and 2.6d shows a sample image with projected rectangles for illustration. Using this approach the possible regions for subsequent steps are reduced. Ger onimo et al. [28] have shown that the performance of this candidate generation scheme is very dependant on the accuracy of the camera calibration parameters which becomes an issue for a vehicle or robot moving in variable road geometry and motion dynamics (braking, acceleration, etc...).
14
(a) Exhaustive search
(b) Exhaustive search with 10% of candidate windows.
(c) Flat world assumption projected rectangles.
(d) Projected rectangles with at world assumptions.
Figure 2.6: Images illustrating sliding window and at world assumption (from [27]).
2.2.3
Stereo Methods
Stereo methods use stereo cameras to obtain the 3D world up front and generate 2D candidate windows in the image corresponding to vertical objects in the 3D representation that might correspond to persons. Gavrila et al. [25] used this method for pedestrian candidate generation by scanning the stereo depth map for vertical objects tting pedestrian sized region of interest laying in the assumed ground plane, and for every occurrence, by selecting corresponding 2D regions as candidate windows. Ger onimo [27] has pointed out that stereo methods as standalone candidate generation are not reliable due to overlapping of close objects, uniform regions, and nonexistent 3D points which could lead to erroneous results.
2.2.4
Multi-sensor Systems
In multi-sensor person detection systems, the inputs of one sensor that needs less computational power can be used to determine region of interests for a more informative sensor input (eg. images). These types of systems are further discussed in detail in section 2.5.2.
15
2.3 Pertinent Features
2.3
Pertinent Features
Individual data points in a sensor reading alone do not convey global information as to what the environment contains. But by looking at a group of data points, it is possible to make a vague interpretation of the input. Features enable us to capture this information by extracting meaningful information from a group of data points in various ways. For example, by looking at groups of data points in 2D LRF scan, geometric features can be extracted to determine the shape of the objects in the region. For all the active sensors described in section 2.1.2 with the exception of RF sensors, geometric features can be used to capture the shape of the object in the scanning eld of view. On the other hand, visual sensors capture very informative data, hence by using dierent feature extraction methods various sensible information can be obtained. In this section, dierent image features that are used frequently in visual person detection are briey presented.
2.3.1
Haar Like Features
Haar Wavelets were rst introduced in the context of Object Detection in late 90s by Papageorgiou et al. [44]. The Haar wavelets encode the relationship between average intensities of neighboring regions along dierent orientations capturing edges or changes in texture. This makes them suitable to capture the structural similarities between various instances of a class. Figure 2.7a shows the three types of 2-dimensional Haar Wavelets used by [43]. These basis capture change in local intensity along horizontal, vertical , and diagonal directions. When applied to images, the value of a two-rectangle feature is the dierence between the sum of the pixels lying in the unshaded area and that of the sum of pixels lying in the shaded area. A four rectangle feature computes the dierence between diagonal pairs of rectangles. Viola et al. [51] introduced the notion of Integral Image so as to compute similar features in a fast way. Leinhart et al. [36] introduced a set of extended haar-like features which enrich the basic set of simple haar-like features. They introduced upright, 45o oriented, and center-surround rectangular features and allowed the prototypes to be scaled independently in vertical and horizontal axis thus generating a rich, over complete set of features. Figure 2.7b shows the complete Haar-like feature template used by [36]. They also presented a variant of the Integral Image using two auxiliary images to compute all the features in a fast way. These features are also mainly used for face detection.
2.3.2
Edge Orientation Histograms (EOH)
If one observes an image containing a person, it is evident that silhouette and edge information are important cues in general. EOH capture silhouette and edge information and were rst
16
(a) Three of the Haar-like features used by [43], (left to right)horizontal, vertical, and diagonal
(b) Complete set of extended Haarlike, including center surround, feature templates used by [36]
Figure 2.7: Haar like features templates
proposed in face detection context by Levi and Weiss [35]. These features not only maintain invariance to global illumination changes, but also capture geometric properties that are dicult to capture with other features. The feature extraction is performed according to the following series of steps: 1. Sobel mask is applied to the image to calculate the edge orientation and magnitude. 2. The Sobel image pixels are classied according to their edge orientation into K bins. 3. For a certain region R, the feature value is dened as the ratio between two orientations, for example between orientations k1 and k2 , i.e. F eatureEOH (k1 , k2 , R) = Ek1 (R) + Ek2 (R) + (2.1)
where E corresponds to the sum of the gradient magnitude in R of the specied bin. The small value is added for smoothing purposes.
4. A set of dominant features are also dened by taking the ratio of a single orientation to the sum of the others Bk (R) = Ek (R) + i Ei (R ) + (2.2)
2.3.3
Histogram of Orientation Gradients (HOG)
Similar to EOHs, the essential thought behind the Histogram of Oriented Gradient descriptors is that local object appearance and shape within an image can be described by the distribution of intensity gradients or edge directions. HOGs were rst introduced by Dalal and Triggs [13] as features for image description and encoding for object detection. As described in [13], HOG descriptor computation is done in ve steps:
17
2.3 Pertinent Features
1. A global image normalization equalization is performed to reduce the inuence of illumination eects. 2. Computation of rst order image gradients. 3. The image window is divided into small spatial regions, called cells, and a local 1D histogram of edge orientations with K orientation bins over all the pixels in the cell is accumulated. Each pixel contributes to each orientation bin a value proportional to the magnitude of its orientation. 4. A normalization step is carried out which takes local groups of cells and contrast normalizes their overall responses. This is performed by accumulating a measure of local histogram energy over local groups of cells called blocks and normalizing each cell in the block with it. 5. HOG descriptors from all blocks of a dense overlapping grid of blocks covering the detection window are collected to form a combined feature vector. The HOG feature extraction is depicted in Figure 2.8 taken from [13]. Each step of the feature computation tries to increase the discriminative power of the descriptor while allowing a certain degree of invariance. According to the authors, by normalizing each color channel using a gamma compression, eects of local shadowing and illumination variations are reduced. The gradient computation captures contour, silhouette and some texture information, while decreasing the illumination variations. The normalization step further introduces better invariance to illumination, shadowing, and edge contrast. When applied to static images, these descriptors are referee to as static HOGs and hereafter HOG refers to static HOGs unless mentioned otherwise. The authors also introduced four variants of the HOG descriptors, namely: Rectangular HOG(R-HOG), Circular HOG(C-HOG), Bar HOG, and Center-Surround HOG. In R-HOG, the descriptor blocks use overlapping square or rectangular grids of cells whereas in C-HOGs, the cells are dened into grids of log-polar shape. In Bar HOG, the descriptors are computed similar to the R-HOG, but use oriented second derivative(bar) lters rather than rst derivatives. The Center-Surround HOG dier as they use a centre-surround style cell normalization scheme. In this paper, unless stated otherwise, HOG refers to the default R-HOG.
2.3.4
Shape Context
Shape Contexts were rst introduced by [5] in the context of object recognition. The descriptor computed at a point, expresses the conguration of the entire shape relative to that point. The descriptor is computed by rst extracting edges using an edge detector. Then the edges are stored in the bins of a log polar histogram formed by quantizing the locations around a point
18
Figure 2.8: HOG feature extraction steps (from [13]).
in both radial and angular directions. The orientation is then quantized in pre-dened number of bins. By making the location bins uniform in log-polar space, the descriptor can be made sensitive to nearby sample points more than those points further away. These descriptors are very well suited for matching purposes and have also been used for pedestrian detection by [34].
2.4
Classication
During classication, a region of interest is evaluated and a decision made weather it is a person or not. The classiers mostly used in conjunction with image classication and detection are variants of the AdaBoost algorithm and SVMs. But a silhouette matching technique known as Chamfer System has also been used. A brief review of these methods is presented in this section for completeness. This section is by no means exhaustive, only classiers mostly used in conjunction with person detection are discussed with emphasis on SVM which is used later.
2.4.1
Chamfer Matching
Chamfer Matching, introduced by [3], is a technique used to compare the shapes of two collections of shape fragments. For example, for an edge template T composed of edge features t and an images edge map I , the Chamf erDistance is given by the average distance dI to the nearest feature. DChamf er (T, I ) = 1 |T | dI (t)
t T
(2.3)
19
2.4 Classication
Hence by setting a threshold, this can be used as a measure to detect a person. The detection depends on how well the template represents the class and how good the features used are. In order to make a precise decision about the object location, orientation, and scale, it may be necessary to use subsequent verication stage [25]. This method has been applied for person detection from images [25].
2.4.2
AdaBoost(Adaptive Boosting)
Introduced by Freund and Schapire in 1995 [24], the AdaBoost algorithm is an algorithm for constructing a strong classier by a linear combination of simple weak classiers. The strong classier H (x) will be the output of the sum of each weak classiers output,f (x) , weighted by a factor , which are learned from the training set. Each weak classier will act on a single feature.
T
H (x) = sign(f (x)) ,
with
f ( x) =
t=1
t ht (x)
(2.4)
In each iteration, the weights of each training example are varied, weights of those incorrectly classied examples are increased and that of those correctly classied are decreased, to give more emphasis on the misclassied examples. AdaBoost has proven to be very useful in feature selection. At the end of the training, the features with prominent discriminative power will have higher weights. It is frequently used for image classication as a result of its simplistic implementation, very good feature selection abilities, and fairly good generalization. Recently it has also been used for person leg detection from 2D laser range scans [2, 55].
2.4.3
Support Vector Machines (SVMs)
Support Vector Machines are statistical supervised learning methods used for classication and regression introduced by Vladmir Vapnik [50]. The SVM approach is considered a good candidate because of its high generalization performance without the need to add a priori knowledge, even when the dimension of the input space is very high [50]. In SVM, one constructs a hyperplane that separates the given two class sets of examples into their respective classes. To formulate Optimal Separating Hyperplane (OSH) let (xi ; yi )1iN be a set of N training examples with yi {1, 1} where xi Rm . The aim is to come up with a hyperplane, w.x + b = 0, that separates the datasets into their respective classes, yi (w.xi + b) > 0, for i = 1, 2, ...N , while maximizing the margin between the two data sets, gure 2.9. For a linearly separable class, the minimum distance between the hyperplane and the closest points can be made 1 by rescaling w and b : min1iN yi (w.xi + b) = 1, i = 1, 2, ...N . With this the minimum distance
20
between two points on the opposite side of the hyperplane is maximizing the margin, while satisfying equation 2.6.
2 ||w|| .
With this formulation, the
problem is to nd the optimal hyperplane, w and b, that minimizes equation 2.5, equivalent to
(a) Possible linear classiers
(b) Optimal hyperplane
Figure 2.9: Linear classiers. a) possible hyperplanes to separate the two classes b) the optimal
hyperplane that separates the two classes while maximizing the margin.
(w) = yi (w.xi + b) 1, For a non separable class, a slack term and 2.8. (w) =
i
1 ||w||2 2 i = 1, 2, ...N
(2.5) (2.6)
measuring distance of error points to their correct
place can be introduced in equation 2.6 resulting in the optimization problem of equation 2.7
1 ||w||2 + C 2
N i i=1
(2.7) (2.8)
yi (w.xi + b) 1 i ,
i = 1, 2, ...N
The C term is a constant which is interpreted as a penalty term for misclassication. Fixing higher C penalizes errors more. Minimizing equation 2.5 and 2.7 considering their respective constraints in equation 2.6 and 2.8 can be formulated as a quadratic optimization problem. The solution of this optimization problem is an optimal hyperplane determined by equation 2.9.
N
wo =
i=1
i yi xi
(2.9) (2.10)
f (x) = sign(wo .x + b)
21
2.5 Person Detection Approaches
The i terms in equation 2.9 will be zero for most of the training dataset xi except few. The vectors which have a non zero coecient are called Support Vectors. These classiers are called Linear SVM. For a given instance x, equation 2.10 gives the binary classication into the two classes. In case the given data cannot be separated linearly, the input data can be transformed into a higher dimension space using Kernels where they might be linearly separable.
2.5
2.5.1
Person Detection Approaches

Single Sensor Based Person Detection
As the name implies single sensor approaches rely solely on a single sensor input for people detection. To the best of our knowledge, only visual sensors and laser range nders have been used alone for person detection. Some of the important approaches in the past are presented below separated into broad classes as vision and laser scanner based approaches. Vision based detection In visual only person detection, the existence and position of persons is determined solely from the input image. Broadly speaking two major approaches in visual person detection can be outlined, one which searches the image for full human bodies by scanning the input images in dierent ways, hereafter referred as full body detection, and the other which tries to aggregate evidence of existence of a person by using part-based human body models and looking for these body parts, hereafter referred as part based approaches. Some approaches have considered including a full body as one of the parts that make up the parts-based model [18], [52]. These methods are also considered as parts based approaches. Full body detection In full body detection, the input image is scanned for the presence of humans using a window with an aspect ratio of an average human being with some tolerances, (see subsection 2.2.1). Most of the works in full body detection were done either from a pedestrian detection context [25, 28] or from a general object detection framework context [13, 18]. In pedestrian detection, the rst promising results were reported by Oren et al. [43]. In their work they used Haar-like templates to extract features and an SVM as a classier. Instead of using all the extracted features to train the SVM, important features for the task were identied using a template learning stage and used. In later years Jones et al. [31] incorporated motion information to detect pedestrians. The authors used a set of Haar-like features used by [51] to extract motion information as well as intensity information as feature sets. Using AdaBoost as a classier and feature selector arranged in a cascade conguration, they have shown to outperform
22
previous results. But this system is only applicable to static camera if one wants to make prot of the motion features. Gavrila et al. [25] developed a four module detector made up of stereo pre-processing, Chamfer Matching, Texture Classication, and Stereo Verication components. In the stereo preprocessing stage, initial areas of interest are provided by computing a depth map that is scanned considering minimum and maximum pedestrian heights, taking the ground plane location at a particular depth range into account. The selected areas, which have number of depth features exceeding a percentage of the window, are passed to the Chamfer System. The Chamfer System performs shape based pedestrian detection by matching the distance transformed images, of the selected area, into a hierarchy of pedestrian templates. The distance transformed images are used as a lookup to compute dI (t) in equation 2.3 and used a coarse to ne hierarchical pedestrian templates. The root template represents the most general template and consecutive templates down to the leaves are matched only if the respective parent has been matched. This process is repeated until one of the leaves matches. To verify the Chamfer System detection, a Texture Classication stage follows which uses a neural network with local receptive elds. Finally, a Stereo Verication system lters out some false detections. The overall system is known as the PROTECTOR. Results for extensive tests are reported in [25] and the system showed a promising performance. Ger onimo et al [28] proposed a combination of dierent extended Haar-like lter sets and Edge Orientation Histograms (EOH) to learn a model for pedestrian detection. Similar to above method, they used AdaBoost to select the most discriminant features and do the classication. To optimize the search for pedestrians in a given frame, instead of using brute-force image scanning with a xed window size, they restricted the search area at image locations determined by estimating the current ground plane. At each frame, the ground position and camera extrinsic parameters are estimated using a RANSAC based least square tting on data points projected onto a 2D space, which are extracted from 3D data points from a stereo pair. Once the ground plane is estimated, a 3D grid, sampling the road plane is projected on the 2D image dening candidate regions (subsection 2.2.2). Their results showed that, compared with using extended Haar-like features only, the detector performance improves with the addition of EOH. Moreover the authors have showed that the ground plane estimation improves the true pedestrian location hypothesis. In a similar context, Monteiro et al. [41] used Haar like features to detect pedestrians in a vision system. The critical visual features were selected by AdaBoost and were used in the AdaBoost framework to make an ecient classier. Pedestrian detection is done by a sliding window approach and scaling the detector rather than the image. Their complete pedestrian detection has 30 cascaded stages with over 1000 features, and achieves real time performance with good detection accuracy.
23
Prominent works performed in person detection in a general object detection context include the works of Dalal and Triggs [13], and Laptev [33]. Dalal and Triggs [13] used HOGs as features and a Linear SVM as a classier, and a sliding window approach for candidate generation. Their major contribution was the construction of particular eective features, HOGs, and a data-mining approach during training in which resulting false positives were re-introduced as hard negative examples. They performed a lot of extensive experiments with variants of the HOG descriptors to tune the parameters and their detector was the winner of the 2006 PASCAL object detection challenge [16]. Later, Zhu et al. [54] reformulated the Dalal-Triggs detector in cascade of rejectors approach to achieve fast and accurate human detection system. The authors used HOG features of variable block sizes and select the important sets of blocks using AdaBoost framework. Their nal implementation used an integral array representation and AdaBoost to achieve a signicant improvement in run time compared to Dalal-Triggs detector while maintaining similar performance levels. Similarly, Laptev [33] used the AdaBoost framework to select prominent features for object detection. Weighted local histogram of gradient orientations in all rectangular sub-window of the object are used as features. Weighted sher discriminant analysis is used as a weak classier, and AdaBoost selects the best features out of all histograms computed on all sub-windows. The nal detector obtained state of the result on the 2005 PASCAL VOC dataset
1
in person detection.
On a dierent note, Goubet et al. [30] presented and also highlighted existing methods on pedestrian detections using thermal infrared cameras. The main idea is to generate region of interest by segmenting people, as they stand out brighter because warm bodies, using simple thresholding and then to use other verication modules to verify the hypothesis. The authors have also outlined the problems due environment temperature variation and surface reections. They proposed fusion of thermal and visual images for improved detection. Body parts based detection approaches Body parts based detection approaches rely on detecting the dierent parts of the body (i.e. head, face, torso, leg) and inferring the presence and position of a person using the detected parts and/or some geometrical constraints. There are two major components of a parts based approach: low-level features and classiers to model individual parts or limbs of a person and a model to represent the topology of the human body by accumulating part evidence. Forsyth and Fleck [23] introduced body plans for detecting presence of people in images. Body plans model people as an assembly of cylindrical parts, each cylinder corresponding to part of a body, where the individual geometry of the parts and the relationship between parts are constrained by the geometry of the skeleton and ligaments. For a given image, the system
1 http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2005/
24
segments human skin using color and texture criteria, assembles extended segments, and uses simple, hand built body plan to support geometric reasoning. The authors have applied it to detect naked people in images. It has been reported in [40] that this body detector fails in the presence of clutter and loose clothing. Felzenszwalb and Huttenlocher [19] developed a parts based detector using pictorial structures which are a collection of parts arranged in a deformable conguration which is represented by a spring-like connections between the parts. The authors used dynamic programming to group body plans eciently. The parts were modeled as rectangle with xed aspect ratio, average color, and color variance. Hence the part detectors used were simple color based part detectors and were not robust. Later Ronford et al. [11] improved the detection by using better part features and detectors. An articulated body model with 14 joints and 15 body parts and a feature set consisting of Gaussian ltered image and its rst and second derivatives were used. Dedicated detectors for each part, 15 in total, were learned using Support and Relevance Vector Machines. Mikolajczyk et al. [40] presented a human detection system based on probabilistic assembly of robust part detectors. In their work, humans are modeled as exible assemblies of parts combined with a joint probabilistic body model which depends on part appearance and relative positions of the parts. In total 7 dierent body parts, namely frontal head, frontal face, prole head, prole face, frontal upper body, prole upper body, and legs, were used. The geometric relationship between body parts is represented by a Gaussian and its parameters are learned from the training set. Dominant orientations based on rst and second derivatives over a neighborhood computed at dierent scales and combined, each three neighboring horizontal and vertical orientations, to make feature groups along with location of the feature group in a local coordinate system attached to the object are used as feature sets. AdaBoost is used to build a strong reliable classier for detection of each part. For a given image, a scalespace pyramid is built and the described features computed at each scale. The dierent strong classiers, learned using the AdaBoost framework, of each body part are used to detect the presence of their respective body parts. Given the locations and magnitudes of local maxima provided by individual detectors, a likelihood model is used to combine the detection results. The overall system achieved an 87% detection rate with a 1 false positive per 1.8 images on 400 images taken from the MIT Pedestrian Database P4 machine for a 640x480 image. Felzenszwalb et al. [18] developed a person detection scheme in a general object detection framework with disriminatevly trained parts based models. The person detector is based on mixtures of multi-scale deformable parts models that have the ability to represent a highly variable object class like that of a person. The resulting person detector has achieved state-of2 http://cbcl.mit.edu/software-datasets/PedestrianData.html
and took less than 10 seconds on a 2GHz
25
the-art results in the PASCAL VOC competition [14,15] and the INRIA person dataset3 with a noticeable dierence with former state of the art techniques. The authors have used Histograms of Orientation Gradients (HOGs) with analytically reduced dimension as features. All model parameter learning was done by constructing a latent SVM problem and training the latent SVM using a coordinated descent approach. This approach is further elaborated in section 3.2. In general body parts detection approaches are better suited during occlusions than full body approaches as their detection depends not only on the whole body but on the dierent parts of the body, head, torso, legs, detected. 2D LRF based only detections Non-vision sensors are mostly used in multi-modal approaches for person detection . Yet some researchers have also used 2D LRFs solely for person detection due to their high depth measurement precision. A 2D LRF swipes an arc, usually of 180o , measuring the radial distance of an obstacle in a given radian step. For a person in the laser scan eld of view, the scan will have shapes corresponding to that of a person body at the height the scan is performed. By detecting these signatures in the scan, the presence and position of a person will be detected. Fod et al. [21] used two 2D LRFs mounted at waist height of an average person to detect person waists around. Xavier et al. [53] outlined a set of geometric constraints to detect human legs in a 2D laser scan and have demonstrated the results by a simple walking test. In a similar context Arras et al. [2] used 2D LRFs to detect human legs. The authors used AdaBoost as a feature selector and classier to detect leg signatures in the scan. The features used were mostly geometric in nature. Their nal detector resulted in 90.42% true positive rate with a 9.58% false positive rate in a new unknown environment.
2.5.2
Multi-modal Person Detection
In the previous subsection, a review of some works on person detection using visual and 2D LRF sensors alone were presented. As clearly noted, LRFs provide precise location for a detected person. But the detection can not be used to discriminate between each individuals in presence of multiple people. Visual detectors are computationally expensive, detections are highly subject to the diculties outlined in section 1.2 and this applies to all the information obtained from each sensor outlined in section 2.1. By combining information from dierent sensors, the shortcomings of one sensor can be compensated by another one leading to better detection performance. Hence, multi-modal person detection utilize more than one sensor for detection. The main idea is to use dierent detectors with dierent modalities to build a better detector
3 http://pascal.inrialpes.fr/data/human/
26
as no single detector system is perfect. Dierent sensors including vision, laser range nders, sonars, radars, stereo microphones, and RF sensors have been used by dierent researchers to detect persons. Multi-modal person detections are mostly used in robotic applications. All these works can be categorized into two: approaches which process the detections from dierent sensors independently and then fuse them in a certain framework or approaches which have sequential constraints with possible detection of one sensor used to constrain the search space of the other detection systems. Obviously the second approach is inferior as misdetections in the detector that denes the search space for the other detectors would result in a complete misdetection of the entire system. Below a review of some multi-modal person detections in the literature are presented. H/R interaction benets highly from multi-modal person detections and dierent robotic platforms have utilized these. For example to mention some, SIG [42] uses a combination of visual face detection and sound source localization based on a stereo microphone system for the detection of a potential communication partner. Lino [32], a user interface robot to intelligent homes, uses three mutually perpendicular microphone pairs to estimate the 3D direction of the speaker and to turn around to face the speaker and further use face detection for detecting a person. The authors look for harmonics to detect pitches corresponding to persons so as to avoid responding to random noise. The Biron robot [37] enjoys advanced multi-modal interaction capabilities for detecting persons in its vicinity and focusing its attention on its current human partner. Information from heterogeneous sources, namely: 2D laser scans to detect human legs, stereo microphones for acoustic localization, and camera mounted on a pan-tilt unit for face detection, are combined in a multi-modal anchoring framework that allows to simultaneously track multiple persons integrating data coming from the dierent types of sensors coping with dierent spatio-temporal properties of the individual modalities. The robot Alpha [6], a humanoid museum guide robot, uses face detection in a Bayesian belief propagation framework and sound source localization, using stereo microphones, to identify which person is currently speaking. From the tracked faces, the person with minimum distance to the sound source is identied as the speaker. Zivkovic and Kr ose [55] used a 2D Laser Range Finder and an omni-directional visual camera mounted on a robot to detect people in the context of H/R interaction. The authors formulated the problem in a parts based approach. First humans are detected from the LRF using AdaBoost in a similar way as that of [2]. Then human parts, full body, upper body, and lower body are detected from the image again using AdaBoost with Haar-like features. The constituent parts, body parts plus the leg detection from the LRF, are combined in a probabilistic framework and detection evaluated using a maximum likelihood estimate. The probabilistic model parameters were all learned from the training sets. The nal results showed that the results obtained using combination of both sensors signicantly outperforms the results
27
2.6 Tracking
obtained while using each sensor independently. Similarly, Cui et al. [12] have shown detecting and tracking results in surveillance application are improved by combining laser scanners and vision. They used two laser scanners and one camera system for detecting multiple people in an open area. The sensors were all calibrated to a global coordinate system so legs are detected using both scanners and an area corresponding to an average person height and width is projected onto the image plane to dene a candidate for a mean shift tracking method. In [46], a similar detection was done by detecting humans at the chest level instead of legs. In another work, Bellotto and Hu [4] and Fontmarty et al. [22] detected legs from laser scans using geometrical constraints and people from camera using face detector independently in a H/R interaction context. The two detections are then transformed into the same coordinate reference using simple geometric transformations for tracking. Their work also highlighted the improvement gained by using both sensors. Martin et al. [39] proposed a multi-modal detection approach characterized by the fact that all used sensors are concurrently processed and integrated into a robot-centered map using a probabilistic aggregation scheme. The authors used three sensors, an omni-directional camera, 2D Laser Range Finder, and a Sonar, to detect people around a mobile robot. Legs were detected using geometrical constraints and some heuristics for person detection from both laser and sonar data while a skin color based detector was used for detecting people on images from the omnidirectional camera. The detections are independently processed and then fused in a global map using covariance intersection method for tracking. In this work, the authors have clearly demonstrated the improved performance by integrating detections from dierent sensors. In [8], a sound localization based on stereo microphones detecting the direction of a speaking person has been added as another detection modality in the same covariance intersection framework.
2.6
Tracking
Tracking allows to overcome gaps in detection, to suppress spurious measurements, to lter out false positives, and to obtain trajectory information for subsequent use. A lot of work is being done on this area too. Detection modules are mostly followed by a tracking one. For example all the approaches discussed in the multi-modal person detection section have a subsequent tracker that takes the detection as an input and tracks the trajectory of the detected persons with some assumed person motion models. Mostly Bayesian approaches including Kalman Filters [22] and Particle Filters [45] are utilized for tracking.
28
2.7
Discussion and Summary
In this chapter a review of the state of the art in automated person detection has been presented. Dierent sensors used for person detection along with the approaches with emphasis in robotic applications has been presented. Of all the sensor types, visual sensors have been used quite a lot in H/R interaction, surveillance, pedestrian protection systems, and video and image indexing, due to the informative nature of the sensor reading with high spatial resolution in both vertical and horizontal direction and richness in cues like texture and color. Methods relying solely in vision sensors for detection either use a full body or body parts detection approaches. An important advantage of the part based approach is as it relies on object parts and therefore is much more robust to partial occlusions than the standard approach considering the whole object. The state of the art in visual person detection currently is based on trained parts based detection method with signicant improvement over other methods. 2D Laser Range Finders have also been solely used for person detection. But eventhough their measurement is very precise in depth, the detections are dependent on geometric properties of the shapes of a person at the scan height and are easily fooled by structures with similar shape. On top of this, the detections can not be used to dierentiate between dierent persons in presence of multiple people for tracking purposes. Currently, the trend specially in robotic contexts is to use multiple sensors for person detection. Multi-modal detection systems try to use dierent sensors to benet from the advantages of each individual sensors while complementing the weakness of one with the other. Compared to single sensor approaches, systems using multiple sensors for person detection are robust to viewpoint changes, body deformations, occlusions, background clutter, and onboarded sensors shortcomings while at the same time improving computational time signicantly. Hence, a multi-modal approach is chosen in this thesis for person detection. Detection modules are mostly followed by tracking modules to overcome gaps in detection, to suppress spurious measurements, to lter out false positives, and to obtain trajectory informations.
Chapter 3
Detector Implementation
This chapter presents the dierent approaches taken to implement a person detection system. First details of two single sensor based detection systems are presented, a person detection system based on a 2D laser range nder and a parts based visual person detector. After highlighting the shortcomings of each of the two methods, a proposed multi-modal person detection system based on 2D laser scan and vision is presented. Associated evaluation for each implemented single sensor detector and the combined multi-modal detector are presented in the next chapter i.e. in sections 4.2 and 4.3.
3.1
Person Detection from a 2D Range Data
Recently, Laser Range Finders (LRFs) have become attractive tools in the robotics area for environment detection due to their accuracy and reliability. As described in the previous chapter, LRFs send a laser beam and detect the reection of the beam, then using the time of ight obtain range information. As the LRFs rotate and acquire range data, they will have distinct scan signatures corresponding to the shape of an obstacle in the scan region. Hence, for a person in the scan region, the scan will have scan signatures corresponding to the shape of a person at the height the scan was performed, for example shape of a leg for a scan at a height below the thigh or shape of the waist at the waist height of a person. Detection of a person from LRF information hence proceeds by trying to detect shapes of a person in the scan data at the height the scan was performed. In the context of this work, leg detection will be considered as the laser scanner used is positioned at a height of 38 cm above the ground. In some works, scan matching techniques have been used to detect legs of moving persons [45]. But this technique is not considered here as it can only detect moving persons. To detect both moving and stationary persons, dierent geometric properties that can characterize 29
Chapter 3: Detector Implementation
30
shape of a leg in laser scan are used. Two common approaches in the literature for detecting legs using geometric properties are: 1. To train a classier using a training set made of geometric features corresponding to scans of a leg and non-leg environment and use this classier to detect legs from a laser scan. Common geometric features include, but are not limited to, number of points, standard deviation in range, jump distance from preceding segments, radius of the circle tted to the points, segment length, mean curvature, and mean angular dierence. 2. To consider legs as an arc with associated size and shape constraint and use these geometric properties of an arc to segment and detect legs. Zivkovic et al. [55], Arras et al. [2], and Spinello et al. [47], used the rst approach to detect legs. In their work, [55] and [2] used AdaBoost with decision stumps as a classier where as [47] used SVM within the AdaBoost framework. On the other hand, Fontmarty et al. [22], Belloto et al. [4], Xavier et al. [53], Martin et al. [39], presented works similar to the second approach. [2] reported results that show the detection with the AdaBoost outperforms the detection using a method similar to the second approach which the authors referred as heuristics. But in the heuristics method, many of the important geometrical properties outlined in [53] were not used. In this project, the second approach is used due to its simplicity, ease of threshold tuning for modication, and also time constraints on the project hence avoiding time spent to collect and label a training set manually as there is no readily available dataset. This work builds upon previous work here at LAAS-CNRS presented in [22]. The approach taken by [22] uses map of the environment to identify points not on the map as candidates for person legs. But map of the environment is not always available hence, in this work, more eort has been taken to obtain acceptable detections solely relying on geometric properties of legs without relying on the map of environment for ltering. In the following sub-sections, a description of the sensor used and the algorithm for person detection are presented.
3.1.1
Sensor Description
Rackham is equipped with a SICK LMS200 2D laser range nder. The SICK laser range nder swipes an arc of 180o measuring the radial distance of obstacles in a set angular resolution of 0.5o . Figure 3.1a-3.1c shows the laser scanner along with its scanning swipe and position on Rackham. Table 3.1 summarizes the specications of the sensor. The algorithm described in the next sub section is aimed at segmenting legs from the laser scan data of the environment.
31
3.1 Person Detection from a 2D Range Data
Table 3.1: Main characteristics of the SICK LMS200 2D Laser Range Finder on Rackham Type Field of application Light source Laser class Field of view Scanning frequency Operating range Max. range with 10 % reectivity Angular resolution Resolution Systematic error Dimensions Ambient operating temperature Short-Range Indoor Infrared (905 nm) 1 (EN/IEC 60825-1), Eye-safe 180o 75 Hz 0 m ... 80 m 10 m 0.5o , 1o 1 mm 15mm 156 mm x 155 mm x 210 mm 0 C - 50 C
(a) SICK LMS 200
(b) Scanner angular swip
(c) Position of the sen-(d) Schematic Diagram sors on rackham of Rackham
Figure 3.1: The 2D SICK LMS200 Laser Range Finder along with its position and orientation on
Rackham.
3.1.2
Detector Algorithm
The work described here is based on the works done by [22] and [53] which are based on the geometric properties of leg scans. The main idea is to try to detect humans around a robot by detecting their legs using a Laser Range Finder placed 38 cm above the ground. For a person
32
in the laser scan eld of view, the scan will have shapes corresponding to that of person legs as can be seen from gure 3.2a and 3.2b. By detecting the leg signature in the scan, the presence and position of a person will be detected.
(a)
(b)
Figure 3.2: 2D laser scans illustrating captured leg patterns. The leg patterns are shown in the white
rectangular bounding.
The sequence of steps taken to detect legs from a laser scan are: 1. Candidate points determination. To determine candidate points or points to be considered for further steps, two modes are used. A map based and non-map based modes. If a 2D map of the environment, made of line segments, is available, all points not lying on the map are ltered to be candidate points as in [22], otherwise all scanned points are considered to be candidates and used in the consecutive steps. 2. Blob segmentation. All sequential candidate scan points that are close to each other are grouped to make blobs of points. The grouping is done based on the distance between consecutive points. A distance threshold T is used to denote this threshold. 3. Blob ltering. Once blobs have been formed from the candidate points, they are ltered using geometric properties outlined in [53] and described as follows. (a) Number of scan points. Assuming a 2D scan of a leg corresponds to a circle, if a leg of radius r is detected at a distance R, the number of points captured by the laser scanner using an angular scanning step of 0.5o are given by equation 3.1 and illustrated in gure 3.3a. Number Of Points = 2
r tan1 ( R ) 0.5 180
(3.1)
By checking whether the number of points of a given blob lie within the number of points determined by an assumed minimum and maximum leg radius, denoted as r1 and r2 respectively, or not, irrelevant blobs are ltered out.
33
3.1 Person Detection from a 2D Range Data
(b) Mid point distance. For a given arc like shape composed of multiple points, the middle point must be inside an area delimited by two lines parallel to the line that passes through the two extreme points of the blob. If the mid point lies on this line, then the arc more resembles that of a straight line, and if the point is very far, the arc will resemble a sharp corner. Hence by setting a threshold to the minimum and maximum distance of the mid point from this line, denoted as d1 and d2 , shapes similar to that of a straight line or a sharp corner are ltered out. Figure 3.3b illustrates this ltering step. d1 and d2 are set by taking the radius of the curve into consideration. In this work, d1 is set to 0.1 BlobDiameter while d2 is set to 0.7 BlobDiameter of the blob under consideration. (c) Mean Internal Angle and Internal Angle Variance. This property makes use of the trigonometric properties of arcs: all angles formed by the two extreme points of the arc and a point lying on the arc are congruent angles, i.e. the angles are of equal degrees. By obtaining the mean and variance of all the internal angles, the shape of the arc can be characterized. For leg detection, an internal mean angle value between 90o and 135o and an internal angle variance less than 0.15radian is used and all blobs not conforming to these values are ltered out. These values were tuned empirically and reported in [53]. (d) Sharp structure removal. Sometimes sharp corner like shapes will fulll all the above stated requirements. To lter out such corners, rst the point making the smallest internal angle with respect to the end points is identied. Then if more than two neighborhood points, left and right, make progressively bigger internal angles with the end points, this shape corresponds to a sharp corner and is ltered out. Figure 3.3d shows a sample sharp corner with its associated internal angles. 4. Leg formation. All the blobs that are not ltered out by the above stated requirements are considered to be legs. A leg is characterized by a blob that fullled all the above geometric constraints. The centroid of the points in the blob will make the center of a leg with a diameter determined by the distance between the two end points of the blob. 5. Pairing legs to make humans. Each formed leg is then compared with a detected leg in its vicinity to be paired. If a nearby leg is found within a threshold distance, denoted as P , of 30cm, it is paired. Otherwise, if a nearby leg pair is not found for a single detected leg, it is assumed that one leg is occluding the other leg during scan. In this case, a virtual leg is placed 25 cm from the detected leg along the line of sight of the robot and detected leg, away from the laser. This virtual leg denes the second leg and a pair is formed. Once all detected legs have been paired, either with another detected leg or a
34
(a) Number points
of
scan
(b) Mid point distance
(c) Internal angles of a curved structure [53]
(d) Internal angles of a sharp corner [53]
Figure 3.3: Figures illustrating the dierent geometrical properties used for discriminating leg
structures in a 2D laser scan.
virtually placed leg, a human detection occurs with its center placed at the center of the two paired legs. A owchart of the detection process is shown in gure 3.4. Figure 3.5 illustrates the detection with images. This detection system has disadvantages because of its false detection of table legs, chair legs, and other narrow objects with circular pattern. People standing with closed legs or wearing long skirts do not yield appropriate leg signatures needed by the detector so are classied as negative instances resulting in false-negatives. On top of this, it is not possible to know which leg detections correspond to which person in the presence of multiple people, hence making associations of each legs in consecutive frames impossible for tracking purposes unless another sensor that can obtain discriminative information is used. Table 3.2 presents all the thresholds empirically tuned for maximizing correct person detection while minizing false detections.
3.2
Visual Person Detection
In the previous section a laser scanner based person detection has been described. But this detector can not be used alone for tracking purposes as it has no discriminative information to dierentiate between multiple people in the scan area. On the contrary, visual detectors contain
35
3.2 Visual Person Detection
Figure 3.4: Flow chart illustrating the algorithm used to detect persons from a 2D laser scan.
36
Table 3.2: Set threshold values of the 2D laser based person detector Parameter Type Blob grouping threshold Min and Max radius of a leg Min and Max mid point distance Mean internal angle range Internal angle variance Leg pairing distance threshold Notation T r1 , r2 d1 , d2 min , max Vinternal P Value 8 cm r1 = 3 cm r2 = 15 cm d1 = 0.1*BlobDiameter d2 = 0.7*BlobDiameter min = 90o max = 135o 0.15 radian 45 cm
cues like texture and color that can be used to discriminate detections among multiple people. In this section, possible realization of a visual person detector is investigated. Rackham, gure 1.1, is equipped with two cameras mounted on pan-tilt units. One of the cameras is dedicated for tracking a target user, hence the second camera, a Sony EVI-D70, is used in this work for the purpose of multiple people detection. As discussed in section 2.5.1, many works have been done in visual person detection. The aim in this project is to use state of the art visual person detector with modications to be used on Rackham to detect multiple people around along with other sensors. To this end, four open source person detectors, namely of Dalal and Triggs [13], Laptev [33], an upper body detector [38], and Felzenszwalb et al. [17] were investigated. The works of Dalal and Triggs use HOG with SVM for detection while that of Laptevs uses HOGs with AdaBoost as a classier. Both implementations are based on a general object detection framework. The upper body detector [38] is similar to that of Dalal and Triggs except the training was done using annotated upper body images. The work of Felzenszwalb et al. is also from a general object detection framework which uses discriminatevly trained part based models. All four implementations with their associated models for person detection were evaluated, with their default parameters, on a set of images taken of multiple people in the LAAS robotic lab (results of this evaluation are presented in section 4.3). Evidently Felzenszwalb et al.s person detector with discriminatively trained part based models outperforms the others signicantly. [13] and [33] mostly fail for close up images, which happen quite frequently due to the pose of the camera, where the legs of people are not visible. On the contrary, [38] fails on persons far from the camera. In the sub sections to follow, the person detector based on the works of Felzenszwalb et al. will be presented in a comprehensive manner. The detectors source code [17] was available on Matlab hence a complete C version has been implemented for porting it on to the robot. The detector of Felzenszwalb et al. is based on mixtures of
37
(a) H/R situation.
(b) Position of persons as seen by the camera on the robot.
(c) 2D laser scans obtained by the SICK LRF.
(d) Blobs formed from the raw laser scan.
(e) Detected legs shown in white circle.
(f) Paired legs and position of each person.
Figure 3.5: Sequence of gures illustrating the person detection from 2D laser scans. The red circle represents the robot. Notice that the second leg of the person detected and denoted by a purple circle was occluded by his rst leg, hence a virtual leg is placed along the line of sight of the robot and the detected leg away from the robot.
multiscale deformable parts models that have the ability to represent a highly variable object class like that of a person. The resulting person detector is ecient, accurate, and has achieved state-of-the-art results in the PASCAL VOC competition [14], [15] and the INRIA person dataset1 . In the following sub sections, this visual person detector is discussed thoroughly.
3.2.1
Feature Set
For detection of dierent class objects, [18] have used Histograms of Orientation Gradients (HOGs) with analytically reduced dimension as features. The gradients orientation at each pixel are discretised into contrast sensitive and insensitive orientations. For a color image, the color
1 http://pascal.inrialpes.fr/data/human/
38
channel with the largest gradient magnitude is used to dene orientation and magnitude. In each cell of a rectangular dense spatial grid, the orientations are spatially aggregated, normalized with respect to each overlapping block, and truncated as discussed in subsection 2.3.3. Nine contrast sensitive and 18 contrast insensitive discrete orientations with a cell size of 8x8 pixels are used. Hence a total of 4 (9+18) = 108 features are computed for each cell. Instead of using these features directly, feature vectors obtained by analytic projection of the 108 dimensional vectors, dened by 27 sums over dierent normalizations, one for each contrast sensitive and insensitive orientation channel, and 4 sums over the 9 contrast insensitive orientations, one for each normalization factor are used. Visualizing the 108 dimensional vector as 2D vector of 4 rows, representing the dierent normalization, and 27 column, representing the contrast sensitive and insensitive orientations, gure 3.6 illustrates how the nal 31 dimensional feature is obtained.
Figure 3.6: Feature computation performed by [18]. A 4x27 2D feature, columns represent orientation while rows represent the dierent normalization factors, is reduced to a 31 dimensional vector by 27 sums over dierent normalizations, one for each contrast sensitive and insensitive orientation channel, and 4 sums over the 9 contrast insensitive orientations, one for each normalization factor. The arrows represent the summation over the values.
3.2.2
Model
A person is modelled using a star-structured part based model dened by a root lter and a set of parts lters with associated deformation models. The person model currently implemented consists of mixtures of two models each of which have one coarse root lter that approximately covers an entire person and six high resolution parts lters that cover smaller parts of the object, gure 3.7a and 3.7b. In this context, a lter is dened as an array of d - dimensional weight vectors and the response or score of a lter F at a position (x, y ) in a feature map G is given by the dot product of the lter and a subwindow of the feature map with top left corner
39
at (x, y ) as in equation 3.2. F [x , x ].G[x + x , y + y ]

x ,y
(3.2)
(a)
(b)
Figure 3.7: Two person models making up a complete person mixture model (from [18]). The model for a person with n parts is formally dened by an (n + 2)-tuple (F0 , P1 , ..., Pn , b) where F0 is a root lter, Pi is a model for the ith part and b is a real valued bias term. The part lters are placed on features computed at twice the resolution of the features in the root lter level. Each part model, Pi , is dened by (Fi , vi , di ) where Fi is a lter for the ith part, vi is a 2D vector specifying an anchor position for the part i relative to the root position, and di is a 4D vector specifying coecients of a quadratic function dening a deformation cost for each possible placement of the part relative to the anchor position. The score of a person hypothesis in the feature pyramid is given by the scores of each lter at their respective location, the parts are placed at twice the resolution of the root level, minus a deformation cost that depends on the relative position of each part with respect to the root. Let F be a WxH lter and H be a feature pyramid with p = (x, y, l) specifying a position (x, y ) in the lth level of the pyramid. Let also (H, p, w, h) denote the vector obtained by concatenating the feature vectors in the WxH sub window of H with top left corner at p in row major order. Then for a person hypothesis z = (p0 , ..., pn ) with pi = (xi , yi , li ) specifying the level and position of the ith lter, the score is given by equation 3.3.
score(p0 , ..., pn ) =
i=0
Fi .(H, pi )
i=1
di .d (dxi , dyi ) + b
(3.3)
where and
(dxi , dyi ) = (xi , yi ) (2(x0 , y0 ) + vi ) d (dx, dy ) = (dx, dy, dx2 , dy 2 )
The bias term, b, is introduced in the score to make the scores of multiple models comparable
40
when combined in a mixture model.
3.2.3
Detection Algorithm
This subsection described how persons are detected on a given image. In this project, a person model trained on the Pascal VOC 2008 dataset by [17] is used. The steps taken during detection are: 1. Feature computation. For a given image, the features are computed to make a feature pyramid. 2. The responses corresponding to the root and each part lters at each level of the feature pyramid are determined independently. Let Ri,l (x, y ) denote the responses at level l of the feature pyramid for ith model lter. Ri,l is simply a cross correlation between Fi and level l of the feature pyramid [18]. 3. The responses of the part lters are transformed to allow for spatial uncertainty using equation 3.4 which spreads high lter scores to nearby locations taking deformation costs into consideration. In the implementation, this is achieved by applying a generalized distance transform on the lter responses Ri,l . Di,l (x, y ) = max (Ri,l (x + dx, y + dy ) di .d (dx, dy ))
dx,dy
(3.4)
Pi,l (x, y ) = arg max (Ri,l (x + dx, y + dy ) di .d (dx, dy ))

dx,dy
(3.5)
In addition, equation 3.5 can be used to compute the optimal displacements for a part as a function of its anchor position vi . 4. Compute the overall root score at each level by the sum of the root lter responses at each level, plus shifted versions of transformed and subsampled part responses, equation 3.6. score(x0 , y0 , l) = R0,l0 (x0 , y0 ) +
i=1 n
Di,l0 (2(x0 , y0 ) + vi ) + b
(3.6)
denotes the number of level to go down to get features computed at twice the resolution of the features at the current level. This denotes the nal score. By thresholding this value, a binary detection can be obtained. The detection steps 1-4 are clearly illustrated by gure 3.8 taken from [18]. When considering a mixture model with m components, in this case m = 2, the detection algorithm outlined in this section is applied to nd root locations that yield high scoring hypothesis independently for each component.
41
3.3 Our Multi-modal Person Detector
5. Bounding box determination and non maximal suppression. As stated by
[18], the
bounding box is determined using functions that map a feature vector g (z ), a 2n + 3 dimensional vector made up of the width of the root lter in the image pixels and the location of the upper left corner of each lter in the image, to the upper left, (x1 , y1 ), and lower right, (x2 , y2 ), corners of the bounding box. The functions are learned using linearleast squares regression from the detection output of a trained model and the PASCAL training data labeled by a bounding box, independently for each component of a mixture model. Once the bounding box is determined, all the detections are sorted in decreasing scores and a greedy non-maximal suppression stage discards overlapping boxes with more than 50% overlap while keeping only the ones with the highest score. Figure 3.9a and 3.9b illustrate the position of the parts and the bounding boxes corresponding to the two mixture models of a person, on two sample images taken by Rackham.
3.2.4
Training
To train the mixture models, [18] used a weakly labeled training set with annotation specifying a bounding box for each target person present on the image with no component labels or part locations. All parameter learning was done by constructing a latent SVM problem and training the latent SVM using a coordinated descent approach. The details about the training formulation and training can be found on [18]. In the work of this project a person model trained with the Pascal VOC 2008 dataset and provided with the Matlab open source, [17], by the authors is used.
3.2.5
Computation time
The C implementation of the person detector takes about 4.6 seconds to detect persons on a 320x240 image on a PIII 850 MHz computer with 6 levels in each octave of the feature pyramid. This computation time is not acceptable for the task at hand, navigation in a crowded environment, and entails further improvements to speed the detection process.
3.3
Our Multi-modal Person Detector
As discussed in section 3.1, people detection using LRFs is possible and when detected the location is very precise, but they suer from false positives due to structures resembling that of a person leg, mis-detections due to a closed leg or long skirt, and do not carry enough information to discriminate detections between multiple persons for tracking purposes. Besides, the visual person detector [18] is not readily applicable for the objective at hand due to computation time
42
Figure 3.8: Illustration of the person detection based on trained parts based models (from [18]).
requirement. Hence, all these entail a multi-modal person detector that could benet from the positive merits of the dierent individual person detectors and at the same time compensate the downside of these detectors to make a better person detection system. As highlighted in section 2.5.2 dierent works have been done to make a Laser and Vision
43
(a)
(b)
Figure 3.9: Visual person detections outlining the detected parts along with the complete bounding
box. The detection in a) corresponds to the parts model of 3.7a while b) is for 3.7b
based multi-modal person detector. To briey mention multi-modal detectors that combine vision and Laser, Cui et al. [12] detected legs from multiple 2D laser range nder and projected an area corresponding to an average person height and width onto the image plane to dene a candidate for a mean shift tracking method. Similarly, Spinello et al. [47] used a projection of a cluster from a 2D laser scan onto the image to dene region of interest for the visual person detector. Belloto et al. [4] and Fontmarty et al [22] detected legs from 2D laser scans using geometrical constraints and people from camera using face detector independently and transformed the detections into the same coordinate reference to make a global map for tracking. Martin et al. [39] combined laser, sonar, and vision in a global map using covariance intersection for tracking. Zivkovic and Kr ose [55] detected legs from a 2D laser range nder and parts of a body, full body, upper body, and lower body, from an omni-directional visual camera and combined them in a parts based model made of legs, upper body, lower body, and full body. In the next sub sections, the proposed multi-modal detector system is presented thoroughly.
3.3.1
Framework
A good way to fuse the laser and vision detection possibly would be to do the detection independently and then to cross validate the occurring detection in a way to maximize true positives while minimizing false-positives. But unfortunately this is not possible as the C implementation of the visual person detector takes about 4.6 seconds to detect people in a 320x240 image on our Mobile Robot Platform, Rackham. Hence similar to [12] and [47] the proposed approach is to dene region of interests, henceforth referred as person hypothesis, using the detections from the laser scanner and then validate this by using the visual person detector on these regions. In short, rst the geometric requirements to detect persons from the 2D laser scanner, presented in section 3.1, are relaxed to have a 100% person detection while at the same time having many
44
false positives. For every hypothesis, assuming a at world,i.e. all people around would be standing on a leveled ground plane, a rectangular region corresponding to an average height of 1.8 meter is projected on to the image. Then the visual person detector, presented in section 3.2, is used to validate each hypothesis and detect persons. A block diagram of the proposed system is presented in gure 3.10.
Figure 3.10: Block diagram of the proposed multi-modal person detection system.
3.3.2
Sensors
Description
For the multi-modal person detection, two sensors, a 2D laser range nder, described in sub section 3.1.1, and a Sony EVI-D70 camera mounted on a pan-tilt unit on Rackham are used. The images captured by the camera are of color 320x240 pixels size images. Figures 3.1c and 3.1d show the position and orientation of the two sensor on the actual robot and on a schematic representation of the robot respectively. Position and orientation of both sensors with respect to the Robot frame of reference in (Tx , Ty , Tz , Y aw, P itch, Roll), translations in mm and rotations in degrees, notation are given as follows: 2D Laser Range F inder Sony EV I D70 Camera (10, 0.0, 380, 0o , 0o , 0o ) (65, 80, 1610, 6o , 90o , 90o ) (3.7) (3.8)
The Sony EVI-D70 camera was calibrated using a checker board panel and Bougets Camera
45
Calibration Toolbox for Matlab [7]. The intrinsics determined from the calibration and the complete camera transformation matrix, extrinsic and intrinsics taking the robot frame as reference, using the pin-hole model of a camera are presented in equations 3.9 and 3.10 respectively. Metric units are in mm, hence to determine the pixel position in the image plane, a point in the world coordinate should be in mm. Both matrices are utilized in subsequent sub sections. 380.7514 = 0 0 154.55 380.75 = 93.717 0 0.994 Hypothesis Generation To generate hypothesises for persons, the geometric constraints used to detect person legs from a 2D laser scan in section 3.1, are relaxed. Hence with this modication, the system is capable of generating a hypothesis containing all the persons within the eld of scan of the laser range nder. This asserts a 100% detection of persons with numerous false positives. The modication on the geometric constraints are shown in table 3.3. Table 3.3: Threshold values of the laser based person hypothesis generator. Parameter Type Blob grouping threshold Min and Max radius of a leg Min and Max mid point distance Mean internal angle range Internal angle variance Leg pairing distance threshold Notation T r1 , r2 d1 , d2 min , max Vinternal P Value 8 cm r1 = 3 cm r2 = No limit d1 = 0.05*BlobDiameter d2 = No limit 45 cm 0 0 381.4332 0 0 134.3235 0 1.0 0 155.4049 66659.30 232.93 (3.10)
Ainstrinsic
(3.9)
16.24 393.384 0.105
Acamera
639440.404
Hypothesis Projection Once the hypothesis are generated by the 2D laser based hypothesis generator, a virtual rectangle conforming to an average person height of 1.8 m with an aspect ratio of 4:11 (width:height)
46
is positioned at the precise distance obtained from the laser assuming a at world,i.e. all people around would be standing on a leveled ground plane corresponding to z = 0 of the robot. The rectangle is placed parallel to the y - axis of the robot to maximize the encapsulated area of a hypothesis. Then all the four corners of the rectangles are projected onto the image plane using the camera transformation matrix, Acamera of equation 3.10, to dene the rectangular search region on the image. Person Detection After all the hypothesis lying in the camera eld of view have been projected onto the image plane, the visual person detector based on discriminatevly trained parts model, described in section 3.2, is used to evaluate the hypothesis regions. All the regions conrmed to be containing persons are labelled as detections of the overall system while those hypothesis not conrmed by the visual detector are discarded as false alarms. To obtain a precise boundary box conforming to the actual person, the post processing part of the visual person detector (sub section 3.2.3) is used. The main advantage of using the dened region of interests is the reduced computation time. Neither all the levels of the feature pyramid nor model scores in all possible positions on the feature pyramid need be computed. The main diculty of this approach would be knowing which level of the feature pyramid to compute for a give rectangular hypothesis. In this work, this is done using a linear relationship between scale of a rectangle and level of the feature pyramid to compute features and score of the model. This is given by equation 3.11 where l is the level in feature pyramid, is the number of levels in each octave, and scale is given by
88 HeightOf RectangleinP ixels ,
as an 8x8 pixel makes one cell on level and aspect ratio is
4:11(width:height). In the nal implementation = 10 is used. l = 2 ( Scale 1.5) (3.11)
Figure 3.11 illustrates the multi-modal person detection system with outputs of each block in the system block diagram (gure 3.10) placed on the position of the blocks.
3.4
Summary
In this chapter, the methodology followed to build a multi-modal person detector has been presented. Two independent person detection modules based on 2D laser range scans and vision have been developed. It has been highlighted that eventhough the detection based on the laser scanner is fast, it can not be used for subsequent tracking purposes and spatio temporal analysis due to its inability to discriminate among dierent persons. Though, the visual person detector implemented and integrated into the target robotic platform is state of the art, it can not be
47
3.4 Summary
Figure 3.11: Illustration of the multi-modal person detection system with output images placed on
corresponding blocks of the system. The white thin lines depict the camera eld of view. The generated hypothesis are shown as white circles on the 2D laser map while the projected rectangles are shown as thin green rectangular windows on the image plane. The nal detected persons are outlined with a bounding box of varying colors, aqua, red, and blue.
used for the reactive navigation task at hand due to the associated computational time. By combining the two detectors into a multi-modal person detector characterized by using a laser detection as a hypothesis generator and using the visual detector to assert those hypothesis, it has been shown that a combined detector with acceptable computational performance was obtained. For every person detected, the multi-modal person detector provides accurate depth localization, from the laser information, and bounding box on the image that can be used to discriminate among dierent persons in the scene making it very suitable for subsequent tracking task. The next chapter will quantify the performance of out multi-modal person detector and prove that it outperforms its single sensor counterparts taking computation time and detection performance into consideration.
48
Chapter 4
Robotic Integration and Associated Experiments

In this chapter, the integration of the developed multi-modal person detector in the software architecture of Rackham is rst presented. Then test sets and evaluation criteria used to determine the performance of the person detector along with obtained results are presented. The chapter concludes with a discussion of the obtained results. Conclusions based on the results presented in this chapter and future works are discussed in the next chapter.
4.1
Robotic Integration
The multi-modal person detector described in section 3.3 has been implemented within the LAAS-CNRS openrobots architecture [1] using C/C++ interface as a GenoM module. A GenoM module is an independent software component that can integrate a set of functions with various time constraints or algorithmic complexity: control of sensors and actuators, servo- controls, monitoring, data processing, trajectory computation, etc. The module has been named multihumPos. Figure 4.1 shows Rackhams software architecture made of various GenoM modules along with their interactions. The new created module, multihumPos, interacts with the sick module to obtain 2D laser range scan data, with pom to obtain the position of the sensors with respect to the robot, and camera to obtain image frames from the camera. 49
Chapter 4: Robotic Integration and Associated Experiments
50
Figure 4.1: Rackhams software architecture
4.2
4.2.1
Test Set and Evaluation Criteria

Test Set
As described in section 3.3, a multi-modal person detector based on a 2D laser scanner and visual person detector has been implemented and integrated into Rackham for evaluation. Note that in section 3.1 an independent person detector based on the 2D laser range nder has been described. To evaluate the performance of the multi-modal person detector and also to compare its performance with respect to 2D laser range person detector, tests were carried out on the described two person detection methodologies, namely: 1. Person detection based on the SICK LRF. The detection proceeds by detecting the legs of persons in the laser scans. To detect and segment out legs, various geometric constraints applicable to a the geometry of a leg signature in a scan are used, described in section 3.1. 2. Multi-modal person detection based on SICK LRF and Vision. In this approach, the laser
51
4.2 Test Set and Evaluation Criteria
scan is used to make hypothesis on position of people. This is achieved by relaxing the geometric constraints in 1. Then the visual person detector is used to detect persons on the hypothesized locations, described in section 3.3. In order to test the two person detectors three sets of data obtained from live experiments were used. (I) 94 frames each containing a single individual. The data were acquired keeping both the robot and person static. (II) 273 frames containing a total of 207 persons. All the targets were walking while the robot was static during data aquitision. This dataset contains empty frames which are used to validate the false positive in frames containing no person. (III) 625 frames containing 956 persons in total. During the acquisition, the target persons were all walking around the robot while at times the robot was also moving.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
Figure 4.2: Sample image from each dataset used for detector evaluation. a-d from set I, e - h from set II, i - l from set III. All the data were acquired inside the LAAS-CNRS robotic lab which has an area of approximately 10x8.20 m2 where Rackham can actually move. The laboratory environment is very complex and contains other robotic platforms, cluttered chairs, desks, and the like. Figure 4.2 shows sample images taken from each dataset.
52
4.2.2
Evaluation Criteria
To evaluate the person detection performance, True Positives (TP), False Positives (FP), and False Negatives (FN) were counted in each frame. The total number of persons in each frame is determined by counting the number of people that were captured by either the SICK laser sensor, as determined by visually looking at leg signatures in the scan data, or the vision system, determined by the visibility of a person in the acquired image. To be able to compare the performance of both systems, only the region in the camera eld of view is considered. As a rule, all persons outside the eld of view of the camera or lying on the border of the eld of view are not considered. To quantify the performance of the person detectors, two measures are used: Recall (True Positive Rate) and Precision as shown below. T rueP ositive 100 T rueP ositive + F alseN egative
Recall(T rueP ositiveRate)(in%) =
(4.1)
P recision(in%) =
T rueP ositive 100 T rueP ositive + F alseP ositive
(4.2)
4.3
Evaluation Description
Before presenting results from the laser based and multi-modal person detector, results obtained by evaluating four open source visual person detectors that have been described in chapter 2 and section 3.3 are presented in table 4.1. The test for these were made on a total of 142 images containing 388 instances of people at the LAAS-CNRS robotic lab, sample images are shown in gure 4.3. The aim was to verify that [15] outperforms the other visual person detectors. As can be seen from the results, the visual person detector of Felzenszwalb et al. [17] outperforms the other three signicantly. These results and results reported from the 2008 PASCAL VOC challenge [15] are the basis for the choice of visual detector in section 3.2.
Table 4.1: Summary of person detection results on 142 images containing 388 person instances. Recall (%) Precision (%) Dalal et al [13] 27.83 % 72.00 % Laptev [33] 31.96% 82.12 % Upper body detector [38] 11.17% 92.86% Felzenszwalb et al [17] 67.78% 96.69 %
53
4.3 Evaluation Description
Figure 4.3: Sample images used for validating the four visual detectors. Person Detection with 2D Laser Range Finder The results for the person detector based on the 2D laser range nder are presented in table 4.2. The results are obtained considering the region in the camera eld of view only. Any person on the lying outside or on the border of the camera eld of view is ignored. For people on the boarder, images taken yield partial views of the person, hence they are not considered to avoid comparing two detection systems where one has full information necessary while the other is missing part of the information. This person detection is quite fast and runs at about 20 scans per second on our robotic platform. Sample detection image can be seen in gure 4.4. It can be clearly seen in the rst and last images a robots curved frame is also detected as a leg. Table 4.2: Summary of person detection results of the 2D laser range only detection on test set I-III . 2D Laser Range Only Detection Test set TP FP FN Recall(%) Precision (%) I 87 44 7 92.55 66.40 II 168 89 39 81.50 65.37 II 887 312 69 92.78 73.97
Figure 4.4: Sample laser only detections. The upper images show the scene as seen by robot camera looking forward and the bottom images show the corresponding laser leg detections. The arc shows the laser scan eld and the shaded region corresponds to the camera eld of view.
54
Multi-modal Person Detection The results of the multi-modal person detector on the test set (I-III) are shown in table 4.3. Results corresponding to the laser hypothesis are obtained by counting hypothesis generated on an actual person as True Hypothesis, missed persons as Missing Hypothesis, and all other hypothesis generated over non-person objects as False Positives. Figure 4.5 shows two instance of Human Robot situation along with detection of the multi-modal person detector. Each individual is shown with a distinct color matching the corresponding generated laser hypothesis. Currently the speed of the overall system depends on how many person hypothesis are generated from the 2D laser scanner. But thus far during tastes it has varied from 1.5f ps minimum to 4.5f ps maximum. Table 4.3: Summary of person detection results for the multi-modal person detector on the test set
I-III.
Test set I II III
Laser Hypothesis True Hy- False Hy- Missing pothesis pothesis Hypothesis 94 172 0 207 607 0 946 1863 10
TP 90 181 809
FP 1 7 54
multi-modal FN Recall(%) Precision (%) 4 95.74 98.90 26 87.4 96.27 147 84.62 93.74
Further sample multi-modal person detections are shown in gure 4.6 and gure 4.7 presents some of the erroneous detections.
4.4
Discussion and Summary
As can be seen from the results in table 4.2 and 4.3, the person detector based on 2D laser range nder has better true detection rate with an inferior precision. The false positives are much higher than those of the multi-modal one, a 312 FP in dataset III, and this is considering only the region in the camera eld of view. If all the laser scan area is considered, the false positives would be higher depending on the structures in the environment. This is accounted to false positives resulting from structures having similar laser signature to that of peoples legs. It is easily fooled by circular chair legs, curved corners or pillars in the lab. Most importantly what should be noted is that a person detector solely based on 2D laser can not be used for tracking because of the inability to dierentiate between multiple people detections and associate persons detected in subsequent frames. On the contrary the multi-modal person detector enjoys the precise localization from the laser and discriminative information from the visual detection.
55
4.4 Discussion and Summary
(a)
(b)
(c)
(d)
Figure 4.5: Sample image showing Human-Robot position and corresponding person detections based on Laser and Vision. The green boxes on the video image in b) show the hypothesis generated from the laser scans, and the detections with the corresponding laser detection are shown in various colors. On the laser scan map, a small white Circle denotes detected leg and a green circle, detected person. The actual scan data is shown in red while blobs are shown as thin blue circle. The gray shaded region corresponds to the camera eld of view.
Computation wise, the multi-modal person detector shows a signicant improvement over that of the visual detector alone. The detection performance in table 4.3 shows that the multimodal detector achieves an 84.62% true positive rate with a precision of 93.74% on a dataset obtained from walking people while at times the robot was also moving. The true positive rate is even higher for stationary people, which makes it all in all a very good person detector outperforming its single sensor counterparts taking subsequent use, computation time, and precision into account. Figure 4.6 shows some detection results of the multi-modal person detector. These images demonstrate detections of stationary person (4.6a), close ups (4.6b), walking persons (4.6c, 4.6d, 4.6i), from behind(4.6c, 4.6e, 4.6f), and very crowded scene (4.6j, 4.6k, 4.6l). Figure 4.7 shows the opposite side with samples of erroneous detection results. Figures 4.7a and 4.7b depict sample failure of the visual detector even tough the correct rectangular hypothesis was
56
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
Figure 4.6: Some person detections of the multi-modal person detector. Green boxes denote hypothesis generated using the information from the laser scanner and red boxes are conrmed detections.
(a) False Negative
(b) False Negative
(c) False Positive
(d) False Positive
(e) False Negative
(f) False Negative
Figure 4.7: Detection mistakes made by the multi-modal person detector. Green boxes denote hypothesis generated using the information from the laser scanner and red boxes are conrmed detections.
57
4.4 Discussion and Summary
projected, gure 4.7c shows a false positive due to a hypothesis generated by a table leg, gure 4.7d shows double detection considered as a false positive due to a person being in double hypothesis region because of the persons leg and a structure with similar shape nearby, gure 4.7e a false negative due to body deformation as a result of bending. Figure 4.7f shows a situation that happens when a person is walking fast and as a result has an inclined posture, most people tend to lean forward when placing the fore foot and about to lift the lagging leg. At this moment the projected rectangle specifying the candidate window misses a signicant part of the persons actual body leading to misdetection. To summarize, in this chapter integration of the developed multi-modal person detector in our robotic platform Rackham along with experiments carried out to evaluate the performance of the multi-modal person detector and evaluation results have been presented. It has also been shown that the implemented multi-modal person detector outperforms its single sensor counterparts taking computation time, and detection performance into consideration.
58
Chapter 5
Conclusions and Future Work

This thesis addresses the problem of person detection in crowded scenes from a mobile robot using multiple sensors onboard. This task is crucial in H/R interaction as the rst and foremost task should be perception of the whereabouts of persons around. This chapter presents concluding statements and highlights the future works that are going to be carried out.
5.1
Conclusions
The work in this thesis is aimed at detecting people around a mobile robot for tracking purposes which will be used for reactive navigation in crowded scenes. The work starts rst by presenting state of the arts in single sensor and multiple sensor based person detection approaches. Then using insights obtained from the dierent works presented, a multi-modal person detector based on a 2D SICK Laser Range Finder and vision has been implemented. The multi-modal person detector uses the 2D laser scanner to scan the environment at legs height and segments proles resembling that of a human leg to make person hypothesis. These hypothesis are projected on to images from the camera to dene search windows for a state of the arts parts based visual person detector. The visual detector utilized is currently the state of the art in visual person detection and outperforms previous state of the arts approaches signicantly. Our C implementation of the visual detector alone takes about 5 seconds to detect people in a 320x240 image while our person detection using 2D laser scans alone can not be used to make spatiotemporal analysis as it carries no information to dierentiate among multiple people detections. But by combining the two modalities, it has been shown that a more convenient multi-modal person detector capable of running at 1.5 fps minimum and yet carries enough information to 59
Chapter 5: Conclusions and Future Work
60
be used in subsequent stages for tracking is achieved. Furthermore, the developed multi-modal person detector has been integrated into our robotic platform and live experiments have been carried out to evaluate its performance. Results corresponding to all the test sets have been clearly reported and prove that the multi-modal person detector is all in all a very good person detector outperforming its single sensor counterparts taking subsequent use, computation time, and precision into account. The main contributions of this thesis are: A complete C implementation of state of the art visual person detector, A multi-modal person detector that uses a 2D SICK laser and a state of the art visual person detector sequentially proving that it outperforms its single sensor counterparts taking subsequent use, computation time, and precision into account, and implementation of this multi-modal detector as GenoM module on Rackham. The detection system is ready to be used for subsequent tasks of tracking and reactive navigational control law denition. On a personal note, person detection is by far a very challenging task and still needs a lot of work to have a perfect system, but evidently multi-modal person detection systems utilizing multiple sensor modalities should be used if one wants to overcome all the challenges and have a perfect detection system.
5.2
Future Works
The aim of this thesis is to develop a multi-modal person detector that will be used to develop a multi-person tracker which in turn will be used to dene control laws for reactive navigation of our robotic platform in crowded scenes. Currently the multi-modal person detection has been completed. As mentioned in section 1.5, two teams are currently working on multi-person tracking and control law denition for reactive navigation. Hence the rst and foremost remaining task is integration of the multi-person tracker and control laws to have a complete system that is aware of persons around and moving in crowds while avoiding passers by in a socially acceptable manner. Possible extension of the multi-modal person detector by incorporating an FMCW radar will also be investigated, assuming a prototype that is small enough to be onboarded on Rackham is made available. Another line of possible work is to improve the existing target person following task of Rackham by incorporating the laser scanner with the current vision and RFID based system for better person localization.
Bibliography
[1] R. Alami, R. Chatila, S. Fleury, M. Ghallab, and F. Ingrand. An architecture for autonomy. In Int.Journal of Robotics Research (IJRR98), 17:315337, 1998. Martnez, and M. W. Burgard. Using boosted features for detection of [2] K. O. Arras, O. people in 2d range scans. In Proceedings of the Int. Conf. on Robotics and Automation (ICRA07), 2007. [3] H. G. Barrow, J. M. Tenenbaum, R. C. Bolles, and H. C. Wolf. Parametric correspondence and chamfer matching: Two new techniques for image matching. In Proceedings of the 5th Int. Joint Conference on Articial Intelligence (IJCAI77), pages 659663, 1977. [4] N. Bellotto and H. Hu. Vision and laser data fusion for tracking people with a mobile robot. In Proceedings of the Int. Conf. on Robotics and Biomimetics (ROBIO06), 2006. [5] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shape contexts. In the IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI02), 24(4):509522, 2002. [6] M. Bennewitz, F. Faber, D. Joho, M. Schreiber, and S. Behnke. Towards a humanoid museum guide robot that interacts with multiple persons. In Proceedings of the IEEE/RSJ Int. Conf. on Humanoid Robots (Humanoids05), 2005. [7] Jean-Yves Bouguet. Camera calibration toolbox for matlab.
http://www.vision.caltech.edu/bouguetj/calib doc/. [8] R. Br uckmann, A. Scheidig, C. Martin, and H. Gross. Integration of a sound source detection into a probabilistic-based multimodal approach for person detection and tracking. In Proceedings of Autonomous Mobile Systeme (AMS05), pages 131137. Springer, 2005. 61
BIBLIOGRAPHY
62
[9] A. Clodic, S. Fleury, R. Alami, R. Chatila, G. Bailly, L. Br ethes, M. Cottret, P. Dan` es, X. Dollat, F. Elise, I. Ferran e, M. Herrb, G. Infantes, C. Lemaire, F. Lerasle, J. Manhes, P. Marcoul, P. Menezes, and Universit e Paul Sabatier. Rackham: An interactive robotguide. In Proceeding of the Int. Conf. on Robot-Machine Interaction (RoMan06), 2006. [10] A. Clodic, S. Fleury, R. Alami, M. Herrb, and R. Chatila. Supervision and interaction: Analysis of an autonomous tour-guide robot deployment. In Proceedings the Int. Conf. on Advanced Robotics (ICAR05), pages 725732, 2005. [11] R. R. Cordelia, C. Schmid, and B. Triggs. Learning to parse pictures of people. In Proceedings of the European Conference on Computer Vision (ECCV02), pages 700714, 2002. [12] J. Cui, H. Zha, H. Zhao, and R. Shibasaki. Tracking multiple people using laser and vision. In Proceedings of the 2005 IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS05), pages pp.13011306, 2005. [13] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Proceedings of the Int. Conf. on Computer Vision and Pattern Recognition (CVPR05), pages 886893, 2005. [14] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. network.org/challenges/VOC/voc2007/workshop/index.html. [15] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. network.org/challenges/VOC/voc2008/workshop/index.html. [16] M. Everingham, A. Zisserman, C. K. I. Williams, and L. Van Gool. CAL Visual Object Classes Challenge 2006 (VOC2006) Results. network.org/challenges/VOC/voc2006/results.pdf. [17] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. Discriminatively trained deformable part models, release 3. http://people.cs.uchicago.edu/ p/latent-release3/. [18] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. In Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI09), 2009. The PASThe The
PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal-
PASCAL Visual Object Classes Challenge 2008 (VOC2008) Results. http://www.pascal-
http://www.pascal-
63
BIBLIOGRAPHY
[19] P. F. Felzenszwalb and D. P. Huttenlocher. Ecient matching of pictorial structures. In Proceedings of the Int. Conf. on Computer Vision and Pattern Recognition (CVPR00), volume 2, pages 6673 vol.2, 2000. [20] S. Fleury, M. Herrb, and R. Chatila. Genom: A tool for the specication and the implementation of operating modules in a distributed robot architecture. In In Proceeding of the Int. Conf. on Intelligent Robots and Systems (IROS97), pages 842848, 1997. [21] A. Fod, A. Howard, and M. J. Matari c. Laser based people tracking. In In Int. Conf. on Robotics and Automation (ICRA02), pages 30243029, 2002. [22] M. Fontmarty, T. Germa, B. Burger, L.F. Marin, and S. Knoop. Implementation of human perception algorithms on a mobile robot. In Proceedings of the 6th IFAC Symposium on Intelligent Autonomous Vehicles (IAV07), September 2007. [23] D.A. Forsyth and M.M. Fleck. Body plans. In Int. Conf. on Computer Vision and Pattern Recognition (CVPR97), 0:678, 1997. [24] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In Proceedings of the Second European Conference on Computational Learning Theory (EuroCOLT95), pages 2337, London, UK, 1995. Springer-Verlag. [25] D. M. Gavrila, J. Giebel, and S. Munder. Vision-based pedestrian detection: The protector system. In Proceedings of Intelligent Vehicles Symposium (IV04), pages 1318, 2004. [26] T. Germa, F. Lerasle, N. Ouadah, V. Cadenat, and M. Devy. Vision and rd-based person tracking in crowds from a mobile robot. In Proceedings of the 2009 IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS09), pages 55915596, Piscataway, NJ, USA, 2009. IEEE Press. [27] D. Ger onimo. A Global Approach to Vision Based Pedestrian Detection for Advanced Driver Assistance Systems. PhD thesis, Computer Vision Center, 2010. [28] D. Ger onimo, A.D. Sappa, A.M. Lpez, and D. Ponsa. Pedestrian detection using adaboost learning of features and vehicle pitch estimation. In Proceedings of the IASTED Int. Conf. on Visualization, Imaging and Image Processing (VIIP06), pages pp. 400405, 2006. [29] M. G oller, M. Devy, T. Kerscher, J. M. Zllner, R. Dillmann, T. Germa, and F. Lerasle. Setup and control architecture for an interactive shopping cart in human all day environments. In In Proceeding of the Int. Conf. on Advanced Robotics (ICAR09), pages 16, 2009.
BIBLIOGRAPHY
64
[30] E. Goubet, J. Katz, and F. Porikli. Pedestrian tracking using thermal infrared imaging. In SPIE Conference Infrared Technology and Applications XXXII, volume 6206, pages 797808, June 2006. [31] M. Jones, P. Viola, and D. Snow. Detecting pedestrians using patterns of motion and appearance. In Proceedings of the Int. Conf. on Computer Vision (ICCV03), pages 734 741, 2003. [32] B. Kr ose, J. Porta, A. van Breemen, K. Crucq, M. Nuttin, and E. Demeester. Lino, the user-interface robot. In First European Symposium on Ambient Intelligence (EUSAI03, pages 314, 2003. [33] I. Laptev. Improvements of object detection using boosted histograms. In Proceedings of the British Machine Vision Conference (BMVC06), pages 949958, 2006. [34] Bastian Leibe, Edgar Seemann, and Bernt Schiele. Pedestrian detection in crowded scenes. In Proceedings of Int. Conf. on Computer Vision and Pattern Recognition (CVPR05), pages 878885, 2005. [35] K. Levi and Y. Weiss. Learning object detection from a small number of examples: the importance of good features. In Int. Conf. on Computer Vision and Pattern Recognition (CVPR04), 2:5360, June 2004. [36] R. Lienhart and J. Maydt. An extended set of haar-like features for rapid object detection. In Proceedings of the Int. Conf. on Image Processing (ICIP02), pages 900903, 2002. [37] J.F. Maas, T. Spexard, J. Fritsch, B. Wrede, and G. Sagerer. Biron, whats the topic? a multi-modal topic tracker for improved human-robot interaction. In Proceedings of the Int. Symposium on Robot and Human Interactive Communication (RoMan06), 2006. [38] Manuel J. Mar n-Jim enez, Vittorio Ferrari, and Andrew Zisserman. Upper-body detector. http://www.robots.ox.ac.uk/ vgg/software/UpperBody/. [39] C. Martin, E. Schaernicht, A. Scheidi, and H. Gross. Multi-modal sensor fusion using a probabilistic aggregation scheme for people detection and tracking. In Robotics and Autonomous Systems (RAS06), pages 721728, 2006. [40] K. Mikolajczyk, C. Schmid, and A. Zisserman. Human detection based on a probabilistic assembly of robust part detectors. In Proceedings of the European Conference on Computer Vision (ECCV04), pages 6982, 2004.
65
BIBLIOGRAPHY
[41] G. Monteiro, P. Peixoto, and U. Nunes. Vision-based pedestrian detection using haar-like features. In Rob otica, pages pp. 400405, 2007. [42] H. G. Okuno, K. Nakadai, and H. Kitano. Social interaction of humanoid robotbased on audio-visual tracking. In Proceedings of the 15th Int. Conf. on Industrial and engineering applications of articial intelligence and expert systems (IEA/AIE02), pages 725735, London, UK, 2002. Springer-Verlag. [43] M. Oren, C. Papageorgiou, P. Sinha, E. Osuna, and T. Poggio. Pedestrian detection using wavelet templates. In Proceedings of the Int. Conf. on Computer Vision and Pattern Recognition (CVPR97), pages 193199, 1997. [44] C. P. Papageorgiou, M. Oren, and T. Poggio. A general framework for object detection. In Proceedings of the Int. Conf. on Computer Vision (ICCV98), pages 555562, 1998. [45] D. Schulz, W. Burgard, D. Fox, and A. B. Cremers. Tracking multiple moving targets with a mobile robot using particle lters and statistical data association. In Proceedings of the Int. Conf. on Robotics and Automation (ICRA01), pages 16651670, 2001. [46] X. Song, J. Cui, H. Zhao, and H. Zha. Bayesian fusion of laser and vision for multiple people detection and tracking. In SICE Annual Conference (SICE08), pages 3014 3019, 2008. [47] L. Spinello, R. Triebel, and R. Siegwart. Multimodal people detection and tracking in crowded scenes. In Proceedings of the 23rd National Conference on Articial intelligence (AAAI08), pages 14091414. AAAI Press, 2008. [48] S. Thrun, M. Beetz, M. Bennewitz, W. Burgard, A. B. Cremers, F. Dellaert, D. Fox, D. Haehnel, C. Rosenberg, N. Roy, J. Schulte, and D. Schulz. Probabilistic algorithms and the interactive museum tour guide robot minerva. Int. Journal of Robotics Research (IJRR00), 19, 2000. [49] M. Tons, R. Doerer, M. M. Meinecke, and M.A. Obojski. Radar sensors and sensor platform used for pedestrian protection in the ec-funded project save-u. In Proceedings of Intelligent Vehicles Symp. (IV04), pages 813 818, 2004. [50] Vladimir Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995. [51] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Int. Conf. on Computer Vision and Pattern Recognition (CVPR01), 1:511518 vol.1, April 2001.
BIBLIOGRAPHY
66
[52] B. Wu and R. Nevatia. Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors. In Int. Conf. on Computer Vision (ICCV05), 1:9097, 2005. [53] J. Xavier, M. Pacheco, D. Castro, and A. Ruano. Fast line, arc/circle and leg detection from laser scan data in a player driver. In Proceedings of the Int. Conf. on Robotics and Automation (ICRA05), 2005. [54] Q. Zhu, S. Avidan, M. Yeh, and K. Cheng. Fast human detection using a cascade of histograms of oriented gradients. In Proceedings of Int. Conf. on Computer Vision and Pattern Recognition (CVPR06), pages 14911498, 2006. [55] Z. Zivkovic and B. Kr ose. Part based people detection using 2d range data and images. In Int. Conf. on Intelligent Robots and Systems (IROS07), 2007.

Multi-Modal People Detection From A Mobile Robot in Crowded Scenes

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multi-Modal People Detection From A Mobile Robot in Crowded Scenes

Uploaded by

Copyright:

Available Formats

Multi-modal People Detection from a Mobile Robot in Crowded Scenes

Alhayat Ali Mekonnen

Masters Thesis Carried Out At

Supervised By: Dr. Fr ed eric Lerasle

Research is what Im doing when I dont know what Im doing. . . .

2.3.1 2.3.2 2.3.3 2.3.4 2.4

Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Discussion and Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 29

3 Detector Implementation 3.1

Our Multi-modal Person Detector 3.3.1 3.3.2

4 Robotic Integration and Associated Experiments 4.1 4.2

Evaluation Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Discussion and Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 59

5 Conclusions and Future Work 5.1 5.2

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 60

General person detection ow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.6 2.7 2.8 2.9

3.7 3.8 3.9

Overview of the problem

Cite de lEspace is a space adventure park in Toulouse,France (http://www.cite-espace.com)

1.2 Overview of the problem

Objective of the thesis

Robotic platform and software environment

1.5 Specic tasks and investigations

Figure 1.1: Rackham in the Lab at LAAS-CNRS.

Specic tasks and investigations

State of the Art and Framework

Chapter 2: State of the Art and Framework

Figure 2.1: General person detection ow.

Chapter 2: State of the Art and Framework

(a) A Robot with various sensors

(b) Image from the Fisheye Camera

(c) LRF scan

(d) Sonar Sensor reading

Chapter 2: State of the Art and Framework

2.2 Candidate Generation

Flat World Assumption

Chapter 2: State of the Art and Framework

(a) Exhaustive search

(b) Exhaustive search with 10% of candidate windows.

(c) Flat world assumption projected rectangles.

(d) Projected rectangles with at world assumptions.

2.3 Pertinent Features

Haar Like Features

Edge Orientation Histograms (EOH)

Chapter 2: State of the Art and Framework

Figure 2.7: Haar like features templates

Histogram of Orientation Gradients (HOG)

2.3 Pertinent Features

Chapter 2: State of the Art and Framework

Figure 2.8: HOG feature extraction steps (from [13]).

H (x) = sign(f (x)) ,

Support Vector Machines (SVMs)

Chapter 2: State of the Art and Framework

With this formulation, the

(a) Possible linear classiers

(b) Optimal hyperplane

measuring distance of error points to their correct

2.5 Person Detection Approaches

Person Detection Approaches

Chapter 2: State of the Art and Framework

2.5 Person Detection Approaches

Chapter 2: State of the Art and Framework

and took less than 10 seconds on a 2GHz