P. 1
A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

|Views: 330|Likes:
A main activity of Social Robots is to interact with people. To do that, the robot must be able to understand what the user is saying or doing. This document presents a supervised learning architecture to enable a social robot to recognise human poses. The architecture is trained using data obtained from a depth camera that allows the creation of a kinematic model of the user. The user labels each set of poses by telling it directly to the robot, which identifies these labels with an Automatic Speech Recognition System (ASR). The architecture is evaluated with two different datasets where the quality of the training examples varies. In both datasets, a user trains the classifier to recognise three different poses. The learned classifiers are evaluated against twelve different users demonstrating high accuracy and robustness when representative examples are provided in the training phase. Using this architecture in a social robot might improve the quality of the human-robot interaction since the robot is able to detect non-verbal cues from the user, making the robot more aware of the interaction context.
A main activity of Social Robots is to interact with people. To do that, the robot must be able to understand what the user is saying or doing. This document presents a supervised learning architecture to enable a social robot to recognise human poses. The architecture is trained using data obtained from a depth camera that allows the creation of a kinematic model of the user. The user labels each set of poses by telling it directly to the robot, which identifies these labels with an Automatic Speech Recognition System (ASR). The architecture is evaluated with two different datasets where the quality of the training examples varies. In both datasets, a user trains the classifier to recognise three different poses. The learned classifiers are evaluated against twelve different users demonstrating high accuracy and robustness when representative examples are provided in the training phase. Using this architecture in a social robot might improve the quality of the human-robot interaction since the robot is able to detect non-verbal cues from the user, making the robot more aware of the interaction context.

More info:

Categories:Types, Research, Science
Published by: Víctor González Pacheco on Sep 06, 2011
Copyright:Attribution Non-commercial Share Alike

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF or read online from Scribd
See more
See less

09/06/2011

pdf

A Supervised Learning Architecture for Human Pose Recognition in a Social Robot

UNIVERSITY CARLOS III OF MADRID
COMPUTER SCIENCE DEPARTMENT

Victor Gonzalez-Pacheco

A thesis submitted for the degree of Master in Computer Science and Technology - Artificial Intelligence 2011 July

ii

Director: Fernando Fernandez Rebollo Computer Science Department University Carlos III of Madrid Co-Director: Miguel A. Salichs Systems Engineering and Automation Department University Carlos III of Madrid

Abstract

A main activity of Social Robots is to interact with people. To do that, the robot must be able to understand what the user is saying or doing. This document presents a supervised learning architecture to enable a social robot to recognise human poses. The architecture is trained using data obtained from a depth camera that allows the creation of a kinematic model of the user. The user labels each set of poses by telling it directly to the robot, which identifies these labels with an Automatic Speech Recognition System (ASR). The architecture is evaluated with two different datasets where the quality of the training examples varies. In both datasets, a user trains the classifier to recognise three different poses. The learned classifiers are evaluated against twelve different users demonstrating high accuracy and robustness when representative examples are provided in the training phase. Using this architecture in a social robot might improve the quality of the human-robot interaction since the robot is able to detect non-verbal cues from the user, making the robot more aware of the interaction context. Keywords: Pose Recognition, Machine Learning, Robotics, Human-Robot Interaction, HRI.

Resumen

Una de las principales actividades de los robots sociales es interactuar con personas. Para ello se requiere que el robot sea capaz de entender qué es lo que está diciendo o haciendo el usuario. Este documento presenta una arquitectura de aprendizaje supervisado que permite a un robot social reconocer poses de personas con las que interactúa. La arquitectura es entrenada utilizando las imágenes provenientes de una cámara de profundidad, la cual permite la creación de un modelo cinemático del usuario que es utilizado para los ejemplos de entrenamiento. El propio usuario se encarga de etiquetar las poses mostradas al robot mediante su propia voz. Para detectar las etiquetas dichas por el usuario, el robot utiliza un sistema de reconocimiento del habla integrado en la arquitectura. La arquitectura es evaluada con dos datasets diferentes en los cuales se varía la calidad de los ejemplos de entrenamiento. En ambos datasets, un usuario entrena al clasificador para que sea capaz de reconocer tres distintas poses. Los classificadores construidos mediante estos datasets son evaluados mediante una prueba con doce usuarios distintos. La evaluación demuestra que esta arquitectura consigue una alta precisión y robustez cuando se le proveen ejemplos representativos en la fase de entrenamiento. El uso de esta arquitectura en un robot social puede mejorar la calidad de las interacciones humano-robot gracias a que con ella el robot es capaz de detectar información no-verbal emitida por el usuario. Esto permite que el robot sea más consciente del contexto de interacción en el que se encuentra. Palabras clave: Pose Recognition, Machine Learning, Robotics, HumanRobot Interaction, HRI.

To Raquel, eternal companion in this, and all the journeys.

Acknowledgements

This large project could not be finished without the help of many people, and for that reason I want to thank them their precious time they dedicated to me. First, I want to express how thankful I am to my two advisors Fernando Fernández and Miguel Ángel Salichs. You have been a lighthouse that guided my in this journey and you have been helpful always I have needed you. I want to thank to Fernando A., especially for helping me to overcome all the difficulties I encountered when I was integrating the voice system of the robot, moreover when you were not physically here. It is really easy and great to work with colleagues as you. I am grateful to the rest of the "Social Robots" team as well. Especially to Arnaud, Alberto and David. It’s incredible how you are always responding with invaluable help, and great advice. There are other people that have suffered the collateral damages of this project, thank you Martin, Javier, Miguel, Juan, Alberto, and Silvia. It is wonderful to have such colleagues and friends, always being there exchanging ideas, concerns, helping in problems and having great coffee breaks with me. Finally, I want to give special thanks to Raquel, my wife, for all the support she has given to me. Without your patience, encouragement, and drive I would not finished this project. Thank you.

Contents
List of Figures List of Tables 1 Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Problems to be solved . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Structure of the Document . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Related Work 2.1 Machine Learning in Human-Robot Interactions. . . . . . . . . . . . . . 2.2 Depth Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Gesture Recognition with Depth Cameras . . . . . . . . . . . . . . . . . 3 Description of the Hardware and Software Platform 3.1 Hardware Description of the Robot Maggie . . . . . . . . . . . . . . . . 3.2 The AD Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Robot Operating System . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 The Kinect Vision System . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 The Weka Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 The Supervised Learning Based Pose Recognition Architecture 4.1 Training Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Pose Labeler Skill . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Pose Labeler Bridge . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 Pose Trainer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv xvii 1 1 2 4 4 7 7 12 15 17 17 20 23 25 28 29 33 35 37 37

xiii

CONTENTS

4.2 Classifying Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Pose Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Pose Teller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Pilot Experiments 5.1 Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Conclusions A Class Diagrams of the Main Nodes B Detailed Results of the Pilot experiment C The ASR Skill Grammar for recognising Pose Labels References License

39 39 41 43 44 46 47 51 55 57 61 63 69

xiv

List of Figures
2.1 Categories of Example Gathering . . . . . . . . . . . . . . . . . . . . . . 2.2 Operation Diagram of a Depth Camera . . . . . . . . . . . . . . . . . . . 2.3 Comparison of the three main depth camera technologies . . . . . . . . 3.1 The sensors of the Robot Maggie . . . . . . . . . . . . . . . . . . . . . . 3.2 Actuators and interaction mechanisms of the Robot Maggie . . . . . . . 3.3 The AD architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Block Diagram of the Prime Sense Reference Design . . . . . . . . . . 3.5 OpenNI’s kinematic model of the human body . . . . . . . . . . . . . . . 4.1 Overview of the Built System . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Diagram of the built architecture . . . . . . . . . . . . . . . . . . . . . . 4.3 Sequence Diagram of the training phase . . . . . . . . . . . . . . . . . . 4.4 Use Case of the Pose Labeler Skill . . . . . . . . . . . . . . . . . . . . . 4.5 Use Case of the Pose Labeler Bridge . . . . . . . . . . . . . . . . . . . 4.6 Use Case of the Pose Trainer Node . . . . . . . . . . . . . . . . . . . . 4.7 Sequence Diagram of the classifying phase . . . . . . . . . . . . . . . . 4.8 Use Case of the Pose Classifier Node . . . . . . . . . . . . . . . . . . . 4.9 Use Case of the Pose Teller Node . . . . . . . . . . . . . . . . . . . . . 5.1 Scenario of the experiment . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 The tree built for the models M1 and M2 . . . . . . . . . . . . . . . . . . 5.3 The tree built for the model M3 . . . . . . . . . . . . . . . . . . . . . . . 5.4 The tree built for the model M4 . . . . . . . . . . . . . . . . . . . . . . . A.1 Class Diagram of the Pose Trainer Node . . . . . . . . . . . . . . . . . . 8 13 14 19 19 21 26 27 30 32 34 36 37 38 40 41 42 45 47 48 48 55

xv

LIST OF FIGURES

A.2 Class Diagram of the Pose Classifier Node . . . . . . . . . . . . . . . .

56

xvi

List of Tables
5.1 Results of the Pilot Experiment . . . . . . . . . . . . . . . . . . . . . . . B.1 Detailed Results of the Models M1 and M2 . . . . . . . . . . . . . . . . B.2 Confusion matrix for the models M1 and M2 . . . . . . . . . . . . . . . . B.3 Detailed Results of the Model M3 . . . . . . . . . . . . . . . . . . . . . . B.4 Confusion matrix for the model M3 . . . . . . . . . . . . . . . . . . . . . B.5 Detailed Results of the Model M4 . . . . . . . . . . . . . . . . . . . . . . B.6 Confusion matrix for the model M4 . . . . . . . . . . . . . . . . . . . . . 47 57 58 58 58 58 59

xvii

LIST OF TABLES

xviii

Chapter 1

Introduction
1.1 Motivation

Human Robot Interaction (HRI) is the field of research that studies how humans and robots should interact and collaborate. Humans expect that robots to understand them as other people do. In this aspect, a robot must understand natural language and should be capable of establishing complex dialogues with its human partners. But dialogues are not only a matter of words. Most of the information that is exchanged in a conversation does not come from the phrases of the people engaged in this talk but from their non verbal messages. Gestures encompass a great part of the non-verbal information, but there are also other factors that provide information to the others. An example of this is the postural information that a person shows to its listeners. For example, imagine a a room where several people are sitting in chairs and talking about something. If, suddenly, a person stands up, it will recall the attention of the rest of the people of the room. What is this man announcing when suddenly stands up? Is he about to leave the room? Or perhaps he wants to say something relevant? His intentions would not be disclosed until he says or does something, but the first important thing is that the dynamics of the conversation have suddenly changed, or at least, it have been affected by a change in the context of the room. Now imagine there is a robot in that room. Would the robot get noticed of this change in the context? Probably not, if we consider the state of the current technology. Gesture and pose recognition systems have been an active research field in the

1

1. INTRODUCTION

recent years [1], but traditional image capture systems require to use complex statistical models for the recognition of the body, making them difficult to apply in practical applications [2]. But recent technology developments are enabling new types of vision sensors which are more suitable for interactive scenarios [3]. These devices are the depth cameras [3], [4], [5], which make the extraction of the body easier than it was with traditional cameras. And, because the extraction of the body is easier than before, now it is possible to retarget great quantities of computer power to algorithms that actually processes the gestures or the pose of the user rather than detecting his body and tracking it. Specially relevant is the case of the Microsoft’s Kinect sensor [6], a low cost a depth camera that offers a precision and a performance similar to the high-end depth cameras but at a cost several times lower. Among the Kinect, several drivers and frameworks to control it have appeared. These drivers and frameworks provide direct access to a skeleton model of the user who is in front of it at a relatively low CPU cost. This model is precise enough to track the pose of the user and to recognise the gestures he is doing in real time.

1.2

Objectives

The main objective of this master thesis is to build a software architecture that is capable of learning human poses seizing the new capabilities offered by a Kinect sensor mounted in the robot Maggie [7]. With such architecture, it is expected that the robot will be able to understand some non-verbal information from the users who interact with the robot as well as context information of the situation. Since the robot Maggie is a platform to study Human-Robot Interaction (HRI), one of the requisites of the system is that the user should be able to teach the robot the poses it doesn’t know. This teaching process must be done by natural interaction processes. I. e. the user and the robot must communicate by voice. The learning architecture will obtain the user’s body information retrieving it from the robot’s vision system. The vision system will rely on a Kinect sensor, which has

2

1.2 Objectives

been recently installed in the robot but not yet fully integrated in its software architecture. For that reason, one of the objectives of the project is to integrate the software that manages the Kinect sensor with the robot architecture. Additionally, no learning architecture is running on the robot. Therefore, this architecture has to be built and integrated in the robot. The integration of all these components is not a trivial task due the requirements of the robot’s software architecture, which requires that its software must be capable of working in a distribute manner. To solve this issue, the mechanisms that will glue all the developed components will be the communication systems of the robot’s software architecture and the communication systems of ROS (Robot Operating System). ROS is an open-source software framework that pretends the standardisation of the robotics software. The vision components we pretend to use are already integrated in ROS and its multilanguage support makes possible to develop new ROS-compatible software. To not redo the work that already have been done in ROS, the vision components will be accessed through ROS. But ROS is not integrated into the robot and there are not mechanisms to communicate ROS with the sofware architecture of the robot. Hence, a last objective of the project is to integrate ROS in the software architecture of the robot. These objectives can be summarised in the following list: • To build a machine learning framework that learns by multi-modal examples. In essence, the system should learn by fusing verbal information with the information captured by the vision system. • The human should be able to teach the robot by interacting naturally, in the same manner it would do with another human. • To develop a pose recognition learning architecture seizing the image acquisition techniques provided by the Kinect sensor and the algorithms of its drivers. • To integrate ROS in the robot • To validate that the system works. • To integrate and test the whole architecture in the robot Maggie.

3

1. INTRODUCTION

1.3

Problems to be solved

In order to accomplish the objectives enumerated above, several technical problems have to be addressed. Among all of them, three are specially relevant. The first one is to integrate the vision algorithms provided by the Kinect sensor in the architecture of the robot. Until now, this is an uncompleted task that has to be done to enabling the learning framework to detect the pose of the users. Additionally, the architecture of the robot lacks of a generic machine learning component. Therefore, if a learning component has to be used, it must be integrated in the robot software as well. Finally, the voice recognition system of the robot must be able to collaborate with the learning architecture in order to feed it with the training examples provided by the user. As it has been shown in the objectives section, on of the objectives of the project is to use some components of the architecture through ROS which it has to be integrated in the robot software architecture. This is a technical problem because the communication mechanisms of ROS differ to the ones of the robot. Thus, it is needed to build some components that enable the communication between ROS and the robot. These technical difficulties are summarised in the following listing:
• Integrate the vision acquisition system with the robot’s software architecture. • Integrate the machine learning framework in the architecture of the robot. • Combining the user’s inputs with the learning architecture. • Integrate ROS with the robot’s software architecture.

1.4

Structure of the Document

This document is organised as follows. Chapter 2 presents an overview of the related work in pose and gesture classification with depth cameras. Chapter 3 introduces an overview of the systems that act as the building blocks of the developed architecture. The chapter describes hardware components such as the robot Maggie and the Kinect sensor as well as the software modules that act as the scaffold of the project. In chapter 4 presents the developed architecture and describes its components separating them following the two phases of the learning process, the training phase and the classifying phase. After that, chapter 5 presents some pilot experiments that have

4

1.4 Structure of the Document

been carried out to validate the correct functioning of the architecture. Finally, chapter 6 closes the document presenting the conclusions and the future work that remains to be done.

5

1. INTRODUCTION

6

Chapter 2

Related Work
This chapter introduces the current state of the art in the topics related to the project. The chapter starts with an overview of the machine learning techniques that are used in in the field of Human-Robot Interaction (HRI) with the purpose of improving the interactions between robots and people. After, section 2.2, presents an overview of the technology used by depth cameras and which enables them to retrieve the depth information form the scenes. Finally, section 2.3, shows an insight of the applications and the works that other research groups have been done with depth cameras to detect human poses and gestures.

2.1

Machine Learning in Human-Robot Interactions.

Fong et al. made an excellent survey [8] of the interactions between humans and social robots. In the survey, Fong mentions that the main purpose of learning in social robots is to improve the interaction experience. At the date of the survey (2003), most of the learning applications were used in robot-robot interaction. Some works addressed the issue of learning in human-robot interaction, mostly centred in imitating human behaviours such as motor primitives. According to the authors, learning in social robots is used for transferring skills, tasks and information to the robot. However, the authors do not mention the use of learning for transferring concepts to the robot.. Few years later, Goodrich and Schultz, published a complete survey that covered several HRI fields [9]. The authors remarked the need of robots with learning capabilities because scriptising every possible interaction with humans is not an afford-

7

2. RELATED WORK

able task due the complexity and unpredictable behaviour of the human beings. They pointed out the need of a continuous learning process where the human can teach the robot in an ad-hoc and incrementally manner to improve the robot’s perceptual ability, autonomy and its interaction capabilities. They called this process interactive learning and it is carried out by natural interaction. Again, the survey only reports works that referred to learning as an instrument to improve abilities, behaviour, perception and multi-robot interaction. No explicit mention was done to use learning to provide the robot with new concepts.
Embodiment Mapping
Direct Embodiment Mapped Embodiment

Direct Recording

Record Mapping

Teleoperation

Sensors on Teacher

Mapped Recording

Shadowing

External Observation

Demonstration

Imitation

Figure 2.1: Categories of Example Gathering - These four squares categorise the manners of gathering examples in the Learning from Demonstration field applied to robotics. The rows represent whether the recording of the example captured all the sensory input of the teacher (upper row) or not (lower row). Columns represent if the recorded dataset can apply directly to actions or states of the robot (left column) or if it is needed some mapping (right column). (Retrieved from [10])

There are many fields where machine learning is applied to robotics. Among all of the machine learning techniques, supervised learning is one of the most widespread since the robot can explore the world with the supervision of a teacher and thus, reducing the danger to the robot or the environment [10]. This section shows some concepts regarding supervised learning, specially, Learning from Demonstration (LfD). In this area, [10] presents a excellent survey establishing several LfD approaches in different categories depending on how they collect the

8

2.1 Machine Learning in Human-Robot Interactions.

learning examples and depending on how they learn a policy from these examples. The latter is not relevant to this project, thus only the former will be summarised below. Argall et. al [10] call correspondence to the fact of recording learning examples and transferring them to the robot. They divide correspondence into two main categories depending on different aspects of how the examples are recorded and transferred to the robot. These categories are Record Mapping, and Embodiment Mapping. The former refers to whether the experience of the teacher during the demonstration is exactly captured or not. The latter refers whether the examples recorded to the dataset are exactly those that the learner would observe or execute. Argall et. al. present these two ways of categorisation in a form of a 2x2 matrix (see Fig. 2.1). Deepening into the categorisation, the authors in [10], categorise the Embodiment Mapping into two subcategories: demonstration and imitation. In the demonstration category, the demonstration is performed on the actual robot or in a physically identical platform and, thus, there is no need of embodiment mapping. On the contrary, in the imitation category, the demonstration is performed in a platform different to the robot learner. Therefore, an embodiment mapping between the demonstration platform and the learning platform is needed. As it is depicted in Fig. 2.1, both demonstration and imitation categories are divided into two sub-categories depending on the Record Mapping. In essence, demonstration is divided into teleoperation and shadowing while imitation is divided into sensors in the teacher and external observation categories. Following, the four categories are described. • Demonstration. The embodiment mapping is direct. The robot uses its own sensors to record hte example while its body executes the behaviour. – Teleoperation: is a technique where the teacher directly operates the learner robot during the execution. The robot learner uses its own sensors to capture the demonstration. In this case exists a direct mapping between the recorded example and the observed example. – Shadowing: in this technique, the robot learner uses its own sensors to record the example while, at the same time, tries to mimic the teacher’s motions. In this case there is no a direct record mapping.

9

2. RELATED WORK

• Imitation. The embodiment mapping is not direct. The robot needs to retrieve the example data from the actions of the teacher. – Sensors in the teacher: In this technique, several sensors are located on the executing body to record the teacher execution. Therefore, this technique is a direct record mapping technique. – External observator: In this case, the recording sensors are not located on the executing body. In some cases the sensors are installed on the learner robot while others are outside it. This means that this technique is not a direct record mapping technique. As a commentary, teleoperation provides the most direct method for transferring information within the demonstration learning, but as a setback, it is not suitable for all learning platforms [10]. On the other hand, shadowing techniques demand more processing to enable the learner to mimic the teacher. The sensors on teacher technique provides very precise measurements since the information is extracted directly from the teachers sensors. But it requires an extra overhead in the form of specialised sensors. On the other side, since external observation sensors are external to the teacher, they do not record the data directly from it, forcing the learner robot to infer the data of the execution. This makes external observation less reliable, but since the set up of the sensors produces less overhead compared to the sensors on the teacher technique, it is more widely used [10]. Typically, the external sensors used to record human teacher executions are vision-based. If we apply the categorisation of [10] to this project, we got that there is no recording mapping because the learning examples are recorded by the Kinect sensor of the robot. Additionally, the embodiment recording categorisation does not apply because there is no action or behaviour to learn. Instead, what it is learnt is a concept which, for the scope of this project, does not need to be mapped to the robot. Almost all the presented works focus the learning process in learning tasks or behaviours. Few of them use learning to teach concepts to the robot. This is the case of [11], where the authors train a mobile robotic platform to understand concepts related to the environment at which it has to navigate. The authors use a Feed Forward Neural Network (NN) to train the robot to understand concepts like doors or walls. The authors train the NN by showing it numerous images of a trash can (its destination

10

2.1 Machine Learning in Human-Robot Interactions.

point) labelling each photo with the distance and the orientation of the can. However, the work presented some limitations such as the learning process lacked of enough flexibility to generalise the work to other areas. A work in an area not directly related to robotics [12] can give us an insight of which kind of concepts our robot can learn with our system. In that work, Van Karsten et al. mounted a wireless sensor network in a home to retrieve the activities of the user who is living in it. The activities were labeled by the person who was living in the house by voice using a wireless bluetooth headset. The labels were processed by a grammar based speech recognition system similar to the Maggie’s one (see annex C to check an example of a Maggie’s grammar). The authors recorded the activities of the user for a period of 28 days and formed a dataset with them. From the 28 days of recording, 27 were used as training dataset and the other day was used as a tests dataset to evaluate the classifier. The training dataset was processed in a Hidden Markov Model (HMM) and in a Conditional Random Field (CRF) to build two models of the user’s activities. The performance of both classifiers were evaluated using two types of measures, the time slice accuracy and the class accuracy. The first represented the percentage of correctly classified time slices, while the latter represented the average percentage of correctly classified time slices per class. Both classifiers demonstrated good results detecting the user’s activities. CRF’s showed better time slice accuracy while HMM performed better in class accuracy. Learning the activities of the user might improve the quality of the interactions in a social robot. But, despite Activity Learning has many potential applications in robotics, no works have been found in the fields of social robotics or human robot interaction. Most of the presented works suppose the robot is a passive learner. But a new paradigm is appearing where the robot takes the initiative to ask the user for more examples where they are needed [13]. Active learning techniques can produce classifiers with better performance requiring fewer examples or reducing the required number of examples to reach certain performance [14]. Applied to HRI, there are some works that demonstrated that Active Learning not only improves the performance of supervised learning, but it also improves the quality of the interaction with their teachers. In [15], the authors experimented how a social robot was able to learn different concepts. In the experiment participated 24 people

11

2. RELATED WORK

and the robot was configured to show four different degrees of initiative in the learning process. The robot ranged from a traditional passive supervised learning approach to three different active learning configurations. These active configurations were a naïve active learner that queried the teacher every turn; a mixed passive-active mode, where the robot waited for certain conditions before asking; and a mode where the robot only queried the teacher when the teacher granted permission to do it. In their experiment, [15] found that there was no appreciable difference between the active learning modes, but the three of them outperformed the passive supervised learning mode in both accuracy and number of examples needed to achieve accurate models. Additionally, the survey to the 24 users, demonstrated that the users preferred the active learner robot, which they found to be more intelligent, more engaging and easier to teach. Despite active learning seems to be better suited for HRI purposes than traditional supervised learning, the scope of this project will remain in passive supervised learning since this is the first approximation to the field. Nevertheless, an objective of the project is to build an architecture which allows future expansions in case active learning techniques are added to it in the near future.

2.2

Depth Cameras

Depth cameras are systems that can build a 3D depth map of a scene by projecting light to that scene. The principle is similar to that of LIDAR scanners with the difference that the latter are only capable of performing a 2D scanning of the scene while depth cameras scan the whole scene at once. Fig. 2.2 depticts an example of the operation principle of a depth camera. Traditionally, depth information has been carried out by stereo vision or laser based systems. Stereo cameras rely on passive triangulation methods to obtain the depth information from the scene. These methods require two cameras separated by a baseline that determines a limited working depth range. But within these algorithms appear the so-called correspondence problem, which is determining what pairs of points in the two images are projections of the same 3D point. In contrast, depth cameras naturally

12

2.2 Depth Cameras

Figure 2.2: Operation Diagram of a Depth Camera - Depth cameras project IR light to the scene which is analysed by a sensor to get the depth map of the scene. The depicted sensor is the PrimeSense’s sensor which uses the Light Coding technology to retrieve the depth data. (Retrieved from [16])

deliver depth and simultaneous intensity data avoiding the correspondence problem, and do not require a baseline in order to operate [3]. In the other hand, laser-based systems provide very precise sliced 3D measurements. But these systems have to deal with difficulties in collision avoidance applications due their 2D field of view. The most widely adopted solution to solve this problem has been mounting the sensor on a pan-and-tilt unit. This solves the problem, but it also implies row by row sampling, which makes this solution inappropriate for realtime, dynamic scenes. In short, although laser based systems present higher depth range, accuracy and reliability, they are voluminous, heavy, increase the power consumption, and add additional moving parts when compared to depth cameras. Depth cameras, on the contrary, are compact and portable, they do not require the control of mechanical moving parts, thus reducing power consumption, and they do not need row by row sampling, thus reducing image acquisition time [3]. There are three main categories of depth cameras, depending on how they project the light to the scene to capture its depth information: Time-of-Flight (ToF) Cameras ToF cameras obtain the depth information by emitting a near-infrared light which is reflected by the 3D surfaces of the scenario back to the sensor (see Fig. 2.3a). Currently two main approaches are being employed

13

2. RELATED WORK

(a) ToF

(b) Structured Light

(c) Light Coding

Figure 2.3: Comparison of the three main depth cameras technologies. (a) shows the principle of operation of the ToF cameras. (b) shows the pattern emitted by a Structured Light camera. (c) shows the pattern emitted by a Light Coding camera. (b) and (c) obtain the depth information by comparing the distortions of the pattern received at the sensor with the original emitted pattern.

in ToF technology [17]. The first one consists in sensors which measure the the time of a light pulse trip to calculate depth. The second approach measures phase differences between the emitted and received signals.

Structured Light Cameras Structured Light is based on projecting a narrow band of IR light onto the scene [4]. When the projected band hits a 3D surface, it produces a line of illumination that appears distorted from other perspectives than the projector’s. When this distorted light is received by a sensor, it is possible to calculate shape of the 3D surface, because the initial form of the band is known. This applies to the only section of the 3D surface that has illuminated by the light band. To extend this principle to the whole scene, many methods emit a pattern of several light bands simultaneously (see Fig 2.3b).

Projected Light Cameras This is the newest technology. It is based on projecting a pattern of IR light to the scene and and calculate its distortions. Differently to Structured Light, here the pattern is based on light dots (see Fig. 2.3c). It is used by devices such as the Microsoft’s Kinect. Since this is the technology used for this project, is further described in section ??.

14

2.3 Gesture Recognition with Depth Cameras

2.3

Gesture Recognition with Depth Cameras

Depth cameras are an attractive opportunity in several fields that require intense analysis of the 3D environment. [18] presents a survey showing different technologies and applications of depth sensors in the recent years. It points out a relevant increase in the scientific activity in this field in the two or three years prior to the survey. Among other potential applications, [18] regards gesture recognition as one of the research fields that can be most benefited by the appearance of the ToF technology. Specially, tracking algorithms that have combined the data provided by the fusion of depth sensors and RGB cameras have seen a significant increase in robustness. Prior to the use of depth cameras, several research has been carried out in the field of gesture recognition [1]. But using traditional computer vision systems leads to requiring complex statistical models for recognition, which are difficult to use in practical applications [2]. In [19] and [20], the authors suggest that it would be easier to infer geometry and 3D location of body parts using depth images. Because of that, several works are focused in the use of depth cameras to detect and track different types of gestures. Following, some examples of gesture recognition are shown. Since depth sensors enable an easy segmentation and extraction of localised parts of the body, several efforts have been dedicated to the field of hand detection and tracking. For instance, in [21] a ToF camera is used to reconstruct and track a 7 Degree of Freedom (DoF) hand model. In [22], the authors use a ToF camera for interaction with computers focusing in two applications: recognising the number of raised fingers in one hand and moving an object in a virtual environment using only a hand gesture. Other works propose the use of ToF cameras to track hand gestures [20]. They use the depth data of the ToF camera to segment the body from the background. Once they have the body segmented they detect the head position since most hand gestures are relative to the rest of the body. In [23], an algorithm for recognizing hand single strokes gestures in a 3D environment with a ToF camera is presented. The authors modified the "$1" gesture recognizer to be used with depth data. The modification of the algorithm enabled the fingertip gestures not having to be articulated in the same perspective as the gesture templates were recorded.

15

2. RELATED WORK

In [24], a stereo-camera for the recognition of pointing gestures in the context of Human Robot Interaction (HRI) is used. To do so, the authors perform visual tracking of the head, hands and head orientation. Using a Hidden Markov Model (HMM) based classifier, they prove that the gesture recognition performance is improved significantly when the classifier is provided with information about the head orientation as an additional feature. But in [25], the authors assure that they achieved better results with a ToF camera. To do that, they extract a set of body features from depth images from a ToF camera and train a model of pointing directions using a Gaussian Process Regression. This model presented higher accuracy than other simple criteria such as head-hand, shoulder-hand or elbow-hand lines and the mentioned work of [24]. Some works focused in tracking other parts of the body. For example, [19] uses a time-of-flight camera for a head tracking application. They use a knowledge-based training algorithm to divide the depth data into a several initial clusters. Then they perform the tracking with an algorithm based on a modified k-means clustering method. Other approaches rely on kinematic models to track human gestures once the body is detected. For Instance, in [26], they use two stereo cameras to detect the hands of a person and build an Inverse-Kinemantics (IK) model of the human body. In [27], authors present a model-based kinematic self retargeting framework to estimate the human pose from a small number of key-points of the body. They prove that it is possible to recover the human pose from a small set of key-points providing an adequate kinematic model and a good formulation of tracking control subject to kinematic constraints. A similar approach is used in [28]. Here, the authors use a Kinect RGB-D sensor to extract and track the skeleton model of a human body. This allows them to capture the 3D temporal variations of the velocity vector of the hand and model them in a Finite State Machine (FSM) to classify the hand gestures. Most of these works rely on capturing only one or few parts of the body. However, combining the ease of segmentating the body provided by the Kinect with recent kinematic approaches like the one in [27], makes possible to track the whole body without a significative increase in the CPU consumption. This is the case of the OpenNI framework [29], which enables the possibility of tracking the skeleton model of the user at a low CPU cost. Because OpenNI is one of the key concepts of this project, it will be addressed in section ??.

16

Chapter 3

Description of the Hardware and Software Platform
Before entering in the description of the architecture developed for this project, it is necessary to understand the building blocks on which it relies on. This chapter presents and describes the pre-existing modules and systems that have been used as the principal components on which de developments of this project leaned on. In this way, each section of the chapter corresponds to one of the main modules that have been used to as the base of the project. Section 3.1 describes the robot Maggie and its hardware components. After, in section3.2 the software architecture of the robot Maggie is presented. In section 3.3, the Robot Operating System (ROS) architecture is described. This architecture is used in several modules of the system such as the vision system and the learning system. The vision system is described in section 3.4, in which a description of the Kinect technology and algorithms that enable it to extract and track skeleton models are presented. Finally, in section 3.5 the learning framework is presented and the algorithm in which relies the learning process is shown.

3.1

Hardware Description of the Robot Maggie

Maggie is a robotic platform developed by the RoboticsLab team in the Carlos III University of Madrid [7]. The objective of this development is the exploration of the fields of social robotics and Human-Robot Interaction (HRI).

17

3. DESCRIPTION OF THE HARDWARE AND SOFTWARE PLATFORM

The main hardware components of the robot can be classified in sensors, actuators and others. Maggie’s sensing system is composed by a laser sensor, 12 ultrasound sensors, 12 bumpers, several tactile sensors, a video camera, 2 RFID detectors and an external microphone. Fig. 3.1 depicts all the sensors of the robot, while Fig. 3.2 depicts the rest of the robot’s hardware. The laser range finder is a Sick LMS 200. The laser is used to build maps of the environment surrounding the robot and to detect obstacles. The 12 ultrasound sensors surround the robot’s base. The ultrasound sensors are used as a complement to the laser. The robot base has also 12 contact bumpers that are used to detect collisions in case the laser sensor and the ultra sound sensors fail. Maggie has 9 capacity sensors installed in different points of the the robot’s skin. The sensors are used as touch sensors to allow tactile interaction with the robot. The video camera is mounted in the robot’s mouth. The camera is used to detect and track humans and objects near the robot. Additionally to the standard camera, the robot has been recently equipped with a Microsoft’s Kinect RGB-D sensor. This sensor allows the robot to retrieve depth information from a scene as well as standard RGB images. The robot also has 2 RFID (Radio Frequency IDentification) detectors, 1 located in the robot’s base and the other in the robot’s nose. These RFID detectors allow to extract data from RFID tags. The use of RFID tags allow the robot to extract information of tagged objects is described in [30]. The external wireless microphone allows the robot receive spoken commands or indications from humans. The robot is actuated by a mobile base with 2 DoF (rotation and translation in the ground plane). Also, the robot has two arms with one DoF each one. Maggie’s head can move in two DoF (pitch and yaw). Finally, the eyelids of the robot can move in one DoF each one. The robot has some hardware dedicated to the interaction with humans and other devices. These hardware devices are an infrared (IR) emitter, a tablet PC, 3 speakers and an array of LEDs (Light Emitting Diodes). The infrared emitter allows the robot to control IR based devices like TVs and stereo devices. The tablet PC, located in Maggie’s chest, is used to show information related to the robot or as an interaction device. The speakers allow the robot to communicate orally with humans. The LED emitters are located in the robot’s mouth. They are used to show expressiveness when the robot is talking.

18

3.1 Hardware Description of the Robot Maggie

Camera

Touch Sensor RFID Reader

Touch Sensor Touch Sensor Touch Sensor (Back)

Touch Sensor (Back) Touch Sensor

Touch Sensors

Kinect Touch Sensor Laser

Ultrasonic

Bumpers

RFID Reader

Figure 3.1: The sensors of the Robot Maggie - Notice the Kinect RGB-D sensor attached to its belly.

Eyelid Actuator (1 DoF)

Leds (expressivity)

Neck Actuator (2 DoF)

Speakers (Voice Interaction)

Arm Actuator (1 DoF)

Tablet PC (Information and Interaction) Arm Actuator (1 DoF) Main computer (Robot control and communication) Base Actuator (2 DoF) Infrared emmiter

Figure 3.2: Actuators and interaction mechanisms of the Robot Maggie -

19

3. DESCRIPTION OF THE HARDWARE AND SOFTWARE PLATFORM

The robot is controlled by a computer located inside her body. The computer communicates with the exterior using a IEEE 802.11n connection. The operative system that runs on the computer is a Ubuntu 10.10 Linux. Over the OS runs the AD architecture.

3.2

The AD Software Architecture

The main software architecture of Maggie is a software implementation of the AutomaticDeliberative (AD) Architecture [31]. AD is designed imitating the human cognitive processes. It is composed by two main levels, the Deliberative Level and the Automatic Level. In the Deliberative Level, the processes that require high level reasoning and decision capacity are located. These processes require a big amount of time and resources to be computed. In the Automatic Level are located the low level processes. These low level processes are the ones that interact directly with the hardware like sensors and actuators. Usually, these processes are lighter than the deliberative ones and, therefore, they need less time to be computed and use less resources than the deliberative processes. Fig 3.3 shows the general schema of the AD architecture. The AD architecture is composed by the following modules: sequencer, skills, shared memory system and events system. The basic component of the AD architecture is the skill [32]. A skill is the minimum module that allows the robot to execute an action. It can reside in both AD levels depending on its behaviour. A skill that executes complex reasoning or decision functions is a Deliberative Skill and resides in the Deliberative Level. A skill that controls hardware components or does not do complex reasoning functions is an Automatic Skill. Every skill has a control loop that executes its main functionality. This control loop can run in three different ways: cyclical, periodical and event triggered. Also, every skill has three different states: ready, running and blocked. Ready. Is the first state of the skill. It is the state between the moment the skill is instantiated and the control loop is launched at the first time. Running. It is the state when the control loop is being executed. This state is also called active state.

20

3.2 The AD Software Architecture

D E L I B E R A T I V E A U T O M A T I C

Dat a Flow Execut ion Or der s Event s

Lo n g -Te r m M e m o r y

Sh o r t -Te r m M e m o r y

Sen so r D at a

Co m m an d s

SEN SORS

A CTUATORS

Figure 3.3: The AD architecture - Notice that the main communication systems between the skills are the shared memory system and the events system.

21

3. DESCRIPTION OF THE HARDWARE AND SOFTWARE PLATFORM

Blocked. It is the state when the control loop is not being executed. Two parameters must be defined when the skill is instantiated for the first time. The first parameter is the time between loop cycles. In other words, the time between two running states. The second parameter is the number of times the control loop is executed. Every skill can be activated or blocked in two different ways. The first way is being blocked by other skills. This can be done both at the Deliberative and Automatic levels. The second way to activate or block a skill is by the sequencer. The sequencer operates at the deliberative level, therefore, only deliberative skills can be activated or blocked by the sequencer. Every skill is launched as a process. Therefore, the communication between skills is an inter-process communication problem. To solve this communication problem two communication systems have been designed and developed, the Shared Memory System and the Event System. The Shared Memory System is composed of the Long Term Memory (LTM) and the Short Term Memory (STM). The Long Term Memory stores permanent knowledge. The robot uses the data stored in the LTM for reasoning or for making decisions. The data of the LTM persists even if the system is shut down. The Short Term Memory is used for storing data that is only needed during the running cycle of the robot. Examples of data stored in the STM are the data extracted from sensors or data that needs to be shared among skills. These data is not needed after the robot is powered off, therefore it is not a persistent memory. The Event System is used to communicate relevant events or information between skills. The Event System follows the publisher/subscriber paradigm described by Gamma et. al. in [33]. The skills can emit or subscribe to determinate events. When a skill needs to inform to other skills of a relevant event, it emits an event. All the skills that are subscribed to this event receive a notification in their “inboxes” when the event is triggered. The events can also carry some data related to the nature of the event itself. Every skill has an event manager for each subscribed event. The event manager defines what to do when an event is received. For example, an obstacle monitoring skill can trigger an “obstacle found” event if an obstacle is detected. Other skills that need to know if an obstacle is near the robot (for example a movement skill) can subscribe to the “obstacle found” event and act properly when the event is received. For example, stopping the robot to avoid a collision with the obstacle.

22

3.3 Robot Operating System

Both Shared Memory System and Event System are designed and built following a distributed architecture. This allows sharing data and notifications (events) between different machines. Because the skills only communicate using the Shared Memory System and the Event System, it is possible to run AD in different machines simultaneously and keep the whole architecture communicated.

3.3

Robot Operating System

ROS (Robot Operating System) [34] is an open-source, meta-operating system for robots. It provides services similar to the ones provided by an Operating System (OS), including hardware abstraction, low-level device control, implementation of commonlyused functionalities, inter process communication and packet management. Additionally, it also provides tools and libraries for obtaining, building, writing and running code in a multi-computer environment. The ROS runtime network is a distributed, peer-to-peer network of processes that interoperate in a loosely coupled, distributed environment using the ROS communication infrastructure. ROS provides three main communication styles, including synchronous RPC-style communication called services, asynchronous streaming of data called topics and storage of data using a Parameter Server. All the elements of the architecture form a peer-to-peer network where data is exchanged between elements and processed together. The basic concepts of ROS are nodes, Master, Parameter Server, messages, services, topics and bags. All of these provide data to the network. Nodes: Nodes are the minimum unit structure of the ROS architecture. They are processes that perform computation. ROS is designed to be modular: a robot control system usually comprises many nodes performing different tasks. For example, one node controls the wheel motors of the robot, one node controls the laser sensors, one node performs the robot localization, one node performs the path planing, etc. Master: The ROS Master is the central server which provides name registration and lookup to the rest of the ROS network. Without the Master, nodes are not able

23

3. DESCRIPTION OF THE HARDWARE AND SOFTWARE PLATFORM

to find each other and, therefore, exchange messages between them, or invoke services. Parameter Server: The Parameter Server is a part of the Master. Its functionality is to act as a central server where nodes can store data. Nodes use this server to store and retrieve parameters at runtime. It is not designed for high-performance. Instead it is used for static, non-binary data such as configuration parameters. It is designed to be globally viewable, allowing the tools and nodes to easily inspect the configuration state of the system and modify if necessary. Messages: Nodes communicate with each other by exchanging messages. A message is a data structure of typed fields. Topics: Messages are routed via a transport system with publish/subscribe semantics. A node sends out a message by publishing it to a given topic. A topic is a name that is used to identify the content of a message. A node that is interested in a certain kind of data will subscribe to the appropriate topic. There may be multiple concurrent publishers and subscribers for a single topic, and a single node may publish or subscribe to multiple topics. In general, publishers and subscribers are not aware of each others’ existence. The objective is to decouple the production of information from its consumption. Services: The publish/subscribe model is a very flexible communication paradigm, but its many-to-many, one-way transport is not appropriate for request/reply interactions, which are often required in a distributed system. Request/reply is done via services, which are defined by a pair of message structures: one for the request and one for the reply. A providing node offers a service under a name and a client uses the service by sending the request message and awaiting the reply. ROS client libraries generally present this interaction to the programmer as if it were a Remote Procedure Call (RPC). Bags: Bags are a format for saving and playing back ROS message data. Bags are an important mechanism for storing data, such as sensor data, that can be difficult to collect but is necessary for developing and testing algorithms.

24

3.4 The Kinect Vision System

Finally, to close the ROS section, is worth tom mention that ROS only runs on Unixbased platforms and is it language independent. Currently supports C++ and Python languages, while support for other languages like Lisp, Octave and Java is still in its experimental phase. The first stable release of ROS was delivered in march 2010.

3.4

The Kinect Vision System

The Microsoft’s Kinect RGB-D sensor is a peripheral designed as a video-game controlling device for the Microsoft’s X-Box Console. But despite its initial purpose, it is it is currently being used by numerous robotics research groups thanks to the combination of its high capabilities and low cost. The sensor provides a depth resolution similar to the high-end ToF cameras, but at a cost several times lower. The reason for this balance between capabilities and low cost resides in how the Kinect retrieves the depth information. To obtain the depth information, the device uses the PrimeSense’s Light Coding Technology [35]. This technology consists in projecting an Infra-Red (IR) pattern to the scene similarly to how structured light sensors do. But Light Coding differs from Structured Light in the light pattern. While Structured Light usually uses grids or strip bands as a pattern, Light Coding emits a dot pattern to the scene [5], [36] (see Fig. 2.3c). This projected light pattern creates textures that makes finding the correspondence between pixels easier. Specially in shiny or texture-less objects or with harsh lighting conditions. Also, because the pattern is fixed, there is no time domain variation other that the movements of the objects in the field of view of the camera. This ensures a precision similar to the ToF and Structured Light cameras, but PrimeSense’s mounted IR received ir a standard CMOS sensor, which reduces the price of the device drastically. The sensor is composed of one IR emitter, responsible of emitting the light pattern to the scene, a depth sensor responsible of capturing the emitted pattern. It is also equipped with a standard RGB sensor that records the scene in visible light (see Fig. 3.4). Both depth and RGB sensors have a resolution of 640x480 pixels. This facilitates the matching between the depth and the RGB pixels. This calibration process, referred

25

3. DESCRIPTION OF THE HARDWARE AND SOFTWARE PLATFORM

Figure 3.4: Block Diagram of the Prime Sense Reference Design - This is the block diagram of the reference design used by the Kinect Sensor. The Kinect incorporates both depth CMOS sensor and colour CMOS sensor. (Retrieved from [16])

by PrimeSense as registration, it is done in factory. Other processes like correspondence1 and reconstruction2 are handled by the chip. Together with other organisations, PrimeSense has created a non-profit organisation formed to promote the use of devices such as the Kinect in areas of the natural interaction. The organisation is named OpenNI (NI stands for Natural Interaction) [29]. OpenNI has released an open-source framework called OpenNI that provides several algorithms for the use of PrimeSense’s compliant depth cameras in natural interaction fields. Some of these algorithms provide the extraction and tracking of a skeleton model from the user who is interacting with the device. This project uses these algorithms to get the data form the user ’s joints. In other words, the information that will be provided to the learning framework comes from the output of the OpenNI’s skeleton extraction algorithms. The kinematic model of the skeleton, provided by OpenNi, is a skeleton model of
1 2

Correspondence means matching the pixels of one camera with the pixels of the other camera. Reconstruction means recovering the 3D information from the disparity between both cameras.

26

3.4 The Kinect Vision System

the body consisting in 15 joints. Fig. 3.5 shows these joints. The algorithms provide the positions and orientations of every joint. Additionally they also provide the confidence of these measures. Moreover, these algorithms are able to track up to four simultaneous skeletons, but this feature is not used in this project. ROS provides a package that envelops OpenNI and enables the access to this framework from other ROS packages1 . Thanks to that, other packages can access to the data of the Kinect sensor. One of such packages is the pi_tracker package. This package has a node, named Skeleton Tracker, which uses the OpenNI’s Application Interface (API) to retrieve the tracking information of the user. The node publishes the data of the joints in the /skeleton topic so other nodes can use it. This is the case of the Pose Trainer node.
Head

Left Shoulder Left Elbow

Neck

Right Shoulder Right Elbow

Torso Left Hand Left Hip Right Hip Right Hand

Left Knee

Right Knee

Left Foot

Right Foot

Figure 3.5: OpenNI’s kinematic model of the human body - OpenNI algorithms are able to create and track a kinematic model of the human body. The model has 15 joints and their positions and orientations are updated at a 30 Frames per Second.

1

The package is the openni_kinect

27

3. DESCRIPTION OF THE HARDWARE AND SOFTWARE PLATFORM

3.5

The Weka Framework

The Machine Learning Framework on which this project relies is Weka (Waikato Environment for Knowledge Analysis) [37]. Weka is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand. Weka is open-source software released under the GNU General Public License. Weka supports several data mining tools such as data preprocessing, classification, clustering, regression, visualization, and feature selection. Most of the Weka’s operations suppose that data is available in a single text file or relation. Weka provides access to SQL databases using Java Database Connectivity being able to process as an input any result returned by a database query. In both files and database queries, each data instance is described by a fixed number of attributes. It is possible to operate with Weka through a user interface, from command line or by accessing its program Application Interface (API). The latter is the method chosen for this project. Since Weka’s API is programmed in Java and ROS provides a client library for this language, all the elements of the architecture that use the Weka’s API, will be programmed as ROS nodes. Although ROS Java’s client library is, at this date, in an experimental phase, the core functionalities are sufficiently stable to be used. These functionalities are connecting to the ROS Master and subscribing and publishing to topics. Since the project focuses in supervised learning methods, only these methods will be used in the Weka framework. Concretely, the C4.5 decision tree [38] has been chosen as the main algorithm to learn and detect poses of the user. This algorithm has been chosen by its good performance [39] and the possibility to see what the robot has learnt in the learning phase.

28

Chapter 4

The Supervised Learning Based Pose Recognition Architecture
This chapter describes all the modules that have been built in order to learn to detect human poses. It is divided in two parts, each one describing one part of the learning process. First, Section 4.1, describes all the components that operate in the training phase. And second, Section 4.2 describes all the components that operate in the process of recognising the poses of the user once the system has been trained. But before entering in detail, an overview of the system is provided. Fig. 4.1 depicts the general scheme of the built architecture. It consists of two differentiated parts, each one with one single purpose. The upper part represents the training phase, where the user teaches the system to recognise certain poses. The lower part of the figure represents the classifying phase, where the user stands at a determined pose and the robot tells him at which pose he is. In the training phase, the robot uses two sensory systems to learn from the user. The first is its Kinect based RGB-D vision system. With it, the robot acquires the figure of the user separated from the background and processes it to extract a kinematic model of the user’s skeleton. The second input is the Automatic Speech Recognition System (ASR) which allows the robot to process the words said by the user and converts them into text strings. In other words, the sensory system captures a pair of datum formed by: The pose of the user, defined by the configuration of the joints of the kinematic model of the user

29

4. THE SUPERVISED LEARNING BASED POSE RECOGNITION ARCHITECTURE

Training Phase Automatic Speech Recognition (ASR) System

Microphone I'm sitting

Kinect "SIT"

Dataset

Machine Learning Framework

Classifying Phase Text To Speech (TTS) System Speaker

You're sitting What's my pose?

Pose Classifier

MODEL

Kinect

Figure 4.1: Overview of the Built System - The upper part of the diagram shows the training phase, where the user teaches the robot by verbal commands which are the poses that the robot must learn. The lower part of the diagram depicts the classifying phase, in which the robot loads the learnt model and tells the user’s current pose by its voice system.

30

A label identifying this pose, defined by the text string captured by the auditive system of the robot. This two sensory inputs are fused together to form a pair of datum that are stored in a dataset. At the end of the training process, the dataset contains all the pose-label associations that the user has shown to the robot. This dataset is processed by a machine learning framework that builds a model establishing the relations between the poses and its associated labels. I. e. the learned model establishes the rules that define when a determined pose is associated to a certain label. In the classifying phase, the robot continues receiving snapshots of the skeleton model every frame. But this time, it does not receive the auditive input that tells which is the pose of the user. It is the moment when the robot has to guess that pose. To do that, it loads the learnt model in a classifier module that receives as an input the skeleton model every frame. The skeleton is processed in the model that returns the corresponding label to that skeleton. Then, the label is sent to the Emotional Text To Speech (ETTS) module of the robot. The ETTS module is in charge of transform text strings into audible phrases that are said by the robot’s speakers. In this way, the label is sent to the ETTS module and then, said by the robot. Fig. 4.2 presents an overview of the whole system but deepening more in how all these processes are carried out and detailing the modules and messages that participate on this process. There, the reader can see that the architecture is divided in two separated parts, the AD part and the ROS part. AD provides powerful tools for HRI, specially relevant are the ASR and ETTS Skills. The ASR Skill processes the speeches of the user and transforms them into text. On the other hand, the ETTS skill, processes strings of text and transforms them in audible words o phrases emitted by the robot’s speakers. Therefore, all the parts of the architecture that need verbal inputs or outputs from or to the user have been developed as AD skills, or have some mechanisms to communicate with AD skills. In other words, the interaction with the user is carried out in the AD part of the System. One of these interaction skills is the pose_labeler_skill, described in section 4.1.1. In the other part of the architecture, ROS provides the openNI and other packages to track humans and extract their skeleton from the Kinect sensor. Therefore,

31

4. THE SUPERVISED LEARNING BASED POSE RECOGNITION ARCHITECTURE

the components which needed information of the human skeleton have been developed as ROS nodes. Additionally, since ROS provides a client-library in java, all the components of the architecture that needed to access the Weka framework have been programmed as ROS nodes as well. This is the case of the pose_trainer and pose_classifier nodes. Both are described in sections 4.1.3 and 4.2.1 respectively.

ROS
Kinect

AD
User

speaks
asr_skill OpenNI + pi_tracker pose_labeler_ bridge /skeleton /pose_labeled LABELED_POSE pose_labeler_skill pose_classifier pose_trainer RECOGNITION_RESULTS

/classified_pose

Data set

etts_skill SAY_TEXT pose_teller

Model Weka Learning Framework/Architecture

Figure 4.2: Diagram of the built architecture - The figure depicts a diagram of the whole architecture and its components. Note that all the interaction modules reside in the AD part of the architecture while the skeleton tracking algorithms and the learning framework reside in the ROS part.

Additionally, there are some components which main function is establishing links between ROS and AD and enable the communications between their modules. These are the Pose_Labeler_Bridge and the Pose_Teller, described in sections 4.1.2 and 4.2.2 respectively. Since the architecture is composed of several modules that act independently, the key for the integration of these modules is the messaging system of the architecture. The communications within the architecture modules is done by exchanging asyn-

32

sp

ea ks

4.1 Training Phase

chronous messages. In other words, if a module needs information from another module, it subscribes to its publications. On the other way, when a module has some information that needs to be shared with others, it publishes to the network. That means that the mechanism followed for the exchange of information in the architecture is based in the publisher /subscriber paradigm. The following sections describe briefly what these modules do, how they operate and, finally, how they link together to build the complete system. The description of the components is done by following the usual temporal sequence of a user trying to use the system Usually, the user would first train the pose classifier and then use it to classify its pose. The former is described in section 4.1 and latter is described in section 4.2.

4.1

Training Phase

The training phase is the phase of the process in which the classifier is built. In this phase, the human teaches the robot to recognise some poses. Completing this phase means that a classifier is built and it can be used in the classifying phase. This section describes all the components of the architecture that have been developed to train the system. Figure 4.3 depticts the temporal sequence of the training phase showing the collaboration among all the modules that participate in it. Summarised, the training phase occurs in the following sequence. First, the system needs to detect what the user is saying to the robot. This is explained in 4.1.1. If what she is saying is a valid pose, then it will be labeled and sent to a node in charge of communicating the interaction modules -which are located in the AD part of the system- with the ROS modules -which are in charge of the learning step-. The bridging between AD and ROS is described in section 4.1.2. Finally, the label arrives to the machine learning module which will gather the label describing the human pose and the data from the vision sensor. This data will be written to a dataset and sent to the Weka framework to build a classifier able to detect human poses. This final part of the process is described in section 4.1.3.

33

4. THE SUPERVISED LEARNING BASED POSE RECOGNITION ARCHITECTURE

asr_skill

pose_labeler_skill

pose_labeler_bridge

pose_trainer

skeleton_tracker

/skeleton (ROS topic) REC_OK (event) "TURNED_RIGHT" Process possible label LABELED_POSE (event) "TURNED_RIGHT" /labeled_pose (ROS topic) "TURNED_RIGHT" /skeleton (ROS topic) Add pose to dataset "TURNED_RIGHT" /skeleton (ROS topic) Process possible label LABELED_POSE (event) "TURNED_LEFT" Add pose to dataset "TURNED_RIGHT" /skeleton (ROS topic) /labeled_pose (ROS topic) "TURNED_LEFT" . . . /skeleton (ROS topic) "END" Process possible label LABELED_POSE (event) "END" /labeled_pose (ROS topic) "END" Building a model and saving it to a file Add pose to dataset "TURNED_LEFT" /skeleton (ROS topic) /skeleton (ROS topic) /skeleton (ROS topic) /skeleton (ROS topic)

REC_OK (event) "TURNED_LEFT"

REC_OK (event)

Figure 4.3: Sequence Diagram of the training phase - The pose_trainer node is the node which fuses the information of the skeleton model and the labels told by the user.

34

4.1 Training Phase

4.1.1

Pose Labeler Skill

When the user starts training the robot, she has to carry out two tasks. The first one is to put himself in the pose he wants to show the robot. The second task is to tell the robot at which pose is she. From the robot’s point of view, first it has to detect the human pose and second, it has to understand what the user is saying to it. The former is described in section 4.1.3 while the latter is described below. The user reports the robot its pose by telling it. The robot has an Automatic Speech Recognition (ASR) System that allows to detect and process natural language. This system mainly relies on the AD’s ASR Skill [40]. The ASR skill detects the human speech and processes it according a predefined grammar. If the speech of the human matches one or more of the semantic elements of that grammar, the ASR Skill sends an event notifying all the other skills that it has recognised a speech. Other skills which are subscribed to the ASR events, read the results obtained by the ASR Skill and process them accordingly. The ASR skill cannot understand what the user is saying unless it is previously provided with a grammar that defines the semantic of these speeches. Therefore, if we want to make the ASR skill able to understand what pose is being said by the human, we need to build a special grammar. This grammar is summarised below and described in Annex C. In this scope, a grammar is a sequence of possible word combinations that the user can say linked to their semantic meaning. In this way, the built grammar is able to detect several words that define distinct poses of the human. The semantics of those words are coded into labels that can be codified as variables in a computer program. The grammar has been built in a manner that allows to detect up to 18 labels by combining different semantics in three different categories: 1. Position Semantics
(a) SIT. Defines that the user is sitting on a chair. (b) STAND. Defines that the user is standing in front of the robot

2. Action Semantics
(a) TURNED. Defines that the user is sitting in a chair. (b) LOOKING. Defines that the user is standing in front of the robot

35

4. THE SUPERVISED LEARNING BASED POSE RECOGNITION ARCHITECTURE

(c) POINTING. Defines that the user is pointing with her arm to a location defined in the directions category

3. Direction Semantics
(a) LEFT. Defines that the action that is doing the user is towards his own left side. For example, if it is pointing, it is doing to her left. (b) FORWARD. Defines that the action performed by the user is towards her front. (c) RIGHT. Defines that the action peformed by the user is towards her right

When the ASR Skill detects a combination of words that define the 3 semantics from above, it emits an event to notify all its subscribers and stores the recognition results in the Short Term Memory System. One of these subscribers is the Pose Labeler Skill. When receives an event from the ASR Skill, the Pose Labeler Skill reads the recognition results from the Short Term Memory and analyses their semantics to form a label. Examples of labels can be SIT_LOOKING_LEFT, meaning that the user is sitting and looking towards her left; or STAND_TURNED_FORWARD, meaning that the user is standing and turned to the robot. Finally, if the label is a valid label -i.e. one of the labels from above- the Pose Labeler Skill sends an event with the label ID. Fig. 4.4 summarises the main functionalities of the Pose Labeler Skill

Subscribe to ASR events

Read the ASR Skill results looking for the label. pose_labeler _skill

Emit an AD LABELED_POSE event with the detected label

Figure 4.4: Use Case of the Pose Labeler Skill - The figure depicts the use case diagram of the Pose Labeler Skill. Its main function is to transform the labels told by the user into a system readable labels and emit them as an AD event.

Additionally to the semantics shown above, the Pose Labeler Skill can also process two semantics more which have no relation with the labels. These semantics act as

36

4.1 Training Phase

a control layer to allow the user control by voice some aspects of the training phase. These semantics are the following:
• CHANGE. Used to allow the user to change its pose. This command is said when the user wants to change its pose. It is used to allow the classifier to discriminate the transitions between two poses. • STOP. Used to end the training process. When the user says she wants to finish the training process, the ASR builds this semantic to allow the Pose Labeler Skill to end it.

4.1.2

Pose Labeler Bridge

Subscribe to the AD's LABELED_POSE events

pose_labeler _bridge

Send Labels to the ROS' /labeled_pose topic

Figure 4.5: Use Case of the Pose Labeler Bridge - The figure depicts the use case diagram of the Pose Labeler Bridge ROS node. Its main function is to listen the AD’s LABELED_POSE events and to bridge their data to the ROS /labeled_pose topic.

The Pose Labeler Bridge is the next step of the process. It is a ROS node that acts as a bridge between AD and ROS. Its main functionality is to transform the events emitted by the Pose Labeler Skill to a ROS topic. Concretely, when the Pose Labeler Skill detects a label from the ASR’s recognition results, it emits an event called POSE_LABELED. The Pose Labeler Bridge parses the content of the messages sent through this event and transforms them to fit in a ROS topic called /labeled_pose. Fig. 4.5 summarises the main functionalities of the Pose Labeler Bridge.

4.1.3

Pose Trainer

The last step of the training process is performed in the Pose Trainer Node. The Pose Trainer node is a ROS node that does several things (see Fig. 4.6). First of all, it subscribes to the /labeled_pose topic to know the pose of the user. Secondly, it also

37

4. THE SUPERVISED LEARNING BASED POSE RECOGNITION ARCHITECTURE

subscribes to the /skeleton topic (see section ?? for more information about this topic). The Pose Trainer node reads the messages of this topic to extract the information of the joints of the user. This information is combined with the information from the / labeled_pose and formatted properly to be understood by the Weka framework.

Subscribe to /labeled_pose topic

Subscribe to /skeleton topic

pose_trainer

Create a dataset with the received skeleton and label messages

Build a model from the dataset

Figure 4.6: Use Case of the Pose Trainer Node - The figure depicts the use case diagram of the Pose Trainer Node. Its main function is to receive messages from the / labeled_pose and /skeleton topics, and build a dataset with these messages. After the dataset is built, the node also creates a learned model from it.

Each skeleton message is coded as a Weka instance. Each instance has 121 attributes divided in the following way. The message consists in 15 joints with 3 attributes for the position of each joint, 4 attributes to track the orientation1 of each joint, and 1 attribute to track the confidence of the measures of this joint. This makes 120 attributes. The last attribute is the label that comes from the /labeled_pose topic. This last attribute is also the class of the instance2 . While the Pose Trainer node is running, it collects /skeleton messages, /labeled_pose messages and fuses them creating instances. During its operation, the node continuously builds a dataset of instances with the received messages. Finally, when it receives a label with the "STOP" identification, it stops adding messages to the dataset. Fig. 4.3 shows this process.
The orientation is coded as a quaternion. The class of an instance is the attribute that tells the learning algorithm at which class belongs the instance. In other words, is the attribute that tells to the classifier how this data must be classified.
2 1

38

4.2 Classifying Phase

With the dataset already completed, the node calls the Weka API in order to build a model from the dataset. This model is the classifier that will be used in section 4.2 to classify the poses of the user. For this project only on model of classifier is built form the dataset. This model is the Weka’s J48 decision tree, which is an open source implementation of the C4.5 decision tree (see section 3.5). If the dataset has relevant data, the model will be able to generalise to other situations for which the classifier has not been trained. If not, the classifier will not be able to classify other situations that were not contemplated during training phase and will cause several classification errors. The structure of the Pose Trainer node is depicted in Fig. A.1 from Annex A.

4.2

Classifying Phase

The classifying phase is the phase where the robot starts guessing the pose of the user. To do so, it needs a previously created model of the poses at which the user will be. The main elements of the classifying phase are the Pose Classifier node and the Pose Teller node. The first is described in section 4.2.1 while the second is described in section 4.2.2. The temporal sequence of how these nodes interact is depicted in Fig. 4.7.

4.2.1

Pose Classifier

The Pose Classifier ROS Node is the node that classifies the pose of the user. Its main funcitons are depicted in Fig. 4.8. To do so it needs two different inputs. The first one is the knowledge to decide the pose of the user from the data of its joints. This comes from the classifier that has been built in section 4.1.3. The second input the node needs is the content of the /skeleton topic messages. As it is said above, these messages contain the information of the user’s joints. The node subscribes to the /skeleton topic and starts reading its messages. For each received message, the node parses and formats it as a weka instance. This instance similar to the instances created by the Pose Trainer node in section 4.1.3. But these instances are different to the others in one aspect: they do not have the

39

4. THE SUPERVISED LEARNING BASED POSE RECOGNITION ARCHITECTURE

etts_skill

pose_teller_tts

pose_classifer

skeleton_tracker

Load Model from a file

/skeleton (ROS topic) Classify Pose /classified_pose (ROS topic) SAY_TEXT (AD event) Says label /classified_pose (ROS topic) "STAND_LOOKING_LEFT" /skeleton (ROS topic) . . . "STAND_LOOKING_LEFT" /skeleton (ROS topic) Classify Pose

. . .

. . .

/skeleton (ROS topic) Classify Pose SAY_TEXT (AD event) Says label /classified_pose (ROS topic) "STAND_LOOKING_RIGHT"

Figure 4.7: Sequence Diagram of the classifying phase - The pose_classifier node processes the skeleton messages using the learnt model and sends the output to the voice system of the robot.

40

4.2 Classifying Phase

Load a model to classify poses

Subscribe to /skeleton topic

Classify the skeleton into a known pose pose_classifier

Send the classified pose to the /classified_pose topic

Figure 4.8: Use Case of the Pose Classifier Node - The figure depicts the use case diagram of the Pose Classifier Node. The node loads a previously trained model to classify Skeleton messages to known poses.

class defined. In other words, the class of the instance is not set. The Pose Classifier node uses the classifier to determine at which class should belong the instance. After classifying the instance, the node emits a message to the /classified_pose topic. In fact, the sent message is the same as the messages that are sent to the /pose_labeled topic. But this time the difference that the former refer to labels that have been deducted while the latter are labels specified by the user. The structure of this node is depicted in Fig. A.2 from Annex A.

4.2.2

Pose Teller

The Pose Teller Node is the node in charge to tell to the user at which pose is she. In other words, it tells to the user what pose has been detected by the classifier. The pose of the user is announced by the Pose Classifier node in the topic /classified_pose. But the content of the messages of this topic is not understandable by humans. therefore, it is needed a node that translates that content to a content that can be understood by people. The Pose Teller node is the module which carries out this task (see Fig. 4.9). First of all, the Pose Teller node subscribes to /classified_pose topic and reads its messages to retrieve the label identificator wrote by the classifier. Then, it transforms

41

4. THE SUPERVISED LEARNING BASED POSE RECOGNITION ARCHITECTURE

the label ID into a description that can be understood by the user. For example, if the label is STAND_LOOKING_LEFT, the node transforms it to the text "You’re standing and looking to the left". But this is only a text string. If we want the robot to say this text, it must be sent to the AD’s ETTS (Emotional Text To Speech) Skill [41]. This skill is in charge of transforming text strings into audible speeches. Therefore, the Pose Teller node sends the "textified" label to the ETTS skill by an AD event. Finally, the ETTS skill pronounces the text using the robot’s voice system.

Subscribe to /classified_pose topic

pose_teller

Tell to the etts_skill to say the labels of the /classified_topic

Figure 4.9: Use Case of the Pose Teller Node - The figure depicts the use case diagram of the Pose Teller Node. Its main function receive labels from the /pose_classified topic and send them to the TTS skill to make it tell the label.

42

Chapter 5

Pilot Experiments
The main objective of carrying a pilot experiment is to test that the classifier is able to learn human poses. But, apart from that, there is other objective that emmerged during the design of the system. During the initial tests of the trainer it was observed that the trainer node built different models depending on the kind of the data or features which was provided to it. I. e., the pose_trainer node built different models when it was only fed by the position of the joints than when it was fed by the position and orientation of them. It also was observed that when the classifier was trained using all the three data types -position, orientation and confidence-, the human trainer had to pay attention to fed the node with representative data. In other words, if the human trained the pose_trainer node fixed in only one position, the classifier was only able to detect the learned poses if they were shown in the exact same position as it was trained. Therefore, if we want to build a classifier which is able to generalise, we have two options. The first one is to train the classifier without giving them position data. The second option is to make the position data irrelevant during the training process. The first option involves a pre-process of the data before is given to the trainer node. During this pre-process stage, the position of the joints is removed. The second option feeds the pose_trainer node with all the data from the human joints, but during the training process, the human must move around the field of view of the Kinect sensor in order to feed the node with all the data which is relevant. Note that, while the first option relies on the automation of the process, the second one yields the responsibility to the human teacher, who is in charge of providing

43

5. PILOT EXPERIMENTS

the classifier with good data that will allow it to generalise. But, a priori, is not clear which method builds better classifiers. Therefore, the objective of this evaluation is to discover which of the two methods produces better models. This is especially relevant because it is difficult to anticipate that the users are "good teachers" and they will train the classifier with a good and representative model. The initial intuition leads us to think that the classifier is able to detect the poses quite accurately if a good model is provided regardless of which method is used. But our hypothesis is that it would be easier to build better models with less features, specially if these features represent better the states of a pose. In other words, it seems that the joint orientations are more representative for detecting poses than the information retrieved from the positions of the joints.

5.1

Scenario

To validate the hypothesis, two datasets were created. Both datasets consist of the data from a user that has trained the classifier to detect three different poses:
1. Standing, turned left 2. Standing, turned front 3. Standing, turned right

But each dataset has been trained in a different manner. In the first dataset "D1", the user has trained the classifier showing the three poses in different positions. In the second dataset "D2", the user has trained each pose without changing her relative position to the robot. That means that D1 has better training data than D2. Also, for each dataset, two models have been built. Each model has been built using different features or attributes. Following, the differences between the datasets and its models are listed: • Dataset 1 (D1): Not relevant data. The user did not move from her original position.
– Model 1 (M1): All attributes were used to construct the model (position, orientation and confidence. – Model 2 (M2): Only orientation and confidence attributes were used to build the model.

44

5.1 Scenario

• Dataset 2 (D2): Not relevant data. The user did not move from her original position.
– Model 3 (M3): All attributes were used to construct the model (position, orientation and confidence. – Model 4 (M4): Only orientation and confidence attributes were used to build the model.

The recording of the datasets D1 and D2 was done in the scenario depicted in Fig. 5.1. The scenario consists of the robot Maggie equipped with a Kinect sensor. The rectangle in Fig. 5.1 shows the area where the user that trained the classifier was standing. She was able to move wherever she wanted inside of the rectangle while recording the poses. The only conditions were that she was not allowed to exit the rectangle during the recording phase. She was also not allowed to change her pose without warning the robot. During the recording of the dataset D1, the user moved through all the rectangle. But in the case of the dataset D2, the user didn’t moved from the center of the rectangle.

Kin

ec t

Ho

riz

on

tal

Fie ld

of

Vie w

User Area

180 cm

Robot

User
240 cm

180 cm

Figure 5.1: Scenario of the experiment - The cone represents the field of view of the Kinect sensor. The user was allowed to move inside the rectangle, but turned in the direction of the arrows.

45

5. PILOT EXPERIMENTS

After the training phase, 12 different users were used to record a test dataset. Each one of these people recorded the same 3 poses as the user who trained the four models. Moreover, they had the same recording conditions than the user who trained the dataset D1. In short, they were allowed to move inside the rectangle during the recording phase. The data of the twelve users were recorded and gathered in one single dataset file. After that it was tested with the four models. The results of these tests are summarised in the following section. Also, they are completely stated in Annex B. A discussion of these results is addressed in section 5.3

5.2

Results

The results of the experiment are summarised in Table 5.1. The table presents how the models M1, M2, M3 and M4 performed against the test dataset. The best models were M1 and M2 with more than a 92% of correctly classified instances and barely a 4% of false positives. Following, model M4 performed slightly worse with a 70% of correctly classified instances and with a 14% of false positives. Finally, as it was expected, model M3 shown the worst performance with a 56% of correctly classified instances and with a 21% of false positives. Note that the table shows the results of models M1 and M2 in the same row. This is because they used the same dataset (D1) to build their trees and in the end, the J48 algorithm built the same tree in both cases. Fig. 5.2 depicts the tree of the models M1 and M2. Although they used different data from the dataset1 , the J48 algorithm decided that the relevant information of D1 was located in the orientation attributes, producing the same trees in both cases. This not happened in the tree built in model M3 (see Fig. 5.3). Here, the algorithm that built the tree, considered that in the training dataset some relevant information was in the position of the right knee. When the users tested the tree, they varied their position respect the user who trained it, so it caused several errors. The last tree, M4 (Fig 5.4), is similar to M3, but with the difference that the former only uses orientation information. This enabled it to be more accurate than M3, but
Remember that M1 used position, orientation and confidence, while M2 used only orientation and confidence
1

46

5.3 Discussion

Model M1, M2 M3 M4

TP Rate 0.926 0.563 0.700

FP Rate 0.039 0.213 0.141

Precision 0.940 0.687 0.796

MAE 0.0049 0.0416 0.0286

RMSE 0.0583 0.204 0.1691

Table 5.1: Results of the Pilot Experiment - Models M1 and M2 produced the same results and performed better than the other models. Note: TP = True positive; FP = False Positive; MAE = Mean Absolute Error; RMSE = Root Mean Squared Error.

not so much as M1 and M2. The reason for this could be that M1/M2 is a bit more complicated tree, so it seems it can cover more cases than M4.
torsoOrient_w

<=0.956792

> 0.956792

torsoOrient_y

STAND_TURNED_FORWARD

<=0.11679

> 0.11679

STAND_TURNED_RIGHT

left_kneeOrient_w

<=0.609245

> 0.609245

STAND_TURNED_RIGHT

STAND_TURNED_LEFT

Figure 5.2: The tree built for the models M1 and M2 - The classifier only used the orientations of the joints.

5.3

Discussion

The results show that the initial hypothesis is partially validated for this experiment. The hypothesis announced that using only orientation and confidence attributes will

47

5. PILOT EXPERIMENTS

right_kneeOrient_Z

<=-0.196633

> -0.196633

STAND_TURNED_RIGHT

right_kneePos_X

<=0.100138

> 0.100138

STAND_TURNED_LEFT

STAND_TURNED_FORWARD

Figure 5.3: The tree built for the model M3 - This time, the classifier used position information of the joints.

right_kneeOrient_Z

<=-0.196633

> -0.196633

STAND_TURNED_RIGHT

torsoOrient_w

<=0.699489

> 0.699489

STAND_TURNED_LEFT

STAND_TURNED_FORWARD

Figure 5.4: The tree built for the model M4 - The tree is quite similar to the one in Fig. 5.3, but this one only uses orientation information for its joints.

48

5.3 Discussion

lead to models with a higher generalisation capabilities. This has only been validated partially. When a good training set is provided, like in dataset D1, the classifier builds models that are able to generalise, no matter the attributes which are used to build that models. This is the case of models M1 and M2, which ended up being the same. However, when the dataset has no relevant information, using only orientation attributes leads to classifiers that perform better than classifiers that are built using position and orientation attributes. In the studied case this difference was nearly of a 15% of better classification and nearly a 7% in false positives. In fact, the classifier has built a model which is based on orientation attributes rather than positions. This means that, as it was thought at the beginning, the orientation information was more significant than the position information. But it also means that if the classifier is provided with a relevant dataset, it will choose the most significant attributes. To sum up, it seems that providing good training data to the classifier is of paramount importance. Good datasets produce better models no matter the joint attributes that are used to build them. But when it is not possible to ensure that the user will train the classifier properly, it could be better to avoid the use of position attributes.

49

5. PILOT EXPERIMENTS

50

Chapter 6

Conclusions
A pose recognition architecture has been built and integrated in the robot Maggie. This architecture allows the robot to learn the poses the user has been taught to it. The system relies on two main pillars, the first is the vision system of the robot, which is composed of a Kinect depth camera and its official algorithms to track people. The sensor and its algorithms have proven to be robust to changes in the light conditions and partial occlusions of the body. The second pillar of the learning platform is the HRI capabilities of the robot Maggie. Specially its abilities to communicate with people by voice. Thanks to that, the user taught the robot by speaking to it as it would do with other people. To validate the architecture, a pilot experiment was carried out. In the experiment two datasets were used to build 4 different models that were tested against twelve people. The experiment demonstrated that the learning system is able to detect the poses of the users obtaining high accuracy rates when a good training dataset was provided to the robot. Following the main contributions of this project are listed. • A Pose Recognition Architecture has been developed and integrated in a social robot allowing it to recognise the poses of different people with high accuracy. • ROS has been integrated with the AD architecture. Although it is not fully integrated, the initial communication mechanisms between both architectures have been established. Additionally some AD skills have started the process of becoming both AD-Skills and ROS nodes.

51

6. CONCLUSIONS

• The Weka machine learning framework has been integrated with the AD architecture, thanks to the integration between ROS and AD. • The robot Maggie has now fully integrated the Kinect sensor, its drivers and the OpenNI framework. This enables a much new possibilities to it, not only in the HRI field, but also in object recognition and navigation fields. • The robot Maggie now is able to understand the human poses. This means that now the robot is able to understand some contextual information when it is interacting with a user. Thanks to that, the robot can adapt its behaviour according to this context and improve the interaction quality, as it is perceived by a user. As an example, imagine that two people are talking to each other, just in front of the robot. If the robot has its ASR turned on, it will process every word in the conversation regardless they are not addressed to it. But because the robot can see that these two people are turned towards the other, it may infer that those words are not addressed to it and simply discard them. Extending the architecture is the first step that has to be done as future work. Since the architecture has proven to be valid and works to its field of use, it would be challenging extending it to other fields such as gesture recognition. This would enable the robot to understand better the interaction situations with the users. This work opens the door for building a continuous learning framework. Thanks to the integration between the learning system and the interaction system, it is now possible to strengthen this bonds to build a more generic platform that would allow the robot being continuously learning from its environment and its partners. Since the learning platform allows the robot to understand information from the user, one possible line of work is to study how the robot can infer the intentions of the users using the information it has learnt from them. Now the robot can learn human poses, next step of the process is to understand what these poses means in an interaction context. Moving the focus from more generic to more specific, we can enter in details regarding the developed learning system itself, other line for further work is to deep into the relation between the position attribute, orientation attributes and the quality of the training. Although it seems that a good training data its enough to Perhaps one way would be

52

It is interesting to know if changes in the reference coordinate system would affect to the training phase. For example, there are some poses that are relative to the robot such as being near or far to the robot. in this case, it is clear that the adequate coordinate frame is the robot’s one. But in other cases it might be better to use other coordinate frames. An example would be an user separating his arms to announce that something is big and its contrary pose, bringing closer the arms to point out that something is small. It has to be studied if in this last case, it is better to use a coordinate frame that has not the origin in the robot’s sensor but, for instance, in one of the user’s hand. Other possibilities for further research are comparing several classifiers or, even more, other data mining techniques. There are studies that have made this comparisons in generic situations [39], but introducing the part of the real time human-robot interaction might lead to interesting findings. Finally, since the main purpose of the robot is the study of the HRI, user studies should be carried out to understand how to improve this learning process from the user perspective. Understanding what the user thinks about the process, might lead to better training scenarios that would end in robots that learn better from the users.

53

6. CONCLUSIONS

54

Appendix A

Class Diagrams of the Main Nodes
Ros

NodeHandle

Joint jointName confidence posX posY posZ orientX orientY orientZ orientW

15

Pose pose 1 userID stamp Joints

n

PoseSet Dataset Poses FillAttributes parseSkeletonMessage 1 labelMessage addMessage setClass saveToFile

1

PoseTrainer PoseSet PoseModel 1 NodeHandle void main 1

Weka PoseModel FilteredClassifier J48 loadFromFile saveToFile classifyInstance buildClassifier

weka.core.Instances

1

weka.classifiers.Classifier

Figure A.1: Class Diagram of the Pose Trainer Node - The main class is a ROS node (it has a node handle) that connects with weka to build a dataset and a model from the inputs it receives from the topics at which is subscribed (defined in the node handle).

55

A. CLASS DIAGRAMS OF THE MAIN NODES

Ros LabeledPose

NodeHandle

Joint jointName confidence posX posY posZ orientX orientY orientZ orientW

15

Pose pose 1 userID stamp Joints

n

PoseSet Dataset Poses FillAttributes parseSkeletonMessage 1 labelMessage addMessage setClass saveToFile

1

PoseClassifier PoseSet PoseModel 1 NodeHandle void main 1

Weka PoseModel FilteredClassifier J48 loadFromFile saveToFile classifyInstance buildClassifier

weka.core.Instances

1

weka.classifiers.Classifier

Figure A.2: Class Diagram of the Pose Classifier Node - Notice that it is almost equal to the class diagram of the Pose Trainer Node. The min difference is that this node, loads the model built by the Pose Trainer Node and uses this model to classify the skeleton messages it receives.

56

Appendix B

Detailed Results of the Pilot experiment
This chapter shows the full results obtained in the pilot experiment described in chapter 5. In the experiment four models were tested against a dataset recorded by twelve people. The four models were trained according the description shown in section 5.1. The results, were partially shown in section 5.2 and discussed in section 5.3. The results are presented in the following tables. Table B.1 presents the detailed results of the models M1 and M2 while Table B.2 presents the confusion matrix them. The detailed results of model M3 are presented in B.3 while its confusion matrix is detailed in the table B.4. Finally, the results of the model M4 are shown in the table B.5 and its confusion matrix can be analysed in table B.6.
Class STAND_TURNED_LEFT STAND_TURNED_RIGHT STAND_TURNED_FORWARD Weighted Avg. TP Rate 0.787 0.996 0.982 0.926 FP Rate 0.000 0.111 0.002 0.039 Precision 1.000 0.826 0.996 0.939 Recall 0.787 0.996 0.982 0.926 F-Measure 0.881 0.903 0.989 0.926 ROC Area 0.999 0.997 0.99 0.995

Table B.1: Detailed Results of the Models M1 and M2 - These models performed with a 92% of correctly classified instances (column 1) and with barely a 4% of false positives (column 2)

57

B. DETAILED RESULTS OF THE PILOT EXPERIMENT

classified as −→ STAND_TURNED_LEFT = a STAND_TURNED_RIGHT = b STAND_TURNED_FORWARD = c

a 1171 0 0

b 317 1648 30

c 0 6 1619

Table B.2: Confusion matrix for the models M1 and M2 - Almost all the errors came between the left and the right orientations.

Class STAND_TURNED_LEFT STAND_TURNED_RIGHT STAND_TURNED_FORWARD Weighted Avg.

TP Rate 0.673 0.209 0.819 0.563

FP Rate 0.315 0.001 0.334 0.213

Precision 0.49 0 0.989 0.563 0.687

Recall 0.673 0.209 0.819 0.563

F-Measure 0.567 0.345 0.667 0.525

ROC Area 0.679 0.604 0.743 0.675

Table B.3: Detailed Results of the Model M3 - It showed a poor performance with only a 56% of correctly classified instances (column 1) with more than a 20% of false positives (column 2). Most of the errors were produced due the low performance when classifying the TURNED_RIGHT pose.

classified as −→ STAND_TURNED_LEFT = a STAND_TURNED_RIGHT = b STAND_TURNED_FORWARD = c

a 1001 743 299

b 4 346 0

c 483 565 1350

Table B.4: Confusion matrix for the model M3 - The TURNED_RIGHT pose produced a great percentage of the errors.

Class STAND_TURNED_LEFT STAND_TURNED_RIGHT STAND_TURNED_FORWARD Weighted Avg.

TP Rate 0.929 0.209 0.984 0.7

FP Rate 0.317 0.001 0.123 0.141

Precision 0.569 0.989 0.807 0.796

Recall 0.929 0.209 0.984 0.7

F-Measure 0.706 0.345 0.887 0.644

ROC Area 0.806 0.604 0.931 0.779

Table B.5: Detailed Results of the Model M4 - It showed a slightly better performance than model 3, with only a 70% of correctly classified instances (column 1) but with a 14% of false positives (column 2). As it happened in the model M3, the performance was poor when classifying the TURNED_RIGHT pose.

58

classified as −→ STAND_TURNED_LEFT = a STAND_TURNED_RIGHT = b STAND_TURNED_FORWARD = c

a 1383 1022 26

b 4 346 0

c 101 286 1623

Table B.6: Confusion matrix for the model M4 - Like in model M3, the TURNED_RIGHT pose produced a great percentage of the errors.

59

B. DETAILED RESULTS OF THE PILOT EXPERIMENT

60

Appendix C

The ASR Skill Grammar for recognising Pose Labels
This chapter describes the grammar used to detect the poses the user can tell the robot to announce her pose. The grammar has the format used by the ASR engine of the ASR Skill, which is similar to [42] with slight modifications in their tags. Since the ASR Skill currently only supports Spanish words, the grammar is written in Spanish. However, the semantic values of the words that can be detected by the grammar are written in English. The grammar allows to understand 2 control commands and 1 pose command. The control commands are defined by its semantics. The first one is the STOP command, used to finish the training phase. The second one is the CHANGE command, used to mark the transitions between poses. The pose semantics are defined in the $pose field. This field understands 3 categories of semantics: $position, $action and $direction. $position defines if the user is sit (SIT ) or stand (STAND). The second semantic, $action, defines whether the user is turned (TURNED), looking (LOOKING)) or pointing (POINTING) towards certain direction. Finally, the third semantic, $direction, defines at which direction is oriented the $action of the user. These directions can be left (LEFT ), right (RIGHT ) or forward (FORWARD). The Spanish words that are located before the semantic labels are the words the user has to say in order to trigger its related semantic value. For instance, in the

61

C. THE ASR SKILL GRAMMAR FOR RECOGNISING POSE LABELS

$position semantics, the words "sentado" or "en una silla" will trigger the semantic value SIT, but the words "de pie" or "levantado" will trigger the semantic value STAND.
# ABNF 1 . 0 ISO−8859−1; l a n g u a g e es−ES ; tag−f o r m a t <l o q −s e m a n t i c s /1.0 >; public $root = $pose_trainer ; $ p o s e _ t r a i n e r = [$GARBAGE] $ s t o p | [$GARBAGE] $change [$GARBAGE] | [$GARBAGE] $ p o s e ; $ s t o p = ( " para " : STOP | " para de e t i q u e t a r " : STOP | " s t o p " : STOP | " ya e s t a b i e n " : STOP | " d e j a l o ya " : STOP | " cansado " : STOP) {<@STOP_COMMAND $ v a l u e >}; $change = ( " pausa " : CHANGE | " cambio " : CHANGE | " cambio de " : CHANGE | " cambiar de " : CHANGE) {<@CHANGE_COMMAND $ v a l u e >}; $ p o s e = [ $ p o s i t i o n ] $ a c t i o n [$GARBAGE] $ d i r e c t i o n ; $ p o s i t i o n = ( " s e n t a d o " : SIT | " en una s i l l a " : SIT | " de p i e " : STAND | " l e v a n t a d o " : STAND) {<@POSITION $ v a l u e >}; $ a c t i o n = ( " g i r a d o " : TURNED | " mirando " : LOOKING | " apuntando " : POINTING) {<@ACTION $ v a l u e >}; $ d i r e c t i o n = ( " d e r e c h a " : RIGHT | " i z q u i e r d a " : LEFT | " d e l a n t e " : FORWARD) {<@DIRECTION $ v a l u e >};

62

References
[1] S USHMITA M ITRA AND T INKU ACHARYA. Gesture Recognition: A Survey. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews), 37(3):311–324, May 2007. 2, 15 [2] A J AUME - I C APÓ AND J AVIER VARONA. Representation of human postures for vision-based gesture recognition in real-time. Gesture-Based HumanComputer, pages 102–107, 2009. 2, 15 [3] S ERGI F OIX , G. A LENYA , AND C. TORRAS. Lock-in Time-of-Flight (ToF) Cameras: A Survey. IEEE Sensors Journal, 11(99):1, 2011. 2, 13 [4] DANIEL S CHARSTEIN AND R ICHARD S ZELISKI. High-Accuracy Stereo Depth Maps Using Structured Light. Computer Vision and Pattern Recognition, IEEE Computer Society Conference on, 1:195, 2003. 2, 14 [5] B. FREEDMAN, A. SHPUNT, M. MACHLINE, AND Y. ARIELI. Depth mapping using projected patterns, October 2008. 2, 25 [6] VARIOUS AUTHORS. Kinect Entry at the Wikipedia, June 2011. 2 [7] MA S ALICHS , R. B ARBER , AM K HAMIS , M. M ALFAZ , JF G OROSTIZA , R. PACHECO, R. R IVAS , A NA C ORRALES , E. D ELGADO, AND D. G ARCIA. Maggie: A robotic platform for human-robot social interaction. In 2006 IEEE Conference on Robotics, Automation and Mechatronics, pages 1–7, 2006. 2, 17 [8] T ERRENCE F ONG , I LLAH N OURBAKHSH , AND K ERSTIN DAUTENHAHN. A survey of socially interactive robots. Robotics and autonomous systems, 42(3-4):143– 166, 2003. 7

63

REFERENCES

[9] M ICHAEL A . G OODRICH AND A LAN C. S CHULTZ. Human-Robot Interaction: A ˝ Survey. Foundations and TrendsÂo in Human-Computer Interaction, 1(3):203– 275, 2007. 7 [10] B RENNA D. A RGALL , S ONIA C HERNOVA , M ANUELA V ELOSO, AND B RETT B ROWNING. A survey of robot learning from demonstration. Robotics and Autonomous Systems, 57(5):469–483, 2009. 8, 9, 10 [11] S RIDHAR M AHADEVAN , G EORGIOS T HEOCHAROUS , AND N IKFAR K HALEELI. Rapid Concept Learning for Mobile Robots. Autonomous Robots, 5(3):239– 251, 1998. 10 [12] T IM VAN K ASTEREN , ATHANASIOS N OULAS , G WENN E NGLEBIENNE , AND B EN K RÖSE. Accurate activity recognition in a home setting. In Proceedings of the 10th international conference on Ubiquitous computing, UbiComp ’08, pages 1–9, New York, NY, USA, 2008. ACM. 11 [13] S TEPHANIE R OSENTHAL , J. B ISWAS , AND M. V ELOSO. An effective personal mobile robot agent through symbiotic human-robot interaction. In Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: volume 1-Volume 1, pages 915–922. International Foundation for Autonomous Agents and Multiagent Systems, 2010. 11 [14] B URR S ETTLES. Active Learning Literature Survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison, 2010. 11 [15] M AYA C AKMAK , C RYSTAL C HAO, AND A NDREA L T HOMAZ. Designing Interactions for Robot Active Learners. IEEE Transactions on Autonomous Mental Development, 2(2):108–118, June 2010. 11, 12 [16] P RIME S ENSE LTD. PrimeSense’s PrimeSensor Reference Design 1.08, June 2011. 13, 26 [17] A KOLB , E B ARTH , AND R KOCH. ToF-sensors: New dimensions for realism and interactivity. In Computer Vision and Pattern Recognition Workshops, 2008. CVPRW ’08. IEEE Computer Society Conference on, pages 1–6, 2008. 14

64

REFERENCES

[18] A NDREAS KOLB , E RHARDT B ARTH , R EINHARD KOCH , AND R ASMUS L ARSEN. Time-of-flight sensors in computer graphics. In Eurographics State of the Art Reports, pages 119–134, 2009. 15 [19] S B G OKTURK AND C TOMASI. 3D head tracking based on recognition and interpolation using a time-of-flight depth sensor. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, 2, pages II–211 – II–217 Vol.2, 2004. 15, 16 [20] X IA L IU AND K F UJIMURA. Hand gesture recognition using depth data. In Automatic Face and Gesture Recognition, 2004. Proceedings. Sixth IEEE International Conference on, pages 529–534, May 2004. 15 [21] P IA B REUER , C HRISTIAN E CKES , AND S TEFAN M ÜLLER. Hand Gesture Recognition with a Novel IR Time-of-Flight Range Camera - A Pilot Study. In A N DRÉ

G AGALOWICZ AND W ILFRIED P HILIPS, editors, Computer Vision/Computer

Graphics Collaboration Techniques, 4418 of Lecture Notes in Computer Science, pages 247–260. Springer Berlin / Heidelberg, 2007. 15 [22] H ERVÉE L AHAMY AND D EREK L ITCHI. Real-time hand gesture recognition using range cameras. In The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences [on CD-ROM], page 38, 2010. 15 [23] N ADIA H AUBNER , U LRICH S CHWANECKE , R. D ÖRNER , S IMON L EHMANN , AND J. L UDERSCHMIDT. Recognition of Dynamic Hand Gestures with Time-ofFlight Cameras. In ITG / GI Workshop on Self-Integrating Systems for Better Living Environments 2010: SENSYBLE 2010, pages 1–7, 2010. 15 [24] K N ICKEL AND R S TIEFELHAGEN. Visual recognition of pointing gestures for human-robot interaction. Image and Vision Computing, 25(12):1875–1884, December 2007. 16 [25] DAVID D ROESCHEL , J ÖRG S TÜCKLER , AND S VEN B EHNKE. Learning to interpret pointing gestures with a time-of-flight camera. In Proceedings of the 6th international conference on Human-robot interaction, HRI ’11, pages 481–488, New York, NY, USA, 2011. ACM. 16

65

REFERENCES

[26] R ONAN B OULIC, J AVIER VARONA , L UIS U NZUETA , M ANUEL P EINADO, A NGEL S UESCUN , AND F RANCISCO P ERALES. Evaluation of on-line analytic and numeric inverse kinematics approaches driven by partial vision input. Virtual Reality, 10(1):48–61, April 2006. 16 [27] YOUDING Z HU, B EHZAD DARIUSH , AND K IKUO F UJIMURA. Kinematic self retargeting: A framework for human pose estimation. Computer Vision and Image Understanding, 114(12):1362–1375, December 2010. 16 [28] A RNAUD R AMEY, V ÍCTOR G ONZÁLEZ -PACHECO, AND M IGUEL A S ALICHS. Integration of a low-cost RGB-D sensor in a social robot for gesture recognition. In Proceedings of the 6th international conference on Human-robot interaction HRI ’11, page 229, New York, New York, USA, 2011. ACM Press. 16 [29] O PEN NI MEMBERS. OpenNI web page, June 2011. 16, 26 [30] A NA C ORRALES , R. R IVAS , AND MA S ALICHS. Sistema de identificación de objetos mediante RFID para un robot personal. In XXVIII Jornadas de Automática, pages 50–54, Huelva, 2007. Comité Español de Automática. 18 [31] R. B ARBER. Desarrollo de una arquitectura para robots móviles autónomos. Aplicación a un sistema de navegación topológica. Phd thesis, Universidad Carlos III de Madrid, 2000. 20 [32] R. R IVAS , A NA C ORRALES , R. B ARBER , AND MA. Robot skill abstraction for ad architecture. 6th IFAC Symposium on Intelligent Autonomous Vehicles, 2007. 20 [33] E RICH G AMMA , R ICHARD H ELM , R ALPH J OHNSON , AND J OHN V LISSIDES. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley Professional, 1995. 22 [34] M. Q UIGLEY, B. G ERKEY, K. C ONLEY, J. FAUST, T. F OOTE , J. L EIBS , E. B ERGER , R. W HEELER , AND A. N G. ROS: an open-source Robot Operating System. In Open-Source Software workshop of the International Conference on Robotics and Automation (ICRA), number Figure 1, 2009. 23

66

REFERENCES

[35] P RIME S ENSE LTD. PrimeSense’s Frequently Asked Questions (FAQ) website, June 2011. 25 [36] Z. Z ALEVSKY, A. S HPUNT, A. M AIZELS , AND J. G ARCIA. Method and System for Object Reconstruction, April 2007. 25 [37] M ARK H ALL , E IBE F RANK , G EOFFREY H OLMES , B ERNHARD P FAHRINGER , P E TER

R EUTEMANN , AND I.H. W ITTEN. The WEKA data mining software: an

update. ACM SIGKDD Explorations Newsletter, 11(1):10–18, 2009. 28 [38] J R Q UINLAN. C4.5: programs for machine learning. Morgan Kaufmann series in machine learning. Morgan Kaufmann Publishers, 1993. 28 [39] SB KOTSIANTIS. Supervised machine learning: A review of classification techniques. Informatica, 31(3):249–268, 2007. 28, 53 [40] F. A LONSO -M ARTIN AND M IGUEL S ALICHS. 42(4):215–245, may 2011. 35 [41] F A LONSO -M ARTIN , A RNAUD A R AMEY, AND M IGUEL A S ALICHS. Maggie: el robot traductor. In UPM, editor, 9 Workshop RoboCity2030-II, number 9, pages 57–73, Madrid, 2011. Robocity 2030. 42 [42] D. C ROCKER AND P. OVERELL. Augmented BNF for Syntax Specifications: ABNF. RFC 2234, Internet Engineering Task Force, November 1997. 61 INTEGRATION OF A VOICE

RECOGNITION SYSTEM IN A SOCIAL ROBOT. Cybernetics and Systems,

67

REFERENCES

68

This document is published under the license (CC)-BY-SA.

You are free to:

To Share - To copy, distribute and transmit this document. To Remix - To adapt the document. To make commercial use of the document.

Under the following conditions:

Attribution - You must attribute the document in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work. Share Alike - If you alter, transform, or build upon this document, you may distribute the resulting work only under the same or similar license to this one.

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->