You are on page 1of 472

Karl-Friedrich Kraiss (Ed.

Advanced Man-Machine Interaction


Karl-Friedrich Kraiss (Ed.)

Advanced Man-Machine
Interaction
Fundamentals and Implementation

With 280 Figures

123
Editor
Professor Dr.-Ing. Karl-Friedrich Kraiss
RWTH Aachen University
Chair of Technical Computer Science
Ahornstraße 55
52074 Aachen
Germany
Kraiss@techinfo.rwth-aachen.de

Library of Congress Control Number: 2005938888

ISBN-10 3-540-30618-8 Springer Berlin Heidelberg New York


ISBN-13 978-3-540-30618-4 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material
is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication
of this publication or parts thereof is permitted only under the provisions of the German Copyright
Law of September 9, 1965, in its current version, and permission for use must always be obtained from
Springer. Violations are liable for prosecution under the German Copyright Law.
Springer is a part of Springer Science+Business Media
springer.com
© Springer-Verlag Berlin Heidelberg 2006
Printed in Germany
The use of general descriptive names, registered names, trademarks, etc. in this publication does not
imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
Typesetting: Digital data supplied by editor
Cover Design: design & production GmbH, Heidelberg
Production: LE-TEX Jelonek, Schmidt & Vöckler GbR, Leipzig
Printed on acid-free paper 7/3100/YL 543210
To: Eva,
Gregor,
Lucas,
Martin, and
Robert.
Preface

The way in which humans interact with machines has changed dramatically during
the last fifty years. Driven by developments in software and hardware technologies
new appliances never thought of some years ago appear on the market. There are
apparently no limits to the functionality of personal computers or mobile telecom-
munication appliances. Intelligent traffic systems promise formerly unknown levels
of safety. Cooperation between factory workers and smart automation is under way
in industrial production. Telepresence and teleoperations in space, under water or in
micro worlds is made possible. Mobile robots provide services on duty and at home.
As, unfortunately, human evolution does not keep pace with technology, we face
problems in dealing with this brave new world. Many individuals have experienced
minor or major calamities when using computers for business or in private. The el-
derly and people with special needs often refrain altogether from capitalizing on
modern technology. Statistics indicate that in traffic systems at least two of three
accidents are caused by human error.
In this situation man-machine interface design is widely recognized as a major
challenge. It represents a key technology, which promises excellent pay-offs. As a
major marketing factor it also secures a competitive edge over contenders.
The purpose of this book is to provide deep insight into novel enabling technolo-
gies of man-machine interfaces. This includes multimodal interaction by mimics,
gesture, and speech, video based biometrics, interaction in virtual reality and with
robots, and the provision of user assistance. All these are innovative and still dynam-
ically developing research areas.
The concept of this book is based on my lectures on man-machine systems
held regularly for students in electrical engineering and computer science at RWTH
Aachen University and on research performed during the last years at this chair.
The compilation of the materials would not have been possible without contribu-
tions of numerous people who worked with me over the years on various topics of
man-machine interaction. Several chapters are authored by former or current doctoral
students. Some chapters build in part on earlier work of former coworkers as, e.g.,
Suat Akyol, Pablo Alvarado, Ingo Elsen, Kirsti Grobel, Hermann Hienz, Thomas
Krüger, Dirk Krumbiegel, Roland Steffan, Peter Walter, and Jochen Wickel. There-
fore their work deserves mentioning as well.
To provide a well balanced and complete view on the topic the volume further
contains two chapters by Professor Gerhard Rigoll from Technical University of
Munich and by Professor Rüdiger Dillmann from Karlsruhe University. I’m very
much indebted to both for their contributions. I also wish to express my gratitude to
Eva Hestermann-Beyerle and to Monika Lempe from Springer Verlag: to the former
for triggering the writing of this book, and to both for provided helpful support
during its preparation.
VIII Preface

I am deeply indebted to Jörg Zieren for administrating a central repository for


the various manuscripts and for making sure that the guidelines were accurately
followed. Also Lars Libuda deserves sincere thanks for undertaking the tedious work
of composing and testing the books CD. The assistance of both was a fundamental
prerequisite for meeting the given deadline.

Finally special thanks go to my wife Cornelia for unwavering support, and, more
general, for sharing life with me.

Aachen, December 2005 Karl-Friedrich Kraiss


Table of Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Non-Intrusive Acquisition of Human Action . . . . . . . . . . . . . . . . . . . . . . . 7


2.1 Hand Gesture Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Image Acquisition and Input Data . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1.1 Vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1.2 Recording Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1.3 Image Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.2.1 Hand Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.2.2 Region Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.2.3 Geometric Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.1.2.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.3 Feature Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.3.1 Classification Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.3.2 Classification Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.1.3.3 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.1.3.4 Rule-based Classification . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.1.3.5 Maximum Likelihood Classification . . . . . . . . . . . . . . . . 34
2.1.3.6 Classification Using Hidden Markov Models . . . . . . . . . 37
2.1.3.7 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.1.4 Static Gesture Recognition Application . . . . . . . . . . . . . . . . . . . . 48
2.1.5 Dynamic Gesture Recognition Application . . . . . . . . . . . . . . . . . 50
2.1.6 Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.2 Facial Expression Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.2.1 Image Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.2.2 Image Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.2.2.1 Face Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.2.2.2 Face Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.2.3 Feature Extraction with Active Appearance Models . . . . . . . . . . 68
2.2.3.1 Appearance Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
2.2.3.2 AAM Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
2.2.4 Feature Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.2.4.1 Head Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.2.4.2 Determination of Line of Sight . . . . . . . . . . . . . . . . . . . . . 83
2.2.4.3 Circle Hough Transformation . . . . . . . . . . . . . . . . . . . . . . 83
2.2.4.4 Determination of Lip Outline . . . . . . . . . . . . . . . . . . . . . . 87
2.2.4.5 Lip modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
2.2.5 Facial Feature Recognition – Eye Localization Application . . . . 90
2.2.6 Facial Feature Recognition – Mouth Localization Application . 90
X Table of Contents

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

3 Sign Language Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95


3.1 Recognition of Isolated Signs in Real-World Scenarios . . . . . . . . . . . . . . . . 99
3.1.1 Image Acquisition and Input Data . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.1.2 Image Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.1.2.1 Background Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.1.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.1.3.1 Overlap Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
3.1.3.2 Hand Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.1.4 Feature Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.1.5 Feature Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.1.6 Test and Training Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
3.2 Sign Recognition Using Nonmanual Features . . . . . . . . . . . . . . . . . . . . . . . . 111
3.2.1 Nonmanual Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.2.2 Extraction of Nonmanual Features . . . . . . . . . . . . . . . . . . . . . . . . 113
3.3 Recognition of continuous Sign Language using Subunits . . . . . . . . . . . . . 116
3.3.1 Subunit Models for Signs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.3.2 Transcription of Sign Language . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
3.3.2.1 Linguistics-orientated Transcription of Sign Language . 118
3.3.2.2 Visually-orientated Transcription of Sign Language . . . 121
3.3.3 Sequential and Parallel Breakdown of Signs . . . . . . . . . . . . . . . . 122
3.3.4 Modification of HMMs to Parallel Hidden Markov Models . . . . 122
3.3.4.1 Modeling Sign Language by means of PaHMMs . . . . . . 124
3.3.5 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
3.3.5.1 Classification of Single Signs by Means of Subunits
and PaHMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
3.3.5.2 Classification of Continuous Sign Language by Means
of Subunits and PaHMMs . . . . . . . . . . . . . . . . . . . . . . . . . 126
3.3.5.3 Stochastic Language Modeling . . . . . . . . . . . . . . . . . . . . 127
3.3.6 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
3.3.6.1 Initial Transcription . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
3.3.6.2 Estimation of Model Parameters for Subunits . . . . . . . . 131
3.3.6.3 Classification of Single Signs . . . . . . . . . . . . . . . . . . . . . . 132
3.3.7 Concatenation of Subunit Models to Word Models for Signs. . . 133
3.3.8 Enlargement of Vocabulary Size by New Signs . . . . . . . . . . . . . . 133
3.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
3.4.1 Video-based Isolated Sign Recognition . . . . . . . . . . . . . . . . . . . . 135
3.4.2 Subunit Based Recognition of Signs and Continuous Sign
Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
3.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Table of Contents XI

4 Speech Communication and Multimodal Interfaces . . . . . . . . . . . . . . . . . 141


4.1 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4.1.1 Fundamentals of Hidden Markov Model-based Speech
Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
4.1.2 Training of Speech Recognition Systems . . . . . . . . . . . . . . . . . . . 144
4.1.3 Recognition Phase for HMM-based ASR Systems . . . . . . . . . . . 145
4.1.4 Information Theory Interpretation of Automatic Speech
Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
4.1.5 Summary of the Automatic Speech Recognition Procedure . . . . 149
4.1.6 Speech Recognition Technology . . . . . . . . . . . . . . . . . . . . . . . . . . 150
4.1.7 Applications of ASR Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
4.2 Speech Dialogs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
4.2.2 Initiative Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
4.2.3 Models of Dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
4.2.3.1 Finite State Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
4.2.3.2 Slot Filling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
4.2.3.3 Stochastic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
4.2.3.4 Goal Directed Processing . . . . . . . . . . . . . . . . . . . . . . . . . 159
4.2.3.5 Rational Conversational Agents . . . . . . . . . . . . . . . . . . . . 160
4.2.4 Dialog Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
4.2.5 Scripting and Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
4.3 Multimodal Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
4.3.1 In- and Output Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
4.3.2 Basics of Multimodal Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . 166
4.3.2.1 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
4.3.2.2 Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
4.3.3 Multimodal Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
4.3.3.1 Integration Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
4.3.4 Errors in Multimodal Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
4.3.4.1 Error Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
4.3.4.2 User Specific Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
4.3.4.3 System Specific Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
4.3.4.4 Error Avoidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
4.3.4.5 Error Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
4.4 Emotions from Speech and Facial Expressions . . . . . . . . . . . . . . . . . . . . . . . 175
4.4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
4.4.1.1 Application Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
4.4.1.2 Modalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
4.4.1.3 Emotion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
4.4.1.4 Emotional Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
4.4.2 Acoustic Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
4.4.2.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
4.4.2.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
4.4.2.3 Classification Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
XII Table of Contents

4.4.3 Linguistic Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180


4.4.3.1 N-Grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
4.4.3.2 Bag-of-Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
4.4.3.3 Phrase Spotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
4.4.4 Visual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
4.4.4.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
4.4.4.2 Holistic Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
4.4.4.3 Analytic Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
4.4.5 Information Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
4.4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

5 Person Recognition and Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191


5.1 Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
5.1.1 Challenges in Automatic Face Recognition . . . . . . . . . . . . . . . . . 193
5.1.2 Structure of Face Recognition Systems . . . . . . . . . . . . . . . . . . . . . 196
5.1.3 Categorization of Face Recognition Algorithms . . . . . . . . . . . . . 200
5.1.4 Global Face Recognition using Eigenfaces . . . . . . . . . . . . . . . . . . 201
5.1.5 Local Face Recognition based on Face Components or
Template Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
5.1.6 Face Databases for Development and Evaluation . . . . . . . . . . . . 212
5.1.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
5.1.7.1 Provided Images and Image Sequences . . . . . . . . . . . . . . 214
5.1.7.2 Face Recognition Using the Eigenface Approach . . . . . 215
5.1.7.3 Face Recognition Using Face Components . . . . . . . . . . . 221
5.2 Full-body Person Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
5.2.1 State of the Art in Full-body Person Recognition . . . . . . . . . . . . 224
5.2.2 Color Features for Person Recognition . . . . . . . . . . . . . . . . . . . . . 226
5.2.2.1 Color Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
5.2.2.2 Color Structure Descriptor . . . . . . . . . . . . . . . . . . . . . . . . 227
5.2.3 Texture Features for Person Recognition . . . . . . . . . . . . . . . . . . . 228
5.2.3.1 Oriented Gaussian Derivatives . . . . . . . . . . . . . . . . . . . . . 228
5.2.3.2 Homogeneous Texture Descriptor . . . . . . . . . . . . . . . . . . 229
5.2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
5.2.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
5.2.4.2 Feature Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
5.3 Camera-based People Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
5.3.1 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
5.3.1.1 Foreground Segmentation by Background Subtraction . 234
5.3.1.2 Morphological Operations . . . . . . . . . . . . . . . . . . . . . . . . 236
5.3.2 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
5.3.2.1 Tracking Following Detection . . . . . . . . . . . . . . . . . . . . . 238
5.3.2.2 Combined Tracking and Detection . . . . . . . . . . . . . . . . . . 243
5.3.3 Occlusion Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
5.3.3.1 Occlusion Handling without Separation . . . . . . . . . . . . . 248
Table of Contents XIII

5.3.3.2 Separation by Shape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248


5.3.3.3 Separation by Color . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
5.3.3.4 Separation by 3D Information . . . . . . . . . . . . . . . . . . . . . 250
5.3.4 Localization and Tracking on the Ground Plane . . . . . . . . . . . . . 250
5.3.5 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259

6 Interacting in Virtual Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263


6.1 Visual Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
6.1.1 Spatial Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
6.1.2 Immersive Visual Displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
6.1.3 Viewer Centered Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
6.2 Acoustic Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
6.2.1 Spatial Hearing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
6.2.2 Auditory Displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
6.2.3 Wave Field Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
6.2.4 Binaural Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
6.2.5 Cross Talk Cancellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
6.3 Haptic Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
6.3.1 Rendering a Wall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
6.3.2 Rendering Solid Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
6.4 Modeling the Behavior of Virtual Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
6.4.1 Particle Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
6.4.2 Rigid Body Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
6.5 Interacting with Rigid Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
6.5.1 Guiding Sleeves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
6.5.2 Sensitive Polygons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
6.5.3 Virtual Magnetism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
6.5.4 Snap-In . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
6.6 Implementing Virtual Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
6.6.1 Scene Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
6.6.2 Toolkits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
6.6.3 Cluster Rendering in Virtual Environments . . . . . . . . . . . . . . . . . 303
6.7 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
6.7.1 A Simple Scene Graph Example . . . . . . . . . . . . . . . . . . . . . . . . . . 307
6.7.2 More Complex Examples: SolarSystem and Robot . . . . . . 308
6.7.3 An Interactive Environment, Tetris3D . . . . . . . . . . . . . . . . . . . . . . 310
6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
6.9 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
XIV Table of Contents

7 Interactive and Cooperative Robot Assistants . . . . . . . . . . . . . . . . . . . . . 315


7.1 Mobile and Humanoid Robots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
7.2 Interaction with Robot Assistants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
7.2.1 Classification of Human-Robot Interaction Activities . . . . . . . . . 323
7.2.2 Performative Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
7.2.3 Commanding and Commenting Actions . . . . . . . . . . . . . . . . . . . . 325
7.3 Learning and Teaching Robot Assistants . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
7.3.1 Classification of Learning through Demonstration . . . . . . . . . . . 326
7.3.2 Skill Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
7.3.3 Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
7.3.4 Interactive Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
7.3.5 Programming Robots via Observation: An Overview . . . . . . . . . 329
7.4 Interactive Task Learning from Human Demonstration . . . . . . . . . . . . . . . . 330
7.4.1 Classification of Task Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
7.4.2 The PbD Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
7.4.3 System Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
7.4.4 Sensors for Learning through Demonstration . . . . . . . . . . . . . . . . 336
7.5 Models and Representation of Manipulation Tasks . . . . . . . . . . . . . . . . . . . 338
7.5.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
7.5.2 Task Classes of Manipulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
7.5.3 Hierarchical Representation of Tasks . . . . . . . . . . . . . . . . . . . . . . 340
7.6 Subgoal Extraction from Human Demonstration . . . . . . . . . . . . . . . . . . . . . 342
7.6.1 Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
7.6.2 Segmentation Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
7.6.3 Grasp Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
7.6.3.1 Static Grasp Classification . . . . . . . . . . . . . . . . . . . . . . . . 344
7.6.3.2 Dynamic Grasp Classification . . . . . . . . . . . . . . . . . . . . . 346
7.7 Task Mapping and Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
7.7.1 Event-Driven Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
7.7.2 Mapping Tasks onto Robot Manipulation Systems . . . . . . . . . . . 350
7.7.3 Human Comments and Advice . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
7.8 Examples of Interactive Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
7.9 Telepresence and Telerobotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
7.9.1 The Concept of Telepresence and Telerobotic Systems . . . . . . . . 356
7.9.2 Components of a Telerobotic System . . . . . . . . . . . . . . . . . . . . . . 359
7.9.3 Telemanipulation for Robot Assistants in Human-Centered
Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
7.9.4 Exemplary Teleoperation Applications . . . . . . . . . . . . . . . . . . . . . 361
7.9.4.1 Using Teleoperated Robots as Remote Sensor Systems 361
7.9.4.2 Controlling and Instructing Robot Assistants via
Teleoperation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
Table of Contents XV

8 Assisted Man-Machine Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369


8.1 The Concept of User Assistance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
8.2 Assistance in Manual Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
8.2.1 Needs for Assistance in Manual Control . . . . . . . . . . . . . . . . . . . . 373
8.2.2 Context of Use in Manual Control . . . . . . . . . . . . . . . . . . . . . . . . . 374
8.2.3 Support of Manual Control Tasks . . . . . . . . . . . . . . . . . . . . . . . . . 379
8.3 Assistance in Man-Machine Dialogs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
8.3.1 Needs for Assistance in Dialogs . . . . . . . . . . . . . . . . . . . . . . . . . . 386
8.3.2 Context of Dialogs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
8.3.3 Support of Dialog Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
8.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395

A LTI-L IB — A C++ Open Source Computer Vision Library . . . . . . . . . . 399


A.1 Installation and Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
A.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
A.2.1 Duplication and Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
A.2.2 Serialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
A.2.3 Encapsulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
A.2.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
A.3 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
A.3.1 Points and Pixels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
A.3.2 Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
A.3.3 Image Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
A.4 Functional Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
A.4.1 Functor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
A.4.2 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
A.5 Handling Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
A.5.1 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
A.5.2 Image IO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
A.5.3 Color Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
A.5.4 Drawing and Viewing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
A.6 Where to Go Next . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421

B I MPRESARIO — A GUI for Rapid Prototyping of Image Processing


Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
B.1 Requirements, Installation, and Deinstallation . . . . . . . . . . . . . . . . . . . . . . . 424
B.2 Basic Components and Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
B.2.1 Document and Processing Graph . . . . . . . . . . . . . . . . . . . . . . . . . . 426
B.2.2 Macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
B.2.3 Processing Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
B.2.4 Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
B.2.5 Summary and further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
B.3 Adding Functionality to a Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
XVI Table of Contents

B.3.1 Using the Macro Browser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432


B.3.2 Rearranging and Deleting Macros . . . . . . . . . . . . . . . . . . . . . . . . . 434
B.4 Defining the Data Flow with Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
B.4.1 Creating a Link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
B.4.2 Removing a Link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
B.5 Configuring Macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
B.5.1 Image Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
B.5.2 Video Capture Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
B.5.3 Video Stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
B.5.4 Standard Macro Property Window . . . . . . . . . . . . . . . . . . . . . . . . 444
B.6 Advanced Features and Internals of I MPRESARIO . . . . . . . . . . . . . . . . . . . . 446
B.6.1 Customizing the Graphical User Interface . . . . . . . . . . . . . . . . . . 446
B.6.2 Macros and Dynamic Link Libraries . . . . . . . . . . . . . . . . . . . . . . . 448
B.6.3 Settings Dialog and Directories . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
B.7 I MPRESARIO Examples in this Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
B.8 Extending I MPRESARIO with New Macros . . . . . . . . . . . . . . . . . . . . . . . . . . 450
B.8.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
B.8.2 API Installation and further Reading . . . . . . . . . . . . . . . . . . . . . . . 451
B.8.3 I MPRESARIO Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
Contributors

Dipl.-Ing. Markus Ablaßmeier


Institute for Human-Machine Communication
Technische Universität München, Munich, Germany
Email: ablassmeier@tum.de
Dr.-Ing. José Pablo Alvarado Moya
Escuela de Electrónica
Instituto Tecnológico de Costa Rica
Email: palvarado@ietec.org
Dipl.-Inform. Ingo Assenmacher
Center for Computing and Communication,
RWTH Aachen University, Aachen, Germany
Email: assenmacher@rz.rwth-aachen.de
Dr. rer. nat. Britta Bauer
Gothaer Versicherungen, Köln, Germany
Email: britta.bauer@gmx.de
Dr.-Ing. Ulrich Canzler
CanControls, Aachen
Email: canzler@cancontrols.com
Prof. Dr.-Ing. Rüdiger Dillmann
Institut für Technische Informatik (ITEC)
Karlsruhe University, Germany
Email: dillmann@ira.uka.de
Dipl.- Ing. Peter Dörfler
Chair of Technical Computer Science,
RWTH Aachen University, Aachen, Germany
Email: ammi@pdoerfler.com
Dipl.- Ing. Holger Fillbrandt
Chair of Technical Computer Science,
RWTH Aachen University, Aachen, Germany
Email: fillbrandt@techinfo.rwth-aachen.de
Dipl.- Ing. Michael Hähnel
Chair of Technical Computer Science,
RWTH Aachen University, Aachen, Germany
Email: haehnel@techinfo.rwth-aachen.de
XVIII Contributors

Dipl.- Ing. Lenka Jerábková


Center for Computing and Communication,
RWTH Aachen University, Aachen, Germany
Email: jerabkova@rz.rwth-aachen.de
Prof. Dr.-Ing. Karl-Friedrich Kraiss
Chair of Technical Computer Science,
RWTH Aachen University, Aachen, Germany
Email: kraiss@techinfo.rwth-aachen.de
Dr. rer. nat. Torsten Kuhlen
Center for Computing and Communication,
RWTH Aachen University, Aachen, Germany
Email: kuhlen@rz.rwth-aachen.de
Dipl.- Ing. Lars Libuda
Chair of Technical Computer Science,
RWTH Aachen University, Aachen, Germany
Email: libuda@techinfo.rwth-aachen.de
Dipl.-Ing. Ronald Müller
Institute for Human-Machine Communication
Technische Universität München, Munich, Germany
Email: mueller@mmk.ei.tum.de
Dipl.-Ing. Tony Poitschke
Institute for Human-Machine Communication
Technische Universität München, Munich, Germany
Email: poitschke@tum.de
Dipl.-Ing. Stefan Reifinger
Institute for Human-Machine Communication
Technische Universität München, Munich, Germany
Email: reifinger@mmk.ei.tum.de
Prof. Dr.-Ing. Gerhard Rigoll
Institute for Human-Machine Communication
Technische Universität München, Munich, Germany
Email: rigoll@tum.de
Dipl.-Ing. Björn Schuller
Institute for Human-Machine Communication
Technische Universität München, Munich, Germany
Email: schuller@tum.de
Contributors XIX

Dipl.- Ing. Jörg Zieren


Chair of Technical Computer Science,
RWTH Aachen University, Aachen, Germany
Email: joerg@zieren.de
Dipl.-Inform. Raoul D. Zöllner
Institut für Technische Informatik (ITEC)
Karlsruhe University, Germany
Email: zoellner@fzi.de
Chapter 1

Introduction
Karl-Friedrich Kraiss
Living in modern societies relies on smart systems and services that support
our collaborative and communicative needs as social beings. Developments in soft-
ware and hardware technologies as e.g. in microelectronics, mechatronics, speech
technology, computer linguistics, computer vision, and artificial intelligence are
continuously driving new applications for work, leisure, and mobility. Man-machine
interaction, which is ubiquitous in human life, is driven by technology as well. It
is the gateway providing access to functions and services which, due to increasing
complexity, threatens to turn into a bottleneck. Since for the user the interface is
the product, interface design is a key technology which accordingly is subject to
significant national and international research efforts.
Interfaces to smart systems exploit the same advanced technologies as the said
systems themselves. This textbook therefore introduces to researchers, students and
practitioners not only advanced concepts of interface design, but also gives insight
into the related interface technologies.
This introduction first outlines the broad spectrum of applications in which man-
machine interaction takes place. Then, following an overview of interface design
goals, major approaches to interaction augmentation are identified. Finally the single
chapters of this textbook are shortly characterized.

Application Areas of Man-Machine Interaction


According to their properties man-machine systems may be classified in dialog sys-
tems and dynamic systems (Fig. 1.1). The former comprise small single user ap-
pliances like mobile phones, personal digital assistants and personal computers, but
also multi-user systems like plant control stations or call centers. All of these require
discrete user actions by, e.g., keyboard or mouse in either stationary or mobile mode.
Representatives of the latter are land, air, or sea vehicles, robots, and master-
slave systems. Following the needs of an aging society also rehabilitation robots,
which provide movement assistance and restoration for people with disabilities, in-
crease steadily in importance. All these systems put the operator in a control loop
and require continuous manual control inputs.
As dynamic systems are becoming more complex, functions tend to be automated
in parts, which takes the operator out of the control loop and assigns supervisory
control tasks to him or suggests cooperation with machines by task sharing. This
approach has a long tradition in civil and military aviation and industrial plants.
Currently smart automation embedded in the real world makes rapid progress in
road traffic, where a multitude of driver assistance functions are under development.
2 Introduction

Many of these require interaction with the driver. In the near future also mobile ser-
vice robots are expected on the marketplace as companions or home helpers. While
these will have built-in intelligence for largely autonomous operation, they neverthe-
less will need human supervision and occasional intervention.
The characterization proposed in (Fig. 1.1) very seldom occurs in pure form.
Many systems are blended in one way or other, requiring discrete dialog actions,
manual control, and supervisory control at the same time. A prototypical example is
a commercial airliner, where the cockpit crew performs all these action in sequence
or, at times, in parallel.

Man machine systems

Dialog systems Dynamic systems

Manual Supervisory
Stationary Mobile
control control

Electronic appliances Mobile phones Land, air, sea vehicles, Automated land, air, sea
in home and office, Laptops Master-slave systems vehicles,
Industrial plants PDAs rehabilitation robotics Servicerobots
Control stations Blackberries Industrial plants

Fig. 1.1. Classification of man-machine systems and associated application areas.

Design Goals for Man-Machine Systems


For an interaction between man and machine to run smoothly, the interface between
both must be designed appropriately. Unfortunately, much different from technical
transmission channels, no crisp evaluation measures are at hand to characterize qual-
ity of service. Looking from the user’s perspective, the general goals to be achieved
are usability, safety of use, and freedom of error.
Usability describes to which extend a product can be used effectively, efficiently,
and with satisfaction as defined by DIN EN ISO 9241. In this definition effectivity
describes the accuracy and completeness with which a user can reach his goals. Ef-
ficiency means the effort needed to work effectively, while satisfaction is a measure
of user acceptance. More recently it has been realized that user satisfaction also cor-
relates with fun of use. The design of hedonistic interfaces that grant joy of use is
already apparent in game boxes and music players, but also attracts the automotive
industry, which advertises products with pertinent slogans that promise, e.g. ”driving
pleasure”.
The concept of safety of use is complementary to usability. It describes to which
extend a product is understandable, predictable, controllable, and robust. Under-
standability implies that the workings of system functions are understood by a user.
3

Predictability describes to which extend a user is aware of system limits. Control-


lability indicates if system function can be switched off or overruled at the user’s
discretion. Robustness, finally, describes how a system reacts to unpredictable per-
formance of certain risk groups, due to complacency, habituation processes, reflexes
in case of malfunctioning, or from testing the limits (e.g. when driving too fast).
Freedom of error is a third objective of interface design, since a product free
of errors complies with product liability requirements so that no warranty related
user claims must be apprehended. Freedom of errors requires that an interface sat-
isfies a user’s security expectations, which may be based on manuals, instructions,
warnings or advertising. Bugs in this sense are construction errors due to inadequate
consideration of all possible modes of use and abuse. Other sources of malfunction
are instruction errors resulting from incomplete or baffling manuals or mistakable
marketing information. It is interesting to note here, that a product is legally consid-
ered defective, if it fails to meet the expectations of the least informed user. Since
such a requirement is extremely difficult to meet, this legal situation prevents the
introduction of more advanced functions with possibly dangerous consequences.
The next section will address the question, how advanced interaction techniques
can contribute to match the stated design goals.

Advanced Interaction Concepts

Starting point for the discussion of interface augmentation is the basic interaction
scheme represented by the white blocks in (Fig. 1.2), where the term machine stands
for appliance, vehicle, robot, simulation or process. Man and machine are interlinked
via two unidirectional channels, one of which transmits machine generated informa-
tion to the user, while the other one feeds user inputs in opposite direction. Interaction
takes place with a real machine in the real world.

Context
User assistance
of use

Information User input Automation


display augmentation
augmentation

Fission
Speech
Gesture
Real Operation
Mimics ...
machine in
Man
in real Virtual
Fusion
world reality
Speech
Gesture
Mimics ...
Multimodality

Fig. 1.2. Interaction augmented by multimodality, user assistance, and virtual reality.
4 Introduction

There are three major approaches, which promise to significantly augment inter-
face quality. These are multimodal interaction, operation in virtual reality, and user
assistance, which have been integrated into the basic MMI scheme of (Fig. 1.2) as
grey rectangles.
Multimodality is ubiquitous in human communication. We gesture and mimic
while talking, even at the phone, when the addressee can not see it. We nod or shake
the head or change head pose to indicate agreement or disagreement. We also signal
attentiveness by suitable body language, e.g. by turning towards a dialog partner. In
so doing conversation becomes comfortable, intuitive, and robust.
It is exactly this lack of multimodality why current interfaces often fail. It there-
fore is a goal of advanced interface design, to make use of various modalities. Modal-
ities applicable to interfaces are mimics, gesture, body posture, movement, speech,
and haptics. They serve either information display, or user input, sometimes even
both purposes. The combination of modalities for input is sometimes called multi-
modal fusion while the combination of different modalities for information display is
labeled multimodal fission. For fission modalities must be generated e.g. by synthe-
sizing speech or by providing gesturing avatars. This book focuses on fusion, which
requires automatic recognition of multimodal human actions.
Speech recognition has been around for almost fifty years and is available for
practical use. Further improvements of speech recognizers with respect to context
sensitive continuous speech recognition and robustness against noise are under way.
Observation of lip movements during talking contributes to speech recognition un-
der noise conditions. Advanced speech recognizers will also cope with incomplete
grammar and spontaneous uttering.
Interest in gesture and mimic recognition is more recent. In fact the first related
papers appeared only in the nineties. First efforts to record mimics and gesture in
real time in laboratory setups and in movie studios involved intrusive methods with
calibrated markers and multiple cameras, which are of no use for the applications at
hand. Only recently video based recognition has achieved an acceptable performance
level in out-of-the-laboratory settings. Gesture, mimics, head pose, line of sight and
body posture can now be recognized based on video recordings in real time, even
under adverse real world conditions. People may even be optically tracked or located
in public places. Emotions derived from a fusion of speech, gestures and mimics now
open the door for yet scarcely exploited emotional interaction.
The second column on which interface augmentation rests is virtual reality,
which provides comfortable means of interactive 3D-simulation, hereby enabling
various novel approaches for interaction. Virtual reality allows, e.g., operating vir-
tual machines in virtual environments such as flight simulators. It may also be used to
generate virtual interfaces. By providing suitable multimodal information channels,
e.g., realistic teleprecence is achieved that permits teleoperating in worlds which
are physically not accessible because of distance or scale (e.g. space applications or
micro- and nano-worlds).
In master-slave systems the operator usually acts on a local real machine, and the
actions are repeated by a remote machine. With virtual reality the local machine can
be substituted by simulation. Virtual reality also has a major impact for robot explo-
5

ration, testing, and training, since robot actions can be taught-in and tested within a
local system simulation before being transferred and executed in a remote and pos-
sibly dangerous and costly setting. Interaction may be further improved by making
use of augmented or mixed reality, where information is presented non-intrusively
on head mounted displays as an overlay to virtual or real worlds.
Apart from multimodality and virtual reality, the concept of user assistance is
a third ingredient of interpersonal communication lacking so far in man-machine
interaction. Envision, e.g., how business executives often call upon a staff of personal
assistants to get their work done. Based on familiarity with the context of the given
task, i.e. the purpose and constraints of a business transaction, a qualified assistant is
expected to provide correct information just in time, hereby reducing the workload
of the executive and improving the quality of the decisions made.
In man-machine interaction context of task transfers to context of use, which may
be characterized by the state of the user, the state of the system, and the situation. In
conventional man-machine interaction little of this kind is yet known. Neither user
characteristics nor the circumstances of operation are made explicit for support of
interaction; the knowledge about context resides with the user alone. However, if
context of use was made explicit, assistance could be provided to the user, similar to
that offered to the executive by his personal assistant. Hence knowledge of context
of use is the key to exploiting interface adaptivity. As indicated by vertical arrows in
(Fig. 1.2) assistance can augment interaction in three ways. First information display
may be improved, e.g., by providing information just in time and coded in the most
suitable modality. Secondly user inputs may be augmented, modified, or even sup-
pressed as appropriate. A third method for providing assistance exists for dynamic
systems, where system functions may be automated either permanently or temporar-
ily. An appliance assisted in this way is supposed to appear to the user simpler than
it actually is. It also will be easier to learn, more pleasant to handle, and less error-
prone.

The Scope of This Book

Most publications in the field of human factors engineering focus on interface design,
while leaving the technical implementation to engineers. While this book addresses
design aspects as well, it puts in addition special emphasis on advanced interface
implementation. A significant part of the book is concerned with the realization of
multimodal interaction via gesture, mimics, and speech. The remaining parts deal
with the identification and tracking of people, the interaction in virtual reality and
with robots, and with the implementation of user assistance.
The second chapter treats picture processing algorithms for the non-intrusive ac-
quisition of human action. In the first section techniques for identifying and classify-
ing hand gesture commands from video are described. These techniques are extended
to facial expression commands in the second section.
Sign language recognition, the topic of chapter 3, is the ultimate challenge in
multimodal interaction. It builds on the basic techniques presented in the chapter 2,
but considers specialized requirements for the recognition of isolated and continuous
6 Introduction

signs using manual as well as facial expression features. A method for the automatic
transcription of signs into subunits is presented, which, comparable to phonemes in
speech recognition, opens the door to large vocabularies.
The role of speech communication in multimodal interfaces is the subject of
chapter 4, which in its first section presents basics of advanced speech recognition.
Speech dialogs are treated next, followed by a comprehensive description of mul-
timodal dialogs. A final section is concerned with the extraction of emotions from
speech and gesture.
The 5th chapter deals with video-based person recognition and tracking. A first
section is concerned with face recognition using global and local features. Then full-
body person recognition is dealt with based on the color and texture of the clothing
a subject wears. Finally various approaches for camera-based people tracking are
presented.
Chapter 6 is devoted to problems arising when interacting with virtual objects
in virtual reality. First technological aspects of implementing visual, acoustic, and
haptic virtual worlds are presented. Then procedures for virtual object modeling are
outlined. The third section is devoted to various methods to augment interacting with
virtual objects. Finally the chapter closes with a description of software tools ap-
plicable for the implementation of virtual environments.
The role of service robots in the future society is the subject of chapter 7. First
a description of available mobile and humanoid robots is given. Then actual chal-
lenges arising from interacting with robot assistants are systematically tackled like,
e.g., learning, teaching, programming by demonstration, commanding procedures,
task planning, and exception handling. As far as manipulation tasks are considered,
relevant models and representations are presented. Finally telepresence and telero-
botics are treated as special cases of man-robot interaction.
The 8th chapter addresses the concept of assistance in man-machine interaction,
which is systematically elaborated for manual control and dialog tasks. In both cases
various methods of providing assistance are identified based on user capabilities and
deficiencies. Since context of use is identified as the key element in this undertak-
ing, a variety of algorithms are introduced and discussed for real-time non-intrusive
context identification.
For readers with a background in engineering and computer science chapters 2
and 5 presents a collection of programming examples, which are implemented with
’Impresario’. This rapid prototyping tool offers an application programming inter-
face for comfortable access to the LTI-Lib, a large public domain software library
for computer vision and classification. A description of the LTI-Lib can be found in
Appendix A, a manual of Impresario in Appendix B.
For the examples a selection of head, face, eye, and lip templates, hand ges-
tures, sign language sequences, and people tracking clips are provided. The reader
is encouraged to use these materials to start experimentation on his own. Chapter 6
finally, provides the Java3D code of some simple VR application. All software, pro-
gramming code, and templates come with the book CD.
Chapter 2

Non-Intrusive Acquisition of Human Action


Jörg Zieren1 , Ulrich Canzler2
This chapter discusses control by gestures and facial expression using the vi-
sual modality. The basic concepts of vision-based pattern recognition are introduced
together with several fundamental image processing algorithms. The text puts
low-level methods in context with actual applications and enables the reader to
create a fully functional implementation. Rather than provide a complete survey
of existing techniques, the focus lies on the presentation of selected procedures
supported by concrete examples.

2.1 Hand Gesture Commands

The use of gestures as a means to convey information is an important part of human


communication. Since hand gestures are a convenient and natural way of interaction,
man-machine interfaces can benefit from their use as an input modality. Gesture con-
trol allows silent, remote, and contactless interaction without the need to locate knobs
or dials. Example applications include publicly accessible devices such as informa-
tion terminals [1], where problems of hygiene and strong wear of keys are solved, or
automotive environments [2], where visual and mental distraction are reduced and
mechanical input devices can be saved.

Fig. 2.1. The processing chain found in many pattern recognition systems.

Many pattern recognition systems have a common processing chain that can
roughly be divided into three consecutive stages, each of which forwards its output
to the next (Fig. 2.1). First, a single frame is acquired from a camera, a file system,
a network stream, etc. The feature extraction stage then computes a fixed number of
scalar values that each describe an individual feature of the observed scene, such as
the coordinates of a target object (e.g. the hand), its size, or its color. The image is
1
Sect. 2.1
2
Sect. 2.2
8 Non-Intrusive Acquisition of Human Action

thereby reduced to a numerical representation called a feature vector. This step is


often the most complex and computationally expensive in the processing chain. The
accuracy and reliability of the feature extraction process can have significant influ-
ence on the system’s overall performance. Finally, the extracted features of a single
frame (static gesture) or a sequence of frames (dynamic gesture) are classified either
by a set of rules or on the basis of a training phase that occurs prior to the actual ap-
plication phase. During the training the correct classification of the current features
is always known and the features are stored or learned in a way that later allows to
classify features computed in the application phase, even in the presence of noise
and minor variations that inevitably occur in the real world.
Moving along the processing chain, the data volume being processed continu-
ously decreases (from an image of several hundred kB to a feature vector of several
bytes to a single event description) while the level of abstraction increases accord-
ingly (from a set of pixels to a set of features to a semantic interpretation of the
observed scene).
The following sections 2.1.1, 2.1.2, and 2.1.3 explain the three processing stages
in detail. En route, we will create example applications that recognize various static
and dynamic hand gestures. These examples show the practical application of the
processing concept and algorithms presented in this chapter. Since the C++ source
code is provided on the accompanying CD, they can also serve as a basis for own
implementations.

2.1.1 Image Acquisition and Input Data

Before looking into algorithms, it is important to examine those aspects of the


planned gesture recognition system that directly affect the interaction with the user.
Three important properties can be defined that greatly influence the subsequent de-
sign process:
• The vocabulary, i.e. the set of gestures that are to be recognized. The vocabulary
size is usually identical with the number of possible system responses (unless
multiple gestures trigger the same response).
• The recording conditions under which the system operates. To simplify system
design, restrictions are commonly imposed on image background, lighting, the
target’s minimum and maximum distance to the camera, etc.
• The system’s performance in terms of latency (the processing time between the
end of the gesture and the availability of the recognition result) and recognition
accuracy (the percentage or probability of correct recognition results).
The first two items are explained in detail below. The third involves performance
measurement and code optimization for a concrete implementation and application
scenario, which is outside the scope of this chapter.
2.1.1.1 Vocabulary
Constellation and characteristics of the vocabulary have a strong influence on all
processing stages. If the vocabulary is known before the development of the recog-
Hand Gesture Commands 9

nition system starts, both design and implementation are usually optimized accord-
ingly. This may later lead to problems if the vocabulary is to be changed or extended.
In case the system is completely operational before the constellation of the vocab-
ulary is decided upon, recognition accuracy can be optimized by simply avoiding
gestures that are frequently misclassified. In practice, a mixture of both approaches
is often applicable.
The vocabulary affects the difficulty of the recognition task in two ways. First,
with growing vocabulary size the number of possible misclassifications rises, render-
ing recognition increasingly difficult. Second, the degree of similarity between the
vocabulary’s elements needs to be considered. This cannot be quantified, and even a
qualitative estimate based on human perception cannot necessarily be transferred to
an automatic system. Thus, for evaluating a system’s performance, vocabulary size
and recognition accuracy alone are insufficient. Though frequently omitted, the ex-
act constellation of the vocabulary (or at least the degree of similarity between its
elements) needs to be known as well.
2.1.1.2 Recording Conditions
The conditions under which the input images are recorded are crucial for the design
of any recognition system. It is therefore important to specify them as precisely and
completely as possible before starting the actual implementation. However, unless
known algorithms and hardware are to be used, it is not always foreseeable how
the subsequent processing stages will react to certain properties of the input data.
For example, a small change in lighting direction may cause an object’s appearance
to change in a way that greatly affects a certain feature (even though this change
may not be immediately apparent to the human eye), resulting in poor recognition
accuracy. In this case, either the input data specification has to be revised not to
allow changes in lighting, or more robust algorithms have to be deployed for feature
extraction and/or classification that are not affected by such variations.
Problems like this can be avoided by keeping any kind of variance or noise in the
input images to a minimum. This leads to what is commonly known as “laboratory
recording conditions”, as exemplified by Tab. 2.1.
While these conditions can usually be met in the training phase (if the classifi-
cation stage requires training at all), they render many systems useless for any kind
of practical application. On the other hand, they greatly simplify system design and
might be the only environment in which certain demanding recognition tasks are
viable.
In contrast to the above enumeration of laboratory conditions, one or more of the
“real world recording conditions” shown in Tab. 2.2 often apply when a system is to
be put to practical use.
Designing a system that is not significantly affected by any of the above condi-
tions is, at the current state of research, impossible and should thus not be aimed for.
General statements regarding the feasibility of certain tasks cannot be made, but from
experience one will be able to judge what can be achieved in a specific application
scenario.
10 Non-Intrusive Acquisition of Human Action

Table 2.1. List of common “laboratory recording conditions”.


Domain Condition(s)
Image Content The image contains only the target object, usually in front of a uni-
colored untextured background. No other objects are visible. The user
wears long-sleeved, non-skin colored clothing, so that the hand can be
separated from the arm by color.
Lighting Strong diffuse lighting provides an even illumination of the target with
no shadows or reflections.
Setup The placement of camera, background, and target remains constant. The
target’s distance to the camera and its position in the image do not
change. (If the target is moving, this applies only to the starting point of
the motion.)
Camera The camera hardware and camera parameters (resolution, gain, white
balance, brightness etc.) are never changed. They are also optimally ad-
justed to prevent overexposure or color cast. In case of a moving target,
shutter speed and frame rate are sufficiently high to prevent both mo-
tion blur and discontinuities. A professional quality camera with a high
resolution is used.

Table 2.2. List of common “real world recording conditions”.


Domain Condition(s)
Image Content The image may contain other objects (distractors) besides, or even in
place of, the target object. The background is unknown and may even
be moving (e.g. trees, clouds). A target may become partially or com-
pletely occluded by distractors or other targets. The user may be wear-
ing short-sleeved or skin-colored clothing that prevents color-based sep-
aration of arm and hand.
Lighting Lighting may not be diffuse, resulting in shadows and uneven illumina-
tion within the image region. It may even vary over time.
Setup The position of the target with respect to the camera is not fixed. The
target may be located anywhere in the image, and its size may vary with
its distance to the camera.
Camera Camera hardware and/or parameters may change from take to take,
or even during a single take in case of automatic dynamic adaptation.
Overexposure and color cast may occur. In case of a moving target, low
shutter speeds may cause motion blur, and a low frame rate (or high tar-
get velocity) may result in discontinuities between successive frames.
A consumer quality camera is used.

2.1.1.3 Image Representation


The description and implementation of image processing algorithms requires a suit-
able mathematical representation of various types of images. A common representa-
tion of a rectangular image with a resolution of M rows and N columns is a discrete
function
Hand Gesture Commands 11

x ∈ {0, 1, . . . , N − 1}
I(x, y) with (2.1)
y ∈ {0, 1, . . . , M − 1}
The 2-tuple (x, y) denotes pixel coordinates with the origin (0, 0) in the upper-
left corner. The value of I(x, y) describes a property of the corresponding pixel. This
property may be the pixel’s color, its brightness, the probability of it representing
human skin, etc.
Color is described as an n-tuple of scalar values, depending on the chosen color
model. The most common color model in computer graphics is RGB, which uses a
3-tuple (r, g, b) specifying a color’s red, green, and blue components:
⎛ ⎞
r(x, y)
I(x, y) = ⎝ g(x, y) ⎠ (2.2)
b(x, y)
Other frequently used color models are HSI (hue, saturation, intensity) or YCbCr
(luminance and chrominance). For color images, I(x, y) is thus a vector function.
For many other properties, such as brightness and probabilities, I(x, y) is a scalar
function and can be visualized as a gray value image (also called intensity image).
Scalar integers can be used to encode a pixel’s classification. For gesture recogni-
tion an important classification is the discrimination between foreground (target) and
background. This yields a binary valued image, commonly called a mask,

Imask (x, y) ∈ {0, 1} (2.3)


Binary values are usually visualized as black and white.
An image sequence (i.e. a video clip) of T frames can be described using a time
index t as shown in (2.4). The amount of real time elapsed between two successive
frames is the inverse of the frame rate used when recording the images.

I(x, y, t), t = 1, 2, . . . , T (2.4)


The C++ computer vision library LTI-L IB (described in Annex A) uses the types
channel and channel8 for gray value images, image for RGB color images,
and rgbPixel for colors.
2.1.1.4 Example
We will now specify the input data to be processed by the two example recognition
applications that will be developed in the course of this chapter, as well as recording
conditions for static and dynamic gestures.
Static Gestures
The system shall recognize three different gestures – “left”, “right”, and “stop” –
as shown in Fig. 2.2 (a-c). In addition, the event that the hand is not visible in the
image or rests in an idle position (such as Fig. 2.2d) and does not perform any of
these gestures shall be detected. Since this vocabulary is of course very small and its
elements are sufficiently dissimilar (this is proven in Sect. 2.1.3.7), high recognition
accuracy can be expected.
12 Non-Intrusive Acquisition of Human Action

(a) “left” (b) “right” (c) “stop” (d) none (idle)

Fig. 2.2. Static hand gestures for “left”, “right”, and “stop” (a-c), as well as example for an
idle position (d).

Dynamic Gestures

The dynamic vocabulary will be comprised of six gestures, where three are sim-
ply backwards executions of the other three. Tab. 2.3 describes these gestures, and
Fig. 2.3 shows several sample executions. Since texture will not be considered, it
does not matter whether the palm or the back of the hand faces the camera. There-
fore, the camera may be mounted atop a desk facing downwards (e.g. on a tripod or
affixed to the computer screen) to avoid background clutter.

Table 2.3. Description of the six dynamic gestures to be recognized.


Gesture Description
“clockwise” circular clockwise motion of closed hand, starting and ending at
the bottom of an imaginary circle
“counterclockwise” backwards execution of “clockwise”
“open” opening motion of flat hand by rotation around the arm’s axis by
90°(fingers extended and touching)
“close” backwards execution of “open”
“grab” starting with all fingers extended and not touching, a fist is made
“drop” backwards execution of “grab”

Recording Conditions

As will be described in the following sections, skin color serves as the primary im-
age cue for both example systems. Assuming an office or other indoor environment,
Tab. 2.4 lists the recording conditions to be complied with for both static and dy-
namic gestures. The enclosed CD contains several accordant examples that can also
be used as input data in case a webcam is not available.

2.1.2 Feature Extraction

The transition from low-level image data to some higher-level description thereof,
represented as a vector of scalar values, is called feature extraction. In this process,
Hand Gesture Commands 13

(a) “clockwise” (left to right) and “counterclockwise” (right to left)

(b) “open” (left to right) and “close” (right to left)

(c) “grab” (left to right) and “drop” (right to left)

Fig. 2.3. The six dynamic gestures to be recognized. In each of the three image sequences,
the second gesture is a backwards execution of the first, i.e. corresponds to reading the image
sequence from right to left.

irrelevant information (background) is discarded, while relevant information (fore-


ground or target) is isolated. Most pattern recognition systems perform this step,
because processing the complete image is computationally too demanding and in-
troduces an unacceptably high amount of noise (images often contain significantly
more background pixels than foreground pixels).
Using two-dimensional images of the unmarked hand, it is at present not possible
to create a three-dimensional model of a deformable object as complex and flexible
as the hand in real-time. Since the hand has 27 degrees of freedom (DOF), this would
require an amount of information that cannot be extracted with sufficient accuracy
and speed. Model-based features such as the bending of individual fingers are there-
fore not available in video-based marker-free real-time gesture recognition. Instead,
appearance-based features that describe the two-dimensional view of the hand are
exploited.
Frequently used appearance-based shape features include geometric properties of
the shape’s border, such as location, orientation, size etc. Texture features shall not be
considered here because resolution and contrast that can be achieved with consumer
cameras are usually insufficient for their reliable and accurate computation.
This section presents algorithms and data structures to detect the user’s hand in
the image, describe its shape, and calculate several geometric features that allow
the distinction of different static hand configurations and different dynamic gestures,
14 Non-Intrusive Acquisition of Human Action

Table 2.4. Recording conditions for the example recognition applications.


Domain Condition(s)
Image Content Ideally, the only skin colored object in the image is the user’s hand.
Other skin colored objects may be visible, but they must be small com-
pared to the hand. This also applies to the arm; it should be covered
by a long-sleeved shirt if it is visible. In dynamic gestures the hand
should have completely entered the image before recording starts, and
not leave the image before recording ends, so that it is entirely visible
in every recorded frame.
Lighting Lighting is sufficiently diffuse so that no significant shadows are visi-
ble on the hand. Slight shadows, such as in Fig. 2.2 and Fig. 2.3, are
acceptable.
Setup The distance between hand and camera is chosen so that the hand fills
approximately 10–25% of the image. The hand’s exact position in the
image is arbitrary, but no parts of the hand should be cropped. The cam-
era is not rotated, i.e. its x axis is horizontal.
Camera Resolution may vary, but should be at least 320 × 240. Aspect ratio
remains constant. The camera is adjusted so that no overexposure and
only minor color cast occurs. (Optimal performance is usually achieved
when gain, brightness and shutter speed are set so that the image appears
somewhat underexposed to the human observer.) For dynamic gestures,
a frame rate of 25 per second is to be used, and shutter speed should
be high enough to prevent motion blur. A consumer quality camera is
sufficient.

such as those shown in Fig. 2.2 and 2.3. This process, which occurs within the feature
extraction module in Fig. 2.1, is visualized in Fig. 2.4.

Fig. 2.4. Visualization of the feature extraction stage. The top row indicates processing steps,
the bottom row shows examples for corresponding data.

2.1.2.1 Hand Localization


The identification of foreground or target regions constitutes an interpretation of the
image based on knowledge which is usually specific to the application scenario. This
Hand Gesture Commands 15

knowledge can be encoded explicitly (as a set of rules) or implicitly (in a histogram,
a neural network, etc.). Known properties of the target object, such as shape, size,
or color, can be exploited. In gesture recognition, color is the most frequently used
feature for hand localization since shape and size of the hand’s projection in the two-
dimensional image plane vary greatly. It is also the only feature explicitly stored in
the image.
Using the color attribute to localize an object in the image requires a definition
of the object’s color or colors. In the RGB color model (and most others), even
objects that one would call unicolored usually occupy a range of numerical values.
This range can be described statistically using a three-dimensional discrete histogram
hobject(r, g, b), with the dimensions corresponding to the red, green, and blue com-
ponents.
hobject is computed from a sufficiently large number of object pixels that are
usually marked manually in a set of source images that cover all setups in which the
system is intended to be used (e.g. multiple users, varying lighting conditions, etc.).
Its value at (r, g, b) indicates the number of pixels with the corresponding color. The
total sum of hobject over all colors is therefore equal to the number of considered
object pixels nobject, i.e.

hobject(r, g, b) = nobject (2.5)
r g b

Because the encoded knowledge is central to the localization task, the creation of
hobject is an important part of the system’s development. It is exemplified below for
a single frame.
Fig. 2.5 shows the source image of a hand (a) and a corresponding, manually
generated binary mask (b) that indicates object pixels (white) and background pixels
(black). In (c) the histogram computed from (a), considering only object pixels (those
that are white in (b)), is shown. The three-dimensional view uses points to indicate
colors that occurred with a certain minimum frequency. The three one-dimensional
graphs to the right show the projection onto the red, green, and blue axis. Not sur-
prisingly, red is the dominant skin color component.

(a) Source image (b) Object mask (c) Object color histogram

Fig. 2.5. From a source image (a) and a corresponding, manually generated object mask (b),
an object color histogram (c) is computed.
16 Non-Intrusive Acquisition of Human Action

On the basis of hobject, color-based object detection can now be performed in


newly acquired images of the object. The aim is to compute, from a pixel’s color,
a probability or belief value indicating its likeliness of representing a part of the
target object. This value is obtained for every pixel (x, y) and stored in a probability
image Iobject(x, y) with the same dimensions as the analyzed image. The following
paragraphs derive the required stochastic equations.
Given an object pixel, the probability of it having a certain color (r, g, b) can be
computed from hobject as

hobject(r, g, b)
P (r, g, b|object) = (2.6)
nobject
By creating a complementary histogram hbg of the background colors from a
total number of nbg background pixels the accordant probability for a background
pixel is obtained in the same way:

hbg (r, g, b)
P (r, g, b|bg) = (2.7)
nbg
Applying Bayes’ rule, the probability of any pixel representing a part of the ob-
ject can be computed from its color (r, g, b) using (2.6) and (2.7):

P (r, g, b|object) · P (object)


P (object|r, g, b) = (2.8)
P (r, g, b|object) · P (object) + P (r, g, b|bg) · P (bg)

P (object) and P (bg) denote the a priori object and background probabilities,
respectively, with P (object) + P (bg) = 1.
Using (2.8), the object probability image Iobj,prob is created from I as

Iobj,prob(x, y) = P (object|I(x, y)) (2.9)


To classify each pixel as either background or target, an object probability thresh-
old Θ is defined. Probabilities equal to or above Θ are considered target, while all
others constitute the background. A data structure suitable for representing this clas-
sification is a binary mask


1 if Iobj,prob(x, y) ≥ Θ (target)
Iobj,mask (x, y) = (2.10)
0 otherwise (background)

Iobj,mask constitutes a threshold segmentation of the source image because it


partitions I into target and background regions.
Rewriting (2.8) as (2.11), it can be seen that varying P (object) and P (bg) is
equivalent to a non-linear scaling of P (object|r, g, b) and does not affect Iobj,mask
when Θ is scaled accordingly (though it may improve contrast in the visualization
of Iobj,prob ). In other words, for any prior probabilities P (object), P (bg) = 0, all
possible segmentations can be obtained by varying Θ from 0 to 1. It is therefore
Hand Gesture Commands 17

practical to choose arbitrary values, e.g. P (object) = P (bg) = 0.5. Obviously, the
value computed for P (object|r, g, b) will then no longer reflect an actual probability,
but a belief value.3 This is usually easier to handle than the computation of exact
values for P (object) and P (bg).

 −1
P (r, g, b|bg) P (bg)
P (object|r, g, b) = 1 + · (2.11)
P (r, g, b|object) P (object)

A suitable choice of Θ is crucial for an accurate discrimination between back-


ground and target. In case the recording conditions remain constant and are known
in advance, Θ can be set manually, but for uncontrolled environments an automatic
determination is desirable. Two different strategies are presented below.
If no knowledge of the target object’s shape and/or position is available, the fol-
lowing iterative low-level algorithm presented in [24] produces good results in a
variety of conditions. It assumes the histogram of Iobj,prob to be bi-modal but can
also be applied if this is not the case.

1. Arbitrarily define a set of background pixels (some usually have a high a priori back-
ground probability, e.g. the four corners of the image). All other pixels are defined as
foreground. This constitutes an initial classification.
2. Compute the mean values for background and foreground, μobject and μbg , based on
the most recent classification. If the mean values are identical to those computed in the
previous iteration, halt.
3. Compute a new threshold Θ = 12 (μobject + μbg ) and perform another classification of
all pixels, then goto step 2.

Fig. 2.6. Automatic computation of the object probability threshold Θ without the use of high-
level knowledge (presented in [24]).

In case the target’s approximate location and geometry are known, humans can
usually identify a suitable threshold just by observation of the corresponding object
mask. This approach can be implemented by defining an expected target shape, cre-
ating several object masks using different thresholds, and choosing the one which
yields the most similar shape. This requires a feature extraction as described in
Sect. 2.1.2 and a metric to quantify the deviation of a candidate shape’s features
from those of the expected target shape.
An example for hand localization can be seen in Fig. 2.7. Using the skin color
histogram shown in Fig. 2.5 and a generic background histogram hbg , a skin proba-
bility image (b) was computed from a source image (a) of the same hand (and some
clutter) as shown in Fig. 2.5, using P (object) = P (bg) = 0.5. For the purpose of
visualization, three different thresholds Θ1 < Θ2 < Θ3 were then applied, resulting
in three different binary masks (c, d, e). In this example none of the binary masks

3
For simplicity, the term “probability” will still be used in either case.
18 Non-Intrusive Acquisition of Human Action

is entirely correct: For Θ1 , numerous background regions (such as the bottle cap in
the bottom right corner) are classified as foreground, while Θ3 leads to holes in the
object, especially at its borders. Θ2 , which was computed automatically using the
algorithm described in Fig. 2.6, is a compromise that might be considered optimal
for many applications, such as the example systems to be developed in this section.

(a) Source Image (b) Skin Probability


Image

(c) Skin/Background (d) Skin/Background (e) Skin/Background


classification using Θ1 classification using Θ2 classification using Θ3

Fig. 2.7. A source image (a) and the corresponding skin probability image (b). An ob-
ject/background classification was performed for three different thresholds Θ1 < Θ2 < Θ3
(c-e).

While the human perception of an object’s color is largely independent of the


current illumination (an effect called color constancy), colors recorded by a camera
are strongly influenced by illumination and hardware characteristics. This restricts
the use of histograms to the recording conditions under which their source data was
created, or necessitates the application of color constancy algorithms [7, 9].4
In [12], a skin and a non-skin color histogram are presented that were created
from several thousand images found on the WWW, covering a multitude of skin
4
In many digital cameras, color constancy can be achieved by performing a white balance:
The camera is pointed to a white object (such as a blank sheet of paper), and a transformation
is performed so that this object appears colorless (r = g = b) in the image as well. The
transformation parameters are then stored for further exposures under the same illumination
conditions.
Hand Gesture Commands 19

hues, lighting conditions, and camera hardware. Thereby, these histograms implic-
itly provide user, illumination, and camera independence, at the cost of a comparably
high probability for non-skin objects being classified as skin (false alarm). Fig. 2.8
shows the skin probability image (a) and binary masks (b, c, d) for the source im-
age shown in Fig. 2.7a, computed using the histograms from [12] and three different
thresholds Θ4 < Θ5 < Θ6 . As described above, P (object) = P (bg) = 0.5 was
chosen. Compared to 2.7b, the coins, the pens, the bottle cap, and the shadow around
the text marker have significantly higher skin probability, with the pens even exceed-
ing parts of the fingers and the thumb (d). This is a fundamental problem that cannot
be solved by a simple threshold classification because there is no threshold Θ that
would achieve a correct result. Unless the histograms can be modified to reduce the
number of false alarms, the subsequent processing stages must therefore be designed
to handle this problem.

(a) Skin Probability (b) Skin/Back- (c) Skin/Back- (d) Skin/Back-


Image ground classifica- ground classifica- ground classifica-
tion using Θ4 tion using Θ5 tion using Θ6

Fig. 2.8. Skin probability image (b) for Fig. 2.8a and object/background classification for three
thresholds Θ4 < Θ5 < Θ6 (b, c, d).

Implementational Aspects

In practical applications the colors r, g, and b commonly have a resolution of 8 bit


each, i.e.

r, g, b ∈ {0, 1, . . . , 255} (2.12)


A corresponding histogram would have to have 2563 = 16, 777, 216 cells, each
storing a floating point number of typically 4 bytes in size, totaling 64 MB. To reduce
these memory requirements, the color values are downsampled to e.g. 5 bit through
a simple integer division by 23 = 8:

r
g
b
r = g = b = (2.13)
8 8 8
This results in a histogram h (r , g  , b ) with 323 = 32, 768 cells that requires
only 128 kB of memory. The entailed loss in accuracy is negligible in most appli-
20 Non-Intrusive Acquisition of Human Action

cations, especially when consumer cameras with high noise levels are employed.
Usually the downsampling is even advantageous because it constitutes a general-
ization and therefore reduces the amount of data required to create a representative
histogram.
The LTI-L IB class skinProbabilityMap comes with the histograms pre-
sented in [12]. The algorithm described in Fig. 2.6 is implemented in the class
optimalThresholding.
2.1.2.2 Region Description
On the basis of Iobj,mask the source image I can be partitioned into regions. A region
R is a contiguous set of pixels p for which Iobj,mask has the same value. The concept
of contiguity requires a definition of pixel adjacency. As depicted in Fig. 2.9, adja-
cency may be based on either a 4-neighborhood or an 8-neighborhood. In this sec-
tion the 8-neighborhood will be used. In general, regions may contain other regions
and/or holes, but this shall not be considered here because it is of minor importance
in most gesture recognition applications.

Fig. 2.9. Adjacent pixels (gray) in the 4-neighborhood (a) and 8-neighborhood (b) of a refer-
ence pixel (black).

Under laboratory recording conditions, Iobj,mask will contain exactly one target
region. However, in many real world applications, other skin colored objects may
be visible as well, so that Iobj,mask typically contains multiple target regions (see
e.g. Fig. 2.8). Unless advanced reasoning strategies (such as described in [23, 28])
are used, the feature extraction stage is responsible for identifying the region that
represents the user’s hand, possibly among a multitude of candidates. This is done
on the basis of the regions’ geometric features.
The calculation of these features is performed by first creating an explicit descrip-
tion of each region contained in Iobj,mask. Various algorithms exist that generate, for
every region, a list of all of its pixels. A more compact and computationally more
efficient representation of a region can be obtained by storing only its border points
(which are, in general, significantly fewer). A border point is conveniently defined
as a pixel p ∈ R that has at least one pixel q ∈ / R within its 4-neighborhood. An
example region (a) and its border points (b) are shown in Fig. 2.10.
In a counterclockwise traversal of the object’s border, every border point has a
predecessor and a successor within its 8-neighborhood. An efficient data structure
for the representation of a region is a sorted list of its border points. This can be
Hand Gesture Commands 21

(a) Region pixels (b) Border points (c) Polygon

Fig. 2.10. Description of an image region by the set of all of its pixels (a), a list of its bor-
der pixels (b), and a closed polygon (c). Light gray pixels in (b) and (c) are not part of the
respective representation, but are shown for comparison with (a).

interpreted as a closed polygon whose vertices are the centers of the border points
(Fig. 2.10c). In the following, the object’s border is defined to be this polygon. This
definition has the advantages of being sub-pixel accurate and facilitating efficient
computation of various shape-based geometric features (as described in the next sec-
tion).5
Finding the border points of a region is not as straightforward as identifying its
pixels. Fig. 2.11 shows an algorithm that processes Iobj,mask and computes, for every
region R, a list of its border points

BR = {(x0 , y0 ), (x1 , y1 ), . . . , (xn−1 , yn−1 )} (2.14)


Fig. 2.12 (a, b, c) shows the borders computed by this algorithm from the masks
shown in Fig. 2.8 (b, c, d). In areas where the skin probability approaches the thresh-
old Θ, the borders become jagged due to the random nature of the input data, as can
be seen especially in (b) and (c). This effect causes an increase in border length that
is random as well, which is undesirable because it reduces the information content
of the computed border length value (similar shapes may have substantially different
border lengths, rendering this feature less useful for recognition). Preceding the seg-
mentation (2.10) by a convolution of Iobj,prob with a Gaussian kernel alleviates this
problem by dampening high frequencies in the input data, providing smooth borders
as in Fig. 2.12 (d, e, f).
The LTI-L IB uses the classes borderPoints and polygonPoints (among
others) to represent image regions. The class objectsFromMask contains an al-
gorithm that is comparable to the one presented in Fig. 2.11, but can also detect holes
within regions, and further regions within these holes.

5
It should be noted that the polygon border definition leads to the border pixels no longer
being considered completely part of the object. Compared to other, non-sub-pixel accurate
interpretations, this may lead to slight differences in object area that are noticeable only for
very small objects.
6
m(x, y) = 1 indicates touching the border of a region that was already processed.
7
m(x, y) = 2 indicates crossing the border of a region that was already processed.
22 Non-Intrusive Acquisition of Human Action

1. Create a helper matrix m with the same dimensions as Iobj,mask and initialize
all entries to 0. Define (x, y) as the current coordinates and initialize them to
(0, 0). Define (x , y  ) and (x , y  ) as temporary coordinates.
2. Iterate from left to right through all image rows successively, starting at y = 0.
If m(x, y) = 0 ∧ Iobj,mask (x, y) = 1, goto step 3. If m(x, y) = 1, ignore
this pixel and continue with the next pixel.6 If m(x, y) = 2, increase x until
m(x, y) = 2 again, ignoring the whole range of pixels.7
3. Create a list B of border points and store (x, y) as the first element.
4. Set (x , y  ) = (x, y).
5. Scan the 8-neighborhood of (x , y  ) using (x , y  ), starting at the pixel that
follows the butlast pixel stored in B in a counterclockwise orientation, or at
(x − 1, y  − 1) if B contains only one pixel. Proceed counterclockwise, skip-
ping coordinates that lie outside of the image, until Iobj,mask (x , y  ) = 1. If
(x , y  ) is identical with the first element of B, goto step 6. Else store (x , y  )
in B. Set (x , y  ) = (x , y  ) and goto step 5.
6. Iterate through B, considering, for every element (xi , yi ) its predecessor
(xi−1 , yi−1 ) and its successor (xi+1 , yi+1 ). The first element is the succes-
sor of the last, which is the predecessor of the first. If yi−1 = yi+1 = yi , set
m(xi , yi ) = 1 to indicate that the border touches the line y = yi at xi . Other-
wise, if yi−1 = yi ∨ yi = yi+1 , set m(xi , yi ) = 2 to indicates that the border
intersects with the line y = yi at xi .
7. Add B to the list of computed borders and proceed with step 2.
Fig. 2.11. Algorithm to find the border points of all regions in an image.

2.1.2.3 Geometric Features


From the multitude of features that can be computed for a closed polygon, only a
subset is suitable for a concrete recognition task. A feature is suitable if it has a
high inter-gesture variance (varies significantly between different gestures) and a
low intra-gesture variance (varies little between multiple productions of the same
gesture). The first property means that the feature carries much information, while
the second indicates that it is not significantly affected by noise or unintentional
variations that will inevitably occur. Additionally, every feature should be stable,
meaning that small changes in input data never result in large changes in the feature.
Finally the feature must be computable with sufficient accuracy and speed. A certain
feature’s suitability thus depends on the constellation of the vocabulary (because this
affects inter- and intra-gesture variance) and on the actual application scenario in
terms of recording conditions, hardware etc. It is therefore a common approach to
first compute as many features as possible and then examine each one with respect
to its suitability for the specific recognition task (see also Sect. 2.1.3.3).
The choice of features is of central importance since it affects system design
throughout the processing chain. The final system’s performance depends signifi-
cantly on the chosen features and their properties.
This section presents a selection of frequently used features and the equations
to calculate them. Additionally, the way each feature is affected by the camera’s
perspective, resolution, and distance to the object is discussed, for these are often
Hand Gesture Commands 23

(a) (b) (c)

(d) (e) (f)

Fig. 2.12. a, b, c: Borders of the regions shown in Fig. 2.8 (d, e, f), computed by the algorithm
described in Fig. 2.11. d, e, f: Additional smoothing with a Gaussian kernel before segmenta-
tion.

important factors in practical applications. In the image domain, changing the cam-
era’s perspective results in rotation and/or translation of the object, while resolution
and object distance affect the object’s scale (in terms of pixels). Since the error intro-
duced in the shape representation by the discretization of the image (the discretiza-
tion noise) is itself affected by image resolution, rotation, translation, and scaling of
the shape, features can, in a strict sense, be invariant of these transformations only in
continuous space. All statements regarding transformation invariance therefore refer
to continuous space. In discretized images, features declared “invariant” may still
show small variance. This can usually be ignored unless the shape’s size is on the
order of several pixels.

Border Length

The border length l can trivially be computed from B, considering that the distance
√ two successive border points is 1 if either their x or y coordinates are equal,
between
and 2 otherwise. l depends on scale/resolution, and is translation and rotation in-
variant.

Area, Center of Gravity, and Second Order Moments

In [25] an efficient algorithm for the computation of arbitrary moments νp,q of poly-
gons is presented. The area a = ν0,0 , as well as the normalized moments αp,q
24 Non-Intrusive Acquisition of Human Action

and central moments μp,q up to order 2, including the center of gravity (COG)
(xcog , ycog ) = (α1,0 , α0,1 ), are obtained as shown in (2.15) to (2.23). xi and yi
with i = 0, 1, . . . , n − 1 refer to the elements of BR as defined in (2.14). Since these
equations require the polygon to be closed, xn = x0 and yn = y0 is defined for
i = n.

1
n
ν0,0 = a = xi−1 yi − xi yi−1 (2.15)
2 i=1
1 
n
α1,0 = xcog = (xi−1 yi − xi yi−1 )(xi−1 + xi ) (2.16)
6a i=1
1 
n
α0,1 = ycog = (xi−1 yi − xi yi−1 )(yi−1 + yi ) (2.17)
6a i=1
1 
n
α2,0 = (xi−1 yi − xi yi−1 )(x2i−1 + xi−1 xi + x2i ) (2.18)
12a i=1
1 
n
α1,1 = (xi−1 yi − xi yi−1 )(2xi−1 yi−1 + xi−1 yi + xi yi−1 + 2xi yi )
24a i=1
(2.19)
1 
n
α2,0 = (xi−1 yi − xi yi−1 )(yi−1
2
+ yi−1 yi + yi2 ) (2.20)
12a i=1
μ2,0 = α2,0 − α21,0 (2.21)
μ1,1 = α1,1 − α1,0 α0,1 (2.22)
μ0,2 = α0,2 − α20,1 (2.23)

The area a depends on scale/resolution, and is independent of translation and ro-


tation. The center of gravity (xcog , ycog ) is obviously translation variant and depends
on resolution. It is rotation invariant only if the rotation occurs around (xcog , ycog ) it-
self, and affected by scaling (i.e. changing the object’s distance to the camera) unless
its angle to the optical axis remains constant.
The second order moments (p+q = 2) are primarily used to compute other shape
descriptors, as described below.

Eccentricity

One possible measure for eccentricity e that is based on central moments is given
in [10]:

(μ2,0 − μ0,2 )2 + 4μ1,1


e= (2.24)
a
(2.24) is zero for circular shapes and increases for elongated shapes. It is rotation
and translation invariant, but variant with respect to scale/resolution.
Hand Gesture Commands 25

Another intuitive measure is based on the object’s inertia along and perpendicular
to its main axis, j1 and j2 , respectively:
j1
e= (2.25)
j2
with

 2
μ2,0 + μ0,2 μ2,0 − μ0,2 μ 2
1,1
j1 = + + (2.26)
2a 2a a
 2
μ2,0 + μ0,2 μ2,0 − μ0,2 μ 2
1,1
j2 = − + (2.27)
2a 2a a

Here e starts at 1 for circular shapes, increases for elongated shapes, and is in-
variant of translation, rotation, and scale/resolution.

Orientation

The region’s main axis is defined as the axis of the least moment of inertia. Its orien-
tation α is given by

1 2μ1,1
α = arctan (2.28)
2 μ2,0 − μ0,2
and corresponds to what one would intuitively call “the object’s orientation”. Ori-
entation is translation and scale/resolution invariant. α carries most information for
elongated shapes and becomes increasingly random for circular shapes. It is 180◦ -
periodic, which necessitates special handling in most cases. For example, the angle
by which an object with α1 = 10◦ has to be rotated to align with an object with
α2 = 170◦ is 20◦ rather than |α1 − α2 | = 160◦ . When using orientation as a feature
for classification, a simple workaround to this problem is to ensure 0 ≤ α < 180◦
and, rather than use α directly as a feature, compute two new features f1 (α) = sin α
and f2 (α) = cos 2α instead, so that f1 (0) = f1 (180◦ ) and f2 (0) = f2 (180◦ ).

Compactness

A shape’s compactness c ∈ [0, 1] is defined as


4πa
c= (2.29)
l2
Compact shapes (c → 1) have short borders l that contain a large area a. The
most compact shape is a circle (c = 1), while for elongated or frayed shapes, c → 0.
Compactness is rotation, translation, and scale/resolution invariant.
26 Non-Intrusive Acquisition of Human Action

Border Features

In addition to the border length l, the minimum and maximum pixel coordinates
of the object, xmin , xmax , ymin and ymax , as well as the minimum and maximum
distance from the center of gravity to the border, rmin and rmax , can be calculated
from B. Minimum and maximum coordinates are not invariant to any transformation.
rmin and rmax are invariant to translation and rotation, and variant to scale/resolution.

Normalization

Some of the above features can only be used in a real-life application if a suitable
normalization is performed to eliminate translation and scale variance. For example,
the user might perform gestures at different locations in the image and at different
distances from the camera.
Which normalization strategy yields the best results depends on the actual appli-
cation and recording conditions, and is commonly found empirically. Normalization
is frequently neglected, even though it is an essential part of feature extraction. The
following paragraphs present several suggestions. In the remainder of this section,
the result of the normalization of a feature f is designated by f  .
For both static and dynamic gestures, the user’s face can serve as a reference
for hand size and location if it is visible and a sufficiently reliable face detection is
available.
In case different resolutions with identical aspect ratio are to be supported, reso-
lution independence can be achieved by specifying lengths and coordinates relative
to the image width N (x and y must be normalized by the same value to preserve
image geometry) and area relative to N 2 .
If a direct normalization is not feasible, invariance can also be achieved by com-
puting a new feature from two or more unnormalized features. For example, from
xcog , xmin , and xmax , a resolution and translation invariant feature xp can be com-
puted as
xmax − xcog
xp = (2.30)
xcog − xmin
xp specifies the ratio of the longest horizontal protrusion, measured from the
center of gravity, to the right and to the left.
For dynamic gestures, several additional methods can be used to compute a nor-
malized feature f  (t) from an unnormalized feature f (t) without any additional in-
formation:
• Subtracting f (1) so that f  (1) = 0:

f  (t) = f (t) − f (1) (2.31)


• Subtracting the arithmetic mean f or median fmedian of f :

f  (t) = f (t) − f (2.32)



f (t) = f (t) − fmedian (2.33)
Hand Gesture Commands 27

• Mapping f to a fixed interval, e.g. [0, 1]:


f (t) − min f (t)
f  (t) = (2.34)
max f (t) − min f (t)
• Dividing f by max |f (t)| so that |f  (t)| ≤ 1:
f (t)
f  (t) = (2.35)
max |f (t)|
Derivatives
In features computed for dynamic gestures, invariance of a constant offset may also
be achieved by derivation. Computing the derivative f˙(t) of a feature f (t) and using
it as an additional element in the feature vector to emphasize changes in f (t) can
sometimes be a simple yet effective method to improve classification performance.
2.1.2.4 Example
To illustrate the use of the algorithms described above, the static example gestures
introduced in Fig. 2.2 were segmented as shown in Fig. 2.13, using the automatic
algorithms presented in Fig. 2.6 and 2.11, and features were computed as shown
in Tab. 2.5, using the LTI-L IB class geometricFeatures. For resolution inde-
pendence, lengths and coordinates are normalized by the image width N , and area
is normalized by N 2 . α specifies the angle by which the object would have to be
rotated to align with the x axis.

(a) “left” (b) “right” (c) “stop” (d) none

Fig. 2.13. Boundaries of the static hand gestures shown in Fig. 2.2. Corresponding features
are shown in Tab. 2.5.

For the dynamic example gestures “clockwise”, “open”, and “grab”, the features
area a, compactness c, and x coordinate of the center of gravity xcog were computed.
Since a depends on the hand’s anatomy and its distance to the camera, it was nor-
malized by its maximum (see (2.35)) to eliminate this dependency. xcog was divided
by N to eliminate resolution dependency, and additionally normalized according
to (2.32) so that x cog = 0. This allows the gestures to be performed anywhere in
the image. The resulting plots for normalized area a , compactness c (which does not
require normalization), and normalized x coordinate of the center of gravity xcog are
visualized in Fig. 2.14.
The plots are scaled equally for each feature to allow a direct comparison. Since
“counterclockwise”, “close”, and “drop” are backwards executions of the above ges-
tures, their feature plots are simply temporally mirrored versions of these plots.
28 Non-Intrusive Acquisition of Human Action

Table 2.5. Features computed from the boundaries shown in Fig. 2.13.
Gesture
“left” “right” “stop” none
Feature Symbol (2.13a) (2.13b) (2.13c) (2.13d)
Normalized Border Length l 1.958 1.705 3.306 1.405
Normalized Area a 0.138 0.115 0.156 0.114
Normalized Center of Gravity xcog 0.491 0.560 0.544 0.479

ycog 0.486 0.527 0.487 0.537
Eccentricity e 1.758 1.434 1.722 2.908
Orientation α 57.4° 147.4° 61.7° 58.7°
Compactness c 0.451 0.498 0.180 0.724
Normalized Min./Max. Coordinates xmin 0.128 0.359 0.241 0.284
xmax 0.691 0.894 0.881 0.681

ymin 0.256 0.341 0.153 0.325

ymax 0.747 0.747 0.747 0.747
Protrusion Ratio xp 0.550 1.664 1.110 1.036

Fig. 2.14. Normalized area a , compactness c, and normalized x coordinate of the center of
gravity xcog plotted over time t = 1, 2, . . . , 60 for the dynamic gestures “clockwise” (left
column), “open” (middle column), and “grab” (right column) from the example vocabulary
shown in Fig. 2.3.
Hand Gesture Commands 29

2.1.3 Feature Classification

The task of feature classification occurs in all pattern recognition systems and has
been subject of considerable research effort. A large number of algorithms is avail-
able to build classifiers for various requirements. The classifiers considered in this
section operate in two phases: In the training phase the classifier “learns” the vo-
cabulary from a sufficiently large number of representative examples (the training
samples). This “knowledge” is then applied in the following classification phase.
Classifiers that continue to learn even in the classification phase, e.g. to automati-
cally adapt to the current user, shall not be considered here.
2.1.3.1 Classification Concepts
This section introduces several basic concepts and terminology commonly used in
pattern classification, with the particular application to gesture recognition.

Mathematical Description of the Classification Task

From a mathematical point of view, the task of classification is that of identifying an


unknown event ω, given a finite set Ω of n mutually exclusive possible events

ω ∈ Ω = {ω1 , ω2 , . . . , ωn } (2.36)
The elements of Ω are called classes. In gesture recognition Ω is the vocabulary,
and each class ωi (i = 1, 2, . . . , n) represents a gesture. Note that (2.36) restricts the
allowed inputs to elements of Ω. This means that the system is never presented with
an unknown gesture, which influences the classifier’s design and algorithms.
The classifier receives, from the feature extraction stage, an observation O,
which, for example, might be a single feature vector for static gestures, or a sequence
of feature vectors for dynamic gestures. Based on this observation it outputs a result

ω̂ ∈ Ω where ω̂ = ωk , k ∈ {1, 2, . . . , n} (2.37)


k denotes the index of the class of the event ω assumed to be the source of the
observation O. If ω̂ = ω then the classification result is correct, otherwise it is wrong.
To account for the case that O does not bear sufficient similarity to any element
in Ω, one may wish to allow a rejection of O. This is accomplished by introducing a
pseudo-class ω0 and defining a set of classifier outputs Ω̂ as

Ω̂ = Ω ∪ {ω0 } (2.38)
Thus, (2.37) becomes

ω̂ ∈ Ω̂ where ω̂ = ωk , k ∈ {0, 1, . . . , n} (2.39)


30 Non-Intrusive Acquisition of Human Action

Modeling Classification Effects

Several types of classifiers can be designed to consider a cost or loss value L for
classification errors, including rejection. In general, the loss value depends on the
actual input class ω and the classifier’s output ω̂:

L(ω, ω̂) = cost incurred for classifying event of class ω as ω̂ (2.40)

The cost of a correct result is set to zero:

L(ω, ω) = 0 (2.41)
This allows to model applications where certain misclassifications are more se-
rious than others. For example, let us consider a gesture recognition application that
performs navigation of a menu structure. Misclassifying the gesture for “return to
main menu” as “move cursor down” requires the user to perform “return to main
menu” again. Misclassifying “move cursor down” as “return to main menu”, how-
ever, discards the currently selected menu item, which is a more serious error because
the user will have to navigate to this menu item all over again (assuming that there is
no “undo” or “back” functionality).

Simplifications

Classifiers that consider rejection and loss values are discussed in [22]. For the pur-
pose of this introduction, we will assume that the classifier never rejects its input (i.e.
(2.37) holds) and that the loss incurred by a classification error is a constant value L:

L(ω, ω̂) = L for ω = ω̂ (2.42)


This simplifies the following equations and allows us to focus on the basic clas-
sification principles.

Garbage Classes

An alternative to rejecting the input is to explicitly include garbage classes in Ω


that represent events to which the system should not react. The classifier treats these
classes just like regular classes, but the subsequent stages do not perform any action
when the classification result ω̂ is a garbage class. For example, a gesture recognition
system may be constantly observing the user’s hand holding a steering wheel, but
react only to specific gestures different from steering motions. Garbage classes for
this application would contain movements of the hand along the circular shape of the
steering wheel.
Hand Gesture Commands 31

Modes of Classification

We distinguish between supervised classification and unsupervised classification:


• In supervised classification the training samples are labeled, i.e. the class of each
training sample is known. Classification of a new observation is performed by
a comparison with these samples (or models created from them) and yields the
class that best matches the observation, according to a matching criterion.
• In unsupervised classification the training samples are unlabeled. Clustering al-
gorithms are used to group similar samples before classification is performed.
Parameters such as the number of clusters to create or the average cluster size
can be specified to influence the clustering process. The task of labeling the sam-
ples is thus performed by the classifier, which of course may introduce errors.
Classification itself is then performed as above, but returns a cluster index in-
stead of a label.
For gesture recognition, supervised classification is by far the most common
method since the labeling of the training samples can easily be done in the recording
process.

Overfitting

With many classification algorithms, optimizing performance for the training data at
hand bears the risk of reducing performance for other (generic) input data. This is
called overfitting and presents a common problem in machine learning, especially
for small sets of training samples. A classifier with a set of parameters p is said
to overfit the training samples T if there exists another set of parameters p that
yields lower performance on T , but higher performance in the actual “real-world”
application [15].
A strategy for avoiding overfitting is to use disjunct sets of samples for training
and testing. This explicitly measures the classifier’s ability of generalization and al-
lows to include it in the optimization process. Obviously, the test samples need to be
sufficiently distinct from the training samples for this approach to be effective.
2.1.3.2 Classification Algorithms
From the numerous strategies that exist for finding the best-matching class for an
observation, three will be presented here:
• A simple, easy-to-implement rule-based approach suitable for small vocabularies
of static gestures.
• The concept of maximum likelihood classification, which can be applied to a
multitude of problems.
• Hidden Markov Models (HMMs) for dynamic gestures. HMMs are frequently
used for the classification of various dynamic processes, including speech and
sign language.
Further algorithms, such as artificial neural networks, can be found in the litera-
ture on pattern recognition and machine learning. A good starting point is [15].
32 Non-Intrusive Acquisition of Human Action

2.1.3.3 Feature Selection


The feature classification stage receives, for every frame of input data, one feature
vector from the feature extraction stage. For static gestures this constitutes the com-
plete observation, while dynamic gestures are represented by a sequence of feature
vectors. The constellation of the feature vector (which remains constant throughout
the training and classification phases) is an important design decision that can sig-
nificantly affect the system’s performance.
Most recognition systems do not use all features that, theoretically, could be com-
puted from the input data. Instead, a subset of suitable features is selected using crite-
ria and algorithms described below. Reducing the number of features usually reduces
computational cost (processing time, memory requirements) and system complexity
in both the feature extraction and classification stage. In some cases it may also in-
crease recognition accuracy by eliminating erratic features and thereby emphasizing
the remaining features.
In the feature selection process one or more features are chosen that allow the
reliable identification of each element in the vocabulary. This is provided if the se-
lected features are approximately constant for multiple productions of the same ges-
ture (low intra-gesture variance) but vary significantly between different gestures
(high inter-gesture variance). It is not required that different gestures differ in every
element of the feature vector, but there must be at least one element in the feature
vector that differs beyond the natural (unintentional) intra-gesture variance, segmen-
tation inaccuracy, and noise.
It is obvious that, in order to determine intra- and inter-gesture variance, a feature
has to be computed for numerous productions of every gesture in the vocabulary.
A representative mixture of permitted recording conditions should be used to find
out how each feature is affected e.g. by lighting conditions, image noise etc. For
person-independent gesture recognition this should include multiple persons. Based
on these samples features are then selected either by manual inspection according to
the criteria mentioned above, or by an automatic algorithm that finds a feature vector
which optimizes the performance of the chosen classifier on the available data. Three
simple feature selection algorithms (stepwise forward selection, stepwise backward
elimination, and exhaustive search) that can be used for various classifiers, including
Hidden Markov Models and maximum likelihood classification (see Sect. 2.1.3.5
and 2.1.3.6), are presented in Fig. 2.15.
It is important to realize that although the feature selection resulting from either
manual or automatic analysis may be optimal for the data upon which it is based, this
is not necessarily the case for the actual practical application. The recorded samples
often cannot cover all recording conditions under which the system may be used,
and features that work well on this data may not work under different recording
conditions or may even turn out to be disadvantageous. It may therefore be necessary
to reconsider the feature selection if the actual recognition accuracy falls short of the
expectation.
Hand Gesture Commands 33

Stepwise Forward Selection

1. Start with an empty feature selection and a recognition accuracy of zero.


2. For every feature that is not in the current feature selection, measure recogni-
tion accuracy on the current feature selection extended by this feature. Add the
feature for which the resulting recognition accuracy is highest to the current
selection. If no feature increases recognition accuracy, halt.
3. If computational cost or system complexity exceed the acceptable maximum,
remove the last added feature from the selection and halt.
4. If recognition accuracy is sufficient or all features are included in the current
selection, halt.
5. Goto step 2.

Stepwise backward elimination

1. Start with a feature selection containing all available features, and measure
recognition accuracy.
2. For every feature in the current feature selection, measure recognition accuracy
on the current feature selection reduced by this feature. Remove the feature for
which the resulting recognition accuracy is highest from the current selection.
3. If computational cost or system complexity exceed the acceptable maximum
and the current selection contains more than one feature, goto step 2. Otherwise,
halt.

Exhaustive search

1. Let n be the total number of features and m the number of selected features.
2. For every m ∈ {1, 2, . . . , n}, measure recognition accuracy for all m
n
possible
feature combinations (this equals a total of n i=1 i
n
combinations). Discard
all cases where computational cost or system complexity exceed the acceptable
maximum, then choose the selection that yields the highest recognition accu-
racy.
Fig. 2.15. Three algorithms for automatic feature selection. Stepwise forward selection and
stepwise backward elimination are fast heuristic approaches that are not guaranteed to find
the global optimum. This is possible with exhaustive search, albeit at a significantly higher
computational cost.

2.1.3.4 Rule-based Classification


A simple heuristic approach to classification is a set of explicit IF-THEN rules that
refer to the target’s features and require them to lie within a certain range that is
typical of a specific gesture. The fact that an IF-THEN rule usually results in just a
single line of source code makes this method straightforward to implement.
For example, to identify the static “stop” gesture (Fig. 2.2c) among various other
gestures, a simple solution would be to use the shape’s compactness c:

IF c < 0.25 THEN the observed gesture is ”stop”. (2.43)


34 Non-Intrusive Acquisition of Human Action

The threshold of 0.25 for c was determined empirically from a data set of multiple
productions of “stop” and all other gestures, performed by different users. On this
data set, max c ≈ 0.23 and c ≈ 0.18 was observed for “stop”, while c
0.23 for
all other gestures.
Obviously, with increasing vocabulary size, number and complexity of the rules
grow and they can quickly become difficult to handle. The manual creation of rules,
as described in this section, is therefore only suitable for very small vocabularies
such as the static example vocabulary, where appropriate thresholds are obvious.
Algorithms for automatic learning of rules are described in [15].
For larger vocabularies, rules are often used as a pre-stage to another classifier.
For instance, a rule like

IF a < Θa N 2 THEN the object is not the hand. (2.44)


discards noise and distractors in the background that are too small to represent the
hand. The threshold Θa specifies the required minimum size relative to the squared
image width. Another example is to determine whether an object in an image se-
quence is moving or idle. The following rule checks for a certain minimum amount
of motion:


IF max (xcog (i) − xcog (j))2 + (ycog (i) − ycog (j))2 ) < Θmotion N (2.45)
i=j
THEN the hand is idle.
where i and j are frame indices, and Θmotion specifies the minimum dislocation for
the hand to qualify as moving, relative to the image width.
2.1.3.5 Maximum Likelihood Classification
Due to its simplicity and general applicability, the concept of maximum likelihood
classification is widely used for all kinds of pattern recognition problems. An exten-
sive discussion can be found in [22]. This section provides a short introduction. We
assume the observation O to be a vector of K scalar features fi :
⎛ ⎞
f1
⎜ f2 ⎟
⎜ ⎟
O=⎜ . ⎟ (2.46)
⎝ .. ⎠
fK

Stochastic Framework

The maximum likelihood classifier identifies the class ω ∈ Ω that is most likely to
have caused the observation O by maximizing the a posteriori probability P (ω|O):

ω̂ = argmax P (ω|O) (2.47)


ω∈Ω

To solve this maximization problem we first apply Bayes’ rule:


Hand Gesture Commands 35

P (O|ω)P (ω)
P (ω|O) = (2.48)
P (O)
For simplification it is commonly assumed that all n classes are equiprobable:
1
P (ω) = (2.49)
n
Therefore, since neither P (ω) nor P (O) depend on ω, we can omit them when
substituting (2.48) in (2.47), leaving only the observation probability P (O|ω):

ω̂ = argmax P (O|ω) (2.50)


ω∈Ω

Estimation of Observation Probabilities

For reasons of efficiency and simplicity, the observation probability P (O|ω) is com-
monly computed from a class-specific parametric distribution of O. This distribution
is estimated from a set of training samples. Since the number of training samples is
often insufficient to identify an appropriate model, a normal distribution is assumed
in most cases. This is motivated by practical considerations and empirical success
rather that by statistical analysis. Usually good results can be achieved even if the
actual distribution significantly violates this assumption.
To increase modeling accuracy one may decide to use multimodal distributions.
This decision should be backed by sufficient data and motivated by the nature of the
regarded feature in order to avoid overfitting (see Sect. 2.1.3.1). Consider, for exam-
ple, the task of modeling the compactness c for the “stop” gesture shown in Fig. 2.2c.
Computing c for e.g. 20 different productions of this gesture would not provide suffi-
cient motivation for modeling it with a multimodal distribution. Furthermore, there is
no reason to assume that c actually obeys a multimodal distribution. Even if measure-
ments do suggest this, it is not backed by the biological properties of the human hand;
adding more measurements is likely to change the distribution to a single mode.
In this section only unimodal distributions will be used. The multidimensional
normal distribution is given by


1 1
N (O, μ, C) =  exp − (O − μ)T C−1 (O − μ) (2.51)
(2π)K det C 2

where μ and C are the mean and covariance of the stochastic process {O}:

μ = E{O} mean of {O} (2.52)


C = E{(O − μ)(O − μ) } = cov{O}
T
covariance matrix of {O} (2.53)

For the one-dimensional case with a single feature f this reduces to the familiar
equation
36 Non-Intrusive Acquisition of Human Action

  2 
1 1 f −μ
N (f, μ, σ ) = 
2
exp − (2.54)
(2π)σ 2 σ
Computing μi and Ci from the training samples for each class ωi in Ω yields the
desired n models that allow to solve (2.50):

P (O|ωi ) = N (O, μi , Ci ), i = 1, 2, . . . , n (2.55)

Evaluation of Observation Probabilities

Many applications require the classification stage to output only the class ωi for
which P (O|ω) is maximal, or possibly a list of all classes in Ω, sorted by P (O|ω).
The actual values of the conditional probabilities themselves are not of interest. In
this case, a strictly monotonic mapping can be applied to (2.55) to yield a score S
 of Ω, but is faster to compute. S(O|ωi )
that imposes the same order on the elements
is obtained by multiplying P (O|ωi ) by (2π)K and taking the logarithm:
1 1
S(O|ωi ) = − ln(det Ci ) − (O − μi )T C−1
i (O − μi ) (2.56)
2 2
Thus,

ω̂ = argmax S(O|ω) (2.57)


ω∈Ω

It is instructive to examine this equation in feature space. The surfaces of constant


score S(O|ωi ) form hyperellipsoids with center μi . This is caused by the fact that,
in general, the elements of O show different variances.
In gesture recognition, variance is commonly caused by the inevitable “natural”
variations in execution, by differences in anatomy (e.g. hand size), segmentation
inaccuracy, and by noise introduced in feature computation. Features are affected
by this in different ways: For instance, scale invariant features are not affected by
variations in hand size.
If the covariance matrices Ci can be considered equal for i = 1, 2, . . . , n (which
is often assumed for simplicity), i.e.

Ci = C i = 1, 2, . . . , n (2.58)
we can further simplify (2.56) by omitting the constant term − 21 ln(det C) and mul-
tiplying by −2:

S(O|ωi ) = (O − μi )T C−1 (O − μi ) (2.59)


Changing the score’s sign requires to replace the argmax operation in (2.57) with
argmin. The right side of (2.59) is known as the Mahalanobis distance between O
and μ, and the classifier itself is called Mahalanobis classifier.
Expanding (2.59) yields
Hand Gesture Commands 37

S(O|ωi ) = OT C−1 O − 2μTi C−1 O + μTi C−1 μi (2.60)


For efficient computation this can be further simplified by omitting the constant
OT C−1 O:

S(O|ωi ) = −2μTi C−1 O + μTi C−1 μi (2.61)


2
If all elements of O have equal variance σ , C can be written as the product of
σ 2 and the identity matrix I:

C = σ2 I (2.62)
Replacing this in (2.59) and omitting the constant σ 2 leads to the Euclidian clas-
sifier:

S(O|ωi ) = (O − μi )T (O − μi ) = |O − μi |2 (2.63)
2.1.3.6 Classification Using Hidden Markov Models
The classification of time-dependent processes such as gestures, sign language, or
speech, suggests to explicitly model not only the samples’ distribution in the fea-
ture space, but also the dynamics of the process from which the samples originate.
In other words, a classifier for dynamic events should consider variations in execu-
tion speed, which means that we need to identify and model these variations. This
requirement is met by Hidden Markov Models (HMMs), a concept derived from
Markov chains.8
A complete discussion of HMMs is out of the scope of this chapter. We will
therefore focus on basic algorithms and accept several simplifications, yet enough
information will be provided for an implementation and practical application. Fur-
ther details, as well as other types and variants of HMMs, can be found e.g. in
[8, 11, 13, 19–21]. The LTI-L IB contains multiple classes for HMMs and HMM-
based classifiers that offer several advanced features not discussed here.
In the following the observation O denotes a sequence of T individual observa-
tions ot :

O = (o1 o2 · · · oT ) (2.64)
Each individual observation corresponds to a K-dimensional feature vector:
⎛ ⎞
f1 (t)
⎜ f2 (t) ⎟
⎜ ⎟
ot = ⎜ . ⎟ (2.65)
⎝ .. ⎠
fK (t)
For the purpose of this introduction we will at first assume K = 1 so that ot is
scalar:
8
Markov chains describe stochastic processes in which the conditional probability distri-
bution of future states, given the present state, depends only on the present state.
38 Non-Intrusive Acquisition of Human Action

ot = ot (2.66)
This restriction will be removed later.

Concept

This section introduces the basic concepts of HMMs on the basis of an example.
In Fig. 2.16 the feature ẋcog (t) has been plotted for three samples of the gesture
“clockwise” (Fig. 2.3a). It is obvious that, while the three plots have similar shape,
they are temporally displaced. We will ignore this for now and divide the time axis
into segments of 5 observations each (one observation corresponds to one frame).
Since the samples have 60 frames this yields 12 segments, visualized by straight
vertical lines.9 For each segment i, i = 1, 2, . . . , 12, mean μi and variance σi2 were
computed from the corresponding observations of all three samples (i.e. from 15
values each), as visualized in Fig. 2.16 by gray bars of length 2σi , centered at μi .

Fig. 2.16. Overlay plot of the feature ẋcog (t) for three productions of the gesture “clockwise”,
divided into 12 segments of 5 observations (frames) each. The gray bars visualize mean and
variance for each segment.

This represents a first crude model (though not an HMM) of the time dependent
stochastic process ẋcog (t), as shown in Fig. 2.17. The model consists of 12 states
9
The segment size of 5 observations was empirically found to be a suitable temporal res-
olution.
Hand Gesture Commands 39

si , each characterized by μi and σi2 . As in the previous section we assume unimodal


normal distributions, again motivated by empirical success. The model is rigid with
respect to time, i.e. each state corresponds to a fixed set of observations. For example,
state s2 models ẋcog (t) for t = 6, 7, 8, 9, 10.

Fig. 2.17. A simple model for the time dependent process ẋcog (t) that rigidly maps 60 obser-
vations to 12 states.

This model allows to compute the probability of a given observation O originat-


ing from the modeled process (i.e., ẋcog (t) for the gesture “clockwise”) as


60
t−1
P (O|“clockwise”) = N (ot , μi(t) , σi(t)
2
) with i(t) = + 1 (2.67)
t=1
5

where N is the normal distribution (see (2.54)).


Thus far we have modeled the first stochastic process, the amplitude of ẋcog (t),
for fixed time segments. To extend the model to the second stochastic process inher-
ent in ẋcog (t), the variation in execution speed and temporal displacement apparent
in Fig. 2.16, we abandon the rigid mapping of observations to states. As before, s1
and s12 are defined as the initial and final state, respectively, but the model is ex-
tended by allowing stochastic transitions between the states. At each time step the
model may remain in its state si or change its state to si+1 or to si+2 . A transition
from si to sj is assigned a probability ai,j , with

ai,j = 1 (2.68)
j

and

ai,j = 0 for j ∈
/ {i, i + 1, i + 2} (2.69)
Fig. 2.18 shows the modified model for ẋcog (t), which now constitutes a com-
plete Hidden Markov Model. Transitions from si to si allow to model a “slower-than-
normal” execution part, while those from si to ss+2 model a “faster-than-normal”
execution part.
Equation (2.69) is not a general requirement for HMMs; it is a property of the
specific Bakis topology used in this introduction. One may choose other topologies,
allowing e.g. a transition from si to si+3 , in order to optimize classification perfor-
mance. In this section we will stick to the Bakis topology because it is widely used
40 Non-Intrusive Acquisition of Human Action

Fig. 2.18. A Hidden Markov Model in Bakis topology. Observations are mapped dynamically
to states.

in speech and sign language recognition, and because it is the simplest topology that
models both delay and acceleration.10
Even though we have not changed the number of states compared to the rigid
model, the stochastic transitions affect the distribution parameters μi and σi2 of each
state i. In the rigid model, the lack of temporal alignment among the training sam-
ples leads to μi and σi2 not accurately representing the input data. Especially for the
first six states σ is high and μ has a lower amplitude than the training samples (cf.
Fig. 2.16). Since the HMM provides temporal alignment by means of the transition
probabilities ai,j we expect σ to decrease and μ to better match the sample plots (this
is proven in Fig. 2.21). The computation of ai,j , μi and σi2 will be discussed later.
For the moment we will assume these values to be given.
In the following a general HMM with N states in Bakis topology is considered.
The sequence of states adopted by this model is denoted by the sequence of their
indices

q = (q1 q2 . . . qT ) with qt ∈ {1, 2, . . . , N }, t = 1, 2, . . . , T (2.70)

where qt indicates the state index at time t and T equals the number of observations.
This notion allows to specify the probabilities ai,j as

ai,j = P (qt = j|qt−1 = i) (2.71)


For brevity, the complete set of transition probabilities is denoted by a square
matrix of N rows and columns

A = [ai,j ]N ×N (2.72)
One may want to loosen the restriction that q1 = s1 by introducing an initial state
probability vector π:

10
If the number of states N is significantly smaller than the number of observations T ,
acceleration can also be modeled solely by the transition from si to si+1 .
Hand Gesture Commands 41

π = (π1 π2 . . . πN ) with πi = P (q1 = si ) and πi = 1 (2.73)
i

This can be advantageous when recognizing a continuous sequence of gestures


(such as a sentence of sign language) and coarticulation effects might lead to the
initial states of some models being skipped. In this section, however, we stick to
π1 = 1 and πi = 0 for i = 1.
To achieve a compact notation for a complete HMM we introduce a vector con-
taining the states’ distribution functions:

B = (b1 b2 . . . bN ) (2.74)
where bi specifies the distribution of the observation o in state si . Since we use
unimodal normal distributions,

bi = N (o, μi , σi2 ) (2.75)


A Hidden Markov Model λ is thus completely specified by

λ = (π, A, B) (2.76)
Obviously, to use HMMs for classification, we again need to compute the proba-
bility that a given observation O originates from a model λ. In contrast to (2.67) this
is no longer simply a product of probabilities. Instead, we have to consider all pos-
sible state sequences of length T , denoted by the set QT , and sum up the accordant
probabilities:

  
T
P (O|λ) = P (O, q|λ) = b1 (o1 ) aqt−1 ,qt bqt (ot ) (2.77)
q∈QT q∈QT t=2

with

q1 = 1 ∧ qT = N (2.78)
Since, in general, the distributions bi will overlap, it is not possible to identify
a single state sequence q associated with O. This property is reflected in the name
Hidden Markov Model.
The model λ̂ that best matches the observation O is found by maximizing (2.77)
on the set Λ of all HMMs:

λ̂ = argmax P (O|λ) (2.79)


λ∈Λ

Even though efficient algorithms exist to solve this equation, (2.79) is commonly
approximated by

λ̂ = argmax P (O|λ, q∗ ) (2.80)


λ∈Λ
42 Non-Intrusive Acquisition of Human Action

where q∗ represents the single most likely state sequence:

q∗ = argmax P (q|O, λ) (2.81)


q∈QT

This is equivalent to approximating (2.77) by its largest summand, which may


appear drastic, but (2.79) and (2.80) have been found to be strongly correlated [21],
and practical success in terms of accuracy and processing speed warrant this approx-
imation.
Application of Bayes’ rule to P (q|O, λ) yields

P (O, q|λ)
P (q|O, λ) = (2.82)
P (O|λ)
Since P (O|λ) does not depend on q we can simplify (2.81) to

q∗ = argmax P (O, q|λ) (2.83)


q∈QT

In order to solve this maximization problem it is helpful to consider a graphical


representation of the elements of QT . Fig. 2.19 shows the Trellis diagram in which
all possible state sequences q are shown superimposed over the time index t.

Fig. 2.19. Trellis diagram for an HMM with Bakis topology.

The optimal state sequence q∗ can efficiently be computed using the Viterbi algo-
rithm (see Fig. 2.20): Instead of performing a brute force search of QT , this method
Hand Gesture Commands 43

exploits the fact that from all state sequences q that arrive at the same state si at time
index t, only the one for which the probability up to si is highest, q∗i , can possibly
solve (2.83):


t
q∗i = argmax b1 (o1 ) aqt −1 ,qt bqt (ot ) (2.84)
q∈{q|qt =i} t =2
All others need not be considered, because regardless of the subsequence of states
following si , the resulting total probability would be lower than that achieved by
following q∗i by the same subsequence. (A common application of this algorithm in
everyday life is the search for the shortest path from a location A to another location
B in a city: From all paths that have a common intermediate point C, only the one
for which the distance from A to C is shortest has a chance of being the shortest path
between A and B.)

Viterbi Search Algorithm

Define t as a loop index and Qt as a set of state sequences of length t.


1. Start with t = 1, Q1 = {(s1 )} and Qt = ∅ for t = 2, 3, . . . , T .
2. For each state si (i = 1, 2, . . . , N ) that is accessible from the last state of at
least one of the sequences in Qt do:
Find the state sequence q with
t
q = argmax b1 (o1 ) aqt −1 ,qt bqt (ot ) aqt ,i bi (ot+1 ) (2.85)
q∈Qt t =2

Append i to q then add q to Qt+1 .


3. Increment t.
4. If t < T , goto step 2.
5. The state sequence q ∈ QT with qT = N solves (2.83).

Fig. 2.20. The Viterbi algorithm for an efficient solution of (2.83).

In an implementation of the Viterbi algorithm the value of (2.85) is effi-


ciently stored to avoid repeated computation. In case multiple state sequence maxi-
mize (2.85) only one of them needs to be considered.
Implementational Aspects
The product of probabilities in (2.77) and (2.85) may become very small, causing
numerical problems. This can be avoided by using scores instead of probabilities,
with

Score(P ) = ln(P ) (2.86)


Since ln ab = ln a + ln b this allows to replace the product of probabilities with
a sum of scores. Because the logarithm is a strictly monotonic mapping, the inverse
44 Non-Intrusive Acquisition of Human Action

exp(·) operation may be omitted without affecting the argmax operation. In addition
to increased numerical stability the use of scores also reduces computational cost.11

Estimation of HMM Parameters

We will now discuss the iterative estimation of the transition probabilities A and
the distribution functions B for a given number of states N and a set of M training
samples O1 , O2 , . . . , OM . Tm denotes the length of Om (m ∈ {1, 2, . . . , M }). An
initial estimate of B is obtained in the same way as shown in Fig. 2.16, i.e. by equally
distributing the elements of all training samples to the N states.12 A is initialized to
⎛ ⎞
0.3 0.3 0.3 0 0 · · · 0
⎜ 0 0.3 0.3 0.3 0 · · · 0 ⎟
⎜ ⎟
⎜ .. .. .. ⎟
⎜ . . . ⎟
A=⎜ ⎟ (2.87)
⎜ 0 ··· 0 0.3 0.3 0.3 ⎟
⎜ ⎟
⎝ 0 ··· 0 0 0.5 0.5 ⎠
0 ··· 0 0 0 1
We now apply the Viterbi algorithm described above to find, for each train-
ing sample Om = (o1,m o2,m . . . oTm ,m ), the optimal state sequence q∗m =
(q1,m q2,m . . . qT,m ). Each q∗m constitutes a mapping of observations to states, and
the set of all q∗m , m = 1, 2, . . . , M , is used to update the model’s distributions B
and transition probabilities A accordingly: Mean μi and variance σi2 for each state si
are recomputed from all sample values ot,m for which qt,m = i holds. The transition
probabilities ai,j are set to the number of times that the subsequence {. . . , qi , qj , . . . }
is contained in the set of all q∗m and normalized to fulfill (2.68). Formally,

μi = E{ot,m |qt,m = i} (2.88)


σi2 = var{ot,m |qt,m = i} (2.89)
ãi,j
ai,j =  (2.90)
j ãi,j

with

ãi,j = |{qt,m |qt = i ∧ qt+1 = j}| (2.91)


This process is called segmental k-means or Viterbi training. Each iteration im-
proves the model’s match with the training data, i.e.

P (O, q∗ |λ(n) ) ≤ P (O, q∗ |λ(n−1) ) (2.92)


where λ(n) denotes the model at iteration n.

11
To avoid negative values, the score is sometimes defined as the negative logarithm of P .
12
If N is not a divisor of Tm , an equal distribution is approximated.
Hand Gesture Commands 45

The training is continued on the same training samples until λ remains constant.
This is the case when the optimal state sequences q∗m remain constant for all M
samples.13 A proof of convergence can be found in [13].
It should be noted that the number of states N is not part of the parameter esti-
mation process and has to be specified manually. Lower values for N reduce compu-
tational cost at the risk of inadequate generalization, while higher values may lead to
overspecialization. Also, since the longest forward transition in the Bakis topology
is two states per observation,

N ≤ 2 min{T1 , T2 , . . . , TM } − 1 (2.93)
is required.
Revisiting the example introduced at the beginning of this section, we will now
create an HMM for ẋcog from the three productions of “clockwise” presented in
Fig. 2.16. To allow a direct comparison with the rigid state assignment shown in
Fig. 2.17 the number of states N is set to 12 as well. Fig. 2.21 overlays the parameters
of the resulting model’s distribution functions with the same three plots of ẋcog .

Fig. 2.21. Overlay plot of the feature ẋcog (t) for three productions of the gesture “clockwise”,
and mean μ and variance σ of the 12 states of an HMM created therefrom. The gray bars are
centered at μ and have a length of 2σ. Each represents a single state.

13
Alternatively, the training may also be aborted when the number of changes in all q∗m
falls below a certain threshold.
46 Non-Intrusive Acquisition of Human Action

Compared to Fig. 2.16, the mean values μ now represent the shape of the plots
significantly better. This is especially noticeable for 10 ≤ t ≤ 30, where the differ-
ences in temporal offset previously led to μ having a much smaller amplitude than
ẋcog . Accordingly, the variances σ are considerably lower. This demonstrates the
effectiveness of the approach.

Multidimensional Observations

So far we have assumed scalar observations ot (see (2.66)). In practice, however, we


will rarely encounter this special case.
The above concepts can easily be extended to multidimensional observations,
removing the initial restriction of K = 1 in (2.65). (2.75) becomes

bi = N (o, μi , Ci ) (2.94)
where μi and Ci are mean and covariance matrix of state si estimated in the Viterbi
training:

μi = E{ot,m |qt,m = i} (2.95)


Ci = cov{ot,m |qt,m = i} (2.96)

These equations replace (2.88) and (2.89). In practical applications the elements
of o are frequently assumed uncorrelated so that C becomes a diagonal matrix, which
simplifies computation.

Extensions

A multitude of extensions and modifications exist to improve the performance of the


simple HMM variant presented here, two of which shall be briefly mentioned here.
Firstly, the distributions b may be chosen differently. This includes other types
of distributions, such as Laplace, and multimodal models.14 The latter modification
brings about a considerable increase in complexity.
Furthermore, in the case of multidimensional observations one may want the
temporal alignment performed by the HMM to be independent for certain parts of
the feature vector. This can be achieved by simply using multiple HMMs in parallel,
one for each set of temporally coupled features. For instance, a feature vector in a
gesture recognition system typically consist of features that describe the hand’s po-
sition, and others that characterize its posture. Using a separate HMM for each set
of features facilitates independent temporal alignment (see Sect. 3.3.4 for a discus-
sion of such Parallel Hidden Markov Models (PaHMMs)). Whether this is actually
desirable depends on the concrete application.

14
For multimodal models, the considerations stated in Sect. 2.1.3.5 apply.
Hand Gesture Commands 47

2.1.3.7 Example
Based on the features of the static and the dynamic vocabulary computed in
Sect. 2.1.2.4, this section describes the selection of suitable features to be used in
the two example applications.
We will manually analyze the values in Tab. 2.5 and the plots in 2.14 to identify
features with high inter-gesture variance. Since all values and plots describe only a
single gesture, they first have to be computed for a number of productions to get an
idea of the intra-gesture variance. For brevity, this process is not described explicitly.
The choice of classification algorithm is straightforward for both the static and the
dynamic case.

Static Gesture Recognition

The data in Tab. 2.5 shows that a characteristic feature of the “stop” gesture
(Fig. 2.2a) is its low compactness: c is smaller than for any other shape. This was
already mentioned in Sect. 2.1.3.4, and found to be true for any production of “stop”
by any user.
Pointing gestures result in an roughly circular, compact shape with a single pro-
trusion from the pointing finger. For the “left” and “right” gestures (Fig. 2.2b,c), this
protrusion is along the x axis. Assuming that the protruding area does not signifi-
cantly affect xcog , the protrusion ratio xp introduced in (2.30) can be used: Shapes
with xp
1 are pointing to the right, while x1p
1 indicates pointing to the left.
Gestures for which xp ≈ 1 are not pointing along the x axis. This holds regardless
of whether the thumb or the index finger is used for pointing. Empirically, xp > 1.2
and x1p > 1.2 was found for pointing right and left, respectively.
Since the hand may not be the only skin-colored object in the image (see
Tab. 2.4), the segmentation might yield multiple skin colored regions. In our ap-
plication scenario it is safe to assume that the hand is the largest object; however, it
need not always be visible. The recording conditions in Tab. 2.4 demand a minimum
shape size amin of 10% of the image size, or

amin = 0.1N M (2.97)


Therefore, all shapes with a < amin may be discarded, and from the remainder,
we simply choose the one with the largest area a. If no shape exceeds the minimum
area, the image is considered empty.
The feature vector for static gesture recognition is therefore
⎛ ⎞
a
O=⎝ c ⎠ (2.98)
xp
The above observations regarding a, c, and xp suggest to use a set of rules for
classification. The required thresholds can easily be identified since different gestures
(including typical “idle” gestures) scarcely overlap in the feature space. A suitable
threshold Θc for identifying “stop” was already found in (2.43).
48 Non-Intrusive Acquisition of Human Action

Rephrasing the above paragraphs as rules yields

1. Discard all objects with a < 0.1N M. (2.99)


2. If no objects remain then the image is empty. (2.100)
3. Consider the object for which a is maximum. (2.101)
4. If c < 0.25 then the observed gesture is “stop”. (2.102)
5. If xp > 1.2 then the observed gesture is “right”. (2.103)
1
6. If xp > 1.2 then the observed gesture is “left”. (2.104)
7. Otherwise the hand is idle and does not perform any gesture. (2.105)
These rules conclude the design of the static gesture recognition example appli-
cation. Sect. 2.1.4 presents the corresponding I MPRESARIO process graph.

Dynamic Gesture Recognition

The feature plots in Fig. 2.14 show that normalized area a , compactness c, and nor-
malized x coordinate of the center of gravity xcog should be sufficient to discriminate
among the six dynamic gestures. “clockwise” and “counterclockwise” can be identi-
fied by their characteristic sinusoidal xcog feature. “open” and “grab” both exhibit a
drop in a around T2 but can be distinguished by compactness, which simultaneously
drops for “open” but rises for “grab”. “close” and “drop” are discerned correspond-
ingly. We thus arrive at a feature vector of
⎛  ⎞
a
O=⎝ c ⎠ (2.106)
xcog
An HMM-based classifier is the obvious choice for processing a time series of
such feature vectors. The resulting example application and its I MPRESARIO process
graph is presented in Sect. 2.1.5.

2.1.4 Static Gesture Recognition Application

To run the example application for static gesture recognition, please install the
rapid prototyping system I MPRESARIO from the accompanying CD as described in
Sect. B.1. After starting I MPRESARIO, choose “File”→“Open...” to load the process
graph named StaticGestureClassifier.ipg (see Fig. 2.22). The data flow
is straightforward and corresponds to the structure of the sections in this chapter. All
six macros are described below.
Video Capture Device Acquires a stream of images from a camera connected e.g.
via USB. In the following, each image is processed individually and indepen-
dently. Other sources may be used by replacing this macro with the appropriate
source macro (for instance, Image Sequence (see Sect. B.5.1) for a set of image
files on disk).
Hand Gesture Commands 49

Fig. 2.22. The process graph of the example application for static gesture recognition (file
StaticGestureClassifier.ipg). The image shows the classification result.

ProbabilityMap Computes a skin probability image as described in Sect. 2.1.2.1.


The parameters Object color histogram file and Non-object color histogram file
point to the files skin-32-32-32.hist and nonskin-32-32-32.hist
(these are the histograms presented in [12], with a resolution of 32 bins per di-
mension as described in (2.13)). All parameters should be set to their default
values.
GaussFilter Smoothes the skin probability image to prevent jagged borders (see
Sect. 2.1.2.2). The parameter Kernel size should be set to around 15 for a resolu-
tion of 320×240. Higher resolutions may require larger kernels. Setting Variance
to zero computes a suitable Gauss kernel variance automatically.
ConvertChannel The output of the preceding stage is a floating point gray value
image I(x, y) ∈ [0, 1]. Since the next macro requires a fixed point gray value
image I(x, y) ∈ {0, 1, . . . , 255}, a conversion is performed here.
ObjectsFromMask Extracts contiguous regions in the skin probability image using
an algorithm similar to that presented in Fig. 2.11. This macro outputs a list of all
detected regions sorted descending by area (which may be empty if no regions
were found). A value of zero for the parameter Threshold uses the algorithm
shown in Fig. 2.6 for automatic threshold computation. If this fails, the threshold
50 Non-Intrusive Acquisition of Human Action

may have to be set manually. The remaining parameters control further aspects of
the segmentation process and should be left at their defaults.
StaticGestureClassifier Receives the list of regions from the preceding stage and
discards all but the largest one. The macro then computes the feature vector de-
scribed in (2.98), using the equations discussed in Sect. 2.1.2.3, and performs
the actual classification according to the rules derived in Sect. 2.1.3.7 (see (2.99)
to (2.105)). The thresholds used in the rules can be changed to modify the sys-
tem’s behavior. For the visualization of the classification result, the macro also re-
ceives the original input image from Video Capture Device (if the output image
is not shown, double-click on the macro’s red output port). It displays the border
of the hand and a symbol indicating the classification result (“left”, “right”, or
“stop”). If the hand is idle or not visible, the word IDLE is shown.
Press the “Play” button or the F1 key to start the example. By double-clicking on
the output port of GaussFilter the smoothed skin probability image can be displayed.
The hand should be clearly distinguishable from the background. Make sure that
the camera’s white balance is set correctly (otherwise the skin color detection will
fail), and that the image is never overexposed, but rather slightly underexposed. If
required, change the device properties as described in Sect. B.5.2.
The application recognizes the three example gestures “left”, “right”, and “stop”
shown in Fig. 2.2. If the classification result is always IDLE or behaves erratic, see
Sect. 2.1.6 for troubleshooting help.

2.1.5 Dynamic Gesture Recognition Application


The process graph of the example application for dynamic gesture recognition is
stored in the file DynamicGestureClassifier.ipg (see Fig. 2.23). It is iden-
tical to the previous example (cf. Fig. 2.22), except for the last macro Dynam-
icGestureClassifier. From the list of regions received from ObjectsFromMask, this
macro considers only the largest one (just as StaticGestureClassifier) and computes
the feature vector shown in (2.106). The feature vectors of subsequent frames are
buffered, and when the gesture is finished, the resulting feature vector sequence is
forwarded to an HMM-based classifier integrated in the macro. The following ex-
plains the parameters and operation of DynamicGestureClassifier.
When starting the example, the macro is in “idle” state: It performs a segmenta-
tion and visualizes the resulting hand border in the color specified by Visualization
Color but does not compute feature vectors. This allows you to position the cam-
era, set the white balance etc. Setting the State parameter to “running” activates the
recording and recognition process.
Whenever the macro’s state is switched from “idle” to “running” it enters a prepa-
ration phase in which no features are computed yet, in order to give you some time
to position your hand (which is probably on the mouse at that point) in the image.
During this period a green progress bar is shown at the bottom of the image so that
you know when the preparation phase is over and the actual recording starts. The
duration of the preparation phase (in frames) is specified by the parameter Delay
Samples.
Hand Gesture Commands 51

Fig. 2.23. Process graph of the example application for dynamic gesture recognition (file
DynamicGestureClassifier.ipg). The image was recorded with a webcam mounted
atop the screen facing down upon the keyboard.

After the preparation phase, the recording phase starts automatically. Its duration
can be configured by Gesture Samples. A red progress bar is shown during recording.
The complete gesture must be made while this progress bar is running. The hand
motion should be smooth and take up the entire recording phase.
The parameter Hidden Markov Model specifies a file containing suitable Hidden
Markov Models.15 You can either use gestures.hmm, which is provided on the
CD and contains HMMs for the six example gestures shown in Fig. 2.3, or generate
your own HMMs using the external tool generateHMM.exe as described below.
Since the HMMs are identified by numerical IDs (0, 1, . . . ), the parameter Ges-
ture Names allows to set names for each model that are used when reporting clas-
sification results. The names are specified as a list of space-separated words, where
the first word corresponds to ID 0, the second to ID 1, etc. Since the HMMs in
gestures.hmm are ordered alphabetically this parameter defaults to “clockwise
close counterclockwise drop grab open”.

15
The file actually contains an instance of lti::hmmClassifier, which includes in-
stances of lti::hiddenMarkovModel.
52 Non-Intrusive Acquisition of Human Action

The parameter Last Classified Gesture is actually not a parameter, but is set by
the macro itself to reflect the last classification result (which is also dumped to the
output pane).
Start the example application by pressing the “Play” button or the F1 key. Ad-
just white balance, shutter speed etc. as for StaticGestureRecognition (Sect. 2.1.4).
Additionally, make sure that the camera is set to the same frame rate used when
recording the training samples from which the HMMs were created (25 fps for
gestures.hmm).
When the segmentation of the hand is accurate, set State to “running” to start the
example. Perform one of the six example gestures (Fig. 2.3) while the red progress
bar is showing and check Last Classified Gesture for the result. If the example does
not work as expected, see Sect. 2.1.6 for help.

Test/Training Samples Provided on CD

The subdirectory imageinput\Dynamic_Gestures in your I MPRESARIO in-


stallation contains three productions of each of the six example gestures. From these
samples, gestures.hmm was created (this process is described in detail below).
For testing purposes you can use them as input to DynamicGestureClassifier. How-
ever, conclusions regarding classification performance cannot be made by feeding
training data back to the system, but only by testing with previously unseen input.
The process graph DynamicGestureClassifierFromDisk.ipg shown
in Fig. 2.24 uses Image Sequence as the source macro. It allows to select one of
six image sequences from a drop-down list, where each sequence corresponds to
one production of the gesture of the same name. A detailed description of Image
Sequence is provided in B.5.1.
In contrast to acquiring images from a camera, reading them from disk does not
require a preparation phase, so the parameter Delay Samples of DynamicGesture-
Classifier is set to 0. Gesture Samples should exactly match the number of files to
read (60 for the example gestures). Make sure that State is set to “running” every
time before starting the example, otherwise one or more images would be ignored.

Generation of Hidden Markov Models

The command line utility generateHMM.exe creates Hidden Markov Models


from a set of training samples and stores them in an .hmm file16 for use in Dy-
namicGestureClassifier. These training samples may either be the above described
samples provided on the CD, your own productions of the example gestures, or any
other gestures that can be classified based on the feature vector introduced in (2.106).
(To use a different feature vector, modify the class DynamicGR accordingly and re-
compile both the affected I MPRESARIO files and generateHMM.)
The training samples must be arranged on disk as shown in Fig. 2.25. The macro
RecordGestures facilitates easy recording of gestures and creates this directory

16
As above, this file actually contains a complete lti::hmmClassifier.
Hand Gesture Commands 53

Fig. 2.24. Recognition of dynamic gestures read from disk frame by frame (process graph file
DynamicGestureClassifierFromDisk.ipg). Each frame corresponds to a single
file.

Fig. 2.25. Directory layout for training samples to be read by the generateHMM.exe com-
mand line utility.

structure automatically. Its process graph is presented in Fig. 2.26. Like Dynam-
icGestureRecognition, it has an “idle” and a “running” state, and precedes the ac-
54 Non-Intrusive Acquisition of Human Action

tual recording with a preparation phase. The parameters State, Delay Samples, and
Gesture Samples work as described above.

Fig. 2.26. Process graph RecordGestures.ipg for recording gestures to be processed by


generateHMM.exe.

To record a set of training gestures, first set the parameter Destination Directory
to specify where the above described directory structure shall be created. Ensure that
Save Images to Disk is activated and State is set to “idle”, then press “Play” or F1.
For each gesture you wish to include in the vocabulary, perform the following steps:
1. Set the parameter Gesture Name (since the name is used as a directory name, it
should not contain special characters).
2. Set Sequence Index to 0.
3. Set State to “running”, position your hand appropriately in the preparation phase
(green progress bar) and perform the gesture during the recording phase (red
progress bar). After recording has finished, Sequence Index is automatically in-
creased by 1, and State is reset to “idle”.
4. Repeat step 3 several times to record multiple productions of the gesture.
5. Examine the results using an image viewer such as the free IrfanView17 or the
macro Image Sequence (see Sect. B.5.1). If a production was incorrect or oth-
17
http://www.irfanview.com/
Hand Gesture Commands 55

erwise not representative, set Sequence Index to its index and repeat step 3 to
overwrite it.
Each time step 3 is performed, a directory with the name of Sequence Index
is created in a subdirectory of Destination Directory named according to Gesture
Name, containing as many image files (frames) as specified in Gesture Samples. This
results in a structure as shown in Fig. 2.25. All images are stored in BMP format.
The files created by RecordGestures can be used for both training and test-
ing. For instance, in order to observe the effects of changes to a certain pa-
rameter while keeping the input data constant, replace the source macro Video
Capture Device in the process graphs StaticGestureRecognition.ipg
or DynamicGestureRecognition.ipg with Image Sequence and use the
recorded gestures as image source.
To run generateHMM.exe on the recorded gestures, change to the direc-
tory where it resides (the I MPRESARIO data directory), ensure that the files
skin-32-32-32.hist and non-skin-32-32-32.hist are also present
there, and type

generateHMM.exe -d <dir> -e <extension> -f <outfile>

where the options -d, -e, and -f specify the input directory, the input file extension,
and the output filename, respectively. For use with Record Gestures, replace <dir>
with the directory specified as Destination Directory, <extension> with bmp18 ,
and <outfile> with the name of the .hmm file to be created. The directories are
then scanned and each presentation is processed (this may take several seconds). The
resulting .hmm file can now be used in DynamicGestureClassifier.
To control the segmentation process, additional options as listed in Tab. 2.6 may
be used. It is important that these values match the ones of the corresponding parame-
ters of the ObjectsFromMask macro in DynamicGestureClassifier.ipg
to ensure that the process of feature computation is consistent. The actual HMM
creation can be configured as shown in Tab. 2.7.

Table 2.6. Options to control the skin color segmentation in generateHMM.exe.


-k Size of the Gauss kernel used for smoothing the skin probability images.
-t Segmentation threshold Θ.

Calling generateHMM.exe with no arguments shows a short help as well


as the default values and allowed ranges for all options. For the provided file
gestures.hmm the option -s 15 was used, and all other parameters were left
at their defaults.

18
For the samples provided on CD, specify jpg instead.
56 Non-Intrusive Acquisition of Human Action

Table 2.7. Options to control the HMM creation in generateHMM.exe.


-c Type of the distribution functions B to be used.
-s Number of states N .
-m, -M Minimum and maximum number of states to jump forward per observation.
These values specify a range, so -m 0 -M 2 allows jumps of zero, one, or
two states (Bakis topology).

2.1.6 Troubleshooting

In case the classification performance of either example is poor, verify that the
recording conditions in Tab. 2.4 are met. Wooden furniture may cause problems for
the skin color segmentation, so none should be visible in the image. Point the camera
to a white wall to be on the safe side. If the segmentation fails (i.e. the hand’s border
is inaccurate) even though no other objects are visible, recheck white balance, shutter
speed, and gain/brightness. The camera may perform an automatic calibration that
yields results unsuitable for the skin color histogram. Disable this feature and change
the corresponding device parameters manually. The hand should be clearly visible in
the output of the GaussFilter macro (double-click on its red output port if no output
image is shown).

2.2 Facial Expression Commands


The human face is, besides speech and gesture, a primary channel of information
exchange during natural communication. By facial expressions emotions can be ex-
pressed faster, more subtle, and more efficiently than with words [3].
Facial expressions are used actively by a speaker to support communication, but
also intuitively for mimic feedback by a dialog partner. In medicine, symptoms and
progress in therapy may be read from the face. Sign languages make use of facial
expressions as additional communication channel supplementing syntax and seman-
tics of a conversation. Babies focus already after few minutes after birth on faces and
differentiate hereby between persons.
While the recognition of facial expressions is easy for humans, it is extremely dif-
ficult for computers. Many attempts are documented in the literature which mostly
concentrate on subparts of the face [17]. Eye blinks are, e.g., applied in the auto-
motive industry to estimate driver fatigue, lip outline and motion contribute to the
robustness of speech recognizers, and face recognition is used as a biometric method
for identification [4].
Parallel registration of facial features currently takes place only in “motion cap-
turing” for the realistic animation of computer-generated characters. Here tiniest
movements in the face are recorded with intricate optical measurement devices.
Fig. 2.27 shows a selection of current hardware for facial expression recording.
The use of non-intrusive video-based registration techniques poses serious prob-
lems due to possible variations in user appearance, which may have intrinsic and ex-
Facial Expression Commands 57

Fig. 2.27. Hardware for facial expression recording. Sometimes the reflections of Styrofoam
balls are exploited (as in a. and b.) Other solutions use a head cam together with markers on
the face (c) or a mechanical devise (d). In medical diagnostics line of sight is measured with
piezo elements or with cameras fixed to the spectacles to measure skin tension.

trinsic causes. Intrinsic variations result from individual face shape and from move-
ments while speaking, extrinsic effects result from camera position, illumination, and
possibly occlusions.
Here we propose a generic four stage facial feature analysis system that can cope
with such influence factors. The processing chain starts with video stream acquisition
and continues over image preprocessing and feature extraction to classification.
During image acquisition, a camera image is recorded while camera adjustments
are optimized continuously. During preprocessing, the face is localized and its ap-
pearance is improved in various respects. Subsequently feature extraction results in
58 Non-Intrusive Acquisition of Human Action

Fig. 2.28. Information processing stages of the facial expression recognition system.

the segmentation of characteristic facial regions among which significant attributes


are identified. A final classification stage assigns features to predefined facial ex-
pressions. An overview of the complete system is depicted in Fig. 2.28. The single
processing stages are described in more detail in the following sections.
Facial Expression Commands 59

2.2.1 Image Acquisition

During image acquisition, the constancy of color and luminance even under variable
illumination conditions must be guaranteed. Commercial web cams which are used
here, usually permit the adjustment of various parameters. Mostly there exists a fea-
ture for partial or full self configuration. The hardware implemented algorithms take
care of white balance, shutter speed, and back light compensation and yield, to the
average, satisfactory results.
Camera parameter adjustment can however be further optimized if context
knowledge is available, as e.g. shadows from which the position of a light source
can be estimated, or the color of a facial region.
Since the factors lighting and reflection cannot be affected due to the objective,
only the adjustment of the camera parameters remains. The most important para-
meters of a digital camera for the optimization of the representation quality are the
shutter speed, the white-balance, and the gain. In order to optimize these factors in
parallel, the extended simplex algorithm is utilized, similar to the approach given
in [16].
With a simplex in n-dimensional space (here n=3 for the three parameters) it acts
around a geometrical object with n+1 position vectors which possess an identical
distance to each other in each case. The position vector is defined as follows:
⎛ ⎞
s s = shutter
P = ⎝w⎠ w = white − balance (2.107)
g g = gain
P is a scalar value that the number of skin-colored pixel in the face region rep-
resents, dependent on the three parameters s, w and g. The worst (smallest) vector
is mirrored at the line, that by the remaining vectors is given and results in a new
simplex.
The simplex algorithm tries to iteratively find an optimal position vector which
comes the desired targets next. In the available case it applies to approximate the
average color of the face region with an empirically determined reference value.
For the optimal adjustment of the parameters, however, a suitable initialization is
necessary which can be achieved for example by a pre-use of the automatic white-
balance of the camera.
Fig. 2.29 (left) shows an example of the two-dimensional case. To the position
vector two additional position vectors U and V are added. Since |P| produces too
few skin-colored areas, it is mirrored at the straight line UV on P . It results the new
simplex S  . The procedure is repeated until it converges towards a local maximum
Fig. 2.29 (right).
A disadvantage of the simplex algorithm is its rigid gradual and linear perfor-
mance. Therefore the approach was extended, as an upsetting and/or an aspect ratio
was made possible for the mirroring by introduction of a set of rules. Fig. 2.30
represents the extended simplex algorithm for a two-dimensional parameter space.

The rules are as follows:


60 Non-Intrusive Acquisition of Human Action

Fig. 2.29. Simple simplex similar to the approach described in [16].

   
1. If P  > |U| ∧ P  > |V| then repeat the mirroring with stretching factor 2
from P to P2
         
2. If P  > |U| ∧ P  < |V| ∨ P  < |U| ∧ P  < |V| then use P
     
3. If P  < |U| ∧ P  < |V| ∧ P  > |P| then repeat mirroring with stretching
factor 0.5 from P to P0,5
     
4. If P  < |U| ∧ P  < |V| ∧ P  < |P| then repeat mirroring with stretching
factor -0.5 from P to P−0,5

Fig. 2.30. Advanced simplex with dynamic stretching factor.

2.2.2 Image Preprocessing

The goal of image preprocessing is the robust localization of the facial region which
corresponds to the rectangle bounded by bottom lip and brows. With regard to
processing speed only skin colored regions are analyzed in an image. Additionally a
general skin color model is adapted to each individual.
Face finding relies on a holistic approach which takes the orientation of facial
regions into consideration. Brows, e.g., are characterized by vertically alternating
Facial Expression Commands 61

bright and dark horizontal regions. A speed-up of processing is achieved by a so-


called integral-image approach and an Ada-Boosting classification. Variations in
head pose and occlusion by, e.g., hands are compensated by tracing characteristic
face regions in parallel. Furthermore, a reduction of shadow and glare effects is per-
formed as soon as the face has been found [5].
2.2.2.1 Face Localization
Face localization is simplified by exploiting a-priori knowledge, either with respect
to the whole face or to parts of it. Analytical or feature-based approaches make use
of local features like edges, intensity, color, movement, contours and symmetry apart
or in combination, to localize facial regions. Holistic approaches consider regions as
a whole.
The approach described here is holistic by finding bright and dark facial regions
and their geometrical relations. A search mask is devised to find skin colored regions
with suitable movement patterns only. For the exploitation of skin color information
the approach is utilized which is already described in Sect. 2.1.2.1.
A movement analysis makes use of the fact that neighboring pixels of contiguous
regions move in similar fashion. There are several options how to calculate optical
flow. Here a Motion History Image (MHI), i.e. the weighted sum of subsequent bi-
nary difference images is used, where a motion image is defined as:


1.0 if || I(x, y, tk ) − I(x, y, tk−1 )||2 > Θ
Imotion (x, y, tk ) = (2.108)
max(0.0, IB (x, y, tk−1 ) − T1 otherwise

Imotion is incremented, if the intensity difference of a pixel in two following


images is, according to the L2-norm, larger than a given threshold Θ. Otherwise
the contribution of elapsed differences will be linearly decreased by the constant T .
Fig. 2.31 shows the result of this computation for head and background movements.
The intensity of the visualized movement pattern indicates contiguous regions.

Fig. 2.31. Visualized movement patterns for movement of head and background (left), move-
ment of the background alone (middle), and movement of the head alone (right).
62 Non-Intrusive Acquisition of Human Action

For a combined skin color and movement search the largest skin colored ob-
ject is selected and subsequently limited by contiguous, non skin colored regions
(Fig. 2.32).

Fig. 2.32. Search mask composed of skin color (left) and motion filter (right).

Holistic face finding makes use of three separate modules. The first transforms
the input image in an integral image for the efficient calculation of features. The
second module supports the automatic selection of suitable features which describe
variations within a face, using Ada-Boosting. The third module is a cascade classifier
that sorts out insignificant regions and analyzes in detail the remaining regions. All
three modules are subsequently described in more detail.
Integral Image
For finding features as easily as possible, Haar-features are used which are presented
by Papgeorgiou et al. [18]. A Haar-feature results from calculating the luminance
difference of equally sized neighboring regions. These combine to the five constella-
tions depicted in Fig. 2.33, where the feature M is determined by:

M (x1 , y1 , x2 , y2 ) = abs(Intensity − Intensity ) (2.109)

Fig. 2.33. The five feature constellations used.

With these features, regions showing maximum luminance difference can be


found. E.g., the brow region mostly shows different intensity from the region above
the brow. Hence, the luminance difference is large and the resulting feature is rated
significant.
However, in a 24x24 pixel region there are already 45396 options to form the
constellations depicted in Fig. 2.33. This enormous number results from the fact that
Facial Expression Commands 63

the side ratio of the regions may vary. To speed-up search in such a large search
space an integral image is introduced which provides the luminance integral in an
economic way.
The integral image Iint at the coordinate (x, y) is defined as the sum over all
intensities I(x, y):

Iint = I(x , y  ) (2.110)
x ≤x,y  ≤y,

By introducing the cumulative column sum Srow (x, y) two recursive calculations
can be performed:

Srow (x, y) = Srow (x, y − 1) + Iint (x, y) (2.111)


Iint (x, y) = Iint (x − 1, y) + Srow (x, y) (2.112)

The application of these two equations enables the computation of a complete in-
tegral image by iterating just once through the original image. The integral image has
to be calculated only once. Subsequently, the intensity integrals Fi of any rectangle
are determined by executing a simple addition and two subtractions.

Fig. 2.34. Calculation of the luminance of a region with the integral image. The luminance of
region FD is I(xD , yD ) + I(xA , yA ) − I(xB , yB ) − I(xD , yD ).

In Fig. 2.34 we find:


64 Non-Intrusive Acquisition of Human Action

FA =Iint (xA , yA ) (2.113)


FB =Iint (xB , yB ) − Iint (xA , yA ) (2.114)
FC =Iint (xC , yC ) − Iint (xA , yA ) (2.115)
FD =Iint (xD , yD ) + Iint (xA , yA ) − Iint (xB , yB ) − Iint (xC , yC ) (2.116)
FE =Iint (xE , yE ) − Iint (xC , yC ) (2.117)
FF =Iint (xF , yF ) + Iint (xC , yC ) − Iint (xD , yD ) − Iint (xE , yE ) (2.118)
FF − FD =Iint (xF , yF ) + Iint (xA , yA ) − Iint (xB , yB ) (2.119)
− Iint (xE , yE ) + 2Iint (xC , yC ) − 2Iint (xC , yC )

The approach is further exemplified when regarding Fig. 2.35. The dark/gray
constellation is the target feature. In a first step the integral image is constructed.
This, in a second step, enables the economic calculation of a difference image. Each
entry in this image represents the luminance difference of both regions with the upper
left corner being the reference point. In the example there is a maximum at position
(2, 3).

Fig. 2.35. Localizing a single feature in an integral image, here consisting of the luminance
difference of the dark and gray region.

Ada-Boosting
Now the n most significant features are selected with the Ada-Boosting-algorithm,
a special kind of ensemble learning. The basic idea here is to construct a diverse
ensemble of “weak” learners and combine these to a “strong” classifier.
Facial Expression Commands 65

Starting point is a learning algorithm A, that generates a classifier h from training


data N , where:

A:D→h (2.120)
N = (xi , yi ) ∈ X × Y i = 1 . . . m (2.121)
h:X→Y (2.122)

X is the set of all features; Y is the perfect classification result. The number of
training examples is m. Here A was selected to be a perceptron:

1 if pj fj (x) < pj Θj
hj (x) = (2.123)
0 otherwise
The perceptron hj (x) depends on the feature fj (x) and a given threshold Θj
whereby pj serves for the correct interpretation of the inequality.
During search for a weak classifier a number of variants Dt with t = 1, 2, . . . K
is generated from the training data D. This results in an ensemble of classifiers ht =
A(Dt ). The global classifier H(x) is then implemented as weighted sum:


K
H(x) = αk hk (x) (2.124)
k=1

Ada-Boosting iteratively generates classifiers. During iteration each training


example is weighted according to its contribution to the classification error. Since the
weight of “difficult” examples is steadily increased, the training of later generated
classifiers concentrates on the remaining problematic cases. In the following the
algorithm for the selection of suitable features by Ada-Boosting is presented.

Step 1: Initialization
• let m1 . . . mn be the set of training images
• let bi be a binary variable, with: bi = 0, 1∀mi ∈ Mfaces , mi ∈ / Mfaces
• let m be the number of all faces and l the number of non-faces
• 1
w1,i are initialization weights with: w1,i = 2m 1
, 2l ∀bi = 0, 1
Step 2: Iterative search for the T most significant features For t = 1, . . . , T
w
• Normalize the weights wt,i = n t,iwt,j such that wt forms a probability distri-
j=1
bution
•  feature j train a classifier hj . The error depending on wt is:
For each
j = i wi · |hj (xi − yi |
• Select the classifier ht with the smallest error t
• Actualize the weights wt+1,i = wt,i βt1−ei , where ei = 0, 1 for correct, uncorrect
t
classification and βt = 1− t

Step 3: Estimation of the Resulting-Classifier


66 Non-Intrusive Acquisition of Human Action

• The resulting classifier finally is:

  T
1 for Tt=1 αt ht (x) ≥ 1
2 t=1 αt with αt = log β1t
h(x) = (2.125)
0 otherwise
Cascaded Classifier
This type of classifier is applied to reduce classification errors and simultaneously
shorten processing time. To this end small efficient classifiers are built, that reject
numerous parts of a image as non-face (false positives), without incorrectly discard-
ing faces (false negatives).
Subsequently more complex classifiers are applied to further reduce the false
positives-rate. A simple classifier for a two-region feature hereby reaches, e.g., in a
first step a false-positive rate of only 40 % while at the same time recognizing 100
% of the faces correctly. The total system is equivalent to a truncated decision tree,
with several classifiers arranged in layers called cascades (Fig. 2.36).

Fig. 2.36. Schematic diagram of the Cascaded Classifier: Several Classifiers in line are applied
for each image-region. Each classifier can reject that region. This approach strongly reduces
computational and at the same provides very accurate results.

The false-positive rate F of a cascade is:


K
F = fi (2.126)
i=1
where fi is the false-positive rate of one of the K layers. The recognition rate D
is respectively:


K
D= di (2.127)
i=1
The final goal for each layer is to compose a classifier from several features, until
a required maximum recognition rate D is reached and simultaneously the number
F of misclassifications is minimal. This is achieved by the following algorithm:
Facial Expression Commands 67

Step 1: Initialization
• The user defines f and FTarget
• Let Mfaces be the set of all faces
• Let M¬faces be the set of all non-faces
• F0 = 1.0; D0 = 1.0; i = 0
Step 2: Iterative Determination of all Classifiers of a Layer
• i = i + 1; ni = 0; Fi = Fi−1
• while Fi > FTarget
– ni = ni+1
– Use Mfaces and M¬faces to generate with Ada-Boosting a classifier with ni
features.
– Test the classifier of the actual layer on a separate test set to determine Fi and
Di .
– Reduce the threshold of the i-th classifier, until the classifier of the actual
layer reaches a recognition rate of d × Di−1 .
• If Fi > FTarget , test the classifier of the actual layer on a set of non-faces and
add each misclassified object to N .
2.2.2.2 Face Tracking
For side view images of faces the described localization procedure yields only uncer-
tain results. Therefore an additional module for tracking suitable points in the facial
area is applied, using the algorithm of Tomasi and Kanade [26].
The tracking of point pairs in subsequent images can be partitioned in two sub-
tasks. First suitable features are extracted which are then tracked in the following
frames of a sequence. In case features are lost, new ones must be identified. If in
subsequent images only small motions of the features to be tracked occur, the inten-
sity of immediately following frames is:

I (x, y) = I(x + ζ, y + η) (2.128)


Here d := (ζ, η) describes the shift of point (x, y). Now the assumption holds
that points whose intensity varies strongly in either vertical or horizontal direction in
the first frame show similar characteristics in the second frame. Therefore, a feature
is described by a small rectangular region w with center (x, y) that shows a large
variance with respect to its pixel intensities. The region W is called feature window.
The intensity variance is determined by the gradient g = (gx gy ) at the position
(x, y):
 2
gx gx gy
G= (2.129)
gx gy gy2
In case matrix G has large eigenvalues λ1 and λ2 , the region establishes a good
feature with sufficient variance. Consequently those windows w are selected, for
which the eigenvalue
68 Non-Intrusive Acquisition of Human Action

λ = min(λ1 , λ2 ) (2.130)
is highest. In addition, further restrictions like the minimal distance between fea-
tures or the minimal distance from image border determine the selection.
For the selected features, the shift in the next frame is determined based on trans-
lation models. If the intensity function is approximated linearly by

I(x + ζ, y + η) = I(x, y) + g T d (2.131)


then the following equations have to be solved:

G·d =e (2.132)

with e = (I(x, y) − I (x, y))g T · ω dW (2.133)
w

Here the feature region W is integrated, with ω being a weighting function. Since
neither the translation model nor the linear approximation is exact, the equations are
solved iteratively and the admittable shift d is assigned an upper limit. In addition
a motion model is applied which enables to define a measure of feature unsimiliar-
ity by estimating the affine transform. In case this unsimiliarity gets too large it is
substituted by a new one.

2.2.3 Feature Extraction with Active Appearance Models

For the analysis of single facial features the face is modeled by dynamic face graphs.
Most common are Elastic Bunch Graphs (EBG) [27] and Active Appearance Models
(AAM), the latter are treated here.
An active Appearance Model (AAM) is a statistic model combining textural and
geometrical information into an appearance. The combination of outline and tex-
ture models are based on an eigenvalue approach which reduces the amount of data
needed, hereby enabling real-time calculation.
Outline Model
The contour of an object is represented by a set of points placed at characteristic
positions of the object. The point configuration must be stable with respect shift,
rotation and scaling. Cootes et al. [6] describe how marking points, i.e. landmarks,
are selected consistently on each image of the training set. Hence, the procedure is
restricted to object classes with a common basic structure.
In a training set of triangles, e.g., typically the corner points are selected since
they describe the essential property of a triangle and allow easy distinction of differ-
ences in the various images of a set. In general, however, objects have more compli-
cated outlines so that further points are needed to adequately represent the object in
question as, e.g., the landmarks for a face depicted in Fig. 2.37.
As soon as n points are selected in d dimensions, the outline (shape) is repre-
sented by a vector x which is composed of the single landmarks.
Facial Expression Commands 69

Fig. 2.37. 70 landmarks are used for the description of a face.

x = [x11 , . . . , x1n , x21 , . . . , x2n , . . . , xd1 , . . . , xdn ]T (2.134)


If there are s examples in the training set, then s vectors x are generated.
For statistical analysis images must be transformed to the same pose, i.e. the same
position, scaling, and rotation. This is performed by a procrustes analysis which con-
siders the outlines in a training set and minimizes the sum of distances with respect
to the average outline (Fig. 2.38). For making adjustments, only outline-preserving
operations must be applied in order to keep the set of points compact and to avoid
unnecessary nonlinearities.

Fig. 2.38. Adjustment of a shape with respect to a reference shape by a procrustes analysis
using affine transformation. Shape B is translated such that its center of gravity matches that
of shape A. Subsequently B is rotated and scaled until the average distance of corresponding
points is minimized (procrustes distance).

Now the s point sets are adjusted in a common coordinate system. They describe
the distribution of the outline points of an object. This property is now used to con-
struct new examples which are similar to the original training set. This means that a
new example x can be generated from a parameterized model M in the form
70 Non-Intrusive Acquisition of Human Action

x = M (p) (2.135)
where p is the model parameter vector. By a suitable selection of p new examples
are kept similar to the trainings set. The Principal Component Analysis (PCA) is a
means for dimensionality reduction by first identifying the main axes of a cluster.
Therefore it involves a mathematical procedure that transforms a number of possi-
bly correlated parameters into a smaller number of uncorrelated parameters called
principal components. The first principal component accounts for as much of the
variability in the data as possible, and each succeeding component accounts for as
much of the remaining variability as possible.
For performing the PCA, first the mean vector x of the s trainings set vectors xi
is determined by:

1
s
x= x (2.136)
s i=1 i
Let the outline points of the model be described by n parameter xi . Then the
variance of a parameter xi describes how far the values x2ij deviate from the expec-
tation value E(Xi ). The covariance Cov then describes the linear coherence of the
various xi parameters:

Cov(x1 , x2 , . . . , xn ) = E(x1 − E(x1 )) · E(x2 − E(x2 )) · . . . · E(xn − E(xn ))


(2.137)
The difference between each outline vector xi and the average xi is ∂xi . Arrang-
ing the differences as columns of a matrix yields:

D = (∂x1 . . . ∂xs ) (2.138)


From this the covariance matrix follows
1
S= DDT (2.139)
s
from which the eigenvalues λi and Eigenvectors ϕi which form a orthogonal
basis, are calculated. Each eigenvalue represents the variance with respect to the
mean value in the direction of the corresponding Eigenvector. Let the deformation
matrix Φk contain the t largest Eigenvectors of S, i.e. those influencing the deviation
from the average value most:

Φk = (ϕ1 , ϕ2 , . . . ϕt ) (2.140)
Then these t vectors form a subspace basis which again is orthogonal. Each point
in the training set can now be approximated by

x ≈ x + Φk · pk (2.141)
Facial Expression Commands 71

where pk corresponds to the model parameter vector mentioned above, now with
a reduced number of parameters.
The number t of Eigenvectors may be selected so that the model explains the
total variance to a certain degree, e.g. 98%, or that each example in the training set is
approximated to a desired accuracy. Thus an approximation for the n marked contour
points with dimension d each is found for all s examples in the training set, as related
to the deviations from the average outline. With this approximation, and considering
the orthogonality of the eigenvalue matrix, the parameter vector pk results in

pk = ΦT (x − x) (2.142)
By varying the elements of pk the outline x may be varied as well. The average
outline and exemplary landmark variances are depicted in (Fig. 2.39).

Fig. 2.39. Average outline and exemplary landmark variances.

The eigenvalue λi is the variance of the i-th parameter pki , over all examples in
the training set. To make sure that a newly generated outline is similar to the training
patterns, limits are set. Empirically it was√found that a maximum deviation for the
parameter pki should be no more than ±3 λi (Fig. 2.40).
Texture Model
With the calculated mean outline and the principal components it is possible to re-
construct each example of the training data. This removes texture variances resulting
from the various contours. Afterwards the RGB values of the normalized contour are
extracted in a texture-vector tim .
To minimize illumination effects, the texture vectors are normalized once more
using a scaling factor and an offset a resulting in:
72 Non-Intrusive Acquisition of Human Action

Fig. √ models for variations of the first three eigenvalues φ1 , φ2 and φ3 between
√ 2.40. Outline
3 λ, 0 and −3 λ.

(tim − β · 1)
t= (2.143)
α
For the identification of t the average of all texture vectors has to be determined
first. Additionally a scaling and an offset have to be estimated, because the sum of
the elements must be zero and the variance should be 1. The values for α and β for
normalizing tim are then

(tim · 1)
α = tim · tβ = (2.144)
n
where n is the number of elements in a vector. As soon as all texture vectors have
been assigned scaling and offset, the average is calculated again. This procedure is
iterated for the new average until the average converges to a stable value.
From here the further processing is identical to that described in the outline sec-
tion. Again a PCA is applied to normalized data which results in the following ap-
proximation for a texture vector in the training set

t = t + Φt · pt (2.145)
where Φt is again the Eigenvector matrix of the covariance matrix, now for tex-
ture vectors. The texture parameters are represented by pt .
Facial Expression Commands 73

For texture description here an artificial 3-band-image is used which is composed


by the intensity channel K1 , the I3 channel K2 and the Y gradient K3 (Fig. 2.41). It
boosts edges and color information, facilitating the separation of lips from the face.
The color space with its three channels K1 , K2 and K3 is defined as:

R+G+B
K1 = (2.146)
3
2G − R − B
K2 = (2.147)
4
2G − R − B
K3 = (2.148)
4

Fig. 2.41. An artificial 3-band image (a) composed of intensity channel (b), I3 channel (c),
and Y gradient (d).

2.2.3.1 Appearance Model


The parameter vectors pk and pt permit to describe each example in the training set
by outline and texture. To this end a concatenated vector pa is generated:

Wk pk
pa = (2.149)
pt
where Wk is a diagonal matrix of weights for each outline parameter. This mea-
sure is needed, since the elements of the outline parameter vectors pk are pixel dis-
tances while those of the texture parameter vectors are pixel intensities. Hence both
can not be compared directly. The weights are identified by systematically varying
the values of the outline parameter vector p and controlling the effect of this measure
on texture.
Since the Eigenvector matrix is orthogonal, its inverse is the transposed. There-
fore:

pk = ΦTk (x − x) (2.150)
pt = ΦTt (t − t) (2.151)
74 Non-Intrusive Acquisition of Human Action

By insertion we then get:


Wk ΦTk (x − x)
pa = (2.152)
ΦTt (t − t)

A further PCA is applied to exploit possible correlations between the outline


and texture parameters for data compression. Since the parameter vectors pk and
pt describe variances, their mean is the zero vector. The PCA then results in the
following approximation for an arbitrary appearance vector from the training set:

pa ≈ Φa c (2.153)
Here Φa denotes the appearance Eigenvector matrix and c its parameter vector.
The latter controls the parameters of outline as well as of texture. The linearity of
the model now permits to apply linear algebra to formulate outline and texture as a
function of c. Setting:


Φpk
Φp = (2.154)
Φpt

cpk
c= (2.155)
cpt

results in

  
Wk ΦTk (x − x) Wk ΦTk x − Wk ΦTk x Φpk cpk
P = = = (2.156)
ΦTt (t − t) ΦTt t − ΦTt t Φpt cpt
 
Wk ΦTk x Φpk cpk + Wk ΦTk x
⇔ = (2.157)
ΦTt t Φpt cpt + ΦTt t
 
x Φk Wk−1 Φpk cpk + x
⇔ = (2.158)
t Φt Φpt cpt + t

From this the following functions for the determination of an outline x and a
texture t can be read:

x = x + Φk Wk−1 Φpk cpk (2.159)


t = t + Φt Φpt cpt (2.160)

or, in shortened form

x = x + Qk cpk (2.161)
t = t + Qt cpt (2.162)
Facial Expression Commands 75

with

Qk = Φk Wk−1 Φpk (2.163)


Qt = Φt Φpt (2.164)

Using these equations an example model can be generated by


• defining a parameter vector c,
• generating the outline free texture image with vector t,
• deforming this by using the control points in x.
In Fig. 2.42√example √
appearance models for variations of the first five Eigenvec-
tors between 3 λ, 0, −3 λ are presented.
2.2.3.2 AAM Search
Image interpretation in general tries to recognize a known object in an a priori un-
known image, even if the object is changed e.g. by a variation in camera position.
When Active Appearance Models are used for image interpretation, the model pa-
rameters have to be identified such that the difference between the model and the
object in the image is minimized. Image interpretation is thus transformed into an
optimization problem.
By performing a PCA the amount of data is considerably reduced. Nevertheless
the optimization process is rather time consuming if the search-space is not limited.
There are although possibilities to optimize the search-algorithm. Each trial to match
the face-graph on an actual image proceeds similar, that means on changing one
parameter of the configuration a specific pattern changes in the output-image. One
can prove the existence of linear dependencies in this process that can be utilized for
speeding-up the system. The general approach is as follows:
The distance vector ∂I between a given image and the image generated by an
AAM is:

∂I = Ii − Im (2.165)
2
For minimizing this distance, the error Δ = |∂I| is considered. Optimization
then concerns
• Learning the relation between image difference ∂I and deviations in the model
parameters.
• Minimizing the distance error Δ by applying the learned relation iteratively.
Experience shows that a linear relation between ∂I and ∂c is a good approxima-
tion:

∂c = A · ∂I (2.166)
To identify A a multiple multivariate linear regression is performed on a selection
of model variations and corresponding difference images ∂I. The selection could be
76 Non-Intrusive Acquisition of Human Action

Fig.
√ 2.42. Appearance
√ for variations of the first five Eigenvectors c1 , c2 , c3 , c4 and c5 between
3 λ, 0, −3 λ.

emphasized by adding carefully, randomly changing the model parameters for the
given examples.
For the determination of error Δ the reference points have to be identical. There-
fore the image outlines to be compared are made identical and their texture is com-
pared. In addition to variations in the model parameters, small deviations in pose, i.e.
translation, scaling and orientation are introduced.
These additional parameters are made part of the regression by simply adding
them to the model parameter deviation vector ∂c. To maintain model linearity, pose
Facial Expression Commands 77

is varied only by scaling s, rotation Θ and translation in x-/y-direction (tx , ty ). The


corresponding parameters are sx , sy , tx , ty and sx = s·cos(Θ)−1 and sy = s·sin(Θ)
perform scaling and rotation in combination.
To identify the relation between ∂c, ∂I, and ∂t the following steps are performed
for each image in the training set:
• Initialize the starting vector c0 with the fixed appearance parameters c of the
actually selected texture tim .
• Define arbitrary deviations for the parameters in |∂c| and compute new model
parameters c = ∂c + c0
• With this new parameter vector c, generate the corresponding target contour
which is then used to normalize ti , such that its outline matches that of the mod-
eled object.
• Then generate a texture image tm free of outline by applying c.
• Now the difference to the current image induced by ∂c can be calculated as ∂t =
tn − tm
• Note ∂c and the corresponding ∂t and iterate several times for the same example
image. Then iterate over all images in the training set.
The selected ∂c should be made as large as possible. Empirical evaluations have
shown that the optimal region for the model parameters is 50% of the training set
standard deviation, 10% scaling, and 2 pixel translation. For larger values the as-
sumption of a linear relation is no longer warranted.
The identified value pairs ∂c and ∂t are now submitted to a multiple multivariate
regression to find A. In general a regression analysis the effect of influence variables
on some target variable. Here, more specifically, the linear dependency of the sin-
gle induced image differences of texture ∂t are examined which in turn depend on
several selected parameter deviations ∂c. So A encodes a-priori knowledge which is
gained during the iteration described above.
By establishing the relation between model parameter deviations and their effect
on generated images the first part of the optimization problem has been solved. What
follows now in a second step is error minimization.
This is achieved by iterations, whereby the linear model described earlier predicts
for each iteration the change of model-parameters which lets the model converge to
an optimum. Convergence is assumed as soon as the error reaches a given threshold.
If the estimated model parameters c0 are initialized with those of the standard model,
i.e. with the average values, once again the normalized image tn is involved. The
algorithm proceeds as follows:
• Identify image difference ∂t0 = tn − tm and den error E0 = |∂t0 |2
• Compute the estimated deviation ∂c = A · ∂t0
• Set k = 1 and c1 = c0 − k∂c
• Generate the image to estimate the parameter vector c1 which results in a new
tm .
• Execute for the given normalized image a further normalization, this time for the
outline given by the new parameter-vector c1 .
78 Non-Intrusive Acquisition of Human Action

• Compute the new error vector ∂t1 .


• Repeat the procedure until the error Ei converges. In case error Ei does not
converge, retry with smaller values, e.g., 0.5, 0.25 etc. for k.
This results in a unique configuration of model parameters for describing an object,
here the face, in a given image. Hence the object can be assigned to a particular
object class, i.e. to the one for which the model was constructed.
In case the configuration coincides with a normalized example of the training-
set, the object is not only localized but also identified. To give an example: If the
algorithm for optimization determines a graph with a certain parameter-configuration
very similar to parameters of a particular view in the training-set, then the actually
localized face could match that of the trained person.

2.2.4 Feature Classification

To localize areas of interest, a face graph is iteratively matched to the face using a
modified active appearance model (AAM) which exploits geometrical and textural
Information within a triangulated face. The AAM must be trained with data of many
subjects. Facial features of interest which have to be identified in this processing
stage, are head pose, line of sight, and lip outline.
2.2.4.1 Head Pose Estimation
For the identification of the posture of the head two approaches are prosecuted in
parallel. The first method is an analytical method based on the trapezoid described
by the outer corners of the eyes and the corners of the mouth. The four required
points are taken from a matched face graph.
The second, holistic method also makes use of a face graph. Here the face region
is derived from the convex hull that contains all nodes of the face-graph (Fig. 2.43).
Then a transformation of the face region into the eigenspace takes place. The
base of the eigenspace is created by the PCA transformation of 60 reference views
which represent different head poses. The calculation of the Euclidian Distance of
the actual view into the eigenspace allows the estimation of the head pose.
Finally the results of the analytic and the holistic approach are compared. In case
of significant differences the actual head pose is estimated by utilizing the last correct
result combined with a prediction involving the optical flow.
Analytic Approach
In a frontal view the area between eye and mouth corners appears to be symmetrical.
If, however, the view is not frontal, the face plane will be distorted. To identify roll σ
and pitch angle τ the distorted face plan is transformed into a frontal view (Fig. 2.44).
A point x on the distorted plane is transformed into an undistorted point x by

x = U x + b (2.167)
where U is a 2x2 transformation matrix and b the translation. It can be shown
that U can be decomposed to:
Facial Expression Commands 79

Fig. 2.43. Head pose is derived by a combined analytical and holistic approach.

U = λ1 R(Θ)P (λ, τ ) (2.168)


Here R(T heta) is a rotation around Θ and P (λ, τ ) is the following symmetric
matrix:

λ0
P (λ, τ ) = R(τ ) R(−τ ) (2.169)
01
with

cos τ − sin τ
R(τ ) = (2.170)
sin τ cos τ
The decomposition of U shows the ingredients of the linear back transformation
consisting of an isotropic scaling by λ1 , a rotation around the optical axis of the
virtual cam corresponding to R(Θ), and a scaling by factor λ in the direction of τ .
With two orthogonal vectors a (COG eyes – outer corner of the left eye) and b
(COG eyes – COG mouth corners) the transformation matrix U is:


0 21 Re
U (ab) = (2.171)
−1 0

0 12 Re
⇔U = (ab)−1 (2.172)
−1 0

Where Re is the ratio of a and b Since a and b are parallel and orthogonal with
respect to the symmetry axes it follows:
80 Non-Intrusive Acquisition of Human Action

Fig. 2.44. Back transformation of a distorted face plane (right) on an undistorted frontal view
of the face (left) which yields roll and pitch angle.

aT · U T · U · b = 0 (2.173)
Hence U must be a symmetrical matrix:

αβ
UT · U = (2.174)
βγ
Also it follows that:

2β 1
tan(2τ ) = and cos(σ) = √ √ (2.175)
α−γ μ+ μ−1
(α + γ)2
with μ= (2.176)
4(α · γ − β 2 )

Finally the normal vector for head pose is:


⎡ ⎤
sin(σ) · cos(τ )
n = ⎣ sin(σ) · sin(τ ) ⎦ (2.177)
− cos(σ)
Caused by the implied trigonometric functions, roll and pitch angles are always
ambiguous, i.e. the trapezoid described by mouth and eye corners is identical for
different head poses. This problem can be solved by considering an additional point
such as, e.g., the nose tip, to fully determine the system.
Facial Expression Commands 81

Fig. 2.45. Pose reconstruction from facial geometry distortion is ambiguous as may be seen
by comparing the left and right head pose.

Holistic Approach
A second method for head pose estimation makes use of the Principal Component
Analysis (PCA). It transforms the face region into a data space of Eigenvectors that
results from different poses. The monochrome images used for reference are first
normalized by subtracting the average intensity from individual pixels and then di-
viding the result by the standard deviation. Hereby variations resulting from different
illuminations are averaged.
For n views of a rotating head a pose eigenspace (PES) can be generated by
performing the PCA. A reference image sequence distributed equally between -60
and 60 degrees yaw angle is collected, where each image has size m = w × h and
is represented as m-dimensional row vector x. This results in a set of such vectors
(x1 , x2 , . . . , xn ). The average μ and the covariance matrix S is calculated by:

1
n
μ= x (2.178)
n i=1 i
1
n
S= [x − μ][xi − μ]T (2.179)
n i=1 i

Let φj with j = 1 . . . n be the Eigenvectors of S with the largest corresponding


eigenvalues λj :

ϕj = λj ϕj (2.180)
82 Non-Intrusive Acquisition of Human Action

The n Eigenvectors establish the basis for the PES. For each image of size w × h
a pattern vector Ω(x) = [ω1 , ω2 , . . . , ωn ] may be found by projection with the
Eigenvectors φj .

ωj = ϕTj (x − μ) with j = 1, . . . , n (2.181)


After normalizing the pattern vector with the eigenvalues the image can be pro-
jected in the PES:
" #
ω1 ω2 ωn
Ωnorm (x) = , ,··· , (2.182)
λ1 λ2 λn
The reference image sequence then corresponds to a curve in the PES. This is
illustrated by Fig. 2.46 where only the first three Eigenvectors are depicted.

Fig. 2.46. top: Five of sixty synthesized views. middle: masked faces. bottom: cropped faces.
right: Projection into the PES.

In spite of color normalization the sequence projection is influenced significantly


by changes in illumination as illustrated by Fig. 2.47. Here three image sequences
with different illumination are presented together with the resulting projection curves
in the PES.

Fig. 2.47. Variations in the PES due to different illumination of the faces.
Facial Expression Commands 83

To compensate this influence a Gabor Wavelet Transformation (GWT) is applied


which is the convolution with a Gabor Wavelet in Fourier space. Here a combination
of the four GWT responses corresponding to 0°, 45°, 90° and 145° is used. As a
result the curves describing head pose for different illumination are closer together
in PES (Fig. 2.48).

Fig. 2.48. Here four gabor wavelet transforms are used instead of image intensity. The influ-
ence of illumination variations is much smaller than in Fig. 2.47

Now, if the pose of a new view is to be determined, this image is projected into
the PES as well and subsequently the reference point with the smallest Euclidean
distance is identified.
2.2.4.2 Determination of Line of Sight
For iris localization the circle Hough transformation is used which supports the re-
liable detection of circular objects in an image. Since the iris contains little red hue,
only the red channel is extracted from the eye region which contains a high contrast
between skin and iris. By the computation of directed X and Y gradients in the red
channel image the amplitude and phase between iris and its environment is empha-
sized.
The Hough space is created using an extended Hough transform. It is applied on
search mask which is generated based on a threshold segmentation of the red channel
image. Local maxima then point to circular objects in the original image.
The iris by its very nature represents a homogeneous area. Hence the local ex-
tremes in the Hough space are being validated by verifying the filling degree of the
circular area with a expected radius in the red-channel Finally the line of sight is
determined by comparing the intensity distributions around the iris with trained ref-
erence distributions using a Maximum-Likelihood-Classifier. In Fig. 2.49 the entire
concept for line of sight analysis is illustrated.
2.2.4.3 Circle Hough Transformation
The Hough transformation (HT) represents digital images in a special parameter
space called Hough Space. For a line, e.g., this space is two dimensional and has the
coordinates slope and shift. For representing circles and ellipses a Hough space of
higher dimensions is required.
84 Non-Intrusive Acquisition of Human Action

Fig. 2.49. Line of sight is identified based on amplitude and phase information in the red
channel image. An extended Hough transformation applied to a search mask finds circular
objects in the eye region.

For the recognition of circles, the circle hough transformation (CHT) is applied.
Here the Hough space H(x, y, r) corresponds to a three dimensional accumulator
that indicates the probability that there is a circle with radius r at position (x, y) in
the input image. Let Cr be a circle with radius r:

Cr = {(i, j)|i2 + j 2 = r2 } (2.183)


Then the Hough Space is described by:

H(x, y, r) = I(x + i, y + j) (2.184)
(i,j)∈Cr

This means that for each position P of the input image where the gradient is
greater than a certain threshold, a circle with a given radius is created in the Hough
space. The contributions of the single circles are accumulated in Hough space. A
local maximum at position P  then points to a circle center in the input image as
illustrated in Fig. 2.50 for 4 and 16 nodes, if a particular radius is searched for.
For the application at hand the procedure is modified such that the accumulator
in Hough Space is no longer incremented by circles but by circular arc segments
(Fig. 2.51). The sections of the circular arc are incremented orthogonally in respect
of the phase to both sides.
Facial Expression Commands 85

Fig. 2.50. In the Hough transformation all points in an edge-coded image are interpreted as
centers of circles. In Hough space the contributions of single circles are accumulated. This
results in local maxima which correspond to circle centers in the input image. This is exem-
plarily shown for 4 and 16 nodes.

Fig. 2.51. Hough transformation for circular arc segments which facilitates the localization of
circles by reducing noise. This is exemplarily shown for 4 and 16 nodes.

For line of sight identification the eyes’ sclera is analyzed. Following iris local-
ization, a concentric circle is described around it with the double of its diameter. The
resulting annulus is divided in to eight segments. The intensity distribution of these
segments then indicates line of sight. This is illustrated by Fig. 2.52 where distrib-
utions for two different lines of sight are presented which are similar for both eyes
but dissimilar for different lines of sight. A particular line of sight associated with
a distribution as shown in Fig. 2.52 is then identified with a maximum likelihood
classifier.
The maximum likelihood classifier assigns a feature vector g = m(x, y) to one
of t object classes. Here g consists of the eight intensities of the circular arc segments
mentioned above. For classifier training a sample of 15 (5 x 3) different lines of sights
were collected from ten subjects. From these data the average vector μi (i = 0 . . . 14),
the covariance matrix Si , and the inverse covariance matrix Si−1 were calculated.
Based on this the statistical spread vector si could be calculated.
86 Non-Intrusive Acquisition of Human Action

Fig. 2.52. Line of sight identification by analyzing intensities around the pupil.

From these instances the rejection threshold dzi = sTi · Si−1 · si for each ob-
ject class is determined. For classification the Mahalanobis distances di = (g −
μi )T Si−1 (g − μi ) of the actual feature vector to all classes is calculated. Afterwards
a comparison with pretrained thresholds takes place. The feature vector g then is as-
signed to that class kj for which the Mahalanobis distance is minimal and for which
the distance is lower than the rejection threshold of that class(Fig. 2.53). .

Fig. 2.53. A maximum likelihood classifier for line of sight identification. A feature vector is
assigned to that class to which the Mahalanobis distance is smallest.
Facial Expression Commands 87

2.2.4.4 Determination of Lip Outline


A point distribution model (PDM) in connection with an active shape model (ASM)
is used for lip outline identification. For ASM initialization the lip borders must be
segmented from the image as accurately as possible.

Lip Region Segmentation

Segmentation is based on four different feature maps which all emphasize the lip
region from the surrounding by color and gradient information Fig. 2.54.

Fig. 2.54. Identification of lip outline with an active shape model. Four different features
contribute to this process.

The first feature represents the probability that a pixel belongs to the lips. The
required ground truth, i.e. lip color histograms, has been derived from 800 images
segmented by hand. The a posteriori probabilities for lips and background are then
calculated as already described in Sect. 2.1.2.1.
In the second feature the nonlinear LUX color space (Logarithmic hUe eXten-
tion) is exploited to emphasize the contrast between lips and surrounding [14], where
L is the luminance, and U and X incorporate chromaticity. Here the U channel is used
as a feature.
88 Non-Intrusive Acquisition of Human Action

L = (R + 1)0.3 (G + 1)0.6 (B + 1)0.1 − 1 (2.185)



256 R+1
U= 2 ( L+1 ) if R < L
(2.186)
256 − 256 (
2 R+1
L+1
) else

256 B+1
X= 2 ( L+1 ) if B < L
(2.187)
256 − 2 ( B+1 ) else
256 L+1

A third feature makes use of the I1 I2 I3 color space which also is suited to em-
phasize the contrast between lips and surrounding. The three channels are defined as
follows:

R+G+B
I1 = (2.188)
3
R−B
I2 = (2.189)
2
2G − R − B
I3 = (2.190)
4
The fourth feature utilizes a Sobel operator that emphasizes the edges between
the lips and skin- or beard-region.
⎡ ⎤ ⎡ ⎤
−1 0 1 −1 −2 −1
⎣ −2 0 2 ⎦ or rather ⎣ 0 0 0 ⎦ (2.191)
−1 0 1 1 2 1
The filter mask is convoluted with the corresponding image region. The four
different features need to be combined for establishing an initialization mask. For
fusion, a logical OR without individual weighting was selected, as weighting did not
improve the results.
While the first three features support the segmentation between lips and sur-
rounding, the gradient map serves the union of single regions in cases where upper
and lower lips are segmented separately due to dark corners of the mouth or to teeth.
Nevertheless, segmentation errors still occur when the color changes between
lips and surrounding is floating. This effect is observed if, e.g., shadow falls below
the lips, making that region dark. Experiments show that morphological convolution
operators like erosion or dilatation are inappropriate to smooth out such distortions.
Therefore here an iterative procedure is used which relates the center of gravity of
the rectangular surrounding the lips Cog Lip to the center of gravity of the segmented
area Cog Area . If Cog Lip lies above Cog Area , this most likely is due to line-formed
artifacts. These then are erased linewise until a recalculation results in approximately
identical centers of gravity (Fig. 2.55).
2.2.4.5 Lip modeling
For the collection of ground truth data, mouth images were taken from 24 subjects
with the head filling the full image format. Each subject had to perform 15 visu-
ally distinguishable mouth forms under two different illumination conditions. Sub-
sequently upper and lower lip, mouth opening, and teeth were segmented manually
Facial Expression Commands 89

Fig. 2.55. The mask resulting from fused features (left) is iteratively curtailed until the distance
of the center of gravity from the bounding box to the center of gravity of the area is minimized
(right).

in these images. Then 44 points were equally assigned on the outline with point 1
and point 23 being the mouth corners.
It is possible to use vertically symmetrical and asymmetrical PDMs. Research
into this shows that most lip outlines are asymmetrical, so asymmetrical models fit
better. Since the segmented lip outlines vary in size, orientation, and position in the
image, all points have to be normalized accordingly. The average form of training
lip outlines, their Eigenvector matrix and variance vector, then form the basis for the
PDMs which are similar to the outline models of the Active Appearance Models.
The resulting PDMs are depicted√ in Fig. 2.56,
√ where the first four Eigenvectors have
been varied in a range between 3 λ, 0, −3 λ.

Fig. 2.56. √
Point distribution
√ models for variations of the first four Eigenvectors φ1 , φ2 , φ3 , φ4
between 3 λ, 0, −3 λ.
90 Non-Intrusive Acquisition of Human Action

2.2.5 Facial Feature Recognition – Eye Localization Application

To run the example application for eye localization, please install the rapid proto-
typing system I MPRESARIO from the accompanying CD as described in Sect. B.1.
After starting I MPRESARIO, choose File → Open to load the process graph named
FaceAnalysis_Eye.ipg (Fig. 2.57). The data flow is straightforward and
corresponds to the structure of the sections in this chapter. All six macros are
described below.

Image Sequence Acquires a sequence of images from hard disk or other storage
media. In the following, each image is processed individually and independently.
Other sources may be used by replacing this macro with the appropriate source
macro (for instance, Video Capture Device for a video stream from an USB cam).

SplitImage Split image in three basic channels Red – Green – Blue. It is possible
to split all the channels at the same time or to extract just one channel (here the Red
channel is used).

Gradient This is a simple wrapper for the convolution functor with some conve-
nience parameterization to choose between different common gradient kernels. Not
only the classical simple difference computation (right minus left for the x direction
or bottom minus top for the y direction) and the classical Sobel, Prewitt, Robinson,
Roberts and Kirsch kernels can be used, but more sophisticated optimal kernels (see
lti::gradientKernelX) and the approximation using oriented Gaussian derivatives
can be used. (output here: An absolute channel computed from X and Y gradient
channels)

ConvertChannel The output of the preceding stage is a floating point gray value
image I(x, y) ∈ [0, 1]. Since the next macro requires a fixed point gray value image
I(x, y) ∈ {0, 1, . . . , 255} a conversion is performed here.

EyeFinder creates the Hough space by utilizing the gradient-information of the


red input-channel. Local maxima in the Hough space are taken as the center of the
eyes. There are two possibilities to visualize the results: First the Hough space itself
could be shown, second the contours of the iris could be high-lighted green.

2.2.6 Facial Feature Recognition – Mouth Localization Application

To run the example application for mouth localization, please install the rapid
prototyping system I MPRESARIO from the accompanying CD as described in Sect.
B.1. After starting I MPRESARIO, choose File → Open to load the process graph
named FaceAnalysis_Mouth.ipg (Fig. 2.58). The data flow is straightforward
and corresponds to the structure of the sections in this chapter. All six macros are
described below.
Facial Expression Commands 91

Fig. 2.57. Process graph of the example application for eye-localization.

Image Sequence Acquires a sequence of images from hard disk or other storage
media. In the following, each image is processed individually and independently.
Other sources may be used by replacing this macro with the appropriate source
macro (for instance, Video Capture Device for a video stream from an USB cam).

SplitImage Split image in three basic-channels Red – Green – Blue. It is


possible to split all the channels at the same time or to get just one channel (Here
the Red channel is used).

Gradient This is a simple wrapper for the convolution functor with some conve-
nience parameterization to choose between different common gradient kernels. Not
only the classical simple difference computation (right minus left for the x direction
or bottom minus top for the y direction) and the classical Sobel, Prewitt, Robinson,
Roberts and Kirsch kernels can be used, but more sophisticated optimal kernels
(see lti::gradientKernelX) and the approximation using oriented Gaussian
derivatives can be used. (output here: An absolute channel computed from X and Y
gradient channels)

GrayWorldConstancy Performs a color normalization on an image using a


gray world approach, in order to eliminate effects of illumination color. Gray world
methods and white world methods assume that some low order statistics on the
92 Non-Intrusive Acquisition of Human Action

colors of a real world scene lie on the gray line of a color space. The most simple
approach uses only the mean color and maps it linearly into one specific point on
the gray line, assuming that a diagonal transform (a simple channel scaling) is valid.
Normalization is done under the assumption of mondrian worlds, constant point
light sources and delta function camera sensors.

ASMInitializer creates an initialization mask for the MouthFinder. Therefore


three color-based maps are generated and together with the gradient-information
combined. The result is a gray image that represent the probability for each pixel to
belong to background or to the lips.

ObjectsFromMask outputs a list of all detected regions sorted descending


by area (which may be empty if no regions were found). A value of zero for the
parameter Threshold uses the algorithm shown in Fig. 2.6 for automatic threshold
computation. If this fails, the threshold can be set manually. The remaining parame-
ters control further aspects of the segmentation process and should be left at their
defaults.

MouthFinder utilizes point distribution models to extract the mouth contour.


Therefore, first a selectable number of points are extracted from the initialization
mask given by module ASMInitializer. Afterwards the points are interpolated by a
spline. The won contour serves for the initialization of the Active Shape models.

References
1. Akyol, S. and Canzler, U. An Information Terminal using Vision Based Sign Language
Recognition. In Büker, U., Eikerling, H.-J., and Müller, W., editors, ITEA Workshop
on Virtual Home Environments, VHE Middleware Consortium, volume 12, pages 61–68.
2002.
2. Akyol, S., Canzler, U., Bengler, K., and Hahn, W. Gesture Control for Use in Automo-
biles. In Proceedings of the IAPR MVA 2000 Workshop on Machine Vision Applications.
2000.
3. Bley, F., Rous, M., Canzler, U., and Kraiss, K.-F. Supervised Navigation and Manip-
ulation for Impaired Wheelchair Users. In Thissen, W., Wierings, P., Pantic, M., and
Ludema, M., editors, Proceedings of the IEEE International Conference on Systems, Man
and Cybernetics. Impacts of Emerging Cybernetics and Huamn-Machine Systems, pages
2790–2796. 2004.
4. Canzler, U. and Kraiss, K.-F. Person-Adaptive Facial Feature Analysis for an Advanced
Wheelchair User-Interface. In Drews, P., editor, Conference on Mechatronics & Robotics,
volume Part III, pages 871 – 876. Sascha Eysoldt Verlag, September 2004.
5. Canzler, U. and Wegener, B. Person-adaptive Facial Feature Analysis. The Faculty of
Electrical Engineering, page 62, May 2004.
6. Cootes, T., Edwards, G., and Taylor, C. Active Appearance Models. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 23(6):681 – 685, 2001.
7. Csink, L., Paulus, D., Ahlrichs, U., and Heigl, B. Color Normalization and Object Local-
ization. In 4. Workshop Farbbildverarbeitung, Koblenz, pages 49–55. 1998.
References 93

Fig. 2.58. Process graph of the example application for mouth localization.

8. Dugad, R. and Desai, U. B. A Tutorial on Hidden Markov Models. Technical Report


SPANN-96.1, Department of Electrical Engineering, Indian Institute of Technology, May
1996.
9. Finlayson, G. D., Schiele, B., and Crowley, J. L. Comprehensive Colour Image Normal-
ization. Lecture Notes in Computer Science, 1406(1406), 1998.
10. Jain, A. K. Fundamentals of Digital Image Processing. Prentice-Hall, 1989.
11. Jelinek, F. Statistical Methods for Speech Recognition. MIT Press, 1998. ISBN 0-262-
10066-5.
12. Jones, M. and Rehg, J. Statistical Color Models with Application to Skin Detection.
Technical Report CRL 98/11, Compaq Cambridge Research Lab, December 1998.
13. Juang, B. H. and Rabiner, L. R. The Segmental k-means Algorithm for Estimating the Pa-
rameters of Hidden Markov Models. In IEEE Trans. Acoust., Speech, Signal Processing,
volume ASSP-38, pages 1639–1641. 1990.
14. Lievin, M. and Luthon, F. Nonlinear Colour Space and Spatiotemporal MRF for Hierar-
chical Segmentation of Face Features in Video. IEEE Transactions on Image Processing,
13:63 – 71, 2004.
15. Mitchell, T. M. Machine Learning. McGraw-Hill, 1997.
16. Nelder, J. and Mead, R. A Simplex Method for Function Minimization. Computer Jour-
nal, 7(4):308 – 313, 1965.
17. Pantic, M. and Rothkrantz, L. J. M. Expert Systems for Automatic Analysis of Facial
Expressions. In Image and Vision Computing Journal, volume 18, pages 881–905. 2000.
94 Non-Intrusive Acquisition of Human Action

18. Papageorgiou, C., Oren, M., and Poggio, T. A General Framework for Object Detection.
pages 555 – 562. 1998.
19. Rabiner, L. and Juang, B.-H. An Introduction to Hidden Markov Models. IEEE ASSP
Magazine, 3(1):4–16, 1986.
20. Rabiner, L. R. A Tutorial on Hidden Markov Models and Selected Applications in Speech
Recognition. Readings in Speech Recognition, 1990.
21. Schukat-Talamazzini, E. Automatische Spracherkennung. Vieweg Verlag, Braun-
schweig/Wiesbaden, 1995.
22. Schürmann, J. Pattern Classification. John Wiley & Sons, 1996.
23. Sherrah, J. and Gong, S. Tracking Discontinuous Motion Using Bayesian Inference. In
Proceedings of the European Conference on Computer Vision, pages 150–166. Dublin,
Ireland, 2000.
24. Sonka, M., Hlavac, V., and Boyle, R. Image Processing, Analysis and Machine Vision.
Brooks Cole, 1998.
25. Steger, C. On the Calculation of Arbitrary Moments of Polygons. Technical Report
FGBV–96–05, Forschungsgruppe Bildverstehen (FG BV), Informatik IX, Technische
Universität München, October 1996.
26. Tomasi, C. and Kanade, T. Detection and Tracking of Point Features. Technical Report
CS-91-132, CMU, 1991.
27. Wiskott, L., Fellous, J., Krüger, N., and von der Malsburg, C. Face Recognition by Elas-
tic Bunch Graph Matching. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 19(7):775–779, 1997.
28. Zieren, J. and Kraiss, K.-F. Robust Person-Independent Visual Sign Language Recog-
nition. In Proceedings of the 2nd Iberian Conference on Pattern RecognitionandImage
Analysis IbPRIA 2005, volume Lecture Notes in Computer Science. 2005.
Chapter 3

Sign Language Recognition


Jörg Zieren1 , Ulrich Canzler2 , Britta Bauer3 , Karl-Friedrich Kraiss
Sign languages are naturally developed and fully fledged languages used by
the Deaf and Hard-of-Hearing for everyday communication among themselves.
Information is conveyed visually, using a combination of manual and nonmanual
means of expression (hand shape, hand posture, hand location, hand motion; head
and body posture, gaze, mimic). The latter encode e.g. adjectives and adverbials, or
provide specialization of general items.
Some signs can be distinguished by manual parameters alone, while others re-
main ambiguous unless additional nonmanual information is made available. Unlike
pantomime, sign language does not include its environment. Signing takes place in a
3D space close to the trunk and the head, called signing space.
The grammar of sign language is fundamentally different from spoken language.
The structure of a sentence in spoken language is linear, one word followed by an-
other, whereas in sign language, a simultaneous structure exists with a parallel tem-
poral and spatial configuration. The configuration of a sign language sentence carries
rich information about time, location, person, or predicate. Thus, even though the du-
ration of a sign is approximately twice as long as the duration of a spoken word, the
duration of a signed sentence is about the same.
Unfortunately, very few hearing people speak sign language. The use of inter-
preters is often prohibited by limited availability and high cost. This leads to prob-
lems in the integration of deaf people into society and conflicts with an independent
and self-determined lifestyle. For example, most deaf people are unable to use the
World Wide Web and communicate by SMS and email in the way hearing people
do, since they commonly do not surpass the level of literacy of a 10-year-old. The
reason for this is that hearing people learn and perceive written language as a visual
representation of spoken language. For deaf people, however, this correspondence
does not exist, and letters – which encode phonemes – are just symbols without any
meaning.
In order to improve communication between Deaf and Hearing research in au-
tomatic sign language recognition is needed. This work shall provide the technical
requirements for translation systems and user interfaces that support the integration
of deaf people into the hearing society. The aim is the development of a mobile sys-
tem, consisting of a laptop equipped with a webcam that visually reads a signer’s
gestures and mimic, and performs a translation into spoken language. This device is
intended as an interpreter in everyday life (e.g. at the post office or in a bank). Fur-

1
Sect. 3.1
2
Sect. 3.2
3
Sect. 3.3
96 Sign Language Recognition

thermore, it allows deaf people intuitive access to electronic media such as computers
or the internet (Fig. 3.1).

Fig. 3.1. laptop and webcam based sign language recognition system.

State of the Art in Automatic Sign Language Recognition

Since a two-dimensional video signal is significantly more complex than a one-


dimensional audio signal, the current state in sign language recognition is roughly
30 years behind speech recognition. In addition, sign language is by far not fully
explored yet. Little is known about syntax and semantics, and no dictionary exists.
The lack of national media – such as radio, TV, and telephone for the hearing – leads
to strong regional variation. For a large number of signs, there is not even a common
definition.
Sign language recognition has become the subject of scientific publications only
in the beginning of the 90s. Most presented systems operate near real-time and re-
quire 1 to 10 seconds of processing time after completion of the sign. For video-
based approaches, details on camera hardware and resolution are rarely published,
suggesting that professional equipment, high resolution, low noise, and optimal cam-
era placement was used.
The method of data acquisition defines a user interface’s quality and constitutes
the primary feature for classification of different works. The most reliable, exact,
and at the same time the simplest techniques are intrusive: Data gloves measure the
97

flexion of the finger joints, optical or magnetic markers placed on face and hands fa-
cilitate a straightforward determination of mimic and manual configuration. For the
user, however, this is unnatural and restrictive. Furthermore, data gloves are unsuit-
able for practical applications due to high cost. Also most existing systems exploit
manual features only; so far facial features were rarely used [17]. A recognition sys-
tem’s usability is greatly influenced by the robustness of its image processing stage,
i.e. its ability to handle inhomogeneous, dynamic, or generally uncontrolled back-
grounds and suboptimal illumination. Many publications do not explicitly address
this issue, which – in connection with accordant illustration – suggests homoge-
neous backgrounds and strong diffuse lighting. Another common assumption is that
the signer wears long-sleeved clothing that differs in color from his skin, allowing
color-based detection of hands and face.
The majority of systems only support person-dependent operation, i.e. every user
is required to train the system himself before being able to use it. Person-independent
operation requires a suitable normalization of features early in the processing chain
to eliminate dependencies of the features on the signer’s position in the image, his
distance from the camera, and the camera’s resolution. This is rarely described in
publications; instead, areas and distances are measured in pixels, which even for
person-dependent operation would require an exact reproduction of the conditions
under which the training material was recorded.
Similar to the early days of speech recognition, most researchers focus on iso-
lated signs. While several systems exist that process continuous signing, their vo-
cabulary is very small. Recognition rates of 90% and higher are reported, but the
exploitation of context and grammar – which is sometimes rigidly fixed to a cer-
tain sentence structure – aid considerably in classification. As in speech recognition,
coarticulation effects and the resulting ambiguities form the primary problem when
using large vocabularies.
Tab. 3.1 lists several important publications and the described systems’ features.
When comparing the indicated performances, it must be kept in mind that, in con-
trast to speech recognition, there is no standardized benchmark for sign language
recognition. Thus, recognition rates cannot be compared directly. The compilation
also shows that most systems’ vocabularies are in the range of 50 signs.
Larger vocabularies have only been realized with the use of data gloves. All sys-
tems are person-dependent. All recognition rates are valid only for the actual test
scenario. Information about robustness in real-life settings is not available for any
of the systems. Furthermore, the exact constellation of the vocabularies is unknown,
despite its significant influence on the difficulty of the recognition task. In summary
it can be stated that none of the systems currently found in literature meets the re-
quirements for a robust real world application.

Visual Non-Intrusive Sign Language Recognition

This chapter describes the implementation of a video-based non-intrusive sign lan-


guage classifier for real world applications. Fig. 3.2 shows a schematic of the concept
that can be divided into a feature extraction stage and a classification stage.
98 Sign Language Recognition

Table 3.1. Classifiers characteristics for user dependent sign language recognition.
Author, Features Interface Vocabulary Language Recognition
Year Level Rate in %
Vamplew 1996 [24] manual data glove 52 word 94
Holden 2001 [10] manual optical markers 22 word 95,5
Yang 2002 [28] manual video 40 word 98,1
Murakami 1991 [15] manual data glove 10 sentence 96
Liang 1997 [14] manual data glove 250 sentence 89,4
Fang 2002 [8] manual data glove 203 sentence 92,1
Starner 1998 [20] manual video 40 sentence 97,8
Vogler 1999 [25] manual video 22 sentence 91,8
Parashar 2003 [17] manual video 39 sentence 92
and facial

Fig. 3.2. Schematic of the sign language recognition system.

The input image sequence is forwarded to two parallel processing chains that
extract manual and facial features using a priori knowledge of the signing process.
Before the final sign classification is performed, a pre-classification module restricts
the active vocabulary to reduce processing time. Manual and facial features are then
classified separately, and both results are merged to yield a single recognition result.
In the following section of this chapter the person-independent recognition of
isolated signs in real world scenarios is dealt with. The second section explains how
facial features can be integrated in sign language recognition process. Both parts
build on related basic concepts presented already in chapter 2.
The third section describes how signs can be divided automatically in subunits by
a data driven procedure and how these subunits enable coding of large vocabularies
and accelerate training times. Finally some performance measures for the described
system are provided.
Recognition of Isolated Signs in Real-World Scenarios 99

3.1 Recognition of Isolated Signs in Real-World Scenarios


Sign language recognition constitutes a challenging field of research in computer
vision. Compared to gesture recognition in controlled environments as described in
Sect. 2.1, recognition of sign language in real-world scenarios places significantly
higher demands on feature extraction and processing algorithms.
First, sign language itself provides a framework that restricts the developer’s
freedom in the choice of vocabulary. In contrast to an arbitrary selection of ges-
tures, common computer vision problems like overlap of both hands and the
face (Fig. 3.3a4 ), ambiguities (Fig. 3.3b), and similarities between different signs
(Fig. 3.3c, d) occur frequently and cannot be avoided.

(a) “digital” (b) “conference”

(c) “food” (d) “date”

Fig. 3.3. Difficulties in sign language recognition: Overlap of hands and face (a), tracking of
both hands in ambiguous situations (b), and similar signs with different meaning (c, d).

Just like in spoken language, there are dialects in sign language. In contrast to
the pronunciation of words, however, there is no standard for signs, and people may
use an altogether different sign for the same word (Fig. 3.4). Even when performing
identical signs, the variations between different signers are considerable (Fig. 3.5).
Finally, when using color and/or motion to detect the user’s hands and face, un-
controlled backgrounds as shown in Fig. 3.6 may give rise to numerous false alarms
4
The video material shown in this section was kindly provided by the British Deaf Asso-
ciation (http://www.signcommunity.org.uk). All signs are British Sign Language (BSL).
100 Sign Language Recognition

(a) (b)

Fig. 3.4. Two different signs for “actor”: Circular motion of hands in front of upper body (a)
vs. dominant hand (here: right) atop non-dominant hand (b).

(a) “autumn” (b) “recruitment” (c) “tennis” (d) “distance” (e) “takeover”

Fig. 3.5. Traces of (xcog , ycog ) for both hands from different persons (white vs. black plots)
performing the same sign. The background shows one of the signers for reference.

that require high-level reasoning to tell targets from distractors. This process is com-
putationally expensive and error-prone, because the required knowledge and experi-
ence that comes natural to humans is difficult to encode in machine-readable form.

Fig. 3.6. Examples for uncontrolled backgrounds that may interfere with the tracking of hands
and face.
Recognition of Isolated Signs in Real-World Scenarios 101

A multitude of algorithms has been published that aim at the solution of the above
problems. This section describes several concepts that were successfully combined
in a working sign language recognition system at the RWTH Aachen university [29].
A comprehensive overview of recent approaches can be found in [6, 16].
This section focuses on the recognition of isolated signs based on manual pa-
rameters. The combination of manual and non-manual parameters such as facial
expression and mouth shape, which constitute an integral part of sign language, is
discussed in Sect. 3.2. Methods for the recognition of sign language sentences are
described in Sect. 3.3.
The following text builds upon Sect. 2.1. It follows a similar structure since
it is based on the same processing chain. However, the feature extraction stage is
extended as shown in Fig. 3.7. An image preprocessing stage is added that ap-
plies low-level algorithms for image enhancement, such as background modeling
described in Sect. 3.1.2.1. High-level knowledge is applied for the resolution of over-
laps (Sect. 3.1.3.1) and to support hand localization and tracking (Sect. 3.1.3.2).
Compared to Sect. 2.1, the presentation is less concerned with concrete algo-
rithms. Due to the complexity and diversity of the topic, the focus lies on general
considerations that should apply to most, though certainly not all, applications.

Fig. 3.7. Feature extraction extended by an image preprocessing stage. High-level knowledge
of the signing process is used for overlap resolution and hand localization/tracking.

3.1.1 Image Acquisition and Input Data

With regard to the input data, switching from gestures to sign language input brings
about the following additional challenges:
102 Sign Language Recognition

1. While most gestures are one-handed, signs may be one- or two-handed. The
system must therefore not only be able to handle two moving objects, but also
to detect whether the sign is one- or two-handed, i.e. whether one of the hands
is idle.
2. Since the hands’ position relative to the face carries a significant amount of in-
formation in sign language, the face must be included in the image as a reference
point. This reduces image resolution available for the hands and poses an addi-
tional localization task.
3. Hands may occlude one another and/or the face. For occluded objects, the extrac-
tion of features through a threshold segmentation is no longer possible, resulting
in a loss of information.
4. Signs may be extremely similar (or even identical) in their manual features and
differ mainly (or exclusively) in non-manual features. This makes automatic
recognition based on manual features difficult or, for manually identical signs,
even unfeasible.
5. Due to dialects and lack of standardization, it may not be possible to map a
given word to a single sign. As a result, either the vocabulary has to be extended
to include multiple variations of the affected signs, or the user needs to know
the signs in the vocabulary. The latter is problematic because native signers may
find it unnatural to be forced to use a certain sign where they would normally
use a different one.5
The last two problems cannot be solved by the approach described here. The
following text presents solution strategies for the former three.

3.1.2 Image Preprocessing

The preprocessing stage improves the quality of the input data to increase the per-
formance (in terms of processing speed and/or accuracy) of the subsequent stages.
High-level information computed in those stages may be exploited, but this bears
the usual risks of a feedback loop, such as instability or the reinforcement of errors.
Low-level pixel-wise algorithms, on the other hand, do not have this problem and can
often be used in a wide range of applications. A prominent example is the modeling
of the image background.
3.1.2.1 Background Modeling
The detection of background (i.e. static) areas in dynamic scenes is an important step
in many pattern recognition systems. Background subtraction, the exclusion of these
areas from further processing, can significantly reduce clutter in the input images
and decrease the amount of data to be processed in subsequent stages.
If a view of the background without any foreground objects is available, a sta-
tistical model can be created in a calibration phase. This approach is described in
Sect. 5.3.1.1. In the following, however, we assume that the only available input data
is the signing user, and that the signing process takes about 1–10 seconds. Thus the
5
This problem also exists in speech recognition, e.g. for the words “zero” and “oh”.
Recognition of Isolated Signs in Real-World Scenarios 103

background model is to be created directly from the corresponding 25–250 image


frames (assuming 25 fps) containing both background and foreground, without prior
calibration.

Median Background Model

A simple yet effective method to create a background model in the form of an


RGB image Ibg (x, y) is to compute, for every pixel (x, y), the median color over
all frames:

Ibg (x, y) = median{I(x, y, t)|1 ≤ t ≤ T } (3.1)


where the median of a set of vectors V = {v1 , v2 , . . . , vn } is the element for which
the sum of Euclidian distances to all other elements is minimal:

n
median V = argmin |vi − v| (3.2)
v∈V i=1
For the one-dimensional case this is equivalent to the 50th percentile, which is
significantly faster to compute, e.g. by simply sorting the elements of V . There-
fore, (3.2) is in practice often approximated by the channel-wise median
⎛ ⎞
median{r(x, y, t)|1 ≤ t ≤ T }
Ibg (x, y) = ⎝ median{g(x, y, t)|1 ≤ t ≤ T } ⎠ (3.3)
median{b(x, y, t)|1 ≤ t ≤ T }
The median background model has the convenient property of not requiring any
parameters and is consequently very robust. Its only requirement is that the im-
age background must visible at the considered pixel in more than 50% of the input
frames, which is a reasonable assumption in most scenarios. A slight drawback might
be that all input frames need to be buffered in order to compute (3.1).
To apply the background model, i.e. classify a given pixel as background or fore-
ground, we need to define a suitable metric Δ to quantify the difference between a
background color vector (rbg gbg bbg )T and a given color vector (r g b)T :
 
Δ (r g b)T , (rbg gbg bbg )T ≥ 0 (3.4)
Computing Δ for every pixel in an input image I(x, y, t) and comparison with a
motion sensitivity threshold Θmotion yields a foreground mask Ifg,mask :

1 if Δ (I(x, y, t), Ibg (x, y)) ≥ Θmotion
Ifg,mask (x, y, t) = (3.5)
0 otherwise
Θmotion is chosen just large enough so that the camera noise is not classified as
foreground.
A straightforward implementation of Δ is the Euclidian distance:

  
Δ (r g b)T , (rbg gbg bbg )T = (r − rbg )2 + (g − gbg )2 + (b − bbg )2 (3.6)
104 Sign Language Recognition

However, (3.6) does not distinguish between differences in brightness and dif-
ferences in hue. Separating the two is desirable since it permits to ignore shadows
(which mainly affect brightness) cast on the background by moving (foreground)
objects. Shadows can be modeled as a multiplication of Ibg with a scalar. We define

Δ = I − Ibg (3.7)
ϕ =  (I, Ibg ) (3.8)

This allows to split the color difference vector Δ in orthogonal components


Δbrightness and Δhue and weight the contribution of shadows in the total difference
with a factor w ∈ [0, 1]:

Δbrightness = |Δ| cos ϕ (3.9)


Δhue = |Δ| sin ϕ (3.10)

Δ = (wΔbrightness )2 + Δ2hue (3.11)

A value of w = 1 disables shadow detection and is equivalent to (3.6), while


w = 0 yields maximum shadow tolerance. No distinction is made between shadows
that appear on the background and shadows that vanish from it. Obviously, the choice
of w is a tradeoff between errors (foreground classified as background) and false
alarms (background classified as foreground) and thus depends on which is handled
better in the subsequent processing stages. Similar approaches to shadow detection
are described e.g. in [13, 18].
Thus, while the computation of the median background model is itself parameter-
free, its application involves the two parameters Θmotion and w.
Fig. 3.8 shows four example images from an input video of 54 frames in total.
Each of the 54 frames contains at least one person in the background. The resulting
background model, as well as an exemplary application to a single frame, is visual-
ized in Fig. 3.9.

Gaussian Mixture Models

Most background models, including the one presented above, assume a static back-
ground. However, backgrounds may also be dynamic, especially in outdoor scenar-
ios. For example, a tree with green leaves that move in front of a white wall results
in each background pixel alternating between green and white.
Approaches that model dynamic backgrounds are described in [13, 18, 21]. In-
stead of computing a single background color as in (3.1), background and foreground
are modeled together for each pixel as a mixture of Gaussian distributions, usually in
RGB color space. The parameters of these distributions are updated online with every
new frame. Simple rules are used to classify each individual Gaussian as background
or foreground. Handling of shadows can be incorporated as described above.
Recognition of Isolated Signs in Real-World Scenarios 105

Fig. 3.8. Four example frames from the sign “peel”. The input video consists of 54 frames, all
featuring at least one person in the background.

Compared to simpler methods this is computationally more demanding and re-


quires more input frames to compute an accurate model, because the number of pa-
rameters to be estimated is higher. Also, several parameters have to be specified
manually to control creation and updates of the individual Gaussians and their back-
ground/foreground classification. Performance may degrade if these parameters are
inappropriate for the input video.
If a sufficient amount of input data is available, the resulting model performs well
under a wide variety of conditions. For single signs of approximately two seconds
duration this is usually a critical factor, while one or more sign language sentences
may provide enough data. For the recognition of isolated signs one might therefore
consider to never stop capturing images and continuously update the distribution
parameters.

3.1.3 Feature Extraction

Systems that operate in real-world conditions require sophisticated feature extrac-


tion approaches. Rather than localize a single hand, both hands and the face (which
is needed as a reference point) have to be segmented. Since a sign may be one- or
two-handed, it is impossible to know in advance whether the non-dominant hand will
remain idle or move together with the dominant hand. Also, the resolution available
for each target object is significantly smaller than in most gesture recognition appli-
106 Sign Language Recognition

(a) Background Model Ibg (b) Example Frame

(c) Foreground Mask Ifg,mask

Fig. 3.9. Background model computed for the input shown in Fig. 3.8 (a), its application to an
example frame (b), and the resulting foreground mask (c).

cations, where the hand typically fills the complete image. Background subtraction
alone cannot be expected to isolate foreground objects without errors or false alarms.
Also, the face – which is mostly static – has to be localized before background sub-
traction can be applied.
Since the hands’ extremely variable appearance prevents the use of shape or tex-
ture cues, it is common to exploit only color for hand localization. This leads to the
restriction that the signer must wear long-sleeved, non-skin-colored clothing to facil-
itate the separation of the hand from the arm by means of color. By using a generic
color model such as the one presented in [12], user and illumination independence
can be achieved at the cost of a high number of false alarms. This method therefore
requires high-level reasoning algorithms to handle these ambiguities. Alternatively,
one may choose to either explicitly or automatically calibrate the system to the cur-
rent user and illumination. This chapter describes the former approach.
3.1.3.1 Overlap Resolution
In numerous signs both hands overlap with each other and/or with the face. When
two or more objects overlap in the image, the skin color segmentation yields only a
single blob for all of them, rendering a direct extraction of meaningful features im-
possible. Low contrast, low resolution, and the hands’ variable appearance usually
Recognition of Isolated Signs in Real-World Scenarios 107

do not allow a separation of the overlapping objects by an edge-based segmentation


either. Most of the geometric features available for unoverlapped objects can there-
fore not be computed for overlapping objects and have to be interpolated. However,
a hand’s appearance is sufficiently constant over several frames for template match-
ing [19] to be applied. Using the last unoverlapped view of each overlapping object
as a template, at least position features – which carry much information – can be
reliably computed during overlap. The accuracy of this method decreases with in-
creasing template age. Fortunately, the same technique can be used twice for a single
period of overlap, the second time starting with the first unoverlapped view after
the cessation of the overlap and proceeding temporally backwards. This effectively
halves the maximum template age and increases precision considerably.
3.1.3.2 Hand Tracking
Performing a simple hand localization in every frame, as described in Sect. 2.1.2.1,
is not sufficiently reliable in complex scenarios. The relation between temporally
adjacent frames has to be exploited in order to increase performance. Localization
is thus replaced by tracking, using information from previous frames as a basis for
finding the hand in the current frame.
A common problem in tracking is the handling of ambiguous situations where
more than one interpretation is plausible. For instance, Fig. 3.10a shows the skin
color segmentation of a typical scene (Fig. 3.9b). This observation does not allow
a direct conclusion as to the actual hand configuration. Instead, there are multiple
interpretations, or hypotheses, as visualized in Fig. 3.10 (b, c, d).
Simple approaches weigh all hypotheses against each other and choose the most
likely one one in every frame on the basis of information gathered in preceding
frames, such as, for instance, a position prediction. All other hypotheses are dis-
carded.
This concept is error-prone because ambiguous situations that cannot reliably be
interpreted occur frequently in sign language. Robustness can be increased signifi-
cantly by evaluating not only preceding, but also subsequent frames, before any hy-
pothesis is discarded. It is therefore desirable to delay the final decision on the hands’
position in each frame until all available data has been analyzed and the maximum
amount of information is available.
These considerations lead to a multiple hypotheses tracking (MHT) approach
[29]. In a first pass of the input data all conceivable hypotheses are created for every
frame. Transitions are possible from each hypothesis at time t to all hypotheses at
time t + 1, resulting in a state space as shown exemplarily
$ in Fig. 3.11. The total
number of paths through this state space equals t H(t), where H(t) denotes the
number of hypotheses at time t. Provided that the skin color segmentation detected
both hands and the face in every frame, one of these paths represents the correct
tracking result. In order to find this path (or one as close as possible to it), proba-
bilities are computed that indicate the likeliness of each hypothesized configuration,
pstate , and the likeliness of each transition, ptransition (see Fig. 3.11).
108 Sign Language Recognition

(a) (b)

(c) (d)

Fig. 3.10. Skin color segmentation of Fig. 3.9b (a) and a subset of the corresponding hypothe-
ses (b, c, d; correct: d).

The computation of pstate and ptransition is based on high-level knowledge, en-


coded as a set of rules or learned in a training phase. The following aspects, possibly
extended by application-dependent items, provide a general basis:
• The physical configuration of the signer’s body can be deduced from position
and size of face and hands. Configurations that are anatomically unlikely or do
not occur in sign language reduce pstate .
• The three phases of a sign (preparation, stroke, retraction), in connection with
the signer’s handedness (which has to be known in order to correctly interpret the
resulting feature vector), should also be reflected in the computation of pstate .
• Even in fast motion, the hand’s shape changes little between successive frames at
25 fps. As long as no overlap occurs, the shape at time t can therefore serve as an
estimate for time t + 1. With increasing deviation of the actual from the expected
shape, ptransition is reduced. Abrupt shape changes due to overlap require special
handling.
• Similarly, hand position changes little from frame to frame (at 25 fps) so that
coordinates at time t may serve as a prediction for time t + 1. Kalman filters [27]
may increase prediction accuracy by extrapolating on the basis of all past mea-
surements.
Recognition of Isolated Signs in Real-World Scenarios 109

Fig. 3.11. Hypothesis space and probabilities for states and transitions.

• Keeping track of the hand’s mean or median color can prevent confusion of the
hand with nearby distractors of similar size but different color. This criterion
affects ptransition .
To search the hypothesis space, the Viterbi algorithm (see Sect. 2.1.3.6) is applied
in conjunction with pruning of unlikely paths at each step.
The MHT approach ensures that all available information is evaluated before the
final tracking result is determined. The tracking stage can thus exploit, at time t,
information that becomes available only at time t1 > t. Errors are corrected retro-
spectively as soon as they become apparent.

3.1.4 Feature Normalization

After successful extraction, the features need to be normalized. Assuming that reso-
lution dependencies have been eliminated as described in Sect. 2.1.2.3, the area a still
depends on the signer’s distance to the camera. Hand coordinates (xcog , ycog ) addi-
tionally depend on the signer’s position in the image. Using the face position and
size as a reference for normalization can eliminate both dependencies. If a simple
threshold segmentation is used for the face, the neck is usually included. Different
necklines may therefore affect the detected face area and position (COG). In this case
one may use the topmost point of the face and its width instead to avoid this prob-
lem. If the shoulder positions can be detected or estimated, they may also provide a
suitable reference.
110 Sign Language Recognition

3.1.5 Feature Classification


The classification of the extracted feature vector sequence is usually performed using
HMMs as discussed in Sect. 2.1.3.6. However, due to the nature of sign language,
the following additional processing steps are advisable.
Firstly, leading and/or trailing idle frames in the input data can be detected by a
simple rule (e.g. (2.45)). This is done for each hands individually. Cropping frames
in which both hands are idle speeds up classification and prevents the classifier from
processing input data that carries no information. Obviously, this process must be
applied in the training phase as well.
If the sign is single-handed, i.e. one hand remains idle in all frames, all classes in
the vocabulary that represent two-handed signs can be disabled (or vice versa). This
further reduces computational cost in the classification stage.

3.1.6 Test and Training Samples


The accompanying CD contains a database of test and training samples that consists
of 80 signs in British Sign Language (BSL) performed five times each, all by the
same signer.6
The signs were recorded under laboratory conditions, with a unicolored blue
background that allows to fade-in arbitrary other backgrounds using the common
bluescreen method. Image resolution is 384 × 288, the frame rate is 25 fps, and the
average sign duration is 73.4 frames.

Fig. 3.12. Example frame from the test and training samples provided on CD.

As shown in Fig. 3.12 the signer is standing. The hands are always visible, and
their start/end positions are constant and identical, which simplifies tracking.
The directory slrdatabase has five subdirectories named var00 to var04,
each of which contains one production of all 80 signs. Each sign production is stored
in the form of individual JPG files in a directory named accordingly.
6
We sincerely thank the British Deaf Association (http://www.signcommunity.org.uk) for
these recordings.
Sign Recognition Using Nonmanual Features 111

This database can be used to test tracking and feature extraction algorithms, and
to train and test a sign language classifier based e.g. on Hidden Markov Models.

3.2 Sign Recognition Using Nonmanual Features


Since sign languages are multimodal languages, several channels for transferring
information are used at the same time. One basically differentiates between the man-
ual/gestical channels and the nonmanual/ facial channels and their respective para-
meters [4]. In the following first the nonmanual parameters are described in more
detail and afterwards the extraction of facial features for sign language recognition
is presented.

3.2.1 Nonmanual Parameters


Nonmanual parameters are indispensable in signs language. They code adjectives
and contribute to grammar. In particular some signs are identical with respect to
gesturing and can only be differentiated by making reference to nonmanual parame-
ters [5]. This is, e.g., the case for the signs ”not” and ”to” in German Sign Language,
which can only be distinguished by making reference to head motion (Fig. 3.13).
Similarly in British Sign Language (BSL) the signs ”now” and ”offer” need lip out-
line for disambiguation (Fig. 3.14). In the following the most important nonmanual
parameters will be described in more detail.

Fig. 3.13. In German Sign Language the signs ”not”(A) and ”to”(B) are identical with respect
to manual gesturing but vary in head movement.

Upper Body Posture


The torso generally serves as reference of the signing space. Spatial distances and
textual aspects can be communicated by the posture of the torso. The signs ”reject-
112 Sign Language Recognition

Fig. 3.14. In British Sign Language the signs ”now” (A) and ”offer” (B) are identical with
respect to manual gesturing but vary in lip outline.

ing” or ”enticing”, e.g., show a slight inclination of the torso towards the rear and
in forward direction. Likewise grammatical aspects as, e.g., indirect speech can be
coded by torso posture.

Head pose

The head pose also supports the semantics of sign language. Questions, affirmations,
denials, and conditional clauses are, e.g., communicated with the help of head pose.
In addition, information concerning time can be coded. Signs, which refer to a short
time lapse are, e.g., characterized by a minimal change of head pose while signs
referring to a long lapse are performed by turning the head clearly into the direction
opposite to the gesture.

Line of sight

Two communicating deaf persons will usually establish a close visual contact. How-
ever, a brief change of line of sight can be used to refer to the spatial meaning of a
gesture. Likewise line of sight can be used in combination with torso posture to ex-
press indirect speech as, e.g., as representation between two absent persons is virtual
placed behind the signer.

Facial expression

Facial expressions essentially serve the transmission of feelings (lexical mimics).


In addition grammatical aspects may be mediated as well. A change of head pose
combined with the lifting of the eye brows corresponds, e.g., to a subjunctive.
Sign Recognition Using Nonmanual Features 113

Lip outline
Lip outline represents the most pronounced nonmanual characteristic. Often it dif-
fers from voicelessly expressed words in that part of a word is shortened. Lip outline
solves ambiguities between signs (brother vs. sister), and specifies expressions (meat
vs. Hamburger). It also provides information redundant to gesturing to support dif-
ferentiation of similar signs.

3.2.2 Extraction of Nonmanual Features


In Fig. 3.15 a general concept for nonmanual feature extraction is depicted. At the
beginning of the processing chain images of the camera are acquired. In the next
step the Face Finder (Sect. 2.2.2.1) is included, in order to locate the face region of
the signer in the image, which covers the entire sign space. Afterwards this region is
cropped and upscaled.

Fig. 3.15. Processing chain of nonmanual feature extraction.

The next step is the fitting of a user-adapted face-graph by utilizing Active Ap-
pearance Models (Sect. 2.2.3). The face graph serves the extraction of the nonmanual
parameters like lip outline, eyes and brows (Fig. 3.16).
The used nonmanual features (Fig. 3.17) can be divided in three groups. At first
the lip outline is described by width, height and form-features like invariant mo-
ments, eccentricity and orientation (Sect. 2.1.2.3). The distances between eyes and
mouth corners are in the second group and the third group contains the distances
between eyes and eye brows.
For each image of the sequence the extracted parameters are merged into a feature
vector, which in the next step is used for classification (Fig. 3.18).
Overlap Resolution
In case of partially overlapping of hands and face an accurate fitting of the Active
Appearance Models is usually no longer possible. Furthermore it is problematic that
114 Sign Language Recognition

Fig. 3.16. Processing scheme of the face region cropping and the matching of an adaptive
face-graph.

sometimes the face graph is computed astounding precisely, if there is not enough
face texture visible. In this case a determination of the features in the affected re-
gions is no longer possible. For compensation of this effect an additional module is
involved, that evaluates all specific regions separately with regard to overlappings by
hands.
In Fig. 3.19 two cases are presented. In the first case one eye and the mouth are
hidden, so the feature-vector of the facial parameters is not used for classification. In
the second case the overlapping is not critical for classification so the facial features
are thus consulted for the classification process.
The hand tracker indicates a crossing of one or both hands with the face, once
the skin colored surfaces touch each other. In these cases it is necessary to decide,
whether the hands affect the shape substantially. Therefore, an ellipse for the hand
is computed by using the manual parameters. In addition an oriented bounding box
is drawn around the lip contour of the active appearance shape. If the hand ellipse
touches the bounding box, the Mahalanobis-distance of the shape fitting is deciding.
If this is too large, the shape is marked as invalid. Since the Mahalanobis-distance
of the shapes depends substantially on the trained model, not an absolute value is
Sign Recognition Using Nonmanual Features 115

Fig. 3.17. Representation of nonmanual parameters suited for sign language recognition.

Fig. 3.18. Scheme of the extraction and classification of nonmanual features.

used here, but a proportional worsening. In experiments it showed itself that a good
overlapping recognition can be achieved, if 25% of the face is hidden.
116 Sign Language Recognition

Fig. 3.19. During the overlapping of the hands and face several regions of the face are eval-
uated separately. If e.g. mouth and one eye could be hidden (left), no features of the face are
considered. However, if eyes and mouth are located sufficiently the won features could be used
for the classification (right).

3.3 Recognition of continuous Sign Language using Subunits


Sign language recognition is a rather young research area compared to speech recog-
nition, where research is already done for about 50 years. The headstart in the field
of speech recognition makes a comparison between these two fields worthwhile,
especially to draw conclusion in regards of how to model sign language. While
phoneme based speech recognition systems represent today’s state of the art, the
early speech recognizer dealt with words as a model unity. A similar development
is observed for sign language recognition. Current systems are still based on word
models. First steps towards subunit based recognition systems have been undertaken
only recently [2,26]. The following section outlines a sign language recognition sys-
tem, which is based on automatic generated subunits of signs.

3.3.1 Subunit Models for Signs

It is yet unclear in sign language recognition, which part of a sign sentence serves
as a good underlying model. The theoretical best solution is a whole sign language
sentence as one model. Due to their long duration sentences do not lead to ill clas-
sification. However, the main disadvantage is obvious: there are far too many com-
binations to train all possible sentences in sign language. Consequently most sign
language recognition systems are based on word models where one sign represents
one model in the model database. Hence, the recognition of sign language sentences
is more flexible. Nevertheless, even this method contains some drawbacks:
• The training complexity increases with vocabulary size.
• A future enlargement of the vocabulary is problematic as new signs are usually
signed in context of other signs for training (embedded training).
Recognition of continuous Sign Language using Subunits 117

As an alternative approach signs can also be modeled with subunits of signs,


which is similar to modeling speech by means of phonemes. Subunits are segments of
signs, which emerge from the subdivision of signs. Figure 3.20 shows an example of
the above mentioned different possibilities for model unities in the model database.
Subunits should represent a finite quantity of models. Their number should be chosen

TODAY I COOK

300
Position [Pixel]

250
200
150
100
50
0
0 10 20 30 40 50
Time [Frames]

SU1 SU2 SU3 SU4 SU5 SU6 SU7 SU8 SU9 SU10

TODAY I COOK

Fig. 3.20. The signed sentence TODAY I COOK (HEUTE ICH KOCHEN) in German Sign
Language (GSL). Top: the recorded sign sequence. Center: Vertical position of right hand
during signing. Bottom: Different possibilities to divide the sentence. For a better visualization
the signer wears simple colored cotton gloves.

in such a way that any sign can be composed with subunits. The advantages are:
• The training period terminates when all subunit models are trained. An enlarge-
ment of the vocabulary is achieved by composing a new sign through concatena-
tion of convenient subunit models.
• The general vocabulary size can be enlarged.
• Eventually, the recognition will be more robust (depending on the vocabulary
size).
Modification of Training and Classification for Recognition with Subunits
A subunit based recognition system needs an additional knowledge source, where the
coding (also called transcription) of the sign into subunits is itemized. This know-
ledge source is called sign-lexicon and contains all transcriptions of signs and their
coherence with subunits. Both, training and classification processes are based on a
sign-lexicon.
118 Sign Language Recognition

Modification for Training

Aim of the training process is the estimation of subunit model parameters. The ex-
ample in figure 3.21 shows that sign 1 consists of subunits (SU) SU4 , SU7 and SU3 .
The accordant model parameter of these subunits are trained on the recorded data of
sign 1 by means of the Viterbi algorithm (see section 2.1.3.6).

Recorded Segmented
images Images

VIDEO- FEATURE recognised


IMAGE PROCESSING CLASSIFIKATION
CAMERA EXTRACTION Signal

Sequence of
Feature Vectors

TRAINING

Model Database
Subunits

Sign 1 SU 4 SU 7 SU 3
Sign 2 SU 6 SU 1 SU 5
. .
. .
. .
Sign M SU 2 SU 7 SU 5

Sign-Lexicon

Fig. 3.21. Components of the training process for subunit identification.

Modification for Classification

After completion of the training process, a database is filled with all subunit models
which serve now as a base for the classification process. However, the aim of the
classification is not the recognition of subunits but of signs. Hence, again the infor-
mation which sign consists of which subunit is needed, which is contained in the
sign-lexicon. The accordant modification for classification is depicted in figure 3.22.

3.3.2 Transcription of Sign Language

So far we have assumed that a sign-lexicon is available, i.e. it is already known which
sign is composed of which subunits. This however is not the case. The subdivision of
a sign into suitable subunits still poses difficult problems. In addition, the semantics
of the subunits have yet to be determined. The following section gives an overview
of possible approaches to linguistic subunit formation.
3.3.2.1 Linguistics-orientated Transcription of Sign Language
Subunits for speech recognition are mostly linguistically motivated and are typically
syllables, half-syllables or phonemes. The base of this breakdown of speech is similar
to the speech’s notation system: a written text with the accordant orthography is the
standard notation system for speech. Nowadays, huge speech-lexica exist consisting
Recognition of continuous Sign Language using Subunits 119

Recorded Segmented
images Images

VIDEO- FEATURE recognised


IMAGE PROCESSING CLASSIFIKATION
CAMERA EXTRACTION Signal

Sequence of
Feature Vectors

TRAINING

Model Database
Subunits

Sign 1 SU 4 SU 7 SU 3
Sign 2 SU 6 SU 1 SU 5
. .
. .
. .
Sign M SU 2 SU 7 SU 5

Sign-Lexicon

Fig. 3.22. Components of the classification process based on subunits.

of the transcription of speech into subunits. These lexica are usually the base for
today’s speech recognizer.
Transferring this concept of linguistic breakdown to sign language recognition,
one is confronted with a variety of options for notation, which all are unfortunately
not yet standardized as is the case for speech. An equally accepted notation system as
written text does not exist for sign language. However, some known notation systems
are examined in the following section, especially with respect to its applicability in
a recognition system. Corresponding to phonemes, the term cheremes (derived from
the Greek term for ’manual’) is used for subunits in sign language.

Notation System by Stokoe

Stokoe was one of the first, who did research in the area of sign language linguistic in
the sixties [22]. He defined three different types of cheremes. The first type describes
the configuration of handshape and is called dez for designator. The second type is
sig for signation and describes the kind of movement of the performed sign. The
third type is the location of the performed sign and is called tab for tabula. Stokoe
developed a lexicon for American Sign Language by means of the above mentioned
types of cheremes. The lexicon consists of nearly 2500 entries, where signs are coded
in altogether 55 different cheremes (12 ’tab’, 19 ’dez’ and 24 different ’sig’). An
example of a sign coded in the Stokoe system is depicted in figure 3.23 [22].
The employed cheremes seem to qualify as subunits for a recognition system,
their practical employment in a recognition system however turns out to be difficult.
Even though Stokoe’s lexicon is still in use today and consists of many entries, not all
signs, especially no German Sign Language signs, are included in this lexicon. Also
most of Stokoe’s cheremes are performed in parallel, whereas a recognition system
expects subunits in subsequent order. Furthermore, none of the signations cheremes
(encoding the movement of a performed sign) are necessary for a recognition system,
as movements are modeled by Hidden Markov Models (HMMs). Hence, Stokoe’s
120 Sign Language Recognition

Fig. 3.23. Left: The sign THREE in American Sign Language (ASL). Right: notation of this
sign after Stokoe [23].

lexicon is a very good linguistic breakdown of signs into cheremes, however without
manual alterations, it is not useful as a base for a recognition system.

Notation system by Liddell and Johnson

Another notation system was invented by Liddell and Johnson. They break signs
into cheremes by a so called Movement-Hold-Model, which was introduced in 1984
and further developed since then. Here, the signs are divided in sequential order into
segments. Two different kinds of segments are possible: ’movements’ are segments,
where the configuration of a hand is still in move, whereas for ’hold’-segments no
movement takes place, i.e. the configuration of the hands is fixed. Each sign can
now be modeled as a sequence of movement and hold-segments. In addition, each
hold-segment consists of articulary features [3]. These describe the handshape, the
position and orientation of the hand, movements of fingers, and rotation and orienta-
tion of the wrist. Figure 3.24 depicts an example of a notation of a sign by means of
movement- and hold-segements.

Fig. 3.24. Notation of the ASL sign FATHER by means of the Movement-Hold model (from
[26]).

While Stokoe’s notation system is based on a mostly parallel breakdown of signs,


here a sequence of segments is produced, which is better suited for a recognition sys-
tems. However, similar to Stokoe’s notation system, there exists no comprehensive
lexicon where all signs are encoded. Moreover, the detailed coding of the articulary
features might cause additional problems. The video-based feature extraction of the
recognition system might not be able to reach such a high level of detail. Hence, the
Recognition of continuous Sign Language using Subunits 121

movement-hold-notation system is not suitable for an employment as a sign-lexicon


within a recognition system, without manual modifications or even manual transcrip-
tion of signs.
3.3.2.2 Visually-orientated Transcription of Sign Language
The visually7 approach of a notation (or transcription) system for sign language
recognition does not rely on any linguistic knowledge about sign language – un-
like the two approaches described in the previous sections. Here, the breakdown of
signs into subunits is based on an data-driven process. In a first step each sign of the
vocabulary is divided sequentially into different segments, which have no semantic
meaning. A subsequent process determines similarities between the identified seg-
ments. Similar segments are then pooled and labeled. They are deemed to be one
subunit. This process requires no other knowledge source except the data itself. Each
sign can now be described as a sequence of the contained subunits, which are dis-
tinguished by their labels. This notation is also called fenonic baseform [11]. Figure
3.25 depicts as an example the temporal horizontal progression (right hand) of two
different signs.

Sign 2 SU3 SU2 SU4 SU6


Position [Pixel]

1600
Sign 2
1400
1200
1000
800
600
400
200 Sign 1
0
0 5 10 15
Time [Frames]

Sign 1 SU3 SU7 SU1 SU9 SU5


Transcription of Sign
Fig. 3.25. Example for different transcriptions of two signs.

7
For speech-recognition the accordant name is acoustic subunits. For sign language recog-
nitions the name is adapted.
122 Sign Language Recognition

The performed signs are similar in the beginning. Consequently, both signs are
assigned to the same subunit (SU3 ). However the further progression differs signifi-
cantly. While the gradient of sign 2 is going upwards, the slope of sign 1 decreases.
Hence, the further transcription of both signs differs.

3.3.3 Sequential and Parallel Breakdown of Signs

The example in Fig. 3.25 illustrates merely one aspect of the performed sign: the
horizontal progression of the right hand. Regarding sign language recognition and
their feature extraction, this aspect would correspond to the y-coordinate of the
right hand’s location. For a complete description of a sign, one feature is however
not sufficient. In fact a recognition system must handle many more features which
are merged in so called feature groups. The composition of these feature groups
must take the lingustical sign language parameters into account, which are location,
handshape and -orientation.
The further details in this section refer to an example separation of a feature vector
into a feature group ’pos’, where all features regarding the position (pos) of the both
hands are grouped. Another group represents all features describing the ’size’ of
the visible part of the hands (size), whereas the third group comprises all features
regarding distances between all fingers (dist). The latter two groups stand for the
sign parameter handshape and -orientation. Note, that these are only examples of
how to model a feature vector and its accordant feature groups. Many other ways
of modeling sign language are conceivable, also the number of feature groups may
vary. To demonstrate the general approach, this example makes use of the three
feature groups ’pos’, ’size’ and ’dist’ mentioned above.

Following the parallel breakdown of a sign, each resulting feature group is seg-
mented in sequential order into subunits. The identification of similar segments is
not carried out on the whole feature vector, but only within each of the three feature
groups. Similar segments finally stand for the subunits of one feature group. Pooled
segments of the feature group ’pos’ e.g., now represent a certain location independent
of any specific handshape and -orientation. The parallel and sequential breakdown
of the signs finally yields three different sign lexica, which are combined to one (see
also figure 3.26). Figure 3.27 shows examples of similar segments of different signs
according to specific feature groups.

3.3.4 Modification of HMMs to Parallel Hidden Markov Models

Conventional HMMs are suited to handle sequential signals (see section 2.1.3.6).
However, after the breakdown into feature groups, we now deal with parallel signals
which can be handled by Parallel Hidden Markov Models (PaHMMs) [9]. Parallel
Hidden Markov Models are conventional HMMs used in parallel. Each of the parallel
combined HMMs is called a channel. Each channel is independent from each other,
the state probabilities of one channel do not influence any of the other channels.
Recognition of continuous Sign Language using Subunits 123

Sign 1 SU4 SU7 SU3


Sign- Sign 2. SU6 SU1 SU5
.
Lexicon . .
‘Pos’ . .
Sign M SU2 SU7 SU5
Sign 1 SU4 SU 7 SU 3 Pos
SU1 SU 9 SU 7 Size
Sign 1 SU1 SU9 SU7 SU3 SU 1 SU 2 Dist
Sign- Sign 2. SU5 SU8 SU2
Sign 2 SU6 SU 1 SU 5 Pos
SU5 SU
. 8 SU 2 Size
Lexicon . .
SU1 SU
. . . . 6 SU 3 Dist
Size . . . .
Sign M SU5 SU1 SU3 Sign M SU2 SU 7 SU 5 Pos
SU5 SU 1 SU 3 Size
SU2 SU 4 SU 1 Dist
Sign 1 SU3 SU1 SU2
Sign- Sign-
Sign 2. SU1 SU6 SU3
Lexicon . Lexicon
. .
Dist . .
Sign M SU2 SU4 SU1

Fig. 3.26. Each feature group leads to an own sign-lexicon, which are finally combined to one
sign-lexicon.

ESSEN

Same Same Same


Position RED Distance HOW MUCH Size LIKE

EAT SALAT TASTE

Fig. 3.27. Different examples for the assignment to all different feature groups.

An accordant PaHMM is depicted in figure 3.28. The last state, a so called con-
fluent state, combines the probabilities of the different channels to one probability,
valid for the whole sign. The combination of probabilities is determined by the fol-
lowing equation:
J
P (O|λ) = P (O j |λj ) . (3.12)
j=1

O j stands here for the relevant observation sequence of one channel, which is evalu-
ated for the accordant segment. All parallel calculated probabilities of a sign finally
124 Sign Language Recognition

s11 s12 s13 s41

s12 s22 s32 s42

s13 s32 s33 s34

Fig. 3.28. Example of a PaHMMs with Bakis-Topology.

result in the following equation:


J 
K
P (O|λ) = P (O kj |λkj ) . (3.13)
j=1 k=1

3.3.4.1 Modeling Sign Language by means of PaHMMs


For recognition based on subunit models each of the feature groups is modeled by
one channel of the PaHMMs (see also 3.3.3). The sequential subdivision into sub-
units is then conducted in each feature group separately. Figure 3.29 depicts the
modeling of the GSL sign HOW MUCH (WIEVIEL) with its three feature groups.
The figure shows the word model of the sign, i.e. the sign and all its contained sub-
units in all three feature groups. Note that it is possible and highly probable that the
different feature groups of a sign contain different numbers of subunit models. This
is the case if, as in this example, the position changes during the execution of the
sign, whereas the handshape and -orientation remains the same.

s0 s1 s2 s3 s4 s5 s6 s7 s8 s9

Pos

s0 s1 s2 s3

Size

s0 s1 s2 s3 s4 s5

Dist
Gebärde WIEVIEL
Sign HOW MUCH

Fig. 3.29. Modeling the GSL sign HOW MUCH (WIEVIEL) by means of PaHMMs.

Figure 3.29 also illustrates the specific topology for a subunit based recognition
system. As the duration of one subunit is quite short, a subunit HMM consists merely
Recognition of continuous Sign Language using Subunits 125

of two states. The connection of several subunits inside one sign depend however on
Bakis topology (see also equation 2.69 in section 2.1.3.6).

3.3.5 Classification

Determining the most likely sign sequence Ŝ which fits the feature vector sequence
best X results in a demanding search process. The recognition decision is carried
out by jointly considering the visual and linguistic knowledge sources. Following
[1], where the most likely sign sequence is approximated by the most likely state
sequence, a dynamic programming search algorithm can be used to compute the
probabilities P (X|S) · P (S). Simultaneously, the optimization over the unknown
sign sequence is applied. Sign language recognition is then solved by matching the
input feature vector sequence to all the sequences of possible state sequences and
finding the most likely sequence of signs using the visual and linguistic knowledge
sources. The different steps are described in more detail in the following subsections.
3.3.5.1 Classification of Single Signs by Means of Subunits and PaHMMs
Before dealing with continuous sign language recognition based on subunit models,
the general approach will be demonstrated by a simplified example of single sign
recognition with subunit models. The extension to continuous sign language recog-
nition is then given in the next section. The general approach is the same in both
cases. The classification example is depicted in figure 3.30. Here, the sign consists of
three subunits for feature group ’size’, four subunits for ’pos’ and eventually two for
feature group ’dist’. It is important to note, that the depicted HMM is not a random
sequence of subunits in each feature group, nor is it a random parallel combination of
subunits. The combination of subunits – in parallel as well as in sequence – depends
on a trained sign, i.e. the sequence of subunits is transcribed in the sign-lexicon of the
accordant feature-group. Furthermore, the parallel combination, i.e. the transcription
in all three sign-lexica codes the same sign. Hence, the recognition process does not
search ’any’ best sequence of subunits independently.
The signal of the example sign of figure 3.30 has a total length of 8 feature-vectors.
In each channel an assignment of feature-vectors (the part of the feature vector of the
accordant feature group) to the different states happens entirely independent from
each other by time alignment (Viterbi algorithm). Only at the end of the sign, i.e.
after the 8th feature vector is assigned, the so far calculated probabilities of each
channel are combined. Here, the first and last states are confluent states. They are
not emitting any probability, as they serve as a common beginning- and end-state
for the three channels. The confluent end-state can only be reached by the accordant
end-states of all channels. In the depicted example this is the case only after feature
vector 8, even though the end-state in channel ’dist’ is already reached after 7 times
steps. The corresponding equation for calculating the combined probability for one
model is:
P (X|λ) = P (X|λP os ) · P (X|λSize ) · P (X|λDist ). (3.14)
The decision on the best model is reached by a maximisation over all models of
the signs of the vocabulary:
126 Sign Language Recognition

Dist
Pos
Size

t=1 t=2 t=3 t=4 t=5 t=6 t=7 t=8 Time t


Zeit t
Sequence of
Merkmals-
Feature Vectors x1 x2 x3 x4 x5 x6 x7 x8
vektorfolge

Fig. 3.30. Classification of a single sign by means of subunits and PaHMMs.

Ŝ = argmax P (X|λi ). (3.15)


i

After completion of the training process, a word model λi exists for each sign S i ,
which consists of the Hidden Markov Models of the accordant subunits. This word
model will serve as reference for recognition.
3.3.5.2 Classification of Continuous Sign Language by Means of Subunits and
PaHMMs
In principle the classification process for continuous and isolated sign language
recognition is identical. However, in contrast to the recognition of isolated signs,
continuous sign language recognition is concerned with a number of further difficul-
ties, such as:
• A sign may begin or end anywhere in a given feature vector sequence.
• It is ambiguous, how many signs are contained in each sentence.
• There is no specific order of given signs.
• All transitions between subsequent signs are to be detected automatically.
All these difficulties are strongly linked to the main problem: The detection of sign
boundaries. Since these can not be detected accurately, all possible beginning and
Recognition of continuous Sign Language using Subunits 127

end points have to be accounted for.


As introduced in the last section, the first and last state of a word model is a confluent
common state for all three channels. Starting from the first state, the feature vector is
divided into feature groups for the different channels of the PaHMM. From the last
joint confluent state of a model exists a transition to first confluent states of all other
models. This scheme is depicted in figure 3.31.
Pos Dist

Pos Dist

Pos Dist
Size

Size

Size
Sign 1 1
Gebärde Gebärde
Sign 1 1 Gebärde
Sign 1 1
. .
. .
. .
Size Pos Dist

Size Pos Dist


Size Pos Dist

Gebärde
Sign i i Gebärde
Sign i i Gebärde
Sign i i
. . .
. . .
. . .

Size Pos Dist


Size Pos Dist

Size Pos Dist

Gebärde
Sign N N Gebärde N Gebärde
Sign N N

Fig. 3.31. A Network of PaHMMs for Continuous Sign Language Recognition with Subunits.

Detection of Sign Boundaries

At the time of classification, the number of signs in the sentence as well as the transi-
tions between these signs are unknown. To find the correct sign transitions all models
of signs are combined as depicted in figure 3.31. The generated model now consti-
tutes one comprehensive HMM. Inside a sign-model there are still three transitions
(Bakis-Topology) between states. The last confluent state of a sign model has tran-
sitions to all other sign models. The Viterbi algorithm is employed to determine the
best state sequence of this three-channel PaHMM. Hereby the assignment of feature
vectors to different sign models becomes obvious and with it the detection of sign
boundaries.
3.3.5.3 Stochastic Language Modeling
The classification of sign language depends on two knowledge sources: the visual
model and the language model. Visual modeling is carried out by using HMMs as
described in the previous section. Language modeling will be discussed in this sec-
tion. Without any language model technology, the transition probabilities between
two successive signs are equal. Knowledge about a specific order of the signs in the
training corpus is not utilised during recognition.
In contrast, a statistical language model takes advantage of the knowledge, that
pairs of signs, i.e. two successive signs, occur more often than others. The following
equation gives the probability of so called bigramm-models.
128 Sign Language Recognition


m
P (w) = P (w1 ) · P (wi |wi−1 ) (3.16)
i=2

The equation describes the estimation of sign pairs wi−1 and wi . During the clas-
sification process the probability of a subsequent sign changes, depending on the
classification result of the preceding sign. By this method, typical sign pairs receive
a higher probability. The estimation of these probabilities requires however a huge
training corpus. Unfortunately, since training corpora do not exist for sign language,
a simple but efficient enhancement of the usual statistical language model is intro-
duced in the next subsection.

Enhancement of language models for sign language recognition

The approach of an enhanced statistical language model for sign language recog-
nition is based on the idea of dividing all signs of the vocabulary into different
sign-groups (SG) [2]. The determination of occurance probabilities is then calcu-
lated between these sign-groups and not between specific signs. If a combination of
different sign-groups is not seen in the training corpus, this is a hint, that signs of
these specific sign-groups do not follow each other. This approach does not require
that all combinations of signs occur in the data base.
If the sequence of two signs of two different sign-groups SGi and SGj is observed
in the training corpus, any sign of sign-group SGi followed by any other sign of
sign-group SGj is allowed for recognition. For instance, if the sign succession ’I
EAT’ is contained in the training corpus, the probability that a sign of group ’verb’
(SGverb ) occurs when a sign of sign-group ’personal pronoun’ (SGpersonalpronoun )
was already seen, is increased. Hereby, the occurrence of the signs ’YOU DRINK’
receives a high probability even though this succession does not occur in the training
corpus. On the other hand, if the training corpus does not contain a sample of two
succeeding ’personal pronoun’ signs (e.g. ’YOU WE’), it is a hint that this sequence
is not possible in sign language. Therefore, the recognition of these two succeeding
signs is excluded from the recognition process.
By this modification, a good compromise between statistical and linguistic lan-
guage modeling is found. The assignment to specific sign-groups is mainly motivated
by word categories known from German Speech Grammar. Sign-groups are ’nouns’,
’personal pronouns’, ’verbs’, ’adjectives’, ’adverbs’, ’conjunctions’, ’modal verbs’,
’prepositions’ and two additional groups, which take the specific sign language char-
acteristics into account.

3.3.6 Training

This section finally details the training process of subunit models. The aim of the
training phase is the estimation of all subunit model parameters λ = (Π, A, B) for
each sign of the vocabulary. It is important to note, that the subunit-model parame-
ters are based on data recorded by single signs. The signs are not signed in context of
Recognition of continuous Sign Language using Subunits 129

a whole sign language sentences, which is usually the case for continuous sign lan-
guage recognition and is then called embedded training. As only single signs serve
as input for the training, this approach leads to a decreased training complexity.
The training for subunits of signs proceeds in four iterative different steps. An
overview is depicted in figure 3.32. In a first step a transcription for all signs has
to be determined for the sign-lexicon. This step is called initial transcription (see
section 3.3.6.1). Hereafter, the model parameter are estimated (estimation step, see
section 3.3.6.2), which leads to to preliminary models for subunits. A classifica-
tion step follows (see section 3.3.6.3). Afterwards, the accordant results has to be
reviewed in an evaluation step. The result of the whole process is also known as
singleton fenonic base form respectively synthetic base forms [11]. However, before
the training process starts, the feature vector has to be split to consider the different
feature groups.

Split of feature-vector

The partition of the feature vector into three different feature groups (pos, size, dist)
has to be considered during the training process. Hence, as a first step before the
actual training process starts, the feature vector is split into the three feature groups.
Each feature group can be considered as a complete feature vector and is trained
separately, since they are (mostly) independent from each other. Each training step
is performed for all three feature groups. The training process eventually leads to
subunit-models λpos , λsize and λdist .
The following training steps are described for one of the three feature groups. The
same procedure applies to the two remaining feature groups. The training process
eventually leads to three different sign-lexica, where the transcription for all signs in
the vocabulary are listed with respect to the corresponding feature-group.
3.3.6.1 Initial Transcription
To start the training process, a sign-lexicon is needed which does not yet exist.
Hence, an initial transcription of the signs for the sign-lexicon is needed before
the training can start. This choice of an initial transcription is arbitrary and of little
importance for the success of the recognition system, as the transcription is updated
iteratively during training. Even with a poor initial transcription, the training process
may lead to good models for subunits of signs [11]. Two possible approaches for the
initial transcription are given in the following.

1. Clustering of Feature Vectors

Again, this step is undertaken for each of the feature group data separately. The ex-
isting data is divided into K different cluster. Each vector is assigned to one of the
K cluster. Eventually, all feature vectors belonging to the same cluster are similar,
which is one of the goals of the initial transcription. A widespread cluster method
is the K-Means-Algorithm [7]. After processing of the K-means algorithm, a assign-
ment of the feature vectors to the clusters is fixed. Each cluster is also associated
130 Sign Language Recognition

Initial Transcription

Sign 1 SU 4
SU 7 SU 3
Sign 2 SU SU 1 SU
Estimation for Model-parmeters 6 5
. .
for Subunits . .
Trainingsdata . .
Sign M SU SU 7 SU
(Feature Vectors 2 5

of single signs) Sign-Lexicon


Model-parameter of
Subunits

Classification of single signs based


on the Models for Subunits

Sign 1 SU 1 SU 7 SU 5 Sign M SU 2 SU 7 SU 5
SU 6
SU 5 .
Recognition SU SU
. 1 SU 5
6 . . . .
. .
Result .
SU 6 SU 1 SU 5 SU 9 SU 7 SU 5

Evaluation of preliminary
Recognition Results

next Iteration

Sign 1 SU 4 SU 7 SU 3
Sign 2 SU 6 SU 1 SU 5
. .
. .
Training
Model-parameter of . .
Result
Subunits Sign M SU 2
SU 7
SU 5

Final Sign-Lexicon

Fig. 3.32. The different processing steps for the identification of subunit models.

with a unique index number. For an initial transcription in the sign-lexicon, these
index numbers have to be coded for each feature vector into the sign-lexicon. An
example is depicted in figure 3.33. Experiments show, that an inventory of K = 200
is reasonable for sign recognition.
Recognition of continuous Sign Language using Subunits 131

THIS
x. x.
..
11 21

..
... ...
..
. ...
HOW MUCH
x 1D x
x. x.
2D
31 x. x.
...
41
X1 X2 .. ..
51
..
61

... ... ... ... COST


.. .. .. ..
. . .
x 3D x 4D x x
x.
5D
x.
6D
x.
X3 X4 X5 X6 ...
71

...
81

...
91

... ... ...


.. .. ..
x 7D x 8D x 9D

X7 X8 X9

x. 41

x. ..
... x.
...
11

x.
91

.. ..
... ... ...
51
.
.. x ...
x.
4D
..
71
...
x 1D
x ... ..
X4 9D
.. x
X1 X9 x
.
5D

x. 81 7D
X5
..
... X7
C1 (SU1) x.
..
. x.
.
21
x .
61 C2 (SU2)
... ...
8D

... x. 31 X8 ...
. . .
x 2D ... x
...
6D

X2 . X6
x 3D

X3 C3(SU3)

THIS SU1 SU2


HOW MUCH SU2 SU1 SU3 SU2
COST SU3 SU2 SU1

Sign-Lexicon

Fig. 3.33. Example for clustering of feature vectors for a GSL sign language sentence THIS
HOW MUCH COST (DAS KOSTEN WIEVIEL).

2. Evenly Division of Signs into Subunits

For this approach, each sign is divided in a fixed number of subunits, e.g. six, per
sign and feature group. Accordingly, sign 1 is associated with subunit 1 - 6, sign 2
with subunit 7 - 12 and so forth. However, as it is not desirable to have 6 times more
subunits than signs in the vocabulary, not all signs of the vocabulary are chosen.
Instead only a subgroup, i.e. the first approximately 30 signs are chosen at random
for the initial transcription, which finally leads to approximately 200 subunits.
3.3.6.2 Estimation of Model Parameters for Subunits
The estimation of model parameters is actually the core of this training process, as
it leads eventually to the parameters of an HMM λi for each subunit. However, to
start with, the data of single signs (each sign with m repetitions) serve here as input.
In addition, the transcription of signs coded in the sign-lexicon (which is in the first
iteration the result of the former initial transcription step) is needed. In all following
steps, the transcription of the prior iteration step is the input for the actual iteration.
Before the training starts, the number of states per sign has to be determined. The
total number of states for a sign is equal to the number of contained subunits of the
sign times two, as each subunit consists of two states.
Viterbi training, which has been already introduced in section 2.1.3.6 is employed to
estimate the model parameters for subunits. Thus, only the basic approach will here
be introduced, for further details and equations see section 2.1.3.6. The final result
132 Sign Language Recognition

of this step are HMMs for subunits. However, as data for single signs are recorded,
we are again concerned with the problem of finding boundaries, here the boundaries
of subunits within signs. Figure 3.34 illustrates how these boundaries are found.
UE7
SU
SU1
UE
SU19
UE
SU5
UE

t=1 t=2 t=3 t=4 t=5 t=6 t=7 Zeit t


t=8 Time
Feature-
Merkmals-
Vectors
vektorfolge x1 x2 x3 x4 x5 x6 x7 x8

UE
SU5 UE
SU19 UE
SU1 UE
SU7

Fig. 3.34. Example for the detection of subunit boundaries. After time alignment the assign-
ment of feature vectors to states is obvious.

The figure details the situation after time alignment by the Viterbi algorithm per-
formed on a single sign. Here, the abscissa presents the time and thereby the feature
vector sequence, whereas the concatenated models for each subunit contained in the
sign are depicted at the ordinate. After the alignment of each feature vector sequence
to the states, the detection of subunit boundaries is obvious. In this example the first
two feature vectors are assigned to the states of the model SU5 , the following two
are assigned to subunit SU1 9, one to SU1 and finally the last three to subunit SU7 .
Now the Viterbi algorithm has to be performed again, this time on the feature vectors
assigned to each subunit, in order to identify the parameters of an HMM λi for one
subunit.
3.3.6.3 Classification of Single Signs
The same data of single signs, which already served as input to the training process
is now classified to proof the quality of the estimated subunit models (of the previ-
ous step). It is expected that most recognition results are similar to the transcription
coded in the sign-lexicon. However, not all results will comply with the original tran-
scription, which is especially true if data of several, e.g. m, repetitions of one sign
exist.
As the recognition results of all repetitions of a single sign are most probable
not identical, the best representation for all repetitions of that specific sign has to be
Recognition of continuous Sign Language using Subunits 133

found. To reach this goal, an HMM for each of the m repetition is constructed. The
Hidden Markov Models is based on the recognition result of the last mentioned step.
The HMMs of the accordant subunit models are concatenated to a word model for
that repetition of the sign. For instance, if the recognition result of a repetition of a
sign1 is SU4 , SU1 , SU7 and SU9 , the resulting word model for that repetition is the
concatenation of SU4 ⊕ SU1 ⊕ SU7 ⊕ SU9 . λsi refers here to the ith repetition of a
sign s.
We start with the evaluation of the word model λs1 of the first repetition. For this
purpose, a probability is determined, that the feature vector sequence of a specific
repetition, say s2 , is produced by the model (λs1 ). This procedure leads eventually
to m different probabilities (one for each of the m repetitions) for the model λs1 . The
multiplication of all these probabilities finally results in an indication of how ’good’
this model fits to all m repetitions. For the remaining repetitions s2 , s3 , . . . , sm of
that sign the proceeding is analogous. Finally, the model of that repetition si is cho-
sen, which results in the highest probability. This probability is determined by:

M 
M
C ∗ (w) = argmax PAj (w) (Ai (w)). (3.17)
j=1 i=1

Eventually, the transcription of the most probable repetition si is coded in the sign-
lexicon. If the transcription of this iteration step differs from the previous iteration
step, another iteration of the training will follow.

3.3.7 Concatenation of Subunit Models to Word Models for Signs.

By determination of subunit models the basic training process is complete. For con-
tinuous sign language recognition ’legal’ sequences of subunits must be composed to
signs. It may however happen, that an ’illegal’ sequence of subunits, i.e. a sequence
of subunits, which is not coded in the sign-lexicon, is recognised.
To avoid this misclassification, the subunits are concatenated according to their
transcription in the sign-lexicon. For example the model λSi for sign S with a tran-
scription of SU6 , SU1 , SU4 is then defined by λSi = λUE6 ◦ λUE1 ◦ λUE4 .

3.3.8 Enlargement of Vocabulary Size by New Signs

A data-driven approach of subunit-models gives the opportunity to increase the vo-


cabulary size by non-trained signs, as simply a new entry in the sign-lexicon has to
be determined. An additional training step is not needed for the new signs. The ap-
proach is similar to the approach described in section 3.3.6, the training of subunit
models [11]. For new signs, i.e. signs which have so far not been included in the vo-
cabulary, it has to be determined, which sequence of already trained subunit models
is able to represent the specific sign. Figure 3.35 illustrates this process.
To determine a new entry to the sign-lexicon at least one performance of the
sign is needed. However, for a better modeling, more repetitions of that sign are
desirable. In a first step the new sign is recognised. For this recognition process, the
134 Sign Language Recognition

New Signs
(Sequence of
Feature Vectors of Model-Parameter of
Single Signs) Subunits

Classification of Single Signs


Reference: Models of Subunits

Sign 1 SU 1
SU 7 SU 5 Sign M SU 2
SU 7
SU 5
SU 6
SU 5 .
Recognition SU SU
. 1 SU 5
6 . . . .
. .
Result .
SU 6
SU 1 SU 5 SU 9
SU 7
SU 5

Evaluation of preliminary
Recognition Results

Sign 1 SU 4 SU 7 SU 3
Sign 2 SU 6
SU 1 SU 5
. .
. .
. .
Sign M SU 2
SU 7
SU 5

Sign-Lexicon
including new Signs

Fig. 3.35. Steps to enlarge the vocabulary by new, non-trained signs.

trained subunit models serve as reference for classification. However, if more than
one repetition of the sign is available it is most unlikely, that all repetitions lead to the
same recognition result. The worst case represents n different results for n different
sign repetitions. To solve this problem, a further evaluation step, which is the same
as the one in the training process, is required. Here, for all repetitions of the sign,
the ’best’ recognition result (in terms of all repetitions) is chosen and added to the
sign-lexicon.

3.4 Performance Evaluation


In this section some performance data that has been achieved with the video based
sign language recognizer described above are presented. Person dependent as well
Performance Evaluation 135

as person independent recognition rates are given for single signs under controlled
laboratory condition as well as under real world condition. Recognition results are
also presented as to the applicability of subunits, which were automatically derived
from signs.

3.4.1 Video-based Isolated Sign Recognition

As a first result Tab. 3.2 shows the person-dependent recognition rates from a
leaving-one-out test for four signers and various video resolutions under controlled
laboratory conditions. The training resolution was always 384 × 288. Since the num-
ber of recorded signs varies slightly the vocabulary size is specified separately for
each signer. Only manual features were considered. Interestingly COG coordinates
alone accounted for approx. 95% recognition of a vocabulary of around 230 signs.
On a 2 GHz PC, processing took an average of 11.79s/4.15s/3.08s/2.92s per sign, de-
pending on resolution. Low resolutions caused only a slight decrease in recognition
rate but reduced processing time considerably. So far a comparably high performance
has only been reported for intrusive systems.

Table 3.2. Person-dependent isolated sign recognition with manual features in controlled en-
vironments.
Test Signer, Vocabulary Size
Video Features Ben Michael Paula Sanchu 
Resolution 235 signs 232 signs 219 signs 230 signs 229 signs
384 × 288 all 98.7% 99.3% 98.5% 99.1% 98.9%
192 × 144 all 98.5% 97.4% 98.5% 99.1% 98.4%
128 × 96 all 97.7% 96.5% 98.3% 98.6% 97.8%
96 × 72 all 93.1% 93.7% 97.1% 95.9% 94.1%
384 × 288 x, ẋ, y, ẏ 93.8% 93.9% 95.5% 96.1% 94.8%

Under the same conditions head pose, eyebrow position, and lip outline were
employed as nonmanual features. The recognition rates achieved on the vocabular-
ies presented in Tab. 3.2 varied between 49,3% and 72,4% among the four signers,
with an average of 63,6%. Hence roughly two of three signs were recognized just
from nonmanual features, a result that emphasizes the importance of mimics for sign
language recognition.
Tab. 3.3 shows results for person-independent recognition. Since the signers used
different signs for some words, the vocabulary has been chosen as the intersection of
the test signs with the union of all training signs. No selection has been performed
otherwise, and no minimal pairs have been removed. As expected, performance drops
significantly. This is caused by strong interpersonal variance in signing, as visualized
in Fig. 3.5 by COG traces for identical signs done by different signers. Recognition
rates are also affected by the exact constellation of training/test signers.
136 Sign Language Recognition

Table 3.3. Person-independent isolated sign recognition with manual features in controlled
environments. The n-best rate indicates the percentage of results for which the correct sign
was among the n signs deemed most similar to the input sign by the classifier. For n = 1 this
corresponds to what is commonly specified as the recognition rate.
Training Test Vocabulary n-Best Rate
Signer(s) Signer Size 1 5 10
Michael Sanchu 205 36.0% 58.0% 64.9%
Paula, Sanchu Michael 218 30.5% 53.6% 63.2%
Ben, Paula, Sanchu Michael 224 44.5% 69.3% 77.1%
Ben, Michael, Paula Sanchu 221 54.2% 79.4% 84.5%
Ben, Michael, Sanchu Paula 212 37.0% 63.6% 72.8%
Michael, Sanchu Ben 206 48.1% 70.0% 77.4%

Person-independent performance in uncontrolled environments is difficult to


measure since it depends on multiple parameters (signer, vocabulary, background,
lighting, camera). Furthermore, noise and outliers are inevitably introduced in the
features when operating in real-world settings. Despite the large interpersonal vari-
ance, person-independent operation is feasible for small vocabularies, as can be seen
in Tab. 3.4. Each test signer was recorded in a different real-life environment, and
the selection of signs is representative of the complete vocabulary (it contains one-
handed and two-handed signs, both with and without overlap).

Table 3.4. Person-independent recognition rates in real life environments.


Vocabulary Test Signer
Size Christian Claudia Holger Jörg Markus Ulrich 
6 96.7% 83.3% 96.7% 100% 100% 93.3% 95.0%
18 90.0% 70.0% 90.0% 93.3% 96.7% 86.7% 87.8%

3.4.2 Subunit Based Recognition of Signs and Continuous Sign Language

Recognition of continuous sign language requires a prohibitive amount of training


material if signed sentences are used. Therefore the automated transcription of signs
to subunits described in Sect. 3.3 was used here, which is expected to reduce the
effort required for vocabulary extension to a minimum. For 52 signs taken arbitrarily
from the domain ”shopping” an automatic transcription was performed, which re-
sulted in 184 (Pos), 184 (Size) and 187 (Dist) subunits respectively. Subsequently
performed classification tests with signs synthesized from the identified subunits
yielded ¿ 90% correct recognition. This result shows, the the automatic segmentation
of signs produced reasonable subunits. Additionally, 100 continuous sign language
sentences were created using only signs synthesized from subunits. Each sentence
References 137

consisted of two to nine signs. After eliminating coarticulation effects and applying
bigrams according to the statistical probability of sign order, a recognition rate of
87% was achieved. This finding is very essential as it solves the problem of vocabu-
lary extension without additional training.

3.4.3 Discussion

In this chapter a concept for video based sign language recognition was described.
Manual and non-manual features have been integrated into the recognizer to cover
all aspects of sign language expressions. Excellent recognition performance has
been achieved for person-dependent classification and medium sized vocabularies.
The presented system is also suitable for person-independent real world applications
where small vocabularies suffice, as e.g. for controlling interactive devices. The re-
sults also show that a reasonable automatic transcription of signs to subunits is feasi-
ble. Hence the extension of an existing vocabulary is made possible without the need
for large amounts of training data. This constitutes a key feature in the development
of sign language recognition systems supporting large vocabularies.

References
1. Bahl, L., Jelinek, F., and Mercer, R. A Maximum Likelihood Approach to Continuous
Speech Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence,
5(2):179–190, March 1983.
2. Bauer, B. Erkennung kontinuierlicher Gebärdensprache mit Untereinheiten-Modellen.
Shaker Verlag, Aachen, 2003.
3. Becker, C. Zur Struktur der deutschen Gebärdensprache. WVT Wissenschaftlicher Ver-
lag, Trier (Germany), 1997.
4. Canzler, U. and Dziurzyk, T. Extraction of Non Manual Features for Videobased Sign
Language Recognition. In The University of Tokyo, I. o. I. S., editor, Proceedings of
IAPR Workshop on Machine Vision Applications, pages 318–321. Nara, Japan, December
2002.
5. Canzler, U. and Ersayar, T. Manual and Facial Features Combination for Videobased Sign
Language Recognition. In 7th International Student Conference on Electrical Engineer-
ing, page IC8. Prague, May 2003.
6. Derpanis, K. G. A Review of Vision-Based Hand Gestures. Technical report, Department
of Computer Science, York University, 2004.
7. Duda, R., Hart, P., and Stork, D. Pattern Classifikation. Wiley-Interscience, New York,
2000.
8. Fang, G., Gao, W., Chen, X., Wang, C., and Ma, J. Signer-Independent Continuous Sign
Language Recognition Based on SRN/HMM. In Revised Papers from the International
Gesture Workshop on Gestures and Sign Languages in Human-Computer Interaction,
pages 76–85. Springer, 2002.
9. Hermansky, H., Timberwala, S., and Pavel, M. Towards ASR on Partially Corrupted
Speech. In Proc. ICSLP ’96, volume 1, pages 462–465. Philadelphia, PA, 1996.
10. Holden, E. J. and Owens, R. A. Visual Sign Language Recognition. In Proceedings of
the 10th International Workshop on Theoretical Foundationsof Computer Vision, pages
270–288. Springer, 2001.
138 Sign Language Recognition

11. Jelinek, F. Statistical Methods for Speech Recognition. MIT Press, 1998. ISBN 0-262-
10066-5.
12. Jones, M. and Rehg, J. Statistical Color Models with Application to Skin Detection.
Technical Report CRL 98/11, Compaq Cambridge Research Lab, December 1998.
13. KaewTraKulPong, P. and Bowden, R. An Improved Adaptive Background Mixture Model
for Realtime Tracking with Shadow Detection. In AVBS01. 2001.
14. Liang, R. H. and Ouhyoung, M. A Real-time Continuous Gesture Interface for Taiwanese
Sign Language. In UIST ‘97 Proceedings of the 10th Annual ACM Symposium on User
Interface Software and Technology. ACM, Banff, Alberta, Canada, October 14-17 1997.
15. Murakami, K. and Taguchi, H. Gesture Recognition Using Recurrent Neural Networks.
In Proceedings of the SIGCHI conference on Human factors in computingsystems, pages
237–242. ACM Press, 1991.
16. Ong, S. C. W. and Ranganath, S. Automatic Sign Language Analysis: A Survey and the
Future Beyond Lexical Meaning. IEEE TPAMI, 27(6):873–8914, June 2005.
17. Parashar, A. S. Representation And Interpretation Of Manual And Non-manual Infor-
mationFor Automated American Sign Language Recognition. Phd thesis, Department of
Computer Science and Engineering, College of Engineering,University of South Florida,
2003.
18. Porikli, F. and Tuzel, O. Human Body Tracking by Adaptive Background Models and
Mean-Shift Analysis. Technical Report TR-2003-36, Mitsubishi Electric Research Labo-
ratory, July 2003.
19. Sonka, M., Hlavac, V., and Boyle, R. Image Processing, Analysis and Machine Vision.
Brooks Cole, 1998.
20. Starner, T., Weaver, J., and Pentland, A. Real-Time American Sign Language Recognition
Using Desk and WearableComputer Based Video. IEEE Transactions on Pattern Analysis
and Machine Intelligence,, 20(12):1371–1375, December 1998.
21. Stauffer, C. and Grimson, W. E. L. Adaptive Background Mixture Models for Real-time
Tracking. In Computer Vision and Pattern Recognition 1999, volume 2. 1999.
22. Stokoe, W. Sign language structure: An Outline of the Visual Communication Systems of
the American Deaf. (Studies in Linguistics. Occasional Paper; 8) University of Buffalo,
1960.
23. Sutton, V. http://www.signwriting.org/, 2003.
24. Vamplew, P. and Adams, A. Recognition of Sign Language Gestures Using Neural Net-
works. In European Conference on Disabilities, Virtual Reality and AssociatedTechnolo-
gies. 1996.
25. Vogler, C. and Metaxas, D. Parallel Hidden Markov Models for American Sign Language
Recognition. In Proceedings of the International Conference on Computer Vision. 1999.
26. Vogler, C. and Metaxas, D. Toward Scalability in ASL Recognition: Breaking Down
Signs into Phonemes. In Braffort, A., Gherbi, R., Gibet, S., Richardson, J., and Teil,
D., editors, The Third Gesture Workshop: Towards a Gesture-Based Communication in
Human-Computer Interaction, pages 193–204. Springer-Verlag Berlin, Gif-sur-Yvette
(France), 2000.
27. Welch, G. and Bishop, G. An Introduction to the Kalman Filter. Technical Report TR
95-041, Department of Computer Science, University of North Carolina at Chapel Hill,
2004.
28. Yang, M., Ahuja, N., and Tabb, M. Extraction of 2D Motion Trajectories and Its Applica-
tion to HandGesture Recognition. IEEE Transactions On Pattern Analysis And Machine
Intelligence, 24:1061–1074, 2002.
References 139

29. Zieren, J. and Kraiss, K.-F. Robust Person-Independent Visual Sign Language Recog-
nition. In Proceedings of the 2nd Iberian Conference on Pattern RecognitionandImage
Analysis IbPRIA 2005, volume Lecture Notes in Computer Science. 2005.
Chapter 4

Speech Communication and Multimodal Interfaces


Björn Schuller, Markus Ablaßmeier, Ronald Müller, Stefan Reifinger,
Tony Poitschke, Gerhard Rigoll

Within the area of advanced man-machine interaction, speech communication


has always played a major role for several decades. The idea of replacing the con-
ventional input devices such as buttons and keyboard by voice control and thus in-
creasing the comfort and the input speed considerably, seems that much attractive,
that even the quite slow progress of speech technology during those decades could
not discourage people from pursuing that goal. However, nowadays this area is in a
different situation than in those earlier times, and these facts shall be also considered
in this book section: First of all, speech technology has reached a much higher de-
gree of maturity, mainly through the technique of stochastic modeling which shall
be briefly introduced in this chapter. Secondly, other interaction techniques became
more mature, too, and in the framework of that development, speech became one
of the preferred modalities of multimodal interaction, e.g. as ideal complementary
mode to pointing or gesture. This shall be also reflected in the subsection on mul-
timodal interaction. Another relatively recent development is the fact that speech is
not only a carrier of linguistic information, but also one of emotional information,
and emotions became another important aspect in today’s advanced man-machine
interaction. This will be considered in a subsection on affective computing, where
this topic is also consequently investigated from a multimodal point of view, taking
into account the possibilities for extracting emotional cues from the speech signal
as well as from visual information. We believe that such an integrated approach to
all the above mentioned different aspects is appropriate in order to reflect the newest
developments in that field.

4.1 Speech Recognition


This section is concerned with the basic principles of Automatic Speech Recognition
(ASR). This research area has had a dynamic development during the last decades,
and has been considered in its early stage as an almost unsolvable problem, then went
through several evolution steps during the 1970s and 1980s, and eventually became
more mature during the 1990s, when extensive databases and evaluation schemes be-
came available that clearly demonstrated the superiority of stochastic machine learn-
ing techniques for this task. Although the existing technology is still far from being
perfect, today there is a speech recognition market with a number of existing prod-
ucts that are almost all based on the before mentioned technology.
142 Speech Communication and Multimodal Interfaces

Our goal here is neither to describe the entire history of this development, nor
to provide the reader with a detailed presentation of the complete state-of-the-art
of the fundamental principles of speech recognition (which would probably require
a separate book). Instead, the aim of this section is to present a relatively compact
overview on the currently most actual and successful method for speech recognition,
which is based on a probabilistic approach for modeling the production of speech us-
ing the technique of Hidden-Markov-Models (HMMs). Today, almost every speech
recognition system, including laboratory as well as commercial systems, is based on
this technology and therefore it is useful to concentrate in this book exclusively on
this approach. Although this method makes use of complicated mathematical foun-
dations in probability theory, machine learning and information theory, it is possible
to describe the basic functionality of this approach using only a moderate amount of
mathematical expressions which is the approach pursued in this presentation.

4.1.1 Fundamentals of Hidden Markov Model-based Speech Recognition

The fundamental assumption of this approach is the fact that each speech sound for
a certain language (more commonly called phoneme) is represented as a Hidden-
Markov-Model, which is nothing else but a stochastic finite state machine.
Similarly to classical finite state machines, an HMM consists of a finite num-
ber of states, with possible transitions between those states. The speech production
process is considered as a sequence of discrete acoustic events, where typically each
event is characterized by the production of a vector of features that basically describe
the produced speech signal at the equivalent discrete time step of the speech produc-
tion process. At each of these discrete time steps, the HMM is assumed to be in one
of its states, where the next time step will let the HMM perform a transition into a
state that can be reached from its current state according to its topology (including
possible transitions back to its current state or formerly visited states).

Fig. 4.1. Example for a Hidden Markov Model.

Fig. 4.1 shows such an HMM, which has two major sets of parameters: The first
set is the matrix of transition probabilities describing the probability p(s(k − 1) →
Speech Recognition 143

s(k)), where k is the discrete time index and s the notation for a state. These prob-
abilities are represented by the parameters a in Fig. 4.1. The second set represents
the so-called emission probabilities p(x|s(k)), which is the probability that a certain
feature vector can occur while the HMM is currently in state s at time k. This proba-
bility is usually expressed by a continuous distribution function, denoted as function
b in Fig. 4.1, which is in many cases a mixture of Gaussian distributions. If a transi-
tion into another state is assumed at discrete time k with the observed acoustic vector
x(k), it is very likely that this transition will be into a state with a high emission prob-
ability for this observed vector, i.e. into a state that represents well the characteristics
of the observed feature vector.

Fig. 4.2. Example for HMM-based speech recognition and training.

With these basic assumptions, it is already possible to formulate the major prin-
ciples of HMM-based speech recognition, which can be best done by having a closer
look at Fig. 4.2: In the lower part of this figure, one can see the speech signal of
an utterance representing the word sequence ”this was”, also displayed in the upper
part of this figure. This signal has been subdivided into time windows of constant
length (typically around 10 ms length) and for each window a vector of features has
been generated, e.g. by applying a frequency transformation to the signal, such as
a Fourier-transformation or similar. This results in a sequence of vectors displayed
just above the speech signal. This vector sequence is now ”observed” by the Markov-
Model in Fig. 4.2, which has been generated by representing each phoneme of the un-
derlying word sequence by a 2-state-HMM and concatenating the separate phoneme
HMMs into one larger single HMM. An important issue in Fig. 4.2 is represented
by the black lines that visualize the assignment of each feature vector to one specific
state of the HMM. We have thus as many assignments as there are vectors in the fea-
ture vector sequence X = [x(1), x(2), ..., x(K)], where K is the number of vectors,
144 Speech Communication and Multimodal Interfaces

i.e. the length of the feature vector sequence. This so-called state-alignment of the
feature vectors is one of the essential capabilities of HMMs and there are different
algorithms for computation of this alignment, which shall not be presented here in
detail. However, the basic principle of this alignment procedure can be made clear by
considering one single transition of the HMM at discrete time k from state s(k−1) to
state s(k), where at the same time the occurrence of feature vector x(k) is observed.
Obviously, the joint probability of this event, consisting of the mentioned transition
and the occurrence of vector x(k) can be expressed according to Bayes law as:

p(x(k), s(k − 1) → s(k)) = p(x(k)|s(k − 1) → s(k)) · p(s(k − 1) → s(k))


= p(x(k)|s(k)) · p(s(k − 1) → s(k)) (4.1)

and thus is exactly composed out of the two parameter sets a and b mentioned
before, that describe the HMM and are shown in Fig. 4.1 As already stated before,
a transition into one of the next states will be likely, that results into a high joint
probability as expressed in the above formula. One can thus imagine that an algo-
rithm for computing the most probable state sequence as alignment to the observed
feature vector sequence must be based on an optimal selection of a state sequence
that eventually leads to the maximization of the product of all probabilities according
to the above equation for k = 1 to K. A few more details on this algorithm will be
provided later.
For now, let us assume that the parameters of the HMMs are all known and that
the optimal state sequence has been determined as indicated in Fig. 4.2 by the black
lines that assign each vector to one most probable state. If that is the case, then this
approach has produced two major results: The first one is a segmentation result that
assigns each vector to one state. With this, it is for instance possible to determine
which vectors – and thus which section of the speech signal – has been assigned to a
specific phoneme, e.g. to the sound /i/ in Fig. 4.2 (namely the vectors nr. 6-9). The
second result will be the before mentioned overall probability, that the feature vec-
tor sequence representing the speech signal has been assigned to the optimal state
sequence. Since this probability will be the maximum possible probability and no
other state sequence will lead to a larger value, this probability can be considered as
the overall production probability that the speech signal has been produced by the
underlying hidden Markov model shown in Fig. 4.2. Thus, an HMM is capable of
processing a speech signal and producing two important results, namely a segmen-
tation of the signal into subordinate units (such as e.g. phonemes) and the overall
probability that such a signal can have been produced at all by the underlying prob-
abilistic model.

4.1.2 Training of Speech Recognition Systems

These results can be exploited in HMM-based speech recognition in the following


manner: In the training phase, the HMM parameters in Fig. 4.2 are not known, but the
transcription of the corresponding word sequence (here: ”what is” will be known).
If some initial parameters for the large concatenated HMM of Fig. 4.2 are assumed
Speech Recognition 145

for the start of an iterative procedure, then it will be of course possible to compute
with these parameters the optimal corresponding state sequence, as outlined before.
This will be certainly not an optimal sequence, since the initial HMM parameters
might not have been selected very well. However, it is possible to derive from that
state sequence a new estimation for the HMM parameters by simply exploiting its
statistics. This is quite obvious for the transition probabilities, since one only needs
to count the occurring transitions between the states of the calculated state sequence
and divide that by the total number of transitions. The same is possible for the prob-
abilities p(x(k)|s(k)) which can basically be (without details) derived by observing
which kind of vectors have been assigned to the different states and calculating sta-
tistics of these vectors, e.g. their mean values and variances. It can be shown that
this procedure can be repeated iteratively: With the HMM parameters updated in the
way as just described, one can now compute again a new improved alignment of the
vectors to the states and from there again exploit the state sequence statistics for up-
dating the HMM parameters. Typically, this leads indeed to a useful estimation of the
HMM parameters after a few iterations. Moreover, at the end of this procedure, an-
other important and final step can be carried out: By ”cutting” the large concatenated
HMM in Fig. 4.2 again into the smaller phoneme-based HMMs, one obtains a single
HMM for each phoneme which has now estimated and assigned parameters that rep-
resent well the characteristics of the different sounds. Thus, the single HMM for the
phoneme /i/ in Fig. 4.2 will have certainly different parameters than the HMM for
the phoneme /a/ in this figure and serves well as probabilistic model for this sound.
This is basically the training procedure of HMMs and it becomes obvious, that
the HMM technology allows the training of probabilistic models for each speech unit
(typically the unit ”phoneme”) from the processing of entire long sentences without
the necessity to phoneme-label or to pre-segment these sentences manually, which is
an enormous advantage.

4.1.3 Recognition Phase for HMM-based ASR Systems

Furthermore, Fig. 4.2 can also serve as suitable visualization for demonstrating the
recognition procedure in HMM-based speech recognition. In this case – contrary to
the training phase – now the HMM parameters are given (from the training phase)
and the speech signal represents an unknown utterance, therefore the transcription
in Fig. 4.2 is now unknown and shall be reconstructed by the recognition procedure.
From the previous description, we have to recall once again that very efficient algo-
rithms exist for the computation of the state-alignment between the feature vector
sequence and the HMM-states, as depicted by the black lines in the lower part of
Fig. 4.2. So far we have not looked at the details of such an algorithm, but shall go
into a somewhat more detailed analysis of one of these algorithms now by looking
at Fig. 4.3, which displays in the horizontal direction the time axis with the feature
vectors x appearing at discrete time steps and the states of a 3-state HMM on the
vertical axis.
Assuming that the model starts in the first state, it is obvious that the first fea-
ture vector will be assigned to that initial state. Looking at the second feature vector,
146 Speech Communication and Multimodal Interfaces

Fig. 4.3. Trellis diagram for the Viterbi algorithm.

according to the topology of the HMM, it is possible that the model stays in state 1
(i.e. makes a transition from state 1 to state1) or moves with a transition from state 1
into state 2. For both options, the probability can be computed according to Eqn. 4.1
and both options are shown as possible path in Fig. 4.3. It is then obvious that from
both path end points for time step 2, the path can be extended, in order to compute if
the model has made a transition from state 1 (into either state 1 or state 2) or a transi-
tion from state 2 (into either state 2 or state 3). Thus, for the 3rd time step, the model
can be in state 1, 2 or 3, and all these options can be computed with a certain prob-
ability, by multiplying the probabilities obtained for time step 2 by the appropriate
transition and emission probabilities for the occurrence of the third feature vector. In
this way it should be easily visible that the possible state sequence can be displayed
in a grid (also called trellis) as shown in Fig. 4.3, which displays all possible paths
that can be taken from state 1 to the final state 3 in this figure, by assuming the obser-
vation of five different feature vectors. The bold line in this grid shows one possible
path through this grid and it is clear that for each path, a probability can be computed
that this path has been taken, according to the before described procedure of mul-
tiplying the appropriate production probabilities of each feature vector according to
Eqn. 4.1. Thus, the optimal path can be computed with the help of the principle of
dynamic programming, which is well-known from the theory of optimization. The
above procedure describes the Viterbi algorithm, which can be considered as one
of the major algorithms for HMM-based speech recognition. As already mentioned
several times, the major outcome of this algorithm is the optimal state sequence as
well as the ”production probability” for this sequence, which is as well the proba-
bility that the given HMM has produced the associated feature vector sequence X.
Let’s assume that the 3-state HMM in Fig. 4.3 represents a phoneme, as mentioned
in the previous section on the training procedure, resulting in trained parameters for
HMMs which typically represent all the phonemes in a given language. Typically,
an unknown speech signal will either represent a spoken word or an entire spoken
sentence. How can such a sentence then be recognized by the Viterbi algorithm as
described above for the state-alignment procedure of an HMM representing a single
Speech Recognition 147

phoneme? This can be achieved by simply extending the algorithm so that it com-
putes the most likely sequence of phoneme-HMMs that maximize the probability of
emitting the observed feature vector sequence. That means that after the final state
of the HMM in Fig. 4.3 has been reached (assuming a rather long feature vector se-
quence of which the end is not yet reached) another HMM (and possibly more) will
be appended and the algorithm is continued until all feature vectors are processed
and a final state (representing the end of a word) is obtained. Since it is not known
which HMMs have to be appended and what will be the optimal HMM sequence, it
becomes obvious that this procedure implies a considerable search problem and this
process is therefore also called ”decoding”. There are however several ways to sup-
port this decoding procedure, for instance by considering the fact that the phoneme
order within words is rather fixed and variation can basically only happen between
word boundaries. Thus the phoneme search procedure is more a word search proce-
dure which will be furthermore assisted by the so-called language model, that will
be discussed later and assigns probabilities for extending the search path into a new
word model if the search procedure has reached the final state of a preceding word
model. Therefore, finally the algorithm can compute the most likely word sequence
and the final result of the recognition procedure is then the recognized sentence. It
should be noted that due to the special capabilities of the HMM approach, this de-
coding result can be obtained additionally with the production probability of that
sentence, and with the segmentation result indicating which part of the speech signal
can be assigned to the word and even to the phoneme boundaries of the recognized
utterance. Some extensions of the algorithms make it even possible to compute the
N most likely sentences for the given utterance (for instance for further processing
of that result in a semantic module), indicating once again the power and elegance of
this approach.

4.1.4 Information Theory Interpretation of Automatic Speech Recognition

With this background, it is now possible to consider the approach to HMM-based


speech recognition from an information theoretic point of view, by considering
Fig. 4.4.

Fig. 4.4. Information theory interpretation of automatic speech recognition.

Fig. 4.4 can be interpreted as follows: A speaker formulates a sentence as a se-


quence of words denoted as W = [w(1), w(2), ..., w(N )]. He speaks that sentence
into the microphone that captures the speech signal which is actually seen by the
automatic speech recognizer. Obviously, this recognizer does not see the speaker’s
148 Speech Communication and Multimodal Interfaces

originally uttered word sequence, but instead sees the ”encoded” version of that in
form of the feature vector sequence X that has been derived from the acoustic wave-
form, as output of the acoustic channel as displayed in Fig. 4.4. There is a probabilis-
tic relation between the word sequence W and the observed feature vector sequence
X, and indeed this probabilistic relation is modeled by the Hidden Markov Models
that represent the phoneme sequence implied by the word sequence W . In fact, this
model yields the already mentioned ”production probability” that the word sequence
W , represented by the appropriate sequence of phoneme HMMs has generated the
observed feature vector sequence and this probability can be denoted as p(X|W ).
The second part of Fig. 4.4 shows the so-called ”linguistic decoder”, a module that
is responsible for decoding the original information W from the observed acoustic
sequence X, by taking into account the knowledge about the model provided by the
HMMs expressed in p(X|W ). The decoding strategy of this module is to find the
best possible word sequence W from observing the acoustic feature string X and
thus to maximize p(W |X) according to Bayes’ rule as follows:

p(W )
max p(W |X) = max[p(X|W ) · ] (4.2)
W p(X)
And because finding the optimal word sequence W is independent of the proba-
bility p(X), the final maximization rule is:

max[p(X|W ) · p(W )] (4.3)


W

Exactly this product of probabilities has to be maximized during the search pro-
cedure described before in the framework of the Viterbi algorithm. In this case, it
should be noted that p(X|W ) is nothing else but the probability of the feature vec-
tor sequence X under the assumption that it has been generated by the underlying
word sequence W , and exactly this probability is expressed by the HMMs resulting
from the concatenation of the phoneme-based HMMs into a model that represents
the resulting word string. In this way, the above formula expresses indeed the be-
fore mentioned decoding strategy, namely to find the combination of phoneme-based
HMMs that maximize the corresponding emission probability. However, the above
mentioned maximization rule contains an extra term p(W ) that has to be taken into
account additionally to the so far known maximization procedure: This term p(W )
is in fact the ”sentence probability” that the word sequence W will occur at all, in-
dependent of the acoustic observation X. Therefore, this probability is described by
the so-called language model, that simply expresses the occurrence probability of a
word in a sentence given its predecessors, which can be expressed as:

p(w(n)|w(n − 1), w(n − 2), ..., w(n − m)) (4.4)


In this case, the variable m denotes the ”word history”, i.e. the number of pre-
decessor words that are considered to be relevant for the computation of the cur-
rent word’s appearance probability. Then, the overall sentence probability can be
expressed as the product of all single word probabilities in a sentence according to
Speech Recognition 149


N
p(W ) = p(w(n)|w(n − 1), ..., w(n − m)) (4.5)
n=1

where N is the length of the sentence and m is the considered length of the
word history. As already mentioned, the above word probabilities are completely
independent of any acoustic observation and can be derived from statistics obtained
e.g. from the analysis of large written text corpora by counting the occurrence of all
words given a certain history of word predecessors.

4.1.5 Summary of the Automatic Speech Recognition Procedure

Fig. 4.5. Block diagram for HMM-based speech recognition.

Finally, to summarize the functioning of HMM-based speech recognition, the


block diagram in Fig. 4.5 can be interpreted as follows: The speech signal is cap-
tured by a microphone, sampled and digitized. Preprocessing of the speech signal
includes some filtering process and possible noise compensation. The next step in
Fig. 4.5 is feature extraction, where the signal is split into windows of roughly 10
msec length and for each window, a feature vector is computed that typically rep-
resents the windowed signal in the frequency domain. Then, in recognition mode,
for each vector of the resulting feature vector sequence, state conditional probabili-
ties are computed, basically by inserting the feature vector x(k) into the right hand
side of Eqn. 4.1 for each state which will be considered in the decoding procedure.
According to Fig. 4.5, the gray-shaded table labeled as ”acoustic phoneme models”
150 Speech Communication and Multimodal Interfaces

contains the parameters of the distribution functions that are used to compute these
state conditional probabilities. This computation is integrated into the already men-
tioned search procedure, that attempts to find the best possible string of concatenated
HMM phoneme models that maximize the emission probability of the feature vector
sequence. This search procedure is controlled by the gray-shaded table containing
the phonological word models (i.e. how a word is composed of phonemes) and the
table containing the language model that delivers probabilities for examining the next
likely word if the search procedure has reached the final HMM-state of a previous
word model. In this way, the above mentioned computation of state conditional prob-
abilities does not need to be carried out for all possible states, but only for the states
that are considered to be likely by the search procedure. The result of the search pro-
cedure is the most probable word model sequence that is displayed as transcription
representing the recognized sentence to the user of the speech recognition system.
This brief description of the basic functionality of HMM-based speech recog-
nition can of course not cover this complicated subject in sufficient detail and it is
therefore not amazing that many interesting sub-topics in ASR have not been covered
in this introduction. These include e.g. the area of discriminative training techniques
for HMMs, different HMM architectures such as discrete, hybrid and tied-mixture
approaches, the field of context-dependent modeling where acoustic models are cre-
ated that model the coarticulation effects of phonemes in the context of neighboring
phonemes, as well as clustering techniques that are required for representing the
acoustic parameters of these extended models. Other examples include the use of
HMM multi-stream techniques to handle different acoustic feature streams or the
entire area of efficient decoding techniques, e.g. the inclusion of higher level seman-
tics in decoding, fast search techniques, efficient dictionary structures or different
decoder architectures such as e.g. stack decoders. The interested reader is advised
to study the available literature on these topics and the large number of conference
papers describing those approaches in more detail.

4.1.6 Speech Recognition Technology

As already mentioned, the HMM-technology has become the major technique for
Automatic Speech Recognition and has nowadays reached a level of maturity that
has led to the fact that this is not only the dominating technology in laboratory sys-
tems but in commercial systems as well. The HMM technology is also that much
flexible that it can be deployed for almost every specialization in ASR. Basically,
one can distinguish the following different technology lines: Small-vocabulary ASR
systems with 10-50 word vocabulary, in speaker-independent mode, mainly used for
telephone applications. Here, the capability of HMMs to capture the statistics of large
training corpora obtained from many different speakers is exploited. The second line
is represented by speaker-independent systems with medium size vocabulary, often
used in automotive or multimedia application environments. Here, noise reduction
technologies are often combined with the HMM framework and the efficiency of
HMMs for decoding entire sequences of phonemes and words are exploited for the
recognition of continuously spoken utterances in adverse environments. The last ma-
Speech Recognition 151

jor line are dictation systems with very large vocabulary (up to 100,000 words) which
often operate in speaker-dependent and/or speaker-adaptive mode. In this case, the
special capabilities of HMMs are in the area of efficient decoding techniques for
searching very large trellis spaces, and especially in the field of context-dependent
acoustic modeling, where coarticulation effects in continuous speech can be effi-
ciently modeled by so-called triphones and clustering techniques for their acoustic
parameters. One of the most recent trends is the developments of so-called embed-
ded systems, where medium to large vocabulary size ASR systems are implemented
on systems such as mobile phones, thin clients or other electronic devices. This has
become possible with the availability of appropriate memory cards with sufficiently
large storage capacity, so that the acoustic parameters and especially the memory-
intensive language model with millions of word sequence probabilities can be stored
directly on such devices. Due to these mentioned memory problems, another recent
trend is so-called Distributed Speech Recognition (DSR), where only feature extrac-
tion is computed on the local device and the features are then transferred by wireless
transmission to a large server where all remaining recognition steps are carried out,
i.e. computation of emission probabilities and the decoding into the recognized word
sequence. For these mentioned steps, an arbitrarily large server can be employed,
with sufficient computation power and memory for large language models and a
large number of Gaussian parameters for acoustic modeling.

4.1.7 Applications of ASR Systems

It is not completely amazing that the above mentioned algorithms and available tech-
nologies have led to a large variety of interesting applications. Although ASR tech-
nology is far from being perfect, the currently achievable performance is in many
cases satisfactory enough in order to create novel ideas for application scenarios
or revisit already established application areas with now improved technology. Al-
though current speech recognition applications are manifold, a certain structure can
be established by identifying several major application areas as follows:
Telecommunications: This is still the probably most important application area
due to the fact that telecommunications is very much concerned with telephony ap-
plications involving naturally speech technology and in this case, speech recognition
is a natural bridge between telecommunications and information technology by pro-
viding a natural interface in order to enter data via a communication channel into in-
formation systems. Thus, most application scenarios are in the area of speech recog-
nition involving the telephone. Prominent scenarios include inquiry systems, where
the user will inquire information by calling an automated system, e.g. for banking
information or querying train and flight schedules. Dialing assistance, such as speak-
ing the telephone number instead of dialing or typing it belongs to this application
area. More advanced applications include telephony interpretation, i.e. the automatic
translation of phone calls between partners from different countries, and all tech-
niques involving mobile communications, such as embedded speech recognition on
mobile clients and distributed speech recognition.
152 Speech Communication and Multimodal Interfaces

Office Automation: Similarly to telecommunications, office automation has been


a traditional application area of ASR for several decades and has been one of the
driving forces for very large vocabulary ASR. This area has been revived by the
latest emerging commercial large vocabulary speech recognition systems that re-
ally provided the required performance in terms of vocabulary size and appropriate
recognition performance. Typical concrete applications in that area include the clas-
sical dictation task where a secretary creates a letter directly via voice input and other
scenarios, such as e.g. the use of ASR in CAD applications or the direct command
input for PCs.
Medical Applications: This area has also a quite long tradition and has been a
favorite experimental scenario for ASR since at least 20 years. The most prominent
field has been radiology, where the major idea has been to create a medical report
directly from the visual analysis of an x-ray image by dictating directly into a micro-
phone connected to an ASR system. Other scenarios include the control of micro-
scopes via voice and another major application area of speech technology in general,
namely the area of handicapped users, where the malfunction of hands or arms can
be compensated by voice control of devices and interfaces. The size of this potential
user group and the variety of different applications for handicapped users make this
one of the most important ASR application scenarios.
Production and Manufacturing: This area may be less popular than the previ-
ously mentioned application areas, but has been also investigated for a long time as
potentially very interesting application with a large industrial impact. Popular appli-
cations include data distribution via spoken ID codes, programming of NC machines
via voice, or spoken commands for the control of large plants, such as power or
chemical plants.
Multimedia Applications: Naturally, the rise of multimedia technology has also
increased the demand for speech-based interfaces, as one major modality of multi-
modal interfaces. Most of those systems are still in the experimental phase, but some
industrial applications are already underway, such as e.g. in smart environments or
information kiosks, that require e.g. user input via voice and pointing. Another inter-
esting area in this field are voice-enabled web applications, where speech recognition
is used for the access to web documents.
Private Sector: This field contains several very popular speech application fields
with huge potential, such as the automotive sector, electronic devices and games.
Without any doubt, those represent some of the most important application fields,
where consumers are ready to make some extra investment in order to add more
functionality to their user interface, e.g. in case of luxury automobiles or expensive
specialized electronic devices.
This overview demonstrates the huge potential of speech recognition technology
for a large variety of interesting applications that can be well subdivided into the
above mentioned major application areas which will represent also in the future the
most relevant domains for even further improved ASR systems to come.
Speech Dialogs 153

4.2 Speech Dialogs

4.2.1 Introduction

It is strongly believed that, following command line interfaces being popular in the
years 1960-80 and graphical user interfaces in the years 1980-2000, the future lies
in speech and multimodal user interfaces. A lot of factors thereby clearly speak for
speech as interaction form:
• Speech is the most natural communication form between humans.
• Only limited space is required: a microphone and a loudspeaker can be placed
even in wearable devices. However, this does not respect computational effort.
• Hands and eyes are left free, which makes speech interaction the number one
modality in many controlling situations as driving a car.
• Approximately 1.3 billion telephones exist worldwide, which resembles more
than five times the number of computers connected to the Internet at the time.
This provides a big market for automatic dialog systems in the future.

Using speech as an input and output form leads us to dialog systems: in general these
are, in hierarchical order by complexity, systems that allow for control e.g. of func-
tions in the car, information retrieval as flight data, structured transactions by voice,
for example stock control, combined tasks of information retrieval and transactions
as booking a hotel according to flight abilities and finally complex tasks as care of
elder persons. More specifically a dialog may be defined as an exchange of different
aspects in a reciprocal conversation between at least two instances which may either
be human or machine with at least one change of the speaker.
Within this chapter we focus on spoken dialog, still it may also exist in other
forms as textual or combined manners. More concretely we will deal with so called
Spoken Language Dialog Systems, abbreviated SLDS, which in general are a combi-
nation of speech recognition, natural language understanding, dialog act generation,
and speech synthesis. While it is not exactly defined which parts are mandatory for
a SLDS, this contributes to their interdisciplinary nature uniting the fields of speech
processing, dialog design, usability engineering and process analysis within the tar-
get area.
As enhancement of Question and Answer (Q&A) systems, which allow natural
language access to data by direct answers without reference to past context (e.g. by
pronominal allusions), dialog systems also allow for anaphoric expressions. This
makes them significantly more natural as anaphora are frequently used as linguistic
elements in human communication. Furthermore SLDS may also take over initiative.
What finally distinguishes them from sheer Q&A and Command and Control (C&C)
systems in a more complex way is that user modeling may be included. However,
both Q&A and SLDS already provide an important improvement over today’s pre-
dominant systems where users are asked to adhere to a given complex and artificial
syntax. The following figure gives an overview of a typical circular pipeline archi-
tecture of a SLDS [23]. On top the user can be found who controls an application
154 Speech Communication and Multimodal Interfaces

found at the bottom by use of a SLDS as interaction medium. While the compo-
nents are mostly the same, other architectures exist, as organization around a central
process [15]. In detail the single components are:

Fig. 4.6. Overview of a SLDS.

• Automatic Speech Recognition (ASR): Spoken input analysis leading to hypothe-


ses of linguistic units as phonemes or words and often some form of confidence
measurement of the certainty of these (see Sect. 4.1. The output is mostly pro-
vided in so called lattices resembling tree structures or n-best lists. Key factors of
an ASR module or engine, as it is mostly referred to, are speaker(in)dependence,
the vocabulary size of known words and its general robustness. On the acoustic
level also a user’s underlying affect may be analyzed in order to include emo-
tional aspects (see Sect. 4.4 [48].
• Natural Language Understanding (NLU): Interpretation of the intention or mean-
ing of the spoken content. Here again several hypotheses may be forwarded to
the dialog management combined with certainty.
• Dialog Management (DM): The DM is arguably the central module of a voice
interface as it functions as an intermediate agent between user and application
and is responsible for the interaction between them. In general it operates on
an intention representation provided by the NLU which models what the user
(presumably) said. On the basis of this information, the DM has several options
as to change the state of an underlying application in the case of voice control or
retrieve a piece of data from a database of interest in the case of an information
service. Furthermore the DM decides, when and which type of system voice
output is performed. Shortly summarized the DM’s primary tasks are storage
Speech Dialogs 155

and analysis of context and dialog history, flow control e.g. for active initiative or
barge-in handling, direction of the course of a conversation, answer production
in an abstract way, and database access or application control.
• Database (DB): Storage of information in respect of dialog content.
• Natural Language Generation (NLG): Formulation of the abstract answer pro-
vided by the DM. A variety of approaches exists for this task reaching from
probabilistic approaches with grammatical post-processing to pre-formulated ut-
terances.
• Speech Synthesis (TTS): Audio-production for the naturally formulated system
answer. Such modules are in general called Text-to-Speech engines. The two ma-
jor types of such are once formant synthesizers resembling a genuinely artifi-
cial production of audio by formant tract modeling and concatenative synthesis.
Within the latter audio clips at diverse lengths reaching from phonemes to whole
words of recorded speech are concatenated to produce new words or sentences.
At the moment these synthesizers tend to sound more natural depending on the
type of modeling and post-processing as pitch and loudness correction. More so-
phisticated approaches use bi- or trigrams to model phonemes in the context of
their neighboring ones. Recently furthermore prosodic cues as emotional speech
gain interest in the field of synthesis. However, the most natural form still remains
prerecorded speech, while providing less flexibility or more cost at recording and
storage space.
Still, as it is disputed which parts besides the DM belong to a strict definition of a
SLDS, we will focus hereon in the ongoing.

4.2.2 Initiative Strategies

Before getting into dialog modeling, we want to make a classification in view of the
initiative:
• system-driven: the system keeps the initiative throughout the whole dialog
• user-driven: the user keeps the initiative
• mixed initiative: the initiative changes throughout the dialog
Usually, the kind of application, and thus the kind of voice interface, codeter-
mines this general dialog strategy. For example systems with limited vocabulary
tend to employ rigid, system initiative dialogs. The system thereby asks very spe-
cific questions, and the user can do nothing else but answer them. Such a strategy is
required due to the highly limited set of inputs the system can cope with. C&C ap-
plications tend to have rigid, user initiative dialogs: the system has to wait for input
from the user before it can do anything. However, the ideal of many researchers and
developers is a natural, mixed initiative dialog: both counterparts have the possibility
of taking the initiative when this is opportune given the current state of the dialog,
and the user can converse with the system as (s)he would with another human. This is
difficult to obtain in general, for at least two reasons: Firstly, it is technically demand-
ing, as the user should have the freedom to basically say anything at any moment,
156 Speech Communication and Multimodal Interfaces

which is a severe complication from a speech recognition and language processing


point of view. Secondly, apart from such technical hurdles, it is also difficult from a
dialog point of view, as initiative has to be tracked and reactions have to be flexible.

4.2.3 Models of Dialog

Within the DM an incorporated dialog model is responsible for the structure of the
communication. We want to introduce the most important such models in the ongo-
ing. They can be mainly divided into structural models, which will be introduced
firstly, and nonstructural ones [23, 31]. Thereby a more or less predefined path is
given within structural ones. While these are quite practicable in the first order, their
methodology is not very principled, and the quality of dialogs based on them is ar-
guable. Trying to overcome this, the non-structural approaches rely rather on general
principles of dialog.
4.2.3.1 Finite State Model
The dialog structure represents a sequence of limited predetermined steps or states
in form of a condition transition graph which models all legal dialogs within the
very basic graph-based-, or Finite State Automaton (FSA) model. This graph’s nodes
represent system questions, system outputs or system actions, while its edges show
all possible paths in the network which are labeled with the according user expres-
sions. As all ways in a deterministic finite automaton are fixed, it cannot be spoken
of an explicit dialog control. The dialog therefore tends to be rigid, and all reactions
for each user-input are determined. We can denote such a model as the quintuple
{S, s0 , sf , A, τ }. Thereby S represents a set of states besides s0 , the initial state, and
sf , the final state, and A a set of actions including an empty action . τ finally is a
transition function with τ : S × A → S. By τ it is specified to which state an action
given the actual state leads. Likewise a dialog is defined as the path from s0 to sf
within the space of possible states.
Generally, deterministic FSA are the most natural kind to represent system-driven
dialogs. Furthermore they are well suitable for completely predictable information
exchange due to the fact that the entire dialog model is defined at the time of the
development. Often it is possible to represent the dialog flow graphically, which
resembles a very natural development process.
However, there is no natural order of the conditions, as each one can be followed
by any other resulting in a combinatorial explosion. Furthermore the model is un-
suited for different abstraction levels of the exchanged information, and for complex
dependencies between information units. Systems basing on FSA models are also in-
flexible, as they are completely system-led, and no path deviations are possible. The
user’s input is restricted to single words or phrases that provide responses to carefully
designed system prompts. Tasks, which require negotiations, cannot be implemented
with deterministic automata due to the uncertainty of the result. Grounding and re-
pair is very rigid and must be applied after each turn. Later corrections by the user are
hardly possible, and the context of the statements cannot be used. Summed up, such
Speech Dialogs 157

systems are appropriate for simple tasks with flat menu structure and short option
lists.
4.2.3.2 Slot Filling
In so called frame-based systems the user is asked questions that enable to fill slots
in a template in order to perform a task [41]. These systems are more or less today’s
standard for database retrieval systems as flight or cinema information. In difference
to automaton techniques the dialog flow is not predetermined but depends on the
content of the user’s input and the information that the system has to elicit. However,
if the user provides more input than is requested at a time, the system can accept this
information and check if any additional item is still required. A necessary compo-
nent therefore is a frame that controls over already expressed information fragments.
If multiple slots are yet to fill, the question of the optimal order in respect of sys-
tem questioning remains. The idea thereby is to constrain the originally large set of
a-priori possibilities of actions in order to speed up the dialog. Likewise, one reason-
able approach to establish a hierarchy of slots is to stick to the order of highest infor-
mation based on Shannon’s entropy measure, as proposed in [24]. Such an item with
high information might e.g. be the one leaving most alternatives open. Let therefore
ck be a random variable of items within the set C, and vj a value of attribute a. The
attribute a that minimizes the following entropy measure H(c|a) will be accordingly
selected:
 
H(c|a) = − P (ck , a = vj ) log2 P (ck |a = vj ) (4.6)
vj ∈a ck ∈C

Thereby the probability P (ck , a = vj ) is set to 0 unless ck matches the revised


partial description. In this case it resembles P (ck ). The further missing probability
P (ck |a = vj ) is calculated by the following equation, where C  is the set of items
that match the revised description that includes a = vj , and P (ck ) can be approxi-
mated basing on the assumption that items are equally likely:

P (ck ) 1
P (ck |a = vj ) =  , P (ck ) = (4.7)
P (cj ) |C|
cj ∈C 

An advantage of dialog control with frames is higher flexibility for the user and
multiple slot filling. The duration or dialogs are shorter, and a mixed initiative control
is possible. However, an extended recognition grammar is required for more flexible
user expressions, and the dialog control algorithm must specify the next system ac-
tions on basis of the available frames. The context that decides on the next action (last
user input, status of the slots, simple priority ranking) is limited, and the knowledge
level of the user, negotiating or collaborative planning cannot – or only be modeled
to a very limited degree. No handling for communication problems exists, and the
form of system questions is not specified. Still, there is an extension [53], whereby
complex questions are asked if communication works well. In case of problems a
system can switch to lower-level questions splitting up the high-level ones. Finally,
158 Speech Communication and Multimodal Interfaces

an overview over the system is not possible due to partially complex rules: which
rule fires when?
4.2.3.3 Stochastic Model
In order to find a less hand-crafted approach which mainly bases on a designer’s
input, recently data-driven methods successfully applied in machine learning are
used within the field of dialog modeling. They aim at overcoming the shortcomings
of low portability to new domains and the limited predictability of potential prob-
lems within a design process. As the aim is to optimize a dialog, let us first define a
cost function C. Thereby the numbers Nt of turns , Ne of errors, and Nm of miss-
ing values are therefore summed up by introduction of individual according weights
wt , we , and wm :

C = wt (Nt ) + we (Ne ) + wm (Nm ) (4.8)


Now if we want to find an ideal solution minimizing this target function, we
face |A||S| possible strategies. A variety of suited approaches for a machine based
solution to this problem exists [57]. However, we chose the predominant Markov
Decision Processes (MDP) as a probabilistic extension to the introduced FSA fol-
lowing [28] herein. MDP differ from FSA, as they introduce transition probabilities
instead of the functions τ . We therefore denote at a time t the state st , and the ac-
tion at . Given st and at we change to state st+1 with the conditional probability
P (st+1 |st , at ). Thereby we respect only events one time step behind, known as the
limited horizon Markov property. Next, we combine this with the introduced costs,
whereby ct shall be the cost if in state st the action at is performed, with the proba-
bility P (ct |st , at ). The cost of a complete dialog can likewise be denoted as:


tf
C= ct , (4.9)
t=0

where tf resembles the instant when the final state sf is reached. In the conse-
quence the best action V ∗ (s) to take within a state s is the action that minimizes
over incurred and expected costs for the succeeding state with the best action within
there, too:
% &

∗  ∗ 
V (s) = arga min c(s, a) + Ptf (s |s, a)V (s ) (4.10)
s

Given all model parameters and a finite state space, the unique function V ∗ (s)
can be computed by value interaction techniques. Finally, the optimal strategy ψ ∗
resembles the chain of actions that minimizes overall costs. In our case of a SLDS
the parameters are not known in advance, but have to be learned by data. Normally
such data might be acquired by test users operating on a simulated system known as
Wizard-of-Oz experiments [33]. However, as loads of data are needed, and there is no
data like more data, the possibility of simulating a user by another stochastic process
as a solution to sparse data handling exists [27]. Basing on annotated dialogs with
Speech Dialogs 159

real users, probabilities are derived with which a simulated user acts responding to
the system. Next reinforced learning e.g. by Monte Carlo with exploring starts is used
to obtain an optimal state-action value function Q∗ (s, a). It resembles the costs to be
expected of a dialog starting in state s, moving to s by a, and optimally continuing
to the end:

Q∗ (s, a) = c(s, a) + Ptf (s |s, a) arga min Q∗ (s , a ), (4.11)
s

Note that V ∗ (s) = arga min Q∗ (s, a). Starting in an arbitrary set of Q∗ (s, a) the
algorithm iteratively finds the costs for a dialog session. In [27] it is shown that it
converges to an optimal solution, and that it is expensive to immediately shortcut a
dialog by ”bye!”.
The goals set at the beginning of this subsection can be accomplished by sto-
chastic dialog modeling. The necessity of data seems one drawback, while it often
suffices to have very sparse data, as low accuracies of the initial transition probabili-
ties may already lead to satisfying results.
Still, the state space is hand-crafted, which highly influences the learning process.
Likewise the state space itself should be learned. Furthermore the cost function de-
termination is not trivial, especially assigning the right weights, while also highly
influencing the result. In the suggested cost function no account is taken for subjec-
tive costs as user satisfaction.
4.2.3.4 Goal Directed Processing
The more or less finite state and action based approaches with predefined paths in-
troduced so far are well suited for C&C and information retrieval, but less suited
for task-oriented dialogs where the user and the system have to cooperate to solve
a problem. Consider therefore a plan-based dialog with a user that has only sparse
knowledge about the problem to solve, and an expert system. The actual task thereby
mostly consists of sub-tasks, and influences the structure of the dialog [16]. A system
has now to be able to reason about the problem at hand, the application, and the user
to solve the task [50]. This leads to a theorem prover that derives conclusions by the
laws of logic from a set of true considered axioms contained in a knowledge-base.
The idea is to proof whether a subgoal or goal is accomplished, as defined by a theo-
rem, yet. User modeling is thereby also done in form of stored axioms that consist of
his competence and knowledge. Whenever an axiom is missing, interaction with the
user becomes necessary, which claims for an interruptible theorem prover that is able
to initiate a user directed question. This is known as the missing axiom theory [50].
As soon as a subgoal is fulfilled, the selection of a new one can be made accord-
ing to the highest probability of success given the actual dialog state. However, the
prover should be flexible enough to switch to different sub-goals, if the user actively
provides new information. This process is iteratively repeated until the main goal is
reached.
In the framework of Artificial Intelligence (AI) this can be described by the Be-
liefs, Desires, and Intentions (BDI) model [23], as shown in the following figure 4.7.
160 Speech Communication and Multimodal Interfaces

This model can be applied to conversational agents that have beliefs about the cur-
rent state of the world, and desires, how they want it to be. Basing on these they
determine their intention, respectively goal, and build a plan consisting of actions to
satisfy their desires. Note that utterances are thereby treated as (speech) actions.

Fig. 4.7. Beliefs, Desires, and Intentions Model.

To conclude, the advantages of goal directed processing are the ability to also
deal with task-oriented communication and informational dialogs, a more principled
approach to dialog based on a general theory of communication, and less domain-
dependency at least for the general model.
Yet, by applying AI methods, several problems are inherited: high computational
cost, the frame problem dealing with the specification of non-influenced parts of
the world by actions, and the lack of proper belief and desire formalizations and
independently motivated rationality principles in view of agent behavior.
4.2.3.5 Rational Conversational Agents
Rational Conversational Agents directly aim at imitation of natural conversation be-
havior by trying to overcome plan-based approaches’ lack in human-like rationality.
The latter is therefore reformulated in a formal framework establishing an intelligent
system that relies on a more general competence basis [43]. The beliefs, desires and
intentions are logically formulated, and a rational unit is constructed basing thereon
that decides upon the actions to be taken.
Let us denote an agent i believing in proposition p as B(i, p), and exemplary
propositions as ϕ, and ψ. We can next construct a set of logical rules known as
NKD45 or Weak S5 that apply for the B operator:

(N ) : always ϕ → always B(i, ϕ) (4.12)

(K) : B(i, ϕ) ∧ B(i, ϕ → ψ) → B(i, ψ) (4.13)


(D) : B(i, ϕ) → ¬B(i, ¬ϕ) (4.14)
(4) : B(i, ϕ) → B(i, B(i, ϕ)) (4.15)
(5) : ¬B(i, ϕ) → B(i, ¬B(i, ϕ) (4.16)
Speech Dialogs 161

Let us furthermore denote desires, respectively goals, as G(i, p). Likewise an


agent indexed i has the desire that p comes true. As an exercise, one can reflect why
only (K) and (4) apply for goals:

(K) : G(i, ϕ) ∧ G(i, ϕ → ψ) → G(i, ψ) (4.17)


(4) : G(i, ϕ) → G(i, B(i, ϕ)) (4.18)
We next connect goals and beliefs by a realism constraint which states that an
agent cannot have a goal he believes to be false:

B(i, ϕ) → G(i, ϕ) (4.19)


Expected consequences of a goal furthermore also have to be a goal of an agent,
which is known as the expected consequences constraint:

G(i, ϕ) ∧ B(i, ϕ → ψ) → G(i, ϕ) (4.20)


Finally, let us introduce the persistent goal constraint with the persistent goal
P G(i, ϕ): an agent i should not give up a goal ϕ to be true in the future (1) that he be-
lieves not true presently (2) until either its fulfillment or necessarily non-availability
(3).
(1) : P G(i, ϕ) if and only if G(i, (Future ϕ))∧ (4.21)
(2) : B(i, ¬ϕ)∧ (4.22)
(3) : [Before ((B(i, ϕ) ∨ B(i, (Necessary ¬ϕ)))¬G(i, (Future ϕ)))] (4.23)
The rational unit of a SLDS fed with the formalized notions of belief, desires and
intentions has now a selection of communication actions to choose from that are
associated with feasibility preconditions and rational effects [43]. Such an action is
e.g. that an agent i informs an agent j about ϕ, whereby the user is also understood
as an agent. Now, if an agent has the intention to achieve a goal, it selects an ap-
propriate action and thereby inherits the intention to fulfill according preconditions.
This approach defines a planning algorithm demanding for a theorem prover as in the
previous section. However, user directed questions do not directly concern missing
axioms – it is rather only checked whether preconditions are fulfilled and effects of
system-actions help in view of the current goal.
To sum up, the characteristics of rational conversational agents are full mixed
initiative, the ability to implement theoretically complex systems which solve dy-
namic and (only) cooperative complex tasks. They are also much more oriented on
the linguistic theory.
However, there is no general definition of the agent term besides that they should
be reactive, autonomous, social, rational and antropomorph. Furthermore way more
resources are needed both quantitatively (computer speed, programming expendi-
ture) and qualitatively (complexity of the problems which can be solved). Also, the
formalization of all task relevant domain-knowledge is not trivial. So far there are
only academic systems realized.
162 Speech Communication and Multimodal Interfaces

4.2.4 Dialog Design


Let us now turn our attention to the design of a dialog, as a lot of factors besides
recognition performance of the ASR unit have significant influence on the utmost
design goals naturalness, efficiency, and effectiveness. As we learned so far, initiative
can e.g. be chosen mixed or by one side only. Furthermore confirmations can be given
or spared, and suggestions made in case of failure, etc. In the following three main
steps in the design of a dialog will be outlined:
• Script writing: In this part, also known as call flow layout, the interaction between
user and system is laid out step-wisely. Focus should be given to the naturalness
of the dialog, which is also significantly influenced by the quality of the dialog
flow.
• Prompt Design: Similar to a prompt in console based interfaces a sound or an-
nouncement of the system signals the user when and depending on the situation
what to speak. This acoustic prompt has to be well designed in order to be in-
formative, well heard, but not disturbing. A frequent prompt might therefore be
chosen short. On the other hand in the case of error handling, tutorial informa-
tion or help provision, prompts might be chosen more complex, as the quality of
a user’s answer depends strongly on the quality of the prompt. Finally, only by
appropriate prompt crafting the initiative throughout a dialog may be controlled
effectively.
• Grammar Writing: Within the grammar the possible user statements given a
dialog-state are defined. A compromise between coverage and recognition ac-
curacy has to be found, as too broad a coverage often leads to decreased perfor-
mance due to a large sphere of possible hypotheses. Also, phrases of multiple
words should be handled. This is often realized by finite state grammars, and
triggering by keywords.
For an automatic system it seems crucial to understand the actual meaning of a
user statement, which highly depends on the context. Furthermore it is important to
design system announcements clear and understandable for the user. We therefore
also want to take a brief linguistic view on dialog within this section.
Three different aspects are important considering the meaning of a verbally ut-
tered phrase in the chain from sender to receiver: Firstly, we have the locution, which
represents the semantic or literal significance of the utterance; secondly there is the
illocution, the actual intention of the speaker, and finally the perlocution stands for
how an utterance is received by the counterpart. Likewise to speak is to perform a
locution, but to speak with an intent (ask, promise, request, assert, demand, apolo-
gize, warn, etc.) is to perform an illocution. The purpose, the illocutionary intent,
is meaningful and will ordinarily be recognized by hearers. Within this context we
want to have a look at four well known Greek conversational maxims:
• Maxim of relevance: be relevant. Consider hereon: ’He kicked the bucket’ (we
assume that someone died because that’s what’s relevant), or ’Do you know what
time it is?’ (we assume that the speaker wants to know the time because the ”real”
question is irrelevant).
Speech Dialogs 163

• Maxim of quality: be truthful. E.g. ’If I hear that song again I’ll kill myself’ (we
accept this as a hyperbole and do not immediately turn the radio off), or ’The boss
has lost his marbles’ (we imagine a mental problem and not actual marbles).
• Maxim of quantity: be informative, say neither too much not too little. (Asked the
date, we do not include the year).
• Maxim of manner: be clear and orderly.
In order to obtain high overall dialog quality, some aspects shall be further out-
lined: Firstly, consistency and transparency are important to enable the user to picture
a model of the system. Secondly, social competence in view of user behavior mod-
eling and providing a personality to the system seems very important. Thirdly, error
handling plays a key role, as speech recognition is prone to errors. The precondi-
tion thereby is clarification, demanding that problems at the user-input (no input,
cut input, word-recognition errors, wrong interpretations, etc.) must become aware
to the system. In general it is said that users accept a maximum of five percent errors
or less. Therefore trade-offs have to be made considering the vocabulary size, and
naturalness of the input. However, there is a chance to raise the accuracy by expert
prompt modeling, allusion to the problem nature, or at least cover errors and reduce
annoyance of the users by a careful design. Also, the danger of over-modeling a dia-
log in view of world knowledge, social experience or general complexity of manlike
communication shall be mentioned. Finally, the main characteristic of a good voice
interface is probably that is usable. Usability thereby is a widely discussed concept
in the field of interfaces, and various operationalizations have been proposed. In [33]
it is stated that usability is a multidimensional concept comprising learnability, effi-
ciency, memorability, errors and satisfaction, and ways are described, in which these
can be measured. Defining when a voice interface is usable is one thing, developing
one is quite another. It is by now received wisdom that usability design is an iterative
process which should be integrated in the general development process.

4.2.5 Scripting and Tagging


We want to conclude this section with a short introduction of the two most important
dialog mark-up languages in view of scripting and tagging: VoiceXML and DAMSL.
VoiceXML (VXML) is the W3C’s standard XML format for specifying interactive
voice dialogs between a human and a computer. VXML is fully analogous to HTML,
and just as HTML documents are interpreted by a visual web browser, VXML docu-
ments are interpreted by a voice browser. A common architecture is to deploy banks
of voice browsers attached to the public switched telephone network so that users
can simply pick up a phone to interact with voice applications. VXML has tags that
instruct the voice browser to provide speech synthesis, automatic speech recogni-
tion, dialog management, and soundfile playback. Considering dialog management,
a form in VXML consists of fields and control units: fields collect information from
the user via speech input or DMTF (dual tone multi-frequency). Control units are
sequences of procedural statements, and the control of the dialog is made by the fol-
lowing form interpretation algorithm, which consists of at least one major loop with
three phases:
164 Speech Communication and Multimodal Interfaces

• Select: The first form not yet filled is selected in a top down manner in the active
VXML document that has an open guard condition.
• Collect: The selected form is visited, and the following prompt algorithm is ap-
plied to the prompts of the form. Next the input grammar of the form is activated,
and the algorithm waits for user input.
• Process: Input evaluation of a form’s fields in accordance to the active grammar.
Filled elements are called for example to the input validation. The process phase
ends, if no more items can be selected or a jump point is reached.
Typically, HTTP is used as the transport protocol for fetching VXML pages.
While simpler applications may use static VXML pages, nearly all rely on dynamic
VXML page generation using an application server. In a well-architected web appli-
cation, the voice interface and the visual interface share the same back-end business
logic.
Tagging of dialogs for machine learning algorithms on the other hand is mostly
done using Dialogue Act Mark-up in Several Layers (DAMSL) - an annotation
scheme for communicative acts in dialog [9]. While different dialogs being analyzed
with different aims in mind will lead to diverse acts, it seems reasonable to agree on
a common basis for annotation in order to enable database enlargement by integra-
tion of other ones. The scheme has three layers: Forward Communicative Functions,
Backward Communicative Functions, and Information Level. Each layer allows mul-
tiple communicative functions of an utterance to be labeled. The Forward Commu-
nicative Functions consist of a taxonomy in a similar style as the actions of tradi-
tional speech act theory. The most important thereby are statement (assert, reassert,
other statement), influencing addressee future action (open-option, directive (info-
request, action-directive), and committing speaker future action (offer, commit). The
Backward Communicative Functions indicate how the current utterance relates to
the previous dialog: agreement (accept, accept-part, maybe, reject-part, reject, hold),
understanding (signal non-understanding, signal understanding (acknowledge, re-
peat phrase, completion), correct misspeaking), and answer. Finally, the Information
Level annotation encodes whether an utterance is occupied with the dialog task, the
communication process or meta-level discussion about the task.

4.3 Multimodal Interaction

The research field of Human-Computer Interaction (HCI) focuses on arranging the


interaction with computers easier, safer, more effective and to a high degree seamless
for the user.
”Human-Computer Interaction is a discipline concerned with the design,
evaluation and implementation of interactive computing systems for human
use and with the study of major phenomena surrounding them.” [18]
As explained, HCI is an interdisciplinary field of research where many differ-
ent subjects are involved to reach the long-term objective of a natural and intuitive
Multimodal Interaction 165

way of interaction with computers. In general, the term interaction describes the mu-
tual influence of several participants to exchange information that is transported by
most diverse means in a bilateral fashion. Therefore, a major goal of HCI is to con-
verge the interface increasingly towards a familiar and ordinary interpersonal way of
interaction. For natural human communication several in- and output channels are
combined in a multimodal manner.

4.3.1 In- and Output Channels

The human being is able to gather, process and express information through a number
of channels. The input channel can be described as sensor function or perception, the
processing of information as cognition, and the output channel as motor function (see
figure 4.8).

Fig. 4.8. Human information input and output channels.

Humans are equipped with six senses respectively sense organs to gathers stim-
uli. These senses are defined by physiology [14]:

sense of sight visual channel


sense of hearing auditive channel
sense of smell olfactory channel
sense of taste gustatory channel
sense of balance vestibular channel
sense of touch tactile channel
The human being is also provided with broad range of abilities for information
output. The output channel can process less amount of information than the input
channel. Outgoing information is transmitted by the auditive, the visual and the hap-
tic channel (tactile as perception modality is distinguished from haptic as output
manner). By this means, the human is enabled to communicate with others in a sim-
ple, effective, and error robust fashion. With an integrated and synchronized use of
166 Speech Communication and Multimodal Interfaces

different channels he can flexibly adapt to the specific abilities of the conversational
partner and the current surroundings.
However, HCI is far from information transmission of this intuitive and natural
manner. This is due to the limitations of the technical systems in regard of their num-
ber and performance of the single in- and output modalities. The term modality refers
to the type of communication channel used to convey or acquire information. One of
the core difficulties of HCI is the divergent boundary conditions between computers
and human. Nowadays, the user can transfer his commands to the computer only by
standard input devices – e.g. mouse or keyboard. The computer feedback is carried
out by visual or acoustic channel – e.g. monitor or speakers. Thus, todays’ technol-
ogy uses only a few interaction modalities of the humans. Whenever two or more of
these modalities are involved you speak from multimodality.

4.3.2 Basics of Multimodal Interaction

The term ”multimodal” is derived from ”multi” (lat.: several, numerous) and ”mode”
(lat.: naming the method). The word ”modal” may cover the notion of ”modality” as
well as that of ”mode”. Scientific research focuses on two central characteristics of
multimodal systems:
• the user is able to communicate with the machine by several input and output
modes
• the different information channels can interact in a sensible fashion
According to S. Oviatt [35] multimodal interfaces combine natural input modes
- such as speech, pen, touch, manual gestures, gaze and head and body movements
- in a coordinated manner with multimedia system output. They are a new class of
interfaces which aims to recognize naturally occurring forms of human language and
behavior, and which incorporate one or more recognition-based technologies (e.g.,
speech, pen, vision). Benoit [5] expanded the definition to a system which repre-
sents and manipulates information from different human communication channels at
multiple levels of abstraction. These systems are able to automatically extract mean-
ing from multimodal raw input data and conversely produce perceivable informa-
tion from symbolic abstract representations. In 1980, Bolt’s ”Put That There” [6]
demonstration showed the new direction for computing which processed speech
in parallel with touch-pad pointing. Multimodal systems benefit from the progress
in recognition-based technologies which are capable to gather naturally occurring
forms of human language and behavior. The dominant theme in users’ natural orga-
nization of multimodal input actually is complementary of content that means each
input consistently contributes to different semantic information. The partial infor-
mation sequences must be fused and can only be interpreted altogether. However,
redundancy of information is much more less common in human communication.
Sometimes different modalities can input concurrent content that has to be processed
independently. Multimodal applications range from map-based (e.g. tourist informa-
tion) and virtual reality systems, to person identification and verification systems, to
medical and web-based transaction systems. Recent systems integrate two or more
Multimodal Interaction 167

recognition-based technologies like speech and lips. A major aspect is the integra-
tion and synchronization of these multiple information streams what is discussed
intensely in Sect. 4.3.3.
4.3.2.1 Advantages
Multimodal interfaces are largely inspired by the goal of supporting more transpar-
ent, flexible, effective, efficient and robust interaction [29, 35]. The flexible use of
input modes is an important design issue. This includes the choice of the appropri-
ate modality for different types of information, the use of combined input modes,
or the alternate use between modes. A more detailed differentiation is made in Sect.
4.3.2.2. Input modalities can be selected according to context and task by the user or
system. Especially for complex tasks and environments, multimodal systems permit
the user to interact more effectively. Because there are large individual differences
in abilities and preferences, it is essential to support selection and control for diverse
user groups [1]. For this reason, multimodal interfaces are expected to be easier to
learn and use. The continuously changing demands of mobile applications enables
the user to shift these modalities, e.g. in-vehicle applications. Many studies proof that
multimodal interfaces satisfy higher levels of user preferences. The main advantage
is probably the efficiency gain that derives from the human ability to process input
modes in parallel. The human brain structure is developed to gather a certain kind of
information with specific sensor inputs. Multimodal interface design allows a supe-
rior error handling to avoid and to recover from errors which can have user-centered
and system-centered reasons. Consequently, it can function in a more robust and sta-
ble manner. In-depth information about error avoidance and graceful resolution from
errors is given in Sect. 4.3.4. A future aim is to interpret continuous input from vi-
sual, auditory, and tactile input modes for everyday systems to support intelligent
adaption to user, task and usage environment.
4.3.2.2 Taxonomy
A base taxonomy for the classification of multimodal systems was created in the
MIAMI-project [17]. As a basic principle there are three decisive degrees of freedom
for multimodal interaction:
• the degree of abstraction
• the manner (temporal) of application
• the fusion of the different modalities
Figure 4.9 shows the resulting classification space and the consequential four ba-
sic categories of multimodal applications dependent on the parameters value of data
fusion (combined/independent) and temporal usage (sequential/parallel). The exclu-
sive case is the simplest variant of a multimodal system. Such a system supports two
or more interaction channels but there is no temporal or content related connection. A
sequential application of the modalities with functional cohesion is denominated as
alternative multimodality. Beside a sequential appliance of the modalities there is the
possibility of parallel operation of different modalities as seen in figure 4.9. Thereby,
168 Speech Communication and Multimodal Interfaces

Fig. 4.9. Classification space for classification of multimodal systems [44].

it is differentiated between the manner of fusion of the interaction channels in simul-


taneous and synergistic multimodality. The third degree of freedom is the level of
abstraction. This refers to the technical level, on which the signals are processed that
ranges from simple binary sequences to highly complex semantic terms.

4.3.3 Multimodal Fusion

As described, in multimodal systems the information flow from human to computer


occurs by different modalities. To be able to utilize the information transmitted by
different senses, it is necessary to integrate the input channels to create an appropriate
command which is equivalent to the user’s intention. The input data gathered from
the single modalities is generated via single mode recognizers. There are three basic
approaches for combining the results of each recognition modules to one information
stream [36, 56]:
• Early (signal) fusion: The earliest possible fusion of the sensor data is the
combination of the sensor specific raw data. The classification of the data is
mostly achieved by Hidden-Markov-Models (HMM), temporal Neural Networks
(NN) or Dynamic Bayesian Networks (DBN). Early fusion is well suited for
temporally synchronized inputs. This approach to fusion only succeeds if the
data provided by different sources is of the same type and a strong correlation of
modalities exists. For example the fusion of the images generated with a regular
camera and an infrared camera for use in night vision systems. Furthermore, this
type of fusion is applied in speech recognition systems supported by lip-reading
technology, in which the viseme1 and phoneme progression can be registered
1
A visem is the generic image of the face (especially the lip positioning) in the moment of
creation of a certain sound. Visemes are thus the graphic pendant to phonemes.
Multimodal Interaction 169

collective in one HMM. A great problem of early fusion is the large data amount
necessary for the training of the utilized HMMs.

• Late (semantic) fusion: Multimodal systems which use late fusion consist of
several single mode recognition devices as well as a downstream data fusion
device. This approach contains a separate preprocessing, feature extraction and
decision level for each separate modality. The results of the separate decision
levels are fused to a total result. For each classification process each discrete
decision level delivers a probability result respective a confidence result for the
choice of a class n. These confidence results are afterwards fused for example by
appropriate linear combination. The advantage of this approach is the different
recognition devices being independently realizable. Therefore the acquisition
of multimodal data sets is not necessary. The separate recognition devices are
trained with monomodal data sets. Because of this easy integration of new
recognizers, systems that use late fusion scale up easier compared to early
fusion, either in number of modalities or in size of command set. [56]

• Soft decision fusion: A compromise between early and late fusion is the so
called soft decision fusion. In this method the confidence of each classifier is
also respected as well as the integration of an N -best list of each classifier.

In general, multimodal systems consist of various modules (e.g., different single


mode recognizers, multimodal integration, user interface). Typically, these software
components are developed and implemented independently from each other. There-
fore, different requirements to the software architecture of a multimodal system arise.
A common infrastructure approach that has been adopted by the multimodal research
community involves multi-agent architectures, where agents are defined as any soft-
ware process. In such architectures, the many modules that are needed to support
the multimodal system, may be written in different programming languages and run
on several machines. One example for an existing multimodal framework is given
by [30]. The system architecture consists of three main processing levels: the input
level, the integration level, and the output level. The input level contains any kind
of interface that is capable of recognizing user inputs (e.g., mouse, buttons, speech
recognizer, etc.). Dedicated command mappers (CMs) encode the information bits
of the single independent modality recognizers and context sensors into a meta lan-
guage based on a context-free grammar (CFG). In the integration level, the recog-
nizer outputs and additional information of context sensors (e.g., information about
application environment, user state) are combined in a late semantic fusion process.
The output level provides any devices for adequate multimodal system feedback.
4.3.3.1 Integration Methods
For signal and semantic fusion, there exist different integration methods. In the fol-
lowing, some of these methods are explained:
• Unification-based Integration:
170 Speech Communication and Multimodal Interfaces

Typed-feature structure unification is an operation that verifies the consistency


of two or more representational structures and combines them into a single
result. Typed feature structures are used in natural language processing and
computational linguistics to enhance syntactic categories. They are very similar
to frames from knowledge representation systems or records from various
programming languages like C, and have been used for grammar rules, lexical
entries and meaning representation. A feature structure consists of a type,
which indicates the kind of entity it represents, and an associated collection of
feature-value or attribute-value pairs [7]. Unification-based integration allows
different modalities to mutually compensate for each others’ errors [22]. Feature
structure unification can combine complementary and redundant input, but
excludes contradictory input. For this reason it is well suited to integrate e.g.
multimodal speech and gesture inputs.

• Statistical Integration:
Every input device produces an N -best list with recognition results and prob-
abilities. The statistical integrator produces a probability for every meaningful
combination from different N -best lists, by calculating the cross product of the
individual probabilities. The multimodal command with the best probability is
then chosen as the recognized command.
In a multimodal system with K input devices we assume that Ri,j = 1, . . . , Ni
with the probabilities Pi,j = Pi (1), . . . , Pi (Ni ) are the Ni possible recognition
results from interface number i, i ∈ (1, . . . , K). The statistical integrator calcu-
lates the product of each combination of probabilities


K
Cn , n = 1, . . . , Ni (4.24)
i=1

combinations with the probabilities


K
Pn = Pi,j , j ∈ (1, . . . , Ni ) (4.25)
i=1

The integrator chooses which combinations represent valid system commands.


The choice could be based on a semantically approach or on a database with all
meaningful combinations. $ Invalid results are deleted from the list giving a new
K
list Cvalid,n with at most i=1 Ni valid combinations. The valid combination
with the maximum probability max[Pvalid,n ] is chosen as the recognized com-
mand.
The results can be improved significantly if empirical data is integrated in
the statistical process. Statistical integrators are easy to scale up, because all
application knowledge and empirical data is integrated at the configuration level.
Statistical integrators will work very well with supplementary information, and
good with complementary. A problem is, how the system should react, when
no suggestive combinations are available. Furthermore the system has to be
Multimodal Interaction 171

programmed with all meaningful combinations to decide which ones represent


valid system commands. Another possibility to decide whether a combination
is valid or not is to use a semantical integration process after the statistical
integration. This avoids the explicit programming of all valid combinations.

• Hybrid Processing:
In hybrid architectures symbolic unification-based techniques that integrate
feature structures are combined with statistical approaches. In contrast to
symbolic approaches these architectures are very robust functioning. The
Associative Mapping and Members-Teams-Committee (MTC) are two main
techniques. They develop and optimize the following factors with a statistical
approach: the mapping structure between multimodal commands and their
respective constituents, and the manner of combining posterior probabilities.
The Associative Mapping defines all semantically meaningful mapping relations
between the different input modes. It supports a process of table lookup that
excludes consideration of those feature structures that can impossibly be
unified semantically. This table can be build by the user or automatically. The
associative mapping approach basically helps to exclude concurrency inputs
form the different recognizer and to quickly rule out impossible combinations.
Members-Team-Committee is a hierarchical technique with multiple members,
teams and committees. Every recognizer is represented by a member. Every
member reports his results to the team leader. The team leader has an own
function to weight the results and reports it to the committee. At last, the
committee chooses the best team, and reports the recognition result to the
system. The weightings at each level have to be trained.

• Rule-based Integration:
Rule-based approaches are well established and applied in a lot of integration ap-
plications. They are similar to temporal approaches, but are not strictly bound to
timing constraints. The combination of different inputs is given by rules or look-
up tables. For example in a look-up table every combination is rated with a score.
Redundant inputs are represented with high scores, concurrency information with
negative scores. This allows to profit from redundant and complementary infor-
mation and excludes concurrency information.
A problem with rule-based integration is the complexity: many of the rules
have preconditions in other rules. With increasing amount of commands these
preconditions lead to an exponential complexity. They are highly specified and
bound to the domains developed for. Moreover application knowledge has to be
integrated in the design process to a high degree. Thus, they are hard to scale up.

• Temporal Aspects:
Overlapped inputs, or inputs that fall within a specific period of time are com-
bined by the temporal integrator. It checks if two or more signals from different
input devices occur in a specific period of time. The main use for temporal in-
tegrators is to determine whether a signal should be interpreted on its own or in
172 Speech Communication and Multimodal Interfaces

combination with other signals. Programmed with results from user studies the
system can use the information how users react in general. Temporal integration
consists of two parts: microtemporal and macrotemporal integration:
Microtemporal integration is applied to combine related information from vari-
ous recognizers in parallel or in a pseudo-parallel manner. Overlapped redundant
or complementary inputs are fused and a system command or a sequence of com-
mands is generated.
Macrotemporal integration is used to combine related information from the
various recognizers in a sequential manner. Redundant or complementary inputs,
which do not directly overlap each other, but fall together in one timing window
are fused by macrotemporal integration. The macrotemporal integrator has to
be programmed with results from user studies to determine which inputs in a
specified timing window belong together.

4.3.4 Errors in Multimodal Systems

Generally, we define an error in HCI, if the user does not reach her or his desired
goal, and no coincidence can be made responsible for it. Independent from the do-
main, error robustness substantially influences the user acceptance of a technical
system. Based on this fact, error robustness is necessary. Basically, errors can never
be avoided completely. Thus, in this field both passive (a-priori error avoidance) and
active (a-posteriori error avoidance) error handling are of great importance. Unde-
sired system reactions due to faulty operation as well as system-internal errors must
be avoided as far as possible. Error resolution must be efficient, transparent and ro-
bust. The following scenario shows a familiar error-prone situation: A driver wants
to change the current radio station using speech command ”listen to hot radio”. The
speech recognition misinterprets his command as ”stop radio” and the radio stops
playing.
4.3.4.1 Error Classification
For a systematic classification of error types, we basically assume that either the user
or the system can cause an error in the human-machine communication process.
4.3.4.2 User Specific Errors
The user interacting with the system is one error source. According to J. Reason [42]
user specific errors can be categorized in three levels:
• Errors on the skill-based level (e.g., slipping from a button)
The skill-based level comprises smooth, automated, and highly integrated
routine actions that take place without conscious attention or control. Human
performance is governed by stored patterns of pre-programmed instructions
represented as analog structures in a time-space domain. Errors at this level
are related to the intrinsic variability of force, space, or time coordination.
Sporadically, the user checks, if the action initiated by her or him runs as
planned, and if the plan for reaching the focused goal is still adequate. Error
Multimodal Interaction 173

patterns on skill-based level are execution or memory errors that result from
inattention or overattention of the user.

• Errors on the rule-based level (e.g., using an valid speech command, which is
not permitted in this, but in another mode)
Concerning errors on the rule-based level, the user violates stored prioritized
rules (so-called productions). Errors are typically associated with the misclas-
sification of situations leading to the application of the wrong rule or with the
incorrect recall of procedures.

• Errors on the knowledge-based level (e.g., using a speech command, which is


unknown to the system)
At the knowledge-based level, the user applies stored knowledge and analytical
processes in novel situations in that actions must be planned on-line. Errors at
this level arise from resource limitations (bounded rationality) and incomplete
or incorrect knowledge.

4.3.4.3 System Specific Errors


In the error taxonomy errors caused by the system are addressed. System specific
errors can be distinguished in three categories:
• Errors on the recognition level
Examples are errors like misinterpretation, false recognition of a correct user
input, or an incorrect system-intrinsic activation of a speech recognizer (e.g., the
user coincidentally applies the keyword which activates the speech recognizer in
a conversation).

• Errors on the processing level


Timing problems or contradictory recognition results of different monomodal
recognizers, etc. (e.g., the result of speech recognition differs from gesture
recognition input) are causing processing errors.

• Errors on the technical level


System overflow or breakdown of system components are leading to system
errors.

4.3.4.4 Error Avoidance


According to [49] there are eight rules for designing user interfaces. These rules are
derived from experience and applicable in most interactive systems. They do not
conduce error avoidance in a direct way, but they simplify the user’s interaction with
the system. In that way, many potential errors are prevented. From these rules one
can derive some guidelines for multimodal interfaces:
174 Speech Communication and Multimodal Interfaces

• Strive for consistency: Similar situations should consist of similar sequences of


action, as identical terminology e.g. menus or prompts. Consistency means for
multimodal interfaces consistency in two ways. First, within one situation all
commands should be the same for all modalities. Second, within one modality,
all similar commands in different situations should be the same. E.g., the com-
mand to return to the main menu should be the same command in all submenus
and all submenus should be accessible by all modalities by the same command
(menuname on the button is identical with the speech command).
• Offer informative feedback: Every step of interaction should be answered with a
system feedback. This feedback should be modest for frequent and minor actions
and major for infrequent or major actions. Multimodal interfaces have the oppor-
tunity to use the advantage of different output modalities. Feedback should be
given in the same modality as the used input modality. E.g., a speech command
should be answered with a acoustical feedback.
• Reduce short-term memory load: The human information processing is limited
in short-term memory. This requires simple dialog system. Multimodal systems
should use the modalities in a way, that reduces the user’s memory load. E.g.,
object selection is easy by pointing on it, but difficult by using speech commands.
• Synchronize multiple modalities: Interaction by speech is highly temporal. Visual
interaction is spatial. Synchronization of these input modalities is needed. E.g.,
selection of objects by pointing with the finger on it and starting the selection by
using speech (”Select this item”).
There are also many other design aspects and guidelines to avoid errors. In
comparison to monomodal interfaces, multimodal interfaces can even improve er-
ror avoidance by enabling the user to choose freely which input modality to use. In
this way, the user is able to select the input modality, which is the most comfortable
and efficient way to achieve his aim. Also, if interacting by more than one channel,
typical errors of a single modality can be compensated by merging all input data to
one information. So, providing more than one input modality increases the robust-
ness of the system and helps to avoid errors in advance.
4.3.4.5 Error Resolution
In case of occurring errors (system or user errors) the system tries to solve upcoming
problems by initiating dialogs with the user. Error resolution strategies are differen-
tiated in single-step and multi-level dialog strategies. In the context of a single-step
strategy a system prompt is generated, to which the user can react with an individual
input. On the other hand, in the context of a multi-level strategy, a complex further
inquiry dialog is initiated, in which the user is led step by step through the error
resolution process. Especially, the second approach offers enormous potential for an
adaptation to the current user and the momentary environment situation.
For example, the following error handling strategies can be differentiated:
• Warning
• Asking for repetition of last input
• Asking to change the input modality
Emotions from Speech and Facial Expressions 175

• Offering alternative input modalities


These strategies differ by characteristics as initialization of the error warning,
strength of context, individual characteristics of the user, complexity of the error
strategy, and inclusion of the user. The choice which dialog strategy to use depends
mainly on contextual parameters and current state of the system (e.g., eventually
chosen input modality, state of the application). According to Sect. 4.3.3, the error
management component is located in the integration level of a multimodal architec-
ture [30]. The error management process consists of four steps: error feature extrac-
tion, error analysis, error classification, and error resolution. First, a module contin-
uously extracts certain features from the stream of incoming messages and verifies
an error potential. Then, the individual error patterns can be classified. In this phase
of the error management process, the resulting error type(s) are determined. After-
wards, for the selection of a dedicated dialog strategy, the current context parameters
as well as the error types are analyzed. From the results, the strategy with the highest
plausibility is chosen and finally helps to solve the failure in an comfortable way for
the user. Summarizing, the importance of error robustness for multimodal systems
has been discussed and ways of avoidance and resolution have been presented.

4.4 Emotions from Speech and Facial Expressions


Today the great importance of the integration of emotional aspects as the next step
toward more natural human-machine interaction is commonly accepted. Through-
out this chapter we therefore want to give an overview over important existing ap-
proaches to recognize human affect out of the audio and video signal.

4.4.1 Background

Within this section we motivate emotion recognition and the modalities chosen in
this article. Also, we introduce models of emotion and talk about databases.
4.4.1.1 Application Scenarios
Even though button pressing starts to be substituted by more natural communica-
tion forms such as talking and gesturing, human-computer communication still feels
somehow impersonal, insensitive, and mechanical. If we take a comparative glance at
human-human communication we will realize a lack of the extra information sensed
by man concerning the affective state of the counterpart. This emotional information
highly influences the explicit information, as recognized by today’s human-machine
communication systems, and with an increasingly natural communication, respect
of it will be expected. Throughout the design of next generation man-machine inter-
faces inclusion of this implicit channel therefore seems obligatory [10].
Automatic emotion recognition is nowadays already introduced experimentally
in call centers, where an annoyed customer is handed over from a robot to a human
call operator [11, 25, 39], and in first commercial lifestyle products as fun software
176 Speech Communication and Multimodal Interfaces

intended to detect lies, stress- or love level of a telephoner. Besides these, more gen-
eral fields of application are an improved comprehension of a user intention, emo-
tional accommodation in the communication (e.g. adaptation of acoustic parameters
for speech synthesis if a user seems sad), behavioral observation (e.g. whether an
airplane passenger seems aggressive) , objective emotional measurement (e.g. as a
guide-line for therapists), transmission of emotion (e.g. sending laughing or crying
images within text-based emails), affect-related multimedia retrieval (e.g. highlight
spotting in a sports event), and affect-sensitive lifestyle products (e.g. a trembling
cross-hair in video games, if the player seems nervous) [3, 10, 38].
4.4.1.2 Modalities
Human emotion is basically observable within a number of different modalities.
First attempts to automatic recognition applied invasive measurement of e.g. the skin
conductivity, heart rate, or temperature [40]. While exploitation of this information
source provides a reliable estimation of the underlying affect, it is often felt uncom-
fortable and unnatural, as a user needs to be wired or at least has to stay in touch
with a sensor. Modern emotion recognition systems therefore focus rather on video
or audio based non-invasive approaches in the style of human emotion recognition:
It is claimed that we communicate by 55% visually, through body language, by 38%
through the tone of our voice and by 7% through the actual spoken words [32]. In
this respect the most promising approach clearly seems to be a combination of these
sources. However, in some systems and situations only one may be available.
Interestingly, contrary to most other modalities, speech allows the user to control
the amount of emotion shown, which may play an important role, if the user feels
too much observed otherwise. Speech-based emotion recognition in general provides
reasonable results already by now. However, it seems sure that the visual information
helps to enable a more robust estimation [38]. Seen from an economical point of
view a microphone as sensor is standard hardware in many HCI-systems today, and
also more and more cameras emerge as in cellular phones of today’s generation. In
these respects we want to give an insight into acoustic, linguistic and vision-based
information analysis in search of affect within this chapter, and provide solutions to
a fusion of these.
4.4.1.3 Emotion Model
Prior to recognizing emotion one needs to establish an underlying emotion model.
In order to obtain a robust recognition performance it seems reasonable to limit the
complexity of the model, e.g. kind and number of emotion labels used, in view of the
target application. This may be one of the reasons that no consensus exists about such
a model in technical approaches, yet. Two generally different views dominate the
scene: on the one hand an emotion sphere is spanned by two up to three orthogonal
axes: firstly arousal or activation, respecting the readiness to take some action, sec-
ondly valence or evaluation, considering a positive or negative attitude, and finally
control or power, analyzing the speaker’s dominance or submission [10]. While this
approach provides a good basis for emotional synthesis, it is often too complex for
concrete application scenarios. The better known way therefore is to classify emo-
Emotions from Speech and Facial Expressions 177

tion by a limited set of discrete emotion tags. A first standard set of such labels exists
within the MPEG-4 standard comprising anger, disgust, fear, joy, sadness and sur-
prise [34]. In order to discriminate a non-emotional state it is often supplemented
by neutrality. While this model opposes many psychological approaches, it provides
a feasible basis in technical view. However, further emotions as boredom are often
used.
4.4.1.4 Emotional Databases
In order to train and test intended recognition engines, a database of emotional sam-
ples is needed. Such a corpus should provide spontaneous and realistic emotional
behavior out of the field. The sample quality should ensure studio audio and video
quality, but for analysis of robustness in the noise also samples with known back-
ground noise conditions may be desired. A database further has to consist of a high
number of ideally equally distributed samples for each emotion, both of the same,
and of many different persons in total. These persons should provide a flashy model
considering genders, age groups, ethical backgrounds, among others. Respecting fur-
ther variability, uttered phrases should possess different contents, lengths, or even
languages. Thereby an unambiguous assignment of collected samples to emotion
classes is especially hard in this discipline. Also, perception tests by human test-
persons are very useful: As we know, it may be hard to rate one’s emotion for sure.
In this respect it seems obvious that comparatively minor recognition rates can be
demanded in this discipline considering related pattern recognition tasks. However,
the named human performance provides a reasonable benchmark for a maximum
expectation. Finally, a database should be made publicly available in view of inter-
national comparability, which seems a problem considering the lacking consensus
about emotion classes used and privacy of the test-persons. A number of methods
exist to create a database, with arguably different strengths: The predominant ones
among these are acting or eliciting of emotions in test set-ups, hidden or conscious
long-term observations, and use of clips out of public media content. However, most
databases use acted emotions, which allow for fulfillment of the named requirements
besides the spontaneity, as there is doubt whether acted emotions are capable of rep-
resenting true characteristics of affect. Still, they do provide a reasonable starting
point, considering that databases of real emotional speech are hard to obtain. In [54]
an overview over existing speech databases can be found. Among the most popular
ones we want to name the Danish Emotional Speech Database (CEICES), the Berlin
Emotional Speech Database (EMO-DB), and the AIBO Emotional Speech Corpus
(AEC) [4]. Audio-visual databases are however still sparse, especially in view of the
named requirements.

4.4.2 Acoustic Information

Basically it can be said that two main information sources are exploited considering
emotion recognition from speech: the acoustic information analyzing the prosodic
structure as well as the spoken content itself, namely the language information.
Hereby the predominant aims besides high reliability are an independence of the
178 Speech Communication and Multimodal Interfaces

speaker, the spoken language, the spoken content when considering acoustic process-
ing, and the background noise. A number of parameters besides the named under-
lying emotion model and database size and quality strongly influence the quality
in these respects and will be mostly addressed throughout the ongoing: the signal
capturing, pre-processing, feature selection, classification method, and a reasonable
integration in the interaction and application context.
4.4.2.1 Feature Extraction
In order to estimate a user’s emotion by acoustic information one has to carefully
select suited features. Such have to carry information about the transmitted emo-
tion, but they also need to fit the chosen modeling by means of classification algo-
rithms. Feature sets used in existing works differ greatly, but the feature types used
in acoustic emotion recognition may be divided into prosodic features (e.g. inten-
sity, intonation, durations), voice quality features (e.g. 1-7 formant positions, 1-7
formant band widths, harmonic-to-noise ratio (HNR), spectral features, 12-15 Mel
Frequency Cepstral Coefficients (MFCC)), and articulatory ones (e.g. spectral cen-
troid, and more hard to compute ones as centralization of vowels).
In order to calculate these, the speech signal is firstly weighted with a shifting
soft window function (e.g. Hamming window) of lengths reaching from 10-30ms
with a window overlap around 50%. This is a common procedure in speech process-
ing and is needed, as the speech signal is quasi stationary. Next a contour value is
computed for every frame and every contour, leading to a multivariate time series.
As for intensity mostly simple logarithmic frame energy is computed. However, it
should be mentioned that this does not respect human perception. Spectral analy-
sis mostly relies on the Fast Fourier Transform or MFCC - a standard homomor-
phic spectral transformation in speech processing aiming at de-convolution of the
vowel tract transfer function and perceptual modeling by the Mel-frequncy-scale.
First problems now arise, as the remaining feature contours pitch, HNR, or formants
can only be estimated. Especially pitch and HNR can either be derived out of the
spectrum, or – more populary – by peak search within the auto correlation function
of the speech signal. Formants may be obtained by analysis of the Linear Predic-
tion Coefficients (LPC), which we will not dig into. Often backtracking by means of
dynamic programming is used to ensure smooth feature contours and reduce global
costs rather than local ones. Mostly, also higher order derivatives as speed and ac-
celeration are included to better model temporal changes. In any case filtering of
the contours leads to a gain by noise reduction, and is done with low-pass filters as
moving average or median filters [34, 46].
Next, two in general different approaches exist considering the further acoustic
feature processing in view of the succeeding classification: dynamic and static mod-
eling. Within the dynamic approach the raw feature contours, e.g. the pitch or inten-
sity contours, are directly analyzed frame-wise by methods capable of handling mul-
tivariate time-series as dynamic programming (e.g. Hidden Markov Models (HMM)
[34] or Dynamic Bayesian Nets (DBN)).
The second way, by far more popular, is to systematically derive functionals out
of the time-series by means of descriptive statistics. Mostly used are thereby moments
Emotions from Speech and Facial Expressions 179

as mean and standard deviation, or extrema and their positions. Also zero-crossing-
rates (ZCR), number of turning points and others are often considered. As temporal
information is thereby mostly lost, duration features are included. Such may be the
mean length of pauses or voiced sounds, etc. However, these are more complex to
estimate. All features should generally be normalized by either mean subtraction and
division by the standard deviance or maximum, as some classifiers are susceptible to
different number ranges.
In a direct comparison under constant test conditions the static features outper-
formed the dynamic approach in our studies [46]. This is highly due to the unsatis-
factory independence of the overall contour in respect of the spoken content. In the
ongoing we therefore focus on the static approach.
4.4.2.2 Feature Selection
Now that we generated a high order multivariate time-series (approx. 30 dimensions),
included delta regression-coefficients (approx. 90 dimensions in total), and started to
derive static features in a deterministic way, we end up with a too high dimensionality
(>300 features) for most classifiers to handle, especially considering typically sparse
databases in this field. Also, we would not expect every feature that is generated to
actually carry important and non-redundant information about the underlying affect.
Still, this costs extraction effort. Likewise, we aim at a dimensionality reduction by
feature selection (FS) methods.
A often chosen approach thereby is to use Principal Component Analysis (PCA)
in order to construct superposed-features out of all features, and select the ones with
highest eigenvalues corresponding to the highest variance [8]. This however does not
save the original extraction effort, as still all features are needed for the computation
of the artificial ones. A genuine reduction of the original features should therefore
be favored, and can be done e.g. by single feature relevance calculation (e.g. infor-
mation gain ratio based on entropy calculation), named filter-based selection. Still,
the best single features do not necessarily result in the best set. This is why so called
wrapper-based selection methods usually deliver better overall results at lower di-
mensionality. The term wrapper alludes to the fact that the target classifier is used
as an optimization target function, which helps to not only optimize a set, but rather
the compound of features and classifiers as a whole. A search function is thereby
needed, as exhaustive search is in general NP-hard. Mostly applied among these are
Hill Climbing search methods (e.g. Sequential Forward Search (SFS) or Sequential
Backward Search (SBS)) which start from a full or empty feature set and step-wisely
reduce it by the least relevant or add the most important one. If this is done in a float-
ing manner we have the Sequential Floating Search Methods (SFSM) [48]. However,
several other mighty such methods exist, among which especially genetic search
proves powerful [55].
After such reduction the feature vector may be reduced to approximate 100 fea-
tures, and will be classified in a next step.
180 Speech Communication and Multimodal Interfaces

4.4.2.3 Classification Methods


A number of factors influence the choice of the classification method. Besides high
recognition rates and efficiency, economical aspects and a reasonable integration in
the target application framework play a role. In the ongoing research a broad spec-
trum reaching from rather basic classifiers such as instance based learners or Naive
Bayes to more complex as Decision Trees, Artificial Neuronal Nets (e.g. Multi Layer
Perceptrons (MLP) or Radial Basis Function Networks (RBF)), and Support Vec-
tor Machines (SVM) [10, 38]. While the more complex tend to show better results,
no general agreement can be found so far. However, SVM tend to be among the
most promising ones [45]. The power of such base classifiers can also be boosted or
combined my methods of ensemble construction (e.g. Bagging, Boosting, or Stack-
ing) [39, 48, 55].

4.4.3 Linguistic Information

Up to here we described a considerable amount of research effort on feature ex-


traction and classification algorithms to the investigation of vocal properties for the
purpose of inferring to probably expressed emotions from the sound. So the ques-
tion after information transmitted within the acoustic channel ”How was it said?”
has been addressed with great success. Recently more attention is paid to the inter-
pretation of the spoken content itself dealing with the related question ”What was
said?” in view of the underlying affect. In psychological studies it is claimed that
a connection between certain terms and the related emotion has been learned by
the speaker [25]. As hereby the speaker’s expression of his emotion consists in us-
age of certain phrases that are likely to be mixed with meaningful statements in
the context of the dialog, an approach with abilities in spotting for emotional rel-
evant information is needed. Consider for this the example ”Could you please tell
me much more about this awesome field of research”. The ratio of affective words is
clearly dependent of the underlying application background and the personal nature
of the speaker, however it will be mostly very low. It therefore remains question-
able whether linguistic information might be sufficient applied standalone. However,
its integration showed clear increase in performance [8, 11, 25, 45], even though the
conclusions drawn rely per definition on erroneous Automatic Speech Recognition
(ASR) outputs. In order to reasonably handle the incomplete and uncertain data of the
ASR unit, a robust approach should take acoustic confidences into account through-
out the processing. Still, none existing system for emotional language interpretation
calculates an output data certainty based upon the input data certainty, except for
the one presented in [47]. The most trivial approach to linguistic analysis would be
the spotting for single emotional terms wj ∈ U within an utterance U labeled with
an emotion ei ∈ E out of the set of emotions E. All known emotional keywords
wk ∈ V would than be stored within a vocabulary V . In order to handle only emo-
tional keywords, sorting out abstract terms that cannot carry information about the
underlying emotion as names helps comparable to the feature space reduction for
acoustic features. This is known as stopping within linguistics. A so called stop-list
Emotions from Speech and Facial Expressions 181

can thereby be obtained either by expert-knowledge or by automated approaches as


calculation of the salience of a word [25]. One can also cope with emotionally ir-
relevant information by a normalized log likelihood ratio between an emotion and a
general task specific model [11]. Additionally by stemming words of the same stem
are clustered, which also reduces vocabulary size while in general directly increasing
performance. This comes, as hits within an utterance are crucial, and their number
increases significantly if none is lost due to minor word differences as plural forms
or verb conjunctions.
However, such an approach does not model word order, or the fact, that one term
can represent several emotions, which leads us to more sophisticated approaches, as
shown in the ongoing.
4.4.3.1 N-Grams
A common approach to speech language modeling is the use of n-grams. Let us
first assume the conditional probability of a word wj is given by its predecessors
from left to right within an utterance U as P (wj |w1 , ..., wj−1 ). Next, for language
interpretation based emotion recognition class-based n-grams are needed. Like-
wise an emotion ei within the emotion set E shall have the a-posteriori probability
P (ei |w1 , ..., wj ) given the words wj and its predecessors in U . However, following
Zipf’s principle of least effort, which states that irrelevant function words occur very
frequently, but terms of interest are rather sparse, we reduce the number of consid-
ered words to N in order to prevent over-modeling. Applying the first order Markov
assumption we can therefore use the following estimation:

P (ei |w1 , ..., wj ) ≈ P (ei |wj−N −1 , ..., wj ) (4.26)


However, mostly uni-grams have been applied so far [11, 25], besides bi-grams
and trigrams [2], which is due to the very limited typical corpus sizes in speech emo-
tion recognition. Uni-grams provide the probability of an emotion under the condi-
tion of single known words, which means they are contained in a vocabulary, without
modeling of neighborhood dependencies. Likewise, in a decision process, the actual
emotion e can be calculated as:

e = argi max P (ei |wj ) (4.27)
(wj ∈U)∧(wj ∈V )

Now, in order to calculate P (ei |wj ), we can use simple Maximum Likelihood
Estimation (MLE):

T F (wj , ei )
PMLE (ei |wj ) = (4.28)
T F (wj , E)
Let thereby T F (wj , ei ) denote the frequency of occurrence of the term wj tagged
with the emotion ei within the whole training corpus. T F (wj , E) then resembles the
whole frequency of occurrence of this particular term. One problem thereby is that
never occurring term/emotion couples lead to a probability resembling zero. As this
is crucial within the overall calculation, we assume that every word has a general
182 Speech Communication and Multimodal Interfaces

probability to appear under each emotion. This is realized by introduction of the


Lidstone coefficient λ as shown in the following equation:

T F (wj , ei ) + λ
Pλ (ei |wj ) = , λ ∈ [0...1] (4.29)
T F (wj , E) + λ · |E|
If λ resembles one, this is also known as Maximum a-Posteriori (MAP) estima-
tion.
4.4.3.2 Bag-of-Words
The so called Bag-of-Words method is a standard representation form for text in
automatic document categorization [21] that can also be applied to recognize emo-
tion [45]. Thereby each word wk ∈ V in the vocabulary V adds a dimension to a
linguistic vector x representing the logarithmic term frequency within the actual ut-
terance U known as logTF (other calculation methods exist, which we will not show
herein). Likewise a component xlogT F,j of the vector xlogTF with the dimension |V |
can be calculated as:

T F (wj , U )
xlogT F,j = log (4.30)
|U |
As can be seen in the equation, the term frequency is normalized by the phrase
length. Due to the fact that a high dimensionality may decrease the performance of
the classifier and flections of terms reduce performance especially within small data-
bases, stopping, stemming, or further methods of feature reduction are mandatory. A
classification can now be fulfilled as with the acoustic features. Preferably, one would
chose SVM for this task, as they are well known to show high performance [21].
Similar as for the uni-grams word order is not modeled by this approach. How-
ever, one advantage is the possibility of direct inclusion of linguistic features within
the acoustic feature vector.
4.4.3.3 Phrase Spotting
As mentioned, a key-drawback of the approaches so far is a lacking view of the
whole utterance. Consider hereby the negation in the following example: ”I do not
feel too good at all,” where the positively perceived term ”good” is negated. There-
fore Bayesian Nets (BN) as a mathematical background for the semantic analysis of
spoken utterances may be used taking advantage of their capabilities in spotting and
handling uncertain and incomplete information [47]. Thereby each BN consists of
a set of N nodes related to state variables Xi , comprising a finite set of states. The
nodes are connected by directed edges reaching from parent to child nodes, and ex-
pressing quantitatively the conditional probabilities of nodes and their parent nodes.
A complete representation of the network structure and conditional probabilities is
provided by the joint probability distribution:


N
P (X1 , ..., XN ) = P (Xi |parents(Xi )) (4.31)
i=1
Emotions from Speech and Facial Expressions 183

Methods of interfering the states of some query variables based on observations


regarding evidence variables are provided by the network. Similar to a standard ap-
proach to natural speech interpretation, the aim here is to make the net maximize the
probability of the root node modeling the specific emotion expressed by the speaker
via his choice of words and phrases. The root probabilities are distributed equally
in the initialization phase and resemble the priors of each emotion. If the emotional
language information interpretation is used stand-alone, a maximum likelihood de-
cision takes place. Otherwise the root probability for each emotion is fed forward
to a higher-level fusion algorithm. On the input layer a standard HMM-based ASR
engine with zero-grams as language model providing N -best hypotheses with sin-
gle word confidences may be applied. In order to deal with the acoustic certainties
the traditional BN may be extended to handle soft evidences [47]. The approach dis-
cussed here is to be based on integration and abstraction of semantically similar units
to higher leveled units in several layers. On the input level the N -best recognized
phrases are presented to the algorithm, which maps this input on defined interpreta-
tions via its semantic model consisting in a BN.
At the beginning the spotting on items known to the semantic model is achieved
by matching the words in the input level to word-nodes contained in the lowest
layer of the BN. Within this step the knowledge about uncertainty of the recognized
words represented by their confidences is completely transferred into the interpre-
tation model by accordingly setting soft evidence in the corresponding word-nodes.
While stepping forward to any superior model-layer, those units resembling each
other in their semantic properties regarding the target interpretations are clustered
to semantic super-units until the final layer with its root-nodes of the network is
reached. Thereby the evidences assigned to word-nodes, due to corresponding ap-
pearance in the utterance, finally result in changes of probabilities in the root-nodes
representing confidences of each specific emotion and their extent. After all the BN
approach allows for an entirely probabilistic processing of uncertain input to gain real
probability afflicted output. To illustrate what is understood as semantically similar,
consider for instance some words expressing positive attitude, as ”good”, ”well”,
”great”, etc. being integrated into a super-word ”Positive”. The quantitative con-
tribution P (ei |wj ) of any word wj to the belief in an emotion ei is calculated in a
training phase by its frequency of occurrence under the observation of the emotion
on basis of the speech corpus as shown within n-grams. Given a word order within
a phrase, an important modification of classic BN has to be carried out, as BN’s are
in general not capable of processing sequences due to their entirely commutative
evidence assignment.

4.4.4 Visual Information

For a human spectator the visual appearance of a person provides rich information
about his or her emotional state [32]. Thereby several sources can be identified,
such as body-pose (upright, slouchy), hand-gestures (waving about, folded arms),
head-gestures (nodding, inclining), and – especially in a direct close conversation –
the variety of facial expressions (smiling, surprised, angry, sad, etc.). Very few ap-
184 Speech Communication and Multimodal Interfaces

proaches exist towards an affective analysis of body-pose and gestures, while several
works report considerable efforts in investigating various methods for facial expres-
sion recognition. Therefore we are going to concentrate on the latter in this section,
provide background information, necessary pre-processing stages, algorithms for af-
fective face analysis, and give an outlook on forthcoming developments.
4.4.4.1 Prerequisites
Similar to the acoustic analysis of speech for emotion estimation, the task of Facial
Expression Recognition can be regarded as a common pattern recognition problem
with the familiar stages of preprocessing, feature extraction, and classification. The
addressed signal, i.e. the camera view to a face, provides lots of information that
is neither dependent on nor relevant to the expression. These are mainly ethnic and
inter-cultural differences in the way to express feelings, inter-personal differences in
the look of the face, gender, age, facial hair, hair cut, glasses, orientation of the face,
and direction of gaze. All these influences are quasi disturbing noise with respect to
the target source facial expression and the aim is to reduce the impact of all noise
sources, while preserving the relevant information. This task however constitutes a
considerable challenge located in the preprocessing and feature extraction stage, as
explicated in the following.
From the technical point of view another disturbing source should be minimized
– the variation in the position of the face in the camera image. Therefore, the pre-
processing comprises a number of required modules for robust head-localization and
estimation of the face-orientation. As a matter of fact facial expression recognition
algorithms perform best, when the face is localized most accurately with respect to
the person’s eyes. Thereby, best addresses quality and execution time of the method.
However, eye localization is a computationally even more complex task than face
localization. For this reason a layered approach is proposed, where the search area
for eyes is limited to the output-hypotheses of the previous face localizer.
Automatic Facial Expression Recognition has still not arrived in real world sce-
narios and applications. Existing systems postulate a number of restrictions regarding
camera hardware, lighting conditions, facial properties (e.g. glasses, beard), allowed
head movements of the person, and view to the face (frontal, profile). Different ap-
proaches show different robustness on the mentioned parameters. In the following we
want to give an insight in different basic ideas that are applied to automatic mimic
analysis. Many of them were derived from the task of face recognition, which is the
visual identification or authentication of a person based on his or her face. Hereby,
the applied methods can generally be categorized in holistic and non-holistic or an-
alytic [37].
4.4.4.2 Holistic Approaches
Holistic methods (Greek: holon = the whole, all parts together) strive to process
the entire face as it is without any incorporated expert-knowledge, like geometric
properties or special regions of interest for mimic analysis.
One approach is to extract a comprehensive and respectively large set of features
from the luminance representation or textures of the face image. Exemplarily, Gabor-
Emotions from Speech and Facial Expressions 185

Wavelet coefficients proved to be an adequate parametrization for textures and edges


in images [51]. The response of the Gabor filter can be written as a correlation of the
input image I(x), with the Gabor kernel pk (x)
 
ak (x0 ) = I(x)pk (x − x0 )dx, (4.32)

where the Gabor filter pk (x) can be formulated as:


 
k2 k2 σ2
pk (x) = 2 exp(− 2 x2 ) exp (ikx) − exp − , (4.33)
σ 2σ 2
while k is the characteristic
 wave vector. Most works [26] apply 5 spatial fre-
quencies with ki = π2 , π4 , π8 , 16
π π
, 32 and 8 orientations from 0 to π differing by
π/8, while σ is set to the value of π. Consequently, 40 coefficients are computed for
each position of the image. Let I(x) be of height 150 pixels and width 100 pixels,
likewise 150 · 100 · 40 = 6 · 105 features are computed.
Subsequently, Machine Learning methods for feature selection identify the most
relevant features as described, which allow for a best possible discrimination of the
addressed classes during training phase. A common algorithm applied for this prob-
lem was proposed by Freud and Schapire and is known as AdaBoost.M1 [13, 19].
AdaBoost and its derivates are capable to perform on feature vectors of six-digit di-
mensionality, while execution time remains tolerable. On the way to assignment of
the reduced static feature set to emotional classes, any statistical algorithm can be
used. In case of video processing and real-time requirements the choice of classifiers
might focus on linear methods or decision trees, depending on the computational
effort to localize the face and extract the limited number of features.
In Face Recognition the approach of Eigenfaces proposed by Turk and Pentland
has been examined thoroughly [52]. Thereby, the aim is to find the principal com-
ponents of the distribution of two-dimensional face representations. This is achieved
by the determination and selection of Eigenvectors from the covariance matrix of a
set of representative face images. This set should cover different races for the com-
putation of the covariance matrix. Each image of size N × M pixels, which can be
thought of as a N × M matrix of 8bit luminance values, is transformed into a vector
of dimensionality N · M . Images of faces, being similar in overall configuration, are
not randomly distributed in this very high-dimensional image space, and thus can be
described by a relatively low dimensional subspace, spanned by the relevant Eigen-
vectors. This relevance is measured by their corresponding Eigen-values, indicating
a different amount of variation among the faces. Each image pixel contributes more
or less to each Eigen-vector, so that each of them can be displayed, resulting in a kind
of ghostly face – the so called Eigenface. Finally, each individual face can be rep-
resented exactly by linear combination of these Eigenfaces, with a remaining error
due to the reduced dimensionality of the new face space. The coefficients or weights
of that linear combination that minimize the error between the original image and
the face space representation now serve as features for any kind of classification. In
case of Facial Expression Recognition the covariance matrix would be computed on
186 Speech Communication and Multimodal Interfaces

a set of faces expressing all categories of mimics. Subsequently, we apply the Eigen-
vector analysis and extract representative sets of weights for each mimic-class that
should be addressed. During classification of unknown faces, the weights for this
image are computed accordingly, and will then be compared to weight-vectors of the
training set.
4.4.4.3 Analytic Approaches
These methods concentrate on the analysis of dominant regions of interest. Thus,
pre-existing knowledge about geometry and facial movements is incorporated, so
that subsequent statistical methods benefit [12]. Ekman and Friesen introduced the
so called Facial Action Coding System in 1978, which consists of 64 Action Units
(AU). Presuming that all facial muscles are relaxed in the neutral state, each AU
models the contraction of a certain set of them, leading to deformations in the face.
Thereby the focus lies on the predominant facial features, such as eyes, eye-brows,
nose, and mouth. Their shape and appearance contain most information about the fa-
cial expressions. One approach for analyzing shape and appearance of facial features
is known as Point Distribution Model (PDM). Cootes and Taylor gave a comprehen-
sive introduction to the theory and implementation aspects of PDM. Subclasses of
PDM are Active Shape Models (ASM) and Active Appearance Models (AAM), which
showed their applicability to mimic analysis [20].
Shape models are statistical descriptions of two-dimensional relations between
landmarks, positioned on dominant edges in face images. These relations are freed
of all transformations, like rotation, scaling and translation. The different shapes that
occur just due to the various inter-personal proportions are modeled by PCA of the
observed landmark displacements during training phase. The search of the landmarks
starts with an initial estimation where the shape is placed over a face manually or
automatically, when preprocessing stages allow for. During search, the edges are
approximated by an iterative approach that tries to measure the similarity of the sum
of edges under the shape to the model. Appearance Models additionally investigate
the textures or gray-value distributions over the face, and combine this knowledge
with shape statistics.
As mentioned before, works in the research community try to investigate a broad
range of approaches and combinations of different holistic and analytic methods in
order to proceed towards algorithms, that are robust to the broad range of different
persons and real-life situations. One of the major problems is still located in the
immense computational effort and applications will possibly have to be distributed
on multiple CPU and include the power of GPU to converge to real-time abilities.

4.4.5 Information Fusion

In this chapter we aim to fuse the acoustic, linguistic and vision information ob-
tained. This integration (see also Sect. 4.3.3) is often done in a late semantic manner
as (weighted) majority voting [25]. More elegant however is the direct fusion of the
streams within one feature vector [45], known as the early feature fusion. The advan-
tage thereby is that less knowledge is lost prior to the final decision. A compromise
References 187

between these two is the so called soft decision fusion whereby the confidence of
each classifier is also respected. Also the integration of a N -best list of each classi-
fier is possible. A problem however is the synchronization of video, and the acoustic
and linguistic audio streams. Especially if audio is classified on a global word or ut-
terance level, it may become difficult to find video segments that correspond to these
units. Likewise, audio processing may be preferred by dynamic means in view of
early fusion with a video stream.

4.4.6 Discussion

Especially automatic speech emotion recognition based on acoustic features already


comes close to human performance [45] somewhere around 80% recognition per-
formance. However, usually very idealized conditions as known speakers and studio
recording conditions are considered, yet. Video processing does not reach these re-
gions at the time, and conditions are very ideal, as well. When using emotion recog-
nition systems out-of-the-lab, a number of new challenges arise, which has been
hardly addressed within the research community up to now. In this respect the fu-
ture research efforts will have to lead to larger databases of spontaneous emotions,
robustness under noisy conditions, less person-dependency, reliable confidence mea-
surements,integration of further multimodal sources, contextual knowledge integra-
tion, and acceptance studies of emotion recognition applied in everyday systems. In
this respect we are looking forward to a flourishing human-like man-machine com-
munication supplemented by emotion for utmost naturalness.

References
1. When Do We Interact Multimodally? Cognitive Load and Multimodal Communication
Patterns., 2004.
2. Ang, J., Dhillon, R., Krupski, A., Shriberg, E., and Stolcke, A. Prosody-Based Automatic
Detection of Annoyance and Frustration in Human-Computer Dialog. In Proceedings of
the International Conference on Speech and Language Processing (ICSLP 2002). Denver,
CO, 2002.
3. Arsic, D., Wallhoff, F., Schuller, B., and Rigoll, G. Video Based Online Behavior Detec-
tion Using Probabilistic Multi-Stream Fusion. In Proceedings of the International IEEE
Conference on Image Processing (ICIP 2005). 2005.
4. Batliner, A., Hacker, C., Steidl, S., Nöth, E., Russel, S. D. M., and Wong, M. ’You
Stupid Tin Box’ - Children Interacting with the AIBO Robot:A Cross-linguisitc Emo-
tional Speech Corpus. In Proceedings of the LREC 2004. Lisboa, Portugal, 2004.
5. Benoit, C., Martin, J.-C., Pelachaud, C., Schomaker, L., and Suhm, B., editors. Audiovi-
sual and Multimodal Speech Systems. In: Handbook of Standards and Resources for Spo-
ken Language Systems-Supplement Volume. D. Gibbon, I. Mertins, R.K. Moore ,Kluwer
International Series in Engineering and Computer Science, 2000.
6. Bolt, R. A. “Put-That-There”: Voice and Gesture at the Graphics Interface. In Inter-
national Conference on Computer Graphics and Interactive Techniques, pages 262–270.
July 1980.
188 Speech Communication and Multimodal Interfaces

7. Carpenter, B. The Logic of Typed Feature Structures. Cambridge, England, 1992.


8. Chuang, Z. and Wu, C. Emotion Recognition using Acoustic Features and Textual Con-
tent. In Proceedings of the International IEEE Conference on Multimedia and Expo
(ICME) 2004. Taipei, Taiwan, 2004.
9. Core, M. G. Analyzing and Predicting Patterns of DAMSL Utterance Tags. In AAAI
Spring Symposium Technical Report SS-98-01. AAAI Press, 1998. ISBN ISBN 1-57735-
046-4.
10. Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., and
Taylor, J. G. Emotion Recognition in Human-computer Interaction. IEEE Signal Process-
ing magazine, 18(1):32–80, January 2001.
11. Devillers, L. and Lamel, L. Emotion Detection in Task-Oriented Dialogs. In Proceedings
of the International Conference on Multimedia and Expo(ICME 2003), IEEE, Multimedia
Human-Machine Interface and Interaction, volume III, pages 549–552. Baltimore, MD,
2003.
12. Ekman, P. and Friesen, W. Facial Action Coding System. Consulting Psychologists Press,
1978.
13. Freund, Y. and Schapire, R. Experiments with a New Boosting Algorithm. In Interna-
tional Conference on Machine Learning, pages 148–156. 1996.
14. Geiser, G., editor. Mensch-Maschine-Kommunikation. Oldenbourg-Verlag, München,
1990.
15. Goldschen, A. and Loehr, D. The Role of the DARPA Communicator Architecture as a
Human-Computer Interface for Distributed Simulations. In Simulation Interoperability
Standards Organization (SISO) Spring Simulation Interoperability Workshop. Orlando,
Florida, 1999.
16. Grosz, B. and Sidner, C. Attentions, Intentions and the Structure of Discourse. Compu-
tational Linguistics, 12(3):175–204, 1986.
17. Hartung, K., Münch, S., and Schomaker, L. MIAMI: Software Architecture, Deliverable
Report 4. Report of ESPRIT III: Basic Research Project 8579, Multimodal Interface for
Advanced Multimedia Interfaces (MIAMI). Technical report, 1996.
18. Hewett, T., Baecker, R., Card, S., Carey, T., Gasen, J., Mantei, M., Perlman, G., Strong,
G., and Verplank, W., editors. Curricula for Human-Computer Interaction. ACM Special
Interest Group on Computer-Human Interaction, Curriculum Development Group, 1996.
19. Hoch, S., Althoff, F., McGlaun, G., and Rigoll, G. Bimodal Fusion of Emotional Data in
an Automotive Environment. In Proc. of the ICASSP 2005, IEEE Int. Conf. on Acoustics,
Speech, and Signal Processing. 2005.
20. Jiao, F., Li, S., Shum, H., and Schuurmanns, D. Face Alignment Using Statistical Models
and Wavelet Features. In Conference on Computer Vision and Pattern Recognition. 2003.
21. Joachims, T. Text Categorization with Support Vector Machines: Learning with Many
Relevant Features. Technical report, LS-8 Report 23, Dortmund, Germany, 1997.
22. Johnston, M. Unification-based Multimodal Integration. In Below, R. K. and Booker, L.,
editors, Proccedings of the 4th International Conference on Genetic Algorithms. Morgan
Kaufmann, 1997.
23. Krahmer, E. The Science and Art of Voice Interfaces. Technical report, Philips Research,
Eindhoven, Netherlands, 2001.
24. Langley, P., Thompson, C., Elio, R., and Haddadi, A. An Adaptive Conversational In-
terface for Destination Advice. In Proceedings of the Third International Workshop on
Cooperative Information Agents. Springer, Uppsala, Sweden, 1999.
25. Lee, C. M. and Pieraccini, R. Combining Acoustic and Language Information for Emo-
tion Recognition. In Proceedings of the International Conference on Speech and Lan-
guage Processing (ICSLP 2002). Denver, CO, 2002.
References 189

26. Lee, T. S. Image Representation Using 2D Gabor Wavelets. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 18(10):959–971, 1996.
27. Levin, E., Pieraccini, R., and Eckert, W. A stochastic Model of Human-Machine Interac-
tion for Learning Dialog Strategies. IEEE Transactions on Speech and Audio Processing,
8(1):11–23, 2000.
28. Litman, D., Kearns, M., Singh, S., and Walker, M. Automatic Optimization of Dialogue
Management. In Proceedings of the 18th International Conference on Computational
Linguistics. Saarbrücken, Germany, 2000.
29. Maybury, M. T. and Stock, O. Multimedia Communication, including Text. In Hovy,
E., Ide, N., Frederking, R., Mariani, J., and Zampolli, A., editors, Multilingual Informa-
tion Management: Current Levels and Future Abilities. A study commissioned by the
US National Science Foundation and also delivered to European Commission Language
Engineering Office and the US Defense Advanced Research Projects Agency, 1999.
30. McGlaun, G., Althoff, F., Lang, M., and Rigoll, G. Development of a Generic Multi-
modal Framework for Handling Error Patterns during Human-Machine Interaction. In
SCI 2004, 8th World Multi-Conference on Systems, Cybernetics, and Informatics, Or-
lando, FL, USA. 2004.
31. McTear, M. F. Spoken Dialogue Technology: Toward the Conversational User Interface.
Springer Verlag, London, 2004. ISBN 1-85233-672-2.
32. Mehrabian, A. Communication without Words. Psychology Today, 2(4):53–56, 1968.
33. Nielsen, J. Usability Engineering. Academic Press, Inc., 1993. ISBN 0-12-518405-0.
34. Nogueiras, A., Moreno, A., Bonafonte, A., and Marino, J. Speech Emotion Recognition
Using Hidden Markov Models. In Eurospeech 2001 Poster Proceedings, pages 2679–
2682. Scandinavia, 2001.
35. Oviatt, S. Ten Myths of Multimodal Interaction. Communications of the ACM 42, 11:74–
81, 1999.
36. Oviatt, S., Cohen, P., Wu, L., Vergo, J., Duncan, L., Suhm, B., Bers, J., Holzman, T.,
Winograd, T., Landay, J., Larson, J., and Ferro, D. Designing the User Interface for
Multimodal Speech and Pen-based Gesture Applications: State-of-the-Art Systems and
Future Research Directionss. Human Computer Interaction, (15(4)):263–322, 2000.
37. Pantic, M. and Rothkrantz, L. Automatic Analysis of Facial Expressions: The State of
the Art. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12):1424–
1445, 2000.
38. Pantic, M. and Rothkrantz, L. Toward an Affect-Sensitive Multimodal Human-Computer
Interaction. Proccedings of the IEEE, 91:1370–1390, September 2003.
39. Petrushin, V. Emotion in Speech: Recognition and Application to Call Centers. In Pro-
ceedings of the Conference on Artificial Neural Networks in Engineering(ANNIE ’99).
1999.
40. Picard, R. W. Affective Computing. MIT Press, Massachusetts, 2nd edition, 1998. ISBN
0-262-16170-2.
41. Pieraccini, R., Levin, E., and Eckert, W. AMICA: The AT&T Mixed Initiative Conver-
sational Architecture. In Proceedings of the Eurospeech ’97, pages 1875–1878. Rhodes,
Greece, 1997.
42. Reason, J. Human Error. Cambridge University Press, 1990. ISBN 0521314194.
43. Sadek, D. and de Mori, R. Dialogue Systems. In de Mori, R., editor, Spoken Dialogues
with computers, pages 523–562. Academic Press, 1998.
44. Schomaker, L., Nijtmanns, J., Camurri, C., Morasso, P., and Benoit, C. A Taxonomy
of Multimodal Interaction in the Human Information Processing System. Report of ES-
PRIT III: Basic Research Project 8579, Multimodal Interface for Advanced Multimedia
Interfaces (MIAMI). Technical report, 1995.
190 Speech Communication and Multimodal Interfaces

45. Schuller, B., Müller, R., Lang, M., and Rigoll, G. Speaker Independent Emotion Recog-
nition by Early Fusion of Acousticand Linguistic Features within Ensembles. In Proceed-
ings of the ISCA Interspeech 2005. Lisboa, Portugal, 2005.
46. Schuller, B., Rigoll, G., and Lang, M. Hidden Markov Model-Based Speech Emo-
tion Recognition. In Proceedings of the IEEE International Conference on Acoustics,
Speech,and Signal Processing (ICASSP 2003), volume II, pages 1–4. 2003.
47. Schuller, B., Rigoll, G., and Lang, M. Speech Emotion Recognition Combining Acoustic
Features and Linguistic Information in a Hybrid Support Vector Machine - Belief Net-
work Architecture. In Proceedings of the IEEE International Conference on Acoustics,
Speech,and Signal Processing (ICASSP 2004), volume I, pages 577–580. Montreal, Que-
bec, 2004.
48. Schuller, B., Villar, R. J., Rigoll, G., and Lang, M. Meta-Classifiers in Acoustic and
Linguistic Feature Fusion-Based Affect Recognition. In Proceedings of the International
Conference on Acoustics, Speechand Signal Processing (ICASSP) 2005, volume 1, pages
325–329. Philadelphia, Pennsylvania, 2005.
49. Shneiderman, B. Designing the user interface: Strategies for effective human-computer
interaction (3rd ed.). Addison-Wesley Publishing, 1998. ISBN 0201694972.
50. Smith, W. and Hipp, D. Spoken Natural Language Dialog Systems: A Practical Approach.
Oxford University Press, 1994. ISBN 0-19-509187-6.
51. Tian, Y., Kanade, T., and Cohn, J. Evaluation of Gabor-wavelet-based Facial Action Unit
Recognitionin Image Sequences of Increasing Complexity. In Proceedings of the Fifth
IEEE International Conference on AutomaticFace and Gesture Recognition, pages 229–
234. May 2002.
52. Turk, M. and Pentland, A. Face Recognition Using Eigenfaces. In Proc. of Conference
on Computer Vision and Pattern Recognition, pages 586–591. 1991.
53. van Zanten, G. V. User-modeling in Adaptive Dialogue Management. In Proceedings of
the Eurospeech ’99, pages 1183–1186. Budapest, Hungary, 1999.
54. Ververidis, D. and Kotropoulos, C. A State of the Art Review on Emotional Speech
Databases. In Proceedings of the 1st Richmedia Conference, pages 109–119. Lausanne,
Sitzerland, 2003.
55. Witten, I. H. and Frank, E. Data Mining: Practical Machine Learning Tools with Java
Implementations. Morgan Kaufmann, San Francisco, CA, 2000. ISBN 1-558-60552-5.
56. Wu, L., Oviatt, S., and Cohen, P. Multimodal integration - A Statistical Review. 1(4),
pages 334–341. 1999.
57. Young, S. Probabilistic Methods in Spoken Dialogue Systems. Philosophical Transac-
tions of the Royal Society, 358:1389–1402, 2000.
Chapter 5

Person Recognition and Tracking


Michael Hähnel1 , Holger Fillbrandt2
Automatic person recognition is a term for the process of determining the
identity of a person by using information of a group of subjects. In general,
recognition can be categorized into authentication (or verification) and identification
processes. For authentication, a person has to claim his identity which is then
verified using the previously stored information about this person (e.g. PIN number,
face image). This is called a 1 : 1 comparison. Identification is performed in those
applications where an unknown person needs to be identified from a set of people
without any claimed identity (1 : n comparison).
Authentication systems have made everyday life easier in many ways. With a
small plastic card one can get cash at nearly any corner of a city or even directly pay
for goods and services. Access to this source of financial convenience, however, must
be protected against unauthorized use. Information exchange via internet and email
made it necessary to personalize information which needs to be protected against
unauthorized access as well. Passwords and PIN codes are needed to gain access to
mail accounts or personalized web pages. Just like the world wide web also local
computer networks are secured using passwords.
Apart from security aspects authentication also has an essential role in man-
machine interaction. To make applications adaptive to the context of use, it is es-
sential to know the user, his whereabouts, and his activities. Person tracking in com-
bination with person authentication must be applied to acquire suitable information
for either interactive as well as assisted systems.
Identification is usually performed in surveillance applications where the persons
to be recognized do not actively participate in the recognition process. The methods
discussed in this section can be used for identification as well as for authentication
purposes. Therefore, when referring to the term “recognition”, it can be substituted
by either identification or authentication.
Recognition methods can be put into three categories: token-based (keys,
passport, cards), knowledge-based (passwords, PINs), and biometric (Fig. 5.1).
Knowledge-based recognition is based on information which is unique and identi-
fies a person when fed into the identification system. Token-based methods imply
the possession of an identifying object like a key, a magnetic card or a passport. The
advantage of authentication systems following these approaches are their relatively
simple implementation, the ease of use, and their widespreaded application. How-
ever, tokens can be lost or get stolen, passwords and PINs might be forgotten. An

1
Sect. 5.1, 5.2
2
Sect. 5.3
192 Person Recognition and Tracking

access to a system is then not possible anymore or at least requires additional effort
and causes higher costs.

recognition

identification authentication

token-based biometric knowledge-based

physiological behavioral
features features

keys retina intrusive PIN

passport hand speech password

clothing fingerprint signature


(full-body)
face gait
non-intrusive

Fig. 5.1. Biometric and non-biometric person recognition methods. In the following two sec-
tions face recognition and camera-based full-body recognition are discussed.

Hence, in the last years much attention has been paid to biometric systems which
are either based on physiological features (face, hand shape and hand vein pattern,
fingerprint, retina pattern) or depend on behavioral patterns that are unique for each
person to be recognized (gait, signature stroke, speech).
These systems are mostly intrusive because a minimum of active participation of
the person to be identified is required as one has to lay down his thumb on the fin-
gerprint sensor or look at a camera or scanner to get the necessary biometric features
extracted properly. This condition is acceptable in many applications.
Non-intrusive recognition can be applied for surveillance tasks where subjects usu-
ally are observed by cameras. In this case, it is not desired that the persons participate
in the recognition process.
The second advantage of non-intrusive over intrusive recognition is the increased
comfort of use of such systems, e.g. if a person can be identified while heading to-
wards a security door, the need for putting the thumb on the fingerprint sensor or the
insertion of a magnetic card into a slot would be unnecessary.
Today, identification of a person from a distance is possible only by gait recognition,
though also this biometric method is still a research issue. Face recognition has the
potential to fulfill this task as well, but still research needs to be done to develop
Face Recognition 193

reliable systems based on face recognition at a glance. This section can not address
all aspects of recognition, however, selected topics of special interest are discussed.
Face recognition has evolved to commercial systems which control the access to
restricted areas e.g. in airports or high-security rooms or even computers. Though
being far from reliably working under all real world conditions, many approaches
have been developed to at least guarantee a safe and reliable functioning under well
determined conditions (Sect. 5.1).
A new non-biometric method for non-intrusive person recognition is based on
images of the full body, in applications where it can be assumed that clothing is not
changed. The big variety in colors and patterns makes clothing a valuable indicator
that adds information to an identification system even if fashion styles and seasonal
changes may affect the condition of “constant clothing”. An approach to this token-
based method is discussed in Sect. 5.2.
As full-body recognition, also gait recognition depends on a proper detection
and segmentation of persons in a scene which is a main issue of video-based person
tracking. Hence, person tracking can also be considered as a preprocessing step for
these recognition methods. Besides that, determining the position of persons in an
observed scene is a main task for surveillance applications in buildings, shopping
malls, airports and other public places and is hence discussed as well (Sect. 5.3).

5.1 Face Recognition


For humans, the face is the most discriminating part of the body used to determine a
persons identity. Automatic face recognition, however, is still far away from human
performance in this task. As many factors influence the appearance of the face, an
overall approach to model, detect and recognize faces still could not be found even
after 30 years of research. For many applications, however, it is not necessary to deal
with all factors. For those applications, commercial systems are already available, but
they are highly intrusive and require active participation in the recognition process.
Sect. 5.1.1 will outline factors influencing the recognition rate which need to
be considered when developing a face recognition system. The general structure
of a face recognition system is outlined in Sect. 5.1.2. Known approaches to au-
tomatic face recognition will be categorized in Sect. 5.1.3. Afterwards the popular
“eigenface” approach (Sect. 5.1.4) and a component-based approach to face recog-
nition (Sect. 5.1.5) are described in more detail. This section will close with a brief
overview of the most common face databases that are publicly available.

5.1.1 Challenges in Automatic Face Recognition

The challenge of (non-intrusive) automatic face recognition is to decrease the


influence of factors that reduce the face recognition performance. Some factors may
have lower priority (e.g. illumination changes when a quite constant illumination
through lighting is available), some can be very important (recognition when only
a single image per person could be trained) depending on the application and
194 Person Recognition and Tracking

environment. These factors can be categorized into human and technical factors
(Fig. 5.2).

technical factors human factors

biological
number of sample images
aging

resolution
behavioral
occlusion head pose

illumination facial expression

Fig. 5.2. Factors mostly addressed by the research community that affect face recognition
performance and their categorization. Additionally, fashion (behavioral) and twin faces (bio-
logical) must be mentioned.

Technical factors arise from the technology employed and the environment in
which a recognition system is operated. Any image-based object recognition task has
to deal with these factors. Their influence can often easily be reduced by choosing
suitable hardware, like lighting (for illumination) and cameras (for illumination and
resolution).
1. Image resolution: The choice of the needed image resolution is a trade-off be-
tween a desired processing performance (⇒ low resolution) and separability of
the features used for recognition (⇒high resolution). The highest possible reso-
lution should be used that allows to meet given time restrictions.
2. One-sample problem: Many methods can extract good descriptions of a person’s
face out of a bunch of images taken under different conditions, e.g. illuminations,
facial expressions and poses. However, several images of the same person are
often not available. Sometimes a few, maybe even only one, sample image can
Face Recognition 195

be used to train a recognition system. In these cases, generic a-priori knowledge


(e.g. head models, databases of face images of other persons) needs to be added
to model possible variations.
3. Illumination: Light sources from different directions and different intensities as
well as objects between a light source and the observed face let the same face
appear in many ways due to cast shadows and (over-)illuminated regions. Neither
important facial features nor the whole face itself can be detected or the features
do not properly describe the face anymore, leading to a misclassification or a
rejection of the face image. Hence, two approaches of considering illumination
can be followed to solve these problems: (1) use illumination invariant features
and classifiers or (2) model the illumination changes and preprocess the image
considering this model. Such a model has to integrate a lot of a-priori knowledge
taken from images of faces of varying identity and illumination, maybe even
under different poses and showing several facial expressions. These images are
often not available (one-sample problem).
Human factors influence the recognition process due to the special characteristics
and properties of the face as object to be recognized. They can be sub-divided into
behavioral and biological factors.
1. Behavioral factors
As the human head is not a steady and rigid object it can undergo certain changes
due to (non-)intentional movements and augmentation with objects like e.g.
glasses.
a) Facial expressions: Beside speech, different facial expressions are the most
important information source for communication through the face. How-
ever, the ability to change the appearance of the face states a big problem to
image processing and recognition methods. Dozens of facial muscles allow
a very subtle movement of the skin so that wrinkles and bulges appear and
important texture regions move or are distorted.
b) Head pose: Extensive changes in appearance are due to 3D head rotation.
This does not only lead to textural changes but also to geometrical changes,
as the human head is not rotationally invariant. When the face is rotated,
important regions of the (frontal) face are not visible anymore. The more a
head is rotated, the more one half of the face is getting invisible and valuable
information for the recognition process is getting lost.
c) Occlusion: A reduction of exploitable information is often the consequence
of occlusion by objects, e.g. scarfs, glasses, caps. Either the mouth region
or even the more important eye region is covered in part and does not allow
the extraction of features anymore. However, occlusion can be handled by
local approaches for face recognition (Sect. 5.1.3). The variation of occlu-
sion is infinite because the appearance of the occluding objects (e.g. glasses,
newspapers, caps) is usually not known.
d) Fashionable appearance changes: Intentional changes of appearance due
to e.g make-up, mustaches, or beards, can complicate feature detection and
196 Person Recognition and Tracking

extraction if these textural changes are not considered by the detection algo-
rithms.
2. Biological factors
a) Aging: Changes of facial appearance due to aging and their modeling are not
well studied yet. Main changes of the skin surface and slight geometrical
changes (considering longer time spans) cause problems in the recognition
process due to strong differences in the extracted features.
b) The twin-problem: The natural “copy” of a face to be identified is the twin’s
face. Even humans have difficulties to distinguish between two unknown
twins, so have automatic systems. However, humans are able to recognize
twins if they are known well.
In contrast to a non-intrusive face recognition system that has to deal with most
of the above mentioned factors, an intrusive systems can neglect many of these fac-
tors. If a so called “friendly user” can be asked either to step into an area with con-
trolled illumination or to look straight into the camera, it is unnecessary to handle
strong variations in e.g. illumination and pose changes. Operational conditions can
often reduce the need for more sophisticated algorithms by setting up suitable hard-
ware environments. However, friendly users usually can not be found in surveillance
tasks.

5.1.2 Structure of Face Recognition Systems

This section will discuss the basic processing steps of a face recognition system
(Fig. 5.3). Referring to chapter 2.1, the coarse structure of an image-based object
recognition system consists of an (image) acquisition, a feature extraction and a
feature classification stage. These stages can be broken up into several steps: While
acquisition, suitable preprocessing and face detection as well as facial feature
extraction are described more detailed in chapter 2.2, feature extraction, feature
transformation, classification and combination will be discussed in the following.

Acquisition
During acquisition the camera takes snapshots or video streams of a scene con-
taining one or more faces. Already at this stage it should be considered where
to set up a camera. If possible, one should use more than one camera to get a
couple of views of the same person either to combine recognition results or to
build a face model that is used to handle different influencing factors like pose
or illumination. Besides positioning the camera, the location for suitable lighting
should be considered as well. If a constant illumination situation can be provided
in the enrollment phase (training) as well as in the test phase (classification), less
effort needs to be put into modeling illumination changes and the accuracy of
the system will increase. If possible, the camera parameters like shutter speed,
amplification and white balance should be adaptively controlled by a suitable
Face Recognition 197

feature extraction
face detection

facial feature
detection
acquisition

preprocessing

acquisition feature extraction

feature classification

transformation
classification
combination

feature

recognition
result

S
Fig. 5.3. General structure of a face recognition system.

algorithm that evaluates the current illumination situation and adapts the parame-
ters accordingly. For a more detailed discussion of this issue, please refer to Sect. 2.2.

Preprocessing
In the preprocessing stage usually the images are prepared on a low-level basis to be
further processed in other stages on a high-level basis. Preprocessing steps include
e.g. (geometrical) image normalization (e.g. Sect. 5.1.4), the suitable enhancement
of brightness and contrast, e.g. brightness planes or histogram equalization [45] [52]
or image filtering.
This preprocessing is always strongly depending on the following processing step.
For the face detection stage, it might be useful to determine face candidates, i.e.
image regions that might contain a face, on a low-level basis. This can e.g. be
achieved using skin color histograms (Sect. 2.1.2.1). For the facial feature extraction
198 Person Recognition and Tracking

step, a proper gray-level or color normalization is needed (Sect. 2.2).

Face Detection
Face detection is a wide field of research and many different approaches have been
developed with creditable results [5] [58] [50]. However, it is difficult to get these
methods work under real world conditions. As all face detection algorithms are
controlled by several parameters, generally suitable parameter settings can usually
not be found. Additional image preprocessing is needed, to gain a working system
that performs fast (maybe even real-time) and reliable, even under different kind of
influencing factors as e.g. resolution, pose and illumination (Sect. 5.1.1). The result
of most face detection algorithms are rectangles containing the face area. Usually a
good estimate of the face size, which is needed for normalization and facial feature
detection can be derived from these rectangles.

Facial Feature Detection


This processing step is only necessary for recognition methods using local infor-
mation (Sect. 5.1.3). Also normalizing the size of images often is done by making
use of the distance between the eyes which needs to be detected for that purpose
(Sect. 5.1.4). A holistic approach which performs feature detection quite well are
the Active Appearance Models (AAM) [13]. With AAMs, however, the compromise
between precision and independence from e.g. illumination needs to be accepted.
For further details, please refer to Sect. 2.2.

Feature Extraction
Face recognition research usually concentrates on the development of feature ex-
traction algorithms. Besides a proper preprocessing, feature extraction is the most
important step in face recognition because a description of a face is built which must
be sufficient to separate the face images of one person from the images of all other
subjects.
Substantial surveys about face recognition are available, e.g. [60]. Therefore,
we refrain from giving an overview of face recognition methods and do not discuss
all methods in detail because much literature can be found for the most promising
approaches. Some feature extraction methods are however briefly outlined in Sect.
5.1.5 for face recognition based on local regions. Those methods originally were
applied globally, i.e. on the whole face image.

Feature Transformation
Feature extraction and feature transformation are very often combined in one single
step. The term feature transformation, however, describes a set of mathematical
methods that try to decrease the within-class distance while increasing the between-
class difference of the features extracted from the image. Procedures like e.g. PCA
(principle component analysis, Sect. 5.1.4) transform the feature space to a more
Face Recognition 199

suitable representation that can better be processed by a classifier.

Classification
The result of a recognition process is determined in the classification stage in
which a classification method compares a given feature vector to the previously
stored training data. Simple classifiers like the k-nearest-neighbor classifier (Sect.
5.1.4) are often surprisingly powerful, especially when only one sample per class is
available. More sophisticated methods like neural networks [6] or Support Vector
Machines [49] can determine more complex, non-linear feature space subdivisions
(class borders) which are often needed if more than one sample per class was trained.

Combination
A hybrid classification system can apply different combination strategies which in
general can be categorized in three groups:
• Voting:
The overall result is given by the ID which was rated best by the majority of all
single classifiers.
• Ranking:
If the classifiers can build a ranking for each test image then the ranks for each
class are added. Winner is the class with the lowest sum.
• Scoring:
If the rankings are annotated with a score that integrates information about
how similar a test image is in comparison to the closest trained image of each
class, more sophisticated combination methods can be applied that combine these
scores to an overall score. The winner class is the one with the best score. Recent
findings can be found in [33]
The input to a combination algorithm can be the result either of classifications
of different facial parts or the result of different types of recognition algorithms and
classifiers (e.g. local and hybrid recognition methods, Sect. 5.1.3). It was argued
in [39] that, e.g. PCA-based methods (global classifiers) should be combined with
methods extracting local information. PCA-based methods are robust against noise
because of estimating low-order modes with large eigenvalues, while local classifiers
better describe high-order modes that amplify noise but are more discriminant con-
cerning different persons. Fig. 5.4 shows some Eigenfaces (extracted by a PCA on
gray-level images) ordered by their eigenvalues. The upper Eigenfaces (high eigen-
values) seem to code only basic structure, e.g. the general shape of a face, while the
Eigenfaces in the lower part of Fig. 5.4, structures can be identified that are more
specific to certain persons. The results of the two methods (global and local) penal-
ize each other while amplifying the correct result. In general, a higher recognition
rate can be expected when different kinds of face recognition algorithms are com-
bined [60].
200 Person Recognition and Tracking

Fig. 5.4. Visualization of a mean face (upper left corner) and 15 Eigenfaces. The corresponding
eigenvalues are decreasing from left to right and top to bottom. (from [47]). Eigenfaces with
high eigenvalues only code general appearance information like the face outline and global
illumination (upper row).

5.1.3 Categorization of Face Recognition Algorithms

Face recognition algorithms can be categorized into four groups:


• Geometry-based methods depend only on geometrical relations of different
facial regions. These methods were developed in the early stages of face
recognition research [27]. However, it could be proven that geometrical features
are not sufficient when used exclusively [8] [21]. Today these kind of methods
are not relevant anymore. Experiments of Craw et al. [14] could show that
shape-free representations are even more suitable than representations including
geometrical information.

• Global or holistic methods use the whole image information of the face at once.
Here geometrical information is not used at all or just indirectly through the dis-
tribution of intensity values (e.g. eyes in different images). The most prominent
representative of this class of algorithms is the eigenface approach [48] which
will be discussed in Sect. 5.1.4.

• Local feature (analysis) methods only consider texture information of regions in


the face without directly integrating their geometrical information, e.g. [39] [51].
The local regions contain just a few pixels but are distinctive for the specific face
Face Recognition 201

image. These methods include an algorithm to detect distinctive regions which


is applied during (a) enrollment (training) to gain a proper description of the
face and (b) during testing to find corresponding regions in the test image that
are similar to regions previously described in the training stage.

• Hybrid algorithms combine geometrical and appearance information to recog-


nize a face. These methods usually incorporate some kind of local information,
in the sense, that they use appearance information from facial regions with a
high entropy regarding face recognition (e.g. eye region), while disregarding re-
gions with poor informational content i.e. homogeneous textures (e.g. forehead,
cheeks). Additionally, geometrical information about the relation of locations
where features were extracted are incorporated in the classification process. The
use of local information reduces the influence of illumination changes and facial
expressions and can handle occlusion if detected. Most modern recognition algo-
rithms like Elastic Bunch Graph Matching [28] [55] and component-based face
recognition [26] follow the hybrid approach.

5.1.4 Global Face Recognition using Eigenfaces

The eigenface approach [44] [48] is one of the earliest methods of face recognition
that uses appearance information. It originally was applied globally and, though
it is lacking stability and good performance under general conditions, it is still a
state-of-the-art method, which often is used as a reference to new classification
methods.

The following sections will describe the eigenface approach considering the basic
structure of face recognition algorithms discussed in Sect. 5.1.2.
Preprocessing
The eigenface approach shows bad performance if the training and input images do
not undergo a preprocessing that normalizes them in a way that only keeps the usable
information in the face area of the image.
For this normalization step, the face and the position of different facial features
must be detected as described in 2.2. The coordinates of the eye centers and the nose
tip are sufficient for a simple yet often applicable normalization (Fig. 5.5).
In a first step, the image must be rotated by an angle α to make sure the eye
centers have the same y-coordinate (Fig. 5.5b). The angle α is given by:
ylefteye − yrighteye
α = arctan( )
xlefteye − xrighteye
where (xlefteye , ylefteye ) and (xrighteye , yrighteye ) are the coordinates of the left
and right eye, respectively.
Then the image is scaled in the directions of x and y independently (Fig. 5.5 (c)).
The scaling factors are determined by:
202 Person Recognition and Tracking

rotate scale

deyes

dnose

(a) (b) (c) shift, crop


& mask

deyes
dnose

(d) (e)

Fig. 5.5. Face image normalization for recognition using Eigenfaces. (a) original image, (b)
rotated image (eyes lie on a horizontal line), (c) scaled image (eyes have a constant distance;
the nose has a constant length), (d) a mask with previously defined coordinates for the eyes
and the nose tip where the image is shifted to, (e) normalized image)

xlefteye − xrighteye
scalex =
deyes
 
yeye − ynose
scalex =
dnose
where xlefteye and xrighteye determine the x-coordinates of the eyes and yeye 

(ynose are the y-coordinate of the eyes (nose tip) after the rotation step. The distance
)
between the eyes deye and the nose length dnose must be constant for all considered
images and must be chosen a-priori. Cropping the image, shifting the eyes to a
previously determined position, and masking out the border regions of the face
image (using a mask like in Fig. 5.5d) leads to an image that contains only the
appearance information of the face (Fig. 5.5e). All training and test images must be
normalized to yield images that have corresponding appearance information at the
same positions.

The “face space”


The eigenface approach considers an image as a point in a high-dimensional data
space. As faces in general have a similar appearance, it is assumed that face images
Face Recognition 203

are not uniformly distributed over the whole image space but are concentrated in a
lower dimensional “face space” (Fig. 5.6).

face space

Fig. 5.6. Simplified illustration of an image space. The face images are located in a sub-space
of the whole image space and form the face space.

The basis of this face space can be used to code the images using their coordi-
nates. This coding is a unique representation for each image. Assuming that images
of the same person are grouped together allows to apply a nearest-neighbor classifier
(Sect. 5.1.4) or more sophisticated classification methods.

Feature Extraction and Transformation


To obtain the basis of the face space, a set of N training images Ii (x, y) of width W
and height H and i = 0 . . . N are needed. Each of the 2D-images must be sampled
row-wise into a one-dimensional vector:

Ii (x, y) −→ Ji (5.1)

All Ji have the dimension j = W · H. This is the simplest feature extraction pos-
sible. Just the grey-levels of the image pixels are considered. Modified approaches
preprocess the images, e.g. by filtering with Gabor functions3 [10].
k2 x2
A Gabor function is given by: g(x) = cos(kx)·e −2σ2 . This is basically a cosine wave
3

combined with an enveloping exponential function limiting the support of the cosine wave. k
204 Person Recognition and Tracking

Now applying the principle component analysis (PCA or Karhunen-Loeve ex-


pansion) on the Ji will lead to a set of orthogonal Eigenvectors which constitute
the basis of the images. First, the mean of the feature vectors (here the Ji ) must be
calculated by:
1 
Ψ= Ji (5.2)
N i
This mean value is then used to determine the covariance matrix C:
1 
C= (Ji − Ψ)T (Ji − Ψ) (5.3)
N i

Applying PCA to C will yield a set of Eigenvectors ek . The corresponding eigenval-


ues λk of each ek can be obtained by:
1  T
λk = ek (Ji − Ψ) (5.4)
N i

Considering only a subset of these Eigenvectors yields the lower-dimensional basis


of the face space. Which Eigenvectors to consider depends on the value of the
eigenvalues. The larger the eigenvalue the larger is the variance of the data in the
direction of the corresponding Eigenvector. Therefore only the Eigenvectors with the
largest eigenvalues are used because smaller eigenvalues correspond to dimensions
mainly describing noise.

Fig. 5.4 (see page 200) shows a visualization of a set of these Eigenvectors. Be-
cause they have a face-like appearance, they were called ”eigenpictures“ [44] or
Eigenfaces [48]. Belhumeur [4] suggested to discard some of the most significant
Eigenfaces (high eigenvalues) because they mainly seem to code variations in illu-
mination but no significant differences between subjects which are coded by Eigen-
faces with smaller eigenvalues. This leads to some stability, at least when global
illumination changes appear.
There are basically two strategies how to determine the number of Eigenvectors
to be kept: (1) simply define a constant number Ñ of Eigenvectors (Ñ ≤ N )4 or (2)
define a minimum amount of information to be kept by determining an information
quality measure Q:

λk
Q= k=0
N
λk
k=0

Eigenvalues must be added to the numerator until Q is bigger than the previously
determined threshold Qthreshold . Hereby, the number Ñ of dimensions is deter-
mined. Commonly used thresholds are Qthreshold = 0,95 or Qthreshold = 0,98.
determines the direction ( k) and the frequency (|k|) of the wave in the two-dimensional
space.
4
Turk and Pentland claim that about 40 Eigenfaces are sufficient [48]. This, however,
strongly depends on the variations (e.g. illumination, pose) in the images to be trained. The
more variation is expected, the more Eigenfaces should be kept. The optimal number of Eigen-
faces can only determined heuristically.
Face Recognition 205

The choice of Qthreshold is often determined by finding a compromise between


processing performance (low Qthreshold ) and recognition rate (high Qthreshold ).

After calculating the PCA and performing dimension reduction, we get a ba-
sis B = {e1 , . . . , eÑ } of Eigenvectors and corresponding eigenvalues Bλ =
˜ y) −→ J̃ is then projected into the face space
{λ1 , . . . , λÑ }. A new image I(x,
determined by the basis B. The projection into each dimension of the face space
basis is given by:
wk = ek T (J̃ − Ψ) (5.5)
for k = 1, . . . Ñ . Each wk describes the contribution of eigenface ek to the currently
examined image J̃. The vector Wi = {w1 , . . . , wÑ } is hence a representation of
the image and can be used to distinguish it from other images.
The application of the PCA and the following determination of the wk can be
regarded as a feature transformation step which transforms the original feature
vector (i.e. the grey-level values) into a differently coded feature vector containing
the wk .

Classification
A quite simple yet often very effective algorithm is the k-nearest-neighbor classifier
(kNN), especially when just a few feature vectors of each class are available. In
this case, training routines of complex classification algorithms do not converge
or do not achieve better recognition results (than a kNN) but are computationally
more expensive. Introductions to more complex classification methods like Support
Vector Machines or neural networks can be found in the literature [49] [16].

The k-nearest-neighbor classifier determines the k trained feature vectors which


lie closest to the examined feature vector. The winner class is determined by voting,
e.g. selecting the class to which the most of the k feature vectors belong. The prop-
erty “closest” is defined by a distance measure that needs to be chosen depending
on the structure of the feature vector distribution, i.e. the application. Training in
the sense of calculating class borders or weights like with neural networks is not
performed here. It basically just consists of storing the feature vectors and their
corresponding class IDs5 .

If W = (wi ) is the feature vector to be classified and Ti = (ti ), i = 0, . . . , N


are the trained (stored) vectors, commonly used distance measures d are:
W·Ti
1. Angle: dangle = |W|·|T = cos(α)
 i|
2. Manhattan distance (L1 -norm): dManh. = |wi − ti |
 1
3. Euclidean distance (L2 -norm): dEukl. = |wi − ti |2 2
' (1
4. Mahalanobis distance: dMahal. = (W − Ti )T C (W − Ti ) 2

5
The term “class ID” is usually used for a unique number that acts as a label for each class
(person) to be identified.
206 Person Recognition and Tracking

where C describes the covariance matrix of the Ti .

Fig. 5.7 exemplifies the functionality of a kNN classifier using the Euclidean dis-
tance. The vectors which were trained are drawn as points into the two-dimensional
feature space while the vector to be classified is marked with a cross. For each of the
training samples the corresponding class is stored (only shown for classes A, B, and
C). Determining the training sample that is closest to the examined feature vector
determines the class which this vector is assigned to (here class “A”, as dA is the
smallest distance). Usually k = 1 is used because exact class assignment is needed
in object recognition tasks. The class ID is then determined by:
class ID = argminp {d(W, {Ti })}
For k > 1 an ordered list is built containing the class IDs of the k closest feature
vectors in increasing order.
class IDs = {a, b, . . .}
with d(W, Ta ) ≤ d(W, Tb ) ≤ · · · ≤ d(W, Ti ) and ||class IDs|| = k. From this
k results the winner class must be determined. Usually the voting scheme is applied,
however, other combination strategies can be used as well (Sect. 5.1.2).

A B
dA

Fig. 5.7. Two-dimensional feature space. A 1NN classifier based on the Euclidean distance
measure will return the point A as it is the closest to the point to be classified (marked with a
cross). Class borders between A, B, and C are marked with dotted lines.

5.1.5 Local Face Recognition based on Face Components or Template


Matching

Psychological experiments showed that human perception of a face is determined by


face parts and not by the whole face [43]. Additionally, the good results of spatial
restricted feature extraction was proven with different face recognition approaches
(e.g. [26] [30]).
Face Recognition 207

Besides modelling possible facial expressions, a common approach to achieve


facial expression invariance is to ignore the face regions that are likely to change
with different facial expressions, but contain no significant information that can be
used for recognition (e.g. cheek, forehead, lips). Therefore, only certain face regions
(face components) are considered. Experiments showed that especially the eye re-
gion contains information that codes the identity of a person [1]. Also other facial
regions can be used for recognition. However, their influence on the overall recog-
nition result should be limited by weighting the different facial regions according to
their importance [1] [30].
It was further argued that component-based approaches can better cope with
slight rotations than holistic approaches because variations within a component are
much smaller than the overall variation of an image of the whole face. The reason is,
that the positions of the components can individually be adapted to the rotated face
image. In the whole face image the face regions can be regarded as rigidly connected,
and hence, can not be subject to adaptation.

determine
contour component
detection positions

extract components

F.E. F.E. F.E. F.E. F.E. F.E. F.E. F.E. F.E. F.E.

class. class. class. class. class. class. class. class. class. class.

combination

classification result

Fig. 5.8. Overview of a component-based face recognition system. For each component, a
feature extraction (F.E.) and a classification step (class.) is separately performed.

Fig. 5.8 shows an overview of a component-based face recognition system. In


the input image, face contours must be detected. From distinctive contour points the
position of the components can be determined and the image information of the com-
ponents can be extracted. For each component, feature extraction and classification is
208 Person Recognition and Tracking

performed separately. The single classification results are finally combined, leading
to an overall classification result.
Component Extraction
As a preprocessing step for component-based face recognition, facial regions must
be detected. Suitable approaches for the detection of distinctive facial points and
contours were already discussed in chapter 2.2.
As soon as facial contours are detected, facial components can be extracted by
determining their position according to distinctive points in the face (Fig. 5.9).

a b c

Fig. 5.9. Face component extraction. Starting from mesh/shape (thin white line in (a)) that
is adapted to the input image, the centers of facial components (filled white dots) can be
determined (b) and with pre-defined size the facial components can be extracted (white lines
in (c)). The image was taken from the FERET database [41].

The centers ci (x, y) of the facial components are either determined by a single
contour point, i.e. cj (x, y) = pk (x, y) (e.g. for the corners of the mouth or the nose
root) or are calculated using the points of a certain facial contour (e.g. the center of
the points of the eye or nose contour) with:
1
cj (x, y) = pi (x, y)

i∈C

where C is a contour containing the ñ distinctive points pi (x, y) of e.g. the eyes.

Heisele et al. determined 10 face components to be the most distinctive (Fig.


5.10). Especially the eye and eye brow components are proved to contain most im-
portant information [26].
Face Recognition 209

Fig. 5.10. 10 most distinctive face components determined by [26].

Feature Extraction and Classification


In the feature extraction stage, any suitable feature vector can be determined for each
component independently. However, it must be considered that the face components
usually only have a fraction of the size of a face image. Hence, not all features are
suitable to describe face components. Texture descriptors, for example, in general
need image information of a certain size, so that the texture can be described prop-
erly.

Features Based on Grey-Level Information (Template Matching)

Grey-level information can be used in two ways for object recognition purposes: (1)
the image is transformed row-wise into a feature vector. Then, the trained vector
having the smallest distance to the testing vector is determined, or, (2) a template
matching based on a similarity measure can be applied. Given an image I(x, y) and
a template T (x, y)6 , the similarity can be calculated by:
1
S(u, v) = 
(i,j)∈V [I(i + u, j + v) − T (i, j)] + 1
2

where V is the set of all coordinate pairs. The template with the highest similarity
score Smax (u, v) determines the classified person.

6
A template is an image pattern showing the object to be classified. A face component can
be regarded as a template as well.
210 Person Recognition and Tracking

Eigenfeatures (local Eigenfaces)


An extension to the grey-level features is the locally applied eigenface approach
(Sect. 5.1.4). With IF Cj (x, y) −→ JFCj being the image information of face com-
ponent F Cj , the mean
1 
ΨFCj = JFCj ,i
NF Cj i
and the covariance matrix of the face component:
1 
CFCj = (JFCj ,i − ΨFCj )T (JFCj ,i − ΨFCj )
NF Cj i

can be determined (i being the index of the training images, and j being the index
of the facial component, here j ∈ {1, . . . , 10}. The Eigenvectors of CFCj are often
called “eigenfeatures” referring to the term “Eigenfaces” as they span a sub-space
for specific facial features [40]:
The classification stage provides a classification result for each face component.
This is done according to any holistic method by determining a distance measure
(Sect. 5.1.4)
Fourier coefficients for image coding
The 2D-DFT (Discrete Fourier Transform) of an image I(x, y) is defined as:
 y·n
I(x, y) · ej2π( M + N )
x·m
g(m, n) =
x y

where M and N are the spectral coefficient in x and y-direction, respectively.

Fig. 5.11. Face image and a typical 2D-DFT (magnitude).

Fig. 5.11 shows a face image and the magnitude of a typical DFT spectrum in
polar coordinates. Note, that mainly coefficients representing low frequencies (close
to the center) can be found in face images. Therefore, only the r most relevant Fourier
coefficients g(m, n) are determined and combined in an r-dimensional feature vector
which is classified using e.g. the Euclidean distance.
Face Recognition 211

Discussion
To show the distinctiveness of facial components, we ran experiments with different
combinations of facial components (Fig. 5.12) on a subset of the FERET database
(Sect. 5.1.6). Only grey-level information was used as feature, together with a 1NN
classifier. Frontal images of a set of 200 persons were manually annotated with dis-
tinctive points/shapes. One image of each person was trained, another one showing
slight facial expression changes, was tested.

a b

c d

Fig. 5.12. Test sets: combinations of face components.

In table 5.1 the recognition rate of each single component is given. The results
show that the eye region (eye and eye brow components) is the most discriminating
part of the face. Just with a single eye component a recognition rate of about 86%
could be achieved. The lower face part (mouth, nose) performs worse with rates up
to only 50%.

Table 5.1. Classification rates of different face components (Fig. 5.10)


component recognition rate
left eye 86,9%
right eye 85,4%
left eye brow 88,9%
right eye brow 85,4%
left temple 73,9%
right temple 68,3%
nose root 76,4%
nose tip 49,7%
left mouth corner 43,2%
right mouth corner 41,7%
212 Person Recognition and Tracking

The results of the combined tests are shown in table 5.2. The combination of
all face components increased the recognition rate to 97%. Even though the mouth
components alone perform not very well, they still add valuable information to the
classification as the recognition rate drops by 1,5% when the mouth components are
not considered (test set (b)). Further reduction (test set (c)) shows however, that the
nose tip in our experiments could not add valuable information, it actually worsened
the result. A recognition rate of about 90% can be achieved under strong occlusion,
simulated by only considering four components of the right face half. This shows that
occlusion can be compensated reasonably well by the component-based approach.

Table 5.2. Classification rates of different combinations of face components (Fig. 5.12)
test sets recognition rate
(a) 10 components 97,0%
(b) 8 components (no mouth) 95,5%
(c) 7 components (eye region only) 96,5%
(d) 4 components (right face half) 90,1%

5.1.6 Face Databases for Development and Evaluation

To evaluate a face recognition algorithm it is essential to have an extensive image


database containing images of faces under different conditions. Many databases are
publicly available and some contain about 1000 persons, some only a few persons
but images under a big variety of conditions (e.g. pose, illumination).

Table 5.3. A selection of publicly available face databases for evaluation.


URL
Color FERET http://www.itl.nist.gov/iad/humanid/colorferet/home.html
AR Face Database http://rvl1.ecn.purdue.edu/ aleix/ar.html
Yale Face Database http://cvc.yale.edu/projects/yalefaces/yalefaces.html
Yale Face Database B http://cvc.yale.edu/projects/yalefacesB/yalefacesB.html

The Color FERET Face Database [41] is the biggest database in terms of dif-
ferent individuals. It contains images of 994 persons including frontal images with
neutral and a randomly different facial expression (usually a smile) for all persons.
For about 990 persons there are images of profiles, half profiles and quarter pro-
files. Images are also included, where the head was turned more or less randomly.
The ground-truth information defines the year of birth (1924-1986), gender, race and
if the person was wearing glasses, a beard, or a mustache. For the frontal images,
also the eye, mouth and nose tip coordinates are annotated. Older images of the first
version of the FERET database are included in the newer Color FERET database.
However, these images are supplied in grey-scale only.
Color FERET
AR
Yale A
Yale B

Fig. 5.13. Example images of the four mentioned face databases.


214 Person Recognition and Tracking

In comparison to the FERET database, the AR Face database [31] contains


systematically annotated changes in facial expression (available are neutral, smile,
anger, scream), illumination (ambient, left, right, direct light from all sides) and
occlusion by sun glasses or scarfs. No pose changes are included. Images of 76
men and 60 women were taken in two sessions on different days. Fiducial point
annotations are missing. For some of the images the FGnet project supported
annotations which are available in the internet7 .

There are two versions of the Yale Face Database. The original version [4] con-
tains 165 frontal face images of 15 persons under different illuminations, facial ex-
pressions, with and without glasses. The second version (Yale Face Database B) [20]
contains grey-scale images of 10 individuals (nine male and one female) under 9
poses and 64 illumination conditions with neutral facial expressions. No subject
wears glasses and no other kind of occlusion is contained.

5.1.7 Exercises

The following exercises deal with face recognition using the Eigenface (Sect. 5.1.7.2)
and the component-based approach (Sect. 5.1.7.3). Image processing macros for the
I MPRESARIO software are explained and suitable process graphs are provided on the
book CD.
5.1.7.1 Provided Images and Image Sequences
A few face images and ground truth data files (directories
<Impresario>/imageinput/LTI-Faces-Eigenfaces and
<Impresario>/imageinput/LTI-Faces-Components)8 are provided on
the book CD with which the process graphs of this exercise can be executed. This
image database is not extensive, however, it should be big enough to get a basic
understanding of the face recognition methods.
Additionally, we suggest to use the Yale face database [4] which
can be obtained easily through the internet9 . Additional ground truth data
like coordinates of facial feature points are provided on the CD shipped
with this book and can be found after installation of I MPRESARIO in
the directories <Impresario>/imageinput/Yale DB-Eigenfaces and
<Impresario>/imageinput/Yale DB-Components, respectively.

Set up of the Yale Face Database

To set up the image and ground truth data of the Yale face database follow these
steps:
1. Download the Yale face database from the internet9 and extract the image files
and their 15 sub-directories (one for each subject) to the following directories:
7
http://www-prima.inrialpes.fr/FGnet/data/05-ARFace/tarfd markup.html
8
<Impresario> denotes the installation directory on your hard disc.
9
http://cvc.yale.edu/projects/yalefaces/yalefaces.html
Face Recognition 215

•<Impresario>/imageinput/Yale DB-Eigenfaces for the


Eigenface exercise
• <Impresario>/imageinput/Yale DB-Components for the exer-
cise dealing with the component-based face recognition approach
Each of the 15 sub-directories of the two above mentioned directories should
now contain 11 ground truth files (as provided with the book CD) and 11 image
files.
2. Convert all image files to the bitmap (BMP) format using any standard image
processing software.

5.1.7.2 Face Recognition Using the Eigenface Approach


Face recognition using the Eigenface approach (Sect. 5.1.4) can be implemented in
I MPRESARIO in three steps which are modelled in separated process graphs10 :
1. Calculating the transformation matrix (Eigenfaces_calcPCA.ipg): The
PCA needs to be calculated from a set of face images which contain possible
variations in face appearance, i.e. mainly variations in illumination and facial
expression.
2. Training (Eigenfaces_Train.ipg): Using the transformation matrix cal-
culated in the first step, the face images of the persons to be recognized are
projected into the eigenspace and their transformed coordinates are used to train
a classifier.
3. Experiments/Testing (Eigenfaces_Test.ipg): The test images are pro-
jected into the eigenspace as well and are processed by the previously trained
nearest-neighbor classifier. By comparing the classified ID with the target ID of
a set of test images, a recognition rate can be determined by:
correct classifications
R=
correct classifications + false classifications
In the following, these three process graphs are explained in more detail. All
processes have in common that they perform a normalization of the face images
which will be described in the next section.

Calculating the Transformation Matrix by Applying PCA

Fig. 5.14 shows the first process graph. The ImageSequence source provides an
image (left output pin) together with its file name (right output pin) which is
needed in the LoadFaceCoordinates macro which loads the coordinates of the
eyes, mouth and nose which are stored in a file with the same name but dif-
ferent extension like the image file (e.g. subject05.normal.asf for image
subject05.normal.bmp).
The ExtractIntensityChannel macro will extract the intensity channel of the image
which is then passed to the FaceNormalizer macro.
10
The process graph files can be found on the book CD. The file names are given in brack-
ets.
216 Person Recognition and Tracking

Fig. 5.14. I MPRESARIO process graph for the calculation of the PCA transformation matrix.
The grey background marks the normalization steps which can be found in all process graphs
of this Eigenface exercise. (the face image was taken from the Yale face database [4])

FaceNormalizer The FaceNormalizer macro uses the facial feature coordinates to


perform the normalization (Sect. 5.1.4) of the image. Five parameters must be set
to perform a proper normalization:
width: The width of the result image. The face image will be cropped to this
width.
height: The height of the result image. The face image will be cropped to this
height.
norm width: The distance between the eye centers. The face image will be scaled
horizontally to gain this distance between the eyes.
norm height: The distance between the y-coordinates of the eye centers and the
nose tip/mouth. The face image will be scaled vertically to gain this distance.
normalization by: This parameter can be set to mouth or nose tip and determines
which feature point is used when scaling to norm height.
At this point, the face image is normalized regarding scale and rotation and is
cropped to a certain size (face image in the center of Fig. 5.14). Normalization of the
intensity values is then performed in the macro IntensityNormalizer (cropped face
image in the lower left corner of Fig. 5.14).
IntensityNormalizer Performs a normalization of the intensity values. The current
grey-level mean avgcur is determined and all pixel values are adapted considering
the new (normalized) grey-level mean avgnorm by:
avgnorm
Inorm (x, y) = I(x, y) ·
avgcur
Face Recognition 217

After this normalization step, all face images have the same mean grey-level
avgnorm . Global linear illumination differences can be reduced by this step.

mean grey-value: The new mean value of all intensities in the image (avgnorm ).
As some parts of the image might still contain background, the normal-
ized face image is masked using a pre-defined mask image which is loaded by
LoadImageMask. The macro MaskImage actually performs the masking operation.
Finally, another ExtractIntensityChannel macro converts the image information
into a feature vector which is then forwarded to the calculatePCAMatrix macro.
calculatePCAMatrix For each image, this macro stores all feature vectors. After
the last image is processed (the ExitMacro function of the macros will be called),
calculatePCAMatrix can calculate the mean feature vector, the covariance ma-
trix and their eigenvalues which are needed to determine the transformation ma-
trix (Sect. 5.1.4) that is then stored.
data file: Path to the file in which the transformation matrix is stored.
result dim: With this parameter the number of dimensions to which the feature
vectors are reduced can be explicitly determined. If this parameter is set to −1,
a suitable dimension reduction is performed by dismissing all Eigenvectors
λmax
whose eigenvalues are smaller than 100000 where λmax determines the highest
eigenvalue.

Fig. 5.15. I MPRESARIO process graph for training an Eigenface classifier.


218 Person Recognition and Tracking

Eigenface Training

The normalization steps described above (Fig. 5.14) can be found in the training
process graph as well (Fig. 5.15). Instead of calculating the PCA, the training
graph contains the EigenfaceFeature macro which loads the previously stored
transformation matrix and projects given feature vectors into the eigenspace (face
space). The transformed feature vectors are then forwarded to the kNNTrainer
macro which stores all feature vectors with their corresponding person IDs which
are determined by the macro ExtractIDFromFilename.

EigenfaceFeature Determines the eigenspace representation of the given image.


data file: File where the transformation matrix to the eigenspace (face space) was
stored by the macro calculatePCAMatrix.
ignored Eigenfaces: Number of most relevant Eigenfaces which are ignored
when determining the eigespace representation of the given image.
width: This is the width of the images to be processed. The width is explicitly
needed to display the eigenface determined by the parameter eigenface num-
ber.
height: This is the height of the images to be processed. The height is explic-
itly needed to display the eigenface determined by the parameter eigenface
number.
eigenface number: You can display an eigenface image at the eigenface image
output (the right one). The index of the eigenface to be displayed can be set
by this parameter.
ExtractIDFromFilename This macro extracts the ID from the file name following
one of two naming conventions.
naming convention: The naming conventions of two face databases can be cho-
sen with this parameter:
• Yale face database (A):
example file name: subject04.glasses.bmp ⇒ ID = 4
• FERET face database:
example file name: 00002_930831_fa.jpg ⇒ ID = 2
kNNTrainer This macro trains a kNN classifier and stores the trained information
on disc.
training file: Path to the file where the trained kNN classifier is stored.

Recognition Experiments Using Eigenfaces

Fig. 5.16 shows the graph which can be used for running tests on the Eigen-
face approach. The kNNTrainer macro of the training graph is substituted by a
kNNClassifier.
kNNClassifier Classifies the given feature vector by loading the previously stored
classifier information (see kNNTrainer above) from hard disc.
Face Recognition 219

Fig. 5.16. I MPRESARIO process graph for testing the eigenface approach using a nearest-
neighbor classifier.

training file: File in which the kNNTrainer macro stored the trained feature vec-
tors.
RecognitionStatistics Compares the classified ID (obtained by kNNClassifier) to
the target ID which is again obtained by the macro ExtractIDFromFilename.
Additionally, it counts the correct and false classifications and determines the
classification rate which is displayed in the output window of I MPRESARIO (Ap-
pendix B).
label: A label which is displayed in the output windows of I MPRESARIO to mark
each message from the respective RecognitionStatistics macro in the case
more than one macro is used.

Annotations

For the normalization of the images, parameters must be set which affect the size
of the result images. FaceNormalizer crops the image to a certain size (determined
by parameters width and height). These parameters must be set to the same value
as in LoadImageMask and EigenfaceFeature (Fig. 5.17). If this convention is not
followed, the macros will exit with an error because the dimensionality for vectors
and matrices do not correspond and hence the calculations can not be performed.

Suggested experiments

To understand and learn about how the different parameters influence the perfor-
mance of the eigenface approach, the reader might want to try changing the following
parameters:
220 Person Recognition and Tracking

Fig. 5.17. Proper image size settings for the Eigenface exercise. The settings in all involved
macros must be set to the same values, e.g., 120 and 180 for the width and height, respectively.

• Variation of the normalization parameters


Possible variations of the normalizations parameters are:
– Image size: The parameters width and height of FaceNormalizer,
LoadImageMask, and EigenfaceFeature must be set in all three process-
ing graphs.
– Norm sizes: The distance between the eyes and nose/mouth can be varied
in a sense, that only certain parts of the face are visible, e.g. only the eye
region.

• Manually determine the dimensionality of the subspace (face space)


The dimensionality of the face space strongly determines the performance of the
eigenface approach. In general, decreasing the number of dimensions will lead
to a decreasing recognition rate. Increasing the number of dimensions at some
point has no effect as only noise will be described by additional dimensions that
does not improve the recognition performance.The number of dimensions can
manually be set by the result dim parameter of the calculatePCAMatrix macro.
All process graphs must be executed, however, the same images can be used.

• Varying the training set


The recognition is also determined by the images which are used to calculate the
transformation matrix and the training of the classifier. These two sets of images
do not necessarily need to be the same. While varying the training set, the
image set to determine the face space can be kept constant. However, variations
(e.g. illumination) in the training set should be included in the image set for
calculating the PCA as well.
Tests on how e.g. illumination effects the recognition rate can be performed
Face Recognition 221

by training face images (e.g. of the Yale face database) under one illumination
situation (e.g. the yale.normal.seq images) and testing under another
illumination (the yale.leftlight.seq images). The same accounts for
training and testing images showing different facial expressions.

• Ignoring dominant Eigenfaces for illumination invariance


Another possibility to gain a certain independence from illumination changes is
to ignore dominant Eigenfaces when transforming the feature vectors into the
face space. The number of dominant Eigenfaces which should be ignored can be
set by changing the parameter ignored Eigenfaces (macro EigenfaceFeature).
This parameter must be changed in the training as well as in the testing process
graph. The transformation matrix does not need to be recalculated.
5.1.7.3 Face Recognition Using Face Components
This exercise will deal with face recognition based on facial components. Two
process graphs are needed for this exercise which are shipped with the book CD.
The IntensityFeatureExtractor macro is provided with C++ source code and can
be used as a basis for new feature extraction macros that can be coded for these
processing graphs.

Fig. 5.18. I MPRESARIO Process graph for training a component-based face recognizer.

Training of a Component-Based Face Recognizer

The training graph of the component-based classifier is quite simple (Fig. 5.18). The
macro LoadComponentPositions loads a list of facial feature points from a file on
disc that corresponds to each image file to be trained. The macro returns a list of
feature points which is forwarded to the ComponentExtractor.
222 Person Recognition and Tracking

ComponentExtractor Determines the positions and sizes of 12 face components


from the facial feature points given by the macro LoadComponentPositions.
The macro returns a list containing images of the components.

component: One can take a look at the extracted components by setting the com-
ponent parameter to the ID of the desired component (e.g. 4 for the nose tip.
All IDs are documented in the macro help; on the left of Fig. 5.18).
The list of face component images is then forwarded to a feature extractor macro
that builds up a list of feature vectors containing one vector per component image. In
Fig. 5.18 the IntensityFeatureExtractor is used which only extracts the grey-level
information from the component images. This macro must be substituted if other
feature extraction methods should be used.
The list of feature vectors is then used to train a set of 12 kNN classifiers (macro
ComponentskNNTrainer). For each classifier, a file is stored in the given directory
(parameter training directory) that contains information of all trained feature vectors.
ComponentskNNTrainer Trains 12 kNN classifiers, one for each component. The
ground truth ID is extracted by the macro ExtractIDFromFilename (Sect.
5.1.7.2).
training directory: Each trained classifier is stored in a file in the directory given
here.
training file base name: The names of each of the 12 classifier files contains
the base file name given here and the ID of the face component, e.g.
TrainedComponent4.knn for the nose tip.

Face Recognition Experiments Using a Component-Based Classifier

For a test image the facial feature point positions are loaded and the face compo-
nent images are extracted from the face image. Then, the features for all compo-
nents are determined and forwarded to the classifier. The ComponentskNNTrainer
is substituted by a ComponentskNNClassifier that will load the previously stored
12 classifier files to classify each component separately (Fig. 5.19).
ComponentskNNClassifier For each component, this macro determines a win-
ner ID and an outputVector11 containing information about the distances
(Euclidean distance is used) and the ranking of the different classes, i.e. a list
containing a similarity value for each class. The winner ID can be used to
determine the classification rate of a single component by feeding it into a
RecognitionStatistics macro that compares the winner ID with the target ID. The
outputVector can be used to combine the results of the single components to
an overall result.
training directory: Directory where ComponentskNNTrainer stored the classi-
fication information for all components.
11
outputVector is a LTILib class. Please refer to the documentation of the LTILib for
implementations details.
Face Recognition 223

Fig. 5.19. I MPRESARIO process graph of a component-based face recognizer. If a component


should influence the overall recognition result can be controlled by connecting the classifier
output with the corresponding combiner input. Here, all components contribute to the overall
result.

training file base name: The base name of the files in which the classification
information is stored.
The ResultCombiner macro performs this step by linearly combining the out-
puts and determines an overall winner ID which again can be examined in an
RecognitionStatistics macro to get an overall classification rate.
Classification results of single components can be considered in the
overall recognition result by connecting the corresponding output of the
ComponentskNNClassifier macro to the input of the ResultCombiner. In Fig. 5.19
all components are considered as all are connected to the combination macro.

Suggested Experiments

In this exercise the reader is encouraged to implement own feature extractors on


the basis of the IntensityFeatureExtractor macro. By reimplementing the function
CIntensityFeatureExtractor::featureExtraction (see Fig. 5.20)
this macro can be extended to extract more sophisticated features. The basic struc-
ture of the macro can be kept because it already provides the necessary interface for
the processing of all components for which this function is called.
Also the reader is encouraged to train and test different face images, as the perfor-
mance of the component-based classifier still strongly depends on which variations
are trained and tested, e.g., on differences in illumination, facial expressions or facial
accessories like glasses.
224 Person Recognition and Tracking

void CIntensityFeatureExtractor::featureExtraction() {
// set the size of the feature vector to the dimension of the face component
if(m_Feature.size() != m_FaceComponent.rows()*m_FaceComponent.columns())
m_Feature.resize(m_FaceComponent.rows()*m_FaceComponent.columns(),
0,false,false);

// Implement your own feature extraction here and remove the


// following code ...

// copy the (intensity) channel to a lti::dvector to generate


// a feature vector
m_FaceComponentIter = m_FaceComponent.begin();
int i=0;
while(m_FaceComponentIter!=m_FaceComponent.end())
{
m_Feature.at(i) = *m_FaceComponentIter;
++m_FaceComponentIter;
++i;
}
}

Fig. 5.20. Code of the feature extractor interface for component-based face classification.
Reimplement this function to build your own feature extraction method.

5.2 Full-body Person Recognition


Video cameras are widely used for visual surveillance which is still mainly per-
formed by human observers. Fully automatic systems would have to perform (1)
person detection, (2) person tracking (Sect. 5.3) and (3) person recognition. For this
non-intrusive scenario neither knowledge nor token-based recognition systems are
applicable. Face recognition can not be performed due to the lack of illumination
and pose-variation invariance of state-of-the-art systems. On top of that, resolution
is often poor.
A possible solution might be considering the overall outer appearance of a per-
son. The big variety in colors and textures of clothes and the possibility of combin-
ing jackets, shirt, pants, shoes and accessories like scarfs and hats offers a unique
measure for identification. Different styles and typical color combinations can be
associated with a single person and hence be used for identification (Fig. 5.21).
For a suitable description of the appearance of persons, the following section will
outline extraction algorithms for color (Sect. 5.2.2) and texture descriptors (Sect.
5.2.3). It is assumed that persons are already detected and the image accordingly
segmented (Sect. 5.3).

5.2.1 State of the Art in Full-body Person Recognition

Until today, the description of the appearance of people for recognition purposes
is only addressed by the research community. Work on appearance-based full-body
recognition was done by Yang et al. [57] and Nakajima et al. [37] whose work is
briefly reviewed here.
Yang et al. developed a multimodal person identification system which, besides face
Full-body Person Recognition 225

a b c d e
Fig. 5.21. Different persons whose appearance differs because of their size and especially in
texture and color of their clothing (a - d). However, sometimes similar clothing might not
allow a proper separation of individuals (d + e).

and speaker recognition, also contains a module for the recognition based on color
appearance. Using the whole segmented image, they build up color histograms based
on the r-g color space and the tint-saturation color space (t-s-color space) to gain illu-
mination independence to some extent (Sect. 5.2.2.1). Depending on the histograms
which are built after the person is separated from the background, a smoothed proba-
bility distribution function is determined for each histogram. During training for each
color model and each person, i probability distribution functions Pi,model (r, g) and
Pi,model (t, s) are calculated and stored. For classification, distributions Pi,test (r, g)
and Pi,test (t, s) are determined and compared to all previously trained Pi,model for
the respective color space. For comparison the following measure (Kullback-Leibler
divergence) is used:
 P (r,g)
D(modeli test) = s∈S Pi,model (r, g) · Pi,model
i,test (r,g)

where S is the set of all histogram bins. For the t-s-color space models, this is
performed analogously. The model having the smallest divergence determines the
recognition result.
Tests on a database containing 5000 images of 16 persons showed that the t-s model
performs slightly better than the r-g color model. Recognition is not influenced
heavily by the size of the histograms using any of the two color spaces and
recognition rates are up to 70% depending on histogram size and color model.

Nakajima et al. examined different features by globally performing feature ex-


traction. Besides a RGB histogram and a r-g histogram (like Yang et. al [57]) a shape
histogram and local shape features are used. The shape histogram simply contains
the number of pixels in rows and columns of the image to be examined. This feature
226 Person Recognition and Tracking

was combined with a RGB histogram. The local shape features were determined by
convolving the image with 25 shape patterns patches. The convolution is performed
on the following color channels: R - G, R + G and R + G - B. For classification a
Support Vector Machine [49] was used.
Tests were performed on a database of about 3500 images of 8 persons which
were taken during a period of 16 days. All persons wore different clothes during
the acquisition period so that different features were assigned to the same person.
Recognition rates of about 98% could be achieved. However, the ratio of training
images to test images was up to 9 : 1. It could be shown that the simple shape fea-
tures are not useful for this kind of recognition task, probably because of insufficient
segmentation. The local shape features performed quite well, though the normalized
color histogram produced the best results. This is in line with the results of Yang [57].

5.2.2 Color Features for Person Recognition


This section will describe color features for person recognition and discusses their
advantages and disadvantages.
5.2.2.1 Color Histograms
Histograms are commonly used to describe the appearance of objects by counting
the number of pixels of each occurring color. Each histogram bin represents either
a single color or a color range. With the number of bins the size of the histogram
increases and hence the need for computational memory resources. With bins
representing a color range the noise invariance increases because subtle changes
in color has no effect as the same bin is increased when building the histogram.
Therefore, a trade-off between computing performance, noise invariance (small
number of bins), and differentiability (large number of bins) must be found.
The dimensionality of a histogram is determined by the number of channels which
are considered. A histogram on the RGB color space is three times bigger than a
histogram of intensity values.

A histogram of an image I(x, y) with height h and width w is given by:


ni =| {I(x, y) | I(x, y) = i} |, i = 0, . . . , K − 1
K
where K is the number of colors and i=0 ni = w · h.

Histograms can be based on any color space that suits the specific application.
The RGBL-histogram is a four-dimensional histogram built on the three RGB chan-
nels of an image and the luminance channel which is defined by:
min(R, G, B) + max(R, G, B)
L=
2
Yang et al. [57] and Nakajima et al. [37] used the r-g color space for person
identification where a RGB image is transformed into a r-channel and a g-channel
based on the following equations applied to each pixel P (x, y) = (R, G, B):
Full-body Person Recognition 227

R G
r= g=
R+G+B R+G+B
Each combination (r̃, g̃) of specific colors are represented by a bin in the his-
togram. Nakajima used 32 bins for each channel, i.e. K = 32 · 32 = 1024 bins for
the whole r-g histogram. The maximum size of the histogram (assuming the com-
monly used color resolution of 256 per channel) is Kmax = 256 · 256 = 65536. So,
just a fraction of the maximum size was used leading to more robust classification.
As the r-channel and the g-channel are coding the chrominance of an image, this
histogram is also called chrominance histogram.
5.2.2.2 Color Structure Descriptor
The Color Structure Descriptor (CSD) was defined by the MPEG-7 standard for im-
age retrieval applications and image content based database queries. It can be seen
as an extension to a color histogram which is not just representing the color distribu-
tion but also considers the local spatial distribution of the colors. With the so called
structuring element not only each pixel of the image is considered when building the
histogram but also its neighborhood.

structuring element

Fig. 5.22. 3 × 3 structuring elements located in two images to consider the neighborhood of
the pixel under the elements center. In both cases, the histogram bins representing the colors
“black” and “white” are increased by exactly 1.

Fig. 5.22 illustrates the usage of the structuring element (here with a size of 3 × 3
shown in grey) with two simply two-colored images. Both images have a size of
10 × 10 pixels of which 16 are black. For building the CSD histogram the structuring
element is moved row-wise over the whole image. A bin of the histogram is increased
by exactly 1 if the corresponding color can be found within the structuring element
window even though more than one pixel have this color. At the position marked
in Fig. 5.22 in the left image one black pixel and eight white pixels are found. The
values of the bin representing the black color and the white color are increased by 1
each, just like on the right side of Fig. 5.22, even though two black pixels and only
seven white pixels are found within the structuring element.
228 Person Recognition and Tracking

By applying the structuring element, the histogram bins will have the values:
nBlack = 36 and nW hite = 60 for the left image, nBlack = 62 and nW hite = 64
for the right. As can be seen for the black pixels, the more the color is distributed
the larger is the value of the histogram bin. A normal histogram built on these two
images will have the bin values nBlack = 16 and nW hite = 84. The images could
not be separated by these identical histograms while the CSD histograms still show
a difference.
The MPEG-7 standard suggests a structure element size of 8×8 pixels for images
smaller than 256 × 256 pixels which is the case for most segmented images in our
experiments.
It must be noted that the MPEG-7 standard also introduced a new color space
[29]. Images are transformed into the HMMD color space (Hue, Min, Max, Diff)
and then the CSD is applied.
The hue value is the same as in the HSV color space while Min and Max are the
smallest and the biggest of the R, G, and B values and Diff is the difference between
these extremes:


⎨ (0 +
G−B
Max−Min ) × 60, if R = Max
Hue = (2 + B−R
Max−Min ) × 60, if G = Max

(4 + R−G
Max−Min ) × 60, if B = Max
Min = min(R, G, B)
Max = max(R, G, B)
Diff = max(R, G, B) − min(R, G, B)

5.2.3 Texture Features for Person Recognition

Clothing is not only characterized by color but also by texture. So far, texture features
have not been evaluated as much as color features though they might give valuable
information about a person’s appearance. The following two features turned out to
have good recognition performance, especially when combined with other features.
In general, most texture features filter an image, split it up into separated fre-
quency bands and extract energy measures that are merged to a feature vector. The
applied filters have different specific properties, e.g. they extract either certain fre-
quencies or frequencies in certain directions.
5.2.3.1 Oriented Gaussian Derivatives
For a stable rotation-invariant feature we considered a feature based on the Oriented
Gaussian Derivatives (OGD) which showed promising results in other recognition
tasks before [54]. The OGDs are steerable, i.e. it can be determined how the in-
formation in a certain direction influences the filter result and therefore the feature
vector.
In general, a steerable filter g θ (x, y) can be expressed as:
Full-body Person Recognition 229


J
g θ (x, y) = kj (θ)gjθ (x, y)
j=1

where gjθ (x, y) denote basis filters and kj (θ) interpolation functions. For the first-
order OGDs the filter can be defined as:
◦ ∂
g θ (x, y) = g 0 (x , y  ) = g(x , y  )
∂x  2
1 x + y2
=− (x cos θ + y sin θ) exp −
2πσ 4 2σ 2
∂ ∂
= cos θ g(x, y) + sin θ g(x, y)
∂x ∂y
◦ ◦
= cos θg 0 (x, y) + sin θg 90 (x, y)

where x = x cos θ and y  = y sin θ. Here the interpolation functions are


k1 (θ) = cos(θ) and k2 (θ) = sin(θ). The basis functions are given by the derivatives
of the Gaussian in the direction of x and y, respectively.

A directed channel C θ with its elements cθ (x, y) can then be extracted by con-
volving the steerable filter g θ (x, y) with the image I(x, y):

cθ (x, y) = g θ (x, y) ∗ I(x, y)

The power pθ (x, y) of the channel C θ is then determined by:

pθ (x, y) = (cθ (x, y))2

For a n-th order Gaussian derivative, the energy of a region R in a filtered channel
cθ (x, y) can then be expressed as [2]:
 
θ
E = pθ (x, y)dxdy


J  
= ki (θ)kj (θ) cθi (x, y)cθj (x, y)dxdy
(i,j)=1

n
= A0 + Ai cos(2iθ − Θi )
i=1

where A0 is independent of the steering angle θ and the Ai are rotation-invariant, i.e.
θ
the energy ER is steerable as well. The feature vector is composed of all coefficients
Ai . For the exact terms of the coefficients Ai and more details the interested reader
is referred to [2].
5.2.3.2 Homogeneous Texture Descriptor
The Homogeneous Texture Descriptor (HTD) describes texture information in a
region by determining the mean energy and energy deviations from 30 frequency
230 Person Recognition and Tracking

channels non-uniformly distributed in a cylindrical frequency domain [38] [42]. The


MPEG-7 specifications for the HTD suggest to use Gabor functions to model the
channels in the polar frequency domain which is splitted into six bands in angular
direction and five bands in radial direction (Fig. 5.23).

θ, s
4 3

5 2

ωs=1
6 i=1
12 18 24
ω, r
30
Fig. 5.23. (Ideal) HTD frequency domain layout. The 30 ideal frequency bands are modelled
with Gabor filters.

In the literature an implementation of the HTD is proposed that applies the “cen-
tral slice problem” and the Radon transformation [42]. However, an implementation
using a 2D Fast-Fourier-Transformation (FFT) turned out to have a much higher
computational performance in our experiments.

The 2D-FFT is defined as:



W −1 H−1
 vy
I(x, y)e−2πi( W + H )
ux
FI (u, v) = u = 0, . . . , W − 1; v = 0, . . . , H − 1
x=0 y=0

The resulting frequency domain is symmetric i.e. only N2 points of the discrete
transformation need to be calculated and afterwards filtered by Gabor filters which
are defined as:
−(ω−ωs )2 −(θ−θr )2
2 2σ2
G(ω, θ) = e 2σw
s ·e θr

where ωs (θr ) determines the center frequency (angle) of the band i where i =
6s + r + 1, r defining the radial index and s the angular index.
From each of the 30 bands the energies ei and their deviations di are defined as:

ei = log(1 + pi )
di = log(1 + qi )
Full-body Person Recognition 231

with ◦
 
1 360
pi = [G(w, θ) · F (w, θ)]2
w=0 θ=0◦
,
- 1 360◦
- 
qi = . {[G(w, θ) · F (w, θ)]2 − pi }
w=0 θ=0◦

respectively, where G(w, θ) denotes the Gabor function in the frequency domain,
F (w, θ) the 2D-Fourier transformation in polar coordinates.

The HTD feature vector is then defined as:

HT D = [fDC , fSD , e1 , e2 , · · · , e30 , d1 , d2 , · · · , d30 ]

fDC is the intensity mean value of the image (or color channel), fSD the cor-
responding standard deviation. If fDC is not used, the HTD can be thought of a
intensity invariant feature [42].

5.2.4 Experimental Results

This section will show some results which were achieved by applying the above
described feature descriptors.
5.2.4.1 Experimental Setup
The tests were performed on a database of 53 persons [22]. For each individual, a
video sequence was available showing the person standing upright. 10 test images
were chosen to train a person and 200 images were used for testing. Some examples
of the used images are depicted in Fig. 5.21.

5.2.4.2 Feature Performance


The results achieved with the above described features are shown in Fig. 5.24. It can
clearly be seen that the performance of the color features is superior to the perfor-
mance of the texture features. Note, the good stability of the CSD with increasing
number of classes (persons).
The good performance of the chrominance histogram in [37] could not be
verified. The chrominance histogram as well as the RGBL histogram in average
have a lower performance than the CSD. The performance of the RGBL-histogram
is 3 − 4% lower, the performance of the chrominance histogram up to about 10%.

Note that the combination of the CSD and the HTD12 still led to a slight
performance improvement, though the performance of the HTD was worse than the
12
Combination was performed by adding the distance measures of the single features:
d(x) = dCSD (x) + dHT D (x) and then determining the class with the smallest overall dis-
tance.
232 Person Recognition and Tracking

0,95

0,9
recognition rate

0,85

0,8

0,75

0,7

0,65
10 20 30 40 50
number of persons

OGD HTD CSD RGBL chrominance CSD+HTD

Fig. 5.24. Recognition results of the examined features. Color features are marked with filled
symbols. Texture features have outlined symbols.

performance of the CSD. This proofs that the HTD describes information that can
not completely be covered by the CSD. Therefore, the combination of texture and
color features is advisable.

5.3 Camera-based People Tracking


Numerous present and future tasks of technical systems require increasingly the per-
ception and understanding of the real world outside the machine’s metal box. Like the
human’s eyes, camera images are the preferred input to provide the required infor-
mation, and therefore computer vision is the appropriate tool to approach these tasks.
One of the most fundamental abilities in scene understanding is the perception and
analysis of people in their natural environment, a research area often titled the “Look-
ing at People”-domain in computer vision. Many applications rely on this ability, like
autonomous service robots, video surveillance, person recognition, man-machine in-
terfaces, motion capturing, video retrieval and content-based low-bandwidth video
compression.
“People Tracking” is the image analysis element that all of these applications
have in common. It denotes the task to detect humans in an image sequence and
to keep track of their movements through the scene. Often this provides merely the
initial information for following analysis, e.g. of the body posture, the position of
people in the three-dimensional scene, the behaviour, or the identity.
There is no general solution at hand that could process the images of any camera
in any situation and deliver the positions and actions of the detected persons, be-
Camera-based People Tracking 233

(a) (b) (c)

Fig. 5.25. Different situations require more or less complex approaches, depending on the
number of people in the scene as well as the camera perspective and distance.

cause real situations are far too varied and complex (see Fig. 5.25) to be interpreted
without any prior knowledge. Furthermore, people have a wide variety of shape and
appearance. A human observer solves the image analysis task by three-dimensional
understanding of the situation in its entirety, using a vast background knowledge
about every single element in the scene, physical laws and human behaviour. No
present computer model is capable to represent and use that amount of world know-
ledge, so the task has to be specialized to allow for more abstract descriptions of the
desired observations, e.g. “Humans are the only objects that move in the scene” or
“People in the scene are walking upright and therefore have a typical shape”.

Segmentation Tracking New person Optional Feature


people detection updating the detection Extraction
position of - position in 3D scene
camera each person - body posture
image - identity

- person description update


- position prediction

Fig. 5.26. Overview of the general structure of a people tracking system.

People tracking is a high-level image processing task that consists of multiple


processing steps (Fig. 5.26). In general, a low-level segmentation step, i.e. a search
for image regions with a high probability of representing people, preceeds the track-
ing step, which uses knowledge and observations from the previous images to assign
each monitored person to the corresponding region and to detect new persons en-
tering the scene. But as will be shown, there are also approaches that combine the
segmentation and the tracking step.
The following sections present various methods to solve the individual tasks of a
tracking system, as there are the initial detection and segmentation of the people in
the image, the tracking in the image plane and, if required, the calculation of the real
world position of each person. Every method has advantages and disadvantages in
different situations, so the choice of how to approach the development of a “Looking
at People” system depends entirely on the specific requirements of the intended ap-
234 Person Recognition and Tracking

plication. Surveys and summaries of a large number of publications in this research


area can be found in [19, 35, 36].

5.3.1 Segmentation

“Image segmentation” denotes the process to group pixels with a certain common
property and can therefore be regarded as a classification task. In the case of people
tracking, the goal of this step is to find pixels in the image that belong to persons.
The segmentation result is represented by a matrix M(x, y) of Booleans, that labels
each image pixel I(x, y) as one of the two classes “background scene” or “probably
person” (Fig. 5.27 b):

0 if background scene
M(x, y) = (5.6)
1 if moving foreground object (person)

While this two-class segmentation suffices for numerous tracking tasks (Sect.
5.3.2.1), individual tracking of people during inter-person overlap requires further
segmentation of the foreground regions, i.e. to label each pixel as one of the classes
background scene, n-th person, or unrecognized foreground (probably a new person)
(Fig. 5.27 c). This segmentation result can be represented by individual foreground
masks Mn (x, y) for every tracked person n ∈ {1, . . . , N } and one mask M0 (x, y)
to represent the remaining foreground pixels. Sect. 5.3.3 presents several methods
to segment the silhouettes separately using additional knowledge about shape and
appearance.

(a) (b) (c)

Fig. 5.27. Image segmentation; a) camera image, b) foreground segmentation, c) separate


person segmentation.

5.3.1.1 Foreground Segmentation by Background Subtraction


Most people-tracking applications use one or more stationary cameras. The advan-
tage of such a setup is that each pixel of the background image B(x, y), i.e. the
camera image of the static empty scene without any persons, has a constant value
as long as there are no illumination changes. The most direct approach to detect
Camera-based People Tracking 235

moving objects would be to label all those pixels of the camera image I(x, y, t) at
time t as foreground, that differ more than a certain threshold θ from the prerecorded
background image:

1 if B(x, y) − I(x, y, t) > θ
M(x, y, t) = (5.7)
0 otherwise

θ has to be chosen greater than the average camera noise. In real camera setups,

a) background b) camera c) background d) shadow e) morphological


model image subtraction reduction filtering

Fig. 5.28. Foreground segmentation steps.

however, the amount of noise is not only correlated to local color intensities (e.g. due
to saturation effects), but there are also several additional types of noise that affect the
segmentation result. Pixels at high contrast edges often have high noise variance due
to micro-movements of the camera and “blooming/smearing”-effects in the image
sensor. Non-static background like moving leaves on trees in outdoor scenes also
causes high local noise. Therefore it is more suitable to describe each background
pixel individually by its mean value μc (x, y) and its noise variance σ 2c (x, y). In the
following, it is assumed that there is no correlation between the noise of the separate
color channels Bc (x, y), c ∈ {r, g, b}. The background model is created from N
images B(x, y, i), i ∈ {1, . . . , N } of the empty scene:

1 
N
μc (x, y) = Bc (x, y, i) (5.8)
N i=1

1  2
N
σ 2c (x, y) = Bc (x, y, i) − μc (x, y) (5.9)
N i=1
Assuming a Gaussian noise model, the Mahalanobis distance ΔM provides a suitable
measure to decide, whether the difference between image and background is relevant
or not.  2
 Ic (x, y, t) − μc (x, y)
ΔM (x, y, t) = (5.10)
σ 2c (x, y)
c∈{r,g,b}
Camera-based People Tracking 237

element positioned at (x, y), the formal description of the morphological operations
is given by:
+d +d
δx =−d δy =−d M(x + δx , y + δy )E(δx , δy )
F(x, y) = +d +d (5.14)
δx =−d δy =−d E(δx , δy )

1 if F(x, y) = 1
Erosion: ME (x, y) = (5.15)
0 otherwise

1 if F(x, y) > 0
Dilation: MD (x, y) = (5.16)
0 otherwise
The effects of both operations are shown in Fig. 5.29. Since the erosion reduces the
Opening

d
d Structure
Element

Erosion Dilation
Closing

Original

Dilation Erosion

Fig. 5.29. Morphological operations. Black squares denote foreground pixels, grey lines mark
the borderline of the original region.

size of the original shape by the radius of the structure element while the dilation
expands it, the operations are always applied pairwise. An erosion followed by a di-
lation is called morphological opening. It erases foreground regions that are smaller
than the structure element. Holes in the foreground are closed by morphological clos-
ing, i.e. a dilation followed by an erosion. Often opening and closing operations are
performed subsequently, since both effects are desired.

5.3.2 Tracking

The foreground mask M(x, y) resulting from the previous section marks each image
pixel either as “potential foreground object” or as “image background”, however so
far no meaning has been assigned to the foreground regions. This is the task of the
238 Person Recognition and Tracking

tracking step. “Tracking” denotes the process of associating the identity of a moving
object (person) with the same object in the preceding frames of an image sequence,
thus following the trajectory of the object either in the image plane or in real world
coordinates. To this end, features of each tracked object, like the predicted position
or appearance descriptions, are used to classify and label the foreground regions.
There are two different types of tracking algorithms: In the first one (Sect.
5.3.2.1), all human candidates in the image are detected before the actual tracking
is done. Here, tracking is primarily a classification problem, using various features
to assign each person candidate to one of the internal descriptions of all observed
persons.
The second type of tracking systems combines the detection and tracking step
(Sect. 5.3.2.2). The predicted image area of each tracked person serves as initiali-
zation to search for the new position. This search can be performed either on the
segmented foreground mask or on the image data itself. In the latter case, the process
often also includes the segmentation step, using color descriptions of all persons to
improve the segmentation quality or even to separate people during occlusion. This
second, more advanced class of tracking algorithms is also preferred in model-based
tracking, where human body models are aligned to the person’s shape in the image
to detect the exact position or body posture.
In the following, both methods are presented and discussed. Inter-person overlap
and occlusions by objects in the scene will be covered in chapter 5.3.3.

5.3.2.1 Tracking Following Detection


The first step in algorithms of the tracking following detection type is the detection
of person candidates in the segmented foreground mask (see Fig. 5.30 b). The most
common solution for this problem is to analyze the size of connected foreground
regions, keeping only those with a pixel number above a certain threshold. An ap-
propriate algorithm is presented in Sect. 2.1.2.2 (lti::objectsFromMask, see
Fig. 2.11). If necessary, additional criteria can be considered, e.g. a valid range for
the width-to-height ratio of the bounding boxes surrounding the detected regions.
Which criteria to apply depends largely on the intended application area. In most
cases, simple heuristic rules are adequate to distinguish persons from other moving
objects that may appear in the scene, like cars, animals or image noise (see also
example in Sect. 5.3.5).
After detection, each tracked person from the last image is assigned to one of
the person candidates, thus finding the new position. This classification task relies
on the calculation of appropriate distance metrics using spatial or color features as
described below.
New appearing persons are detected if a person candidate can not be identified.
If no image region can be assigned to a tracked person, it is assumed that the person
has left the camera field of view.
Tracking following detection is an adequate approach in situations that are not too
demanding. Examples are the surveillance of parking lots or similar set-ups fulfilling
the following criteria (Fig. 5.25 b):
Camera-based People Tracking 239

A
B

(a) (b) (c)

Fig. 5.30. Tracking following detection: a) camera image, b) foreground segmentation and
person candidate detection, c) people assignment according to distance and appearance.

• The camera is mounted at a position significantly higher than a human. The


steeper the tilt towards the scene, the smaller is the probability of inter-person
overlap in the image plane.
• The distance between the camera and the monitored area is large, so that each
person occupies a small region in the camera image. As a result, the variability
of human shape and even grouping of persons have only negligible influence on
the detected position.
• There are only few people moving simultaneously in the area, so the frequency
of overlap is small.
• The application does not require separating people when they merge into groups
(see Sect 5.3.3).
If, on the other hand, the monitored area is narrow with many people inside, who
can come close towards the camera or are likely to be only partially visible in the
camera image (Fig. 5.25 a), it is more appropriate to use the combined tracking and
detection approach as presented in Sect. 5.3.2.2.
Various features have been applied in tracking systems to assign the tracked per-
sons to the detected image areas. The most commonly used distance metric dn,k
between a tracked person n and a person candidate k is the Euclidean distance be-
tween the predicted center position (xpn , ypn ) of the person’s shape and those of the
respective segmented foreground region (xck , yck ):

dn,k = (xpn − xck )2 + (ypn − yck )2 (5.17)

The prediction (xpn , ypn ) is extrapolated from the history of positions in the preced-
ing frames of the sequence, e.g. with a Kalman Filter [53].
A similar distance metric is the overlap ratio between the predicted bounding box
of a tracked person and the bounding box of the compared foreground region. Here,
the region with the largest area inside the predicted bounding box is selected.
While assignment by spatial distance provides the basic framework of a tracking
system, more advanced features are necessary for reliable identification whenever
people come close to each other in the image plane or have to be re-identified after
occlusions (Sect. 5.3.3). To this end, appearance models of all persons are created
when they enter the scene and compared to the candidates in each new frame. There
240 Person Recognition and Tracking

are many different types of appearance models proposed in the literature, of which a
selection is presented below.
The most direct way to describe a person would be his or her full-body im-
age. Theoretically, it would be possible to take the image of a tracked person in
one frame and to compare it pixel-by-pixel with the foregound region in question in
the next frame. This, however, is not suitable for re-identification after more than a
few frames, e.g. after an occlusion, since the appearance and silhouette of a person
changes continuously during the movement through the scene. A good appearance
model has to be general enough to comprise the varying appearance of a person dur-
ing the observation time, as well as being detailed enough to distinguish the person
from others in the scene. To solve this problem for a large number of people un-
der varying lighting conditions requires elaborate methods. The appearance models
presented in the following sections assume that the lighting within the scene is tem-
porally and spatially constant and that the clothes of the tracked persons look similar
from all sides. Example applications using these models for people tracking can be
found in [24, 59] (Temporal Texture Templates), [3, 32] (color histograms), [34, 56]
(Gaussian Mixture Models) and [17] (Kernel Density Estimation). Other possible
descriptions, e.g. using texture features, are presented in Sect. 5.2.

Temporal Texture Templates

A Temporal Texture Template consists of the average color or monochrome image


T(/x, y/) of a person in combination with a matrix of weights W(/ x, y/) that counts
the occurrence of every pixel as foreground. The matrix is initialized with zeros.
x, y/) are the relative coordinates inside the bounding box of the segmented person,
(/
normalized to a constant size to cope with the changing scale of a person in the
camera image. In the following, (x, y) will be considered as the image coordinates
that correspond to (/ x, y/) and vice versa. In image frame I(x, y, t) with the isolated
foreground mask Mn (x, y, t) for the n-th person, the Temporal Texture Template is
updated as follows:

x, y/, t) = W(/
Wn (/ x, y/, t − 1) + Mn (x, y, t) (5.18)
x, y/, t − 1)Tn (/
Wn (/ x, y/, t) + Mn (x, y, t)I(x, y, t)
x, y/, t) =
Tn (/ (5.19)
max{Wn (/ x, y/, t), 1}
Figure 5.31 shows an example of how the template and the weight matrix evolve
over time. Depicting the weights as a grayscale image shows the average silhouette
of the observed person with higher values towards the body center and lower values
in the regions of moving arms and legs. Since these weights are also a measure for
the reliability of the respective average image value, they are used to calculate the
weighted sum of differences dTn,k between the region of a person candidate k and
the average appearance of the compared n-th person.
 
x, y/, t − 1) Tn (/
y Mk (x, y, t)Wn (/ x, y/, t − 1) − I(x, y, t)
dTn,k =
x
 
x x, y/, t − 1)
y Mk (x, y, t)Wn (/
(5.20)
Camera-based People Tracking 241

(a) k = 1 (b) k = 5 (c) k = 50

Fig. 5.31. Temporal Texture Templates calculated from k images.

Color Histograms

A color histogram h(r , g  , b ) is a non-parametric representation of the color distrib-


ution inside an object (Fig. 5.32 a). The indices r , g  and b quantize the (r,g,b)-color
space, e.g. in 32 intervals per axis. Each histogram bin represents the total number
of pixels inside the corresponding color space cube. For a generalized description
of a person, the histogram has to be generated from multiple images. The normal-
ized histogram /h(r , g  , b ) represents the relative occurrence frequency of each color
interval:
/ h(r , g  , b )
h(r , g  , b ) =    (5.21)
r g b h(/ g  , /b )
r , /
To compare a person candidate to this description, a normalized color histogram
of the corresponding foreground region /hk (r , g  , b ) is created, and the difference
between both histograms is calculated:

dHn,k = |hn (r , g  , b ) − hk (r , g  , b )| (5.22)
r g b

An alternative method to compare two normalized histograms is the “histogram in-


tersection” that was introduced by Swain and Ballard [46]. The higher the similarity
between two histograms, the greater is the sum sHn,k :

sHn,k = min{hn (r , g  , b ), hk (r , g  , b )} (5.23)
r g b

Instead of the (r,g,b)-space, other color spaces are often used that are less sensitive
regarding illumination changes (see appendix A.5.3). Another expansion of this
principle uses multiple local histograms instead of a global one to include spatial
information in the description.

Gaussian Mixture Models

A Gaussian Mixture Model is a parametric representation of the color distribution


of an image. The image is clustered into N regions of similar color (“blobs”), e.g.
242 Person Recognition and Tracking

using the Expectation Maximization (EM-) algorithm [15]. Each v-dimensional blob
i = 1 . . . N is described as a Gaussian distribution by its mean μi and covariance
matrix Ki either in three-dimensional color space (c = (r, g, b), v = 3) or in the
five-dimensional combined feature space of color and normalized image coordinates
/, y/), v = 5):
(c = (r, g, b, x
1  1 
pi (c) = √ · exp − (c − μi )T K−1
i (c − μi ) (5.24)
(2π)v/2 detKi 2
The Mahalanobis distance between a pixel value c and the i-th blob is given as:
ΔM (c, i) = (c − μi )T K−1
i (c − μi ) (5.25)
For less calculation complexity, the uncorrelated Gaussian distribution is often used
instead, see eq. 5.8 . . . 5.10. The difference dGn,k between the description of the n-th
person and an image region Mk (x, y) is calculated as the average minimal Maha-
lanobis distance between each image pixel c(x, y) inside the mask and the set of
Gaussians:
 
x y Mk (x, y) minin =1...Nn ΔM (c(x, y), in )
dGn,k =   (5.26)
x y Mk (x, y)

While a parametric model is a compact representation of the color distribution of


a person, it is in most cases only a rough approximation of the actual appearance.
On the other hand, this results in high generalization capabilities, so that it is often
sufficient to create the model from one image frame only. A difficult question, how-
ever, is to choose the right number of clusters for optimal approximation of the color
density function.

Kernel Density Estimation

Kernel density estimation combines the advantages of both, parametric Gaussian


modeling and non-parametric histogram descriptions, and allows the approximation
of any color distribution by a smooth density function. Given a set of N randomly
chosen sample pixels ci , i = 1 . . . N , the estimated color probability density func-
tion pK (c) is given by the sum of kernel functions K(ci ) which are centered at every
color sample:
1 
N
pK (c) = K(ci ) (5.27)
N i=1
Multivariate Gaussians (eq. 5.24) are often used as an appropriate kernel. The vari-
ances of the individual axes of the chosen feature space have to be defined a priori.
This offers the possibility to weigh the axes individually, e.g. with the purpose to
increase the stability towards illumination changes. Fig. 5.32 illustrates the principle
of kernel density estimation in comparison to histogram representation.
To measure the similarity between a person described by kernel density estima-
tion and a segmented image region, the average color probability according to eq.
5.27 is calculated over all pixels in that region.
Camera-based People Tracking 243

h(c1) pK(c1)

3 3

2 2

1 1

c1 c1
Sample points
(a) (b)

Fig. 5.32. Comparison of probability density approximation by histograms (a) and kernel den-
sity estimation (b).

5.3.2.2 Combined Tracking and Detection


In contrast to the tracking principle presented in the preceding section, the tracked
persons here are not assigned to previously detected person candidates, but instead
the predicted positions of their silhouettes are iteratively matched to image regions
(Fig. 5.33). For robust matching, the initial positions have to be as close as possible to
the target coordinates, therefore high frame rates and fast algorithms are preferable.
After the positions of all tracked people are identified, new persons entering the
camera field of view are detected by analyzing the remaining foreground regions as
explained in the previous section.

A B B
A
C C

(a) Initialization (b) Adaptation

Fig. 5.33. Combined tracking and detection by iterative model adaptation.

This approch to tracking can be split up in two different types of algorithms


depending on the chosen person modeling. The first type, shape-based tracking,
aligns a model of the human shape to the segmented foreground mask, while
the second type, color-based tracking, works directly with the camera image and
combines segmentation and tracking by using a color description of each person.
Both approaches are discussed in the following.
244 Person Recognition and Tracking

Shape-based Tracking

The silhouette adaptation process is initialized with the predicted image position
of a person’s silhouette, which can be represented either by the parameters of
the surrounding bounding box (center coordinates, width and height) or by the
parameters of a 2D model of the human shape. The most simple model is an average
human silhouette, that is shifted and scaled to approximate a person’s silhouette in
the image. If the framerate is not too low, the initial image area overlaps with the
segmented foreground region that represents the current silhouette of that person.
The task of the adaptation algorithm is to update the model parameters to find the
best match between the silhouette representation (bounding box, 2D model) and the
foreground region.

Mean Shift Tracking


Using a bounding box representation, the corresponding rectangular region can be
regarded as a search window that has to be shifted towards the local maximum of
the density of foreground pixels. That density is approximated by the number of
foreground pixels inside the search window at the current position. The Mean Shift
Algorithm is an iterative gradient climbing algorithm to find the closest local density
maximum [9, 11]. In each iteration i, the mean shift vector dx(i) = (dx(i), dy(i))
is calculated as the vector pointing from the center (xm , ym ) of the search window
(with width w and height h) to the center of gravity (COG) of all included data points.
In addition, each data point can be weighted by a kernel function k(x, y) centered
at the middle of the search window, e.g., a Gaussian kernel with σx ∼ w and σy ∼ h:

1  1 x2 y2 
k(x, y) = ∗ exp − ( 2 + 2 ) (5.28)
2πσx σy 2 σx σy
As a result, less importance is given to the outer points of a silhouette, which include
high variability and therefore unreliability due to arm and leg movements. Using the
segmented foreground mask M(x, y) as source data, the mean shift vector dx(i) is
calculated as
w/2 

h/2
 x
M(xm (i − 1) + x, ym (i − 1) + y)k(x, y)
y
y=−h/2 x=−w/2
dx(i) = (5.29)

h/2

w/2
M(xm (i − 1) + x, ym (i − 1) + y)k(x, y)
y=−h/2 x=−w/2

The position of the search window is then updated with the mean shift vector:

xm (i) = xm (i − 1) + dx(i) (5.30)


This iteration is repeated until it converges, i.e. dx(i) < Θ, with Θ being a small
threshold. Since the silhouette grows or shrinks if a person moves forwards or back-
wards in the scene, the size of the search window has to be adapted accordingly.
Camera-based People Tracking 245

To this end, the rectangle is resized after the last iteration according to the size of
the enclosed foreground area (CAMShift - Continuously Adaptive Mean Shift Algo-
rithm [7]),


h/2

w/2
/∗/
w h=C∗ M(xm (i) + x, ym (i) + y)k(x, y) (5.31)
y=−h/2 x=−w/2

with C and the width-to-height ratio wh being user-defined constants. An alternative


method is to calculate the standard deviations σx and σy of the enclosed foreground
and set the width and height to proportional values.
One advantage of this method over tracking following detection is the higher
stability in case of bad foreground segmentation: Where an advanced person
detection algorithm would be necessary to recognize a group of isolated small
foreground regions as one human silhouette, the mean shift algorithm solves this
problem automatically by adapting to the density maximum.

Linear Regression Tracking


For further improvement of tracking stability, knowledge about the human shape
is incorporated into the system. In the following, an average human silhouette
is used for simplicity, of which the parameters are x- and y-position, scale and
width-to-height ratio (Fig. 5.34 a). The task is to find the optimal combination of
model parameters that minimizes the difference between the model and the target
region.

(a) (b) (c)

Fig. 5.34. Iterative alignment (c) of a shape model (a) to the segmented foreground (b) using
linear regression.

One approach to solve this problem is to define a measure for the matching error
(e.g. the sum of all pixel differences) and to find the minimum of the error function
in parameter space using global search techniques or gradient-descent algorithms.
These approaches, however, disregard the fact that the high-dimensional vector of
pixel differences between the current model position and the foreground region
contains valuable information about how to adjust the parameters. For example, the
locations of pixel differences in Fig. 5.34 c between the average silhouette and the
246 Person Recognition and Tracking

foreground region correspond to downscaling and shifting the model to the right.
Expressed mathematically, the task is to find a transformation between two vector
spaces: the w ∗ h-dimensional space of image differences di (by concatenating all
rows of the difference image into one vector) and the space of model parameter
changes dp , which has 4 dimensions here. One solution to this task is provided by
linear regression, resulting in the regression matrix R as a linear approximation of
the correspondance between the two vector spaces:

dp = Rdi (5.32)
The matching algorithm works iteratively. In each iteration, the pixel difference vec-
tor di is calculated using the current model parameters, which are then updated using
the parameter change resulting from eq. 5.32.
To calculate the regression matrix R, a set of training data has to be collected in
advance by randomly varying the model parameters and calculating the difference
vector between the original and the shifted and scaled shape. All training data is
written in two matrices Di and Dp , of which the corresponding columns are the
difference vectors di or the parameter changes dp respectively. The regression
matrix is then calculated as

R = Dp Φ(Λ−1 ΦT DTi ) (5.33)


where the columns of matrix Φ are the first m Eigenvectors of DTi Di and Λ is the
diagonal matrix of the corresponding eigenvalues λ. A derivation of this equation
with application to human face modeling can be found in [12].
Models of the human shape increase the overall robustness of the tracking sys-
tem, because they cause the system to adapt to similar foreground areas instead
of tracking arbitrarily shaped regions. Furthermore, they improve people separation
during overlap (Sect. 5.3.3).

Color-based Tracking

In this type of tracking algorithms, appearance models of all tracked persons in the
image are used for the position update. Therefore no previous segmentation of the
image in background and foreground regions is necessary. Instead, the segmentation
and tracking steps are combined. Using a background model (Sect. 5.3.1) and the
color descriptions of the persons at the predicted positions, each pixel is classified ei-
ther as background, as belonging to a specific person, or as unrecognized foreground.
An example is the Pfinder (“Person finder”) algorithm [56]. Here, each person n is
represented by a mixture of m multivariate Gaussians Gn,i (c), i = 1 . . . m, with
c = (r, g, b, x, y) being the combined feature space of color and image coordinates
(Sect. 5.3.2.1). For each image pixel c(x, y), the Mahalanobis distances ΔM,B to
the background model and ΔM,n,i to each Gaussian are calculated according to
equation 5.10 or 5.25 respectively. A pixel is then classified using the following
equation, where θ is an appropriate threshold:
Camera-based People Tracking 247

ΔM,min = min{ΔM,B , ΔM,n,i } (5.34)


n,i

⎨ unrecognized foreground if ΔM,min ≥ θ
c(x, y) = background if ΔM,B = ΔM,min and ΔM,B < θ

person n, cluster i if ΔM,n,i = ΔM,min and ΔM,n,i < θ
(5.35)
The set of pixels assigned to each cluster is used to update the mean and variance of
the respective Gaussian. The average translation of all clusters of one person defines
the total movement of that person. The above algorithm is repeated iteratively until
it converges.

(a) source image (b) high threshold

(c) low threshold (d) expectation based


segmentation

Fig. 5.35. Image segmentation with fixed thresholds (b,c) or expected color distributions (d).

One problem of common background subtraction (Sect. 5.3.1) is to find a global


threshold that minimizes the segmentation error (see Sect. 2.1.2.1: optimal threshold-
ing algorithm in Fig. 2.6), due to people wearing clothes of more or less similar colors
to different background regions (Fig. 5.35 b,c). In theory, the optimal decision-border
between two Gaussians is defined by equal Mahalanobis distances to both distribu-
tions. The same principle is applied here to improve image segmentation (Fig. 5.35
d). An additional advantage of this method is the possibility to include the separation
of people during occlusion in the segmentation process (Sect. 5.3.3.3).
In a similar way, the method can be applied with other appearance models, e.g.
with kernel density estimation (Sect. 5.3.2.1).
248 Person Recognition and Tracking

5.3.3 Occlusion Handling

In the previous sections, several methods have been presented to track people as
isolated regions in the image plane. In real situations, however, people form groups or
pass each other in different distances from the camera. Depending on the monitored
scene and the camera perspective, it is more or less likely that the corresponding
silhouettes in the image plane merge into one foreground region. How to deal with
these situations is an elementary question in the development of a tracking system.
The answer depends largely on the application area and the system requirements.
In the surveillance of a parking lot from a high and distant camera position, it is
sufficient to detect when and where people merge into a group and, after the group
splits again, to continue tracking them independently (Sect. 5.3.3.1). Separate person
tracking during occlusion is neither necessary nor feasible here due to the small
image regions the persons occupy and the relatively small resulting position error. A
counter-example is the automatic video surveillance in public transport like a subway
wagon, where people partially occlude each other permanently. In complex situations
like that, elaborate methods that combine person separation by shape, color and 3D
information (Sect. 5.3.3.2 - 5.3.3.3) are required for robust tracking.
5.3.3.1 Occlusion Handling without Separation
As mentioned above, there are many surveillance tasks where inter-person occlusions
are not frequent and therefore a tracking logic that analyzes the merging and split-
ting of foreground regions suffices. This method is applied particularly in tracking
following detection systems as presented in Sect. 5.3.2.1 [24, 32].
Merging of two persons in the image plane is detected, if both are assigned to
the same foreground region. In the following frames, both are tracked as one group
object until it splits up again or merges with other persons. The first case occurs, if
two or more foreground regions are assigned to the group object. The separate image
regions are then identified by comparing them to the appearance models of the per-
sons in question. This basic principle can be extended by a number of heuristic rules
to cope with potential combinations of appearing/disappearing and splitting/merging
regions. The example in Sect. 5.3.5 describes a tracking system of this type in detail.
5.3.3.2 Separation by Shape
When the silhouettes of overlapping people merge into one foreground region, the
resulting shape contains information about how the people are arranged (Fig. 5.36).
There are multiple ways to take advantage of that information, depending on the type
of tracking algorithm used.
The people detection method in tracking following detection systems (Sect.
5.3.2.1) can be improved by further analyzing the shape of each connected fore-
ground region to derive a hypothesis about the number of persons it is composed of.
A common approach is the detection and counting of head shapes that extend from
the foreground blob (Fig. 5.36 b). Possible methods are the detection of local peaks
in combination with local maxima of the extension in y-direction [59] or the analysis
of the curvature of the foreground region’s top boundary [23]. Since these methods
Camera-based People Tracking 249

(a) (b) (c)

Fig. 5.36. Shape-based silhouette separation: (b) by head detection, (c) by silhouette align-
ment.

will fail in situations where people are standing directly behind each other so that
not all of the heads are visible, the results are primarily used to provide additional
information to the tracking logic or to following processing steps.
A more advanced approach uses models of the human shape as introduced in
Sect. 5.3.2.2. Starting with the predicted positions, the goal is to find an arrangement
of the individual human silhouettes that minimizes the difference towards the re-
spective foreground region. Using the linear regression tracking algorithm presented
above, the following principle causes the silhouettes to align only with the valid outer
boundaries and ignore the areas of overlap: In each iteration, the values of the differ-
ence vector di are set to zero in foreground points that lie inside the bounding box
of another person (gray region in Fig. 5.36 c). As a result, the parameter update is
calculated only from the remaining non-overlapping parts. If a person is completely
occluded, the difference vector and therefore also the translation is zero, resulting in
the prediction as final position assumption.
Tracking during occlusion using only the shape of the foreground region requires
good initialization to converge to the correct positions. Furthermore, there is often
more than one possible solution in arranging the overlapping persons. Tracking ro-
bustness can be increased in combination with separation by color or 3D information.
5.3.3.3 Separation by Color
The best results in individual people tracking during occlusions are achieved if their
silhouettes are separated from each other using appearance models as described in
Sect. 5.3.2.2.
In contrast to the previous methods that work entirely in the 2D image plane,
reasoning about the relative depth of the occluding people towards each other is
necessary here to cope with the fact that people in the back are only partially visible,
thus affecting the position calculation. Knowing the order of overlapping persons
from front to back in combination with their individually segmented image regions,
a more advanced version of the shape-based tracking algorithm from Sect. 5.3.3.2
can be implemented that adapts only to the visible parts of a person and ignores the
250 Person Recognition and Tracking

corresponding occluded areas, i.e. image areas assigned to persons closer towards
the camera.
The relative depth positions can be derived either by finding the person arrange-
ment that maximizes the overall color similarity [17], by using 3D information, or
by tracking people on the ground plane of the scene [18, 34].
5.3.3.4 Separation by 3D Information
While people occluding each other share the same 2D region in the image plane, they
are positioned at different depths in the scene. Therefore additional depth information
allows to split the combined foreground region into areas of similar depth, that can
then be assigned to the people in question by using one of the above methods (e.g.
color descriptions).
One possibility to measure the depth of each image pixel is the use of a stereo
camera [25]. It consists of a pair of identical cameras attached side by side. An im-
age processing module searches the resulting image pair for corresponding feature
points, of which the depth is calculated from the horizontal offset. Drawbacks of this
method are that the depth resolution is often coarse and that feature point detection
in uniformly colored areas is unstable.
An extension of the stereo principle is the usage of multiple cameras that are
looking at the scene from different viewpoints. All cameras are calibrated to trans-
form the image coordinates in 3D scene coordinates and vice versa as explained in
the following section. People are tracked on the ground plane, taking advantage of
the least occluded view of each person. Such multi-camera systems are the most
favored approach to people tracking in narrow indoor scenes [3, 34].

5.3.4 Localization and Tracking on the Ground Plane

In the preceding sections, various methods have been presented to track and describe
people as flat image regions in the 2D image plane. These systems provide the infor-
mation, where the tracked persons are located in the camera image. However, many
applications need to know their location and trajectory in the real 3D scene instead,
e.g., to evaluate their behaviour in sensitive areas or to merge the views from multiple
cameras. Additionally, it has been shown that knowledge about the depth positions
in the scene improves the segmentation of overlapping persons (Sect. 5.3.3.3).
A mathematical model of the imaging process is used to transform 3D world
coordinates into the image plane and vice versa. To this end, the prior knowledge of
extrinsic (camera position, height, tilt angle) and intrinsic camera parameters (focal
length, opening angle, potential lense distortions) is necessary. The equations derived
in the following use so-called homogeneous coordinates to denote points in the world
or image space: ⎛ ⎞
wx
⎜ wy ⎟
x=⎜ ⎝ wz ⎠
⎟ (5.36)
w
Camera-based People Tracking 251

The homogeneous coordinate representation extents the original coordinates


(x, y, z) by a scaling factor w, resulting in an infinite number of possible descriptions
of each 3D point. The concept of homogeneous coordinates is basically a mathe-
matical trick to represent broken-linear transformations by linear matrix operations.
Furthermore, it enables the computer to perform calculations with points lying in
infinity (w = 0).
A coordinate transformation with homogeneous coordinates has the general form
x̃ = Mx, with M being the transformation matrix:
⎛ ⎞
r11 r12 r13 xΔ
⎜ r21 r22 r23 yΔ ⎟
M=⎜ ⎝ r31 r32 r33 zΔ ⎠
⎟ (5.37)
1 1 1 1
dx dy dz s

The coefficients rij denote the rotation of the coordinate system, (xΔ , yΔ , zΔ ) the
translation, ( d1x , d1y , d1z ) the perspective distortion and s the scaling.

yC
DC
xC
0

HC

z
x 0
zC

Fig. 5.37. Elevated and tilted pinhole camera model.

Most cameras can be approximated by the pinhole camera model (Fig. 5.37). In
case of pincushion or other image distortions caused by the optical system, additional
normalization is necessary. The projection of 3D scene coordinates into the image
plane of a pinhole camera located in the origin is given by the transformation matrix
MH : ⎛ ⎞
1 0 0 0
⎜0 1 0 0⎟
MH = ⎜ ⎝0 0 1 0⎠
⎟ (5.38)
0 0 − DC 0
1

DC denotes the focal length of the camera model. Since the image coordinates are
measured in pixel units while other measures are used for the scene positions (e.g.
252 Person Recognition and Tracking

cm), measurement conversion is included in the coordinate transformation by ex-


pressing DC in pixel units using the width- or height resolution rx or ry of the
camera image together with the respective opening angles θx or θy :
rx ry
DC = = (5.39)
2 tan( θ2x ) θ
2 tan( 2y )

In a typical set-up, the camera is mounted at a certain height y = HC above the


ground and tilted by an angle α. The according transformation matrices, MR for the
rotation around the x-axis and MT for the translation in the already rotated coordi-
nate system, are given by:
⎛ ⎞ ⎛ ⎞
1 0 0 0 1 0 0 0
⎜ 0 cos α sin α 0 ⎟ ⎜ 0 1 0 −HC cos α ⎟
MR = ⎜ ⎝ 0 − sin α cos α 0 ⎠
⎟; M T = ⎜
⎝ 0 0 1 HC sin α ⎠ (5.40)

0 0 0 1 0 0 0 1

The total transformation matrix M results from the concatenation of the three trans-
formations:
⎛ ⎞
1 0 0 0
⎜ 0 cos α sin α −HC cos α ⎟
M = MH MT MR = ⎜ ⎝ 0 − sin α cos α HC sin α ⎠
⎟ (5.41)
0 sin α
DC − DC − DC
cos α HC sin α

The camera coordinates (xC , yC ) can now be calculated from the world coordinates
(x, y, z) as follows:
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
wC xC x x
⎜ wC yC ⎟ ⎜y⎟ ⎜ y cos α − HC cos α ⎟
⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎝ wC zC ⎠ = M ⎝ z ⎠ = M ⎝ −y sin α + z cos α + HC sin α ⎠ (5.42)
DC (y sin α − z cos α − HC sin α)
1
wC 1

By dividing the homogeneous coordinates by wc , the final equations for the image
coordinates are derived:
x
xC = −DC (5.43)
z cos α + (HC − y) sin α

(HC − y) cos α − z sin α


yC = DC (5.44)
z cos α + (HC − y) sin α
The value of zc is a constant and denotes the image plane, zc = −Dc . The inverse
transformation of the image coordinates into the 3D scene requires the prior know-
ledge of the value in one dimension due to the lower dimensionality of the 2D image.
This has to be the height y above the ground, since the floor position (x, z) is un-
known. In the analysis of the camera image, the head and feet coordinates of a person
can be detected as the extrema of the silhouette, or, in a more robust way, with the
Camera-based People Tracking 253

help of body models. Therefore the height is either equal to zero or to the body height
HP of the person. The body height can be calculated, if the y-positions of the head
yc,H and of the feet yc,F are detected simultaneously:
yc,F − yc,H
H P = HC D C (5.45)
(yc,F cos α + DC sin α)(DC cos α − yc,H sin α)

Besides improved tracking stability especially during occlusions [3,59], tracking


on the ground plane also enables the inclusion of additional real world knowledge
into the system, like the human walking speed limits, minimum distance between
two persons or valid ground positions defined by a floor map. To deal with partial
occlusions by objects in the scene, additional knowledge about the three-dimensional
structure of the monitored area can be used to predict, which image regions occlude
a person’s silhouette and therefore contains no valid information about the true shape
of the person in the image plane [18].

5.3.5 Example

In this example, a people tracking system of the type tracking following detection is
built and tested using I MPRESARIO and the LTI-L IB (see appendices A and B). As
explained in Sect. 5.3.2.1, systems of this type are more appropriate in scenes where
inter-person occlusion is rare, e.g. the surveillance of large outdoor areas from a
high camera perspective, than in narrow or crowded indoor scenes. This is due to the
necessity of the tracked persons to be separated from each other most of the time to
ensure stable tracking results.
The general structure of the system as it appears in I MPRESARIO is shown in
Fig. 5.38. Image source can either be a live stream from a webcam or a prerecorded
image sequence. Included on the book CD are two example sequences, one of an out-
side parking lot scene (in I MPRESARIO-subdirectory ./imageinput/ParkingLot) and
an indoor scene (directory ./imageinput/IndoorScene). The corresponding I MPRE -
SARIO -projects are PeopleTracking Parking.ipg and PeopleTracking Indoor.ipg.
The first processing step is the segmentation of moving foreground objects by
background subtraction (see Sect. 5.3.1.1). The background model is calculated from
the first initialFrameNum frames of the image sequence, so it has to be ensured
that these frames show only the empty background scene without any persons or
other moving objects. Alternatively, a pre-built background model (using the I MPRE -
SARIO -macro trainBackgroundModel) can be loaded that must have been created
under exactly the same lighting conditions. The automatic adjustment of the camera
parameters has to be deactivated to prevent background colors from changing when
people with dark or bright clothes enter the scene.
After reducing the noise of the resulting foreground mask with a sequence of
morphological operations, person candidates are detected using the objectsFrom-
Mask macro (see Fig. 2.11 in Sect. 2.1.2.2). The detected image regions are passed
on to the peopleTracking macro, where they are used to update the internal list of
tracked objects. A tracking logic handles the appearing and disappearing of people in
254 Person Recognition and Tracking

Fig. 5.38. I MPRESARIO example for people tracking.

the camera field of view as well as the merging and splitting of regions as people oc-
clude each other temporarily in the image plane. Each person is described by a one-
or two-dimensional Temporal Texture Template (Sect. 5.3.2.1) for re-identification
after an occlusion or, optionally, after re-entering the scene.
The tracking result is displayed using the drawTrackingResults macro (Fig.
5.39). The appearance models of individual persons can be visualized with the macro
extractTrackingTemplates.

(a) (b) (c)

Fig. 5.39. Example output of tracking system (a) before, (b) during and (c) after an overlap.

In the following, all macros are explained in detail and the possibilities to vary
their parameters are presented. The parameter values used in the example sequences
Camera-based People Tracking 255

are given in brackets behind each description.

Background Subtraction As presented in Sect. 5.3.1.1, the background model de-


scribes each pixel of the background image as a Gaussian color distribution. The
additional shadow reduction algorithm assumes that a shadow is a small reduc-
tion in image intensity without a significant change in chromaticity. The shadow
similarity s of a pixel is calculated as
wI ΔI − Δc
s = ws (ΔI ) , (5.46)
wI ΔI
where ΔI is the decrease of intensity between the background model and the cur-
rent image and Δc is the chromaticity difference, here calculated as the Euclid-
ean distance between the u and v values in the CIELuv color space. The user
defined parameter intensityWeight wI can be used to give more or less weight
to the intensity difference, depending on how stable the chromaticity values in
the shadows are in the particular scene. To prevent dark foreground colors (e.g.
black clothes) to be classified as shadows, the shadow similarity is weighted with
a function ws (ΔI ) that gives a higher weight to a low decrease in intensity (Fig.
5.40). This function is defined by the parameters minShadow and maxShadow. If
a segmented foreground pixel has a shadow similarity s that is greater than the
user defined threshold shadowThreshold, the pixel is classified as background.

wS( DI)

0
DI
0 minShadow maxShadow

Fig. 5.40. Shadow similarity scaling factor ws as a function of the decrease of intensity ΔI .

The macro has the following parameters:


load background model: True: load background model from file (created with
macro trainBackgroundModel), false: create background model from the
first initFrameNum frames of the image sequence. These frames have to show
the empty background scene, without any moving persons or objects. (False)
model file: Filename to load background model from if load background model
is true.
initFrameNum: Number of first frames to create the background model from if
load background model is false. (30)
256 Person Recognition and Tracking

threshold: Threshold for the Mahalanobis distance in color space to decide be-
tween foreground object and background. (30.0)
use adaptive model: True: Update background model in each frame at the back-
ground pixels, false: use static background model for entire sequence. (True)
adaptivity: Number of past images to define the background model. A new image
is weighted with 1/adaptivity, if the current frame number is lower than
adaptivity, or with 1/f rame number otherwise. “0” stands for an infinite
number of frames, i.e. the background model is the average of all past frames.
(100)
apply shadow reduction: True: reduce the effect of classifying shadows as fore-
ground object. (True)
minShadow: Shadow reduction parameter, see explanation above. (0.02)
maxShadow: Shadow reduction parameter. (0.2)
shadowThreshold: Shadow reduction parameter. (0.1)
intensityWeight: Shadow reduction parameter. (3.0)
Morphological Operations This macro applies a sequence of morphological oper-
ations like dilation, erosion, opening and closing on a binary or grayscale image
as presented in Sect. 5.3.1.2. Input and output are always of the same data type,
either a lti::channel8 or a lti::channel. The parameters are:
Morph. Op. Sequence: std::string that defines the sequence of morpholog-
ical operations using the following apprevations: “e” = erosion, “d” = dilation,
“o” = opening, “c” = closing. (“ddeeed”)
Mode: process input channel in “binary” or “gray” mode. (“binary”)
Kernel type: Type of structure element used. (square)
Kernel size: Size of the structure element (3,5,7,...). (3)
apply clipping: Clip values that lie outside the valid range (only used in gray
mode). (False)
filter strength: Factor for kernel values (only used in gray mode). (1)
ObjectsFromMask This macro detects connected regions in the foreground mask.
Threshold: Threshold to separate between foreground and background values.
(100)
Level: Iteration level to describe hierarchy of objects and holes. (0)
Assume labeled mask: Is input a labeled foreground mask (each region identified
by a number)? (False)
Melt holes: Close holes in object description (not necessary when Level=0).
(False)
Minimum size: Minimum size of detected objects in pixel. (60)
Sort Objects: Sort objects according to Sort by Area. (False)
Sort by Area: Sort objects by their area size (true) or the size of the data structure
(false). (False)
PeopleTracker The tracking macro is a tracking following detection system (see
Sect. 5.3.2.1) based on a merge/split logic (Sect. 5.3.3.1). The internal state of
the system, which is also the output of the macro, is represented by an instance
of the class trackedObjects (Fig. 5.41). This structure includes two lists: The first
one is a list of the individual objects (here: persons), each described by a one-
Camera-based People Tracking 257

trackedObjects

std::list<individualObjects> std::list<visibleBlobs>

Fig. 5.41. Structure of class trackedObjects. One or more individual objects (persons) are
assigned to one visible foreground blob. Gray not-assigned objects have left the scene, but are
kept for re-identification.

or two-dimensional Temporal Texture Template. The second list includes all cur-
rent foreground regions (“blobs”), represented each by lti::areaPoints,
the bounding box and a Kalman filter to predict the movement from the past tra-
jectory. One or, in case of inter-person overlap, multiple individual objects are
assigned to one visible blob. In each image frame, this structure is updated by
comparing the predicted positions of the foreground blobs in the visibleBlob list
with the new detected foreground regions. The following cases are recognized
(see Fig. 5.42):

a b c d e
Fig. 5.42. Possible cases during tracking. Dashed line = predicted bounding box of tracked
blob, solid line and gray rectangle = bounding box of new detected foreground region.

a) Old blob does not overlap with new blob: The blob is deleted from the
visibleBlob list, the assigned persons are deleted or deactivated in the indi-
vidualObjects list.
b) One old blob overlaps with one new blob: Clear tracking case, the position
of the blob is updated.
258 Person Recognition and Tracking

c) Multiple old blobs overlap with one new blob: Merge case, new blob is
added to visibleBlob list, old blobs are deleted, and all included persons are
assigned to the new blob.
d) One old blob overlaps with mutliple new blobs: To solve the ambiguity,
each person included in the blob is assigned to one of the new blobs using
the Temporal Texture Template description. Each assigned new blob is added
to the visibleBlob list, the old blob is deleted. In the special case of only one
included person and multiple overlapping blobs which do not overlap with
other old blobs, a “false split” is assumed. This can occur if the segmented
silhouette of a person consists of separate regions due to similar colors in the
background. In this case, the separate regions are united to update the blob
position.
e) New blob that doesn’t overlap with an old blob: The Blob is added to
the visibleBlob list, if its width-to-height ratio is inside the valid range for
a person. If parameter memory is true, it is tried to re-identify the person,
otherwise a new person is added to the individualObjects list.
The macro uses the following parameters:
Memory: True: Persons that have left the field of view are not deleted but used
for re-identification whenever a new person is detected. (True)
Threshold: Similarity threshold to re-identify a person. (100)
Use Texture Templates: True: use Temporal Texture Templates to identify a per-
son, false: assign persons using only the distance between the predicted and
detected position (True)
Texture Template Weight: Update ratio for Temporal Texture Templates (0.1)
Template Size x/y: Size of Temporal Texture Template in pixels (x=50, y=100)
Average x/y Values: Use 2D Temporal Texture Template or use average values in
x- or y-direction. (x: True, y: False)
Intensity weight: Weight of intensity difference in contrast to color difference
when comparing templates (0.3)
UseKalmanPrediction: True: use a Kalman filter to predict the bounding box
positions in the next frame, false: use old positions in new frame. (True)
Min/Max width-to-height ratio: Valid range of width-to-height ratio of new de-
tected objects (Min = 0.25, Max = 0.5)
Text output: Write process information in console output window (False)
DrawTrackingResults This macro draws the tracking result into the source image.
Text: Label objects. (True)
Draw Prediction: Draw predicted bounding box additionally. (False)
Color: Choose clearly visible colors for each single object and merged objects.
ExtractTrackingTemplates This macro can be used to display the appearance
model of a chosen person from the trackedObjects class.
Object ID: ID number of tracked object (person) to display.
Display Numbers: True: Diplay ID number.
Color of the numbers: Color of displayed ID number.
References 259

References
1. Ahonen, T., Hadid, A., and Pietikäinen, M. Face Recognition with Local Binary Patterns.
In Proceedings of the 8th European Conference on Computer Vision, pages 469–481.
Springer, Prague, Czech Republic, May 11-14 2004.
2. Alvarado, P., Dörfler, P., and Wickel, J. AXON 2 - A Visual Object Recognition System
for Non-Rigid Objects. In Hamza, M. H., editor, IASTED International Conference-
Signal Processing, Pattern Recognition and Applications (SPPRA), pages 235–240.
Rhodes, Greece, July 3-6 2001.
3. Batista, J. Tracking Pedestrians Under Occlusion Using Multiple Cameras. In Int. Con-
ference on Image Analysis and Recognition, volume Lecture Notes in Computer Science
3212, pages 552–562. 2004.
4. Belhumeur, P., Hespanha, J., and Kriegman, D. Eigenfaces vs. Fisherfaces: Recognition
Using Class Specific Linear Projection. In IEEE Transactions on Pattern Analysis and
Machine Intelligence, pages 711–720. 1997.
5. Bileschi, S. and Heisele, B. Advances in Component-based Face Detection. In 2003 IEEE
International Workshop on Analysis and Modeling of Faces and Gestures, AMFG 2003,
pages 149–156. IEEE, October 2003.
6. Bishop, C. Neural Networks for Pattern Recognition. Oxford University Press, 1995.
7. Bradski, G. Computer Vision Face Tracking for Use in a Perceptual User Interface. Intel
Technology Journal, Q2, 1998.
8. Brunelli, R. and Poggio, T. Face Recognition: Features versus Templates. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 15(10):1042–1052, 1993.
9. Cheng, Y. Mean Shift, Mode Seeking and Clustering. IEEE PAMI, 17:790–799, 1995.
10. Chung, K., Kee, S., and Kim, S. Face Recognition Using Principal Component Analysis
of Gabor Filter Responses. In International Workshop on Recognition, Analysis, and
Tracking of Faces and Gestures in Real-Time Systems. 1999.
11. Comaniciu, D. and Meer, P. Mean Shift: A Robust Approach toward Feature Space Analy-
sis. IEEE PAMI, 24(5):603–619, 2002.
12. Cootes, T. and Taylor, C. Statistical Models of Appearance for Computer Vision. Tech-
nical report, University of Manchester, September 1999.
13. Cootes, T. and Taylor, C. Statistical Models of Appearance for Computer Vision. Tech-
nical report, Imaging Science and Biomedical Engineering, University of Manchester,
March 2004.
14. Craw, I., Costen, N., Kato, T., and Akamatsu, S. How Should We Represent Faces for
Automatic Recognition? IEEE Transactions on Pattern Recognition and Machine Intel-
ligence, 21(8):725–736, 1999.
15. Dempster, A., Laird, N., and Rubin, D. Maximum Likelihood from Incomplete Data
Using the EM Algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1–38,
1977.
16. Duda, R. and Hart, P. Pattern Classification and Scene Analysis. John Wiley & Sons,
1973.
17. Elgammal, A. and Davis, L. Probabilistic Framework for Segmenting People Under Oc-
clusion. In IEEE ICCV, volume 2, pages 145–152. 2001.
18. Fillbrandt, H. and Kraiss, K.-F. Tracking People on the Ground Plane of a Cluttered Scene
with a Single Camera. WSEAS Transactions on Information Science and Applications,
9(2):1302–1311, 2005.
19. Gavrila, D. The Visual Analysis of Human Movement: A Survey. Computer Vision and
Image Understanding, 73(1):82–98, 1999.
260 Person Recognition and Tracking

20. Georghiades, A., Belhumeur, P., and Kriegman, D. From Few to Many: Illumination Cone
Models for Face Recognition under Variable Lighting and Pose. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 23(6):643–660, 2001.
21. Gutta, S., Huang, J., Singh, D., Shah, I., Takács, B., and Wechsler, H. Benchmark Studies
on Face Recognition. In Proceedings of the International Workshop on Automatic Face-
and Gesture-Recognition (IWAFGR). Zürich, Switzerland, 1995.
22. Hähnel, M., Klünder, D., and Kraiss, K.-F. Color and Texture Features for Person Recog-
nition. In International Joint Conference onf Neural Networks (IJCNN 2004). Budapest,
Hungary, July 2004.
23. Haritaoglu, I., Harwood, D., and Davis, L. Hydra: Multiple People Detection and Track-
ing Using Silhouettes. In Proc. IEEE Workshop on Visual Surveillance, pages 6–13. 1999.
24. Haritaoglu, I., Harwood, D., and Davis, L. W4: Real-Time Surveillance of People and
Their Activities. IEEE PAMI, 22(8):809–830, 2000.
25. Harville, M. and Li, D. Fast, Integrated Person Tracking and Activity Recognition with
Plan-View Templates from a Single Stereo Camera. In IEEE CVPR, volume 2, pages
398–405. 2004.
26. Heisele, B., Ho, P., and Poggio, T. Face Recognition with Support Vector Machines:
Global versus Component-based Approach. In ICCV01, volume 2, pages 688–694. 2001.
27. Kanade, T. Computer Recognition of Human Faces. Birkhauser, Basel, Switzerland,
1973.
28. Lades, M., Vorbrüggen, J., Buhmann, J., Lange, J., von der Malsburg, C., Würtz, R. P.,
and Konen, W. Distortion Invariant Object Recognition in the Dynamic Link Architecture.
IEEE Transactions on Computers, 42(3):300–311, 1993.
29. Manjunath, B., Salembier, P., and Sikora, T. Introduction to MPEG-7. John Wiley and
Sons, Ltd., 2002.
30. Martinez, A. Recognizing Imprecisely Localized, Partially Occluded, and Expression
Variant Faces from a Single Sample per Class. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 24(6):748–763, June 2002.
31. Martinez, A. and Benavente, R. The AR face database. Technical Report 24, CVC, 1998.
32. McKenna, S., Jabri, S., Duric, Z., Wechsler, H., and Rosenfeld, A. Tracking Groups of
People. CVIU, 80(1):42–56, 2000.
33. Melnik, O., Vardi, Y., and Zhang, C.-H. Mixed Group Ranks: Preference and Confidence
in Classifier Combination. PAMI, 26(8):973–981, August 2004.
34. Mittal, A. and Davis, L. M2-Tracker: A Multi-View Approach to Segmenting and Track-
ing People in a Cluttered Scene. Int. Journal on Computer Vision, 51(3):189–203, 2003.
35. Moeslund, T. Summaries of 107 Computer Vision-Based Human Motion Capture Papers.
Technical Report LIA 99-01, University of Aalborg, March 1999.
36. Moeslund, T. and Granum, E. A Survey of Computer Vision-Based Human Motion Cap-
ture. Computer Vision and Image Understanding, 81(3):231–268, 2001.
37. Nakajima, C., Pontil, M., Heisele, B., and Poggio, T. Full-body Person Recognition Sys-
tem. Pattern Recognition, 36:1997–2006, 2003.
38. Ohm, J. R., Cieplinski, L., Kim, H. J., Krishnamachari, S., Manjunath, B. S., Messing,
D. S., and Yamada, A. Color Descriptors. In Manjunath, B., Salembier, P., and Sikora, T.,
editors, Introduction to MPEG-7, chapter 13, pages 187–212. Wiley & Sons, Inc., 2002.
39. Penev, P. and Atick, J. Local Feature Analysis: A General Statistical Theory for Object
Representation. Network: Computation in Neural Systems, 7(3):477–500, 1996.
40. Pentland, A., Moghaddam, B., and Starner, T. View-based and Modular Eigenspaces for
Face Recognition. In IEEE Conference on Computer Vision and Pattern Recognition
(CVPR’94). Seattle, WA, June 1994.
References 261

41. Phillips, P., Moon, H., Rizvi, S., and Rauss, P. The FERET Evaluation Methodology
for Face-Recognition Algorithms. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 22(10):1090–1104, October 2000.
42. Ro, Y., Kim, M., Kang, H., Manjunath, B., and Kim, J. MPEG-7 Homogeneous Texture
Descriptor. ETRI Journal, 23(2):41–51, June 2001.
43. Schwaninger, A., Mast, F., and Hecht, H. Mental Rotation of Facial Components and
Configurations. In Proceedings of the Psychonomic Society 41st Annual Meeting. New
Orleans, USA, 2000.
44. Sirovich, L. and Kirby, M. Low-Dimensional Procedure for the Characterization of Hu-
man Faces. Journal of the Optical Society of America, 4(3):519–524, 1987.
45. Sung, K. K. and Poggio, T. Example-Based Learning for View-Based Human Face De-
tection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1):39–51,
1997.
46. Swain, M. and Ballard, D. Color Indexing. Int. Journal on Computer Vision, 7(1):11–32,
1991.
47. Turk, M. A Random Walk through Eigenspace. IEICE Transactions on Information and
Systems, E84-D(12):1586–1595, 2001.
48. Turk, M. and Pentland, A. Eigenfaces for Recognition. Journal of Cognitive Neuro-
science, 3(1):71–86, 1991.
49. Vapnik, V. Statistical Learning Theory. Wiley & Sons, Inc., 1998.
50. Viola, P. and Jones, M. Robust Real-time Object Detection. In ICCV01, 2, page 747.
2001.
51. Wallraven, C. and Bülthoff, H. View-based Recognition under Illumination Changes Us-
ing Local Features. In Proceedings of the International Conference on Computer Vision
and pattern Recognition (CVPR). Kauai, Hawaii, December, 8-14 2001.
52. Waring, C. A. and Liu, X. Face Detection Using Spectral Histograms and SVMs. IEEE
Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics, 35(3):467–476,
June 2005.
53. Welch, G. and Bishop, G. An Introduction to the Kalman Filter. Technical Report TR
95-041, Department of Computer Science, University of North Carolina at Chapel Hill,
2004.
54. Wickel, J., Alvarado, P., Dörfler, P., Krüger, T., and Kraiss, K.-F. Axiom - A Modular
Visual Object Retrieval System. In Jarke, M., Koehler, J., and Lakemeyer, G., editors,
Proceedings of 25th German Conference on Artificial Intelligence (KI 2002), Lecture
Notes in Artificial Intelligence, pages 253–267. Springer, Aachen, September 2002.
55. Wiskott, L., Fellous, J., Krüger, N., and von der Malsburg, C. Face Recognition by Elas-
tic Bunch Graph Matching. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 19(7):775–779, 1997.
56. Wren, C., Azerbayejani, A., Darell, T., and Pentland, A. Pfinder: Real-Time Tracking of
the Human Body. IEEE PAMI, 19(7):780–785, 1997.
57. Yang, J., Zhu, X., Gross, R., Kominek, J., Pan, Y., and Waibel, A. Multimodal People
ID for a Multimedia Meeting Browser. In Proceedings of the Seventh ACM International
Conference on Multimedia, pages 159 – 168. ACM, Orlando, Florida, United States, 1999.
58. Yang, M., Kriegman, D., and Ahuja, N. Detecting Faces in Images: A Survey. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 24:34–58, 2002.
59. Zhao, T. and Nevatia, R. Tracking Multiple Humans in Complex Situations. IEEE PAMI,
26(9):1208–1221, 2004.
60. Zhao, W., Chellappa, R., Phillips, P. J., and Rosenfeld, A. Face recognition: A Literature
Survey. ACM Computing Surveys, 35(4):399–458, 2003.
263

Chapter 6

Interacting in Virtual Reality


Torsten Kuhlen, Ingo Assenmacher, Lenka Jeřábková

The word ”virtual” is widely used in computer science and in a variety of situ-
ations. In general, it refers to something that is merely conceptual from something
that is physically real. Insofar, the term ”Virtual Reality”, which has been formed
by the American artist Jaron Lanier in the late eighties, is a paradox in itself. Until
today, there exists no standardized definition of Virtual Reality (VR). All definitions
have in common however, that they specify VR as a computer-generated scenario of
objects (virtual world, virtual environment) a user can interact with in real time and
in all three dimensions. The characteristics of VR are most often described by means
of the I 3 : Interaction, Imagination, and Immersion:
• Interaction
Unlike an animation which cannot be influenced by the user, in VR real-time
interaction is a fundamental and imperative feature. Thus, movies like ”Jurassic
Parc” realize computer-generated, virtual worlds, strictly speaking they are not a
matter of VR however.
The interaction comprises navigation as well as the manipulation of the virtual
scene. In order to achieve real-time interaction, two important basic criteria have
to be fulfilled:
1. Minimum Frame Rate: A VR application is not allowed to work with a frame
rate lower than a threshold which is defined by Bryson [6] to 10 Hz. A more
restrictive limit is set by Kreylos et. al. [9] who recommend a minimum rate
of 30 frames per second.
2. Latency: Additionally, Bryson as well as Kreylos have defined the maximum
response time of a VR system after a user input was applied. Both claim that a
maximum latency of 100 milliseconds can be tolerated until a feedback must
take place.
I/O devices, and especially methods and algorithms have to meet these require-
ments. As a consequence, high quality, photorealistic rendering algorithms in-
cluding global illumination methods, or complex algorithms like the Finite El-
ement Method for a physically authentic manipulation of deformable objects,
typically cannot be used in VR without significant optimizations or simplifica-
tions.
• Imagination
The interaction within a virtual environment should be as intuitive as possible. In
the ideal case, a user can perceive the virtual world and interact with it in exactly
the same way as with the real world, so that the difference between both becomes
blurred. Obviously, this goal has impact on the man-machine interface design.
264 Interacting in Virtual Reality

First of all, since humans interact with the real world in all three dimensions, the
interface must be three-dimensional as well. Furthermore, multiple senses should
be included into the interaction, i.e., besides the visual sense, the integration of
other senses like the auditory, the haptic/tactile, and the olfactory ones should
be considered in order to achieve a more natural, intuitive interaction with the
virtual world.
• Immersion
By special display technology, the user has the impression of being a part of the
virtual environment, standing in the midst of the virtual scene, fully surrounded
by it instead of looking from outside.
In the recent years, VR has proven its potential to provide an innovative human
computer interface, from which multiple application areas can profit. Whereas in the
United States VR research is mainly pushed by military applications, the automotive
industry has been significantly promoting VR in Germany during the last ten years,
provoking Universities and research facilities like the Fraunhofer Institutes to sig-
nificant advances in VR technology and methodology. The reason for their role as
a precursor is their goal to realize the digital vehicle, where the whole development
from the initial design process up to the production planning is implemented in the
computer. By VR technology, so called ”virtual mockups” can to some extent replace
physical prototypes, having the potential to shorten time to market, to reduce costs,
and to lead to a higher quality of the final product. In particular, the following con-
crete applications are relevant for the automotive industry and can profit from VR as
an advanced human computer interface:
• Car body design
• Car body aerodynamics
• Design and ergonomical studies of car interiors
• Simulation of air conditioning technology in car interiors
• Crash simulations
• Car body vibrations
• Assembly simulations
• Motor development: Simulation of airflow and fuel injection in the cylinders of
combustion engines
• Prototyping of production lines and factories
Besides automotive applications, medicine has emerged as a very important field
for VR. Obviously, data resulting from 3D computer tomography or magnet reso-
nance imaging suggest a visualization in virtual environments. Furthermore, the VR-
based simulation of surgical interventions is a significant step towards the training
of surgeons or the planning of complicated operations. The development of realistic
surgical procedures evokes a series of challenges however, starting with a real-time
simulation of (deformable) organs and tissue, and ending in a realistic simulation of
liquids like blood and liquor.

Obviously, VR covers a large variety of complex aspects which cannot all be cov-
ered within the scope of a single book chapter. Therefore, instead of trying to cover
Visual Interfaces 265

all topics, we decided to pick out single aspects and to describe them in an adequate
detailedness. Following the overall focus of this book, chapter 6 concentrates on the
interaction aspects of VR. Technological aspects are only described when they con-
tribute to an understanding of the algorithmic and mathematical issues. In particular,
in sections 6.1 to 6.3 we address the multimodality of VR with emphasis on visual,
acoustic, and haptic interfaces for virtual environments. Sect. 6.4 is about the model-
ing of a realistic behavior of virtual objects. The methods described are restricted to
rigid objects, since a treatment of deformable objects would go beyond the scope of
this chapter. In Sect. 6.5, support mechanisms are introduced as an add-on to haptics
and physically-based modeling, which facilitate the completion of elementary inter-
action tasks in virtual environments. Sect. 6.6 gives an introduction to algorithmic
aspects of VR toolkits, and scene graphs as an essential data structure for the im-
plementation of virtual environments. Finally, Sect. 6.7 explains the examples which
are provided on the CD that accompanies this book. For further reading, we recom-
mend the book ”3D User Interfaces – Theory and Practice” written by D. Bowman
et. al. [5].

6.1 Visual Interfaces


6.1.1 Spatial Vision

The capability of the human visual system to perceive the environment in three di-
mensions originates from psychological and physiological cues. The most important
psychological cues are:
• (Linear) perspective
• Relative size of objects
• Occlusion
• Shadows and lighting
• Texture gradients
Since these cues rely on monocular and static mechanisms only, they can be
covered by conventional visual displays and computer graphics algorithms. The main
difference between vision in VR and traditional computer graphics is that VR aims
at emulating all cues. Thus, in VR interfaces the following physiological cues must
be taken into account as well:
• Stereopsis
Due to the interocular distance, slightly different images are projected onto the
retina of the two eyes (binocular disparity). Within the visual cortex of the brain,
these two images are then fused to a three-dimensional scene. Stereopsis is con-
sidered the most powerful mechanism for seeing in 3D. However, it merely works
for objects that are located at a distance not more than a few meters from the
viewer.
266 Interacting in Virtual Reality

• Oculomotor factors
The oculomotor factors are based on muscle activities and can be further subdi-
vided into accommodation and convergence.
– Accommodation
In order to focus on an object and to produce a sharp image on the eye’s
retina, the ciliar muscles deform the eye lenses accordingly.
– Convergence
The viewer rotates her eyes towards the object in focus so that the two images
can be fused together.
• Motion parallax
When the viewer is moving from left to right while looking at two objects at
different distances, the object which is closer to the viewer passes her field of
view faster than the object which is farer away.

6.1.2 Immersive Visual Displays

While the exact imitation of psychological cues – especially photorealistic global


illumination – is mainly limited by the performance of the computer system and
its graphics hardware, the realization of physiological cues requires dedicated hard-
ware. In principle, two different types of visual, immersive VR displays can be dis-
tinguished: head-mounted and room-mounted displays.
In the early days of VR, a head-mounted display (HMD) usually formed the
heart of every VR system. A HMD is a display worn like a helmet and consists of
two small monitors positioned directly in front of the viewer’s eyes, which realize
the stereopsis. The first HMD has already been developed in 1965 by Ivan Suther-
land [19]. The first commercially available systems came up in the late 80’s however
(see Fig. 6.1, left). Although HMDs have the advantage that they fully immerse the
user into the virtual environment, they also suffer from major drawbacks. Besides
ergonomic problems, they tend to isolate the user from the real environment, which
makes the communication between two users sharing the same virtual environment
more difficult. In particular, the user loses the visual contact to her own body, which
can lead to significant irritations if the body is not represented by an adequate com-
putergraphic counterpart.
Today, HMDs are mainly used in low cost or mobile and portable VR systems.
Furthermore, a special variant of HMDs called see-through displays has only recently
gained high relevance in Augmented Reality. Here, semi-transparent mirrors are used
in order to supplement the real environment by virtual objects.
In the early 90’s, the presentation of the CAVETM at the ACM SIGGRAPH
conference [7] caused a paradigm change, away from head-mounted displays to-
wards room-mounted installations based on a combination of large projection screens
which surround the user in order to achieve the immersion effect. Fig. 6.2 shows
a CAVE-like display installed at RWTH Aachen University, Germany, consisting
of five projection screens. Although it is preferable to exclusively use rear projec-
tion screens to prevent the user from standing in the projection beam and producing
shadows on the screens, the floor of such systems is most often realized as a front
Visual Interfaces 267

(a) (b)

Fig. 6.1. Head-mounted displays (left photograph courtesy of P. Winandy).

projection in order to achieve a more compact construction. In systems with four


walls allowing a 360 degrees look-around, one of the walls is typically designed as a
(sliding) door to allow entering and leaving the system.
In room-mounted systems, a variety of techniques exist to realize stereopsis. One
of the most common principles is based on a time multiplex method. Here, the im-
ages for the left eye and the right eye are projected on the screen one after another,
while the right and the left glass of so-called shutter glasses are opened and closed
respectively. Beside this ”active” technology, methods based on polarized or spec-
tral filters (e.g., INFITECTM ) are in use. Unlike the shutter technique, these ”passive”
methods do not require active electronic components in the glasses and work well

(a) (b)

Fig. 6.2. Immersive, room-mounted displays: The CAVE-like display at RWTH Aachen Uni-
versity, Germany.
268 Interacting in Virtual Reality

without an extremely precise time synchonization (known as ”gen locking”). How-


ever, since the images for the left and the right eye are shown in parallel, they demand
two projectors per screen.

6.1.3 Viewer Centered Projection

Head-mounted as well as room-mounted VR displays typically come along with a


so-called tracking system, which captures the user’s head position and orientation in
real time. By this information, the VR software is capable of adapting the perspective
of the visual scene to the user’s current viewpoint and view direction. A variety of
tracking principles for VR systems are in use, ranging from mechanical, acoustic (ul-
trasonic), and electromagnetic techniques up to inertia and opto-electronical devices.
Recently, a combination of two or more infrared (IR) cameras and IR light reflect-
ing markers attached to the stereo glasses have become the most popular tracking
technology in virtual environments (Fig. 6.3) since, in comparison to other technolo-
gies, they are working with higher precision and lower latency and are also nearly
non-intrusive.

(a) IR Camera (b) Stereo shutter glasses with


markers

Fig. 6.3. Opto-electronical head tracking (Courtesy of A.R.T. GmbH, Weilheim, Germany).

While in VR systems with HMDs the realization of motion parallax is obviously


straight forward by just setting the viewpoint and view direction to the position and
orientation of the tracking sensor (added by an offset indicating the position of the
left/right eye relative to the sensor), the situation is a little more difficult with room-
mounted displays. At first glance, it is not obvious that motion parallax is mandatory
in such systems at all, since many stereo vision systems are working quite well with-
out any head tracking (cp., e.g., IMAX cinema). However, since navigation is a key
feature in VR, the user is typically walking around, looking at the scene from very
different viewpoints. The left drawing in Fig. 6.4 shows what happens: Since a static
stereogram is only correct for one specific eye position, it shows a distorted image
of virtual objects when the viewer starts moving in front of the screen. This is not
only annoying, but also makes a direct, manual interaction with virtual objects in 3D
Visual Interfaces 269

space impossible. Instead, the projection has to be adopted to the new viewpoint as
shown in the right drawing of Fig. 6.4. By this adaptation called Viewer Centered
Projection (VCP), the user is also capable of looking around objects by just moving
her head. In the example, the right side of the virtual cube becomes visible when the
user moves to the right. Fig. 6.5 demonstrates the quasi-holographic effect of VCP by
means of a tracking sensor attached to the camera. Like real objects, virtual objects
are perceived as stationary in 3D space.

projection plane

eye position 2

(a) Distortion of the virtual cube (b) Correction of perspective


when moving from viewpoint 1 to
viewpoint 2

Fig. 6.4. The principle of viewer centered projection.

Fig. 6.5. The viewer centered projection evokes a pseudo-holographic effect. As with the
lower, real cuboid, the right side of the upper, virtual cube becomes visible when the viewpoint
moves to the right.

In the following, the mathematical background of VCP is briefly explained. In


computer graphics, 3D points are typically described as 4D vectors, the so called
270 Interacting in Virtual Reality

homogeneous coordinates (see, e.g., [10]). Assuming the origin of the coordinate
system is in the middle of the projection plane with its x- and y-axes parallel to the
plane, the following shearing operation has to be applied to the vertices of a virtual,
polygonal object:
⎛ ⎞
 1 0 00
ex ey ⎜ 0 1 0 0⎟
SHxy − , − =⎜ ⎝ − ex − ey 1 0 ⎠
⎟ (6.1)
ez ez ez ez
0 0 01

For an arbitrary point P = (x, y, z, 1) in the object space, applying the projection
matrix leads to the following new coordinates:
 
ex ey ex ey
(x, y, z, 1) · SHxy − , − = x− · z, y − · z, z, 1 (6.2)
ez ez ez ez
By this operation, the viewpoint is shifted to the z-axis, i.e., P = (ex , ey , ez , 1) is
transformed to P  = (0, 0, ez , 1). The result is a simple, central perspective projec-
tion (Fig. 6.6), which is most common in computer graphics and which can, e.g., be
further processed into a parallel projection in the unit cube and handled by computer
graphics hardware.

Fig. 6.6. Shearing the view volume.

Of course, shearing the view volume as shown above must leave the pixels (pic-
ture elements) on the projection plane unaffected, which is the case when applying
(6.1): 
ex ey
(x, y, 0, 1) · SHxy − , − = (x, y, 0, 1) (6.3)
ez ez
In CAVE-like displays with several screens not parallel to each other, VCP is ab-
solutely mandatory. If the user is not tracked, perspective distortions arise, especially
Visual Interfaces 271

at the boundaries of the screens. Since every screen has its own coordinate system,
for each eye the number of different projection matrices is equal to the number of
screens.
Experience has shown that not every VR application needs to run in a fully im-
mersive system. In many cases, semi-immersive displays work quite well, especially
in the scientific and technical field. In the simplest case such displays only consist
of a single, large wall (see Fig. 6.7 left) or a table-like horizontal screen known as
Responsive Workbench [12]. Workbenches are often completed by an additional ver-
tical screen in order to enlarge the view volume (see Fig. 6.7 right). Recent efforts
towards more natural visual interfaces refer to high-resolution, tiled displays built
from a matrix of multiple projectors [11].

(a) Large rear projection wall (b) L-shaped workbench (Photograph


courtesy of P. Winandy)

Fig. 6.7. Semi-immersive, room-mounted displays.

In summary, today’s visual VR interfaces are able to emulate all physiological


cues for 3D viewing. It must be taken into consideration however, that accommoda-
tion and convergence build a natural teamwork when looking at real objects. Since
in all projection-based displays the eye lenses have to focus on the screen, this team-
work is disturbed, which can cause eye strain and can be one of the reasons for
simulator sickness.
Physiological cues can only be realized by usage of dedicated hardware like
HMDs or stereo glasses, and tracking sensors, which have to be attached to the
user. Since a major goal of VR interfaces is to be as non-intrusive as possible, auto-
stereoscopic and volumetric displays are under development (see, e.g., [8]), which
are still in infancy however. Today’s optical IR tracking systems represent a signifi-
cant step forward, although they still rely on markers. Up to now, fully non-intrusive
optical systems based on video cameras and image processing are still too slow for
use in VR applications.
272 Interacting in Virtual Reality

6.2 Acoustic Interfaces

6.2.1 Spatial Hearing

In contrast to the human visual system, the human auditory system can perceive input
from all directions and has no limited field of view. As such, it provides valuable cues
for navigation and orientation in virtual environments. With respect to immersion,
the acoustic sense is a very important additional source of information for the user,
as acoustical perception works precisely especially for close auditory stimuli. This
enhances the liveliness and credibility of the virtual environment. The ultimate goal
thus would be the ability to place virtual sounds in any three dimensions and distance
around the user in real-time. However, audio stimuli are not that common in today’s
VR applications.
For visual stimuli, the brain compares pictures from both eyes to determine the
placing of objects in a scene. With this information, it creates a three-dimensional
cognitive representation that humans perceive as a three-dimensional image. In
straight analogy, stimuli that are present at the eardrums will be compared by the
brain to determine the nature and the direction of a sound event [3]. In spatial
acoustics, it is most convenient to describe the position of a sound source relative
to the position and orientation of the listener’s head by means of a coordinate system
which has its origin in the head, as shown in Fig. 6.8. Depending on the horizontal
angle of incidence, different time delays and levels between both ears consequently
arise. In addition, frequency characteristics dependent on the angle of incidence are
influenced by the interference between the direct signal and the reflections of head,
shoulder, auricle and other parts of the human body. The interaction of these three
factors permits humans to assign a direction to acoustic events [15].

Fig. 6.8. Dimensions and conventions of measurement for head movements.

The characteristics of the sound pressure at the eardrum can be described in the
time domain by the Head Related Impulse Response (HRIR) and in the frequency
domain by the Head Related Transfer Function (HRTF). These transfer functions
can be measured individually with small in-ear microphones or with an artificial
head. Fig. 6.9 shows a measurement with an artificial head under 45◦ relating to the
frontal direction in the horizontal plane. The Interaural Time Difference (ITD) can
Acoustic Interfaces 273

be assessed in the time domain plot. The Interaural Level Difference (ILD) is shown
in the frequency domain plot and clarifies the frequency dependent level increase at
the ear turned toward the sound source, and the decrease at the ear which is turned
away from the sound source.

(a) (b)

Fig. 6.9. HRIR and HRTF of a sound source under 45◦ measured using an artificial head. The
upper diagram shows the Interaural Time Difference in the time domain plot. The Interaural
Level difference is depicted in the frequency domain plot of the lower diagram.

6.2.2 Auditory Displays

In VR systems with head-mounted displays, an auditory stimulation by means of


headphones integrated into the HMD is suitable. For room-mounted displays, a so-
lution based on loudspeakers is obviously preferable. Here, simple intensity panning
is most often used to produce three-dimensional sound effects. However, intensity
panning is not able to provide authentic virtual sound scenes. In particular, it is im-
possible to create virtual near-to-head sound sources although these sources are of
high importance in direct interaction metaphors for virtual environments, such as
by-hand-placements or manipulations. In CAVE-like displays, virtual objects which
shall be directly manipulated are typically displayed with negative parallax, i.e., in
front of the projection plane, rather near to the user’s head and within the grasping
range, see Fig. 6.10.
As a consequence, in high quality, interactive VR systems, a technique has to be
applied which is more complex than the simple intensity panning. Creating a virtual
sound scene with spatially distributed sources needs a technique for adding spatial
cues to audio signals and an appropriate reproduction. There are mainly two different
274 Interacting in Virtual Reality

Fig. 6.10. Typical interaction range in a room-mounted, CAVE-like display.

approaches reproducing a sound event with true spatial relation: wave field synthesis
and binaural synthesis. While wave field synthesis reproduces the whole sound field
with a large number of loudspeakers, binaural synthesis in combination with cross
talk cancellation (CTC) is able to provide a spatial auditory representation by using
only few loudspeakers. The following sections will briefly introduce the principles
and problems of these technologies with focus on the binaural approach.
Another problem can be found in the synchronization of the auditory and visual
VR subsystems. Enhanced immersion dictates that the coupling between the two
systems has to be tight and with small computational overhead, as each sub-system
introduces its own lag and latency problems due to different requirements on the
processing. If the visual and the auditory cues differ too much in time and space, this
is directly perceived as a presentation error, and the immersion of the user ceases.

6.2.3 Wave Field Synthesis

The basic theory of the wave field synthesis is the Huygens principle. An array of
loudspeakers (ranging from just a few to a few hundreds in number) is placed in the
same position a microphone array was placed in at the time of recording the sound
event in order to reproduce an entire real sound field. Fig. 6.11 shows this principle
of recording and reproduction [2], [20]. In a VR environment the loudspeaker signal
will then be calculated for a given position of one or more virtual sources.
By reproducing a wave field in the whole listening area, it is possible to walk
around or turn the head while the spatial impression does not change. This is the big
advantage of this principle. The main drawback, beyond the high effort of processing
power, is the size of the loudspeaker array. Furthermore, mostly two-dimensional so-
lutions have been presented so far. The placement of those arrays in semi-immersive,
projection-based VR systems like a workench or a single wall is only just possible,
but in display systems with four to six surfaces as in CAVE-like environments, it
is nearly impossible. A wave field synthesis approach is thus not applicable, as the
Acoustic Interfaces 275

Wave field synthesis

Recording

Fig. 6.11. Setup for recording a sound field and reproducing it by wave field synthesis.

loudspeaker arrays have to be positioned around the user and outside of the environ-
ment’s boundaries.

6.2.4 Binaural Synthesis

In contrast to wave field synthesis a binaural approach does not deal with the com-
plete reproduction of the sound field. It is convenient and sufficient to reproduce that
sound field only at two points, the ears of the listener. In this case, only two signals
have to be calculated for a complete three-dimensional sound scene.
The procedure of convolving a mono sound source with an appropriate pair of
HRTFs in order to obtain a synthetic binaural signal is called binaural synthesis.
The binaural synthesis transforms a sound source without position information into
a virtual source related to the listener’s head. The synthesized signals contain the
directional information of the source, which is provided by the information in the
HRTFs.
For the realization of a room related virtual source, the HRTF must be changed
when the listener turns or moves her head, otherwise the virtual source would move
with the listener. Since a typical VR system includes head tracking for the adaptation
of the visual perspective, the listener’s position is always known and can also be used
to realize a synthetic sound source with a fixed position corresponding to the room
coordinate system. The software calculates the relative position and orientation of
the listener’s head to the imaginary point where the source should be localized. By
knowing the relative position and orientation the appropriate HRTF can be chosen
from a database (see Fig. 6.12) [1].
All in all, the binaural synthesis is a powerful and at the same time feasible
method for an exact spatial imaging of virtual sound sources, which in principle only
needs two loudspeakers. In contrast to panning systems where the virtual sources are
always on or behind the line spanned by the speakers, binaural synthesis can realize
a source at any distance to the head, and in particular one close to the listener’s
276 Interacting in Virtual Reality

Fig. 6.12. The principle of binaural synthesis.

head, by using an appropriate HRTF. It is also possible to synthesize many different


sources and create a complex three-dimensional acoustic scenario.

6.2.5 Cross Talk Cancellation

The requirement for a correct binaural presentation is that the right channel of the
signal is audible only in the right ear and the left one is audible only in the left ear.
From a technical point of view, the presentation of binaural signals by headphones
is the easiest way since the acoustical separation between both channels is perfectly
solved. While headphones may be quite acceptable in combination with HMDs, in
room-mounted displays the user should wear as few active hardware as possible.
Furthermore, when the ears are covered by headphones, the impression of a source
located at a certain point and distance to the listener often does not match the im-
pression of a real sound field. For these reasons, a reproduction by loudspeakers
should be considered, especially in CAVE-like systems. Speakers can be mounted,
e.g., above the screens. The placing of virtual sound sources is not restricted due to
the binaural reproduction.
Using loudspeakers for binaural synthesis introduces the problem of cross talk.
As sound waves determined for the right ear arrive at the left ear and vice versa,
without an adequate CTC the three-dimensional cues of the binaural signal would be
destroyed. In order to realize a proper channel separation at any point of the listening
area, CTC filters have to be calculated on-line when the listener moves his head.
When XL and XR are the signals coming from the binaural synthesis (see
Fig. 6.12 and 6.13), and ZL and ZR are the signals arriving at the listener’s left
and right ear, filters must be found providing signals YL and YR , so that XL = ZL
and XR = ZR . Following the way of the binaural signals to the listener’s ears results
in the two equations
!
ZL = YL · L · HLL + YR · R · HRL = XL (6.4)
!
ZR = YR · R · HRR + YL · L · HLR = XR , (6.5)
Acoustic Interfaces 277

where L, R are the transfer functions of the loudspeakers, and Hpq is the transfer
function from speaker p to ear q.

In principle, a simple analytic approach exists for this problem, leading to


HRR HRL
YL = L−1 ( XL − XR ) (6.6)
HLL HRR − HRL HLR HLL HRR − HRL HLR
and an analogous solution for YR .

However, since singularities arise especially for deeper frequencies, where


HLL ∼ HRL and HRR ∼ HLR , this simple calculation is not feasible. Instead,
an iterative solution has to be applied, of which the principle is depicted in Fig. 6.13
by example for an impulse presented to the left ear.

Fig. 6.13. Principle of (static) cross-talk cancellation. Presenting an impulse addressed to the
left ear (1) and the first steps of compensation (2, 3).

For a better understanding of the CTC filters, the first four iteration steps will be
shown by example for a signal from the right speaker. In an initial step, the influence
R of the right speaker and HRR are eliminated by
−1
BlockRR := R−1 · HRR (6.7)
278 Interacting in Virtual Reality

1. Iteration Step The resulting signal XR · BlockRR · R from the right speaker also
arrives at the left ear via HRL and must be compensated for by an adequate filter
K1R , which is sent from the left speaker L to the left ear via HLL :
!
XR · BlockRR · R · HRL − XR · K1R · L · HLL = 0 (6.8)

−1 HRL
⇒ K1R = L−1 · HRR ·( ) =: BlockRL (6.9)
HLL
2. Iteration Step The compensation signal −XR ·K1R ·L from the first iteration step
arrives at the right ear via HLR and is again compensated for by a filter K2R , sent
from the right speaker R to the right ear via HRR :

−1 HRL
−XR · [L−1 · HRR
!
·( )] · L · HLR + XR · K2R · R · HRR = 0 (6.10)
HLL

−1 HRL · HLR
⇒ K2R = R−1 · HRR ·( ) (6.11)
0 12 3 HLL · HRR
Block
0 12 3
RR
=:K

3. Iteration Step The cross talk produced by the preceding iteration step is again
eliminated by a compensation signal from the left speaker.

−1
XR · [R−1 · HRR
!
· K] · R · HRL − XR · [K3R ] · L · HLL = 0 (6.12)

−1 HRL
⇒ K3R = L−1 · HRR · ·K (6.13)
HLL
0 12 3
BlockRL

4. Iteration Step The right speaker produces a compensation signal to eliminate the
cross talk of the third iteration step:

−1 HRL
−XR · [L−1 · HRR
!
· · K] · L · HLR + XR · [K4R ] · R · HRR = 0 (6.14)
HLL

−1
⇒ K4R = R−1 · HRR ·K2 (6.15)
0 12 3
BlockRR

With the fourth step, the regularity of the iterative CTC becomes clear. Even iter-
ations extend BlockRR and odd iterations extend BlockRL . All in all, the following
compensation filters can be identified for the right channel (compare to Fig. 6.13):

CT CRR = BlockRR ·(1 + K + K 2 + . . .)


(6.16)
CT CRL = BlockRL ·(1 + K + K 2 + . . .)
Haptic Interfaces 279

For an infinite number of iterations, the solution converges to the simple analytic
approach:

 1 HRL · HLR
Ki = with K = (6.17)
i=1
1−K HLL · HRR
Be aware that the convergence is guaranteed, since |K| < 1 due to HRL < HLL and
HLR < HRR .

In principle, a complete binaural synthesis with CTC can be achieved by only


two loudspeakers. However, a dynamic CTC is only possible in the angle spanned by
the loudspeakers. When the ITD in one HRTF decreases toward zero – this happens
when the listener faces one speaker – an adequate cancellation is impossible [14].
Since for CAVE-like displays, a dynamic CTC is needed which allows the user a
full turnaround, the well-known two-speaker CTC solution must be extended to a
four-speaker setup in order to provide a full 360◦ rotation for the listener. An imple-
mentation of such a setup is described in [13].

6.3 Haptic Interfaces


Haptics comes from the Greek hapthesthai - to touch. Haptic refers to touching as
visual refers to seeing and auditory to hearing. There are two types of human haptic
sensing:
• tactile – the sensation arises from stimulus to the skin (skin receptors). The tactile
haptics is responsible for sensing heat, pressure, vibration, slip, pain and surface
texture.
• kinesthetic – the sensation arises from the body movements. The mechanorecep-
tors are located in muscles, tendons and joints. The kinesthetic haptics is respon-
sible for sensing the limb positions, motion and forces.
The VR research focuses mainly on kinesthetics, which is easier to realize techni-
cally. Kinesthetic human-computer interfaces are also called force feedback inter-
faces. Force feedback interfaces have derived from robotics and teleoperation ap-
plications. In the early 90’s, the field of haptic force feedback began to develop an
identity of its own. There were research projects specifically aimed at providing force
feedback or tactile feedback rather than teleoperation. The first commercial device,
the PHANToM (Fig. 6.14), was for sale in 1995.
Designing and developing haptic interfaces is an intensely interdisciplinary area.
The knowledge of human haptics, machine haptics and computer haptics have to be
put together to accomplish a particular interaction task. Human haptics is a psycho-
logical study of the sensory and cognitive capacities relating to the touch sense, and
the interaction of touch with other senses. This knowledge is crucial to effective in-
terface design. Machine haptics is where of mechanical and electrical engineers get
involved. It consists of the design of the robotic machine itself, including its kine-
matic configuration, electronics and sensing, and communication to the computer
280 Interacting in Virtual Reality

Fig. 6.14. The PHANToM haptic device, Sensable Technologies.

controller. The computed haptic models and algorithms that run on a computer host
belong to computer haptics. This includes the creation of virtual environments for
particular applications, general haptic rendering techniques and control issues.
The sensorimotor control guides our physical motion in coordination with our
touch sense. There is a different balance of position and force control when we are
exploring an environment (i.e. lightly touching a surface), versus manipulation (dom-
inated by motorics), where we might rely on the touch sense only subconsciously.
As anyone knows intuitively, the amount of force you generate depends on the way
you hold something.
Many psychological studies have been done to determine human capabilities in
force sensing and control resolution. Frequency is an important sensory parameter.
We can sense tactile frequencies much higher than kinesthetic. This makes sense,
since skin can be vibrated more quickly than a limb or even a fingertip (10-10000
Hz vs. 20-30 Hz). The control bandwidth, i.e., how fast we can move our own limbs
(5-10 Hz), is much lower than the rate of motion we can perceive. As a reference,
table 6.1 provides more detail on what happens at different bandwidths.
A force feedback device is typically a robotic device interacting with the user,
who supplies physical input by imposing force or position onto the device. If the
measured state is different from the desired state, a controller translates the desired
signal (or the difference between the measured and the desired state) into a control
action. The controller can be a computer algorithm or an electronic circuit. Haptic
control systems can be difficult, mainly because the human user is a part of the sys-
tem, and it is quite hard to predict what the user is going to do. Fig. 6.15 shows the
haptic control loop schematically. The virtual environment takes as input the mea-
sured state, and supplies a desired state. The desired state depends on the particular
environment, and its computation is called haptic rendering.
The goal of haptic rendering is to enable a user to touch, feel, and manipulate
virtual objects through a haptic device. A rendering algorithm will generally include
a code loop that reads the position of the device, tests the virtual environment for
Haptic Interfaces 281

Table 6.1. Force sensing and control resolution.


1-2 Hz The maximum bandwidth with which the human finger can react to unexpected
force/position singnals
5-10 Hz The maximum bandwidth with which the human finger can apply force and
motion commands comfortably
8-12 Hz The bandwidth beyond which the human finger cannot correct for its positional
disturbances
12-16 Hz The bandwidth beyond which the human fingers cannot correct their grasping
forces if the grasped object slips
20-30 Hz The minimum bandwidth with which the human finger demands the force input
signals to be present for meaningful perception
320 Hz The bandwidth beyond which the human fingers cannot discriminate two con-
secutive force input signals
5-10 kHz The bandwidth over which the human finger needs to sense vibration during
skillful manipulative task

Fig. 6.15. The force feedback loop.

collisions, calculates collision responses and sends the desired device position to the
device. The loop must execute in about 1 millisecond (1 kHz rate).

6.3.1 Rendering a Wall

A wall is the most common virtual object to render haptically. It involves two dif-
ferent modes (contact and non-contact) and a discontinuity between them. In non-
contact mode, your hand is free (although holding the haptic device) and no forces
are acting, until it encounters the virtual object. The force is proportional to the pen-
etration depth determined by the collision detection, pushing the device out of the
virtual wall Fig. 6.16. The position of the haptic device in the virtual environment is
called the haptic interaction point (HIP).
Rendering a wall is actually one of the hardest tests for a haptic interface even
though it is conceptually so simple. This is because of the discontinuity between a
regime with zero stiffness and a regime with very high stiffness (usually the highest
stiffness the system can manage). This can lead to a computational and physical
instability and cause vibrations. In addition to that, the contact might not feel very
282 Interacting in Virtual Reality

(a) no contact, F = 0 (b) contact, F = kΔx

Fig. 6.16. Rendering a wall. In non-contact mode (a) no forces are acting. Once the penetration
into the virtual wall is detected (b), the force is proportional to the penetration depth, pushing
the device out of the virtual wall.

hard. When you first enter the wall, the force you are feeling is not very large, kΔx
is close to zero when you are near the surface. Fig. 6.17 depicts changes in force

Fig. 6.17. The energy gain. Due to the finite sampling frequency of the haptic device a “stair-
case” effect is observed when rendering a virtual wall. The difference in the areas enclosed by
the curves that correspond to penetrating into and moving out of the virtual wall is a manifes-
tation of energy gain.

profile with respect to position for real and virtual walls of a given stiffness. Since the
position of the probe tip is sampled with a certain frequency during the simulation, a
“staircase” effect is observed. The difference in the areas enclosed by the curves that
correspond to penetrating into and moving out of the virtual wall is a manifestation
of energy gain. This energy gain leads to instabilities as the stiffness coefficient is
increased (compare the energy gains for stiffness coefficients k1 and k2 ). On the
Haptic Interfaces 283

other hand, a low value of the stiffness coefficient generates a soft virtual wall, which
is not desirable either.
The hardness of a contact can be increased by applying damping when the user
enters the wall, i.e., by adding a term proportional to velocity v to the computation
of force:
F = kΔx + bv
It makes the force large right at entry if the HIP is moving with any speed. This term
has to be turned off when the HIP is moving out of the wall, otherwise the wall would
feel sticky.
Adding either visual or acoustic reinforcement of the wall’s entry can make the
wall seem harder than relying on haptic feedback alone. Therefore, the wall penetra-
tion should never be displayed visually. The HIP is always displayed where it would
be in the real world (on the object’s surface). In the literature, the ideal position has
got different names, but all of them refer to the same concept Fig. 6.18.

Fig. 6.18. The surface contact point (SCP) is also called the ideal haptic interaction point
(IHIP), the god-object or the proxy point. Basically, it corresponds to the position of the HIP
on the objects’s surface.

6.3.2 Rendering Solid Objects

Penalty based methods represent a naive approach to the rendering of solid objects.
Similarly to the rendering of a wall, the force is proportional to the penetration depth.
Fig. 6.19 demonstrates the limitations of the penalty based methods.
The penalty based methods simply pull the HIP to the closest surface point with-
out tracking the HIP history. This causes force discontinuities at the object’s corners,
where the probe is attracted to other surfaces than to the one the object originally pen-
etrated. When the user touches a thin surface, he feels a small force. As he pushes
harder, he penetrates deeper into the object until he passes more than a halfway
through the object, where the force vector changes direction and shoots him out the
other side. Furthermore, when multiple primitives touch or are allowed to intersect,
it is often difficult to determine, which exterior surface should be associated with a
given internal volume. In the worst case, a global search of all primitives may be re-
quired to find the nearest exterior surface. To avoid these problems, constraint based
methods have been proposed by [21] (the god-object method) and [17] (the virtual
284 Interacting in Virtual Reality

(a) object corners (b) thin objects (c) multiple touching objects

Fig. 6.19. The limitations of the penalty based methods.

proxy method). These methods are similar in that they trace the position of a vir-
tual proxy, which is placed where the haptic interface point would be if the haptic
interface and the object were infinitely stiff (Figure 6.20).

Fig. 6.20. The virtual proxy. As the probe (HIP) is moving, the virtual proxy (SCP) is trying
to keep the minimal distance to the probe by moving along the constrained surfaces.

The constraint based methods use the polygonal representation of the geometry
to create one-way penetration constraints. The constraint is a plane determined by
the polygon through which the geometry has been entered. For convex parts of ob-
jects, only one constraint is active at a time. Concave parts require up to three active
constraints at a time. The movement is then constrained to a plane, a line or a point
Fig. 6.21.
Finally, we are going to show how the SCP can be computed if the HIP and the
constraint planes are known. Basically, we want to minimize the distance between
the SCP and HIP so that the SCP is placed on the active constraint plane(s). This is a
typical application case where the Lagrange multipliers can be used. Lagrange mul-
tipliers are a mathematical tool for constrained optimization of differentiable func-
tions. In the unconstrained situation, we have some (differentiable) function f that
we want to minimize (or maximize). This can be done by finding the points where
the gradient ∇f is zero, or, equivalently, each of the partial derivatives is zero. The
constraints have to be given by functions gi (one function for each constraint). The
ith constraint is fulfilled if gi (x) = 0. It can be shown, that for the constrained case,
the gradient of f has to be equal to the linear combination of the gradients of the
constraints. The coefficients of the linear combination are the Lagrange multipliers
Haptic Interfaces 285

Fig. 6.21. The principle of constraints. Convex objects require only one active constraint at a
time. Concave objects require up to three active constraints at a time. The movement is then
constrained to a plane, a line or a point.

λi .
∇f (x) = λ1 ∇g1 (x) + λn ∇gn (x)
The function we want to minimize is

|HIP − SCP | = (xHIP − xSCP )2 + (yHIP − ySCP )2 + (zHIP − zSCP )2

Alternatively, we can use


1' (
f= (xHIP − xSCP )2 + (yHIP − ySCP )2 + (zHIP − zSCP )2
2
which corresponds to the energy of a virtual spring between SCP and HIP with the
spring constant k = 1. This function is easier to differentiate and leads to the same
results. The constraint planes are defined by the plane equation

gi = Ai xSCP + Bi ySCP + Ci zSCP − Di

The unknowns are the xSCP , ySCP , zSCP and the λi . Evaluating the partial deriv-
atives and rearranging leads to the following system of 4, 5, or 6 linear equations
(depending on the number of constraints):
⎡ ⎤⎡ ⎤ ⎡ ⎤
1 0 0 A1 A2 A3 xSCP xHIP
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ 0 1 0 B1 B2 B3 ⎥ ⎢ ySCP ⎥ ⎢ yHIP ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
0 constraints ⎢ 0 0 1 C C C ⎥ ⎢ zSCP ⎥ ⎢ zHIP ⎥
⎢ 1 2 3 ⎥⎢ ⎥ ⎢ ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ ⎥⎢ ⎥=⎢ ⎥
1 constraint ⎢ ⎢ A1 B 1 C 1 0 0 0 ⎥ ⎢ λ1 ⎥ ⎢ D1 ⎥
⎥⎢ ⎥ ⎢ ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
2 constraints ⎢ A2 B2 C2 0 0 0 ⎥ ⎢ λ2 ⎥ ⎢ D2 ⎥
⎢ ⎥ ⎢ ⎥ ⎢

⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣ ⎦⎣ ⎦ ⎣ ⎦
3 constraints A3 B3 C3 0 0 0 λ3 D3
286 Interacting in Virtual Reality

6.4 Modeling the Behavior of Virtual Objects


Physically based modeling (PBM) has become an important approach to computer
animation and computer graphics modeling in the late 80’s. The raising computer
performance allows for simulating physics in real time making PBM a useful tech-
nique for VR. A lot of effort is being invested to make virtual worlds look realisti-
cally. PBM is the next step toward making virtual objects behave realistically. The
users are familiar with the behavior of objects in the real world, therefore providing
”real physics” in the virtual environment allows for an intuitive interaction with the
virtual objects. There is a wide range of VR applications that benefit from PBM. To
mention some examples, think of assembly simulations, robotics, training and teach-
ing (medical, military, sports), cloth simulation, hair simulation, entertainment. In
fact, anything that flies, rolls, slides, deforms can be modeled to create a believable
content of the virtual world.
Physics is a vast field of science that covers many different subjects. The subject
most applicable to virtual worlds is mechanics, and especially a part of it called dy-
namics, which focuses on bodies in motion. Within the subject of dynamics, there
are even more specific subjects to investigate, namely kinematics and kinetics. Kine-
matics focuses on the motion of bodies without regard to forces. Kinetics considers
both the motion of bodies and the forces that affect the bodies motion.
PBM involves numerical methods for solving differential equations describing
the motion, collision detection and collision response, application of constraints (e.g.,
joints or contact) and, of course, the laws of mechanics. 1

6.4.1 Particle Dynamics


Particles are objects that have mass, position and velocity (particle state), and respond
to forces, but they have no spatial extent. Particles are the easiest objects to simulate.
Particle systems are used to simulate natural phenomena, e.g., rain, dust and non-
rigid structures, e.g., cloth – particles connected by springs and dampers.
The theory of mechanics is based on Newton’s laws of motion, from which the
second Newton’s law is of particular interest in the study of dynamics. It states, that
a force F acting on a body gives it an acceleration a, which is in the direction of the
force and has magnitude inversely proportional to the mass m of the body.
F = ma (6.18)
Further, we can introduce the particle velocity v and acceleration a as the first and
second time derivatives of position.
dx
ẋ ==v (6.19)
dt
2
d x dv
ẍ = 2 = =a (6.20)
dt dt
1
The Particle Dynamics and Rigid Body Dynamics sections are based on the Physi-
cally Based Modeling SIGGRAPH 2001 Course Notes by Andrew Witkin and David Baraff.
http://www.pixar.com/companyinfo/research/pbm2001
Modeling the Behavior of Virtual Objects 287

The equation 6.18 is thus an ordinary differential equation (ODE) of the second or-
der, and solving it means finding a function x that satisfies the relation F = mẍ.
Moreover, the initial value of x0 at some starting time t0 is given and we are inter-
ested in following x in the time thereafter.
There are various kinds of forces in the real world, e.g., the gravitational force is
causing a constant acceleration, a damping force is proportional to velocity, a (static)
spring force is proportional to the spring elongation. The left side of the equation
6.18 has to be replaced with a specific formula describing the acting force.
When solving the equation numerically it is easier to split it into two equations
of first order introducing velocity. The state of the particle is then described by its
position and velocity: [x, v]. At the beginning of the simulation, the particle is in the
initial state [x0 , v0 ]. A discrete time step Δt is used to evolve the state [x0 , v0 ] →
[x1 , v1 ] → [x2 , v2 ] → . . . corresponding to the next time steps.
The most basic numerical integration method is the Euler method.
dx f initetimestep explicit
v= =⇒ Δx = vΔt =⇒ xt+Δt = xt + Δtvt (6.21)
dt
implicit
=⇒ xt+Δt = xt + Δtvt+Δt (6.22)
dv f initetimestep explicit
a= =⇒ Δv = aΔt =⇒ vt+Δt = vt + Δtat (6.23)
dt
implicit
=⇒ vt+Δt = vt + Δtat+Δt (6.24)
xt=0 = x0 , vt=0 = v0
The equations 6.21, 6.23 correspond to the explicit (forward) Euler method, equa-
tions 6.22, 6.24 correspond to the implicit (backward) Euler method.
For each time step t, the current state [xt , vt ] is known, and the acceleration can
be evaluated using equation 6.18:
Ft
at =
m
Though simple, the explicit Euler’s method is not accurate. Moreover, the explicit
Euler’s method can be unstable. For sufficiently small time steps we get reasonable
behavior, but as the time step gets larger, the solution oscillates or even explodes
toward infinity.
The implicit Euler’s method is unconditionally stable, in general yet not neces-
sarily more accurate. However, the implicit method leads to a system of equations
that has to be solved in each time step, as xt+Δt and vt+Δt not only depend on the
previous time step but also on vt+Δt and at+Δt respectively. The first few terms
of an approximation of at+Δt using a Taylor’s series expansion can be used in the
equation 6.24.
∂a(xt , vt ) ∂a(xt , vt )
at+Δt = a(xt+Δt , vt+Δt ) = a(xt , vt ) + Δx + Δv + ...
" ∂x ∂v #
∂a(xt , vt ) ∂a(xt , vt )
vt+Δt = vt + Δtat+Δt ⇒ Δv = Δt a(xt , vt ) + Δx + Δv
∂x ∂v
288 Interacting in Virtual Reality

Furthermore, Δx can be eliminated using equation 6.22.

xt+Δt = xt + Δtvt+Δt ⇒ Δx = Δtvt+Δt = Δt(vt + Δv)


" #
∂a(xt , vt ) ∂a(xt , vt )
Δv = Δt a(xt , vt ) + Δt(vt + Δv) + Δv
∂x ∂v
Finally, we get an equation for Δv.
 
∂a(xt , vt ) ∂a(xt , vt ) ∂a(xt , vt )
I3×3 − Δt − Δt2 Δv = Δt at + Δtvt
∂v ∂x ∂x
(6.25)
This equation has to be solved for Δv in each time step. In general, it is a system of
three equations. The a, v and x are vectors, therefore ∂a∂v and ∂x are 3 × 3 matrices.
∂a

The I3×3 on the left side of the equation is a 3 × 3 identity matrix. However, in many
practical cases, the matrices can be replaced by a scalar. When the equation 6.25 is
solved, vt+Δt and xt+Δt can be computed as

vt+Δt = vt + Δv xt+Δt = xt + Δtvt+Δt

The choice of the integration method depends on the problem. The explicit
Euler’s method is very simple to implement and it is sufficient for a number of prob-
lems. Other methods focus on improving accuracy, but of course, the price for this
accuracy is a higher computational cost per time step. In fact, more sophisticated
methods can turn out to be worse than the Euler’s method, because their higher cost
per step is more than the increase of the time step they allow. For more numerical
integration methods (e.g., midpoint, Runge-Kutta, Verlet, multistep methods), the
stability analysis and methods for solving systems of linear and nonlinear equations
see textbooks on numerical maths or [16].

6.4.2 Rigid Body Dynamics

Simulating the motion of a rigid body is almost the same as simulating the motion
of a particle. The location of a particle in space at time t can be described as a
vector xt , which describes the translation of the particle from the origin. Rigid bodies
are more complicated, in that in addition to translating them, we can also rotate
them. Moreover, a rigid body, unlike a particle, has a particular shape and occupies
a volume of space. The shape of a rigid body is defined in terms of a fixed and
unchanging space called body space. Given a geometric description of a body in
body space, we use a vector xt and a matrix Rt to translate and rotate the body in
the world space respectively (Fig. 6.22). In order to simplify some equations, it is
required that the origin of the body space lies in the center of mass of the body. If p0
is an arbitrary point on the rigid body in body space, then the world space location
p(t) of p0 is the result of rotating p0 about the origin and then translating it:

p(t) = R(t)p0 + x(t) (6.26)


Modeling the Behavior of Virtual Objects 289

y y’
p0
p(t)=R(t)p0+x(t)
x(t)
y
z’ R(t)
x
z x’

x
z

(a) body space (b) world space

Fig. 6.22. Body space vs. world space

We have defined the position and orientation of the rigid body. As the next step
we need to determine how they change over time, i.e. we need expressions for ẋ(t)
and Ṙ(t). Since x(t) is the position of the center of mass in world space, ẋ(t) is the
velocity of the center of mass in world space v(t) (also called linear velocity). The
angular velocity ω(t) describes the spinning of a rigid body. The direction of ω(t)
gives the direction of the spinning axis that passes through the center of mass. The
magnitude of ω(t) tells how fast the body is spinning. Angular velocity ω(t) cannot

Fig. 6.23. Linear and angular velocity.

be Ṙ(t), since ω(t) is a vector and Ṙ(t) is a matrix. Instead:

ṙ(t) = ω(t) × r(t), Ṙ(t) = ω ∗ R(t)


⎡ ⎤
0 −ωz (t) ωy (t)
ω ∗ = ⎣ ωz (t) 0 −ωx (t) ⎦
−ωy (t) ωx (t) 0

where r(t) is a vector in world space fixed to the rigid body r(t) = p(t) − x(t) and
R(t) is the rotation matrix of the body.
290 Interacting in Virtual Reality

To determine the mass of the rigid body, we can imagine that the rigid body is
made up of a large number of small particles. The particles are indexed from 1 to N .
The mass of the ith particle is mi . The total mass of the body M is the sum


N
M= mi (6.27)
i=1

Each particle has a constant location p0i in the body space. The location of the ith
particle in world space at time t, is therefore given by the formula

pi (t) = R(t)p0i + x(t) (6.28)

The center of mass of a rigid body in world space is


N 
N 
N
mi pi (t) mi (R(t)p0i + x(t)) mi p0i
i=1 i=1
c(t) = = = R(t) i=1 + x(t)
M M M
When we are using the center of mass coordinate system for the body space, it means
that:
N
mi p 0 i
i=1
=0
M
Therefore, the position of the center of mass in the world space is x(t).
Imagine Fi as an external force acting on the ith particle. Torque acting on the
ith particle is defined as

τi (t) = (pi (t) − x(t)) × Fi (t)

Torque differs from force in that it depends on the location of the particle relative to
the center of mass. We can think of the direction of torque as being the axis the body
would spin about due to Fi if the center of mass was held in place.


N
F = Fi (t) (6.29)
i=1

N 
N
τ = τi (t) = (pi (t) − x(t)) × Fi (t) (6.30)
i=1 i=1

The torque τ (t) tells us something about the distribution of forces over the body.
The total linear momentum P (t) of a rigid body is the sum of the products of
mass and velocity of each particle:


N
P (t) = mi ṗi (t)
i=1
Modeling the Behavior of Virtual Objects 291

which leads to
P (t) = M v(t)
Thus, the linear momentum of a rigid body is the same as if the body was a particle
with mass M and velocity v(t). Because of this, we can also write:
Ṗ (t) = M v̇(t) = F (t)
The change in linear momentum is equivalent to the force acting on a body. Since
the relationship between P (t) and v(t) is simple, we can use P (t) as a state variable
of the rigid body instead of v(t). However, P (t) tells us nothing about the rotational
velocity of the body.
While the concept of linear momentum P (t) is rather intuitive (P (t) = M v(t)),
the concept of angular momentum L(t) for a rigid body is not. However, it will
let us write simpler equations than using angular velocity. Similarly to the linear
momentum, the angular momentum is defined as:
L(t) = I(t)ω(t)
where I(t) is a 3 × 3 matrix called the inertia tensor. The inertia tensor describes
how the mass in a body is distributed relative to the center of mass. It depends on the
orientation of the body, but does not depend on the body’s translation.

N
 
I(t) = mi (riT (t) · ri (t))[1]3×3 − ri (t) · riT (t)
i=1
ri (t) = pi (t) − xi (t)
[1]3×3 is a 3 × 3 identity matrix. Using the body space coordinates, I(t) can be
efficiently computed for any orientation:
I(t) = R(t)Ibody RT (t)
Ibody is specified in the body space and is constant over the simulation.

N
 
Ibody = mi (pT0i · p0i )[1]3×3 − p0i · pT0i
i=1

To calculate the inertia tensor, the sum can be treated as a volume integral.

 T 
Ibody = ρ (p0i · p0i )[1]3×3 − p0i · pT0i dV
V
ρ is the density of the rigid body (M = ρV ). For example, the inertia tensor of a
block a × b × c with origin in the center of mass is
⎡ 2 ⎤
 y + z 2 −yx −zx
Ibody = ρ ⎣ −xy x2 + z 2 −zy ⎦ dxdydz
V −xz −yz x2 + y 2
⎡ 2 ⎤
b + c2 0 0
M⎣
= 0 a 2 + c2 0 ⎦
12
0 0 a2 + b 2
292 Interacting in Virtual Reality

The relationship between the angular momentum L(t) and the total torque τ is very
simple
L̇(t) = τ (t)
analogously to the relation Ṗ (t) = F (t). The simulation loop performs the following
steps:
• determine the force F (t) and torque τ (t) acting on a rigid body
• update the state derivatives ẋ(t), Ṙ(t), Ṗ (t), L̇(t)
• update the state, e.g., using the Euler’s method
Table 6.2 summarizes all important variables and formulas for particle dynamics and
rigid body dynamics.

Table 6.2. The summary of particle dynamics and rigid body dynamics.
Particles Rigid bodies
constants m M, Ibody
state x(t), v(t) x(t), R(t), P (t), L(t)
time derivatives of state v(t), a(t) v(t), ω ∗ R(t), F (t), τ (t)
simulation input F (t) F (t), τ (t)
P (t) = M v(t)
useful forms F (t) = ma(t) ω(t) = I −1 (t)L(t)
−1
I −1 (t) = R(t)Ibody RT (t)

6.5 Interacting with Rigid Objects


Multiple studies have been carried out which demonstrate that physically-based mod-
eling (PBM) and haptics have the potential to significantly improve interaction in vir-
tual environments. As an example, one such study is described in [18]. Here, three
virtual elements in form of the letters I, L, and T, had to be positioned into a frame.
The experiment consisted of three phases. In the first phase, subjects had to complete
the task in a native ”virtual environment”, where collisions between a letter and the
frame were only optically signaled by a wireframe presentation of the letter. During
the second phase, the subjects were supported by haptic feedback. Finally, in the
third phase, the subjects were asked to move the letters above their target positions
and let them fall into the frame. The falling procedure, as well as the behaviour of the
objects when touching the frame, had been physically modeled. The results depicted
in Fig. 6.24 show that haptics as well as PBM significanty improve the user interface.
However, the preceding sections about haptics and PBM should have made also
clear that the ultimate goal of VR – an exact correlation between interaction in the
virtual and in the real world – cannot be achieved in the near future for the following
main reasons:
• Modeling
Interacting with Rigid Objects 293

Fig. 6.24. Influence of haptics and PBM on the judgement of task difficulty (a) and user inter-
face (b) [18].

– Geometry
To achieve a graphical representation of the virtual scene in real time, the
shapes of virtual objects are typically tessellated, i.e., they are modeled as
polyhedrons. It is obvious that by such a polygonal representation, curved
surfaces in the virtual world can only be an approximation of those in a real
world. As a consequence, even very simple positioning tasks lead to unde-
sired effects. As an example, it nearly impossible for a user to put a cylindri-
cal virtual bolt into a hole, since both, bolt and hole, are not perfectly round,
so that inadequate collisions inevitably occur.
– Physics
Due to the discrete structure of a digital computer, collisions between vir-
tual objects and the resulting, physically-based collision reaction can only be
computed at discrete time steps. As a consequence, not only the graphical,
but also the ”physical” complexity of a virtual scene is limited by the perfor-
mance of the underlying computer system. As with the graphical representa-
tion, the physical behavior of virtual objects can only be an approximation of
reality.
• Hardware Technology
The development of adequate interaction devices, especially force and tactile
feedback devices, remains a challenging task. Hardware solutions suitable for a
use in industrial and scientific applications, like the PHANToM Haptic Device,
typically only display a single force vector and cover a rather limited work vol-
ume. First attempts have also been made to add force feedback to instrumented
gloves (see Fig. 6.25), thus providing kinesthetic feedback to complete hand
movements. It seems arguable however, whether such approaches become ac-
cepted in industrial and technical applications. With regard to tactile devices, it
seems even impossible at all to develop feasible solutions due to the large amount
of tactile sensors in the human skin that have to be addressed.
In order to compensate for the problems arising from the non-exact modeling of
geometry and behavior of virtual objects as well as from the inadequacy of avail-
able interaction devices, additional mechanisms should be made available to the user
which facilitate the execution of manipulation tasks in virtual environments.
294 Interacting in Virtual Reality

Fig. 6.25. The CyberForceTM Haptic Interface, Immersion Corporation.

As an example, we show here how elementary positioning tasks can be supported


by artificial support mechanisms which habe been developed at RWTH Aachen Uni-
versity [18]. Fig. 6.26 illustrates how the single mechanisms – guiding sleeves, sen-
sitive polygons, virtual magnetism, and snap-in – are assigned to the three classical
phases of a positioning process. On the book CD, movies are available which show
these support mechanisms. In the following, the single mechanisms will be described
in more detail.

6.5.1 Guiding Sleeves

At the beginning of a positioning task is the transport phase, i.e., the movement to
bring a selected virtual object towards the target position. Guiding sleeves are a sup-
port mechanism for the transport phase which make use of a priori knowledge about
possible target positions and orientations of objects. In the configuration phase of
the virtual scenario, the guiding sleeves are generated in form of a scaled copy or a
scaled bounding box of the objects that are to be assembled, and then placed as invis-
ible items at the corresponding target location. When an interactively guided object
collides with an adequate guiding sleeve, the target position and orientation is visual-
ized by means of a semi-transparent copy of the manipulated object. In addition, the
motion path necessary to complete the positioning task is animated as a wireframe
representation (see Fig. 6.27).
When the orientations of objects are described as quaternions (see [10]), the an-
imation path can be calculated on the basis of the position difference pdif f and the
orientation difference qdif f between the current position pcur and orientation qcur ,
and the target position pgoal and orientation qgoal :
Interacting with Rigid Objects 295

Fig. 6.26. Assignment of support mechanisms to the single interaction phases of a positioning
task.

Fig. 6.27. Guiding Sleeves support the user during the transport phase of a positioning task by
means of a wireframe animation of the correct path. A video of this scenario is available on
the book CD.

pdif f = pcur − pgoal (6.31)


qdif f = qcur − qgoal (6.32)

The in-between representations are then computed according to the Slerp (Spher-
ical Linear intERPolation) algorithm [4]:
296 Interacting in Virtual Reality

sin((1 − t)θ) sin(tθ)


Slerp(qn,cur , qn,goal , t) = qn,cur + qn,goal , (6.33)
sin(θ) sin(θ)
θ = arccos(qn,cur , qn,goal ), t ∈ [0.0, 1.0] . (6.34)
The interpolation step t = 0.0 describes the normalized, current orientation
qn,cur , and t = 1.0 stands for the normalized goal orientation qn,goal .

6.5.2 Sensitive Polygons


Sensitive polygons are created before simulation start and positioned at adequate
positions on object surfaces. A function can be assigned to every sensitive polygon,
which is automatically called when an interactively guided element collides with
the sensitive polygon. Sensitive polygons are primarily designed to support the user
during the coarse positioning phase of an assembly task. Among others, the polygons
can be used to constrain the degrees of freedom for interactively guided objects.
For instance, a sensitive polygon positioned at the port of a hole can restrict the
motion of a pin to the axial direction of the hole and thus considerably facilitate
the manipulation task. In case of this classical peg in hole assembly task, a copy
of the guided object is created and orientated into the direction of the hole’s axis.
The original object is switched into a semi-transparent representation (see Fig. 6.28,
left) in order to provide the user at any time with a visual feedback of the actual
movement. The new position Pneu of the copy can be calculated from the current
manipulator position Pact , the object’s target position Pz , and the normal vector N
of the sensitive polygon, which is identical to the object’s target orientation:
Pneu = Pz + N (Pact − Pz ) N. (6.35)

Fig. 6.28. Sensitive polygons can be used to constrain the degrees of freedom for an interac-
tively guided object.

6.5.3 Virtual Magnetism


In assembly procedures, it is often necessary to position objects exactly, i.e., parallel
to each other at arbitrary locations, without leaving any space between them. Since
Interacting with Rigid Objects 297

in a virtual environment this task is nearly impossible to accomplish without any


artificial support mechanisms, the principle of virtual magnetism has been introduced
[18]. The left part of Fig. 6.29 illustrates the effect of virtual magnetism on movable
objects. In contrast to other support mechanisms, a virtual magnetism should not be
activated automatically in order to avoid non-desired object movements. Instead, the
user should actively initiate it during the simulation, e.g., by a speech command.

Fig. 6.29. Virtual magnetism to facilitate a correct alignment of virtual objects. A short video
on the book CD shows virtual magnetism in action.

Before simulation start, an object’s influence radius is calculated as a function of


its volume. Whenever two of such influence areas overlap during the simulation, an
attraction force is generated that consists of two components (see Fig. 6.29, right).
The first component is calculated from the distance between the two objects’ centers
of gravity, and the second component is a function of polygon surfaces. To calculate
the second component, a search beam starts from every polygon midpoint into di-
rection of the surface normal. If such a beam intersects with a polygon of the other
object, a force directed to the surface normal is created, which is proportional to
the distance between the polygons. Finally, the sum of the two force components is
applied to move the smaller object towards the larger one.

6.5.4 Snap-In

In addition to virtual magnetism, a snap-in mechanism can be provided to support the


final phase of a positioning task. The snap-in is activated when position and orienta-
tion difference between a guided object and the target location falls below a specific
threshold. As with guiding sleeves and sensitive polygons, snap-in mechanisms re-
quire a priory knowledge about the exact target positions. The following threshold
conditions d and α for position and orientation can be used to initiate the snap-in
process:

NPcur
d ≥ |Pgoal − Pcur | and α ≥ arccos . (6.36)
|N| |Pcur |
Studies could show that the additional support from guiding sleeves, sensitive
polygons, virtual magnetism, and snap-in, enhance user performance and acceptance
significantly [18]. One of the experiments that have been carried out refers to a task,
298 Interacting in Virtual Reality

in which virtual nails must be brought into the corresponding holes of a virtual block.
The experiment consisted of three phases: In phase 1, subjects had to complete the
task in a native virtual environment. In phases 2 and 3, they were supported by guid-
ing sleeves and sensitive polygons, respectively. Every phase was carried out twice.
During the first trial, objects snapped in as far as the subject placed them into the
target position below a certain tolerance. Fig. 6.30 depicts the qualitative analysis of
this experiment. Subjects judged task difficulty as much easier and the quality of the
user interface as much better when the support mechanisms were activated.

Fig. 6.30. Average judgement of task difficulty and user interface with and without support
mechanisms [18].

6.6 Implementing Virtual Environments


When implementing virtual environments, a number of obstacles have to be over-
come. For a lively and vivid virtual world, there is the need for an efficient data
structure that keeps track of hierarchies and dependencies between objects. The most
fundamental data structure that is used for this purpose is called scene graph. While
scene graphs keep track of things in the virtual world, there is usually no standardized
support for real world problems, e.g., the configuration of display and input devices
and the management of user input. For tasks like these, a number of toolkits exists
that assist in programming virtual environments. Basic assumptions and common
principles about scene graphs and toolkits will be discussed in the following section.

6.6.1 Scene Graphs

The concept of scenes is derived from theatre plays, and describes units of a play that
are smaller than the act or a picture. Usually, scenes define the changing of a place
or the exchange of characters on the set. In straight analogy for virtual environments,
scenes define the set of objects that define the place where the user can act. This
definition of course includes the objects that can be interacted with by the user. From
a computer graphics point of view, several objects have to work together in order
to display a virtual environment. The most prominent data is the geometry of the
objects, usually defined as polygonal meshes. In addition to that, information about
Implementing Virtual Environments 299

the illumination of the scene is necessary. Trivially, positions and orientations of


objects are needed in order to place objects in the scene. Looking closer, there often
is a dependency between the geometrical transformations of certain objects, e.g.,
tires are always attached to the chassis of a car, as well as objects that reside inside
the vehicle move along when the vehicle moves. For a programmer, it rapidly evolves
as a tedious task to keep these dependencies in between objects correctly. For that
purpose, a scene graph is an established data structure that eases this task a lot.
Basically, a scene graph can be modeled as a directed acyclic graph (DAG) where
the nodes describe objects of the scene in various attributes, and edges define a ”child
of” dependency in between these objects. In graphics processing, scene graphs are
used for the description of a scene. As such, it provides a hierarchical framework for
easily grouping objects spatially. There are basically three processes that are covered
by scene graph toolkits. The first is the management of objects in the scene, mainly
creating, accessing, grouping and deleting them. The second one is the traversal of
the scene, where distinct traversal paths can result in different scenes that are ren-
dered. A further aspect covers operations that are called when encountering nodes,
that is, realizing functionality on the basis of the types of objects that are contained
in the scene.
As stated above, there exist different types of nodes that can be contained in a
scene graph. In order to realize the spatial grouping of objects, group nodes define
the root of a subtree of a scene graph. Nodes that do not have a child are called leaf
nodes. Children of group nodes can be group nodes again or content nodes, that is
an abstract type of a node that can have individual properties, such as a geometry
to draw, color to use, and so on. Depending on the toolkit you chose to implement
the virtual environment, content nodes are never part of the scene graph directly, but
are indirectly linked to traversal nodes in the scene graph. The scene graph nodes
then only define the spatial relation of the content nodes. This enables the distinction
of algorithms that do the traversal of a scene graph and the actual rendering of the
contents. However, the following passages will assume that leaf nodes define the
geometrical content to render. As all things have a beginning and an end, a special
set of group nodes is entitled as root nodes of the virtual world. While some tool-
kits allow only one root node, others may allow more than one and usually select
a special root node for an upcoming rendering procedure. Fig. 31(a) depicts a naive
example of the scene graph that could be used to render a car. Generally speaking, the
traversal of a scene graph is a concept of its own and thus the order in which a scene
graph is traversed can be toolkit specific. In the subsequent passages, a left-order-
depth-first tree walk is assumed. That means that in every group node, the leftmost
child is visited in order to determine the next traversal step. If this child is a group
node itself, again the leftmost child will be visited. Once the traversal encounters a
leaf node, the content that is linked to this leaf node will be rendered. After that, the
right sibling of the rendered node will be visited until there is no right sibling. Once
the traversal routine re-encounters the world’s root node, another rendering traversal
may begin as the whole scene is painted. This process will terminate on a finite DAG.
As stated above, any node defines a spatial relation, more precisely, transforma-
tions that are applied to the parent of a set of nodes affect the transformations of the
300 Interacting in Virtual Reality

(a) A naive scene graph example, (b) Still a simple but more elaborate
where content nodes are modeled as scene graph for a car.
leaf nodes.

Fig. 6.31. Two very basic examples for a scene graph that describes a car with four tyres.

child nodes. With a left-order-depth-first tree walk, transformations are processed


”bottom up”, that is from leaf nodes to the root node. This corresponds to a matrix
multiplication of a set of transformation matrices from right to left. In order to let
the car from Fig. 6.31(a) drive, the programmer has to apply nine transformations
as a sum, touching every single entity in the graph with exception of the root node.
Fig. 6.31(b) therefore depicts a more elegant solution, where a group node for a par-
ent object ”car” is introduced. This node defines a spatial relation itself, called local
transformation, i.e., a translation of the node ”car” translates all it’s child nodes by the
same amount. This introduces the concept of a local coordinate system, that defines
all spatial relations locally for every node and its children. The coordinate system of
the world’s root node is called global coordinate system. The concrete definition of
the global coordinate system is either defined by the application programmer or the
toolkit that is used.
An example with quite the same structure as the above depicted car example can
be found on the book CD. See sec. 6.7 for more details on this simple example.
As stated above, content nodes can have different types and can influence what
can be seen in the scene. This ranges from nodes that describe the polygonal mesh to
be rendered, over application programmer’s callback nodes up to intelligent ”prun-
ing” nodes that can be used to simplify a scene and speed up rendering. This happens,
e.g., when parts of the scene that are not visible from the current viewpoint get culled
from the scene or highly detailed geometries that are at a high distance are replaced
by simpler counterparts that still contain the same visual information for the viewer.
The latter method is called level of detail rendering. Some content nodes describe the
position of the camera in the current scene, or a ”virtual platform” where the user is
located in the virtual world. By using such nodes correctly, the application program-
mer can tie the user to an object. E.g., the camera that is attached as a child of the
Implementing Virtual Environments 301

car in Fig. 6.31(b) is affected by the transformation of the car node, as well are its
siblings.

6.6.2 Toolkits

Currently, the number of available toolkits for the implementation of virtual envi-
ronments varies, as well as their abilities and focuses vary. But some basic concepts
evolve that can be distinguished and are more or less equally followed by most envi-
ronments. A list of some available toolkits and a URL to download them is enlisted
in the README.txt on the book CD in the vr directory. The following section
will try to concentrate on these principles, without describing a special toolkit and
without enumerating the currently available frameworks.
The core of any VR toolkit is the application model that can be followed by the
programmer to create the virtual world. A common approach is to provide a complete
and semi-open infrastructure, that includes scene graphs, input/output facilities, de-
vice drivers and a tightly defined procedure for the application processing. Another
strategy is to provide independent layers that augment existing low level facilities,
e.g., scene graph APIs with VR functionality, such as interaction metaphors, hard-
ware abstraction, physics or artificial intelligence. While the first approach tends to
be a static all-in-one approach, the latter one needs manual adaptations that can be te-
dious when using exotic setups. However, the application models can be event driven
or procedural models where the application programmer either listens to events prop-
agated over an event bus or defines a number of callback routines that will be called
by the toolkit in a defined order.
A common model for interactive applications is the frame-loop. It is depicted
in Fig. 6.32. In this model, the application calculates its current state in between
the rendering of two consequent frames. After the calculation is done, the current
scene is rendered onto the screen. This is repeated in an endless loop way, until the
user breaks the loop and the application exits. A single iteration of the loop is called
a frame. It consists of a calculation step for the current application state and the
rendering of the resulting scene.
User interaction is often dispatched after the rendering of the current scene, e.g.,
when no traversal of the scene graph is occurring. Any state change of an interaction
entity is propagated to the system and the application using events. An event indicates
that a certain state is present in the system. E.g., the press of a button on the keyboard
defines such a state. All events are propagated over an event-bus.
Events usually trigger a change of state in the world’s objects which results in a
changed rendering on the various displays. As a consequence, the task of implement-
ing and vitalizing a virtual environment deals with the representation of the world’s
objects for the various senses. For the visual sense, the graphical representation, that
is the polygonal meshes and descriptions that are needed for the graphics hardware
to render an object properly. For the auditory sense, this comprises of sound waves,
and radiation characteristics or attributes that are needed to calculate filters that drive
loudspeakers. Force feedback devices which appeal the haptic sense usually need
information about materials, such as stiffness and masses with forces and torques.
302 Interacting in Virtual Reality

event bus

application start rendering

interaction-
update

application
update

application end system update

Fig. 6.32. A typical endless loop of a VR application.

Real VR toolkits therefore provide algorithms for the transformation or storage of


data structures that are needed for the rendering of virtual objects. Usually a collec-
tion of algorithms for numerical simulation and physically based modeling have to
be present, too.
In addition to the structures that help to implement the virtual objects in a virtual
world, a certain amount of real world housekeeping is addressed by proper VR tool-
kits. The most appealing fact is that a number of non standard input devices have to
be supported in order to process user input. This includes tracking devices, ranging
from electro magnetic over ultra-sonic trackers up to optical tracking systems. VR
toolkits provide a layer that maps the broad range of devices to a standard input and
output bus and allow the application programmer to process the output of the devices.
The most fundamental concepts for input processing include strategies for real time
dispatching of incoming data and the transformation from device dependent spatial
data to spatial data for the virtual world. Another aspect of input processing is the
ability to join spatially distributed users of a virtual environment in a collaborative
setting. If that aspect is supported, the VR toolkit provides a mean of differentiating
between the ”here” and ”there” of user input and suitable locking mechanisms to
avoid conflicting states of the distributed instances of the virtual environment.
Another fundamental aspect of a toolkit that implements virtual environments is
the flexible setup for a varying range of display devices. A trivial requirement for the
toolkit is to support different resolutions and geometries of the projection surfaces
as well as active and passive stereo support. This includes multi-pipe environments,
where a number of graphics boards is hosted by a single machine, or distributed
cluster architectures where each node in a cluster drives a single graphics board but
the whole cluster drives a single display.
A fundamental requirement is the presence of a well performing scene graph
structure that enables the rendering of complex scenes. In addition to that, the tool-
Implementing Virtual Environments 303

kit should provide basic means to create geometry for 3D widgets and algorithms
for collision detection, deformation and topological changes of geometries. A very
obvious requirement for toolkits that provide an immersive virtual environment is
the automatic calculation for the user centered projection that enables stereoscopic
views. Ideally, this can directly be bound to a specific tracking device sensor such
that it is transparent to the user of the toolkit. Fig. 6.33 depicts the building blocks
of modern virtual environments that are more or less covered by most available tool-
kits for implementing virtual environments today. It can be seen that the software
necessary for building vivid virtual worlds is more than just a scene graph and even
simple applications can be rather complex in its set-up and runtime behavior.

6.6.3 Cluster Rendering in Virtual Environments

As stated above, room mounted multi-screen projection displays are nowadays driven
by off-the-shelf PC clusters instead of multi-pipe, shared memory machines. The
topology and system layout of a PC cluster raises fundamental differences in the
software design, as the application has to respect distributed computation and inde-
pendent graphics drawing. The first issue introduces the need for data sharing among
the different nodes of the cluster, while the latter one raises the need for a synchro-
nization of frame drawing across the different graphics boards. Data locking deals
with the question of sharing the relevant data between nodes. Usually, nodes in a PC
cluster architecture do not share memory. Requirements on the type of data differ,
depending if a system distributes the scene graph or synchronizes copies of the same
application. Data locked applications are calculating on the same data, and if the
algorithms are deterministic, are computing the same results in the same granularity.
An important issue especially for VR applications is rendering. The frame draw-
ing on the individual projection screens has to be precisely timed. This is usually
achieved with specialized hardware, e.g., gen locking or frame locking features, that
are available on the graphics boards. However, this hardware is not common in off-
the-shelf graphics boards and usually expensive. Ideally, a software-based solution
to the swap synchronization issue would strengthen the idea of using non-specialized
hardware for VR rendering, thus making the technique more common.
In VR applications, it is usually distinguished between two types of knowledge,
the graphical setup of the application (the scene graph) and the values of the domain
models that define the state of the application. Distributing the scene graph results
in a simple application setup for cluster environments. A setup like this is called a
client-server setup, where the server cluster nodes provide the service of drawing the
scene, while the client dictates what is drawn by providing the initial scene graph
and subsequent modifications to it. This technique is usually embedded as a low
level infrastructure in the scene graph API that is used for rendering. Alternatives
to the distribution of the scene graph can be seen in the distribution of pixel based
information over high bandwidth networks, where all images are rendered in a high
performance graphics environment, or the distribution of graphics primitives.
A totally different approach respects the idea that a Virtual Reality application
that has basically the same state of its domain objects will render the same scene,
304 Interacting in Virtual Reality

Fig. 6.33. Building blocks and aspects that are covered by a generic toolkit that implements
virtual environments. Applications based on these toolkits need a large amount of infrastruc-
ture and advanced real time algorithms.

respectively. It is therefore sufficient to distribute the state of the domain objects


to render the same scene. In a multi-screen environment, the camera on the virtual
scene has to be adapted to the layout of your projection system. This is a very com-
mon approach. It is called the master-slave, or mirrored application paradigm, as the
calculation of the domain objects for a given time step is done on a master machine
and afterward distributed to the slave machines.
Examples 305

One can choose between the distribution of the domain objects or the distribution
of the influences that can alter the state of any domain object, e.g., user input. Do-
main objects and their interactions are usually defined on the application level by the
author of the VR application, so it seems more reasonable to distribute the entities of
influence to these domain objects and apply these influences to the domain objects
on the slave nodes of the PC cluster, see Fig. 6.34.
Every VR system constantly records user interaction and provides information
about this to the application which will change the state of the domain objects. For
a master-slave approach, as depicted above, it is reasonable to distribute this inter-
action and non-deterministic influences across several cluster nodes, which will in
turn calculate the same state of the application. This results in a very simple appli-
cation model at the cost of redundant calculations and synchronization costs across
different cluster nodes.
We can see that the distribution of user interaction in the form of events is a
key component to the master-slave approach. As a consequence, in a well designed
system it is sufficient to mirror application events to a number of cluster nodes to
transparently realize a master-slave approach of a clustered virtual environment. The
task for the framework is to listen to the events that run over the event bus during
the computational step in between the rendering steps of an application frame, and
distribute this information across a network.

6.7 Examples
The CD that accompanies this book contains some simple examples for the imple-
mentation of VR applications. It is, generally speaking, a hard task to provide some
real world examples for VR applications, as much of its fascination and expressive-

event bus
event bus
slave node
master node

interaction
event observer net listener
update

application application
update serialized events update

0101011101101
1010110101011
01010010110...
cluster’s intranet

ti ti+1 ti t i+1
domain objects domain objects

Fig. 6.34. Data locking by event sharing.


306 Interacting in Virtual Reality

ness comes from the hardware that is used, e.g., stereo capable projection devices,
haptic robots or precisely calibrated loudspeaker set-ups. The examples provided
here are to introduce the reader to the very basic work with 3D graphical applica-
tions and provide a beginner’s set-up that can be used for one’s own extensions.
For a number of reasons, we have chosen to use the Java3D API for implemen-
tation of the examples, which is a product of SUN Microsystems. Java3D needs a
properly installed Java environment in order to run. For the time of writing this book,
the current Java environment version number is 1.5.0 (J2SE). Java3D is an add-on
package that has to be downloaded separately, and its interface is still under devel-
opment. The version that was used for this book is 1.3.2. The book CD contains a
README.txt file where you can find the URLs to download the Java Environment,
the Java3D API, and some other utilities needed for the development. You can use
any environment that is suitable for the development of normal Java applications,
e.g., Eclipse, JBuilder or the like. Due to the licensing policies you have to down-
load all the needed libraries, they are not contained on the book CD. You can find
the contents of this chapter on the book CD in the directory vr. There you will find
the folders SimpleScene, Robot, SolarSystem and Tetris3D. Each folder
contains a README.txt file that contains some additional information about the
execution of the applications or the configuration.
The Java3D API provides a well structured and intuitive access to the work with
scene graphs. For a detailed introduction, please see the comprehensive tutorials on
the Java3D website. You can find the URLs for the tutorials and additional stuff on
the book CD. The following examples all use a common syntax when depicting the
scene graph layouts. Figure 6.35 shows the entities that are used in all the diagrams.
This syntax is close to the one in the Java3D tutorials that can be found at the URLs
as given on the book CD.

Fig. 6.35. Syntax of the scene graph figures that describe the examples in this chapter. The
syntax is close to the one that can be found in the Java3D tutorials.
Examples 307

6.7.1 A Simple Scene Graph Example


The folder SimpleScene contains an example that can be used as a very first
step into the world of scene graphs. It is a starting point for playing around with
hierarchies, local and global transformations. In addition to that, it embeds a 3D
canvas (an object that is used for the rendering of a 3D scene) to an AWT context,
that is a simple 2D user interface layer of Java, which provides many types of 2D
control widgets, e.g., buttons, listviews and sliders.
The scene that is displayed with the unmodified SimpleScene.java is out-
lined in its structure in figure 6.36. The figure depicts the part of the scene graph
that is created by the application and the part that is automatically provided by the
object SimpleUniverse, which is shipped with the standard Java3D API in or-
der to easily realize a 3D window mouse interaction. The application first creates

Fig. 6.36. The scene graph that is outlined by the SimpleScene example. In addition to that,
the augmented scene graph that is automatically created by the SimpleUniverse object of
the Java3D API.

the root node, called objRoot, and a transform node that allows the interaction
308 Interacting in Virtual Reality

with an input device on this node. Java3D provides predefined behavior classes that
can be used to apply rotation, translation and scaling operations on transform nodes.
Behaviors in Java3D can be active or inactive depending on the current spatial rela-
tions of objects. In this example, we define the active region of the mouse behavior
for the whole scene. This can be accomplished by the code fragment in the method
createBehavior(), see Fig. 6.37.

public void createBehavior(TransformGroup oInfluencedNode, Group oParent)


{
// behavior classes shipped with the Java3D API
mouseRotate = new MouseRotate();
mouseTranslate = new MouseTranslate();
mouseZoom = new MouseZoom();

// you have to set the node that is transformed by the mouse


// behavior classes
mouseRotate.setTransformGroup(oInfluencedNode);
mouseTranslate.setTransformGroup(oInfluencedNode);
mouseZoom.setTransformGroup(oInfluencedNode);

// these values determine the "smootheness" of the interaction


mouseTranslate.setFactor(0.02);
mouseZoom.setFactor(0.02);

// determines the region of influence


// setting the max of the Point3D value to Float.MAX_VALUE
// states the these behaviors are active all over
BoundingSphere boundingSphere =
new BoundingSphere(new Point3d (0.0, 0.0, 0.0), Float.MAX_VALUE);

mouseRotate.setSchedulingBounds(boundingSphere);
mouseTranslate.setSchedulingBounds(boundingSphere);
mouseZoom.setSchedulingBounds(boundingSphere);

// behaviors are part of the scene graph and they live


// as a child of the transform node
oParent.addChild(mouseRotate);
oParent.addChild(mouseTranslate);
oParent.addChild(mouseZoom);
}

Fig. 6.37. Example code for mouse behavior. This code sets the influence region of the mouse
interaction to the complete scene.

The application’s scene graph is constructed by the method


createSceneGraph(). This method can be adapted to create an arbitrary
scene graph, but take care to inialize the members t1Group and t2Group, which
indicate a target for the mouse transformation that is changed by clicking on the
radio button on the left hand side of the application panel.

6.7.2 More Complex Examples: SolarSystem and Robot


The SimpleScene application is a very basic step in understanding the power of a
scene graph. The book CD provides two more elaborated example applications that
are described in the following section.
Examples 309

A classic example of spatial relations and the usage of scene graphs is the vi-
sualization of an animated solar system containing earth, moon and the sun. The
earth rotates around the sun, and the moon orbits the earth. Additionally, the sun and
the earth rotate around their own axis. The folder SolarSystem on the book CD
contains the necessary code in order to see how to construct these spatial relations.
Figure 6.38 depicts the scene graph of the application. In addition to the branch group
and transform group nodes, the figure contains the auxiliary objects that are needed
for the animation. These are given, e.g., by the rotator objects that modify the
transformation of the transform group they are attached to over time.

Fig. 6.38. The scene graph that is outlined by the SolarSystem example. It contains the
branch, transform and content nodes as well as special objects needed for the animation of the
system, e.g., rotator objects.

In the folder Robot you will find the file start.bat, which starts the Robot
application on your Microsoft Windows system. You can take the command line of
310 Interacting in Virtual Reality

the batch file as an example to start the Robot application on a different operating
system that has Java and Java3D installed.
The Robot application example creates an animated and textured robot which
can be turned and zoomed with the mouse, using left and middle mouse button re-
spectively. In addition to the more complex scene graph, it features light nodes. Light
is an important aspect, and in most scene graph APIs, special light nodes provide
different types of lightning effects. These lights are themselves part of the scene
graph. The Robot application indicates the use of spatial hierarchies rather clearly.
Every part that can move independently is part of a subtree of a transform group.
Figure 6.39 depicts the scene graph needed to animate The Robot. It omits the aux-
iliary objects necessary for the complete technical realization of the animation. See
the source code on the book CD for more details.

6.7.3 An Interactive Environment, Tetris3D

While SimpleScene, SolarSystem and Robot deal with transformations and


spatial hierarchies, they are not very interactive besides the point that the complete
scene can be rotated, translated, or zoomed. Typical VR applications gain much fas-
cination when they allow to interact with objects in the scene in real time. Java3D
uses behavior objects for the implementation of interaction with scene objects. An
intuitive way of learning this interaction concept is a game that benefits from depth
perception.
On the book CD, a complete Tetris3D game can be found in the Tetris3D
directory contained in the vr folder. It shows how to realize time-based behav-
ior with the TetrisStepBehavior class, which realizes the time dependent
falling of a puzzle piece. A way to react on the keyboard is outlined in the
TetrisKeyBahavior class. Puzzle elements are defined in the Figure class
and are basically a number of cubes compiled to the well known Tetris figures by
cube primitives.
In addition to that, this example provides the configuration file
j3d1x2-stereo-vr for a workbench set-up with two active stereo projec-
tors, an electro magnetic tracking system (Flock of Birds, Ascension Technology)
and a wand like interaction metaphor. It enables head tracking and VCP for the user.
The default configuration file for the game is in the same directory on the book CD
and is called j3d1x1-window. In order to determine which configuration file
is picked at start-up, the Tetris3D::Tetris3D() method has to be changed
and the class has to be recompiled. See Tetris3D.java file on the book CD for
comprehensive details.

6.8 Summary
In the recent years, VR could profit considerably from dramatically increasing com-
puting power and especially from the growing computer graphics market. Virtual
Summary 311

Fig. 6.39. The scene graph that is outlined by the Robot example.

scenarios can now be modeled and visualized at interactive frame rates and at a fi-
delity which could by far not be achieved a few years before. As a consequence,
VR technology promises to be a powerful and henceforth affordable man-machine
interface not only for the automotive industry and military, but also for smaller com-
panies and medical applications. However, in practice most applications only address
the visual sense and are restricted to navigation through the virtual scenario. More
complex interaction functionality, which is, e.g., absolutely necessary for applica-
tions like assembly simulation, are still at the beginning. The main reason for this
situation is that such interaction tasks heavily rely on the realistic behavior of ob-
jects, and on the team play of multiple senses. Therefore, we concentrated in this
chapter on the physically-based modeling of virtual object behavior as well as on the
312 Interacting in Virtual Reality

multimodal aspects of VR with focus on acoustics and haptics, since for assembly
tasks these two modalities promise to be the most important complementation to the
visual sense. Due to limitations in hardware technology and modeling, for which so-
lutions are outside the range of vision in the near future, we introduced additional
support mechanisms as an example to cope with the completion of VR-based assem-
bly tasks.
The chapter closes with an introduction to the fundamental concept of scene
graphs that are used for the modeling of spatial hierarchies and play an important
role in the vitalization of virtual worlds. Scene graphs are often the most important
part of VR toolkits that provide general functionality for special VR hardware like
trackers and display systems. The usage of PC cluster based rendering introduces a
tool that can be used to lower the total costs for a VR installation and to ease the
utilization of VR techniques in the future.
On the book CD, videos are available which demonstrate how visual and haptic
interfaces, physically-based modeling, and additional support mechanisms have been
integrated to a comprehensive interaction tool, allowing for an intuitive completion
of fundamental assembly tasks in virtual environments.

6.9 Acknowledgements
We want to thank Tobias Lentz and Professor Michael Vorländer, our cooperation
partners in a joint research project funded by the German Research Foundation about
acoustics in virtual environments. Without their expertise it would have been im-
possible to write Sect. 6.2. Furthermore, we thank our former colleague and friend
Roland Steffan. The discourse on support mechanisms in Sect. 6.5 is based on stud-
ies in his Ph.D. thesis. We kindly thank Marc Schirski for the Robot example that is
contained on the book CD.

References
1. Assenmacher, I., Kuhlen, T., Lentz, T., and Vorländer, M. Integrating Real-time Binau-
ral Acoustics into VR Applications. In Virtual Environments 2004, Eurographics ACM
SIGGRAPH Symposium Proceedings, pages 129–136. June 2004.
2. Berkhout, A., Vogel, P., and de Vries, D. Use of Wave Field Synthesis for Natural Rein-
forced Sound. In Proceedings of the Audio Engineering Society Convention 92, volume
Preprint 3299. 1992.
3. Blauert, J. Spatial Hearing, Revised Edition. MIT Press, 1997.
4. Bobick, N. Rotating Objects Using Quaternions. Game Developer, 2(26):21–31, 1998.
5. Bowman, D. A., Kruijff, E., LaViola, J. J., and Poupyrev, I. 3D User Interfaces - Theory
and Practice. Addison Wesley, 2004.
6. Bryson, S. Virtual Reality in Scientific Visualization. Communications of the ACM,
39(5):62–71, 1996.
7. Cruz-Neira, C., Sandin, D. J., DeFanti, T. A., and K, R. Virtual Reality in Scientific
Visualization. Communications of the ACM, 39(5):62–71, 1996.
References 313

8. Delaney, B. Forget the Funny Glasses. IEEE Computer Graphics & Applications,
25(3):14–19, 1994.
9. Farin, G., Hamann, B., and Hagen, H., editors. Virtual-Reality-Based Interactive Explo-
ration of Multiresolution Data, pages 205–224. Springer, 2003.
10. Foley, J., van Dam, A., Feiner, S., and Hughes, J. Computer Graphics: Principles and
Practice. Addison Wesley, 1996.
11. Kresse, W., Reiners, D., and Knöpfle, C. Color Consistency for Digital Multi-Projector
Stereo Display Systems: The HEyeWall and The Digital CAVE. In Proceedings of the
Immersive Projection Technologies Workshop, pages 271–279. 2003.
12. Krüger, W. and Fröhlich, B. The Responsive Workbench. IEEE Computer Graphics &
Applications, 14(3):12–15, 1994.
13. Lentz, T. and Renner, C. A Four-Channel Dynamic Cross-Talk Cancellation System. In
Proceedings of the CFA/DAGA. 2004.
14. Lentz, T. and Schmitz, O. Realisation of an Adaptive Cross-talk Cancellation System for
a Moving Listener. In Proceedings of the 21st Audio Engineering Society Conference.
2002.
15. Møller, H. Reproduction of Artificial Head Recordings through Loudspeakers. Journal
of the Audio Engineering Society, 37(1/2):30–33, 1989.
16. Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T. Numerical Recipes in
C: The Art of Scientific Computing. Cambridge University Press, second edition edition,
1992.
17. Ruspini, D. C., Kolarov, K., and Khatib, O. The Haptic Display of Complex Graphical
Environments. In Proceedings of the 24th Annual Conference on Computer Graphics and
Interactive Techniques. 1997.
18. Steffan, R. and Kuhlen, T. MAESTRO - A Tool for Interactive Assembly Simulation in
Virtual Environments. In Proceedings of the Immersive Projection Technologies Work-
shop, pages 141–152. 2001.
19. Sutherland, I. The Ultimate Display. In Proceedings of the IFIP Congress, pages 505–
508. 1965.
20. Theile, G. Potential Wavefield Synthesis Applications in the Multichannel Stereophonic
World. In Proceedings of the 24th AES International Conference on Multichannel Audio,
page published on CDROM. 2003.
21. Zilles, C. B. and Salisbury, J. K. A Constraint-based God-object Method for Haptic Dis-
play. In Proceedings of the International Conference on Intelligent Robots and Systems.
1995.
Chapter 7

Interactive and Cooperative Robot Assistants


Rüdiger Dillmann, Raoul D. Zöllner

A probate approach for learning knowledge about actions and sensory-motor


abilities is to acquire prototypes of human actions by observing these actions with
sensors and transferring the acquired abilities to the robot, the so-called ”learning
by demonstration´´. Learning by demonstration requires human motion capture, ob-
servation of interactions and object state transitions, and observation of spatial and
physical relations between objects. By doing so, it is possible to acquire so-called
”skills”, situative knowledge as well as task knowledge. Such, the robot can be intro-
duced to new and unknown tasks. New terms, new objects and situations, even new
types of motion can be learned with the help of a human tutor or be corrected inter-
actively via multimodal channels. In this case, the term “multimodality´´ points out
communication channels which are intuitive for humans, such as language, gesture
and haptics (i.e. physical human-robot contact). These communication channels are
to be used for commanding and instructing the robot system.
The research field of programming by demonstration has evolved as a response
to the need of generating flexible programs for service robots. It is largely driven
by attempts to model human behavior and to map it onto virtual androids or hu-
manoid robots. Programming by demonstration comprises a broad set of observation
techniques processing large sets of data from high speed camera systems, laser scan-
ners, data gloves and even exoskeleton devices. Some programming by demonstra-
tion systems operate with precise a priori models, others use statistical approaches
to approximate human behavior. In any case, observation is done to identify motion
over space and time, interaction with the environment and its effects, useful regu-
larities or structures and their interpretation in a given context. Keeping this goal in
mind, systems have been developed which combine active sensing, computational
learning techniques and multimodal dialogs to enrich the robot’s semantic system
level, memorization techniques and mapping strategies to make use of the learned
knowledge to control a real robot.
In the remainder of this chapter, we will introduce several different aspects of
interactive and cooperative robot assistants. The first section presents an overview of
currently developed mobile and humanoid robots with their abilities to interact and
cooperate with humans. In the second section, basic principles of human-robot inter-
action are introduced, whereas the third section discusses how robot assistants can
learn interactively from a human in a programming by demonstration process. More
details of this process and the relevant models and representations are presented in
sections four and five, respectively. Section six is concerned about how to extract
subgoals from human demonstrations, section seven with the questions of task map-
ping and exception handling. Section eight gives a short overview of the interesting
316 Interactive and Cooperative Robot Assistants

research areas of telepresence and telerobotics. Finally, section nine shows some ex-
amples of human-robot interaction and cooperation.

7.1 Mobile and Humanoid Robots

Often, robots are described according to their functionality, e.g. underwater robot,
service robot etc. In contrast to this, the term “humanoid robot´´ does not in particular
refer to the functionality of a robot, but rather to its outer design and form. Compared
to other more specialized robot types, humanoid robots resemble humans not only
in their form, but also in the fact that they can cope with a variety of different tasks.
Their human-like form also makes them very suitable to work in environments which
are designed for humans like households. Another important factor is that a more
humanoid form and behavior is more appealing especially to non-expert users in
everyday situations, i.e. a humanoid robot will probably be more accepted as a robot
assistant e.g. in a household. On the other hand, users expect much more in terms of
communication, behavior and cognitive skills of these robots. It is important to note,
though, that there is not one established meaning of the term “humanoid robot´´
in research – as shown in this section, there currently exists a wide variety of so-
called humanoid robots with very different shapes, sizes, sensors and actors. What is
common among them is the goal of creating more and more human-like robots.
Having this goal in mind, humanoid robotics is a fast developing area of research
world-wide, and currently represents one of the main challenges for many robotics
researchers. This trend is particularly evident in Japan. Many humanoid robotic plat-
forms have been developed there in the last few years, most of them with an approach
very much focussed on mechanical, or more general on hardware problems, in the
attempt to replicate as closely as possible the appearance and the motions of human
beings.
Impressive results were achieved by the Waseda University of Tokyo since 1973
with the WABOT system and its later version WASUBOT, that was exhibited in
playing piano with the NHK Orchestra at a public concert in 1985. The WASUBOT
system could read the sheets of music by a vision system (Fig. 7.1) and could use
its feet and five-finger hands for playing piano. The evolution of this research line at
the Waseda University has led to further humanoid robots, such as: Wabian (1997),
able to walk on two legs, to dance and to carry objects; Hadaly-2 (1997), focused
on human-robot interaction through voice and gesture communication; and Wendy
(1999) ( [29]).
With a great financial effort over more than ten years, Honda was able to transfer
the results of such research projects into a family of humanoid prototypes. Honda’s
main focus has been on developing a stable and robust mechatronic structure for their
humanoid robots (cf. Fig. 7.2). The result is the Honda Humanoid Robot, whose cur-
rent version is called ASIMO. It presents very advanced walking abilities and fully
functional arms and hands with very limited capabilities of hand-eye coordination
for grasping and manipulation ( [31]). Even if the robots presented in Fig. 7.2 seem
Mobile and Humanoid Robots 317

Fig. 7.1. Waseda humanoid robots: a) Wabot-2, the pioneer piano player, b) Hadaly-2, c)
walking Wabian and d) Wendy.

to have about the same size, the real size was consistently reduced during the evo-
lution: height went down from the P3’s 160cm to the ASIMO’s 120cm, and weight
was reduced from the P3’s 130kg to ASIMO’s 43kg.

Fig. 7.2. The Honda humanoid robots: a) P2, b) P3, and c) the current version, ASIMO.

As Honda’s research interest concentrated mostly on mechatronic aspects, the


combination of industrial and academic research by Honda and the Japanese Na-
tional Institut of Advanced Industrial Science and Technology (AIST) led to first
applications like the one shown in Fig. 7.3 a) where HRP-1S is operating a backhoe.
The engagement of Honda in this research area motivated several other Japanese
companies to start their own research projects in this field. In 2002, Kawada Indus-
tries introduced the humanoid robot HRP-2 ( [39]). With a size of 154 cm and 58 kg,
HRP-2 has about the dimensions of a human. It is the first humanoid robot which is
able to lie down and stand up without any help (as shown in Fig. 7.3).
By reuse of components, Toyota managed to set up a family of humanoid robots
with different abilities in relatively short time. In 2004, the family of robot partners
consisted of four models, two of which are shown in Fig. 7.4. The left one is a
walking android which is intended to assist the elderly. It is 1.20 m tall and weighs
318 Interactive and Cooperative Robot Assistants

Fig. 7.3. a) HRP-1S operating a backhoe, b) The humanoid robot HRP-2 is able to lie down
and stand up without any help.

35 kg. The right robot, 1 m tall and weighing 35 kg, is supposed to assist humans in
manufacturing processes ( [63]).

Fig. 7.4. Two of Toyota’s humanoid robot partners: a) a walking robot for assisting elderly
people, and b) a robot for manufacturing purposes.

The Japanese research activities which led to these humanoid robot systems con-
centrate especially on the mechanical, kinematic, dynamic, electronic, and control
problems, with minor concern for the robot’s behavior, their application perspectives,
and their acceptability. Most progresses in humanoid robotics in Japan have been
accomplished without focusing on specific applications. In Japan, the long-term per-
spective of humanoid robots concerns all those roles in the Society in which humans
can benefit by being replaced by robots. This includes some factory jobs, emergency
actions in hazardous environments, and other service tasks. Amongst them, in a Soci-
Mobile and Humanoid Robots 319

ety whose average age is increasing fast and steadily, personal assistance to humans
is felt to be one of the most critical tasks.
In the past few years, Korea has developed to be a second center for humanoid
robotics research on the Asian continent. Having carried out several projects on walk-
ing machines and partly humanoid systems in the past, the Korea Advanced Institute
of Science and Technology (KAIST) announced a new humanoid system which they
call HUBO in 2005 (Fig. 7.5). HUBO has 41 DOF, stands 1.25m tall and weighs
about 55 kg. It can walk, talk and understand speech. As a second humanoid project
the robot NBH-1 (Smart-Bot, Fig. 7.5 b)) is developed in Seoul. The research group
claims their humanoid to be the first network-based humanoid which will be able to
think like a human and learn like a human. NBH-1 stands 1.5m tall, weighs about 67
kg, and is also able to walk, talk and understand speech.

Fig. 7.5. Korean humanoid robot projects: a) HUBO, b) Smart-Bot.

In the USA, the research on humanoid robotics received its major impulse
through the studies related to Artificial Intelligence, mainly through the approach
proposed by Rodney Brooks, who identified the need for a physical human-like struc-
ture as prior for achieving human-like intelligence in machines ( [12]). Brooks’ group
at the AI lab of the MIT is developing human-like upper bodies which are able to
learn how to interact with the environment and with humans. Their approach is much
more focused on robot behavior, which is built-up by experience in the world. In this
framework, research on humanoids does not focus on any specific application. Nev-
ertheless, it is accompanied by studies on human-robot interaction and sociability,
which aim at favoring the introduction of humanoid robots in the society of Humans.
The probably most famous example of MIT’s research is COG, shown in Fig. 7.6.
Still at the MIT AI Lab, the Kismet robot has been developed as a platform for
investigation on human-robot interaction (Fig. 7.6). Kismet is a pet-like head with
320 Interactive and Cooperative Robot Assistants

vision, audition, speech and eye and neck motion capability. It can therefore perceive
external stimuli, track faces and objects, and express its own feelings accordingly
( [11]).

Fig. 7.6. a) The MIT’s COG humanoid robot and b) Kismet.

Europe is more cautiously entering the field of humanoid robotics, but can rely on
an approach that, based on the peculiar cultural background, allows to integrate var-
ious considerations of different nature, by integrating multidisciplinary knowledge
from engineering, biology and humanities. Generally speaking, in Europe, research
on robotic personal assistants has received a higher attention, even without the impli-
cation of anthropomorphic solutions. On the other hand, in the European humanoid
robotics research the application as personal assistants has always been much more
clear and explicitly declared. Often, humanoid solutions are answers to the problem
of developing personal robots able to operate in human environments and to interact
with human beings. Personal assistance or, more in general, helpful services are the
European key to introduce robots in the society and, actually, research and industrial
activities on robotic assistants and tools (not necessarily humanoids) have received
a larger support than research on basic humanoid robotics. While robotic solutions
for rehabilitation and personal care are now at a more advanced stage with respect
to their market opportunities, humanoid projects are currently being carried out by
several European Universities.
Some joint European Projects, like the Brite-Euram Syneragh and the IST-FET
Paloma, are implementing biologically-inspired sensory-motor coordination models
onto humanoid robotic systems for manipulation. In Italy, at the University of Gen-
ova, the Baby-Bot robot is being developed for studying the evolution of sensory-
motor coordination as it happens in human babies (cf. Fig. 7.7). Starting the devel-
opment in 2001, a cooperation between the russian company New Era and the St.
Petersburg Polytechnic University led to the announcement of two humanoid sys-
tems called ARNE (male) and ARNEA (female) in 2003. Both robots have 28 DOF,
stand about 1.23m and weigh 61 kg (7.7), i.e. their dimensions resemble those of
Interaction with Robot Assistants 321

Asimo. Like Asimo, they can walk on their own, avoid obstacles, distinguish and
remember objects and colors, recognize 40 separate commands, and can speak in
a synthesized voice. The androids can run on batteries for 1 hour, compared to 1.5
hours for the Asimo. Located in Sofia, Bulgaria, the company Kibertron Inc. has
a full scale humanoid project called Kibertron. Their humanoid is 1.75m tall and
weighs 90 kg. Kibertron has 82 DOF (7.7). Each robot hand has 20 DOF and each
arm has another 8 DOF ( [45]).

Fig. 7.7. a) Baby-Bot, b) ARNE, and c) Kibertron.

Compared to the majority of electro-motor driven robot systems the german com-
pany Festo is following a different approach. Their robot called Tron X (cf. Fig. 7.8
a)) is driven by servo-pneumatic muscles and uses over 200 controllers to emulate
human-like muscle movement ( [23]).
ARMAR, developed at Karlsruhe University, Germany, is an autonomous mobile
humanoid robot for supporting people in their daily life as personal or assistance
robot (Fig. 7.8). Currently, two anthropomorphic arms have been constructed and
mounted on a mobile base with a flexible torso and studies on manipulation based on
human arm movements are carried out ( [7,8]). The robot system is able to detect and
track the user by vision and acoustics, talk and understand speech. Simple objects can
be manipulated.
Thus, European humanoid robot systems in general are being developed in view
of their integration in our society and they are very likely to be employed in assis-
tance activities, meeting the need for robotic assistants already strongly pursued in
Europe.

7.2 Interaction with Robot Assistants

Robot assistants are expected to be deployed in a vast field of applications in every-


day environments. To meet the consumer’s requirements, they should be helpful,
simple and easy to instruct. Especially household environments are tailored to an
322 Interactive and Cooperative Robot Assistants

Fig. 7.8. Humanoid robot projects in Germany: a) Festo’s Tron X and b) ARMAR by the
University of Karlsruhe.

individual’s needs and taste. So, in each household a different set of facilities, fur-
niture and tools is used. Also, the wide variety of tasks the robot encounters cannot
be foreseen in advance. Ready-made programs will not be able to cope with such
environments. So, flexibility will surely be the most important requirement for the
robot’s design. It must be able to navigate through a changing environment, to adapt
its recognition abilities to a particular scenery and to manipulate a wide range of ob-
jects. Furthermore, it is extremely important for the user to easily adapt the system
to his needs, i.e. to teach the system what to do and how to do it.
The basis of intelligent robots acting in close cooperation with humans and being
fully accepted by humans is the ability of the robot systems to understand and to
communicate. This process of interaction, which should be intuitive for humans, can
be structured along the role of the robot in:
• Passive interaction or Observation. The pure observation of humans together
with the understanding of seen actions is an important characteristic of robot
assistants, and it enables them to learn new skills in an unsupervised manner and
to decide by prediction when and how to get active.
• Active interaction or Multimodal Dialog. Performing a dialog using speech,
gesture and physical contact denotes a mechanism of controlling robots in an
intuitive way. The control process can either take place by activity selection and
parameterization or by selecting knowledge to be transferred to the robot system.
Regardless of the type and role of interaction with robots assistants, one nec-
essary capability is to understand and to interpret the given situation. In order to
achieve this, robots must be able to recognize and to interpret human actions and
spatio-temporal situations. Consequently the cognitive capabilities of robot systems
in terms of understanding human activities are crucial. In cognitive psychology, hu-
man activity is characterized by three features [3]:
Direction: Human activity is purposeful and directed to a specific goal situation.
Interaction with Robot Assistants 323

Decomposition: The goal to be reached is being decomposed into subgoals.


Operator selection: There are known operators that may be applied in order to
reach a subgoal. The operator concept designates an action that directly real-
izes such a subgoal. The solution of the overall problem can be represented as a
sequence of such operators.
It is important to know that cognition is mainly taken to be directed as well.
Furthermore, humans tend to perceive activity as a clearly separated sequence of
elementary actions. For transferring their concept to robots, the set of elementary
actions must be identified, and, furthermore, a classification taxonomy for the inter-
pretation must be set up. Concerning the observation and the recognition process, the
most useful information for interpreting a sequence of actions is to be found in the
transition from one elementary action to another [52].

7.2.1 Classification of Human-Robot Interaction Activities

The set of supported elementary actions should be derived from human interaction
mechanisms. Depending on the goal of the operator’s demonstration, a categorization
into three categories seems to be appropriate [20]:
• Performative actions: One human shows a task solution to somebody else. The
interaction partner observes the activity as a teaching demonstration. The ele-
mentary actions change the environment. Such demonstrations are adapted to
the particular situation in order to be captured in an appropriate way by the ob-
server [36].
• Commenting actions: Humans refer to objects and processes by their name, they
label and qualify them. Primarily, this type of action serves to reduce complexity
and to aid the interpretation process after and during execution of performative
actions.
• Commanding actions: Giving orders falls into the last category. This could, e.g.,
be commands to move, stop, hand something over, or even complex sequences
of single commands that directly address robot activity.
The identified categories may be rolled out by the modality of their application
(see figure 7.9). For teaching an action sequence to a robot system, the most im-
portant type of action is the performative one. But perfomative actions are usually
accompanied by comments, containing additional information about the performed
task. So, for knowledge acquisition, it is very important to include this type of action
into a robot assistant system. Intuitive handling of robot assistance through instruc-
tion is based on commanding actions, which trigger all function modes of the system.

7.2.2 Performative Actions

Performed actions are categorized along the action type (i.e. ”change position” or
”move object”), and, closely related to this first categorization, also along the in-
volved actuators or subsystems. Under this global viewpoint, user and robots are
324 Interactive and Cooperative Robot Assistants

one hand static Cutkosky


manipulate grasp
two hands dynamic Zöllner

with objective

without object under restrictions


performative move using a tool
action
carrying object with contact
without tool
transport without contact
navigate
strategy

utter
iconic
speech
commenting metaphoric
action
gesticulate
deictic
commanding
action
guide tact

emblematic

Fig. 7.9. Hierarchy and overview of possible demonstration actions. Most of them are taken
from everyday tasks (see lower right images)

part of the environment. Manipulation, navigation and the utterance of verbal perfor-
mative sentences are different forms of performative actions.
Manipulation: Manipulation is certainly one of the main functions of a robot
assistant: tasks as e.g. fetch and carry and handling tools or devices are vital
parts of most applications in household- or office-like environments. Grasps and
movements are relevant for interpretation and representation of actions of the
type ”manipulation”.
Grasps: Established schemes can be reverted to classify grasps that in-
volve one hand. Here, an underlying distinction is made between grasps dur-
ing which finger configuration does not change (”static grasps”) and grasps
that require such configuration changes (”dynamic grasps”). While for sta-
tic grasps exhaustive taxonomies exist based on finger configurations and
the geometrical structure of the carried object [14], dynamic grasps may be
categorized by movements of manipulated objects around the local hand co-
ordinate system [71]. Grasps being performed by two hands have to take
into account synchronicity and parallelism in addition to the requirements of
single-handed grasp recognition.
Movement: Here, the movement of both extremities and of objects has to be
discerned. The first may be partitioned further into movements that require a
specific goal pose and into movements where position changes underly cer-
tain conditions (e.g. force/torque, visibility or collision). On the other hand,
the transfer of objects can be carried out with or without contact. It is very
useful to check if the object in the hand has or has not tool quality. The latter
case eases reasoning on the goal of the operator (e.g.: tool type screwdriver →
operator turn screw upwards or downwards). Todays programming systems
support very limited movement patterns only. Transportation movements
Learning and Teaching Robot Assistants 325

have been investigated in block worlds [33, 37, 41, 54, 64] or in simple as-
sembly scenarios [9] and dedicated observation environments [10,27,49,61].
The issue of movement analysis is being addressed differently by many re-
searchers in contrast to the issue of grasp analysis. The representation of
movement types varies strongly with respect to the action representation as
well as with respect to the goal situation.
Navigation: In contrast to object manipulation, navigation means the movement
of the interaction partner himself. This includes position changes with a certain
destination in order to transport objects and movement strategies that may serve
for exploration [13, 55, 56].
Verbal performative utterance: In language theory, actions are known that get
realized by speaking out so called performative sentences (such as ”Your’re wel-
come”). Performative utterance is currently not relevant in the context of robot
programming. We treat this as a special case and will not further discuss it.

7.2.3 Commanding and Commenting Actions

Commanding actions are used for instructing a robot either to fulfill a certain ac-
tion or to initiate adequate behaviors. Typically these kinds of actions imply a robot
activity in terms of manipulation, locomotion or communication. In contrast to com-
mands, commenting actions of humans usually result in a passive attitude of the robot
system, since they do not require any activity. Nevertheless, an intelligently behav-
ing robot assistant might decide to initiate a dialog or interaction after perceiving a
comment action.
Looking onto robot assistants in human environments the most intuitive way of
transferring commands or actions are the channels which are also used for human-
human communication. These can either be speech, gestures or physical guidance.
Speech and gesture are usually used for communication and through these interac-
tion channels a symbolic high level communication is possible. But for explaining
manipulation tasks which are often performed mechanically by humans, physical
guidance seems more appropriate as a communication channel.
As can be seen by the complexity of grasp performance, navigation, and inter-
action, observation of human actions requires vast and dedicated sensors. Hereby,
diverse information is vital for the analysis of an applied operator: a grasp type may
have various rotation axes, a certain flow of force/torque exerted on the grasped ob-
ject, special grasp points where the object is touched etc.

7.3 Learning and Teaching Robot Assistants


Besides interaction, learning is one of the most important properties of robot as-
sistants, which are supposed to act in close cooperation with humans. Here the learn-
ing process must enable robots to cope with everyday situations and natural (house-
hold ore office) environments in a way intuitive for humans. This means that through
326 Interactive and Cooperative Robot Assistants

the learning process on the one hand new skills have to be acquired and on the other
hand the knowledge of the system has to be adapted to new contexts and situations.
Robot assistants coping with these demands must be equipped with innate skills
and should learn live-long from their users. Supposedly, these are no robot experts
and require a system that adapts itself to their individual needs. For meeting these
requirements, a new paradigm for teaching robots is defined to solve the problems
of skill and task transfer from human (user) to robot, as a special way of knowledge
transfer between man and machine.
The only approach that satisfies the latter condition are programming systems
that automatically acquire relevant information for task execution by observation and
multimodal interaction. Obviously, systems providing such a functionality require:
1. powerful sensor systems to gather as much information as possible by observing
human behavior or processing explicit instructions like commands or comments,
2. a methodology to transform observed information for a specific task to a robot-
independent and flexible knowledge structure, and
3. actuator systems using this knowledge structure to generate actions that will
solve the acquired task in a specific target environment.

7.3.1 Classification of Learning through Demonstration

A classification of methods for teaching robot assistants in an intuitive way might


be done according the type of knowledge which is supposed to be transferred be-
tween human and robot. Concerning performative actions, especially manipulation
and navigation tasks are addressed. Here the knowledge needed for characterizing
the actions can be categorized by the included symbolic and semantic information.
The resulting classification, shown in figure 7.10, contains three classes of learning
or teaching methods, namely: skill learning, task learning and interactive learning.

7.3.2 Skill Learning

For the term ”skill”, different definitions exist in literature. In this chapter, it will be
used as follows: A skill denotes an action (i.e. manipulation or navigation), which
contains a close sensor-actuator coupling. In terms of representation, it denotes the
smallest symbolic entity which is used for describing a task.
Examples for skills are grasps or moves of manipulated objects with constraints
like ”move along a table surface” etc. The classical ”peg in hole” problem can be
modeled as a skill, for example using a force controller which minimizes a cost func-
tion or a task consisting of a sequence of skills like ”find the hole”, ”set perpendicular
position” and ”insert”.
Due to the close relation between sensors and actors the process of skill learning
is mainly done by direct programming using the robot system and if necessary some
special control devices. Skill learning means to find the transfer function R, which
satisfy the equation U (t) = R(X(t), Z(t)). The model of the skill transfer process
is visualized in Fig. 7.11.
Learning and Teaching Robot Assistants 327

Fig. 7.10. Classes of skill transfer through interaction.

Fig. 7.11. Process of skill transfer.

As learning methods for skill acquisition often statistical methods like Neuronal
Networks or Hidden Markov Models are used, in order to generalize a set of skill
examples in terms of sensory input. Given a model of a skill for example through
demonstration, reinforcement techniques can be applied to refine the model. Here
a lot of background knowledge in form of a valuation function which guides the
optimization process has to be included into the system.
However, teaching skills to robot assistants is time-intensive and affords a lot of
knowledge about the system. Therefore, looking to robot assistants were the user of
the systems is no expert, it seems more feasible that the majority of the needed skills
are pre-learned. The system would have a set of innate skills and will only have to
learn a few new skills during its ”living time.”

7.3.3 Task Learning

From the semantic point of view, a task denotes a more complex action type than a
skill. From the representation and model side a task is usually seen as a sequence
of skills, and that should serve as a definition for the following sections. In terms of
knowledge representation a task denotes a more abstract view of actions, were the
control mechanisms for actuators and sensors are enclosed in modules represented
by symbols.
328 Interactive and Cooperative Robot Assistants

Learning of tasks differs from skill learning, since, apart from control strategies
of sensors and actuators, the goal and subgoal of actions have to be considered. As in
this case the action depends on the situation as well as on the related action domain,
the learning methods used rely on a vast knowledge base. Further analytical learning
seems to be more appropriate for coping with symbolic environment descriptions
used to represent states and effects of skills within a task.
For modeling a task T as a sequence of skills (or primitives) Si , the context
including environmental information or constraints Ei and internal states of the robot
system Zi are introduced as pre- and post-conditions of the skills. Formally, a task
can be described as:

T = {(P re − Cond(Ei , Zi ))Si (Effect(Ei , Zi ))} (7.1)


Effect = P ost − Cond\P re − Cond

were Effect() denotes the effect of a skill on the environment and the internal robot
states. Consequently, the goal and subgoals of a task are represented by the effect of
the task or the subsequence of the task.
Noticeable in the presented task model is the fact that tasks are state-, but not
time-dependent like skills, and therefore enable a higher generalization in terms
of learning. Although the sensory-motor skills in this representation are encapsu-
lated in symbols, learning and planning task execution for robot assistants becomes
very complex due to the fact that state descriptions of environments (including hu-
mans) and internal states are very complex, and decision making is a not always
well-defined.

7.3.4 Interactive Learning

The process of interactive learning has the highest abstraction level because of the
information exchange through speech and gestures. Through these interaction chan-
nels various situation-dependent symbols are used to refer to objects in the environ-
ment, or to select actions to be performed. Information, like control strategies on the
sensory-motor level, can not be transferred any more, but rather can global parame-
ters, which are used for triggering and adapting existing skills and tasks.
The difference between task learning and interactive learning on the one hand
consist in the complexity of actions that can be learned and on the other hand in
the use of dialog during the learning process with explicit system demands. Beside
communication issues, one of the main problems in developing robots assistants with
such capabilities denotes the understanding and handling of ”common sense” know-
ledge. In this context the problem of finding the same view of situations and com-
municating these between robot and human is still a huge research area.
For learning however, a state driven representation of actions which can either
be skills, tasks or behaviors is required as well as a vast decision making module,
which is able to disambiguate gathered information by initiating adequate dialogs.
This process relies on a complex knowledge base including skill and task models, en-
vironment descriptions, and human robot interaction mechanisms. Learning in this
Learning and Teaching Robot Assistants 329

context denotes finding and memorizing sequences of actions or important parame-


ters for action execution or interaction. Associating semantic information to objects
or actions for expanding space and action concepts is a further goal of interactive
learning.

7.3.5 Programming Robots via Observation: An Overview

Different programming systems and approaches to teach robots based on human


demonstrations have been proposed during the past years. Many of them address
special problems or a special subset of objects only. An overview and classification
of the approaches can be found in [16, 58]. Learning of action sequences or oper-
ation plans is an abstract problem, which supposes a modeling of cognitive skills.
When learning complex action sequences, basic manipulation skills or controlling
techniques become less relevant. In fact, the aim is to generate an abstract descrip-
tion of the demonstration, reflecting the user´s intention and modeling the problem
solution as optimally as possible. Robot-independence is an important issue because
this would allow for exchange of robot programs between robots with different kine-
matics. Likewise, a given problem has to be suitably generalized, for example distin-
guishing parameters specific for the particular demonstration and parameters specific
for the problem concept. Both generalizations demand a certain insight into the en-
vironment and the user’s performance.
Basis for the mapping of a demonstration to a robot system are the task repre-
sentation and the task analysis. Often, the analysis of a demonstration takes place
observing the changes in the scene. These changes can be described using relational
expressions or contact relations [40, 47, 53].
Issues for learning to map action sequences to dissimilar agents have been in-
vestigated by [2]. Here, the agent learns an explicit correspondence between his own
possible actions and the actions performed by a demonstrator agent by imitation. To
learn correct correspondences, in general several demonstrations of the same prob-
lem are necessary.
For generalizing a single demonstration, mainly explanation-based methods are
used [24, 51]. They allow for an adequate generalization taken from only one ex-
ample (One-Shot-Learning). Approaches based on One-shot-learning techniques are
the only ones feasible for end users, since giving many similar performing examples
is an annoying task.
A physical demonstration in the real world can be too time-consuming and may
not be mandatory. Some researchers rely on virtual or iconic demonstrations [5].
The advantage of performances in a virtual world is that here manipulations can be
executed that would not be feasible in the real world because of physical constraints.
Furthermore, inaccuracies of sensor-based approaches do not have to be taken into
account. Such object poses or trajectories [41, 62, 64] and/or object contexts [26, 30,
53,59] can be generalized in an appropriate way. A physical demonstration, however,
is more convenient for the human operator. Iconic programming starts at an even
more abstract level. Here, a user may fall back on a pool of existing or acquired
basic skills. These can be retrieved through specific cues (like icons) and embedded
330 Interactive and Cooperative Robot Assistants

in an action sequence. The result of such a demonstration of operator sequences can


then be abstracted, generalized and summarized as new knowledge for the system.
An example can be found in the skill-oriented programming system SKORP [5] that
helps to set up macro operators interactively from actuator or cognitive elementary
operators.
Besides a derivation of action sequences from a user demonstration direct coop-
eration with users has been investigated. Here, user and robot reside in a common
work cell. The user may direct the robot on the basis of a common vocabulary like
speech or gestures [60, 69, 70]. But this approach allows for rudimentary teaching
procedures only.
A more general approach to programming by demonstration together with the
processing phases that have to be realized when setting up such a system is outlined
in [19]. A discussion of sensor employment and sensor fusion for action observation
can be found in [21, 57, 71]. Here, basic concepts for the classification of dynamic
grasps are presented as well.
To summarize, the most crucial task concerning the programming process is the
recognition of the human´s actions and the recognition of effects in the environment.
Assuming that background knowledge about tasks and actions is limited, valuable
information must be extracted by intimately observing the user during demonstra-
tions. For real-world problems simplified demonstration environments will not lead
to robot instructions that are applicable in general. Therefore, we propose to rely
on powerful sensor systems that allow to keep track on human actions in unstruc-
tured environments as well. Unfortunately, the on-board sensor systems of today’s
autonomous robot systems will not provide enough information for obtaining rele-
vant information in the general case. So, it is necessary to integrate additional sensor
systems in the environment to enlarge the robot’s recognition capabilities as well as
to undertake a fundamental inspection of actions that can be recognized in everyday
environments.

7.4 Interactive Task Learning from Human Demonstration


The approach of programming by demonstration (PbD) emphasizes the interpreta-
tion of what has been done by the human demonstrator. Only a correct interpretation
enables a system to reuse formerly observed action sequences in a different environ-
ment. Actions that might be performed during the demonstration can be classified as
explained above, cf. figure 7.9. The main distinction here is drawn by their objectives:
when observing a human demonstrator, performative, commenting and commanding
actions are the most relevant. They are performed in order to show something by
doing it on your own (performative actions), in order annotate or comment on some-
thing (commenting actions), and in order to make the observer act (commanding
actions). These actions may be further subclassified as explained before.
Some of these actions may be interpreted immediately (e.g. the orally given com-
mand stop!). But mostly, the interpretation depends on environment states, robot and
user positions and so on. This allows to interpret actions that serve for man-machine
Interactive Task Learning from Human Demonstration 331

interaction. It is even harder, though, to analyze manipulation sequences. Here, the


interpretation has to take into account previously performed actions as well. Thus,
the process of learning complex problem solving knowledge consists of several dif-
ferent phases.

7.4.1 Classification of Task Learning

Systems programmed by user demonstrations are not restricted to a specific area of


robotics. Even for construction of graphical user interfaces and work-flow manage-
ment systems such systems have been used [15].In any case the system learns from
examples provided by a human user. Within the field of robotics, and especially for
manipulation tasks, PbD systems can be classified according to several classification
features (Fig. 7.12), in particular with regard to task complexity, class of demon-
stration, internal representation and mapping strategy. The concrete system strongly
depends on the complexity of the task that is being learned.

Fig. 7.12. Classification features for Programming by Demonstration (PbD) systems.

• The first classification feature is the the abstraction level. Here, the learning goal
can be divided in learning of low-level (elementary) skills or high-level skills
(complex task knowledge).
Elementary skills represent human reflexes or actions applied unconsciously. For
them, a direct mapping between the actuator’s actions and the corresponding
sensor input is learned. In most cases neural networks are trained for a certain
sensor/actuator combination like grasping objects or peg-in-hole tasks [6,38,46].
Obviously, those systems must be trained explicitly and can not easily be reused
for a different sensor/actuator combination. Another drawback is the need for a
high number of demonstration examples, since otherwise the system might only
learn a specific sensor/actuator configuration.
332 Interactive and Cooperative Robot Assistants

To acquire complex task knowledge from user demonstrations on the other hand
needs a high amount of background knowledge and sometimes communication
with the user. Most applications deal with assembly tasks like for example [35,
43, 59]
• Another important classification feature is the methodology of giving examples to
the system. Within robotic applications, active, passive and implicit examples can
be provided for the learning process. Following the terminology, active examples
indicate those demonstrations where the user performs the task by himself, while
the system uses sensors like data-gloves, cameras and haptic devices for tracking
the environment and/or the user. Obviously, powerful sensor systems are required
to gather as much data as available [25, 32, 42, 44, 65, 66]. Normally, the process
of finding relevant actions and goals is a demanding challenge for those systems.
Passive examples can be used if the robot can be controlled by an external device
as for example a space-mouse, a master-slave-system or graphical interface. Dur-
ing the demonstration process the robots supervises its internal or external sen-
sors and tries to directly map those actions to a well known action or goal [38].
Obviously, this method requires less intelligence, since perception is restricted
to well known sensor sources and actions for the robot system are directly avail-
able. However, the main disadvantage is given by the fact that the demonstration
example is restricted to the target system.
If implicit examples are given, the user directly specifies system goals to be
achieved by selecting a set of iconic commands. Although this methodology is
the easiest one to interpret, since the mapping to well known symbols is implic-
itly given, it both restricts the user to a certain set of actions and provides the
most uncomfortable way of giving examples.
• A third aspect is the system’s internal representation of the given examples. To-
day’s systems regard effects in the environment, trajectories, operations and ob-
ject positions. While observing effects in the environment often requires highly
developed cognitive components, observing trajectories of the user’s hand and
fingers can be considered comparably simple. However, recording only trajecto-
ries is in most cases not enough for real world applications, since all informa-
tion is restricted to only one environment. Tracking positions of objects, either
statically at certain phases of the demonstration or dynamically by extracting
kinematic models, abstracts on the one hand from the demonstration devices, but
implies an execution environment similar to the demonstration environment.
If the user demonstration is transformed into a sequence of pre-defined actions,
it is called operation-based representation. If the target system supports the same
set of actions, execution becomes very easy. However, this approach only works
in very structured environments where a small number of actions is enough to
solve most of the forecoming tasks.
From our point of view a hybrid representation, where knowledge about the user
demonstration is provided not only on the level of representation class, but also
on different abstraction levels is the most promising approach for obtaining the
best results.
Interactive Task Learning from Human Demonstration 333

• Finally, the representation of the user demonstration must somehow be mapped


on a target system. Therefore, the systems can be classified with regard to the
classification strategy. If the user demonstration is represented by effects in the
environment or goal states given by object positions, planning the robot move-
ments is straightforward. However, this requires a very intelligent planning sys-
tem and most of the data that were acquired in the demonstration, like trajectories
or grasps, is lost. Therefore, for learning new tasks this approach has only limited
applicability.
Instead of planning actions based on the environment state and current goal, the
observed trajectories can be mapped directly to robot trajectories, using a fixed
or learned transformation model [42, 48]. The problem here is, that the execu-
tion environment must be similar to the demonstration environment. Thus, this
mapping method is restricted to applications where easy teach-in tasks must be
applied.

7.4.2 The PbD Process

In this section, the process of programming a robot by demonstration is presented in


more detail. To this means, the steps of the PbD process are discussed in their order
in such a programming cycle. As summarized in figure 7.13, the whole program-
ming process starts with a user demonstration of a specific task which is observed
by a sensor system. The following phases are the basic components for successfully
making use of data from a human demonstration:
1. Sensor systems are used for observing the users movements and actions. Also
important changes like object positions and constraints in the environment can
be detected. The sensor efficiency might be improved by allowing the user to
comment on his actions.
2. During the next phase relevant operations or environment states based on the
sensor data are extracted. This process is called segmentation. Segmentation can
be performed online during the observation process or offline based on recorded
data. Here, the system´s performance can be improved significantly by asking
the user whether decisions made by the system are right or wrong. This is also
important to reduce sensor noise.
3. Within the interpretation phase the segmented user demonstration is mapped
onto a sequence of symbols. These symbols contain information about the ac-
tion (e.g. type of grasp or trajectory) as well as important data from the sensor
readings as forces, contact points, etc.
4. Abstraction from the given demonstration is vital for representing the task so-
lution as general as possible. Generalization of the obtained operators includes
further advice by the user. Spontaneous and not goal-oriented motions may be
identified and filtered in this phase. It is important to store the task knowledge in
a form that makes it reusable even if execution conditions do vary slightly from
the demonstration conditions.
334 Interactive and Cooperative Robot Assistants

5. The next necessary phase is the mapping of the internal knowledge representa-
tion to the target system. Here, the task solution knowledge generated in the pre-
vious phase serves as input. Additional background knowledge about the kine-
matic structure of the target system is required. Within this phase as much in-
formation as possible available from the user demonstration should be used to
allow robot programming for new tasks.
6. In the simulation phase the generated robot program is tested for its applica-
bility in the execution environment. It is also desirable to let the user confirm
correctness in this phase to avoid dangerous situations in the execution phase.
7. During execution in the real world success and failure can be used to modify the
current mapping strategy (e.g. by selecting higher grasping forces when a picked
workpiece slips out of the robot´s gripper).

Fig. 7.13. The process of Programming by Demonstration (PbD).

The overall process with its identified phases is suitable for manipulation tasks
in general. However, in practice the working components are restricted to a certain
domain to reduce environment and action space.
Interactive Task Learning from Human Demonstration 335

7.4.3 System Structure

Fig. 7.14. The overall system: training center and robot.

In this section, the structure of a concrete PbD system developed at the Uni-
versity of Karlsruhe [17] is presented as an example of necessary components of
a Programming by demonstration system. Since visual sensors can not easily cope
with occlusions and therefore cannot track dexterous manipulation demonstrations, a
fixed robot training center has been set up in this project that integrates various sen-
sors for observing user actions. The system is depicted schematically in figure 7.14.
It uses data gloves, magnetic field tracking sensors supplied by Polhemus, and active
stereo camera heads. Here, manipulations tasks can be taught that the robot then is
able to execute in a different environment.
To meet the requirements for each phase of the PbD-process identified in sec-
tion 7.4.2 the software system consists of the components shown in figure 7.15. The
system has four basic processing modules which are connected to a set of databases
and a graphical user interface. The observation and segmentation module is respon-
sible for analysis and pre-segmentation of information channels, connecting the sys-
tem to the sensor data. Its output is a vector describing states in the environment
and user actions. This information is stored in a database including the world model.
The interpretation module operates on the gained observation vectors and associates
sequences of observation vectors to a set of predefined symbols. These parameteriz-
able symbols do represent the elementary action set. During this interpretation phase
the symbols are chunked into hierarchical macro operators after replacing specific
task-dependent parameters by variables. The result is stored in a database as gener-
alized execution knowledge. This knowledge is used by the execution module which
uses specific kinematic robot data for processing. It calculates optimized movements
for the target system, taking into account the actual world model. Before sending
the generated program to the target system its validity is tested through simulation.
In case of unforeseen errors, movements of the robot have to be corrected and op-
336 Interactive and Cooperative Robot Assistants

timized. All four components do communicate with the user by a graphical user-
interface. Additional information can be retrieved from the user, and also hypotheses
can be accepted or rejected via this interface.

Fig. 7.15. System structure.

7.4.4 Sensors for Learning through Demonstration


The question of which sensors are useful for learning from human demonstration is
closely related to the goal of the learning process and the background knowledge
of the system. Trying to imitate or learn movements from humans, the exact trajec-
tory of the articulated limbs is needed and consequently motion capturing systems
like the camera-based VICON system [68] can be used. These sensors extract, based
on marker positions, the 6D positions of the limbs, which are further used for cal-
culating the joint angles, or their change, of the human demonstrator. Another way
for measuring the human body configurations are exoskeletons, like the one offered
by MetaMotion [50]. However imitation of movements alone is only possible using
humanoids or anthropometric robots, having a kinematic structure similar to that of
humans.
While observing manipulation tasks, the movements of hands and their pose or
configuration (joint angles) certainly are essential for learning skillful operations.
Due to occlusion constraints, camera-based systems are not adequate for tracking
finger movements. Therefore data gloves like the one of Immersion [34] can be used
for gaining hand configurations with acceptable accuracy. The hand position can be
gathered by camera systems and / or magnetic trackers .
For identifying the goal of manipulation tasks, the analysis of changes in the ob-
served scene is crucial. For this purpose object tracking is needed and consequently
sensors for object recognition and tracking are used. Nowadays active (stereo) cam-
eras are fast and accurate enough and can be used in combination with real-time
Interactive Task Learning from Human Demonstration 337

tracking algorithms to observe complex realistic scenes like household or office en-
vironments.

tun- and tiltable


camera head movable platform for
for color and overview perspective
depht image
processing
tun- and tiltable
camera head
for color and
depht image
processing

microphone
and
loudspeaker
for speech
interaction

data gloves
with attached
tactile sensors
and magnetic
field trackers

processing units magnetic field emitter

Fig. 7.16. Training center for observation and learning of manipulation tasks at the University
of Karlsruhe.

Figure 7.16 shows the demonstration area (”Training Center”) for learning and
acquisition of task knowledge built at the University of Karlsruhe (cf. section 7.4.3).
During the demonstration process, the user handles objects in the training center.
This is equipped with the following sensors: two data gloves with magnetic field
based tracking sensors and force sensors which are mounted on the finger tips and the
palm, as well as two active stereo camera heads. Object recognition is done by com-
puter vision approaches using fast view-based approaches described in [18]. From
the data gloves, the system extracts and smoothes finger joint movements and hand
positions in 3D space. To reduce noise in trajectory information, the user’s hand
is additionally observed by the camera system. Both measurements are fused using
confidence factors, see [21]. This information is stored with discrete time stamps in
the world model database.
338 Interactive and Cooperative Robot Assistants

7.5 Models and Representation of Manipulation Tasks

One of the basic capabilities of robot assistants is the manipulation of the environ-
ment. Under the assumption that robots will share the working or living space with
humans, learning manipulation tasks from scratch would not be adequate for many
reasons: first, in many cases it is not possible for a non-expert user to specify what
and how to learn and secondly, teaching skills is a very time-consuming and maybe
annoying task. Therefore a model-based approach including a general task represen-
tation were a lot of common (or innate) background knowledge is incorporated is
more appropriate.
Modeling the manipulation tasks requires definitions of general structures used
for representing the goal of a task as well as a sequence of actions. Therefore the
next section will present some definitions the following task models will rely on.

7.5.1 Definitions

The term ”manipulation” is in robotics usually used as a change of position of an


object through an action of the robot. In this chapter a more general definition of
manipulation will be used, which is not orientated at the goal of the manipulation
but rather is related to how a manipulation is performed. The following definition of
”manipulation” will be used: a manipulation (task) is a task, or sequence of actions,
which contains a grasp action. A grasp action denotes picking an object but also
touching or pushing it (like pushing a button).
The classes of manipulation tasks can be divided in more specific subclasses in
order to incorporate them into different task models. Considering the complexity of
tasks, the following two subclasses are particularly interesting:
• Fine manipulation: This class contains all manipulations which are executed
by using the fingers to achieve a movement of objects. From the point of view of
robotics the observation of these movements as well as their reproduction denotes
a big challenge.
• Dual hand manipulation: All manipulations making use of two arms for
achieving the task goal are included in this class. For modelling this kind of
manipulations the coordination issues between the hands have to be considered
in addition.

7.5.2 Task Classes of Manipulations

Considering the household and office domain under the hypothesis that the user
causes all of the changes in the environment (closed world assumption) through
manipulation actions, a goal-related classification of manipulation tasks can be per-
formed. Hereby the manipulation actions can generally be viewed including dual
hand and fine manipulations. Furthermore, the goal of manipulation tasks is closely
related to the functional view of actions and, besides Cartesian criteria, consequently
Models and Representation of Manipulation Tasks 339

incorporates a semantic aspect, which is very important for interaction and commu-
nication with humans.
According to the functional role a manipulation action aims for, the following
classes of manipulation tasks can be distinguished (Figure 7.17):
Transport operations are one of the widely executed action classes of robots in
the role of servants or assistants. Tasks like Pick & Place or Fetch & Carry denot-
ing the transport of objects are part of almost all manipulations. The change of
the Cartesian position of the manipulated object serves as a formal distinguishing
criterion of this kind of task. Consequently, for modeling transport actions, the
trajectory of the manipulated object has to be considered and modeled as well.
In terms of teaching transport actions to robots the acquisition and interpretation
of the performed trajectory is therefore crucial.
Device handling. Another class of manipulation tasks deals with changing the
internal state of objects like opening a drawer, pushing a button etc. Actions of
this class are typically applied when using devices, but many other objects also
have internal states that can be changed (e.g. a bottle can be open or closed,
filled or empty etc.). Every task changing an internal state of an object belongs to
this class. In terms of modeling device handling tasks, transition actions leading
to an internal state change have to be modeled. Additionally, the object models
need to integrate an adequate internal state description. Teaching this kind of task
requires observation routines able to detect internal state changes by continuously
tracking relevant parameters.
Tool handling. The most distinctive feature for actions belonging to the class
of tool handling is the interaction modality between two objects, typically a tool
and some workpiece. Hereby the grasped object interacts with the environment
or another grasped object in the case of dual arm manipulations. Like the class of
device handling tasks, this action type does not only include manipulations using
a tool, but rather all actions containing an interaction between objects. Examples
for such tasks are pouring a glass of water, screwing etc. The interaction modal-
ity is related to the functional role of objects used or the correlation between the
roles of all objects included in the manipulation, respectively. The model of tool
handling actions thus should contain a model of the interaction. According to dif-
ferent modalities of interaction, considering contact, movements etc., a diversity
of handling methods has to be modeled. In terms of teaching robots the observa-
tion and interpretation of the actions can be done using parameters corresponding
to the functional role and the movements of the involved objects.
These three distinguished classes are obviously not disjunct, since for example a
transport action is almost always part of the other two classes. However identifying
the actions as part of these classes eases the teaching and interaction process with
robot assistants, since the inherent semantics behind these classes correlates with
human descriptions of manipulation tasks.
340 Interactive and Cooperative Robot Assistants

Relative Position
and Trajectory

1. Transport Actions
Internal State,
Opening angle

2. Device Handling

Object
Interaction

3. Tool Handling

Fig. 7.17. Examples for the manipulation classes distinguished in a PbD system.

7.5.3 Hierarchical Representation of Tasks

Representing manipulation tasks as pure action sequences is not flexible and also
not scalable. Therefore a hierarchical representation is introduced in order to gene-
ralize an action sequence and to prune elementary skills to more complex subtasks.
Looking only on manipulation tasks the assumption is that each manipulation con-
sists of a grasp and a release action, as defined before. To cope with the manipulation
classes specified above, pushing or touching an object is interpreted as a grasp. The
representation of grasping an object constitutes a ”pick” action and consists of three
sub-actions: an approach, a grasp type and a dis-approach (Fig. 7.18). Each of these
sub-actions consists of a variable sequence of elementary skills represented by ele-
mentary operators (EO’s) of certain types (i.e. for the approach and dis-approach the
Models and Representation of Manipulation Tasks 341

EO’s are ”move” types and the grasp will be of type ”grasp”). A ”place” action is
treated analogously.

Fig. 7.18. Hierarchical model of a pick action.

Between a pick and a place operation, depending on the manipulation class, seve-
ral basic manipulation operations can be placed (Fig. 7.19). A demonstration of the
task ”pouring a glass of water” consists, e.g., of the basic operations: ”pick a bottle”,
”transport the bottle”, ”pour in”, ”‘transport the bottle”’ and ”‘place the bottle”’. A
sequence of basic manipulation operations starting with a pick and ending with a
place is abstracted to a manipulation segment.

Fig. 7.19. Representation of a manipulation segment.

The level of manipulation segments denotes a new abstraction level on closed


subtasks of manipulation. In this context ”closed” means that the end of a manipu-
lation segment ensures that both hands are free and that the environmental state is
stable. Furthermore the synchronization of EO’s for left and right hand are included
in the manipulation segments. Pre and post conditions describing the state of the en-
vironment at the beginning and at the end of a manipulation segment are sufficient
for their instantiation. The conditions are propagated from the EO level to the manip-
ulation segmentation level and are computed from the environmental changes during
the manipulation. In parallel to the propagation of the conditions a generalization in
terms of positions and object types and features is performed.
342 Interactive and Cooperative Robot Assistants

A complex demonstration of a manipulation task is represented by a macro op-


erator, which is a sequence of manipulation segments. The pre and post conditions
of the manipulation segments are propagated to a context and an effect of the macro
operator. At execution time a positive evaluation of the context of a macro enables its
execution and an error free execution leads to the desired effect in the environment.

7.6 Subgoal Extraction from Human Demonstration

Teaching robots through demonstration, independently of the used sensors and the
abstraction level of the demonstration (e.g. interactive learning, task learning and
mostly also skill learning), a search of the goals and subgoals of the demonstration
has to be performed. Therefore, a recorded or observed demonstration has to be
segmented and analyzed, in order to understand the users intention and to extract all
information necessary for enabling the robot to reproduce the learned action.
A feasible approach for the subgoal extraction is done over several steps, where
the gathered sensor data is consequently abstracted. As a practical example for a
PbD system with subgoal extraction, this section presents a two step approach based
on the manipulation models and representation and the sensor system mounted in
the training center (see Fig. 7.16). The example system outlined here is presented
in more detail in [71]. In the first step the sensor data has to be preprocessed for
extracting reliable measurements and key points, which are in a second step used to
segment the demonstration. The following sections describe these steps.

Fig. 7.20. Sensor preprocessing and fusion for segmentation.


Subgoal Extraction from Human Demonstration 343

7.6.1 Signal Processing

Figure 7.20 only shows the performed preprocessing steps for the segmentation. A
more elaborate preprocessing and fusion system would, e.g., be necessary for de-
tecting gesture actions, as shown in [21]. The input signals jt , ht , ft are gathered
from data glove, magnetic tracker and tactile force sensors. All signals are filtered
i.e. smoothed and high passed in order eliminate outliers. In the next step, the joint
angles are normalized and passed over by a rule-based switch to a static (SGC) and a
dynamic (DGC) grasp classifier. The tracker values, representing the absolute posi-
tion of the hand, are differentiated in order to obtain the velocity of the hand move-
ment. As mentioned in [71] the tactile sensors have a 10-20% hysteresis, which is
eliminated by the Function H(x). The normalized force values, together with the
velocity of the hand, are passed to the Module R, which consists of a rule set in order
to determine if an object is potentially grasped and triggers the SGC and DGC. The
output of these classifiers are grasp types according to the Cutkosky hierarchy (static
grasps) or dynamic grasps according to hierarchy presented in section 7.6.3.2.

7.6.2 Segmentation Step

The segmentation of a recorded demonstration is performed in two steps:


1. Trajectory segmentation
This step segments the trajectory of the hand during the manipulation task.
Hereby the segmentation is done by detecting grasp actions. Therefore the time
of contact between hand and object has to determined. This is done by analyzing
the force values with a threshold-based algorithm. To improve the reliability of
the system the results are fused within a second algorithm, based on the analysis
of trajectories of finger poses, velocity and acceleration with regard to minima.
Figure 7.21 shows the trajectories of force values and finger joint velocity values
of three Pick & Place actions.
2. Grasp segmentation
For detecting fine manipulation the actions while an object is grasped have to
be segmented and analyzed. The upper part of figure 7.21 shows that the shape
of the force graph features a relative constant plateau. Due to the fact that no
external forces are applied to the object this effect is plausible. But if the grasped
object collides with the environment the force profile will change. The results are
high peaks i.e. both amplitude and frequency are oscillating, like shown in the
lower part of figure 7.21).
Looking to the force values during a grasp, three different profiles can be distin-
guished:
• Static Grasp
Here the gathered force values are nearly constant. The force profile shows
characteristic plateaus, where the height points out the weight of the grasped
object.
• External Forces
The force graph of this class shows high peaks. Because of the hysteresis
344 Interactive and Cooperative Robot Assistants

Fig. 7.21. Analyzing segments of a demonstration: force values and finger joint velocity.

of the sensors, no quantitative prediction about the applied forces can be


made. A proper analysis of external forces applied to a grasped object will
be subject of further work.
• Dynamic Grasps
During a dynamic grasp both amplitude and frequency oscillate moderately,
as a result of finger movements performed by the user.
The result of the segmentation step is a sequence of elemental actions like moves
and static and dynamic grasps.

7.6.3 Grasp Classification

For mapping detected manipulative tasks and representations of grasps, the diversity
of human object handling has to be considered. In order to conserve as many infor-
mation as possible from the demonstration, two classes of grasps are distinguished.
Static grasps are used for Pick & Place tasks, while fine manipulative tasks require
dynamic handling of objects.
7.6.3.1 Static Grasp Classification
Static grasps are classified according to the Cutkosky hierarchy ( [14]). This hier-
archy subdivides static grasps into 16 different types mainly with respect to their
posture (see figure 7.22). Thus, a classification approach based on the finger angles
measured by the data glove can be used. All these values are fed into a hierarchy
Subgoal Extraction from Human Demonstration 345

Fig. 7.22. Cutkosky grasp taxonomy.

of neural networks, each consisting of a three layer Radial Basis Function (RBF)
network . The hierarchy resembles the Cutkosky hierarchy passing the finger values
from node to node like in a decision tree. Thus, each network has to distinguish very
few classes and can be trained on a subset of the whole set of learning examples.
The classification results of the system for all static grasp types are listed in
table 7.1. A training set of 1280 grasps recorded from 10 subjects and a test set of
640 examples was used for testing. The average classification rate is 83,28% where
most confusions happen between grasps of similar posture (e.g. types 10 and 12 in
346 Interactive and Cooperative Robot Assistants

the Cutkosky hierarchy). In the application, classification reliability is furthermore


enhanced by heuristically regarding hand movement speed (which is very low when
grasping an object) and distance to a possibly grasped object in the world model
(which has to be very small for grasp recognition).

Table 7.1. Classification rates gained on test data


Grasp type
1 2 3 4 5 6 7 8
0.95 0.90 0.90 0.86 0.98 0.75 0.61 0.90

Grasp type
9 10 11 12 13 14 15 16
0.88 0.95 0.85 0.53 0.35 0.88 1.00 0.45

7.6.3.2 Dynamic Grasp Classification


For describing various household activities like opening a twisted cap or screwing a
bold in a nut, simple operations like Pick & Place are inadequate. Therefore other
elemental operations like Dynamic Grasps need to be detected and represented in
order to program such tasks within the PbD paradigm. With Dynamic Grasps we de-
note operations like screw, insert etc. which all have in common that finger joints are
changed during the grasping phase (i.e. the force sensors provide non zero values).
For classifying dynamic grasps we choose the movement of the grasped object, as
a distinction criterion. This allows for an intuitive description of the user’s intention
when performing fine manipulative tasks. For describing grasped object movements
these are transformed in hand coordinates. Figure 7.23 shows the axes of the hand,
according to the taxonomy of Elliot & Co. ( [22]). Rotation and translation are de-
fined with regard to the principle axes. Some restrictions of the human hand, like
rotation along the y-axes, are considered.
Figure 7.24 shows the distinguished dynamic grasps. The classification is done
separately for rotation and translation of the grasped object. Furthermore the num-
ber of fingers which where involved in the grasp (i.e. from 2 to 5) are considered.
There are several precision tasks which are performed only with thumb and index
finger, like e.g. opening a bottle that is classified as a rotation around the x-axes.
Other grasps require higher forces and all fingers are involved in the manipulation,
as e.g. Full Roll for screwing actions. The presented classification contains most of
the common fine manipulations, if we assume that three and four finger manipula-
tions are included in the five finger classes. For example a Rock Full dynamic grasp
can be performed with three, four or five fingers.
Whereas the classification of static grasps in this system is done by Neural Net-
works (cf. section 7.6.3.1), Support Vector Machines (SVM) are used to classify
dynamic grasps. Support vector machines are a general class of statistical learning
architectures, which are getting more and more important in a variety of applications
Subgoal Extraction from Human Demonstration 347

Fig. 7.23. Principal axes of the human hand.

because of their profound theoretical foundation as well as their excellent empirical


performance. Originally developed for pattern recognition, the SVM’s justify their
application by a large number of positive qualities, like: fast learning, accurate clas-
sification and in the same time a high generalization performance. The basic training
principle behind the SVM is to find the optimal class-separating hyper-plane, such
that the expected classification error for previously unseen examples is minimized.
In the remainder of this section, we shortly present the results of the dynamic grasp
classification using Support Vector Machines. For an introduction to the theory of
SVMs, we recommend e.g. [67], for more details about the dynamic grasp classifi-
cation system see [71].
For training the SVM, twenty-six classes corresponding to the elementary dy-
namic grasps presented in figure 7.24 where trained. Because of the fact that a dy-
namic grasp is defined by a progression of joint values, a time-delay approach was
chosen. Consequently the input vector of the SVM Classifier comprised 50 joint con-
figurations of 20 joint values. The training data set contained 2600 input vectors. The
fact that SVM’s can learn from significant less data than neuronal networks assures
that this approach works very well in this case.
Figure 7.24 also shows results of the DGC. Since the figure shows either only
right or forward direction, the displayed percentages represent an average value of
the two values. The maximum variance between these two directions is about 2%.
Remarkable is the fact that the SVM needs only 486 support vectors (SV) for gener-
alizing over 2600 vectors i.e. 18,7% of the data set. A smaller number of SV does not
only improve the generalized behavior but also the runtime of the resulting algorithm
during the application.
The joint data is normalized but surely there exists some user specific variance.
Therefore, the DGC was tested with data from a second user. A sample of ten ele-
ments of eight elemental dynamic grasps (i.e. 80 graps) was evaluated. The maxi-
mum variance was less than or equal to 3%.
348 Interactive and Cooperative Robot Assistants

Fig. 7.24. Hierarchy of dynamic grasps containing the results of the DGC .

7.7 Task Mapping and Execution


The execution of meaningful actions in human centered environments is highly de-
manding on the robot’s architecture, which has to incorporate interaction and learn-
ing modalities. The execution itself needs to be reliable and legible in order to be
accepted by humans. Two main questions which have to be tackled in the context of
task execution with robot assistants are: How to organize a robot system to cope with
the demands on interaction and acceptance, and how to integrate new knowledge into
the robot?

7.7.1 Event-Driven Architecture

In order to cope with high interaction demands, an event-driven architecture like the
one outlined in figure 7.25 can be used. Hereby the focus on control mechanism
of the robot systems lies in enabling the system to react on interaction demands on
the one hand and to be autonomous on the other. All multimodal input especially
from gesture and object recognition with speech input can be melted together into
so-called ”events“, which are building the key concept of representing this input.
Task Mapping and Execution 349

An event is, on a rather abstract level of representation, the description of some-


thing that happened. The input values from the data glove are, e.g., transformed into
”gesture events“ or ”grasp events“. Thus, an event is a conceptual description of in-
put or inner states of the robot on a level which the user can easily understand and
cope with – in a way, the robot assistant is speaking ”the same language“ as a human.
As events can be of very different types (e.g. gesture events, speech events), they
have to be processed with regard to the specific information they carry.
Figure 7.25 shows the overall control structure transforming events in the envi-
ronment to actions performed by robot devices. Action devices are called ”agents”
and represent semantic groups of action performing devices (Head, Arms, Platform,
Speech, ...). Each agent is capable of generating events. All events are stored in the
”Eventplug”. A set of event transformers called ”Automatons”, each of which can
transform a certain event or group of events into an ”action“, is responsible for se-
lecting actions. Actions are then carried out by the corresponding part of the system,
e.g. the platform, the speech output system etc. A very simple example would be a
speech event generated from the user input ”Hello Albert“, which the responsible
event transformer would generate into a robot action speechoutput (Hello!). In prin-
ciple, the event transformers are capable of fusing two or more events that belong
together and generate the appropriate actions. This would, e.g., be a pointing gesture
with accompanying speech input like ”take this“. The appropriate robot action would
be to take the object on which the user is pointing.
The order of event transformers is variable, and transformers can be added or
deleted as suitable, thus representing a form of focus – the transformer highest in
order tries to transform incoming events first, then (if the first one does not succeed)
the transformer second in order etc.. For encapsulating the mechanism of reordering
event transformers (automatons), the priority module is responsible.

Fig. 7.25. Event manager architecture.


350 Interactive and Cooperative Robot Assistants

As mentioned above, event transformers are also capable of fusing two or more
events. A typical problem is a verbal instruction by the user, combined with some
gesture, e.g. ”Take this cup.“ with a corresponding pointing gesture. In such cases, it
is part of the appropriate event transformer to identify the two events which belong
together and to put in progress the actions which are appropriate for this situation.
In each case, one or more actions can be started, as appropriate for a certain event.
In addition, it is easily possible to generate questions concerning the input: an event
transformer designed specifically for queries (to the user) can handle events which
are ambiguous or underspecified, start an action which asks an appropriate question
to the user, and then uses the additional information to transform the event. For con-
trolling hardware resources a scheduling system locking and unlocking execution of
the action performer is applied.
All possible actions (learned or innate) which can be executed by the robot sys-
tem are coded or wrapped as an automaton. These control the execution, and generate
events which can start new automatons or generate ”actions” for the subsystems. The
mechanism for starting new automatons, which are carrying out subtasks, leads to a
hierarchical execution of actions. Furthermore, the interaction can be initiated from
each automaton, what implies that every subtask can fulfill its interaction delibera-
tive.
For ensuring reliability of the system the ”Security and Exception Handler” is in-
troduced into the systems architecture. The main tasks of the module are handling ex-
ceptions from the agents, reacting on cancelling commands by the user and clearing
up internally all automatons after an exception. According to the distributed control
through different automatons, every automaton reacts on special events for ”cleaning
up” its activities and ensures a stable state of all used and lock agents.
The presented architecture concept not only ensures a working way of control-
ling robot assistants, but also represents a general way of coping with new know-
ledge and behaviors. These can easily be integrated as new automatons in the system.
The integration process itself can be done via a learning automaton , which creates
new automatons. The process of generating execution knowledge from general task
knowledge is part of the next section.

7.7.2 Mapping Tasks onto Robot Manipulation Systems

Task knowledge acquired through demonstration or interaction represents an abstract


description for problem solving, stored in some kind of a macro operator which is
not directly usable for the target system. Grasps represented in order to describe
human hand configurations and trajectories are not optimal for robot kinematics,
since they are tailored to human kinematics. Besides, sensor control for the robot
like e.g., force control information for the robot system is not extractable from a
demonstration directly.
Considering manipulation operations, a method for automatically mapping
grasps to robot grippers is needed. Furthermore, stored trajectories in the macro have
to be optimized for the execution environment and the target system.
Task Mapping and Execution 351

As an example for grasp mapping the 16 different power and precision grasps
of Cutkosky’s grasp hierarchy [14] should be considered. As input the symbolic op-
erator description and the positions of the user’s fingers and palm during the grasp
are used. For mapping the grasps an optimal group of coupled fingers have to be
calculated. These are fingers having the same grasp direction. The finger groups are
associated with the chosen gripper type. This helps to orient finger positions for ad-
equate grasp pose.
Since the symbolic representation of performed grasps include information about
enclosed fingers, only those fingers are used for finding the correct pose. Let’s as-
sume that fingers used for the grasp are numbered as H1 , H2 , . . . , Hn , n ≤ 51 .
For coupling fingers with the same grasping direction it is necessary to calculate
forces affecting the object. In case of precision grasps the finger tips are projected on
a grasping plane E defined by the finger tip position during the grasp (figure 7.26).
Since the plane might be overdetermined if more than three fingers are used, the
plane is determined by using the least square method.

Fig. 7.26. Calculation of a plane E and the gravity point G defined by the finger tip’s positions
.

Now the force-vectors can be calculated by using the geometric formation of the
fingers given by the grasp type and the force value measured by the force sensors on
the finger tips. Prismatic and sphere grasps are distinguished. For prismatic grasps all
finger forces are assumed to act against the thumb. This leads to a simple calculation
of the forces F1 , . . . , Fn , n ≤ 5 according to figure 7.27.
For circular grasps, the direction of forces are assumed to act against the finger
tip’s center of gravity G, as shown in Fig. 7.28.
To evaluate the coupling of fingers the degree of force coupling Dc (i, j) is de-
fined:
Fi .Fj
Dc (i, j) = (7.2)
|Fi ||Fj |
1
It should be clear that the number of enclosed fingers can be less than 5.
352 Interactive and Cooperative Robot Assistants

Fig. 7.27. Calculation of forces for a prismatic 5 finger grasp .

Fig. 7.28. Calculation of forces for a sphere precision grasp .

This means |Dc (i, j)| ≤ 1, and in particular Dc (i, j) = 0 for f i ⊥ f j . So the
degree of force coupling is high for forces acting in the same direction and low for
forces acting in orthogonal direction. The degree of force coupling is used for finding
three or two optimal groups of fingers using Arbib’s method [4].
When the optimal groups of fingers is found, the robot fingers are being assigned
to these finger groups. This process is rather easy for a three finger hand, since the
three groups of fingers can be assigned directly to the robot fingers. In case there are
only two finger groups, two arbitrary robot fingers are selected to be treated as one
finger.
The grasping position depends on the grasp type as well as the grasp pose. Since
there is no knowledge about the object present, the grasp position must be obtained
directly from the performed grasp. Two strategies are feasible:
• Precision Grasps: The center of gravity G which has been used for calculat-
ing the grasp pose serves as reference point for grasping. The robot gripper is
positioned relative to this point performing the grasp pose.
• Power Grasps: The reference point for power grasping is the user’s palm rela-
tive to the object. Thus, the robot’s palm is positioned with a fixed transformation
with respect to this point.
Now, the correct grasp pose and position are determined and the system can
perform force controlled grasping by closing the fingers.
Task Mapping and Execution 353

Figure 7.29 shows 4 different grasp types mapped from human grasp operations
to a robot equipped with a three finger hand (Barrett) and 7 DoF arm. It can be seen
that the robot’s hand position depends strongly on the grasp type.

Fig. 7.29. Example of different grasp types. 1. 2-finger precision grasp, 2. 3-finger tripoid
grasp, 3. circular power grasp, 4. prismatic power grasp

For mapping simple movements of the human demonstration to the robot, a set of
logical rules can be defined that select sensor constraints depending on the execution
context, i.e. for example to select a force threshold parallel to the movement when
approaching an object or selecting zero-force control during grasping an object. So
context information serves for selecting intelligent sensor control. The context infor-
mation should be stored, in order to be available for processing.
Thus, a set of heuristic rules can be used for handling the system’s behavior
depending on the current context. To give an overview the following rules turned out
to be useful in specific contexts:
• Approach: The main approach direction vector is determined. Force control is
set to 0 in orthogonal direction to the approach vector. Along the grasp direction
a maximum force threshold is selected.
• Grasp: Set force control to 0 in all directions.
• Retract: The main approach direction vector is determined. Force control is set
to 0 in orthogonal direction to the approach vector.
• Transfer: Segments of basic movements are chunked into complex robot moves
depending on direction and speed.
354 Interactive and Cooperative Robot Assistants

Summarizing, the mapping module generates a set of default parameters for the
target robot system itself as well as for movements and force control. These parame-
ters are directly used for controlling the robot.

7.7.3 Human Comments and Advice

Although it is desirable to make hypotheses about the user’s wishes and intention
based on only one demonstration, this will not work in the general case. Therefore,
the system needs the possibility to accept and reject interpretations of the human
demonstration generated by the system. Since the generation of a well working robot
program is the ultimate goal of the programming process, the human user can interact
with the system in two ways:
1. Evaluation of hypotheses concerning the human demonstration, e.g. recognized
grasps, grasped objects, important effects in the environment.
2. Evaluation and correction of the robot program that will be generated from the
demonstration.
Interaction with the system to evaluate learned tasks can be done through speech,
when only very abstract information have to be exchanged like specifying the order
of actions or some parameters. Otherwise, especially if Cartesian points or trajecto-
ries have to be corrected the interaction must be done in an adequate way using for
example a graphical interface. In addition, a simulation system showing the identified
objects of the demonstration and actuators (human hand with its particular observed
trajectory and robot manipulators) in a 3D environment would be helpful.
During the evaluation phase of the user demonstration all hypotheses about
grasps including types and grasped objects can be displayed in a replayed scenario.
With graphical interfaces the user is prompted for acceptance or rejection of actions
(see figure 7.30).
After the modified user demonstration has been mapped onto a robot program
the correctness of the program must be validated in a simulation phase. The modi-
fication of the environment and the gripper trajectories are displayed within the 3D
environment, giving the user the opportunity to interactively modify trajectories or
grasp points of the robot if desired (see figure 7.31).

7.8 Examples of Interactive Programming


The concept of teaching robot assistants interactively by speech and gestures as well
as through a demonstration in a training center was evaluated in various systems,
e.g. [17]. Due to cognitive skills of the programmed robots and their event manage-
ment concept, intuitive and fluent interaction with them is possible. The following
example shows how a previously generated complex robot program for laying a table
is triggered. As in figure 7.32, the user shows the manipulable objects to the robot
Examples of Interactive Programming 355

Fig. 7.30. Example of a dialog mask for accepting and rejecting hypotheses generated by a
PbD system.

Fig. 7.31. Visualization of modifiable robot trajectories.

which instantiates an executable program from its database with the given environ-
mental information. It is also possible to advice the robot to grasp selected objects
and lay them down at desired positions.
In this experiment, off-the-shelf cups and plates in different forms and colors
were used. The examples were recorded in the training center described in sec-
tion 7.4.4 [21]. Here, the mentioned sensors are integrated into an environment al-
lowing free movements of the user without restrictions. In a single demonstration,
crockery is placed for one person (see left column of figure 7.33). The observed ac-
tions can then be replayed in simulation (cf. second column). Here, sensor errors can
be corrected interactively. The system displays its hypotheses on recognized actions
356 Interactive and Cooperative Robot Assistants

Fig. 7.32. Interaction with a robot assistant to teach how to lay a table. The instantiation of
manipulable objects is aided through pointing gestures.

and their segments in time. After interpreting and abstracting this information, the
learned task can be mapped to a specific robot. The third column of Fig. 7.33 shows
the simulation of a robot assistant performing the task. In the fourth column, the real
robot is performing the task in a household environment.

7.9 Telepresence and Telerobotics

Although autonomous robots are the explicit goal of many research projects, there
are nonetheless tasks which are very hard to perform for an autonomous robot at the
moment, and which robots may never be able to perform completely on their own.
Examples are tasks where the processing of sensor data would be very complex or
where it is vital to react flexibly on unexpected incidents. On the other hand, some
of these tasks can not be performed by humans as well since they are too dangerous
(e.g. in contaminated areas), too expensive (e.g. some underwater missions) or too
time-consuming (e.g. space missions). In all of these cases, it would be advantageous
to have a robot system which performs the given tasks, combined with the foresight
and knowledge of a human expert.

7.9.1 The Concept of Telepresence and Telerobotic Systems

In this section, the basic ideas of telepresence and telerobotics are introduced, and
some example applications are presented. ”Telepresence” indicates a system by
which all relevant environment information is presented to a human user as if he
or she would be present on site. In ”telerobotic” applications, a robot is located re-
motely to the human operator. Due to latencies in signal transfer from and to the
robot, instantaneous intervention by the user is not possible.
Fig. 7.34 shows the range of potential systems with regard to both system com-
plexity and the abstraction level of programming which is necessary. As can be seen
Telepresence and Telerobotics 357

Fig. 7.33. Experiment ”Laying a table”. 1. column: real demonstration, 2. column: analyze
and interpretation of the observed task, 3. column: simulated execution, 4. column: execution
with a real robot in an household environment.
358 Interactive and Cooperative Robot Assistants

Fig. 7.34. The characteristic of telerobotic systems in terms of system complexity and level of
programming abstraction.

from the figure, both the interaction with the user and the use of sensor informa-
tion increase heavily with increasing system complexity and programming abstrac-
tion. At the one end of the spread are industrial robot systems with low complexity
and explicit programming . The scale extends towards implicit programming (as in
Programming by Demonstration systems) and via telerobots towards completely au-
tonomous systems. In terms of telepresence and telerobotics, this means a shift from
telemanipulation via teleoperation towards autonomy . ”Telemanipulation” refers to
systems where the robot is controlled by the operator directly, i.e. without the use
of sensor systems or planning components. In ”teleoperation”, the planning of robot
movements is initialized by the operator, and occurs by use of environment model
data which is actualized permanently from sensor data processing.
Telerobotic systems are thus semi-autonomous robot systems which integrate the
human operator into their control circle because of his perceptive and reactive capa-
bilities which are indispensable for some tasks. Fig. 7.35 shows the components of
such a system in more detail. The robot is monitored by sensor systems like cameras.
The sensor data is presented to the operator who decides about the next movements
of the robot and communicates them to the system by appropriate input devices. The
planning and control system integrates the instructions of the operator and the results
of the sensor data processing, plans the movements to be performed by the robot and
controls the robot remotely. The main problem of these systems is the data transfer
between the robot and the control system on the user’s side. Latencies in signal trans-
fer due to technical reasons make it impossible for the user or for the planning and
control system, respectively, to intervene immediately in the case of emergencies.
Telepresence and Telerobotics 359

Fig. 7.35. Structure of a telerobotic system.

Beside the latencies in the signal transfer, the kind of information gathered from
the teleoperated robot system is often limited due to limited sensors on board, and
the sensing or the visualization of important sensor values like force feedback or
temperature is restricted as well.

7.9.2 Components of a Telerobotic System

A variety of input devices is possible, depending on the application, to bring the


user’s commands to the telerobotic system. Graphical user interfaces are very com-
mon, but also joysticks, potentially with force feedback devices, to transfer the user’s
movements to the systems. Another example for this are 6-dimensional input devices,
so-called Space Masters, or, even more elaborate, data gloves or exoskeletons which
provide the exact finger joint angles of the human hand. The output of information,
e.g., sensor information from the robot, can be brought to the user in different ways,
e.g., in textual form or, more comfortable, via graphical user interfaces. Potential out-
put modes include other forms of visual display, like stereoscopic (shutter glasses) or
head-mounted displays which are often used in virtual or augmented reality. Force
feedback is realized by use of data gloves (ForceGlove) or by exoskeletons which
communicate the forces that the robot system experiences to the user.
For enabling the robot systems to overcome the latencies caused by the transmis-
sion channel it needs a local reaction and planning module . This component ensures
an autonomous behavior for a short time. Local models concerning actions and per-
ception constraints are needed for analyzing and planning and reacting in a save way.
360 Interactive and Cooperative Robot Assistants

They need to be synchronized with teleoperator planning and control system in order
to determine the internal state of the robot.
The teleoperator is aided by a planning and control system which simulates the
robot. The simulation is updated with sensor values and internal robot states. Based
on the simulation, a prediction module is needed for estimating the actual robot state
and (if possible) the sensor values. Based on this the planning module will generate
a new plan, which is then sent to the remote system.
Beside the automatic planning module the interaction modality of the teleop-
erator and the remote system represents a crucial component whenever the human
operator controls the robot directly. Here apart from the adequate in- and output
modalities (depending on the used sensors and devices) the selection of informa-
tion and the presentation form must be adapted to the actual situation and control
strategies.
Error and Exception Handling of a teleoperated system must ensure that a the
system does not destroy itself or collide with the environment on the one hand and on
the other hand that it maintains the communication channel. Therefore often security
strategies like stopping the robot’s actions in dangerous situations and waiting for
manual control are implemented.

7.9.3 Telemanipulation for Robot Assistants in Human-Centered


Environments

Considering robot assistants which work in human-centered environments and are


supposed to act autonomously, telemanipulation follows different goals:
Exception and error handling. Robot assistants usually have adequate excep-
tion and error handling procedures, which enable them to fulfill there normal
work. But due to the fact that these robots are supposed to work in dynamic and
often in unstructured environments, situations can occur which require an inter-
vention of human operators. Having a teleoperation device, the systems can be
moved manually into a save position or state and if necessary be reconfigured by
a remote service center.
Expert programming device. Adaptation is definitely one of the most important
features of a robot assistant, but due to strict constraints in terms of reliability, the
adaptation to new environments and tasks where new skills are needed cannot be
done automatically by the robot. For extending the capabilities of a robot assis-
tant, a teleoperation interface can serve as direct programming interface which
enables a remote expert to program new skills with respect to the special context
the system is engaged in.
In both cases the teleoperation interface is used only in exceptional situations,
when an expert is needed to ensure the further autonomous working of the systems.
Telepresence and Telerobotics 361

7.9.4 Exemplary Teleoperation Applications

7.9.4.1 Using Teleoperated Robots as Remote Sensor Systems


Fig. 7.37 shows the 6-legged robot LAURON which can be guided by a remote
teleoperator in order to find a way through the rubble of collapsed buildings. The
robot can be equipped with several sensors depending on the concrete situation (e.g.
video cameras, infrared cameras, laser scanner etc.). The images gathered from the
on-board sensors are passed to the teleoperator which guides the robot to places of
interest.

Fig. 7.36. (a) Six-legged walking machine LAURON; (b) Head-mounted display with the head
pose sensor, (c) flow of image and head pose data (for details see [1]).

The undistorted images are presented to the teleoperator via a head-mounted


display (HMD) which is fitted with a gyroscopic and inertial sensor. This sensor
yields a direction-of-gaze vector which controls both the center of the virtual planar
camera and the robot’s pan-tilt unit (cf. Fig. 7.36). The direction of gaze is also
used to determine which part of the panoramic image needs to be transferred over
the network in the future (namely, a part covering a 180° angle centered around
the current direction of gaze). Using this technique together with low-impact JPEG
compression, the image data can be transferred even over low-capacity wireless links
at about 20fps.
To control the movements of LAURON the graphical point-and-click control in-
terface has been expanded by the possibility to use a commercial two-stick ana-
logue joypad. Joypads, as they are used in the gaming industry, have the great ad-
vantage that they are designed to be used without looking at them (relief buttons,
hand ergonomics). This eliminates the need of the operator to look at the keyboard
or the mouse. The joypad controls the three Cartesian walking directions (forward,
sidewards and turn) and the height of the body (coupled to the step height) and is
equipped with a dead-man circuit.
The immersive interface makes the following information (apart from the sur-
rounding camera view) available to the user: artificial horizons of both the user (us-
ing the HMD inertial sensor) and the robot (using the robot’s own inertial sensor),
HMD direction of gaze relative to the robot’s forward direction, movement of the ro-
bot (translation vector, turn angle and body height), foot-ground contact and display
and network speed. Additionally, a dot in the user’s view marks the current intersec-
362 Interactive and Cooperative Robot Assistants

tion of laser distance sensor and scene and presents the distance obtained from the
sensor at that point, cf. Fig. 7.37.

Fig. 7.37. Immersive interface (user view) and the remotely working teleoperator.

7.9.4.2 Controlling and Instructing Robot Assistants via Teleoperation


In this section, an example of a teleoperation system is presented [28]. The system
consists of the augmented reality component which is displayed in Fig. 7.38. For
visualization, a head-mounted display is used. The head is tracked using two cameras
and black-and-white artificial landmarks. As input devices, a data glove and a so-
called magic wand are used, which are both also tacked with such landmarks. This
system allows to teleoperate robots with either a mobile platform or anthropomorphic
shape, as shown in Fig. 7.39. For the mobile platform, e.g. topological knots of the
map can be displayed in the head-mounted display and then manipulated by the
human operator with the aid of the input devices. Examples of these operations can
be seen in Fig. 7.40. In the case of robots with manipulative abilities, manipulations
and trajectories can be visualized, corrected or adapted as needed (cf. Fig. 7.40).
With the aid of this system, different applications can be realized:
• Visualization of robot intention. Such a system is very well suited to visualize
geometrical properties, such as planned trajectories for a robot arm or grip points
for a dexterous hand. Therefore, the system can be used to refine existing inter-
action methods by providing the user with richer data about the robot’s intention.
• Integration with a rich environmental model. Objects can be selected for manip-
ulation by simply pointing at them, together with immediate feedback to the user
about whether the correct object has been selected, etc..
Telepresence and Telerobotics 363

Fig. 7.38. Input setup with data gloves and magic wand as input devices.

Fig. 7.39. (left) Mobile platform ODETE , with emotional expression that humans can read;
(right) Anthropomorphic robot ALBERT loading a dish washer.

• Manipulating the environmental model. Since robots are excellent at refining ex-
isting geometrical object models from sensor data, but very bad at creating new
object models from unclassified data themselves, the human user is integrated
into the robot’s sensory loop. The user receives a view of the robot’s sensor data
and is thus able to use interactive devices to perform cutting and selection actions
on the sensor data to help the robot to create new object models from the data
itself.
364 Interactive and Cooperative Robot Assistants

Fig. 7.40. Views in the head-mounted display: top) topological map, and bottom) trajectory of
the manipulation ”‘loading a dishwasher”’.

References
1. Albiez, J., Giesler, B., Lellmann, J., Zöllner, J., and Dillmann, R. Virtual Immersion for
Tele-Controlling a Hexapod Robot. In Proceedings of the 8th International Conference
on Climbing and Walking Robots (CLAWAR). September 2005.
2. Alissandrakis, A., Nehaniv, C., and Dautenhahn, K. Imitation with ALICE: Learning to
Imitate Corresponding Actions Across Dissimilar Embodiments. IEEE Transactions on
Systems, Man, and Cybernetics, 32(4):482–496, 2002.
3. Anderson, J. Kognitive Psychologie, 2. Auflage. Spektrum der Wissenschaft Verlagsge-
sellschaft mbH, Heidelberg, 1989.
4. Arbib, M., Iberall, T., Lyons, D., and Linscheid, R. Hand Function and Neocortex, chapter
Coordinated control programs for movement of the hand, pages 111–129. A. Goodwin
and T. Darian-Smith (Hrsg.), Springer-Verlag, 1985.
5. Archibald, C. and Petriu, E. Computational Paradigm for Creating and Executing Sensor-
based Robot Skills. 24th International Symposium on Industrial Robots, pages 401–406,
1993.
6. Asada, H. and Liu, S. Transfer of Human Skills to Neural Net Robot Controllers. In IEEE
International Conference on Robotics and Automation (ICRA), pages 2442–2448. 1991.
7. Asfour, T., Berns, K., and Dillmann, R. The Humanoid Robot ARMAR: Design and
Control. In IEEE-RAS International Conference on Humanoid Robots (Humanoids 2000),
volume 1, pages 897–904. MIT, Boston, USA, Sep. 7-8 2000.
8. Asfour, T. and Dillmann, R. Human-like Motion of a Humanoid Robot Arm Based on
a Closed-Form Solution of the Inverse Kinematics Problem. In IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS 2003), volume 1, pages 897–904.
Las Vegas, USA, Oct. 27-31 2003.
9. Bauckhage, C., Kummmert, F., and Sagerer, G. Modeling and Recognition of Assembled
Objects. In IEEE 24th Annual Conference of the Industrial Electronics Society, volume 4,
pages 2051–2056. Aachen, September 1998.
10. Bergener, T., Bruckhoff, C., Dahm, P., Janßen, H., Joublin, F., Menzner, R., Steinhage,
A., and von Seelen, W. Complex Behavior by Means of Dynamical Systems for an An-
thropomorphic Robot. Neural Networks, 12(7-8):1087–1099, 1999.
11. Breazeal, C. Sociable Machines: Expressive Social Exchange Between Humans and Ro-
bots. Master’s thesis, Department of Electrical Engineering and Computer Science, MIT,
2000.
References 365

12. Brooks, R. The Cog Project. Journal of the Robotics Society of Japan Advanced Robotics,
15(7):968–970, 1997.
13. Bundesministerium für Bildung und Forschung. Leitprojekt Intelligente anthropomorphe
Assistenzsysteme, 2000. Http://www.morpha.de.
14. Cutkosky, M. R. On Grasp Choice, Grasp Models, and the Design of Hands for Manu-
facturing Tasks. IEEE Transactions on Robotics and Automation, 5(3):269–279, 1989.
15. Cypher, A. Watch what I do - Programming by Demonstration. MIT Press, Cambridge,
1993.
16. Dillmann, R., Rogalla, O., Ehrenmann, M., Zöllner, R., and Bordegoni, M. Learning
Robot Behaviour and Skills based on Human Demonstration and Advice: the Machine
Learning Paradigm. In 9th International Symposium of Robotics Research (ISRR 1999),
pages 229–238. Snowbird, Utah, USA, 9.-12. Oktober 1999.
17. Dillmann, R., Zöllner, R., Ehrenmann, M., and and, O. R. Interactive Natural Program-
ming of Robots: Introductory Overview. In Proceedings of the DREH 2002. Toulouse,
France, Octobre 2002.
18. Ehrenmann, M., Ambela, D., Steinhaus, P., and Dillmann., R. A Comparison of Four Fast
Vision-Based Object Recognition Methods for Programing by Demonstration Applica-
tions. In Proceedings of the 2000 International Conference on Robotics and Automation
(ICRA), volume 1, pages 1862–1867. San Francisco, Kalifornien, USA, 24.–28. April
2000.
19. Ehrenmann, M., Rogalla, O., Zöllner, R., and Dillmann, R. Teaching Service Robots
complex Tasks: Programming by Demonstration for Workshop and Household Environ-
ments. In Proceedings of the 2001 International Conference on Field and Service Robots
(FSR), volume 1, pages 397–402. Helsinki, Finnland, 11.–13. Juni 2001.
20. Ehrenmann, M., Rogalla, O., Zöllner, R., and Dillmann, R. Analyse der Instrumentarien
zur Belehrung und Kommandierung von Robotern. In Human Centered Robotic Systems
(HCRS), pages 25–34. Karlsruhe, November 2002.
21. Ehrenmann, M., Zöllner, R., Knoop, S., and Dillmann, R. Sensor Fusion Approaches
for Observation of User Actions in Programming by Demonstration. In Proceedings of
the 2001 International Conference on Multi Sensor Fusion and Integration for Intelligent
Systems (MFI), volume 1, pages 227–232. Baden-Baden, 19.–22. August 2001.
22. Elliot, J. and Connolly, K. A Classification of Hand Movements. In Developmental
Medicine and Child Neurology, volume 26, pages 283–296. 1984.
23. Festo. Tron X web page, 2000. Http://www.festo.com.
24. Friedrich, H. Interaktive Programmierung von Manipulationssequenzen. Ph.D. thesis,
Universität Karlsruhe, 1998.
25. Friedrich, H., Holle, J., and Dillmann, R. Interactive Generation of Flexible Robot Pro-
grams. In IEEE International Conference on Robotics and Automation. Leuven, Belgium,
1998.
26. Friedrich, H., Münch, S., Dillmann, R., Bocionek, S., and Sassin, M. Robot Programming
by Demonstration: Supporting the Induction by Human Interaction. Machine Learning,
pages 163–189, May/June 1996.
27. Fritsch, J., Lomker, F., Wienecke, M., and Sagerer, G. Detecting Assembly Actions by
Scene Observation. In International Conference on Image Processing 2000, volume 1,
pages 212–215. Vancouver, BC, Kanada, September 2000.
28. Giesler, B., Salb, T., Steinhaus, P., and Dillmann, R. Using Augmented Reality to Interact
with an Autonomous Mobile Platform. In IEEE International Conference on Robotics
and Automation (ICRA 2004), volume 1, pages 1009–1014. New Orleans, USA, Oct. 27-
31 2004.
366 Interactive and Cooperative Robot Assistants

29. Hashimoto. Humanoid Robots in Waseda University - Hadaly-2 and Wabian. In IEEE-
RAS International Conference on Humanoid Robots (Humanoids 2000), volume 1, pages
897–904. Cambridge, MA, USA, Sep. 7-8 2000.
30. Heise, R. Programming Robots by Example. Technical report, Department of Computer
Science, The university of Calgary, 1992.
31. Honda Motor Co., L. Honda Debuts New Humanoid Robot ”ASIMO”. New Technol-
ogy allows Robot to walk like a Human. Press Release of November 20th, 2000, 2000.
Http://world.honda.com/news/2000/c001120.html.
32. Ikeuchi, K. and Suehiro, T. Towards an Assembly Plan from Observation. In International
Conference on Robotics and Automation, pages 2171–2177. 1992.
33. Ikeuchi, K. and Suehiro, T. Towards an Assembly Plan from Observation, Part I: Task
Recognition with Polyhedral Objects. IEEE Transactions on Robotics and Automation,
10(3):368–385, 1994.
34. Immersion. Cyberglove Specifications, 2004. Http://www.immersion.com.
35. Inaba, M. and Inoue, H. Robotics Research 5, chapter Vision Based Robot Programming,
pages 129–136. MIT Press, Cambridge, Massachusetts, USA, 1990.
36. Jiar, Y., Wheeler, M., and Ikeuchi, K. Hand Action Perception and Robot Instruction.
Technical report, School of Computer Science, Carnegie Mellon University, Pittsburgh,
Pennsylvania 15213, 1996.
37. Jiar, Y., Wheeler, M., and Ikeuchi, K. Hand Action Perception and Robot Instruction. In
Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems,
volume 3, pages 1586–93. 1996.
38. Kaiser, M. Interaktive Akquisition elementarer Roboterfähigkeiten. Ph.D. thesis, Uni-
versität Karlsruhe (TH), Fakultät für Informatik, 1996. Erschienen im Infix-Verlag, St.
Augustin (DISKI 153).
39. Kanehiro, F., Kaneko, K., Fujiwara, K., Harada, K., Kajita, S., Yokoi, K., Hirukawa, H.,
Akachi, K., and Isozumi, T. KANTRA – Human-Machine Interaction for Intelligent
Robots using Natural Language. In IEEE International Workshop on Robot and Human
Communication, volume 4, pages 106–110. 1994.
40. Kang, S. and Ikeuchi, K. Temporal Segmentation of Tasks from Human Hand Mo-
tion. Technical Report CMU-CS-93-150, Computer Science Department, Carnegie Mel-
lon University, Pittsburgh, PA, April 1993.
41. Kang, S. and Ikeuchi, K. Toward Automatic Robot Instruction from Perception: Mapping
Human Grasps to Manipulator Grasps. Robotics and Automation, 13(1):81–95, Februar
1997.
42. Kang, S. and Ikeuchi, K. Toward Automatic Robot Instruction from Perception: Mapping
Human Grasps to Manipulator Grasps. Robotics and Automation, 13(1):81–95, Februar
1997.
43. Kang, S. B. Robot Instruction by Human Demonstration. Ph.D. thesis, Carnegie Mellon
University, Pittsburgh, PA, 1994.
44. Kang, S. B. and Ikeuchi, K. Determination of Motion Breakpoints in a Task Sequence
from Human Hand Motion. In IEEE International Conference on Robotics and Automa-
tion (ICORA‘94), volume 1, pages 551–556. San Diego, CA, USA, 1994.
45. Kibertron. Kibertron Web Page, 2000. Http://www.kibertron.com.
46. Koeppe, R., Breidenbach, A., and Hirzinger, G. Skill Representation and Acquisition
of Compliant Motions Using a Teach Device. In IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS‘96), volume 1, pages 897–904. Pittsburgh, PA,
USA, Aug. 5-9 1996.
References 367

47. Kuniyoshi, Y., Inaba, M., and Inoue, H. Learning by Watching: Extracting Reusable
Task Knowledge from Visual Observation of Human Performance. IEEE Transactions
on Robotics and Automation, 10(6):799–822, 1994.
48. Lee, C. and Xu, Y. Online, Interactive Learning of Gestures for Human/Robot Inter-
faces, Minneapolis, Minnesota. In Proceedings of the IEEE International Conference on
Robotics and Automation, volume 4, pages 2982–2987. April 1996.
49. Menzner, R. and Steinhage, A. Control of an Autonnomous Robot by Keyword Speech.
Technical report, Ruhr-Universität Bochum, April 2001. Seiten 49-51.
50. MetaMotion. Gypsy Specification. Meta Motion, 268 Bush St. 1, San Francisco, Califor-
nia 94104, USA, 2001. Http://www.MetaMotion.com/motion-capture/magnetic-motion-
capture-2.htm.
51. Mitchell, T. Explanation-based Generalization - a unifying View. Machine Learning,
1:47–80, 1986.
52. Newtson, D. The Objective Basis of Behaviour Units. Journal of Personality and Social
Psychology, 35(12):847–862, 1977.
53. Onda, H., Hirukawa, H., Tomita, F., Suehiro, T., and Takase, K. Assembly Motion
Teaching System using Position/Force Simulator—Generating Control Program. In 10th
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages
389–396. Grenoble, Frankreich, 7.-11. September 1997.
54. Paul, G. and Ikeuchi, K. Modelling planar Assembly Tasks: Representation and Recogni-
tion. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS), Pittsburgh, Pennsylvania, volume 1, pages 17–22. 5.-9. August 1995.
55. Pomerleau, D. A. Efficient Training of Artificial Neural Networks for Autonomous Nav-
igation. Neural Computation, 3(1):88 – 97, 1991.
56. Pomerleau, D. A. Neural Network Based Vision for Precise Control of a Walking Robot.
Machine Learning, 15:125 – 135, 1994.
57. Rogalla, O., Ehrenmann, M., and Dillmann, R. A Sensor Fusion Approach for PbD. In
Proc. of the IEEE/RSJ Conference Intelligent Robots and Systems, IROS’98, volume 2,
pages 1040–1045. 1998.
58. Schaal, S. Is Imitation Learning the Route to Humanoid Robots? In Trends in Cognitive
Scienes, volume 3, pages 323–242. 1999.
59. Segre, A. Machine Learning of Assembly Plans. Kluwer Academic Publishers, 1989.
60. Steinhage, A. and Bergener, T. Learning by Doing: A Dynamic Architecture for Gener-
ating Adaptive Behavioral Sequences. In Proceedings of the Second International ICSC
Symposium on Neural Computation (NC), pages 813–820. 2000.
61. Steinhage, A. and v. Seelen, W. Dynamische Systeme zur Verhaltensgenerierung eines
anthropomorphen Roboters. Technical report, Institut für Neuroinformatik, Ruhr Univer-
sität Bochum, Bochum, Deutschland, 2000.
62. Takahashi, T. Time Normalization and Analysis Method in Robot Programming from
Human Demonstration Data. In Proceedings of the IEEE International Conference on
Robotics and Automation (ICRA), Minneapolis, USA, volume 1, pages 37–42. April 1996.
63. Toyota. Toyota Humanoid Robot Partners Web Page, 2000.
Http://www.toyota.co.jp/en/special/robot/index.html.
64. Tung, C. and Kak, A. Automatic Learning of Assembly Tasks using a Dataglove System.
In Proceedings of the International Conference on Intelligent Robots and Systems (IROS),
pages 1–8. 1995.
65. Tung, C. P. and Kak, A. C. Automatic Learning of Assembly Tasks Using a Data-
Glove System. In IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS‘95), volume 1, pages 1–8. Pittsburgh, PA, USA, Aug. 5-9 1995.
368 Interactive and Cooperative Robot Assistants

66. Ude, A. Rekonstruktion von Trajektorien aus Stereobildfolgen für die Programmierung
von Roboterbahnen. Ph.D. thesis, IPR, 1995.
67. Vapnik, V. Statistical Learning Theory. John Wiley & Sons, Inc., 1998.
68. Vicon. Vicon Web Page, 2000. Http://www.vicon.com/products/viconmx.html.
69. Voyles, R. and Khosla, P. Gesture-Based Programming: A Preliminary Demonstration. In
Proceedings of the IEEE International Conference on Robotics and Automation, Detroit,
Michigan, pages 708–713. Mai 1999.
70. Zhang, J., von Collani, Y., and Knoll, A. Interactive Assembly by a Two-Arm-Robot
Agent. Robotics and Autonomous Systems, 1(29):91–100, 1999.
71. Zöllner, R., Rogalla, O., Zöllner, J., and Dillmann, R. Dynamic Grasp Recognition within
the Framework of Programming by Demonstration. In The 10th IEEE International Work-
shop on Robot and Human Interactive Communication (Roman), pages 418–423. 18.-21.
September 2001.
Chapter 8

Assisted Man-Machine Interaction


Karl-Friedrich Kraiss

When comparing man-machine dialogs with human to human conversation it


is stunning, how comparatively hassle-free the latter works. Even breakdowns in
conversation due to incomplete grammar or missing information are often repaired
intuitively. There are several reasons for this robustness of discourse like the mutually
available common sense and special knowledge about the subject under discussion
[33]. Also, the conduct of a conversation follows agreed conventions, allows mixed
initiative, and provides feedback of mutual understanding. Focused questions can
be asked and answers be given in both directions. On this basis an assessment of
plausibility and correctness can be made, which enables understanding in spite of
uncomplete information (Fig. 8.1).

Man Man Information


Perception Man
display
Domain Domain Domain
knowledge , knowledge , knowledge, Machine
common common common
sense, sense, sense,
conventions, conventions, conventions, User
expectations
Action expectations expectations inputs

Fig. 8.1. Interpersonal conversation (left) vs. conventional man-machine interaction (right).

Unfortunately, man-machine interaction lacks these features yet almost entirely.


Only some of these properties have been incorporated rudimentarily into advanced
interfaces [8, 17, 41]. In conventional man-machine interaction knowledge resides
with the user alone. Neither user characteristics nor the circumstances of operation
are made explicit.
In this chapter the concept of assisted interaction is put forward. It is based on the
consideration that, if systems were made more knowledgeable, human deficiencies
could be avoided or compensated in several respects, hereby increasing individual
abilities and overall performance. If e.g. the traffic situation and destination were
known, a driver could be assisted in accelerating, decelerating and steering his car
appropriately. Also, if the whereabouts and intentions of a personal digital assistant
user were available, information likely to be needed in a particular context could be
provided just in time.
Some examples of assistance that can be provided are:
• support of individual skills, e.g. under adverse environmental conditions
370 Assisted Man-Machine Interaction

• compensation of individual deficiencies, e.g. of the elderly


• correction and prevention of human errors
• automatic override of manual action in case of emergency
• prediction of action consequences
• reduction of user workload
• provision of user tutoring
Assistance augments the information provided to the user and the inputs made
by the user in one way or other. In consequence an assisted system is expected to ap-
pear simpler to the user than it actually is and to be easier to learn. Hereby handling
is made more efficient or even more pleasant. Also system safety may profit from
assistance since users can duly be made aware of errors and their consequences. Es-
pecially action slips may be detected and suppressed even before they are committed.

8.1 The Concept of User Assistance


The generation of assistance relies on the system architecture presented in (Fig. 8.2).
As may be seen, the conventional man-machine system is augmented by three func-
tion blocks labeled context of use identification, user assistance generation, and data
bases.

Sensor context of use


User assistance generation
data identification

Data bases

Information
display
Man Machine

User inputs

Fig. 8.2. Assisted man-machine interaction.

Context of Use Identification

Correct identification of context of use is a prerequisite for assistance to be useful.


It describes the circumstances under which a system operates and includes the state
of the user, the system state, and the situation. Depending on the problem at hand,
some or all of these attributes may be relevant.
User state is defined by various indicators like user identity , whereabouts , work-
load, skills, preferences , and intentions . Identification requires e.g. user prompting
The Concept of User Assistance 371

for passwords or fingerprints. Whereabouts may be derived from telecommunication


activities or from ambient intelligence methods. Workload assessment is based ei-
ther on user prompting or on physiological data like heartbeat or skin resistance. The
latter requires tedious installation of biosensors to the body which in general is not
acceptable.
Skills, preferences, and intentions usually are ascertained by interrogation, a
method which in everyday operations is not accepted. Hence only non-intrusive
methods can be used, which rely on user observation alone. To this end, user ac-
tions or utterances are recorded and processed.
System state characterization is application dependent. In dialog systems the state
of the user interface and of single applications and functions are relevant for an as-
sessment. In contrast, dynamics systems are described by state variables, state vari-
ables limitations, and resources. All these data are in most systems readily accessible
for automatic logging.
Parameters relevant for situation assessment are system dependent and different
for mobile or stationary appliances. In general acquisition of environmental con-
ditions poses no serious problem, as sensors are available abundantly. As long as
automatic situation assessment is identical to what a user perceives with his nat-
ural senses, assistive functions are based on identical inputs as human reasoning and
therefore coincident with user expectations . Problems may arise however if dis-
crepancies occur between both, as may be the case if what technical sensors see is
different from what is perceived by human senses. Such deviations may lead to a de-
ficient identification of assistance requirements. Consider, e.g. a car distance control
system based on radar that logs, with a linear beam, just the distance to the preceding
car. The driver, in contrast, sees a whole queue of cars ahead on a curved road. In this
case the speed suggested to the driver may be incompatible with own experiences.

Data Bases

In general there are several application specific data bases needed for user support
generation. For dialog systems these data bases may contain menu hierarchy, nomen-
clature, and functionality descriptions. For dynamic systems a system data base must
be provided that contains information related to performance limits, available re-
sources, functionality schematics for subsystems, maintenance data and diagnostic
aids.
For vehicle systems an environmental data base will contain navigational, geo-
graphical and topological information, infrastructure data as well as weather con-
ditions. For air traffic the navigational data base may e.g. contain, airport approach
maps, approach procedures or radio beacon frequencies. In surface traffic e.g. digital
road maps and traffic signaling would be needed.

User Assistance Generation

Based on context of use and on data base content there are five options how to support
user in tasks execution:
372 Assisted Man-Machine Interaction

Providing access to information


In this mode the user is granted access to the data bases on request, however no
active consultation is provided. An example would be the call-up of wiring diagrams
during trouble shooting. In addition information may be enhanced to match the char-
acteristics of human senses. Consider, e.g. the use of infrared lighting to improve
night vision.
Providing consultation
In this mode the user is automatically accommodated with information relevant
in the prevailing context of use. During diagnostic activities available options are e.g.
presented together with possible consequences of choosing a particular one. Consul-
tation is derived by the inference engine of an expert system, which makes use of
facts and rules accumulated in data bases.
Providing commands
Commands are the proper means of assistance if time is scarce, a situation that
often happens during manual control. The user is then triggered by a visual, acoustic,
or haptic command to act immediately. The command can be followed or disregarded
but, due to time restriction, not disputed.
Data entry management
This type of assistance refers to the correction, prediction and/or completion of
keyboard and mouse entries into dialog systems, in order to reduce the number of
required input actions.
Providing intervention
Manual inputs into dynamic systems sometimes happen to be inappropriate in
amplitude, duration or timing. Here automatic intervention is applied to appropri-
ately shape, filter, limit or modify manual input signals. A well known example for
this approach is the antilock braking system common in cars. Here the drivers brake
pedal pressure is substituted by synthesized intermittent braking pulses, scaled to
prevent wheels from blocking. While antilock system operation is audible, the user
in most systems is not even aware of intervening systems being active.
Takeover by Automation
If a task requires manual action beyond human capabilities, the human operator
may be taken out-of-the-loop altogether and authority may be transferred to perma-
nent automation as exemplified by stabilizers in airplanes. In case task requirements
are heavy but still within human capabilities, temporary automation may be consid-
ered to reduce workload or enhance comfort, as is the case with autopilots. Suitable
strategies for task delegation to automatons must then be considered, which are ex-
plained in a later section.
In the following the concept of assistance outlined above will be elaborated in
more detail for manual control and dialog tasks.

8.2 Assistance in Manual Control


In this section various aspects of assistance in manual control tasks are discussed.
First the needs for assistance are identified based on an analysis of human motor
Assistance in Manual Control 373

limitations and deficiencies. Then parameters describing context of use as relevant in


manual control are described. Finally measures to support manual control tasks are
discussed.

8.2.1 Needs for Assistance in Manual Control

Manual control relies on perception, mental models of system dynamics, and man-
ual dexterity. While human perception outperforms computer vision in many respects
like, e.g. in Gestalt recognition and visual scene analysis, there are also some serious
shortcomings. It is, e.g. difficulty for humans to keep attention high over extended
periods of time. In case of fatigue there is a tendency for selective perception, i.e.
the environment is partially neglected (Tunnel vision ). Night vision capabilities are
rather poor; also environmental conditions like fog or rain restrict long range vision.
Visual acuity deteriorates with age. Accommodation and adaptation processes result
in degraded resolution, even with young eyes. Support needs arise also from cogni-
tive deficiencies in establishing and maintaining mental models of system dynamics.
Human motor skills are on one hand very elaborate, considering the 27 degrees of
a hand. Unfortunately hand skills are limited by muscle tremor as far as movement
precision is concerned and by fatigue as far as force exertion is concerned [7, 42].
Holding a position steadily during an extended period of time is beyond human abil-
ities. In high precision tasks, as e.g. during medical teleoperation, this can not be
tolerated. Vibrations also pose a problem, since vibrations present in a system, origi-
nating e.g. from wind gust or moving platforms, tend to cross talk into manual inputs
and disturb control accuracy.
For the discussion of manual control limitations a simple compensatory control
loop is depicted in (Fig. 8.3a), where r(t) is the input reference variable, e(t) the
error variable, u(t) the control variable, and y(t) the output variable. The human
operator is characterized by the transfer function GH (s), and the system dynamics
by GS (s).

Fig. 8.3. Manual compensatory control loop (a) and equivalent crossover model (b).

The control loop can be kept stable manually, as long as the open loop transfer
function G0 (s) meets the crossover model of (8.1), where ωc is the crossover fre-
quency and τ the delay time accumulated in the system. Model parameters vary with
input bandwidth, subjects and subject practice. The corresponding crossover model
loop is depicted in (Fig. 8.3b).
ωc −sτ
G0 (s) = GH (s) · GS (s) = ·e (8.1)
s
374 Assisted Man-Machine Interaction

A quasilinear formulation of the human transfer function is

(1 + T1 s) 1
GH (s) = · · e−sτ (8.2)
(1 + T2 s) (1 + Tn s)

Stabilization requires adaptation of lead/ lag and gain terms in GH (s). In a second
order system the operator must e.g. produce a lead time T1 in equation (8.2). With
reference to the crossover model it follows, that stabilization of third order system
dynamics is beyond human abilities (except after long training and following build-
up of mental models of system dynamics).

8.2.2 Context of Use in Manual Control

Context of use for manual control tasks must take into account operator state, system
state, and environmental state. The selection of suitable data describing these states
is largely application dependent and can not be discussed in general.
Therefore the methods of state identification addressed here will focus on car
driving. A very general and incomplete scheme characterizing context of car driving
has to take into account the driver, vehicle state, and traffic situation. For each of
these typical input data are given in (Fig. 8.4), which will subsequently be discussed
in more detail.

Head pose, line of sight Focus of attention


Driver state
eye blinks, biosignals, Workload
identification
control inputs Driving behavior

car and subsystem Context


state variables Car state Vehicle errors of
resources identification & resource state car
errors driving

Other cars state variables


Traffic situation
time, date, weather Traffic situation
assessment
inputs from infrastructure

Fig. 8.4. Factors relevant in the context of car driving.

Identification of Driver State

Biosignals appear to be the first choice for driver state identification. In particular avi-
ation medicine has stimulated the development of manageable biosensors to collect
data via electrocardiograms (ECG), electro myograms (EMG), electro oculograms
(EOG), and from electro dermal activity sensors (EDA). All of these have proved to
be useful, but have the disadvantage of intrusiveness . Identification of the driver’s
state in everyday operations must in no case interfere with driver comfort. Therefore
non-intrusive methods of biosignal acquisition have been tested as, e.g. deriving skin
resistance and heartbeat from the hands holding the steering wheel.
Assistance in Manual Control 375

With breakthroughs in camera and computer technology the robustness of com-


puter vision has become mature enough for in-car applications. In future premium
cars a camera will be mounted behind the rearview mirror. From the video pictures
head pose , line-of-sight , and eye blinks are identified which indicate where the
driver is looking at and where he focuses his attention. Eye blinks are an indicator of
operator fatigue (Fig. 8.5). In the state of fatigue frequent slow blinks are observed,
in contrast to a few fast blinks in a relaxed state [21].

Fig. 8.5. Eye blink as an indicator of fatigue (left: dizzy; right: awake).

Driver control behavior is easily accessible from control inputs to steering wheel,
brakes, and accelerator pedal. The recorded signals may be used for driver modeling.
Individual and interindividual behavior may however vary significantly over time.
For a model to be useful, it must therefore be able to adapt to changing behavior in
real time.
Consider e.g. the development of an overhauling warning system for drivers on a
winding road. Since the lateral acceleration preferred during curve driving is different
for relaxed and dynamic drivers, the warning threshold must be individualized (Fig.
8.6). Warning must only be given if the expected lateral acceleration in the curve
ahead exceeds the individual preference range. A fixed threshold set too high would
be useless for the relaxed driver, while a threshold set too low would be annoying to
the dynamic driver [32].

relaxed driver dynamic driver


Number of curves

Number of curves

Lateral forces [% of bodyweight] Lateral forces [% of bodyweight]

Fig. 8.6. Interpersonal differences in lateral acceleration preferences during curve driving [32].

With respect to user models, parametric and black-box approaches must be dis-
tinguished. If a priori knowledge about behavior is available, a model structure can be
developed, where only some parameters remain to be identified, otherwise no a priori
assumptions can be made [26]. An example of parametric modeling is the method
376 Assisted Man-Machine Interaction

for direct measurement of crossover model parameters proposed by [19]. Here a gra-
dient search approach is used to identify ωc and τ in the frequency domain, based on
partial derivatives of G0 as shown in (Fig. 8.7).
In contrast to parametric models , black box models use regression to match
model output to observed behavior (Fig. 8.8). The applied regression method must
be able to cope with inconsistent and uncertain data and follow behavioral changes
in real time.

Fig. 8.7. Identification of crossover model parameters ωc and τ [19].

One option for parametric driver modeling are functional link networks , which
approximate a function y(x) by nonlinear basis functions Φj (x) [34].


b
y(x) = wj Φj (x) (8.3)
j=1

Kraiss [26] used this approach to model driver following and passing behavior in a
two way traffic situation. The weights are learned during by supervised training and
summed up to yield the desired model output. Since global basis functions are used,
the resulting model represents mean behavior averaged over subjects and time.

e(t) u(t) y(t)


GH GS
r(t)

Black box
model
Model adaptation

Fig. 8.8. Black-box model identification.

For individual driver support mean behavior is not sufficient, since also rare
events are of interest and must not be concealed by frequently occurring events. For
Assistance in Manual Control 377

the identification of such details semi-parametric, local regression methods are better
suited than global regression [15]. In a radial basis function network a function y(x)
is approximated accordingly by weighted Gaussian functions Φj ,


b
y(x) = wj Φj ( x − xj ) (8.4)
j=1

where x − xj denotes a norm, mostly the Euclidian distance, of a data point x from
the center xj of Φj :
Garrel [13] used a modified version of basis function regression as proposed
by [39] for driver modeling. The task was, to learn individual driving behavior when
following a leading car. Inputs were own speed as well as distance and speed differ-
ences. Data were collected in a driving simulator during 20 minutes training inter-
vals. As documented by (Fig.8.9), the model output emulates driver behavior almost
exactly.

Fig. 8.9. Identification of follow-up behavior by driver observation.

Identification of Vehicle State


Numerous car dynamics parameters can potentially contribute to driver assistance
generation. Among these are state variables for translation, speed, and acceleration.
In addition there are parameters indicating limits of operation like, e.g. car acceler-
ation and braking. The friction coefficient between wheel and street is important to
prevent blocking and skidding of wheels. Finally car yaw angle acceleration has to
be mentioned as an input for automatic stabilization. All these data can be accessed
on board without difficulty.

Identification of a Traffic Situation


Automatic acquisition and assessment of a traffic situation is based on the current
position and the destination of the own car in relation to other traffic participants.
378 Assisted Man-Machine Interaction

Also traffic rules to be followed and optional paths in case of traffic slams must be
known. Further factors of influence include time of day, week, month, or year as well
as weather conditions and available infrastructure.

Fig. 8.10. Detection ranges of distance sensor common in vehicles [22].

Distance sensors used for reconnaissance include far, mid, and short range radars,
ultrasonic, and infrared (Fig. 8.10). Recently also video cameras are powerful enough
for outdoor application. Especially CMOS high dynamic range cameras can cope
with varying lighting conditions.
Advanced real-time vision systems try to emulate human vision by active gaze
control and simultaneous processing of foveal and peripheral field of view. A sum-
mary of related machine vision developments for road vehicles in the last decade
may be found in [11]. Visual scene analysis also draws to a great deal upon digital
map information for a priori hypotheses disambiguation [37].
In digital maps infrastructural data like the position of gas stations, restaurants,
and workshops are associated with geographical information. In combination with
traffic telematics information can be provided beyond what the driver perceives with
his eyes like, e.g. looking around a corner. Telematics also enables communication
between cars and drivers (a feature to be used e.g. for advance warnings), as well as
between driver and infrastructure as, e.g. electronic beacons (Fig. 8.11).
Sensor data fusion is needed to take best advantage of all sensors. In (Fig. 8.12)
a car is located on the street by a combination of digital map and differential global
positioning system (DGPS) outputs. In parallel a camera tracks road side-strip and
mid-line to exactly identify car position as related to lane margins.
Following augmentation by radar, traffic signs and obstacles are extracted from
pictures, as well as other leading, trailing, passing, or cutting in cars. Reference to the
digital maps also yields the curve radius and reference speed of the road ahead. Fi-
nally braking distance and impending skidding are derived from yaw angle sensing,
wheel slip detection,wheel air pressure, and road condition.
Assistance in Manual Control 379

GPS localization

GPS localisation

Information
forwarding

Laser, radar , Electronic


video sensing beacon

Fig. 8.11. Traffic telematic enables contact between drivers and between driver and infrastruc-
ture.

Picture Recognition of Position relative


processing lane margin to lane margin
Camera Side-strip
mid-line
Traffic sign
Traffic light
Obstacles
Sensor data Object Driving
Laser fusion recognition leading situation
Object
cutting in
Radar
passing
Vehicle

Digital map

Reference speed
Precision Route Curve radius
DGPS positioning planning
V ehicle
INS position

Yaw angle sensing Braking distance


Driving Skidding trend
Transmiss ion slip
dynamics
Wheel air pres sure
coefficients
R oad condition

Fig. 8.12. Traffic situation assessment by sensor data fusion.

8.2.3 Support of Manual Control Tasks

A system architecture for assistance generation in dynamic systems, which takes


context of manual control and digital map data bases as inputs, is proposed in (Fig.
8.13). The manual control assistance generator provides complementary support with
respect to information management, control input management, and substitution of
manual inputs by automation takeover (Fig.8.13).

Information Management

Methods to improve information display in a dynamic system may be based on en-


vironmental cue perceptibility augmentation, on making available predictive infor-
380 Assisted Man-Machine Interaction

Fig. 8.13. Assistance in manual control.

mation, or on the provision of commands and alerts. In this section the different
measures are presented and discussed.
Environmental cue perceptibility augmentation
Compensation of sensory deficiencies by technical means applies in difficult en-
vironmental conditions. An example from ground traffic is active night vision, where
the headlights are made to emit near IR radiation reaching wider than the own low
beam. Thus a normally invisible pedestrian in (Fig. 8.14, left) is illuminated and can
be recorded by an IR-sensitive video camera. The infrared picture is then superim-
posed on the outside view with a head-up display (Fig. 8.14, right). Hereby no change
of accommodation is needed when looking from the panel outside and vice versa.

Fig. 8.14. Left: Sensory augmentation by near infrared lighting and head-up display of pictures
recorded by an IR sensitive video camera [22].

Provision of predictive information


High inertia systems or systems incorporating long reaction time caused by iner-
tia or transmission delays are notoriously hard to control manually. An appropriate
method to compensate such latencies is prediction. A predictor presents to the oper-
ator not only the actual state but also a calculated future state, which is characterized
by a transfer function Gp (Fig. 8.15).
A simple realization of the predictor transfer function GP by a second order
Taylor expansion with prediction time τ yields:
Assistance in Manual Control 381

Predicted
e(t) u(t) y(t) y p(t) output variable
GH GS GP
r(t)

Fig. 8.15. Control loop with prediction.

yP τ2 2
GP (s) = =1+τ ·s+ · s + remnant (8.5)
y 2
Now assume the dynamics GS of the controlled process to be of third order (with ζ=
degree of damping and ωn = natural frequency )
y 1
GS (s) = = (8.6)
u (1 + 2ς
ωn ·s+ 1
ωn2 · s2 ) · s
√ √
With τ = 2/ωn and ς = 1/ 2 the product GS (s) · GP (s) simplifies – with
respect to the predicted state variable – to an easily controllable integrator (8.7).

yP τ2 2 2ς 1 1
GS (s) · GP (s) = = (1 + τ · s + · s ) · (1 + · s + 2 · s2 )−1 = (8.7)
u 2 ωn ωn s
As an example a submarine depth control predictor display is shown in Fig. 8.16.
The boat is at 22 m actual depth and tilts downward. The predictor is depicted as
a quadratic spline line. It shows that the boat will level off after about 28 s at a
depth of about 60 m, given the control input remains as is. Dashed lines in (Fig.
8.16) indicate limits of maneuverability , if the depth rudders are set to extreme
upper respectively lower deflections. The helmsman can observe the effect of inputs
well ahead of time and take corrective actions in time. The predictor display thus
visualizes system dynamics, which otherwise would have to be trained into a mental
model [25].
Predictors can be found in many traffic systems, in robotics, and in medicine.
Successful compensation of several seconds of time delay by prediction has e.g. been
demonstrated for telerobotics in space [16]. Reintsema et al [35] summarize recent
developments in high fidelity telepresence in space and surgery robotics, which also
make use of predictors.
Provision of alerts and commands
Since reaction time is critical during manual control, assistance often takes the
form of visual, acoustic, or haptic commands and alerts. Making best use of human
senses requires addressing the sensory modality that matches task requirements best.
Reaction to haptic stimulation is the fastest among human modalities (Fig. 8.17).
Also haptic stimulation is perceived in parallel to seeing and hearing and thus es-
tablishes an independent sensory input channel. These characteristics have led to an
extensive use of haptic alerts . In airplanes control stick shakers become active if
airspeed is getting critically slow. More recently a tactile situation awareness system
for helicopter pilots has been presented, where the pilot’s waistcoat is equipped with
382 Assisted Man-Machine Interaction

Fig. 8.16. Submarine depth predictor display.

120 pressure sensors which provide haptic 3D warning information. In cars the driver
receives haptic torque or vibration commands via the steering wheel or accelerator
pedal as part of an adaptive distance and heading control system.
Also redundant coding of a signal by more than one modality is a proven means
to guarantee perception. An alarm coded simultaneously via display and sound is
less likely to be missed than a visual signal alone [5].
Reac tion time [ms]

800

With secondary task


40
0
no secondary task

sound seat steering steering


vibrations wheel moment
vibrations

Fig. 8.17. Reaction times following haptic and acoustic stimulation [30].

The buckle-up sign when starting a car is an example of a nonverbal acoustic


alert . Speech commands are common in car navigation systems. In commercial air-
planes synthesized altitude callouts are provided during landing approaches. In case
of impending flight into terrain pull-up calls alert the pilot to take action. In military
aircraft visual commands are displayed on a head-up display to point to an impend-
ing threat. However visual commands address a sensory modality which often is
Assistance in Manual Control 383

already overloaded by other observation tasks. Therefore other modalities often are
preferable.

Control Input Management

In this section two measures of managing control inputs in dynamic systems are
illustrated. This refers to control input modification and adaptive filtering of platform
vibrations.
Control input modifications
Modifications of the manual control input signal may consist in its shaping, limit-
ing, or amplifying. Nonlinear signal shaping is e.g. applied in active steering systems
in premium cars, where it contributes to an increased comfort of use. During park-
ing with low speed little turning angles at the steering wheel result in large wheel
rotation angles, while during fast highway driving, wheel rotations corresponding to
the same steering input are comparatively small. The amplitude of required steering
actions is thus made adaptive to the traffic situation.
Similar to steering gain , nonlinear force execution gain can be a useful method
of control input management. Applying the right level of force to a controlled ele-
ment can be a difficult task. It is e.g. a common observation that people refrain from
exerting the full power at their disposal to the brake pedal. In fact, even in emergency
situations, only 25% of the physically possible pressure is applied. In consequence
a car comes much later to a stop, than brakes and tires would permit. In order to
compensate this effect a braking assistant is offered by car manufacturers, which in-
tervenes by multiplying the applied braking force by a factor of about four. Hence
already moderate braking pressure results in emergency braking (Fig. 8.18). This
braking augmentation is however only triggered when the brake pedal is activated
very fast, as is the case in emergency situations.

100
dec eleration

%
C ar

Emergency
50 %
braking
Normal executed brake
30 % pedal force
braking
amplified by
a factor of 4.

15 % 25 % 100 %
Executed brake pedal force

Fig. 8.18. The concept of braking assistance.


384 Assisted Man-Machine Interaction

A special case of signal shaping is control input limiting . The structure of air-
planes is e.g. dimensioned for a range of permitted flight maneuvers. In order to
prevent a pilot from executing illegal trajectories like flying too narrow turns at ex-
ceeding speeds, steering inputs are confined with respect to the exerted accelerations
(g-force limiters). Another example refers to traction control systems on cars where
inputs to the accelerator pedal are limited to prevent wheels from slipping.
Input shaping by smoothing plays an important role in human-robot interaction
and teleoperation. Consider e.g. a surgeon using a robot actuator over an extended
period of time. In the sequence of strain his muscles unavoidably will show tremor
that cross talks into the machine. This tremor can be smoothed out, to guarantee
steady movements of the robot tool.
Adaptive filtering of moving platform vibration cross talk into control input
In vehicle operations control inputs may be affected by moving platform vibra-
tions which must be disposed of by adaptive filtering. This method makes use of the
fact, that noise no contained in the control signal s is correlated with a reference
noise n1 acquired at a different location in the vehicle [14, 29, 40]. The functioning
of adaptive filtering is illustrated by (Fig. 8.19), where s, n0 , n1 , y are assumed to be
stationary and have zero means:

Manual control input plus noise s + n + Output signal


0
- =s+n0 -y
y
Adaptive
Reference noise n filter
1

Fig. 8.19. Adaptive filtering of moving platform vibrations.

δ = s + n0 − y (8.8)
Squaring 8.8 and taking into account, that s is correlated with neither n0 nor y yields
the expectation E:

E[δ 2 ] = E[s2 ] + E[(n0 − y)2 ] + 2E[s(n0 − y)] = E[s2 ] + E[(n0 − y)2 ] (8.9)

Adapting the filter to minimize E[δ 2 ] will minimize E[(n0 − y)2 ] but not affect the
signal power E[s2 ]. The filter output y therefore is the best least square estimate of
the noise n0 contained in the manual control input signal. Subtracting y according to
(8.8) results in the desired noise canceling .

Automation

If the required speed or accuracy of manual inputs needed in a dynamic system is


beyond human capabilities, manual control is not feasible and must be substituted
Assistance in Manual Control 385

by permanent automation. Examples for permanent automation are stabilizers in air-


crafts, which level out disturbances resulting from high frequency air turbulences.
Frequently it is assumed that automation simplifies the handling of complex sys-
tems in any case. This is however not true, since removal of subtasks from a working
routine truncates the familiar working procedures. Users find it difficult to cope with
the task segments remaining for manual operation. Therefore it is widely accepted,
that a user centered procedure must be followed in man-machine task allocation . In
order to be accepted, automated functions must e.g. be transparent in the sense that
performance must correspond to user expectations. Any automode operation puts an
operator into an out-of-the-loop situation. Care has to be taken, that the training level
of an operator doesn’t deteriorate, due to lack of active action.
Beside of permanent automation there is a need for optional on/off automation
of tasks during phases of extensive operator workload or for comfort. Autopilots
are e.g. switched on regularly by pilots while cruising at high altitudes. Also moving
or grasping objects with robotic actuators is an application where on/off automation
makes sense. The manual control of the various degrees of freedom involved in ma-
nipulators demands time and special skills on the part of the operator [6]. Instead a
manipulator can be taught to perform selected skills just once. Subsequently trained
skills may be called up as needed and executed autonomously [23, 31].
If automation is only temporary decisions about function allocation and handover
procedures between man and machine have to be made, which both influence system
safety and user acceptance [27].
The handover procedures for on/off automatic functions may follow different
strategies:
• Management by delegation (autonomous operation if switched on)
• Management by consent (autonomous operation following acknowledgement by
the operator)
• Management by exception (autonomous operation, optional check by the opera-
tor)
For the choice of one of these handover strategies the workload of the opera-
tor and the possible consequences of a malfunction are decisive. Dangerous actions
should never be delegated to autonomous operation, since unexpected software and
logic errors may occur anytime. Transition from manual to automated operation also
implies serious liability problems. If, e.g. an emergency braking is triggered auto-
matically and turns out to be erroneous in hindsight, insurance companies tend to
deny indemnity.
Intervention is a special form of temporary takeover by automation. The elec-
tronic stabilization program (ESP) in cars becomes e.g. only active, if a tendency
towards instability is observed. In such cases it makes use of combined single wheel
braking and active steering to stabilize a cars yaw angle. In general the driver be-
comes not even aware of the fact, that he has been assisted in stabilizing his car.
386 Assisted Man-Machine Interaction

8.3 Assistance in Man-Machine Dialogs


In this section the needs for assistance in dialogs are substantiated. Then methods
for context of dialog identification are treated. Finally options for dialog support are
presented and discussed.

8.3.1 Needs for Assistance in Dialogs

Executing a dialog describes the process of executing a sequence of discrete events


at the user interface of an appliance in order to achieve a set goal. To succeed in this
task, inputs have to be made to formulate legal commands or menu options have to
be selected and triggered in the right order. If an appliance provides many functions
and if display space is scarce, the menu tends to be hierarchically structured in many
levels (Fig. 8.20).

Selected top level menu option i

Visible menu top layer

1. sublayer
Hidden
sublayer
menu 2. sublayer
options
3. sublayer Desired 3. sublevel menu option j

Fig. 8.20. Example menu hierarchy .

Executing a dialog requires two kinds of knowledge. On one hand the user needs
application knowledge which refers to knowing whether a function is available,
which service it provides, and how it is accessed. Secondly knowledge about the
menu hierarchy, the interface layout, and the labeling of functions must be familiar.
All this know-how is subsumed as interaction knowledge . Unfortunately only the
top layer menu items are visible to the user, while sublayers are hidden until acti-
vated. Since a desired function may be buried in the depths of a multilayer dialog
hierarchy, menu selection requires a mental model of the dialog tree structure that
tells, where a menu item is located in the menu tree and by which menu path it can
be reached.
An additional difficulty arises from the fact that the labeling of menu items hap-
pens to be misleading, since many synonyms exist. Occasionally even identical labels
occur on different menu branches and for different functions. Finally the interface
geometric layout must be known, to be able to localize the correct control buttons.
All of this must be kept ready in memory. Unfortunately human working memory
is limited to only 7 ± 2 Chunks. Also mental models are restricted in accuracy and
completeness and subject to forgetting which enforces the use of manuals or query
of other people.
Assistance in Man-Machine Dialogs 387

Depending on the available level of competence in the application and interaction


domain users are classified as novices, trained specialists, application experts, and in-
teraction experts (Fig. 8.21). Assistance in dialogs is geared towards the reduction of
the described knowledge requirements. In particular for products designed for every-
body’s use like mobiles or electronic appliances, interaction knowledge requirements
must be kept as low as possible to ensure positive out of the box experiences on the
part of the buyers. As will be shown, identification of the context of dialog is the key
to achieve this goal.
application k now ledge high

Application Trained
expert specialist

Novice Interaction
Least expert
informed
user
low

low Interaction knowledge high

Fig. 8.21. Application and interaction knowledge required in man-machine dialogs.

8.3.2 Context of Dialogs

The context of a dialog is determined by the state of the user, the state of the dialog
system itself, and the situation in which the dialog takes place, as depicted by (Fig.
8.22). In this figure parameters available for context of dialog identification are listed,
which will be discussed in more detail in the following sections.

Skill level
User state
User identity & inputs Preferences
identification
Plans/Intentions

System functions
Context
Running application Dialog system state
Dialog state & history of
Dialog hierarchy identification
dialog
User inputs

User whereabouts Dialog situation


Activities
and movements assessment

Fig. 8.22. Context of dialog.


388 Assisted Man-Machine Interaction

User State Identification

Data sources for user state identification are the user’s identity and inputs. From these
his skill level, preferences, and intentions are inferred. Skill level and preferences
are associated with the user’s identity. Entering a password is a simple method to
make oneself known to a system. Alternatively biometric access like fingerprints is
an option. Both approaches require action by the user. Nonintrusive methods rely
on observed user actions. Preferences may e.g. be derived from the frequency with
which functions are used.
The timing of keystrokes is an indicator of skill level. This method exploits the
fact, that people deliberately choose interaction intervals between 2 and 5 seconds
when not forced to keep pace. Longer intervals are indicative of deliberation, while
shorter intervals point to a trial and error approach [24].
Knowing what a user plans to do is a crucial asset in providing support. At the
same time it is difficult to find out the pursued plans, since prompting the user is
out of the question. The only applicable method then is keyhole intent recognition,
which is based on user input observation [9]. Techniques for plan identification from
logged data include Bayesian Belief Networks (BBNs), Dempster Shafer Theory ,
and Fuzzy Logic , [1, 2, 10, 18, 20]. From these BBNs turn out to be most suited (Fig.
8.23):

action 1
action 2
….
action m

observed actions future intended actions


action 1 action 2 …
. action n-1 action n action n+1 action m

Plan recognition

Fig. 8.23. The concept of plan recognition from observed user actions [17].

As long as plans are complete and flawless they can be inferred easily from action
sequences. Unfortunately users also act imperfectly by typing flawed inputs or by
applying false operations or those irrelevant to a set goal. Classification also has
to consider individual preferences among various alternative strategies to achieve
identical goals. A plan recognizer architecture that can cope with these problems
was proposed by [17] and is depicted in (Fig. 8.24):
The sequence of observations O in (Fig. 8.24) describes the logged user actions.
Feature extraction performs a filtering of O in order to get rid of flawed inputs, re-
sulting in a feature vector M. The plan library P contains all intentions I a user may
possibly have while interacting with an application. Therefore the library content of-
Assistance in Man-Machine Dialogs 389

Feature extraction
Sequence of
Context of use
observations O
Feature vector M
Feature vector M
Training of models

Library of Plan execution Plan based


plans P models PM classification
Most likely plan

Fig. 8.24. Plan based interpretation of user actions (adopted from [17]).

ten is identical with the list of functions a system offers. The a priori probability of
plans can be made dependent on context of use.
The plan execution models PM are implemented as three layer Bayes Belief
Networks , where nodes represent statistical variables, while links represent depen-
dencies (Fig. 8.25). User inputs are fed as features M into the net and combined to
syntactically and semantically meaningful operators R, which link to plan nodes and
serve to support or reject plan hypotheses.
The structure of and the conditional probabilities in a plan execution model PM
must cover all possible methods applicable to pursuit a particular plan. The condi-
tional probabilities of a BBN plan execution models are derived from training data,
which must be collected during preceding empirical tests.

Fig. 8.25. General plan model structure.

For plan based classification , feature vector components are fed into PM feature
nodes. In case of a feature match the corresponding plan model state variable is set
to true, while the state of all other feature nodes remain false. Matching refers to the
feature as well as to the feature sequence (Fig. 8.26). For each plan hypothesis the
probability of the observed actions being part of that plan is then calculated and the
most likely plan is adopted. This classification procedure is iterated after every new
user input.
390 Assisted Man-Machine Interaction

Fig. 8.26. Mapping feature vector components plan model to feature nodes (adopted from
[17]).

Dialog System State Identification

Identification of dialog system state and history poses no problem, if data related to
GUI state, user interaction, and running application can be derived from the operating
system or the application software respectively. Otherwise the expense needed to
collect these data may be prohibitive and prevent an assistance system from being
installed.

Dialog Situation Assessment

Essential attributes of a dialog situation are user whereabouts and concurrent user
activities . In the age of global positioning, ambient intelligence, and ubiquitous
computing ample technical measures exist to locate and track people’s whereabouts.
Also people may be equipped with various sensors (e.g. micro gyros) and wearable
computing facilities, which allow identification of their motor activities.
Sensors may alternatively be integrated into handheld appliances, as imple-
mented in the mobile phone described by [36]. This appliance is instrumented with
sensors for temperature, light, touch, acceleration, and tilt angle. Evaluation of such
sensor data enables recognition of user micro activities like phone “held in hand”, or
“located in pocket” from which situations like user is “walking”, “boarding a plane”,
“flying”, or “in a meeting” can be inferred.

8.3.3 Support of Dialog Tasks

According to ISO/DIS 9241-110 (2004) Part 10 the major design principles for dia-
logs refer to functionality and usability . As far as system functionality is concerned,
a dialog must be tailored to the requirements of the task at hand. Not less, but also
no more functions than needed must be provided. Hereby user perceived complexity
is kept as low as possible. The second design goal, usability, is substantiated by the
following attributes:
Assistance in Man-Machine Dialogs 391

• Self-descriptiveness
• Conformity with expectations
• Error tolerance
• Transparency
• Controllability
• Suitability for individualization
Self-descriptiveness requires, that the purpose of single system functions or menu
options are readily understood by the user or can be explained by the system on
request. Conformity with expectations requires that system performance complies
with user anticipations originating from previous experiences or user training. Error
tolerance suggests that a set plan can be achieved in spite of input errors or with
only minor corrections. Transparency refers to GUI design and message formatting,
which must be matched to human perceptual, behavioral, and cognitive processes.
Controllability of a dialog allows the user to adjust the speed and sequence of in-
puts as individually preferred. Suitability for individualization finally describes the
adaptability of the user interface to individual preferences as well as to language and
culture.
Dialog assistance is expected to support the stated dialog properties. A system ar-
chitecture accordant with this goal is proposed in (Fig. 8.27). It takes dialog context,
dictionaries, and a plan library as inputs. The generated assistance is concerned with
various aspects of information management and data entry management as detailed
in the following sections.

Context of dialog Dialog assistance generator


Dictionaries Information Data entry
Plan library management management
Codes & formats Predictive text
Tutoring Do what I mean
Functionality shaping Auto completion
Prompting, timing & prioritizing Predictive scanning

Information
display
Dialog
Man
system
Control
inputs

Fig. 8.27. Assistance in dialogs.

Information Management

Information management is concerned with the question, which data should be pre-
sented to the user when and how. Different aspects of information management are
codes, formats, tutoring, functionality shaping, prompting, timing, and prioritizing.
392 Assisted Man-Machine Interaction

Display codes and formats


When information is to be presented to a user, the right decisions have to be made
with respect to display coding and format in order to ensure self-descriptiveness and
transparency. Various options for information coding exist like text, graphics, ani-
mation, video, sound, speech, force, and vibrations. The selection of a suitable code
and the best sensory modality depends on the context of use. There exists extensive
literature in human factors engineering and many guidelines on these topics which
may be consulted, see e.g. [38].
Tutoring
Tutoring means context-dependent provision of information for teaching pur-
poses. Following successful intent recognition , step-by-step advice can be given for
successful task completion. Also hints to shortcuts and to seldomly used functions
may be provided. Descriptions and explanations of functions may be offered, with
their granularity adapted to user skill level. Individualization of the GUI is achieved
by adapting menu structure and labeling to user preferences. Using augmented real-
ity enables to overlay tutoring instructions on a head mounted display over the real
world [4, 12].
Functionality shaping
In order to achieve suitability for task, many applications offer the user various
functional levels to choose from, which are graded from basic to expert functions.
This is reflected in the menu tree by suppressed or locked menu items, which gen-
erally are rendered grey, but keep their place in the menu hierarchy so as to ensure
constancy in menu appearance.
Prompting , timing and prioritizing
Information may be prompted by the user or show up unsolicited depending on
context of use. Also the timing of information, as well as its duration and frequency
can be made context-dependent. Prioritizing of messages and alerts according to their
urgency is a means to fight information overload e.g. in a process control situations.
As an example for advanced information management consider the electronically
centralized aircraft monitoring system (ECAM ) installed in Airbus commercial air-
planes, which is designed to assist pilots in resource and error management. The
appertaining information is presented on a warning and a system display screen in
the center of the cockpit panel. The top left warning display in (Fig. 8.28) illustrates
an unsolicited warning as a consequence of a hydraulic system breakdown. The top
right system display shows, also unsolicited, the flow diagram of the three redundant
onboard hydraulic systems, which are relevant for error diagnosis. Here the pilot can
see at a glance, that the blue system has lost pressure.
Next the pilot may call-up the following two pages as depicted in the lower part
of (Fig. 8.28) for consultation. Here the warning display lists the control surfaces
affected by the hydraulic failure, e.g. SLATS SYS 1 FAULT and shows on the system
display where those planes are located on the wings. The lower part of the warning
display informs the pilot about failure consequences. LDG DISTANCE: MULTIPLY
BY 1.2 e.g. indicates that landing distance will be stretched by 20% due to defective
control surfaces.
Assistance in Man-Machine Dialogs 393

Fig. 8.28. Electronically Centralized Aircraft Monitoring System (ECAM).

The ECAM system employs a sophisticated concept for mitigating warnings to


the pilot. As soon as an error appears, various phases of error management are identi-
fied, i.e. warning , error identification , and error isolation . Three types of alarms are
distinguished and treated differently. Single alarms have no consequences for other
components. Primary alarms indicate major malfunctions, which will be the cause
for a larger number of secondary alarms to follow. In normal operation the cockpit is
“dark and silent”. Alarms are graded as to their urgency. An emergency alarms e.g.
requires immediate action. In an exceptional situation some latency in reaction can
be tolerated. Warnings call upon the pilot for enhanced supervision, while simple
messages require no action at all.

Data Entry Management

Data entry often poses a problem especially during mobile use of small appli-
ances. Also people with handicaps need augmentative and alternative communication
(AAC). The goal of data entry management is to provide suitable input media, to ac-
celerate text input by minimizing the number of inputs needed and to cope with input
errors. Recently multimodal interaction receives increasing attention, since input by
speech, gesture, mimics, and haptics can substitute keyboards and mice altogether
(see related chapters in this book).
For spelling-based systems in English, the number of symbols is at least twenty-
seven (space included), and more when the system offers options to select from a
number of lexical predictions . A great variety of special input hardware has been
devised to enable selection of these symbols on small hand-held devices.
394 Assisted Man-Machine Interaction

The simplest solution is to assign multiple letters to each key. The user then is
required to press the appropriate key a number of times for a particular letter to
be shown. This makes entering text a slow and cumbersome process. For wearable
computing also keyboards have been developed, which can be operated by one hand.
The thumb operates a track point and function keys on the back side, while the other
four fingers activate keys on the front side. Key combinations (accords) permit input
of all signs.
Complementary to such hardware prediction, completion, and disambiguation
algorithms are used to curtail the needed number of inputs. Prediction is based on
statistics of “n-grams”, i.e. groups of n letters as they occur in sequence in words
[28]. Disambiguation of signs is based on letter-by-letter or word-level comparison
[3]. In any case reference is made to built-in dictionaries. Some applications are:
Predictive text
For keyboards where multiple letters are assigned to each key predictive text re-
duces the number of key pressings necessary to enter text. By restricting the available
words to those in a dictionary each key needs to be pressed only once. It allows e.g.
the word “hello” to be typed in 5 key presses instead of 12.
Auto completion
Auto-completion or auto-fill is a feature of text-entry fields that automatically
completes typed entries with the best guess of what the user may intend to enter, such
as pathnames, URLs, or long words, thus reducing the amount of typing necessary
to enter long strings of text. It uses similar techniques as predictive text.
Do what I mean
In Do-what-I-mean functions an input string is compared with all dictionary en-
tries. Automatic correction is made if both strings differ by one character, if one
character is inserted or deleted, or if two characters are transposed. Sometimes even
adjacent sub words are transposed (existsFile == fileExists).
Predictive scanning
For menu item selection single-switch scanning can be used with a simple linear
scan, thus requiring only one switch activation per key selection. Predictive scanning
alters the scan such, that keys are skipped over when they do not correspond to any
letters which occur in the database at the current point in the input key sequence.
This is a familiar feature for destination input in navigation systems.

8.4 Summary
Technically there are virtually no limits to the complexity of systems and appliances.
The only limits appear to be user performance and acceptance. Both threaten to en-
cumber innovation and prohibit success on the market place. In this chapter the con-
cept of assistance has been put forward as a remedy. This approach is based on the
hypothesis, that man-machine system usability can be improved by providing user
assistance adapted to the actual context of use.
First the general concept of user assistance was described, which involves context
of use identification as key issue. Subsequently this concept was elaborated for the
References 395

two main tasks facing the user in man-machine interaction, i.e. manual control and
dialogs.
Needs for assistance in manual control were identified. Then attributes of context
of use for manual control tasks were listed and specified for a car driving situation.
Also methods for context identification were discussed, which include parametric
and black-box approaches to driver modeling. Finally three methods of providing
support to a human controller were identified. These are information management,
control input management, and automation.
The second part of the chapter was devoted to assistance in man-machine dialogs.
First support needs in dialogs were identified. This was followed by a treatment of the
variables describing a dialog context, i.e. user state, dialog system state, and dialog
situation. Several methods for context identification were discussed with emphasis on
user intent recognition. Finally two methods of providing assistance in dialog tasks
were proposed. These are information management and data entry management.
In summary it was demonstrated how assistance based on context of use can
be a huge asset to the design of man-machine systems. However, the potential of
this concept remains yet to be fully exploited. Progress relies mainly on improved
methods for context of use identification.

References
1. Akyol, S., Libuda, L., and Kraiss, K.-F. Multimodale Benutzung adaptiver Kfz-
Bordsysteme. In Jürgensohn, T. and Timpe, K.-P., editors, Kraftfahrzeugführung, pages
137–154. Springer, 2001.
2. Albrecht, D. W., Zukerman, I., and Nicholson, A. Bayesian Models for Keyhole Plan
Recognition in an Adventure Game. User Modeling and User-Adapted Interaction, 8(1–
2):5–47, 1998.
3. Arnott, A. L. and Javed, M. Y. Probabilistic Character Disambiguation for Reduced
Keyboards Using Small Text Samples. Augmentative and Alternative Communication,
8(3):215–223, 1992.
4. Azuma, R. T. A Survey of Augmented Reality. Presence: Teleoperators and Virtual
Environments, 6(4):355–385, 1997.
5. Belz, S. M. A Simulator-Based Investigation of Visual, Auditory, and Mixed-Modality
Display of Vehicle Dynamic State Information to Commercial Motor Vehicle Operators.
Master’s thesis, Virginia Polytechnic Institute, Blacksburg, Virginia, USA, 1997.
6. Bien, Z. Z. and Stefanov, D. Advances in Rehabilitation Robotics. Springer, 2004.
7. Boff, K. R., Kaufmann, L., and Thomas, J. P. Handbook of Perception and Human Per-
formance. Wiley, 1986.
8. Breazeal, C. Recognition of Affective Communicative Intent in Robot-Directed Speech.
Autonomous Robots, 12(1):83–104, 2002.
9. Canberry, S. Techniques for Plan Recognition. User Modeling and User-Adapted Inter-
action, 11(1–2):31–48, 2001.
10. Charniak, E. and Goldmann, R. B. A Bayesian Model of Plan Recognition. Artificial
Intelligence, 64(1):53–79, 1992.
11. Dickmanns, E. D. The Development of Machine Vision for Road Vehicles in the last
Decade. In Proceedings of the IEEE Intelligent Vehicle Symposium, volume 1, pages
268–281. Versailles, June 17–21 2002.
396 Assisted Man-Machine Interaction

12. Friedrich, W. ARVIKA Augmented Reality für Entwicklung, Produktion und Service. Pub-
licis Corporate Publishing, Erlangen, 2004.
13. Garrel, U., Otto, H.-J., and Onken, R. Adaptive Modeling of the Skill- and Rule-Based
Driver Behavior. In The Driver in the 21st Century, VDI-Berichte 1613, pages 239–261.
Berlin, 3.–4. Mai 2001.
14. Haykin, S. Adaptive Filter Theory. Prentice-Hall, Englewood Cliffs, NJ, 2nd edition,
1991.
15. Haykin, S. Neural Networks: A Comprehensive Foundation. Prentice Hall, Upper Saddle
River, NJ, 2nd edition, 1999.
16. Hirzinger, G., Brunner, B., Dietrich, J., and Heindl, J. Sensor-Based Space Robotics -
ROTEX and its Telerobotic Features. IEEE Transactions on Robotics and Automation
(Special Issue on Space Robotics), 9(5):649–663, October 1993.
17. Hofmann, M. Intentionsbasierte maschinelle Interpretation von Benutzeraktionen. Dis-
sertation, Technische Universität München., 2003.
18. Hofmann, M. and Lang, M. User Appropriate Plan Recognition for Adaptive Interfaces.
In Smith, M. J., editor, Usability Evaluation and Interface Design: Cognitive Engineer-
ing,Intelligent Agents and Virtual Reality. Proceedings of the 9th International Confer-
enceon Human-Computer Interaction, volume 1, pages 1130–1134. New Orleans, August
5–10 2001.
19. Jackson, C. A Method for the Direct Measurement of Crossover Model Parameters. IEEE
MMS, 10(1):27–33, March 1969.
20. Jameson, A. Numerical Uncertainty Management in User and Student Modeling: An
Overview of Systems and Issues. User Modeling and User-Adapted Interaction. The
Journal of Personalization Research, 5(4):193–251, 1995.
21. Ji, Q. and Yang, X. Real-Time Eye, Gaze, and Face Pose Tracking for Monitoring Driver
Vigilance. Real-Time Imaging, 8(5):357–377, 2002.
22. Knoll, P. The Night Sensitive Vehicle. In VDI-Report 1768, pages 247–256. VDI, 2003.
23. Kragic, D. and Christensen, H. F. Robust Visual Servoing. The International Journal of
Robotics Research, 22(10–11):923 ff, 2003.
24. Kraiss, K.-F. Ergonomie, chapter Mensch-Maschine Dialog, pages 446–458. Carl Hanser,
München, 3rd edition, 1985.
25. Kraiss, K.-F. Fahrzeug- und Prozessführung: Kognitives Verhalten des Menschen und
Entscheidungshilfen. Springer Verlag, Berlin, 1985.
26. Kraiss, K.-F. Implementation of User-Adaptive Assistants with Neural Operator Models.
Control Engineering Practice, 3(2):249–256, 1995.
27. Kraiss, K.-F. and Hamacher, N. Concepts of User Centered Automation. Aerospace
Science Technology, 5(8):505–510, 2001.
28. Kushler, C. AAC Using a Reduced Keyboard. In Proceedings of the CSUN California
State University Conference, Technology and Persons with Disabilities Conference. Los
Angeles, March 1998.
29. Larimore, M. G., Johnson, C. R., and Treichler, J. R. Theory and Design of Adaptive
Filters. Prentice Hall, 1st edition, 2001.
30. Mann, M. and Popken, M. Auslegung einer fahreroptimierten Mensch-Maschine-
Schnittstelle am Beispiel eines Querführungsassistenten. In GZVB, editor, 5. Braun-
schweiger Symposium ”Automatisierungs- und Assistenzsystemefuer Transportmittel”,
pages l82–108. 17.–18. Februar 2004.
31. Matsikis, A., Zoumpoulidis, T., Broicher, F., and Kraiss, K.-F. Learning Object-Specific
Vision-Based Manipulation in Virtual Environments. In IEEE Proceedings of the 11th
International Workshop on Robot and Human Interactive Communication ROMAN 2002,
pages 204–210. Berlin, September 25–27 2002.
References 397

32. Neumerkel, D., Rammelt, P., Reichardt, D., Stolzmann, W., and Vogler, A. Fahrermodelle
- Ein Schlüssel für unfallfreies Fahren? Künstliche Intelligenz, 3:34–36, 2002.
33. Nickerson, R. S. On Conversational Interaction with Computers. In Treu, S., editor,
Proceedings of the ACM/SIGGRAPH Workshop on User-Oriented Design of Interactive
Graphics Systems, pages 101–113. Pittsburgh, PA, October 14–15 1976.
34. Pao, Y. Adaptive Pattern Recognition and Neural Networks. Addison-Wesley, 1989.
35. Reintsema, D., Preusche, C., Ortmaier, T., and Hirzinger, G. Towards High Fidelity Telep-
resence in Space and Surgery Robotics. Presence- Teleoperators and Virtual Environ-
ments, 13(1):77–98, 2004.
36. Schmidt, A. and Gellersen, H. W. Nutzung von Kontext in ubiquitären Informationssyste-
men. it+ti - Informationstechnik und technische Informatik, Sonderheft: UbiquitousCom-
puting - der allgegenwärtige Computer, 43(2):83–90, 2001.
37. Schraut, M. Umgebungserfassung auf Basis lernender digitaler Karten zur vorauss-
chauenden Konditionierung von Fahrerassistanzsystemen. Ph.D. thesis, Technische Uni-
versität München, 2000.
38. Shneiderman, B. and Plaisant, C. Designing the User Interface. Strategies for Effective
Human-Computer Interaction. Addison Wesley, 4th edition, 2004.
39. Specht, D. F. A General Regression Neural Network. IEEE Transactions on Neural
Networks, 2(6):568–576, 1991.
40. Velger, M., Grunwald, A., and Merhav, S. J. Adaptive Filtering of Biodynamic Stick
Feedthrough in ManipulationTaskson Board Moving Platforms. AIAA Journal of Guid-
ance, Control, and Dynamics, 11(2):153–158, 1988.
41. Wahlster, W., Reitiger, N., and Blocher, A. SMARTKOM: Multimodal Communication
with a Life-Like Character. In Proceedings of the 7th European Conference on Speech
CommunicationandTechnology, volume 3, pages 1547–1550. Aalborg, Denmark, Sep-
tember 3–7 2001.
42. Wickens, C. D. Engineering Psychology and Human Performance. Columbus. Merrill,
1984.
Appendix A

LTI-L IB —
a C++ Open Source Computer Vision Library
Peter Dörfler and José Pablo Alvarado Moya

The LTI-L IB is an open source software library that contains a large collection
of algorithms from the field of computer vision. Its roots lie in the need for more
cooperation between different research groups at the Chair of Technical Computer
Science at RWTH Aachen University, Germany. The name of the library stems from
the German name of the Chair: Lehrstuhl für Technische Informatik.
This chapter gives a short overview of the LTI-L IB that provides the reader with
sufficient background knowledge for understanding the examples throughout this
book that use this library. It can also serve as a starting point for further exploration
and use of the library. The reader is expected to have some experience in program-
ming with C++.
The main idea of the LTI-L IB is to facilitate research in different areas of com-
puter vision. This leads to the following design goals:
• Cross platform:
It should at least be possible to use the library on two main platforms: Linux and
Microsoft Windows© .
• Modularity:
To enhance code reusability and to facilitate exchanging small parts in a chain of
operations the library should be as modular as possible.
• Standardized interface:
All classes should have the same or very similar interfaces to increase interoper-
ability and flatten the user’s learning curve.
• Fast code:
Although not the main concern, the implementations should be efficient enough
to allow the use of the library in prototypes working in real time.

Based on these goals the following design decisions were made:


• Use an object oriented programming language to enforce the standardized inter-
face and facilitate modularity.
• For the same reasons and also for better code reuse the data structures should be
separated from the functionality unless the data structure itself allows only one
way of manipulation.
• To gain fast implementations C++ was chosen as the programming language.
With the GNU Compiler Collection (gcc) and Microsoft Visual C++ compilers
are available for both targeted platforms.
400 LTI-L IB — A C++ Open Source Computer Vision Library

The LTI-L IB is licensed under the GNU Lesser General Public License (LGPL)1
to encourage collaboration with other researchers and allow its use in commercial
products. This has already lead to many enhancements of the existing code base and
also to some contributions from outside the Chair of Technical Computer Science.
From the number of downloads the userbase can be estimated to a few thousand
researchers.
This library differs from other open source software projects in its design: it
follows an object oriented paradigm that allows to hide the configuration of the al-
gorithms, making it easier for unexperienced users to start working with them. At
the same time it allows more experienced researchers to take complete control of
the software modules as well. The LTI-L IB also exploits the advantages of generic
programming concepts for the implementation of container and functional classes,
seeking for a compromise between “pure” generic approaches and a limited size for
the final code.
This chapter only gives a short introduction to the fundamental concepts of the
LTI-L IB. We focus on the functionality needed for working with images and im-
age processing methods. For more detailed information please refer to one of the
following resources:
• project webpage: http://ltilib.sourceforge.net
• online manual: http://ltilib.sourceforge.net/doc/html/index.shtml
• wiki: http://ltilib.pdoerfler.com/wiki
• mailing list: ltilib-users@lists.sourceforge.net
The LTI-L IB is the result of the collaboration of many people. As the current
administrators of the project we would like to thank Suat Akyol, Ulrich Canzler,
Claudia Gönner, Lars Libuda, Jochen Wickel, and Jörg Zieren for the parts they
play(ed) in the design and maintenance of the library additionally to their coding.
We would also like to thank the numerous people who contributed their time and the
resulting code to make the LTI-L IB such a large collection of algorithms:
Daniel Beier, Axel Berner, Florian Bley, Thorsten Dick, Thomas Erger, Hel-
muth Euler, Holger Fillbrandt, Dorothee Finck, Birgit Gehrke, Peter Gerber, Ingo
Grothues, Xin Gu, Michael Haehnel, Arnd Hannemann, Christian Harte, Bastian
Ibach, Torsten Kaemper, Thomas Krueger, Frederik Lange, Henning Luepschen, Pe-
ter Mathes, Alexandros Matsikis, Ralf Miunske, Bernd Mussmann, Jens Pausten-
bach, Norman Pfeil, Vlad Popovici, Gustavo Quiros, Markus Radermacher, Jens Ri-
etzschel, Daniel Ruijters, Thomas Rusert, Volker Schmirgel, Stefan Syberichs, Guy
Wafo Moudhe, Ruediger Weiler, Benjamin Winkler, Xinghan Yu, Marius Wolf

A.1 Installation and Requirements

The LTI-L IB does not in general require additional software to be usable. However,
it becomes more useful if some complementary libraries are utilized. These are in-

1
http://www.gnu.org/licenses/lgpl.html
Installation and Requirements 401

cluded in most distributions of Linux and are easily installed. For Windows systems
the most important libraries are included in the installer found on the CD.
The script language perl should be available on your computer for maintenance
and to build some commonly used scripts. It is included on the CD for Windows, and
is usually installed on Linux systems. The Gimp Tool Kit (GTK )2 should be installed
as it is needed for almost all visualization tools. To read and write images using the
Portable Network Graphics (PNG) and the Joint Picture Experts Group (JPEG) im-
age formats you either need the relevant files from the ltilib-extras package, which
are adaptations of the Colosseum Builders C++ Image Library, or you have to pro-
vide access to the freely available libraries libjpeg and libpng. Additionally, the data
compression library zlib is useful for some I/O functions3.
The LTI-L IB provides interfaces to some functions of the well-known LA-
PACK (Linear Algebra PACKage)4 for optimized algorithms. On Linux systems
you need to install lapack, blas , and the fortran to C translation tool f2c (includ-
ing the corresponding development packages, which in distributions like SuSE or
Debian are always denoted with and additional “-dev” or “-devel” postfix). The
installation of LAPACK on Windows systems is a bit more complicated. Refer to
http://ltilib.pdoerfler.com/wiki/Lapack for instructions.
The installation of the LTI-L IB itself is quite straightforward on both systems.
On Windows execute the installer program (ltilib/setup-ltilib-1.9.15-activeperl.exe).
This starts a guided graphical setup program where you can choose among some of
the packages mentioned above. If unsure we recommend installing all available files.
Installation on Linux follows the standard procedure for configuration and instal-
lation of source code packages established by GNU as a de facto standard:
cd $HOME
cp /media/cdrom/ltilib/*.gz .
tar -xzvf 051124_ltilib_1.9.15.tar.gz
cd ltilib
tar -xzvf ../051124_ltilib-extras_1.9.15.tar.gz
cd linux
make -f Makefile.cvs
./configure
make
Here it has been assumed that you want the sources in a subdirectory called
ltilib within your home directory, and that the CD has been mounted into
/media/cdrom. The next line unpacks the tarballs found on the CD in the direc-
tory ltilib. In the file name, the first number of six digits specifies the release date
of the library, where the first two digits relate to the year, the next two indicate the
month and the last two the day. You can check for newer versions in the library home-
page. The sources in the ltilib-extras package should be unpacked within the

2
http://www.gtk.org
3
http://www.ijg.org, http://www.libpng.org, and http://www.zlib.net, respectively
4
http://netlib.org/lapack/
402 LTI-L IB — A C++ Open Source Computer Vision Library

directory ltilib created by the first tarball. The next line is optional, and creates all
configuration scripts. For it to work the auto-configure packages have to be installed.
You can however simply use the provided configure script as indicated above.
Additional information can be obtained from the file ltilib/linux/README.
You can of course customize your library via additional options of the configure
script (call ./configure --help for a list of all available options). To install the
libraries you will need write privileges on the --prefix directory (which defaults
to /usr/local). Switch to a user with the required privileges (for instance su
root) and execute make install. If you have doxygen and graphviz installed
on your machine, and you want to have local access to the API HTML documenta-
tion, you can generate it with make doxydoc. The documentation to the release
1.9.15 which comes with this book is already present on the CD.

A.2 Overview
Following the design goals, data structures and algorithms are mostly separated in
the LTI-L IB. The former can be divided into small types such as points and pix-
els and more complex types ranging from (algebraic) vectors and matrices to trees
and graphs. Data structures are discussed in section A.3. Algorithms and other func-
tionality are encapsuled in separate classes that can have parameters and a state and
manipulate the data given to them in a standardized way. These are the subject of sec-
tion A.4. Before dealing with those themes, we cover some general design concepts
used in the library.

A.2.1 Duplication and Replication

Most classes used for data structures and functional objects support several mecha-
nisms for duplication and replication. The copy constructor is often used while ini-
tializing a new object with the contents of another one. The copy() method and
the operator= are equivalent and permit to acquire all data contained in another
instance, whereby, as a matter of style, the former is the way preferred within the
library, as it visually permits a faster realization that the object belongs to an LTI-
L IB class instead of being an elemental data type (like integers, floating point values,
etc.). As LTI-L IB instances might represent or contain images or other relatively
large data structures, it is important to always keep in mind that a replication task
may take its time, and it is therefore useful to explicitly indicate when this expensive
copies are done.
In class hierarchies it is often required to replicate instances of classes to which
you just have a pointer to a parent common class. This is the case, for instance, when
implementing the factory design pattern. For these situations the LTI-L IB provides
a virtual clone() method.
Overview 403

A.2.2 Serialization

The data of almost all classes of the LTI-L IB can be written to and read from
a stream. For this purpose an ioHandler is used that formats and parses the
stream in a special way. At this time, two such ioHandlers exist. The first one,
lispStreamHandler, writes the data of the class in a Lisp like fashion, i. e. each
attribute is stored as a list enclosed in parenthesis. This format is human readable
and can be edited manually. However, it is not useful for large data sets due to its
extensive size and expensive parsing. For these applications the second ioHandler,
binaryStreamHandler, is better suited.
All elementary types of C++ and some more complex types like the STL5
std::string, std::vector, std::map and std::list can be read or
written from the LTI-L IB streams using a set of global functions called read()
and write(). Global functions also allow to read and write almost all LTI-Lib
types from and to the stream handlers, although as a matter of style, if a class defines
its own read() and write() methods they should be preferred over the global
ones to indicate that the given object belongs to the LTI-L IB.
The example in Fig. A.1 shows the principles of copying and serialization on a
LTI-L IB container called vector<double>, and a STL list container of strings.

A.2.3 Encapsulation

All classes of the LTI-L IB reside in the namespace lti to avoid ambiguity when
other libraries might use the same class names. This allows for instance to use the
class name vector, even if the Standard Template Library already uses this type.
For the sake of readability the examples in this chapter omit the explicit scoping of
the namespace (i. e., the prefix lti:: before all class names omitted).
Within the library, types regarding just one specific class are always defined
within that class. If you want to use those types you have to indicate the class scope
too (for example lti::filter::parameters::borderType). Even if this
is cumbersome to type, it is definitely easier to maintain, as the place of declaration
and the context are directly visible.

A.2.4 Examples

If you want to try out the examples in this chapter, the easiest way is to copy them
directly into the definition of the operator() of the class tester found in the
file ltilib/tester/ltiTester.cpp. The necessary header files to be in-
cluded are listed at the begining of each example. Standard header files (e.g. cmath,
cstdlib, etc.) are omitted for brevity.

5
The Standard Template Library (STL) is included with most compilers. It contains com-
mon container classes and algorithms.
404 LTI-L IB — A C++ Open Source Computer Vision Library

#include "ltiVector.h"
#include "ltiLispStreamHandler.h"
#include <list>

vector<double> v1(10,2.); // create a double vector


vector<double> v2(v1); // v2 is equal to v1
vector<double> v3;

v3=v2; // v3 is equal to v2
v3.copy(v2); // same as above

std::list<std::string> myList; // a standard list


myList.push_back(‘‘Hello’’); // with two elements
myList.push_back(‘‘World!’’);

// create a lispStreamHandler and write v1 to a file


std::ofstream os("test.dat"); // the STL output file stream
lispStreamHandler lsh; // the LTI-Lib stream handler
lsh.use(os); // tell the LTI-Lib handler
// which stream to use

v1.write(lsh); // write the lti::vector


write(lsh,myList); // write the std::list through
// a global function
os.close(); // close the output file

// read the file into v4


vector<double> v4;
std::list<std::string> l2;

std::ifstream is("test.dat"); // STL input file stream


lsh.use(is); // tell the handler to read
// from the ‘‘is’’ stream
v4.read(lsh); // load the vector
read(lsh,l2); // load the list
is.close(); // close the input file

Fig. A.1. Serialization and copying of LTI-L IB classes.

A.3 Data Structures


The LTI-L IB uses a reduced set of fundamental types. Due to the lack of fixed size
types in C/C++, the LTI-L IB defines a set of aliases at configuration time which are
mapped to standard types. If you want to ensure 32 bit types you should use therefore
int32 for signed integers or uint32 for the unsigned version. For 8 bit numbers
you can use byte and ubyte for the signed and unsigned versions, respectively.
Data Structures 405

When the size of the numeric representations is not of importance, the default
C/C++ types are used. In the LTI-L IB a policy of avoidance of unsigned integer types
is followed, unless the whole data representation range is required (for example,
ubyte when values from 0 to 255 are to be used). This has been extended to the
values used to represent the sizes of container classes, which are usually of type
int, as opposite to the unsigned int of other class libraries like the STL.
Most math and classification modules employ the double floating point precision
(double) types to allow a better conditioning of the mathematical computations in-
volved. Single precision numbers (float) are used where the memory requirements
are critical, for instance in grey valued images that require a more flexible numerical
representation than an 8-bit integer.
More involved data structures in the LTI-L IB (for instance, matrix and vector
containers) use generic programming concepts and are therefore implemented as
C++ template types. However, these are usually explicitly instantiated to reduce the
size of the final library code. Thus, you can only choose between the explicitly in-
stantiated types. Note that many modules are restricted to those types that are useful
for the algorithm they implement, i. e. linear algebra functors only work on float and
double data structures. This is a design decision in the LTI-L IB architecture, which
might seem in some cases rigid, but simplifies the maintenance and optimization of
code for the most frequently used cases. At this point we want to stress the differ-
ence of the LTI-L IB with other “pure” generic programming oriented libraries. For
the “generic” types, like genericVector or genericMatrix you can also cre-
ate explicit instantiations for your own types, for which you will require to include
the header files with the template implementations, usually denoted with a postfix
template in the file name (for example, ltiGenericVector template.h).
The next sections introduce the most important object types for image processing.
Many other data structures exist in the library. Please refer to the online documenta-
tion for help.

A.3.1 Points and Pixels

The template classes tpoint<T> and tpoint3D<T> define two and three dimen-
sional points with attributes x, y, and if applicable z all of the template type T. For
T=int the type aliases point and point3D are available. Especially point is
often used as coordinates of a pixel in an image.
Pixels are defined similarly: trgbPixel<T> has three attributes red, green,
and blue. They should be accessed via the member functions like getRed().
However, the pixel class used for color images is rgbPixel which is a 32 bit data
structure. It has the three color values and an extra alpha value all of which are
ubyte. There are several reasons for a four byte long pixel representation of a typ-
ically 24 bit long value. First of all, in 32 bit or 64 bit processor architectures, this
is necessary to ensure the memory alignment of the pixels with the processor’s word
length, which helps to improve the performance of the memory access. Second, it
helps to improve the efficiency when reading images from many 32-bit based frame
406 LTI-L IB — A C++ Open Source Computer Vision Library

grabbers. Third, some image formats store the 24-bit pixel information together with
an alpha channel, resulting in 32-bit long pixels.
All of these basic types support the fundamental arithmetic functions. The exam-
ple in Fig. A.2 shows the idea.

#include "ltiPoint.h"
#include "ltiRGBPixel.h"

// create an rgbPixel, set its values and divide them by 2


rgbPixel pix; // a 32-bit long pixel
pix.set(127,0,255); // set the three RGB values
pix.divide(2); // divide all components by 2
std::cout << pix << std::endl; // show the new components

// create two points and add the first one to the second
tpoint<float> p1(2.f,3.f); // create a 2D point
tpoint<float> p2(3.f,2.f); // and another one
p2.add(p1); // add both points and leave
// the result in p2
std::cout << p2 << std::endl; // show the result
Fig. A.2. This small code example shows how to create instances of some basic types and how
to do arithmetic operations on them.

A.3.2 Vectors and Matrices

Probably the most important types in the LTI-L IB are vectorand matrix. Un-
like the well known STL classes these are vectors and matrices in the mathematical
sense and provide the appropriate arithmetic functions. Since LTI-L IB vectors and
matrices have been designed to support efficient data structures for image processing
applications, they also differ from the STL equivalents in the memory management
approach.
STL vectors, for instance, increase their memory dynamically by allocating a
larger block and copying the old contents by calling the elements’ copy constructors.
This can be very useful when the final size of a vector is not known but is not very
efficient.
On the other hand, since most image processing algorithms require very efficient
ways to access the data in their memory, the LTI-L IB vector and matrix classes
optimize the mechanisms to share, transfer, copy and access the memory among
different instances, but are less efficient in reserving and deallocating memory. For
example, matrices are organized in a row-wise manner and each row can be accessed
as a vector. In order to avoid copying the data of the matrix to a vector when the in-
formation is required as such, a pool of row vectors is stored within the matrix which
Data Structures 407

share the data with the matrix. The matrix memory is also a single adjacent and com-
pact memory block, that can therefore be easily traversed. Iteration on the elements
of the matrices or vectors has been optimized to allow boundary check in a code de-
bugging stage, but also to be as fast as C pointers when the code is finally released.
Additionally, complete memory transfers and exchange of data among matrix and
vector instances is allowed, which usually saves valuable time, since unnecessary
copying is avoided.
We will not go into a deeper detail of the memory management in the LTI-L IB
containers here, but rather point out some of the main container features. Figure A.3
shows a small example for vector matrix multiplication.

#include "ltiVector.h"
#include "ltiMatrix.h"

double tmp[]={1., 2., 1., 3.}; // data for a vector


dmatrix mat(2,2,tmp); // create a 2x2 matrix
dvector vec(2,2.); // create a 2D vector
vec.at(1) = 1.; // second element to 1
dvector res;
mat.multiply(vec, res); // multiply mat with vec
mat.at(0,1)*=2.; // at 1st row, 2nd column
mat[0][1]/=2.; // at 1st row, 2nd column

std::cout << res << std::endl; // show the results


std::cout << vec.dot(res) << std::endl; // dot product
Fig. A.3. Vector matrix multiplication. The result of the product of the 2 × 2 matrix mat and
vec is stored in res. All arithmetic functions also return the result as can be seen with the
dot product between the two vectors.

The example uses the abbreviated form dvector instead of


vector<double> to enhance readability. Two different ways of initializing
a vector/matrix are shown: using a pointer to an array of elements of the proper
type or through a single initialization value. Single elements can be accessed via
at() or the operator[ ], whereby the at() version should be preferred, first,
because it is faster (especially with matrices), and second, to indicate that you access
an LTI-L IB container. All possibilities of vector and matrix multiplications as well
as sum, difference and division by a scalar are provided. Further, mathematical
functions such as transpose and sum of elements as well as informational functions
such as minimum are available.
For convenience both classes have an apply() method which takes a C-style
function as argument. The function is then applied to each element of the vec-
tor/matrix. The example in Fig. A.4 shows how to use this function along with some
more element manipulation features.
408 LTI-L IB — A C++ Open Source Computer Vision Library

#include "ltiVector.h"
#include "ltiMatrix.h"

fmatrix mat(2,2,4.f); // 2x2 matrix filled with 4


mat.getRow(0).fill(9.f); // fill first row with 9
mat[0].fill(9.); // also fills the first row with 9
mat.apply(sqrt); // sqrt() to all elements of mat
std::cout << mat << std::endl;
Fig. A.4. Using matrix::apply(): The matrix mat is initialized with the value 4 in all
elements. Then all elements in the first row are set to 9. Using the apply() function the
square root of all elements is calculated and left in the matrix.

The function getRow() returns a reference to a vector which is a row of the


matrix, and so does the operator[ ]. The values of this reference can be manipu-
lated. In the example all elements of the row are set to 9. The example writes a 2 × 2
matrix on the console whose first row elements are 3 and second row elements are 2.

A.3.3 Image Types

The LTI-L IB provides four container classes for images: image represents color im-
ages, channel and channel8 are utilized as grey value images, and channel32
is mostly used to hold so-called label masks. Using the template functionality of
matrix we define the image types as matrices of rgbPixel, float, ubyte,
and int32, respectively.
The two grey valued image types have quite different objectives: channel8
uses 8-bit per pixel which is a common value for monochrome cameras. Algorithms
that work directly on these are usually very fast due, on the one hand, to the single-
byte fixed point arithmetic and on the other hand to the possibility of using look-up
tables to speed up computations as only 256 values are available. However, many
image processing methods return results with a wider range of values or need higher
precision. For these, the channel is more appropriate as its pixels employ single
precision floating point numbers.
For a regular monochrome channel the value range for a pixel lies between
0.0 and 1.0. Quite obviously a channel can have values outside of the normal-
ized interval. This property is often necessary, for example, when the magnitude of
an image gradient, or the output of a high-pass filter is computed and stored in a
channel. The image class represents a 24(32)-bit RGB(A) color image, i. e. 24
bits are used for the three colors and an additional byte is used for the alpha channel.
Color images are often split into three channels for processing (cf. A.5.3).
The pixel representation as a 32-bit fixed point arithmetic value, as found in
channel32 objects, are mostly employed to store pixel classification results, where
each integer number will denote a specific class, and the total number of pixel classes
may become greater than the capabilities of a channel8. This is the case, for in-
stance, in image segmentation results or in multiple object recognition systems.
Functional Objects 409

The advantage of using matrix for image types is twofold: the individual pixels
can easily be accessed and the calculation power of matrix is available for images
as well. The only additional functions in the image types are conversion functions
from other image types via castFrom(), which maintain certain semantical mean-
ing in the conversion: for example, the range of 0 to 255 of a channel8is linearly
mapped into the interval 0.0 to 1.0 for a channel. The example in Fig. A.5 shows
how easy it is to implement a transform of grey values to the correct interval between
0 and 1 using matrix functions.

#include "ltiImage.h"

// let ch contain values outside of [0; 1]


channel ch;
// find min and max value
float cmin,cmax;
ch.getExtremes(cmin,cmax);
// calc factor and offset
const float fac=1.f/(cmax-cmin);
const float off=cmin*fac;
ch.multiply(fac);
ch.subtract(off);
Fig. A.5. Using matrix functions to normalize grey values. Assuming that ch contains val-
ues outside the range [0,1] or in a much smaller interval we stretch/shrink and shift the grey
values so that they cover the complete [0,1] interval.

The channel ch could be the result of some previous processing step. First the
minimum and maximum values are sought. From these the required stretching factor
and corresponding offset are calculated and each element of the image matrix is
transformed accordingly. Note that operations like the above can be achieved more
efficiently by defining a function and then using the apply() member function
supplied by matrix. For this case in particular channel also offers the member
function mapLinear().

A.4 Functional Objects


Functional objects in the LTI-L IB consist of three parts: the functionality itself (i. e.,
the algorithms), parameters that can tune the behavior of the algorithm to the require-
ments of a user, and an internal state that holds trained data or partial results. Two
large groups of functional objects exist: functors and classifiers. While the former
mostly behave like a function the latter need to be trained first to be useful. Since the
basic usage of functors and classifiers is very similar we’ll discuss functors in detail
first. Then only the differences of classifiers with respect to functors are elaborated.
410 LTI-L IB — A C++ Open Source Computer Vision Library

Functor
User
Input Data

Functionality
Output Data

Configuration
Parameter State

Serializable (read/write) Serializable (read/write)

Fig. A.6. A functor has three parts: functionality, parameters, and state. The diagram shows
their interaction

A.4.1 Functor

A functor consists of three main parts: functionality, parameters, and state. The func-
tionality of the functor is accessed via the apply() methods. In many cases there
are on-copy and on-place versions of these methods, so the user can choose whether
the output of the algorithm should overwrite the data in the container objects used
as input, or if the results should to be stored in container objects passed specially
for this purpose as arguments of the apply() methods. The functionality can be
modified via the parameters of the functor. Finally, some functors require a state, for
example, for keeping track of previous input data. Figure A.6 shows the three parts
of a functor and their interaction.
The parameters are implemented as an inner class of each functor. This allows to
consistently use the type name parameters to designate this configuration object
in each functor, where the class name is used to clearly indicate their scope. To en-
sure consistency along lines of inheritance the parameters class of a derived functor
is derived from the parameters class of its base class. The current parameters of a
functor can be changed via the function setParameters(). A read-only refer-
ence to the current parameters instance is returned by getParameters(). The
parameters are never changed by the functor itself. Only the user or another functor
may set parameters. This is opposite to the state of a functor which is manipulated
exclusively by the functor itself.
The parameters as well as the state of the functor are serializable through the
members read() and write(). Note that when saving the functor the parameters
are saved as well.
Setting parameters is one of the most important tasks when using the LTI-L IB.
Although most functors have useful default parameters, these need to be adapted for
special tasks quite frequently. In Fig. A.7 we take a look at configuring our imple-
mentation of the Harris corner detector [1]. As can be seen in the documentation
it uses two other functors internally, localMaxima and gradientFunctor.
To keep the code maintainable, the parameters of harrisCorners just contain
instances of the parameter classes of those two functors. In the example the pa-
Functional Objects 411

rameter kernelType of the internal gradientFunctor is changed as well.


Note that most functors offer a constructor that takes the functor’s parameters as
an argument, which is more convenient than setting the parameters later. This ex-
ample also introduces another data structure: pointList. It can be used like a
std::list<point> but has some additional features.

#include "ltiImage.h"
#include "ltiHarrisCorners.h"

// assume this channel contains an image


channel src;

// change some parameters and create functor


harrisCorners::parameters param;
param.gradientFunctorParameters.kernelType
= gradientFunctor::parameters::OGD;
param.gradientFunctorParameters.gradientKernelSize = 5;
param.kernelSize = 7;
harrisCorners harris(param);

// list of lti::point
pointList corners;
channel cornerness;
float maxCornerness;
harris.apply(src,cornerness,maxCornerness,corners);

//print out how many corners were found


std::cout << corners.size() << std::endl;
Fig. A.7. Setting the parameters of a functor. In this example the parameters contain instances
of another functor’s parameters — gradientFunctor::parameters. These also de-
fine an enum which contains the value OGD. The harrisCorners functor is applied on a
channel and returns the positions of the corners.

A.4.2 Classifier

Classifiers are very similar to functors. The main difference is that instead of
apply() they have train() and classify() member functions. All classi-
fiers have a state which stores the results of the training and which is later used for
classification. The input data is strictly limited to dvector (or sequences thereof).
The LTI-L IB discerns three general types of classifiers: supervised and unsuper-
vised instance classifiers, and supervised sequence classifiers. Hidden Markov Mod-
els (cf. chapter 2.1.3.6) are currently the only representatives of the last category.
Supervised instance classifiers are trained with pairs of dvector and an integer,
i. e. feature vector and label. Unsupervised instance classifiers only receive feature
412 LTI-L IB — A C++ Open Source Computer Vision Library

vectors as input and try to find natural groups within the data. This group contains
clustering algorithms as well as connectionist algorithms such as Self Organizing
Feature Maps. When used for classification the unsupervised version tries to give a
measure of how likely the given feature vector would have belonged to each cluster
had it been in the original data set.
The classifiers are not much used within the context of this book. The interested
reader is refered to the CD and online documentation for more information about
classifiers.

A.5 Handling Images


Most algorithms used in computer vision draw their information from images or
sequences of images. We have seen that they are mostly manipulated with functors
in the LTI-L IB. This section explains some of the fundamental tasks when working
with images. The first paragraph explains how to convolve an image with an arbitrary
filter kernel. The subject of the second paragraph is input/output for different image
formats. The next paragraph shows how to split a color image into different color
space representations. Finally, drawing simple geometric primitives and viewing of
images is explained.

A.5.1 Convolution

One of the most common opertations in image processing is the convolution of an


image with a linear filter kernel such as a Gaussian kernel or its derivatives. The
LTI-L IB distinguishes between kernels that are separable and those that are not. A
kernel is separable if it can be written as the outer product of two one dimensional
kernels:
k(x, y) = kx (x)ky T (y)
with k a two dimensional kernel, kx and ky one dimensional column kernels. The
advantage of separable filters is that they can be applied subsequently to the image
rows and columns. Thus, far less multiplications are needed for the convolution of
the image and kernel. Even if the equation above does not hold, a two dimensional
kernel can be approximated by a separable kernel or a set of separable kernels.
The classes kernel2D and sepKernel represent the two types of ker-
nels that occur when convolving an image with a kernel. The latter consists of
several kernel1D which can also be used for vector convolution. The class
convolution applies a kernel which is set in the parameters to an image. The
example shown in Fig. A.8 first shows how to manually create a kernel, here a sharp-
ening function. Then is uses the gaussKernel to smooth the image.
Note that convolution has a convenience function setKernel() which
sets the kernel in the parameters of the functor. The convolution detects which kind
of kernel is given — 2D or separable — and calls the appropriate algorithm. The size
of the Gauss kernel used in the example is 5 × 5 and has a variance of 1.7.
Handling Images 413

#include "ltiImage.h"
#include "ltiLinearKernels.h"
#include "ltiGaussKernels.h"
#include "ltiConvolution.h"

// assume this channel contains an image


channel src;

// a convolution functor
convolution conv;

// create a 3x3 sharpening filter which is not separable


kernel2D<float> sharpKernel(-1,-1,1,1,-0.1f);
sharpKernel.at(0,0) = 1.8f;
// use shortcut function to set kernel
conv.setKernel(sharpKernel);
// sharpen src and leave in sharpImg
channel sharpImg;
conv.apply(src,sharpImg);

// create a Gaussian kernel and smooth the sharp image


gaussKernel2D<float> gaussKern(5,1.7);
conv.setKernel(gaussKern);
conv.apply(sharpImg);
Fig. A.8. The example first creates a kernel2D to sharpen the channel and then uses a Gauss
kernel, which is a sepKernel to smooth it again.

Another important parameter of the convolution is the boundary type, which


specifies how the data on the border of the image has to be considered. There are
five possibilities: a boundary of Zero assumes that all pixels outside the image are
zero valued. In cases where the derivative of the images has to be approximated, the
borders usually cause undesired effects. This can be reduced indicating a Constant
boundary, which means a constant continuation of the pixel values at the boundary.
The image can also be considered as the result of a windowing process of a periodic
and infinite two-dimensional signal. If the window has the size of one period, then the
boundary Periodic has to be used. If the window is half of the period of an even
symmetrical signal then Mirror is the right choice. If you want to leave the border
untouched, because maybe you want to take care of its computation in a very spe-
cial way, then you have to set the boundaryType attribute with a NoBoundary
value.

A.5.2 Image IO

The LTI-L IB can read and write three common image formats: BMP, JPEG, and
PNG. While the former always uses an implementation which is part of the library
414 LTI-L IB — A C++ Open Source Computer Vision Library

the latter two use the standard libjpg and libpng if they are available. These
libraries are mostly present on Unix/Linux systems. For Windows systems source
code is available to build the libraries. If the libraries are not available an alternative
implementation based on the Colosseum Builders C++ Image Library is available in
the extras package of the LTI-L IB.
Unless the LTI-L IB is used to write a specific application it is most practical
to be able to load all possible image types. The file ltiALLFunctor.h contains
two classes loadImage and saveImage that load/save an image depending on
its file extension. The filename can be set in the parameters or given directly in the
convenience functions load() and save(), respectively. The small example in
Fig. A.9 shows how a bitmap is loaded and then saved as a portable network graphic.

#include "ltiImage.h"
#include "ltiALLFunctor.h"

loadImage loader;
saveImage saver;
image img;

loader.load("/home/homer/bart.bmp",img);
saver.save("/home/homer/bart.png",img);
Fig. A.9. In this example a bitmap is loaded from a file and saved as a PNG.

Very often it is not necessary to save images on disk but to process a large set
of images. For this purpose the LTI-L IB has the loadImageList functor. It can
be used to sequentially get all images in a directory, listed in a file or supplied in
a std::vector of strings. In any case the filenames are sorted alphabetically to
ensure consistency across different systems. While the method hasNext() returns
true there are more images available. The next image is loaded by calling apply().
It is also possible to load grey valued and indexed images with the image I/O
functors. If the loaded file actually contains grey value information this can be loaded
into a channel. For a 24 bit file the intensity of the image is left in the channel.
Indexed images can be loaded into a channel8 and a palette which stores up to 256
colors. If the file contains a 24 bit image the colors are quantized to up to 256 colors.

A.5.3 Color Spaces

Many image processing algorithms that work on color images first split them into the
different components in a particular color space (the so called color channels) and
then work on these separately. The most common color space is RGB (Red, Green,
Blue) which is also the native color space for LTI-L IB images, as it is frequently
used when dealing with industrial cameras and image file formats. Particular types
of images for additional color spaces are not provided, nor special pixel types, as
Handling Images 415

the number of available color spaces would have made the maintenance of all LTI-
L IB functors very difficult, due mainly to the typical singularities found in different
spaces (for example, the case of intensity zero in hue-saturation spaces). Therefore,
in this library just one color format is used, while all other cases have to be managed
as a collection of separate channel objects.
To extract all three color components from an image in a particular color space,
the functors splitImageToXXX is used, with XXX the desired color space abbre-
viation. The functors mergeXXXToImage form an RGB image from three sep-
arate components in the XXX color space. The code in Fig. A.10 shows how color
channels can be swapped to create a strangely colored image.

#include "ltiImage.h"
#include "ltiSplitImageToRGB.h"
#include "ltiMergeRGBToImage.h"

// this is the image we want to change


image src;
// RGB channel8
channel8 r,g,b;

splitImageToRGB split;
mergeRGBToImage merge;

split.apply(src,r,g,b);
merge.apply(g,b,r,src);
Fig. A.10. The image src is first split into the three RGB channels. Then an image is recon-
structed with swapped colors.

In this example we use channel8 rather than channel because no calcula-


tions are performed and we can avoid casting from ubyte to float and back again. An
often needed functionality is to get an intensity channel from an image or create an
image from a channel (the image will, of course, be grey valued). These are imple-
mented via the castFrom() functions of the image types. Figure A.11 shows an
example.
These functions are especially helpful for visualization. After a channel is con-
verted to an image we can draw on the channel in color. Note that the above operation
can again be implemented via the apply() member function of matrix for faster
processing.
Besides the RGB color space the following conversions are implemented in the
LTI-L IB:
• CIELuv: CIE L∗ u∗ v ∗ is an attempt to linearize color perception. It is related to
XYZ.
• HLS: Hue, Luminance, and Saturation
416 LTI-L IB — A C++ Open Source Computer Vision Library

#include "ltiImage.h"

// a color image
image img;
channel ch;

// compute the intensity defined as (R+G+B)/3


ch.castFrom(img);

// now contains same value in all colors


img.castFrom(ch);
Fig. A.11. The color image img is first converted into an intensity channel. Then the same
information is copied to all three channels of the image, resulting in a color image that contains
only grey tones.

• HSI: Hue, Saturation, and Intensity


• HSV: Hue, Saturation, and Value
• OCP: An opponent color system. The first two are opponent colors, the third is
the intensity.
• rgI: red and green chromaticity and Intensity.
• xyY: normed chromaticity channels
• XYZ: a set of three linear-light components that conform to the CIE color-
matching functions.
• YIQ: Luminance Inphase Quadrature channels used in the NTSC television stan-
dard.
• YUV: Luminance/chrominance representation according to ITU-RS601. Well
known from JPEG.
All of these color representations can be obtained from and reverted to an
image. The example in Fig. A.12 shows how to generate a false color image from a
channel with only a few lines of code.

A.5.4 Drawing and Viewing

In programs related to image processing it is often desired to visualize the results.


The LTI-L IB offers classes for drawing in images and for viewing them. While 2D
and 3D drawing and viewing is available this introduction only covers the former.
The class draw<T> draws on an image of type T, e. g. draw<rgbPixel>
which is equivalent to draw<image::value type> draws on an image. The
image that serves as canvas is set with the function use(). To change the color/value
for drawing call setColor(). The graphical representation of many simple types
of the LTI-L IB can be drawn via the set() method. This is also the fastest way
to draw a single pixel in the current color. The following list of functions gives an
overview of frequently used functions:
Handling Images 417

#include "ltiImage.h"
#include "ltiMergeHSIToImage.h"

// the source channel


channel ch;
// the false color image
image img;

// functor for merging an RGB image from


// hue, saturation, intensity information
mergeHSIToImage hsiMerger;

channel tmp(ch.size(), 0.8);


hsiMerger.apply(ch,tmp,tmp,img);
Fig. A.12. Using the given channel as hue information and a temporary channel of the same
size for saturation and intensity we receive a false color image of the original channel.

• Geometric shapes: line(), lineTo(), arrow(), rectangle(), arc(),


circle(), ellipse()
• Vectors: A vector can be drawn as a function of one discrete integer variable
with set(vector).
• Markers: With marker() a point can be marked by a small shape.
• Text: The functions text() and number() offer rudimentary text drawing
functionality.
The markers available are similar to those known from Matlab/Octave. Addition-
ally, they can be scaled via a parameter and selected to be filled. To set the marker
type either an enum value or the familiar Matlab or Octave string can be used. Ta-
ble A.1 gives the available types. When configuring the marker it can be combined
with a color and a factor stating the brightness of the color. See the example in
Fig. A.13 for a few sample configurations.
The viewer in the LTI-L IB is based on GTK6 . While it also has parameters
like a functor, most of these can also be set via the GUI. The only notable excep-
tion to this rule is the window title which is most easily set with the constructor.
To view an image create a viewer and use show(). This function can also be used
on other types such as vectors and histograms. The viewer has been designed to
take control of the GUI in a separate thread, in a way that the users is allowed to
manipulate the visualization parameters as long as the viewer instance exists. This
enormously simplifies the task of displaying a window with an image, which oth-
erwise would involve some cumbersome widget programming. The viewer will not
block the execution of your program, unless you explicitly call one of the blocking
methods like waitButtonPressed(), which waits for a left click on the window

6
The Gimp Tool Kit is a multi-platform GUI toolkit, licensed under the LGPL.
http://www.gtk.org
418 LTI-L IB — A C++ Open Source Computer Vision Library

Table A.1. Marker types and color specifiers for draw.


Marker types Color specifiers
Marker shape enum value marker char Color char
Pixel Pixel . white w
Circle Circle o black b
Xmark Xmark x red r
Plus Plus + green g
Star Star * blue b
Square Square s cyan c
Diamond Diamond d magenta m
Triangle up TriangleUp ˆ yellow y
Triangle down TriangleDown v
Triangle left TriangleLeft <
Triangle right TriangleRight >
Dot Dot #
LTI-Logo LtiLogo

or waitKey() which waits for a keystroke. The function waitKeyOrButton()


combines both. To stop showing the image the member function hide() is sup-
plied, which is anyway automatically called when the viewer object is destroyed.
The example code in Fig. A.14 shows how to display the canvas from the previ-
ous example. The last two lines of the example are needed since there is a technical
problem at the very end of the program, where there is no way to coordinate that the
GTK has to be stopped before all GUI handlers are destroyed.
The capabilities of draw in combination with the viewer are now demonstrated
on the data obtained with a Harris corner detector (cf. Fig. A.7). Figure A.15 shows
the cornerness after normalization as well as the corners on the original image and
the configuration dialog of the viewer. The additional source code can be found in
Fig. A.16.
The upper part of the configuration dialog offers sliders for zoom, contrast and
brightness of the displayed image. The left part of the middle section shows some
statistics information about the image. These are exploited by the different scale
functions on the right. In the example Scale Min-Max was used to warp the values of
the cornerness to the interval [0,1]. In the bottom section the possibility to save the
displayed image is of interest. It uses the functor defined in ltiALLFunctor.h.

A.6 Where to Go Next


This chapter has introduced a small set of functors usually employed in image
processing tasks. The LTI-L IB provides however many additional algorithms and
general classes for this and other related areas, which simplify the implementation
of complete computer vision systems. You will find functors for segmentation, mor-
phology, edge and corner detection, saliency analysis, color analysis, linear and non-
linear filtering, and much more. Images can not only be obtained from files but also
Where to Go Next 419

#include "ltiImage.h"
#include "ltiDraw.h"

// the canvas is yellow


image canvas(30,30,Yellow);
// draw on that canvas
draw<rgbPixel> painter;
painter.use(canvas);

painter.setColor(Blue);
point p1(25,5),p2(25,25),p3(5,25);
painter.line(p1,p2);
painter.setColor(Green);
painter.lineTo(p3);
point pm;
pm.add(p1,p3).divide(2);
painter.setColor(Magenta);
painter.arc(pm,p1,p3);

//draw a filled dark cyan square, width 3


painter.marker(10,5,3,"c5sf");
//draw a very dark yellow circle, default width 5
painter.marker(5,10,"y2o");
Fig. A.13. Drawing example. A yellow canvas is used for drawing. The first block draws two
lines and an arc between three points; pm is the center of the arc. The second block shows
how to use markers. The third optional argument is the width of the marker, the string has the
format color with optional factor and shape with optional ’f’ for filled.

#include "ltiImage.h"
#include "ltiViewer.h"
#include "ltiTimer.h"

//The canvas from the previous example


image canvas;

viewer v("Draw example");


v.show(canvas);
v.waitButtonPressed();
v.hide();
gtkServer::shutdown(); // workaround
passiveWait(10000);
Fig. A.14. Using the viewer to display an image.
420 LTI-L IB — A C++ Open Source Computer Vision Library

(a) Cornerness values (b) Corners marked on the original image

(c) By using the button Scale Min-Max the values in the cornerness (a) get scaled to
completely fill the range from 0 to 1.
Fig. A.15. Using the viewer configuration dialog.

via some functors in the LTI-L IB that provide interfaces to some popular cameras
and framegrabbers. Furthermore, you will find functors for programming with mul-
tiple threads, for control of serial ports, as well as for system performance evalu-
ation. Since classification and processing tasks require a plethora of mathematical
functions, the LTI-L IB contains many classes for statistics and linear algebra. We
encourage you to browse the HTML documentation to get an idea of the great possi-
bilities that this library provides.
References 421

#include "ltiImage.h"
#include "ltiViewer.h"
#include "ltiTimer.h"

// data from the Harris example


channel ch, cornerness;
pointList corners;

image img;
img.castFrom(ch);
draw<rgbPixel> painter;
painter.use(img);
painter.marker(corners,7,"rx");

viewer vCornerness("Cornerness");
viewer vCorners("Corners");
vCornerness.show(cornerness);
vCorners.show(img);
vCorners.waitButtonPressed();
vCorners.hide();
vCornerness.hide();
passiveWait(10000);
gtkServer::shutdown();
Fig. A.16. The first viewer just displays the cornerness. For the second viewer the grey scale
image is first transformed to an image. Then the corners can be marked as red crosses.

References
1. Harris, C. and Stephens, M. A Combined Corner and Edge Detector. In Proc. 4th Alvey
Vision Conference, pages 147–151. 1988.
Appendix B

I MPRESARIO — A Graphical User Interface for Rapid


Prototyping of Image Processing Systems
Lars Libuda

During development of complex software systems like new human machine in-
terfaces, some tasks always reappear: First, the whole system has to be separated into
modules which handle subtasks and communicate with each other. Afterwards, algo-
rithms for each module have to be developed whereas each algorithm is described by
its input data, output data, and parameters. After implementation the algorithms have
to be optimized. Optimization includes the determination of the optimal parameter
settings for each algorithm in context of the application. This is normally done by
comparing an algorithm’s output data for different parameter settings applied to the
same input data. Therefore, the output data has to be visualized in an appropriate way.
The whole process is very time consuming if everything has to be coded manually,
especially visualization of more complex data types like images or 3D data.
But the process may be speeded up if a framework is available, which supports
the development process. These kinds of frameworks are called Rapid Prototyping
Systems. Ideally, such a system has to meet the following requirements:
• Input data from different sources has to be supported, e. g. image data from still
images, video streams or cameras in case of image processing.
• It has to define a standardized but flexible and easy to use interface to integrate
any algorithm with any input data, output data, and parameters as a reusable
component in the framework.
• Processing chains consisting of several algorithms have to be composed and
changed easily.
• Parameter settings of each algorithm have to be able to be changed during run-
time.
• Output data of each algorithm has to be visualized on demand during execution
of the processing chain.
• The system has to provide standardized methods to analyze, compare, and eval-
uate the results obtained from different parameter settings.
• All components have to be integrated in a graphical user interface (GUI) for ease
of use.
Rapid Prototyping Systems exist for different types of applications like Borland’s
Delphi [5] for data base applications, Statemate [9] for industrial embedded systems,
or 3D Studio Max [11] for creating animated virtual worlds.
The remaining part of this chapter introduces a Rapid Prototyping System called
I MPRESARIO which was originally developed as a graphical user interface for image
processing with the LTI-L IB but is not necessarily limited to it. The software can be
424 I MPRESARIO — A GUI for Rapid Prototyping of Image Processing Systems

Table B.1. Requirements to run I MPRESARIO on a PC. The DirectX interface is needed if you
plan to use a capture device or if image data should be acquired from video streams.

Minimum Recommended
©
Operating system: Windows 98 SE Windows XP©
Memory: 256 MB 512 MB
Hard disc space: 500 MB -
DirectX version: 8.1 -

found on the CD accompanying this book and is provided for a deeper understanding
of the examples in the chapters 2, 5.1, and 5.3.
The next paragraph describes the requirements, installation, and deinstallation
of I MPRESARIO. The following paragraphs contain a tutorial on the graphical user
interface and detailed references for the different steps needed to setup an image
processing system. The last three paragraphs provide details about the internals of
I MPRESARIO itself, summarize the I MPRESARIO examples which are part of the
chapters related to image processing, and describe how the user can extend I MPRE -
SARIO ’s functionality with her own algorithms.

B.1 Requirements, Installation, and Deinstallation


To run I MPRESARIO on a personal computer, the requirements listed in Tab. B.1
must be met. To install the software, copy the file impresario setup.exe con-
tained in the directory impresario on the CD to the hard disc and execute it. The
setup program will prompt for a directory and then extract all necessary files to this
directory. Finally, make sure that the graphic accelerator is set to 32 bit color display.
Now the software can be started by executing impresario.exe in the directory
you selected during installation.
For deinstallation delete the directory where the software was copied.

B.2 Basic Components and Operations


After the first start, I MPRESARIO will show up like depicted in Fig. B.1. The user
interface is divided in 5 parts.
• The menu bar and the standard toolbar are located at the top of the application
window. They contain all application commands. Tab. B.2 lists all toolbar com-
mands which are shortcuts for some menu items.
• Beneath the toolbar and on the left side of the main window, the macro browser
is displayed. This is a database from which macros can be added to a document.
Each macro represents an image processing algorithm in the user interface.
Basic Components and Operations 425

menu and toolbar

document view
macro browser

output window

statusbar

Fig. B.1. I MPRESARIO after first start. The GUI is divided in 5 parts: Menu and toolbar, macro
browser, document view, output window, and statusbar.

Table B.2. Application commands located on the standard toolbar

Icon Command Triggered action


Create new document Creates a new document and displays its view.
Open document Displays a file selection dialog to open an existing document.
Save document Saves the current document to a file. If the document is un-
named, a dialog will prompt for a filename.
Enable/Disable output Enables or disabels the output window attached to the current
document.
Start processing Starts image processing.
Pause processing Pauses image processing.
Stop processing Stops image processing.
Single processing step Starts image processing for exactly one image and stops auto-
matically after this image has been processed.

• Right to the macro browser, the main workspace displays a view on the current
document. At the moment, this document named “ProcessGraph1” is empty. It
has to be filled with macros from the macro browser.
426 I MPRESARIO — A GUI for Rapid Prototyping of Image Processing Systems

• Below the main workspace and the macro browser the output window displays
messages generated by the application (e. g. see the tab “System”) and by opened
documents. For each document a tab is added to the output window and named
after the document.
• At the bottom of the main window, the status bar shows help texts for each menu
or toolbar command when it is selected and the overall processing speed of the
current document in milliseconds and frames per second when the system exe-
cutes the active document.
To work with this interface it is necessary to understand the basic components
and operations which are described in the following subsections.

B.2.1 Document and Processing Graph

The core component of the system is the document. A document contains a process-
ing graph and the configuration of the user interface used with this graph. A process-
ing graph is a directed graph consisting of vertices and directed edges whereas
macros represent the vertices and data links represent the edges. For more theoreti-
cal background on graphs, e. g. see [7]. Documents are stored in files with the suffix
“.ipg” which is the abbreviation for I MPRESARIO processing graph. The commands
to create, open, save, and close documents are located in the “File” menu. Shortcuts
for some file commands can also be found on the standard toolbar (see Tab. B.2). For
further discussion, we open a simple example:
1. Select the “File”→“open” command in the main menu or click the correspond-
ing toolbar button.
2. In the upcoming dialog select the document called “CannyEdgeDetection.ipg”.
3. Confirm the selection by pressing the “Open” button in the dialog.
A new view on the opened document will be created and a new tab will be added
to the output window, both named after the document. The view will show the con-
tents depicted in Fig. B.2. This is a very simple processing graph consisting of three
vertices and two edges. The vertices are made up by the three macros named “Image
Sequence”, “SplitImage”, and “CannyEdges” and two unnamed data links visualized
by black arrows represent the two directed edges.
It is necessary to take a closer look at the macros because they encapsulate the
algorithms for image processing and thus the graph’s functionality while the links
determine the data flow in the graph.

B.2.2 Macros

In I MPRESARIO macros are used to integrate algorithms for image processing in the
user interface. A macro has two purposes: First, it contains an algorithm for im-
age processing which, in general, can be described by its input data, its output data,
and its parameters. This algorithm is executed by I MPRESARIO with the help of the
macro. Second, a macro provides a graphical representation to change the parameters
Basic Components and Operations 427

Fig. B.2. View on example document “CannyEdgeDetection.ipg” after opening it.

of the algorithm it contains. In the document view a macro is visualized as a beige


block with its name printed in black. The input data is visualized as yellow input
ports, the output data as red output ports (see the Canny edge example in Fig. B.2).
If the mouse is hovered over the area of an input or output port, a tooltip will pop
up showing the port’s data type and name separated by a colon. Macro parameters
are visualized in a separate macro property window which is opened by clicking the
button within the macro (see Fig. B.3). Furthermore, all macros display information
about the time needed to process the current input data and their state. Both informa-
tion are of importance in processing mode (see Sect. B.2.3).
Our example consists of three different macros. The first macro in the graph is
named “Image Sequence”. This macro acquires image data from files stored on hard
disk and therefore represents the source. Every document needs at least one source
but there can be more than one in each document. The version of I MPRESARIO in-
cluded in this book ships three different source macros. Source macros behave like
all other macros but do not display the standard interface for parameter changes in
their macro property window as shown for the “SplitImage” macro in Fig. B.3. Take
a look at Fig. B.4 to see the macro property window of the “Image Sequence” macro.
Because every document has to contain a source and source macros feature different
graphical user interfaces they are described in detail in Sect. B.5.
All other macros coming with the book’s CD represent LTI-L IB functors (see
chapter A.4). A macro can represent a single LTI-L IB functor or an assembly of
several functors. This also applies to the two remaining macros in our example. The
428 I MPRESARIO — A GUI for Rapid Prototyping of Image Processing Systems

Fig. B.3. Opened macro property window of the “SplitImage” macro from the Canny edge
example in Fig. B.2 and a tooltip showing a port’s data type and name. The tooltip appears as
soon as the mouse cursor is hovered over a port.

“SplitImage” macro in the Canny edge example (Fig. B.3) is a macro assembling all
LTI-L IB functors which split color images into the components of a certain color
model. The macro takes a color image of type class lti::image as input and
splits it into three grayvalue images of type class lti::channel8 represent-
ing the three components of a color model which can be selected by the macro’s
parameter named Output. In Fig. B.3 the RGB model is currently selected. The first
grey value image of the “SplitImage” macro serves as input for the “CannyEdge”
macro which contains a single functor (the Canny algorithm [3, 4]). This macro per-
forms edge detection on the input data and outputs two grey value images of type
class lti::channel8 named “Edge channel” and “Orientation channel”. The
edge channel is a black and white image where all pixels are white which belong to
an edge in the source image. The orientation channel encodes the direction of the
gradient used to detect the edges.
A document may contain as many macros as necessary. They can be added us-
ing the macro browser, deleted, configured, and linked with each other. A detailed
description of handling macros and linking them together is provided in Sect. B.3
and B.4.
But now it is time to see the Canny edge example in action.
Basic Components and Operations 429

B.2.3 Processing Mode


Up to this point we have opened a document and taken a closer look at its contents.
But we haven’t seen any image processing so far. If it has not already happened,
open the macro property window of the “SplitImage” macro and the “Image Se-
quence” macro. To start image processing, select the “Control”→“Start” command
from the menu or press the corresponding toolbar button (see Tab. B.2). I MPRE -
SARIO switches from edit mode to processing mode and executes the processing
graph. There is one major difference between edit and processing mode: In process-
ing mode the structure of the graph cannot be changed. It’s not possible to add or
delete macros and links.
In the Canny edge example the view on the document in processing mode should
look similar to Fig. B.4 now. In the property window of the Image Sequence macro,
the currently processed image file is highlighted and the three macros display their
execution times and their current state. All possible states are color coded and sum-
marized in Tab. B.3. During processing, a macro’s state will normally change be-
tween Waiting and Processing. The macros in this example execute their algorithms
very fast so that a state change is hardly noticeable. The overall processing time for
the whole graph is displayed on the right side of the status bar.
The processing mode can be controlled with the commands located in the “Con-
trol” menu. Beside starting, processing can be paused and stopped. In the latter case,

Fig. B.4. View on the Canny edge example document in processing mode. Each macro dis-
plays the execution time and its current state. The property window of the Image Sequence
macro shows the name of the currently processed image file.
430 I MPRESARIO — A GUI for Rapid Prototyping of Image Processing Systems

Table B.3. Macro states and their meaning

State Color Meaning


Ok green The macro is ready for processing mode
Waiting orange The macro is waiting for predecessing macros to complete processing
Processing orange The macro is currently processing its input data
Pause blue Processing is paused in the processing graph
Error red An error occurred in the macro. Processing of the whole graph is
stopped immediately and an error message is displayed in the docu-
ment’s output window.

I MPRESARIO will switch back into edit mode. The “Snap” command demands a spe-
cial note. Using this command will set I MPRESARIO to processing mode for exactly
one image. After processing this image, I MPRESARIO stops processing and reacti-
vates the edit mode.
Now that the Canny edge example is running, the only thing missing is a visual-
ization of the macros’ results.

B.2.4 Viewer

Viewer visualize the output data of every macro. A viewer is opened by double-
clicking a red output port. Fig. B.5 shows two viewers from the Canny edge example.
They visualize the original image “hallway.jpg” which is accessible via the “Color
Image” output port of the “Image Sequence” macro and the edge channel of the
“CannyEdge” macro (the left output port of the “CannyEdge” macro). When a viewer
is opened the first time it appears right to the macro and in a standardized size. Most

Fig. B.5. Two viewers showing visualizing some results of the Canny edge example. The left
viewer shows the original image loaded by the “Image Sequence” macro, the right viewer
displays the “Edge channel” of the “CannyEdge” macro.
Basic Components and Operations 431

Table B.4. Commands for image viewer

Icon Command Triggered action


Save image Saves the current image to a file. Prompts for a filename in a di-
alog. Supported file formats: Bitmap, Portable Network Graph-
ics, and JPEG.
Zoom in Zooms in the current image by factor 2
Zoom out Zooms out the current image by factor 2
Original size Displays the image in its original size

times, this size is too small to see anything because the image data is adjusted to
the window size. The viewer can be resized like every other window by clicking on
its frame and moving the mouse until it reaches the desired size. Additionally, every
viewer provides a toolbar with commands for zooming the image and adjusting the
window size to the image size. Tab. B.4 lists all available viewer commands.
Beside the toolbar a viewer also contains a statusbar which is divided in four
panes. The first pane displays the image size in pixels (width x height). When the
mouse is hovered over the image the second pane shows the image coordinates the
mouse cursor is pointing to. The origin of the coordinate system is the upper left
corner of the image. The pixel value belonging to the current image coordinates can
be found in the third pane. The value depends on the image type which is shown in
the fourth pane. For an image of type lti::image the value is a vector consisting
of the pixel’s red, green, and blue fraction, each in the range of 0 to 255. For a
lti::channel8 the value is a single integer in the range of 0 to 255 and for a
lti::channel a floating point number. Take a look at chapter A.3.3 for more
information about the different image types.
Now that we know how to visualize results of the different macros it is easy to
see the influence of parameter changes in a macro on the final result. For a last time,
take the Canny edge example, start the execution of the processing graph and open
the viewer for the edge channel of the “CannyEdge” macro. After that, open the
macro property window of the “SplitImage” macro and change the parameter named
Output from the value “RGB” to “CIE Luv” by opening the parameter’s drop down
box and selecting the item. You will notice changes in the edge channel. Different
values will even increase the changes up to a point, where the edge channel cannot
be interpreted in a meaningful way any more. Try to experiment with the parameters
of both processing macros and open some more viewers to get a feeling for the user
interface.

B.2.5 Summary and further Reading

In this tutorial the basic components and operations were introduced starting with
the document as core component containing the processing graph which consists
of macros and data links. Afterwards macros were introduced encapsulating image
432 I MPRESARIO — A GUI for Rapid Prototyping of Image Processing Systems

acquisition and image processing algorithms in a common interface. The distinction


between edit mode and processing mode allows the system to execute a processing
graph with a constant structure. Finally, viewers allow the user to observe the output
data of every macro in a comfortable way.
The remaining part of this chapter describes the creation of a new document with
a working processing graph and provides more detailed information on the topics
covered in this tutorial. The creation of a new processing graph can be separated into
three steps:
1. Adding the functionality with macros and rearranging macros in the document.
See Sect. B.3 for more information about this.
2. Defining the data flow with links (Sect. B.4).
3. Configuring macros. This topic is covered in Sect. B.5 and describes the usage
of the standard macro property window as depicted in Fig. B.3 as well as the
special user interfaces of the source macros.
For the interested reader, Sect. B.6 gives some insight into the internal structure
and concept of I MPRESARIO itself and describes the features which have not been
covered until this point. Last but not least, Sect. B.7 describes the examples contained
in the image processing chapters of this book and Sect. B.8 addresses all readers who
might like to extend I MPRESARIO with own macros.

B.3 Adding Functionality to a Document


I MPRESARIO will create an empty document if the “New” command in the main
toolbar is executed. In the first step to build a working processing graph macros have
to be added to the document as the functional parts of a process graph. This section
provides the necessary information on how macros can be added to a document us-
ing the macro browser and how they can be rearranged within and removed from a
document.

B.3.1 Using the Macro Browser

Fig. B.6 depicts two views on the macro browser which contains a database with all
available macros in the system. Tab. B.5 summarizes the commands available on the
macro browser’s toolbar.
The macro browsers’s main component is the macro tree list. It provides four
different categories to view macros. The first category lists the macros alphabetically
by their name. The second category named “Group” sorts all macros according to
their function. The third category lists the macros by the DLL1 they are located in,
and the fourth category by their creator. A category can be accessed by clicking
the corresponding tab beneath the macro tree list with the left mouse button. The
right part of Fig. B.6 shows the macro browser displaying the group category and all
1
DLL – Dynamic Link Library, see Sect. B.6 for more information.
Adding Functionality to a Document 433

Fig. B.6. Macro browser with all available macros. Left: Default view with activated name
category. Right: View on the group category with activated filter “Source”.

Table B.5. Toolbar commands for the macro browser


Icon Command Triggered action
Collapse all tree nodes Collapses all tree nodes for a compact view in the filter cate-
gories
Expand all tree nodes Expands all tree nodes for a complete view of all macros in the
filter categories
Toggle help window Toggles the help window in the macro selection window and
macro property window

macros of the group “Source” which will be discussed in detail in Sect. B.5. In each
category macros can be selected by a left click on their name. As soon as a macro is
selected, a more detailed description is displayed in the help area beneath the tabs.
There are two ways to add macros from the macro selection window to a doc-
ument. The first one uses the “Search for” input field. Activate the field with a left
click. A blinking cursor appears in the field. Now it’s possible to enter characters.
The first macro name which matches the entered character sequence is selected in
the macro tree list. The search can be narrowed by using the “Filter” drop down
list box. The filter options vary depending on the selected filter category. The name
434 I MPRESARIO — A GUI for Rapid Prototyping of Image Processing Systems

category does not provide any filter options but e. g. the group category allows to
search only one group (Right part of Fig. B.6). As soon as the Enter key is pressed
in the “Search for” input field, the selected macro is added to the active document. It
appears in the document view’s upper left corner. The second and more comfortable
way to add macros works by drag and drop:
1. Click the desired macro with the left mouse button and hold it down.
2. Move the mouse cursor to some free space in the document view while the left
mouse button is still held down.
3. Release the left mouse button in the document view and the selected macro will
be inserted at the mouse cursor’s position.
This operation can be cancelled any time by pressing the Escape key.

B.3.2 Rearranging and Deleting Macros

After macros have been added to a document, they can be rearranged or removed
from the document.
Macros can be rearranged by a simple drag and drop operation. Move the mouse
cursor onto the desired macro, press and hold down the left mouse button and move
the cursor to the desired location in the document view. A rectangle representing the
macro’s borders will move with the cursor. If the cursor is moved to the view’s right
or bottom edge, the view will start to scroll to reach invisible parts of the document.
As soon as the desired location is reached release the mouse button and the macro is
moved there.
To remove a macro from the document, select it by clicking it with the left mouse
button. The macro’s border will turn light green to indicate that it is selected. Exe-
cute the “Delete” command in the “Edit” menu to remove the macro. In the current
version of I MPRESARIO only one macro can be selected and removed at one time.

B.4 Defining the Data Flow with Links


After all necessary macros are added to the document, the process graph has to be
completed by defining the data flow with links. This section gives a brief overview
on how to create and remove links from a document.

B.4.1 Creating a Link

To realize data exchange between different macros they have to be linked. All links
underly the following rules:
• A link can only be established between an output port (colored red) and an input
port (colored yellow).
• The input port and the output port have to support the same data type. A port’s
data type is displayed as a tooltip when the mouse pointer is hovered over the
port and left there unmoved for a moment.
Defining the Data Flow with Links 435

Table B.6. Cursor appearances during link creation

Cursor Name Description


Link allowed It is allowed to start link creation from the currently selected
port (source) or a link can be established to the selected port
(destination).
Link forbidden A link cannot be created starting from the currently selected
port or a link cannot be established between the source port and
the currently selected destination port.

• An output port may be connected to many input ports but one input port excepts
exactly one incoming link from an output port.
Links are created by drag and drop operations with the mouse by the following
steps:
1. The mouse cursor has to be placed on a red output port or a yellow input port.
The cursor appearance changes according to Tab. B.6. The “Link forbidden”
cursor will appear only if link creation is started from a yellow input port which
is already connected to an output port.
2. If it is allowed to start a link from the selected port (indicated by the “Link
allowed” cursor) press the left mouse button, hold it down and start to move the
mouse to the desired destination port. The system draws a gray line from the
starting point of the link to the current mouse position and the mouse cursor
changes to the “Link forbidden” cursor to indicate an incomplete link (see left
column in Fig. B.7).
3. To connect two macros which are not concurrently visible in the document view,
move the cursor to the view’s border. The view will start to scroll without abort-
ing the link operation. Scroll the view until the desired macro becomes visible.
4. As soon as the desired destination port is reached, the system indicates wether it
is possible to establish the link or not. If the possible link is valid the gray line
will turn green and the cursor will change to “Link allowed” (middle column in
Fig. B.7). In case of an invalid link the gray line turns red and the cursor appear-
ances does not change (right column in Fig. B.7). Invalid links are indicated in
two cases: First, the source and destination port are both input or output ports.
Second, the data types of the source and destination port are unequal.
5. If a valid link is indicated, release the left mouse button to create the link. The
green line will be replaced by a black arrow. If the mouse button is released
when the system indicates an incomplete or invalid link, the operation will be
cancelled.
Link creation can also be cancelled at any time by pressing the Escape key.
436 I MPRESARIO — A GUI for Rapid Prototyping of Image Processing Systems

Fig. B.7. Possible appearances of links during creation. Left column: Gray line indicating
incomplete link. Middle column: Green line indicating valid complete link. Right column:
Red line indicating invalid link.

B.4.2 Removing a Link

To remove an existing link indicated by a black arrow, it has to be selected. Click on


the link’s black arrow with the left mouse button and the arrow turns light green. Now
the link is selected. Execute the “Delete” command in the “Edit” menu to remove the
link.

B.5 Configuring Macros


After adding macros to the document and connecting them by links the process
graph is nearly ready to be processed. The final step includes the configuration of
the macros. This especially applies to the three source macros coming with the ver-
sion of I MPRESARIO on the books CD (see the group “Source” in the macro browser,
Fig. B.6):
• Image Sequence: This macro allows the processing of single still images and
sequences of still images.
• Video Capture Device: Here, image data can be gained from video capture de-
vices which support the DirectX 8.1 interface [6]. These are mainly USB or
Firewire cameras [1, 10] like the Phillips ToUCam or the Unibrain Fire-i cam-
era.
• Video Stream: The Video Stream macro extracts image data from video files of
type AVI and MPEG-1. Therefore, the DirectX 8.1 interface is also required.
Configuring Macros 437

The source macros have to be fed with information, where they can find the
image data. Therefore they provide customized user interfaces which are described in
detail in the following subsections. All other macros coming with this book provide
a standard user interface which will be discussed at the end of this section.

B.5.1 Image Sequence

The Image Sequence macro handles still images of the types Bitmap (BMP), Portable
Networks Graphics (PNG), and JPEG. These kinds of images may be used as input
data to the processing graph either as single images or as a sequence of several im-
ages. Therefore, the Image Sequence macro maintains so called sequence files which
are suffixed with “.seq”. Sequence files are simple text files containing the path to one
existing image file in each line. They can be generated manually with any text editor
or much more comfortably with the macro’s user interface depicted in Fig. B.8. On
top, the interface contains a menu bar and a toolbar for image sequence manipula-
tion. The commands are summarized in Tab. B.7. The drop down list box beneath the
toolbar contains the currently and recently used sequences for fast access. Beneath
the list box the image list displays the contents of the current sequence.
Creating and Loading Sequences
By default, an empty and unnamed image sequence is generated when a new docu-
ment is created. A new sequence can be created by selecting the “Create sequence”
command (see Tab. B.7). To load an existing sequence, press the “Load sequence”
button in the toolbar. A standard open file dialog appears where the desired sequence
can be selected. An alternative way to open one of the recently used sequences is to
open the drop down list box and select one of its items. The corresponding sequence
will be opened immediately if it exists.

Fig. B.8. User interface of the Image Sequence macro. It consists of the command toolbar, the
drop down list box with the last opened sequences and the list containing the image files to be
processed.
438 I MPRESARIO — A GUI for Rapid Prototyping of Image Processing Systems

Table B.7. Commands for image sequence manipulation

Icon Command Triggered action


Create new sequence Asks to save the current sequence if it has been modified and
clears the list to begin a new image sequence.
Load sequence Asks to save the current sequence if it has been modified and
displays a file selection dialog to open an existing sequence file.
Add sequence Adds the contents of a different sequence file to the currently
used sequence. Displays a dialog to select the sequence file to
add.
Save sequence Saves the current sequence to a file. If the sequence is unnamed,
a file dialog will prompt for a filename.
n. a. Save sequence as Saves the current sequence to a file and always prompts for a
filename. This command is only available in the menu.
Add images Dislays a file dialog to select one or more image files which are
added to the image list.
Remove images Removes all selected image files from the list.
Play as sequence When the icon is pressed, all images in the list will be used as
input in the order they appear on the list. When the icon is not
pressed, the selected image will be used as input only. The input
image can be changed by selecting a different image file from
the list.
Play in loop When this option is selected, the image sequence is played re-
peatedly. Otherwise, processing stops as soon as the end of the
sequence is reached. Please note, that this option is only avail-
able if “Play as sequence” is also selected.

Adding Images to a Sequence


There are two ways to add images to the image list: It is possible to add a whole
sequence by choosing the “Add sequence” command or image files with the “Add
images” command (see Tab. B.7).
Choosing the “Add sequence” command brings up a dialog which allows to select
an existing sequence file. After confirming the selected file by pressing the “Open”
button in the dialog, all images referenced in the selected sequence file are appended
to the image list.
To add image files, choose the “Add images” command. A dialog appears which
allows to select one or more image files. Confirm the selection by pressing the “Add”
button in the dialog and the image files are added to the list (see Fig. B.9). The list
displays some information about each image. The first column contains the position
of the image in the list, the second column the file name, the third column the res-
olution of the image in the format (width x height) in pixels, the fourth column the
type of the image (“color image” or “grey value image”), and the fifth column the
absolute path to the image file.
Configuring Macros 439

Fig. B.9. Appearance of the Image Sequence user interface after adding new images. The
asterix appended to the sequence name indicates, that the sequence was changed and needs to
be saved to preserve the changes.

If errors occur during analysis of an image to be added, corresponding messages


will be displayed in the output window of the document.
Changing the position of an image in the sequence
It is possible to change the position of an image in the list by selecting it with the
left mouse button and dragging it to the new position while the left mouse button is
held down. After releasing the mouse button at the new position the image will be
moved there.
Removing Images
To remove images from the current sequence, they have to be selected in the list.
A single image is selected by simply clicking it with the left mouse button. Multi-
ple images may be selected by holding the Ctrl- or Shift-Key during selection with
the left mouse button. When all desired images are selected, choose the “Remove
images” command from the toolbar.
To clear the whole image list in one step, choose the “Create new image” com-
mand from the toolbar.
Storing Sequences
When images are added, moved, or removed the sequence is modified. A modifi-
cation is indicated by an asterix (*) which is appended to the sequence name in the
drop down list box (see Fig. B.9). To preserve the changes the sequence has to be
stored on disk.
A sequence can be saved by selecting the “Save sequence” command. If the se-
quence is unnamed, a dialog will prompt for a filename. Subsequent save commands
for the same sequence will be executed without further prompts. To save the current
sequence with a different name, choose the “Save sequence as” command. Here, you
will always be prompted for a file name.
440 I MPRESARIO — A GUI for Rapid Prototyping of Image Processing Systems

The system automatically asks to save a sequence if changes to the current se-
quence may be lost due to a following execution of a “Create new sequence” or
“Load sequence” command.
Image control during processing
When the process graph is processed, the “Play as sequence” command allows to
switch between two different processing modes of the image sequence. In the first
mode, every image is processed in the order it appears in the list. In this mode, the
list is grayed and any interaction with the list is impossible. Further, the “Play in
loop” command allows to toggle between one and infinite iterations of the sequence.
The second mode processes only the selected image. It is possible to change the
processed image by selecting a different one in the list. The “Play in loop” command
is not available in this mode.
During processing of the image graph it is not possible to modify the image list
in any way.
Image Sequence provides two output ports. The left one contains the currently
loaded image and the right one the path to the file where the image was loaded from.
To visualize any output double click the ports of the macro. A viewer window will
open and display the currently processed data.

B.5.2 Video Capture Device

With the Video Capture Device macro it is possible to acquire life image data from
any capture device which supports Microsoft’s DirectX interface [6] in the version
8.1 or above. Most USB cameras and Firewire cameras support this interface [1, 10].
The macros’s user interface is shown in Fig. B.10 and the meaning of the two buttons
is described in Tab. B.8.

Fig. B.10. Configuration page of the Video Capture Device macro. The list box shows all
detected capture devices and the combo box all possible capture modes supported by the
selected device. In this case a Phillips ToUCam is detected and selected.
Configuring Macros 441

Table B.8. Commands for Video Capture Device macro

Icon Command Triggered action


Scan system Scans the system for connected capture devices with DirectX
8.1 support and updates the list box.
Show device properties Calls the properties dialog for the selected device if such a dia-
log is supported by the device driver software.

The main list box in the middle shows all available devices for image capturing
and is updated on program start. If a new capture device is connected, press the “Scan
system” button to update the list box. By default, the first found device is selected
for image acquisition. To change the device, select the corresponding list box entry
with the left mouse button.
The drop down list box beneath the main list box contains all capture modes
supported by the selected device. A capture mode consists of the video resolution
and the used compression method. By default, the first capture mode is selected by
the system. It can be changed by opening the drop down list box and selecting a
different entry with the left mouse button.
These three controls operate in edit mode and are disabled in processing mode.
The button “Show device properties” is available in processing mode only (see
Fig. B.11). Pressing this button brings up the properties dialog of the selected de-
vice. Pressing the button once again will close the dialog. The properties dialog is
normally provided by the device driver but it is possible that some drivers do not
offer such a dialog. In this case, there won’t be any system reaction if the button is

Fig. B.11. Configuration page of the Video Capture Device macro in processing mode. In this
mode only the “Show device properties” button is enabled to display and change the device
dependent property dialog.
442 I MPRESARIO — A GUI for Rapid Prototyping of Image Processing Systems

pressed. To see the life image double click the red output port of the Video Capture
Device macro.

B.5.3 Video Stream

The Video Stream macro gains image data from video files of type AVI (Audio Video
Interleaved [2]) and MPEG-1 (Motion Picture Experts Group). The image data of
videos may be compressed with different Codecs (Compressor – Decompressor).
To view a video on a computer, the required Codec for decompression has to be
installed. If this is not the case the video will not play. Furthermore, dependent on
the Codec being used for decompression, some options may not be available during
video playback.
To configure the macro, the graphical interface depicted in Fig. B.12 is provided.
The interface consists of the video file selection area at the top, the video information
area in the middle, and the frame control at the bottom of the window.
Loading a video stream
To load a video file, click the “Open” button in the video file selection area. An
open file dialog appears where the desired video file can be selected. After press-
ing the “Open” button in the file dialog the selected video file is validated. If it can
be opened its file name is displayed in the box labeled “Video file” and information
about the video stream is shown in the video information area (Fig. B.12). This in-
cludes the resolution of the video in (width x height) in pixels, the total number of
frames contained in the video, the duration, and the frame rate. If the selected file
cannot be opened a corresponding message will appear in the output window of the
current document. The most probable reason for this failure is a missing Codec for

Fig. B.12. User interface of the Video Stream macro consisting of the video file selection
area, the video information area, and the frame control. In this example a video file named
“Street.avi” has already been opened.
Configuring Macros 443

decompression in case of AVI video files or an interlaced MPEG video file which is
currently not supported.
After a video is loaded, the macro is ready to be processed.
Video Stream Control During Processing
In the default settings, a video is processed frame by frame and when the end is
reached, it is started at the beginning again. The frame control allows to change the
standard behavior. It consists of four controls:
• Manual frame control: If this checkbox is selected, the edit box and the slider
control for selecting a frame will be activated and the “Continuous Play” check-
box will be disabled. Only the selected frame will be processed until a different
one is chosen (see Fig. B.13).
Please note, that the availability of manual frame control depends on the Codec
being used for decompressing the video. If it is not supported by the Codec,
the check box will be disabled and a corresponding warning message will be
displayed in the output window of the document.
• Continuous play: This checkbox will be only available if the “Manual frame
control” is not selected and tells the system how to continue processing after the
last frame of the video has been reached. If the checkbox is selected, the video
will be rewound and processing will continue with the first frame. If the checkbox
is not selected, processing will stop after the last frame.
• Frame edit box: If “Manual frame control” is not selected, this control is disabled
and just displays the number of the frame currently processed. During manual
control it is possible to enter the number of the desired frame. Invalid frame
numbers cannot be entered. The entered frame will be processed until a different
frame is selected.
• Frame slider control: The slider is coupled with the frame edit box and is also
disabled when “Manual frame control” is switched off. In this case, the slider in-

Fig. B.13. Video Stream macro with activated manual frame control. The “Continuous Play”
checkbox is disabled and the edit and slider control allow the selection of a desired frame.
444 I MPRESARIO — A GUI for Rapid Prototyping of Image Processing Systems

dicates approximately the position of the currently processed frame in the video.
During manual control a frame can be selected by dragging the slider with the
mouse. The selected frame will be displayed in the edit control and processed
subsequently.
To view the image data of the processed video frame double click the red output
port of the Video Stream macro. This will bring up a viewer. Finally, it has to be noted,
that the video is not played with the frame rate denoted in the video information area
but as fast as possible because the software is not intended as a pure video viewer.
The main goal is to allow random access to every frame contained in the video.
This requires sometimes more time for decompression than playing the frames in
sequential order because of interframe compression.

B.5.4 Standard Macro Property Window

Except the source macros introduced in the last three subsections all other macros
use a standardized user interface which allows the configuration of their parameters.
We already encountered a standard macro property window in the tutorial (Fig. B.3
in Sect. B.2.2). Fig. B.14 shows the standard property window of the CannyEdges
macro and Tab. B.9 summarizes the available toolbar commands.

Fig. B.14. Appearance of a standard macro property window.

Table B.9. Toolbar commands for the standard macro property window

Icon Command Triggered action


Restore default values Restores default values for all parameters in the macro property
window
Restore default value Restores default value for selected parameter only
Toggle help window Toggles the help window in the macro selection window and
macro property window
Configuring Macros 445

A macro property window is attached to each macro in the document view. It


can be opened by clicking the button within the macro with the left mouse button.
Parameter changes in these windows apply only to the macro the window is attached
to. The macro property windows can be distinguished by their title which display the
macro names they are attached to. However, all windows have the same layout and
are operated in the same way.
The main component of a macro property window is a list holding a parameter
value pair in each line. Therefore it consists of two columns. The first column con-
tains the parameter’s name, the second column the current parameter value. A para-
meter can be selected with a left click in the corresponding line. The help window
beneath the parameter list displays more information about the selected parameter.
It can be hidden by selecting the “Toggle help window” command from the toolbar
contained in the macro property window (see Tab. B.9). The value of the selected
parameter can be changed in the value field. Depending of the parameter’s data type
the following possibilities exist:
• String: To edit a parameter value of type String, click the parameter’s value
field with the left mouse button. A blinking cursor will appear in the field and
all printable characters can be entered. It is not necessary to confirm the text by
pressing the “Enter” key. All changes are noticed by the programm.
• Integer: An integer value can be changed like the value of type String ex-
cept that only characters representing decimal digits can be entered and option-
ally a “+” or “−” sign as the first character.
• Floating point number: Floating point numbers can be changed like in-
teger values. The field accepts the same characters like the integer field and ad-
ditionally the period (“.”) to separate the integer part and the floating part of the
number.
• Boolean: A boolean value differs between the states “true” and “false”. The
state can be changed by choosing one of these options from the drop down list
box in the parameter’s value field or by a double click in the field with the left
mouse button. The latter method will toggle the value from “true” to “false” or
vice versa.
• Option list: An option list represents a collection of predefined values. This
collection is presented in a drop down list box where it is possible to select one
of the values. The current value is displayed in the parameter’s value field.
• File: A parameter of type File references a physical file on hard drive. To
change this value, press the button in the value field. A file dialog will appear to
select a new file. It is not possible to enter a new file name into the value field
directly.
• Folder: This kind of parameter references a folder or directory on hard drive,
e. g. to load sequences of images from this folder or to store some data in this
place. It is changed like a file parameter value except that a folder dialog is pre-
sented instead of a file dialog.
• Color: Colors are used for visualization purposes of processing results. Some-
times it is necessary to change colors for better contrasts in the visualization.
446 I MPRESARIO — A GUI for Rapid Prototyping of Image Processing Systems

Colors are changed like files and folders but a color selection dialog appears
instead of the file or folder dialog.
When experimenting with different parameter values especially in processing
mode it may happen that the results produced by the macros look very strange and
cannot be interpreted in any sense. This problem can be solved by restoring the
default values for all macro parameters. To do this, the macro property window’s
toolbar (see Tab. B.9) provides two commands. The left most button restores all pa-
rameters to their default values whereas the middle button just restores the default
value for the currently selected parameter.

B.6 Advanced Features and Internals of I MPRESARIO


For the interested reader this section covers the not yet mentioned features of the
user interface which are not necessarily needed to work with I MPRESARIO and also
reveals some of the concepts used in this software.

B.6.1 Customizing the Graphical User Interface

Because I MPRESARIO follows the guidelines for standardized user interfaces it is


highly customizable in its appearance. Every component of the interface can be re-
sized, hidden or docked at different locations using drag and drop operations. Ad-
ditionally the menus “Window” and “View” offer some commands for GUI cus-
tomization. The former may be used for the arrangement and management of the
opened document views. The latter allows to recover components which were hid-
den before and is shown in Fig. B.15. The check mark next to a component indicates
its visibility. The three components “Process information”, “Console output”, and
“Blackboard” are hidden and haven’t been mentioned so far. The first two compo-
nents were developed to provide information during development of I MPRESARIO
itself. The console output window displays text messages which are normally send
to the console (DOS box) and was integrated for debugging purposes. In the software

Fig. B.15. View menu. With this menu the parts of the graphical user interface can be hidden
or made visible again.
Advanced Features and Internals of I MPRESARIO 447

Fig. B.16. The three pages of the process information window displaying the running threads,
the window hierarchy of I MPRESARIO and the linked DLLs. The information have to be up-
dated manually by pressing the “Update” command on the toolbar.

distribution at hand, only the macro execution order is printed in this window. The
process information window shows the threads with their attached windows running
in I MPRESARIO, the window hierarchy as defined by the operating system, and the
used Dynamic Link Libraries on separate pages (see Fig. B.16). All information have
to be updated manually by pressing the “Update” button on the toolbar.
A Dynamic Link Library (DLL) is a file with the suffix “.dll” which contains
executable program code and is loaded into memory by the operating system on
application demand [8]. Code within a DLL may be shared among several applica-
tions. DLLs play an important role in the concept of I MPRESARIO as we will see in
Sect. B.6.2.
The third component is the blackboard. Fig. B.17 shows a view on the black-
board when it is activated using the menu depicted in Fig. B.15. A blackboard can
be considered as a place where information can be exchanged globally. This enables
macros to define a kind of global variable. For a user of I MPRESARIO it is not very

Fig. B.17. View on the activated blackboard. The figure shows a filled blackboard from a
document which is not part of the distribution coming with this book.
448 I MPRESARIO — A GUI for Rapid Prototyping of Image Processing Systems

important but for those readers who might like to develop their own macros it may
be of interest. In this case please refer to Sect. B.8 and the online documentation for
macro developers. However, the blackboard is not used in the examples included in
this book.

B.6.2 Macros and Dynamic Link Libraries

As mentioned in the tutorial, the most important components are the macros. Macros
are based on executable code which is stored in Dynamic Link Libraries (DLLs)
and loaded at program start. Therefore the libraries have to be located in a directory
known to the program. Take a look at the “System” tab in the output window. It re-
ports the loaded Macro-DLLs. Each Macro-DLL contains one or more macros which
are added to the macro browser. In the version of I MPRESARIO coming with this
book several Macro-DLLs are delivered: The three DLLs ImageSequence.dll,
VideoCapture.dll, and VideoStream.dll contain the three source macros.
The two remaining Macro-DLLs, namely macros.dll and ammi.dll, contain
all other macros. The former file is publicly available, the latter file was exclusively
programmed for the examples in this book. The DLL concept allows an easy exten-
sion of the system’s functionality without changing the application itself. An Ap-
plication Programming Interface (API) exists for development of own Macro-DLLs.
For integration in I MPRESARIO the developed DLLs just have to be copied to the
directories where I MPRESARIO expects them. The directories where I MPRESARIO
searches for DLLs can be configured in the settings dialog (see Sect. B.6.3).

Fig. B.18. The two pages of the settings dialog accessible via the menu
“Extras”→“Settings...”. Left: Default settings for directories, macro property windows,
and viewers. Right: Fictive declaration of dependent dynamic link libraries. Because no third
party products are shipped with the version of I MPRESARIO in this book, this page is empty
by default.
Advanced Features and Internals of I MPRESARIO 449

B.6.3 Settings Dialog and Directories

The settings dialog allows to customize some aspects of I MPRESARIO. It is opened


by selecting the “Settings...” command in the “Extras” menu. Fig. B.18 shows the
two pages of the dialog named “General Settings” and “DLL Dependencies”.
The “General Settings” page displays a listbox with four categories. The first
category contains the directories, where files of different types are searched for. A
directory can be changed by selecting the line in the listbox with the left mouse but-
ton. A button will appear right to the directory name. Pushing this button brings up
a dialog where the new directory can be selected. The directories and their meaning
within I MPRESARIO is summarized in Tab. B.10. The remaining three categories al-
low to set some default options for newly created macro property windows, viewers,
and the blackboard.
The “DLL Dependencies” page is not used in the version of I MPRESARIO
shipped on the book’s CD but for the sake of completeness its purpose has to be
mentioned. As described in Sect. B.6.2, the functionality of I MPRESARIO can be ex-
tended by developing new Macro-DLLs. Sometimes it is necessary to incorporate
third party software into one’s own code which may also be delivered in DLLs for
further distribution. The DLL developed by oneself then depends on the existence of
the third party DLLs. These dependencies can be declared in the “DLL Dependen-
cies” page and allow more detailed checks and error messages at system start up. On
the right side in Fig. B.18, a fictive example of a DLL dependency is shown.

Table B.10. Directories used in I MPRESARIO and their meaning

Directory Meaning
DLL-Directory for Macros Directory where I MPRESARIO searches for Macro-
DLLs at program start. This directory should not be
changed. Otherwise it may happen, that no macros are
available next time I MPRESARIO is used. Check the
“System” tab in the output window for loaded Macro-
DLLs.
DLL-Directory for dependent files In this directory dependent files are stored which may
be shipped with third party software. In this version of
I MPRESARIO no third party software is contained and
this directory remains empty. So, changing it won’t have
any effect.
Data directory Macros will store their data in this directory by default.
Document directory Directory where the document files (*.ipg) are loaded
from and stored by default.
Image directory (Input) Still images in the format Bitmap, Portable Network
Graphics and JPEG are stored in this directory for use
in the Image Sequence source.
Image directory (Output) Images produced by macros and viewers will be stored
here.
450 I MPRESARIO — A GUI for Rapid Prototyping of Image Processing Systems

Table B.11. Summary of examples using I MPRESARIO

Example Process graph file Section


Hand Gesture Commands (Chapter 2):
Static gesture classification StaticGestureClassifier.ipg 2.1.4
Dynamic gesture classification
from live video DynamicGestureClassifier.ipg 2.1.5
from images DynamicGestureClassifier 2.1.5
FromDisk.ipg
Record gestures for own training RecordGestures.ipg 2.1.5
Facial Expression Commands (Chapter 2):
Eye detection FaceAnalysis Eye.ipg 2.2
Lip tracking FaceAnalysis Mouth.ipg 2.2
Face recognition (Chapter 5):
Eigenface PCA calculation Eigenfaces calcPCA.ipg 5.1.7.2
Eigenface classifier training Eigenfaces Train.ipg 5.1.7.2
Eigenface face recognition Eigenfaces Test.ipg 5.1.7.2
Component classifier training Components Train.ipg 5.1.7.3
Component based face recognition Components Test.ipg 5.1.7.3
People tracking (Chapter 5):
People tracking indoor PeopleTracking Indoor.ipg 5.3.5
People tracking outdoor PeopleTracking Parking.ipg 5.3.5
I MPRESARIO (Appendix B):
Canny edge detection CannyEdgeDetection.ipg B.2

B.7 I MPRESARIO Examples in this Book


This section summarizes all examples in this book using I MPRESARIO. For the
reader’s convenience all examples are stored in process graph files which can be
found in the documents folder in I MPRESARIO’s installation directory. Tab. B.11
names the example, the corresponding file, and the section where the example is
discussed in detail. Please refer to this section before you start processing a graph.

B.8 Extending I MPRESARIO with New Macros


This section addresses all readers who might like to extend I MPRESARIO’s function-
ality with their own Macro-DLLs.

B.8.1 Requirements

The API for developing Macro-DLLs is written in C++ because the LTI-L IB has
also been developed using this language. To successfully develop new Macro-DLLs
the reader has to be familiar with
References 451

• concepts and usage of I MPRESARIO,


• concepts and usage of the LTI-L IB,
• object oriented programming in C++ including class inheritance, data encapsu-
lation, polymorphism, and template usage,
• and DLL concepts and programming on Windows platforms.
To use the API an ANSI C++ compatible compiler is needed. The API contains
sample projects for Microsoft’s Visual Studio .NET 2003. This development envi-
ronment is therefore recommended. Please note, that Microsoft’s Visual Studio 6 or
prior versions of this product will not compile the source code because of insufficient
template support. Different compilers haven’t been tested so far.
Additionally an installed Perl environment will be needed if the Perl scripts com-
ing with the API are used.

B.8.2 API Installation and further Reading

The necessary files containing the API for Macro-DLLs as well as further doc-
umentation and code examples can be found on the book’s CD in the file
impresario api setup.exe in the Impresario directory. Copy this file to
your hard drive and execute it. You will be prompted for a directory. Please provide
the directory I MPRESARIO is located in. The setup program will decompress several
files into a separate subdirectory named macrodev. This subdirectory contains a
file called documentation.html which contains the complete documentation
for the API. Please refer to this file for further reading.

B.8.3 I MPRESARIO Updates

I MPRESARIO is freeware and was not developed for this book only but for the
community of LTI-L IB users in first place. Updates and bug fixes for I MPRE -
SARIO as well as new macros can be obtained from the World Wide Web at
http://www.techinfo.rwth-aachen.de/Software/Impresario/index.html

References
1. Anderson, D. FireWire System Architecture. Addison-Wesley, 1998. ISBN 0201485354.
2. Austerberry, D. Technology of Video and Audio Streaming. Focal Press, 2002. ISBN
024051694X.
3. Canny, J. Finding Edges and Lines in Images. Technical report, MIT, Artificial Intelli-
gence Laboratory, Cambridge, MA, 1983.
4. Canny, J. A Computational Approach to Edge Detection. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 8(6):679–698, 1986.
5. Cantu, M. Mastering Delphi 7. Sybex Inc, 2003. ISBN 078214201X.
6. Gray, K. The Microsoft DirectX 9 Programmable Graphics Pipline. Microsoft Press
International, 2003. ISBN 0735616531.
452 I MPRESARIO — A GUI for Rapid Prototyping of Image Processing Systems

7. Gross, J. L. and Yellen, J., editors. Handbook of Graph Theory. CRC Press, 2003. ISBN
1584880902.
8. Hart, J. M. Windows System Programming. Addison-Wesley, 3rd edition, 2004. ISBN
0321256190.
9. i-Logix Inc., Andover, MA. Statemate Magnum Reference Manual, Version 1.2, 1997.
Part-No. D-1100-37-C.
10. Jaff, K. USB Handbook: Based on USB Specification Revision 1.0. Annabooks, 1997.
ISBN 0929392396.
11. Woods, C., Bicalho, A., and Murray, C. Mastering 3D Studio Max 4. Sybex Inc, 2001.
ISBN 0782129382.
Index

4-neighborhood, 20 beliefs, desires, and intentions model, 159


8-neighborhood, 20 binaural synthesis, 274, 275
black box models, 376
AAC, 393 border
accommodation, 266, 271 computation, 21
acoustic alert, 382 length, 23
acoustic interfaces, 272 points, 20
Active Appearance Model, 68, 186 braking augmentation, 383
active shape model, 87, 186
active steering, 383, 385 camera model, 251
active stereo, 268 cascaded classifier, 66
AdaBoost, 64, 183 CAVE, 266
adaptive filtering, 384 cheremes, 119
adjacency circle Hough transformation, 83
pixel, 20 class-based n-grams, 181
agent, 349 classification, 29, 110
ALBERT, 363 class, 29
alerts and commands, 381 concepts, 29
analytic approaches, 186 garbage class, 30
application knowledge, 386 Hidden Markov Models, 37
architecture Markov chain, 37
event-driven, 348 maximum likelihood, 34
ARMAR, 322 overfitting, 31
articulatory features, 178 phases, 29
assistance in manual control, 372 rule-based, 33
assisted interaction, 369 supervised, 31
auditory display, 273 training samples, 29
augmented reality, 266, 392 unsupervised, 31
auto completion, 394 color histogram, 226, 241
automatic speech recognition, 141, 154 color space
automation, 384 RGB, 11
autonomous systems, 358 color structure descriptor (CSD), 227
command and control, 153
background commanding actions, 325
Gaussian mixture model, 104 commenting actions, 325
median model, 103 compactness, 25
modeling/subtraction, 102 conformity with expectations, 391
background subtraction, 234 context of use, 5, 370
shadow reduction, 255 control input limiting, 384
bag-of-words, 182 control input management, 383
Bakis topology, 39 control input modifications, 383
Bayes’ rule, 148 controllability, 391
Bayesian Belief Networks, 388, 389 convergence, 266, 271
Bayesian nets, 182 conversational maxims, 162
454 Index

coordinate transformation, 251 technical, 173


cross talk cancellation, 274, 276 tolerance, 391
crossover model, 373, 376 userspecific, 172
Euclidian distance, 78
DAMSL, 163 event-driven architecture, 348
data entry management, 372, 391, 393 eye blinks, 375
data glove, 336, 363
decoding, 147 face localization, 61
Dempster Shafer theory, 388 face recognition
design goals, 2 affecting factors, 194
dialog context, 391 component-based, 206
dialog design, 162 geometry based, 200
dialog management, 154 global/holistic, 200
dialog system state, 390 hybrid, 200
dialog systems, 1 local feature based, 200
dilation, 237 face tracking, 67
discrete fourier transform (DFT), 210 facial action coding system, 186
display codes and formats, 392 facial expression commands, 56
distance sensors, 378 facial expression recognition, 184
do what i mean, 394 facial expressions, 183
driver modeling, 375 feature
driver state, 374 area, 23
dual hand manipulation, 338 border length, 23
dynamic grasp classification, 346 center of gravity, 23
dynamic systems, 1 classification, 29, 110
compactness, 25
early feature fusion, 186 derivative, 27
ECAM, 392 eccentricity, 24
eigenfaces, 183, 185, 201 extraction, 12, 105
eigenspace, 78 geometric, 22
emotion model, 176 moments, 23
emotion recognition, 175 normalization, 26, 109
emotional databases, 177 orientation, 25
ensemble construction, 180 selection, 32
erosion, 237 stable, 22
error vector, 8
active handling, 172 feature extraction, 178
avoidance, 173 feature groups, 122, 124
identification, 393 feature selection, 179
isolation, 393 filter-based selection, 179
knowledge-based, 173 fine manipulation, 338
management, 393 finite state model, 156
multimodal, 172 force execution gain, 383
passive handling passive, 172 formant, 178
processing, 173 frame-based dialog, 157
recognition, 173 freedom of error, 2
rulebased, 173 function allocation, 385
skillbased, 172 functional link networks, 376
systemspecific, 173 functionality, 390
Index 455

functionality shaping, 392 human activity, 322


fusion decomposition, 323
early signal, 168 direction, 322
late semantic, 169 operator selection, 323
multimodal integration/fusion, 168 human advice, 354
fuzzy logic, 388 human comments, 354
human transfer function, 374
Gabor Wavelet Transformation, 83 human-computer interaction, 164
garbage class, 30 humanoid robot, 316
Garbor-wavelet coefficients, 185
Gaussian mixture models, 241 image
gen locking, 268, 303 gray value, 11
geometric feature, see feature, geometric intensity, 11
gesture recognition, 7 mask, 11
goal directed processing, 159 object probability, 16
goal extraction, 342 preprocessing, 102
grasp, 324
representation, 10
classification, 344
skin probability, 19
dynamic, 324, 344
immersion, 264
segmentation, 343
immersive display, 266
static, 324, 343
immersive interface, 361, 362
taxonomy, 348
Impresario, 6
taxonomy of Cutkosky, 345
information management, 379, 391
gray value image, 11
input shaping, 384
guiding sleeves, 294
integral image, 62
intensity panning, 273
Haar-features, 62
intent recognition, 392
hand gesture commands, 7
intentions, 370
hand localization, 14
inter-/intra-gesture variance, 22
handover strategies, 385
interaction, 164
haptic alerts, 381
haptic interfaces, 279 active, 322
head pose, 375 passive, 322
head pose estimation, 78 robot assistants, 321
head related impulse response, 272 interaction activity
head related transfer function, 272 classification, 323
head-mounted display, 266, 361 commanding actions, 323
head-up display, 380, 382 commenting actions, 323
Hidden Markov Models, 37, 142 performative actions, 323
Bakis topology, 39 interaction knowledge, 386
multidimensional observations, 46 interactive learning, 328
parameter estimation, 44 task learning, 330
high inertia systems, 380 interactive programming
histogram example, 354
object/background color, 15 interaural time difference, 272
skin/non-skin color, 18 intervention, 372
holistic approaches, 184 intrusiveness, 374
homogeneous coordinates, 251
homogeneous texture descriptor (HTD), 229 Java3D, 6
456 Index

kernel density estimation, 242 mental model, 386


keyword spotting, 180 menu hierarchy, 386
kNN classifier, 205 menu options, 386
MFCC, 178
language model, 148, 181 mobile robot, 316
late semantic fusion, 186 modality, 166
latency, 263 morphological operations, 236
LAURON, 361 motion parallax, 266
learning motor function, 165
interactive, 328 moving platform vibrations, 384
learning automaton, 350 multimodal interaction, 393
learning through demonstration multimodality, 4, 166
classification, 326 multiple hypotheses tracking, 107
sensors, 336 mutlimodal interaction, 164
skill, 326
task, 327 n-grams, 181
level of detail rendering, 300 natural frequency, 381
lexical predictions, 393 natural language generation, 155
Lidstone coefficient, 182 natural language understanding, 154
limits of maneuverability, 381 nearest neighbor classifier, 205
line-of-sight, 375 noise canceling, 384
linear regression tracking, 245 normalization
linguistic decoder, 148 features, 26, 109
local regression, 377 notation system, 118
movement-hold-model, 120
magic wand, 363
magnetic tracker, 336 object probability image, 16
man machine interaction, 1 observation, 29
man-machine dialogs, 386 oculomotor factors, 266
man-machine task allocation, 385 ODETE, 363
manipulation tasks, 338 on/off automation, 385
classes, 338 operator fatigue, 375
definition, 338 oriented gaussian derivatives (OGD), 228
device handling, 339 overfitting, 31
tool handling, 339 overlap resolution, 106
transport operations, 339
manual control assistance, 379 Parallel Hidden Markov Models, 122
mapping tasks, 350 channel, 122
movements, 353 confluent states, 123
coupling fingers, 351 parametric models, 376
grasps, 351 particle dynamics, 286
power grasps, 352 passive stereo, 268
precision grasps, 352 PbD process, 333
Markov chain, 37 abstraction, 333
Markov decision process, 158 execution, 334
maximum a-posteriori, 182 interpretation, 333
maximum likelihood classification, 34, 85 mapping, 334
maximum likelihood estimation, 181 observation, 333
mean shift algorithm, 244 segmentation, 333
Index 457

simulation, 334 description, 20


system structure, 335 geometric features, 22
perceptibility augmentation, 380 responsive workbench, 271
perception, 165 RGB color space, 11
performative actions, 323 rigid body dynamics, 288
person recognition robot assistants, 315
authentication, 191 interaction, 321
identification, 191 learning and teaching, 325
verification, 191 room-mounted display, 267
phrase spotting, 182
physically based modeling, 286 safety of use, 2
physiological data, 371 scene graph, 298
pinhole camera model, 251 scripting, 163
pitch, 178 segmentation, 16, 234
pixel adjacency, 20 threshold, 16
plan based classification, 389 self-descriptiveness, 391
plan library, 388, 391 sense, 165
plan recognition, 388 sensitive polygons, 296
predictive information, 380 sensors for demonstration, 336
predictive scanning, 394 data gloves, 336
predictive text, 394 force sensors, 337
predictor display, 381 magnetic tracker, 336
predictor transfer function, 380 motion capturing systems, 336
preferences, 370 training center, 337
principal component analysis, 179 sequential floating search method, 179
principle component analysis (PCA), 204 shape model, 186
prioritizing, 392 shutter glasses, 267
programming sign language recognition, 5, 99
explicit, 358 sign-lexicon, 117
implicit, 358 signal
programming by demonstration, 330 transfer latency, 358
programming robots via observation, 329 signal processing, 343
prompting, 392 simplex, 59
prosodic features, 178 single wheel braking, 385
skill
question and answer, 153 elementary, 331
high-level, 331
Radial Basis Function (RBF) network, 345 innate, 327
rational conversational agents, 160 learning, 326
recognition skin probability image, 19
faces, 193 skin/non-skin color histogram, 18
gestures, 7 slot filling, 157
person, 224 snap-in mechanism, 297
sign language, 99 soft decision fusion, 169, 187
recording conditions sparse data handling, 158
laboratory, 9 spatial vision, 265
real world, 9 speech commands, 382
redundant coding, 382 speech communication, 141
region speech dialogs, 153
458 Index

speech recognition, 4 expert programming device, 360


speech synthesis, 155 input devices, 359
spoken language dialog systems, 153 local reaction, 359
static grasp classification, 344 output devices, 359
steering gain, 383 planning, 359
stemming, 181 prediction, 360
stereopsis, 265 system components, 359
stick shakers, 381 template matching, 209
stochastic dialog modeling, 158 temporal texture templates, 240
stochastic language modeling, 127 texture features
bigramm-models, 127 homogeneous texture descriptor (HTD),
sign-groups, 128 229
stopping, 180 oriented gaussian derivatives (OGD), 228
subgoal extraction, 342 threshold
subunits, 116, 117 automatic computation, 17
classification, 125 segmentation, 16
enlargement of vocabulary size, 133 tiled display, 271
estimation of model parameters, 131 timing, 392
training, 128 tracking, 238
suitability for individualization, 391 color-based tracking, 246
supervised classification, 31 combined tracking detection, 243
supervised training, 376 hand, 107
supervisory control, 1 hypotheses, 107
support of dialog tasks, 390 multiple hypotheses, 107
Support Vector Machines (SVM), 346 shape-based tracking, 244
state space, 107
tagging, 163 tracking following detection, 238
takeover by automation, 372 tracking on the ground plane, 250
task tracking system, 268
abstraction level, 331 traction control, 384
example methodology, 332 traffic telematics, 378
goal, 328 training, 144
hierarchical representation, 340 training center, 337
internal representation, 332 training phase, 29
learning, 327 training samples, 29
mapping, 333 trajectory segmentation, 343
mapping and execution, 348 transcription, 117, 118
model, 328 initial transcription, 129
subgoal, 328 linguistic-orientated, 118
task learning visually-orientated, 121
classification, 331 transparency, 391
taxonomy Trellis diagram, 146
multimodal, 167 trellis diagram, 42
telepresence, 356 tremor, 384
telerobotics, 356 tunnel vision, 373
concept, 356 tutoring, 392
controlling robot assistants, 362
error handling, 360 unsupervised classification, 31
exception handling, 360 usability, 2, 163, 390
Index 459

user activities, 390 visual commands, 382


user assistance, 5, 370 visual interfaces, 265
user expectations, 371 visualization robot intention, 362
user identity, 370 Viterbi algorithm, 42, 146
user observation, 371 voice quality features, 178
user prompting, 371 VoiceXML, 163
user state, 370 VR toolkits, 301
user state identification, 388
user whereabouts, 390
warning, 393
variance wave field synthesis, 274
inter-/intra-gesture, 22 wearable computing, 390, 394
vehicle state, 377 whereabouts, 370
viewer centered projection, 268 window function, 178
virtual magnetism, 296 working memory, 386
virtual reality, 4, 263 workload assessment, 371
vision systems, 378 wrapper-based selection, 179
461

Karl-Friedrich Kraiss received the Diploma in Communications Engineering and


the Doctorate in Mechanical Engineering from Berlin Technical University in 1966,
and 1970 respectively. In 1987 he earned the Habilitation for Aerospace Human Fac-
tors Engineering from RWTH Aachen University. In 1988 he received the research
award ”Technical Communication” from the Alcatel/SEL foundation for research in
man-machine communication. Since 1992 he has been a full Professor at RWTH
Aachen University, where he holds the Chair for Technical Computer Science in the
Faculty of Electrical Engineering and Information Technology. His research interests
include man-machine interaction, artificial intelligence, computer vision, assistive
technology, and mobile robotics.

You might also like