Professional Documents
Culture Documents
Advanced Man-Machine
Interaction
Fundamentals and Implementation
123
Editor
Professor Dr.-Ing. Karl-Friedrich Kraiss
RWTH Aachen University
Chair of Technical Computer Science
Ahornstraße 55
52074 Aachen
Germany
Kraiss@techinfo.rwth-aachen.de
This work is subject to copyright. All rights are reserved, whether the whole or part of the material
is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication
of this publication or parts thereof is permitted only under the provisions of the German Copyright
Law of September 9, 1965, in its current version, and permission for use must always be obtained from
Springer. Violations are liable for prosecution under the German Copyright Law.
Springer is a part of Springer Science+Business Media
springer.com
© Springer-Verlag Berlin Heidelberg 2006
Printed in Germany
The use of general descriptive names, registered names, trademarks, etc. in this publication does not
imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
Typesetting: Digital data supplied by editor
Cover Design: design & production GmbH, Heidelberg
Production: LE-TEX Jelonek, Schmidt & Vöckler GbR, Leipzig
Printed on acid-free paper 7/3100/YL 543210
To: Eva,
Gregor,
Lucas,
Martin, and
Robert.
Preface
The way in which humans interact with machines has changed dramatically during
the last fifty years. Driven by developments in software and hardware technologies
new appliances never thought of some years ago appear on the market. There are
apparently no limits to the functionality of personal computers or mobile telecom-
munication appliances. Intelligent traffic systems promise formerly unknown levels
of safety. Cooperation between factory workers and smart automation is under way
in industrial production. Telepresence and teleoperations in space, under water or in
micro worlds is made possible. Mobile robots provide services on duty and at home.
As, unfortunately, human evolution does not keep pace with technology, we face
problems in dealing with this brave new world. Many individuals have experienced
minor or major calamities when using computers for business or in private. The el-
derly and people with special needs often refrain altogether from capitalizing on
modern technology. Statistics indicate that in traffic systems at least two of three
accidents are caused by human error.
In this situation man-machine interface design is widely recognized as a major
challenge. It represents a key technology, which promises excellent pay-offs. As a
major marketing factor it also secures a competitive edge over contenders.
The purpose of this book is to provide deep insight into novel enabling technolo-
gies of man-machine interfaces. This includes multimodal interaction by mimics,
gesture, and speech, video based biometrics, interaction in virtual reality and with
robots, and the provision of user assistance. All these are innovative and still dynam-
ically developing research areas.
The concept of this book is based on my lectures on man-machine systems
held regularly for students in electrical engineering and computer science at RWTH
Aachen University and on research performed during the last years at this chair.
The compilation of the materials would not have been possible without contribu-
tions of numerous people who worked with me over the years on various topics of
man-machine interaction. Several chapters are authored by former or current doctoral
students. Some chapters build in part on earlier work of former coworkers as, e.g.,
Suat Akyol, Pablo Alvarado, Ingo Elsen, Kirsti Grobel, Hermann Hienz, Thomas
Krüger, Dirk Krumbiegel, Roland Steffan, Peter Walter, and Jochen Wickel. There-
fore their work deserves mentioning as well.
To provide a well balanced and complete view on the topic the volume further
contains two chapters by Professor Gerhard Rigoll from Technical University of
Munich and by Professor Rüdiger Dillmann from Karlsruhe University. I’m very
much indebted to both for their contributions. I also wish to express my gratitude to
Eva Hestermann-Beyerle and to Monika Lempe from Springer Verlag: to the former
for triggering the writing of this book, and to both for provided helpful support
during its preparation.
VIII Preface
Finally special thanks go to my wife Cornelia for unwavering support, and, more
general, for sharing life with me.
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
Contributors
Introduction
Karl-Friedrich Kraiss
Living in modern societies relies on smart systems and services that support
our collaborative and communicative needs as social beings. Developments in soft-
ware and hardware technologies as e.g. in microelectronics, mechatronics, speech
technology, computer linguistics, computer vision, and artificial intelligence are
continuously driving new applications for work, leisure, and mobility. Man-machine
interaction, which is ubiquitous in human life, is driven by technology as well. It
is the gateway providing access to functions and services which, due to increasing
complexity, threatens to turn into a bottleneck. Since for the user the interface is
the product, interface design is a key technology which accordingly is subject to
significant national and international research efforts.
Interfaces to smart systems exploit the same advanced technologies as the said
systems themselves. This textbook therefore introduces to researchers, students and
practitioners not only advanced concepts of interface design, but also gives insight
into the related interface technologies.
This introduction first outlines the broad spectrum of applications in which man-
machine interaction takes place. Then, following an overview of interface design
goals, major approaches to interaction augmentation are identified. Finally the single
chapters of this textbook are shortly characterized.
Many of these require interaction with the driver. In the near future also mobile ser-
vice robots are expected on the marketplace as companions or home helpers. While
these will have built-in intelligence for largely autonomous operation, they neverthe-
less will need human supervision and occasional intervention.
The characterization proposed in (Fig. 1.1) very seldom occurs in pure form.
Many systems are blended in one way or other, requiring discrete dialog actions,
manual control, and supervisory control at the same time. A prototypical example is
a commercial airliner, where the cockpit crew performs all these action in sequence
or, at times, in parallel.
Manual Supervisory
Stationary Mobile
control control
Electronic appliances Mobile phones Land, air, sea vehicles, Automated land, air, sea
in home and office, Laptops Master-slave systems vehicles,
Industrial plants PDAs rehabilitation robotics Servicerobots
Control stations Blackberries Industrial plants
Starting point for the discussion of interface augmentation is the basic interaction
scheme represented by the white blocks in (Fig. 1.2), where the term machine stands
for appliance, vehicle, robot, simulation or process. Man and machine are interlinked
via two unidirectional channels, one of which transmits machine generated informa-
tion to the user, while the other one feeds user inputs in opposite direction. Interaction
takes place with a real machine in the real world.
Context
User assistance
of use
Fission
Speech
Gesture
Real Operation
Mimics ...
machine in
Man
in real Virtual
Fusion
world reality
Speech
Gesture
Mimics ...
Multimodality
Fig. 1.2. Interaction augmented by multimodality, user assistance, and virtual reality.
4 Introduction
There are three major approaches, which promise to significantly augment inter-
face quality. These are multimodal interaction, operation in virtual reality, and user
assistance, which have been integrated into the basic MMI scheme of (Fig. 1.2) as
grey rectangles.
Multimodality is ubiquitous in human communication. We gesture and mimic
while talking, even at the phone, when the addressee can not see it. We nod or shake
the head or change head pose to indicate agreement or disagreement. We also signal
attentiveness by suitable body language, e.g. by turning towards a dialog partner. In
so doing conversation becomes comfortable, intuitive, and robust.
It is exactly this lack of multimodality why current interfaces often fail. It there-
fore is a goal of advanced interface design, to make use of various modalities. Modal-
ities applicable to interfaces are mimics, gesture, body posture, movement, speech,
and haptics. They serve either information display, or user input, sometimes even
both purposes. The combination of modalities for input is sometimes called multi-
modal fusion while the combination of different modalities for information display is
labeled multimodal fission. For fission modalities must be generated e.g. by synthe-
sizing speech or by providing gesturing avatars. This book focuses on fusion, which
requires automatic recognition of multimodal human actions.
Speech recognition has been around for almost fifty years and is available for
practical use. Further improvements of speech recognizers with respect to context
sensitive continuous speech recognition and robustness against noise are under way.
Observation of lip movements during talking contributes to speech recognition un-
der noise conditions. Advanced speech recognizers will also cope with incomplete
grammar and spontaneous uttering.
Interest in gesture and mimic recognition is more recent. In fact the first related
papers appeared only in the nineties. First efforts to record mimics and gesture in
real time in laboratory setups and in movie studios involved intrusive methods with
calibrated markers and multiple cameras, which are of no use for the applications at
hand. Only recently video based recognition has achieved an acceptable performance
level in out-of-the-laboratory settings. Gesture, mimics, head pose, line of sight and
body posture can now be recognized based on video recordings in real time, even
under adverse real world conditions. People may even be optically tracked or located
in public places. Emotions derived from a fusion of speech, gestures and mimics now
open the door for yet scarcely exploited emotional interaction.
The second column on which interface augmentation rests is virtual reality,
which provides comfortable means of interactive 3D-simulation, hereby enabling
various novel approaches for interaction. Virtual reality allows, e.g., operating vir-
tual machines in virtual environments such as flight simulators. It may also be used to
generate virtual interfaces. By providing suitable multimodal information channels,
e.g., realistic teleprecence is achieved that permits teleoperating in worlds which
are physically not accessible because of distance or scale (e.g. space applications or
micro- and nano-worlds).
In master-slave systems the operator usually acts on a local real machine, and the
actions are repeated by a remote machine. With virtual reality the local machine can
be substituted by simulation. Virtual reality also has a major impact for robot explo-
5
ration, testing, and training, since robot actions can be taught-in and tested within a
local system simulation before being transferred and executed in a remote and pos-
sibly dangerous and costly setting. Interaction may be further improved by making
use of augmented or mixed reality, where information is presented non-intrusively
on head mounted displays as an overlay to virtual or real worlds.
Apart from multimodality and virtual reality, the concept of user assistance is
a third ingredient of interpersonal communication lacking so far in man-machine
interaction. Envision, e.g., how business executives often call upon a staff of personal
assistants to get their work done. Based on familiarity with the context of the given
task, i.e. the purpose and constraints of a business transaction, a qualified assistant is
expected to provide correct information just in time, hereby reducing the workload
of the executive and improving the quality of the decisions made.
In man-machine interaction context of task transfers to context of use, which may
be characterized by the state of the user, the state of the system, and the situation. In
conventional man-machine interaction little of this kind is yet known. Neither user
characteristics nor the circumstances of operation are made explicit for support of
interaction; the knowledge about context resides with the user alone. However, if
context of use was made explicit, assistance could be provided to the user, similar to
that offered to the executive by his personal assistant. Hence knowledge of context
of use is the key to exploiting interface adaptivity. As indicated by vertical arrows in
(Fig. 1.2) assistance can augment interaction in three ways. First information display
may be improved, e.g., by providing information just in time and coded in the most
suitable modality. Secondly user inputs may be augmented, modified, or even sup-
pressed as appropriate. A third method for providing assistance exists for dynamic
systems, where system functions may be automated either permanently or temporar-
ily. An appliance assisted in this way is supposed to appear to the user simpler than
it actually is. It also will be easier to learn, more pleasant to handle, and less error-
prone.
Most publications in the field of human factors engineering focus on interface design,
while leaving the technical implementation to engineers. While this book addresses
design aspects as well, it puts in addition special emphasis on advanced interface
implementation. A significant part of the book is concerned with the realization of
multimodal interaction via gesture, mimics, and speech. The remaining parts deal
with the identification and tracking of people, the interaction in virtual reality and
with robots, and with the implementation of user assistance.
The second chapter treats picture processing algorithms for the non-intrusive ac-
quisition of human action. In the first section techniques for identifying and classify-
ing hand gesture commands from video are described. These techniques are extended
to facial expression commands in the second section.
Sign language recognition, the topic of chapter 3, is the ultimate challenge in
multimodal interaction. It builds on the basic techniques presented in the chapter 2,
but considers specialized requirements for the recognition of isolated and continuous
6 Introduction
signs using manual as well as facial expression features. A method for the automatic
transcription of signs into subunits is presented, which, comparable to phonemes in
speech recognition, opens the door to large vocabularies.
The role of speech communication in multimodal interfaces is the subject of
chapter 4, which in its first section presents basics of advanced speech recognition.
Speech dialogs are treated next, followed by a comprehensive description of mul-
timodal dialogs. A final section is concerned with the extraction of emotions from
speech and gesture.
The 5th chapter deals with video-based person recognition and tracking. A first
section is concerned with face recognition using global and local features. Then full-
body person recognition is dealt with based on the color and texture of the clothing
a subject wears. Finally various approaches for camera-based people tracking are
presented.
Chapter 6 is devoted to problems arising when interacting with virtual objects
in virtual reality. First technological aspects of implementing visual, acoustic, and
haptic virtual worlds are presented. Then procedures for virtual object modeling are
outlined. The third section is devoted to various methods to augment interacting with
virtual objects. Finally the chapter closes with a description of software tools ap-
plicable for the implementation of virtual environments.
The role of service robots in the future society is the subject of chapter 7. First
a description of available mobile and humanoid robots is given. Then actual chal-
lenges arising from interacting with robot assistants are systematically tackled like,
e.g., learning, teaching, programming by demonstration, commanding procedures,
task planning, and exception handling. As far as manipulation tasks are considered,
relevant models and representations are presented. Finally telepresence and telero-
botics are treated as special cases of man-robot interaction.
The 8th chapter addresses the concept of assistance in man-machine interaction,
which is systematically elaborated for manual control and dialog tasks. In both cases
various methods of providing assistance are identified based on user capabilities and
deficiencies. Since context of use is identified as the key element in this undertak-
ing, a variety of algorithms are introduced and discussed for real-time non-intrusive
context identification.
For readers with a background in engineering and computer science chapters 2
and 5 presents a collection of programming examples, which are implemented with
’Impresario’. This rapid prototyping tool offers an application programming inter-
face for comfortable access to the LTI-Lib, a large public domain software library
for computer vision and classification. A description of the LTI-Lib can be found in
Appendix A, a manual of Impresario in Appendix B.
For the examples a selection of head, face, eye, and lip templates, hand ges-
tures, sign language sequences, and people tracking clips are provided. The reader
is encouraged to use these materials to start experimentation on his own. Chapter 6
finally, provides the Java3D code of some simple VR application. All software, pro-
gramming code, and templates come with the book CD.
Chapter 2
Fig. 2.1. The processing chain found in many pattern recognition systems.
Many pattern recognition systems have a common processing chain that can
roughly be divided into three consecutive stages, each of which forwards its output
to the next (Fig. 2.1). First, a single frame is acquired from a camera, a file system,
a network stream, etc. The feature extraction stage then computes a fixed number of
scalar values that each describe an individual feature of the observed scene, such as
the coordinates of a target object (e.g. the hand), its size, or its color. The image is
1
Sect. 2.1
2
Sect. 2.2
8 Non-Intrusive Acquisition of Human Action
nition system starts, both design and implementation are usually optimized accord-
ingly. This may later lead to problems if the vocabulary is to be changed or extended.
In case the system is completely operational before the constellation of the vocab-
ulary is decided upon, recognition accuracy can be optimized by simply avoiding
gestures that are frequently misclassified. In practice, a mixture of both approaches
is often applicable.
The vocabulary affects the difficulty of the recognition task in two ways. First,
with growing vocabulary size the number of possible misclassifications rises, render-
ing recognition increasingly difficult. Second, the degree of similarity between the
vocabulary’s elements needs to be considered. This cannot be quantified, and even a
qualitative estimate based on human perception cannot necessarily be transferred to
an automatic system. Thus, for evaluating a system’s performance, vocabulary size
and recognition accuracy alone are insufficient. Though frequently omitted, the ex-
act constellation of the vocabulary (or at least the degree of similarity between its
elements) needs to be known as well.
2.1.1.2 Recording Conditions
The conditions under which the input images are recorded are crucial for the design
of any recognition system. It is therefore important to specify them as precisely and
completely as possible before starting the actual implementation. However, unless
known algorithms and hardware are to be used, it is not always foreseeable how
the subsequent processing stages will react to certain properties of the input data.
For example, a small change in lighting direction may cause an object’s appearance
to change in a way that greatly affects a certain feature (even though this change
may not be immediately apparent to the human eye), resulting in poor recognition
accuracy. In this case, either the input data specification has to be revised not to
allow changes in lighting, or more robust algorithms have to be deployed for feature
extraction and/or classification that are not affected by such variations.
Problems like this can be avoided by keeping any kind of variance or noise in the
input images to a minimum. This leads to what is commonly known as “laboratory
recording conditions”, as exemplified by Tab. 2.1.
While these conditions can usually be met in the training phase (if the classifi-
cation stage requires training at all), they render many systems useless for any kind
of practical application. On the other hand, they greatly simplify system design and
might be the only environment in which certain demanding recognition tasks are
viable.
In contrast to the above enumeration of laboratory conditions, one or more of the
“real world recording conditions” shown in Tab. 2.2 often apply when a system is to
be put to practical use.
Designing a system that is not significantly affected by any of the above condi-
tions is, at the current state of research, impossible and should thus not be aimed for.
General statements regarding the feasibility of certain tasks cannot be made, but from
experience one will be able to judge what can be achieved in a specific application
scenario.
10 Non-Intrusive Acquisition of Human Action
x ∈ {0, 1, . . . , N − 1}
I(x, y) with (2.1)
y ∈ {0, 1, . . . , M − 1}
The 2-tuple (x, y) denotes pixel coordinates with the origin (0, 0) in the upper-
left corner. The value of I(x, y) describes a property of the corresponding pixel. This
property may be the pixel’s color, its brightness, the probability of it representing
human skin, etc.
Color is described as an n-tuple of scalar values, depending on the chosen color
model. The most common color model in computer graphics is RGB, which uses a
3-tuple (r, g, b) specifying a color’s red, green, and blue components:
⎛ ⎞
r(x, y)
I(x, y) = ⎝ g(x, y) ⎠ (2.2)
b(x, y)
Other frequently used color models are HSI (hue, saturation, intensity) or YCbCr
(luminance and chrominance). For color images, I(x, y) is thus a vector function.
For many other properties, such as brightness and probabilities, I(x, y) is a scalar
function and can be visualized as a gray value image (also called intensity image).
Scalar integers can be used to encode a pixel’s classification. For gesture recogni-
tion an important classification is the discrimination between foreground (target) and
background. This yields a binary valued image, commonly called a mask,
Fig. 2.2. Static hand gestures for “left”, “right”, and “stop” (a-c), as well as example for an
idle position (d).
Dynamic Gestures
The dynamic vocabulary will be comprised of six gestures, where three are sim-
ply backwards executions of the other three. Tab. 2.3 describes these gestures, and
Fig. 2.3 shows several sample executions. Since texture will not be considered, it
does not matter whether the palm or the back of the hand faces the camera. There-
fore, the camera may be mounted atop a desk facing downwards (e.g. on a tripod or
affixed to the computer screen) to avoid background clutter.
Recording Conditions
As will be described in the following sections, skin color serves as the primary im-
age cue for both example systems. Assuming an office or other indoor environment,
Tab. 2.4 lists the recording conditions to be complied with for both static and dy-
namic gestures. The enclosed CD contains several accordant examples that can also
be used as input data in case a webcam is not available.
The transition from low-level image data to some higher-level description thereof,
represented as a vector of scalar values, is called feature extraction. In this process,
Hand Gesture Commands 13
Fig. 2.3. The six dynamic gestures to be recognized. In each of the three image sequences,
the second gesture is a backwards execution of the first, i.e. corresponds to reading the image
sequence from right to left.
such as those shown in Fig. 2.2 and 2.3. This process, which occurs within the feature
extraction module in Fig. 2.1, is visualized in Fig. 2.4.
Fig. 2.4. Visualization of the feature extraction stage. The top row indicates processing steps,
the bottom row shows examples for corresponding data.
knowledge can be encoded explicitly (as a set of rules) or implicitly (in a histogram,
a neural network, etc.). Known properties of the target object, such as shape, size,
or color, can be exploited. In gesture recognition, color is the most frequently used
feature for hand localization since shape and size of the hand’s projection in the two-
dimensional image plane vary greatly. It is also the only feature explicitly stored in
the image.
Using the color attribute to localize an object in the image requires a definition
of the object’s color or colors. In the RGB color model (and most others), even
objects that one would call unicolored usually occupy a range of numerical values.
This range can be described statistically using a three-dimensional discrete histogram
hobject(r, g, b), with the dimensions corresponding to the red, green, and blue com-
ponents.
hobject is computed from a sufficiently large number of object pixels that are
usually marked manually in a set of source images that cover all setups in which the
system is intended to be used (e.g. multiple users, varying lighting conditions, etc.).
Its value at (r, g, b) indicates the number of pixels with the corresponding color. The
total sum of hobject over all colors is therefore equal to the number of considered
object pixels nobject, i.e.
hobject(r, g, b) = nobject (2.5)
r g b
Because the encoded knowledge is central to the localization task, the creation of
hobject is an important part of the system’s development. It is exemplified below for
a single frame.
Fig. 2.5 shows the source image of a hand (a) and a corresponding, manually
generated binary mask (b) that indicates object pixels (white) and background pixels
(black). In (c) the histogram computed from (a), considering only object pixels (those
that are white in (b)), is shown. The three-dimensional view uses points to indicate
colors that occurred with a certain minimum frequency. The three one-dimensional
graphs to the right show the projection onto the red, green, and blue axis. Not sur-
prisingly, red is the dominant skin color component.
(a) Source image (b) Object mask (c) Object color histogram
Fig. 2.5. From a source image (a) and a corresponding, manually generated object mask (b),
an object color histogram (c) is computed.
16 Non-Intrusive Acquisition of Human Action
hobject(r, g, b)
P (r, g, b|object) = (2.6)
nobject
By creating a complementary histogram hbg of the background colors from a
total number of nbg background pixels the accordant probability for a background
pixel is obtained in the same way:
hbg (r, g, b)
P (r, g, b|bg) = (2.7)
nbg
Applying Bayes’ rule, the probability of any pixel representing a part of the ob-
ject can be computed from its color (r, g, b) using (2.6) and (2.7):
P (object) and P (bg) denote the a priori object and background probabilities,
respectively, with P (object) + P (bg) = 1.
Using (2.8), the object probability image Iobj,prob is created from I as
1 if Iobj,prob(x, y) ≥ Θ (target)
Iobj,mask (x, y) = (2.10)
0 otherwise (background)
practical to choose arbitrary values, e.g. P (object) = P (bg) = 0.5. Obviously, the
value computed for P (object|r, g, b) will then no longer reflect an actual probability,
but a belief value.3 This is usually easier to handle than the computation of exact
values for P (object) and P (bg).
−1
P (r, g, b|bg) P (bg)
P (object|r, g, b) = 1 + · (2.11)
P (r, g, b|object) P (object)
1. Arbitrarily define a set of background pixels (some usually have a high a priori back-
ground probability, e.g. the four corners of the image). All other pixels are defined as
foreground. This constitutes an initial classification.
2. Compute the mean values for background and foreground, μobject and μbg , based on
the most recent classification. If the mean values are identical to those computed in the
previous iteration, halt.
3. Compute a new threshold Θ = 12 (μobject + μbg ) and perform another classification of
all pixels, then goto step 2.
Fig. 2.6. Automatic computation of the object probability threshold Θ without the use of high-
level knowledge (presented in [24]).
In case the target’s approximate location and geometry are known, humans can
usually identify a suitable threshold just by observation of the corresponding object
mask. This approach can be implemented by defining an expected target shape, cre-
ating several object masks using different thresholds, and choosing the one which
yields the most similar shape. This requires a feature extraction as described in
Sect. 2.1.2 and a metric to quantify the deviation of a candidate shape’s features
from those of the expected target shape.
An example for hand localization can be seen in Fig. 2.7. Using the skin color
histogram shown in Fig. 2.5 and a generic background histogram hbg , a skin proba-
bility image (b) was computed from a source image (a) of the same hand (and some
clutter) as shown in Fig. 2.5, using P (object) = P (bg) = 0.5. For the purpose of
visualization, three different thresholds Θ1 < Θ2 < Θ3 were then applied, resulting
in three different binary masks (c, d, e). In this example none of the binary masks
3
For simplicity, the term “probability” will still be used in either case.
18 Non-Intrusive Acquisition of Human Action
is entirely correct: For Θ1 , numerous background regions (such as the bottle cap in
the bottom right corner) are classified as foreground, while Θ3 leads to holes in the
object, especially at its borders. Θ2 , which was computed automatically using the
algorithm described in Fig. 2.6, is a compromise that might be considered optimal
for many applications, such as the example systems to be developed in this section.
Fig. 2.7. A source image (a) and the corresponding skin probability image (b). An ob-
ject/background classification was performed for three different thresholds Θ1 < Θ2 < Θ3
(c-e).
hues, lighting conditions, and camera hardware. Thereby, these histograms implic-
itly provide user, illumination, and camera independence, at the cost of a comparably
high probability for non-skin objects being classified as skin (false alarm). Fig. 2.8
shows the skin probability image (a) and binary masks (b, c, d) for the source im-
age shown in Fig. 2.7a, computed using the histograms from [12] and three different
thresholds Θ4 < Θ5 < Θ6 . As described above, P (object) = P (bg) = 0.5 was
chosen. Compared to 2.7b, the coins, the pens, the bottle cap, and the shadow around
the text marker have significantly higher skin probability, with the pens even exceed-
ing parts of the fingers and the thumb (d). This is a fundamental problem that cannot
be solved by a simple threshold classification because there is no threshold Θ that
would achieve a correct result. Unless the histograms can be modified to reduce the
number of false alarms, the subsequent processing stages must therefore be designed
to handle this problem.
Fig. 2.8. Skin probability image (b) for Fig. 2.8a and object/background classification for three
thresholds Θ4 < Θ5 < Θ6 (b, c, d).
Implementational Aspects
cations, especially when consumer cameras with high noise levels are employed.
Usually the downsampling is even advantageous because it constitutes a general-
ization and therefore reduces the amount of data required to create a representative
histogram.
The LTI-L IB class skinProbabilityMap comes with the histograms pre-
sented in [12]. The algorithm described in Fig. 2.6 is implemented in the class
optimalThresholding.
2.1.2.2 Region Description
On the basis of Iobj,mask the source image I can be partitioned into regions. A region
R is a contiguous set of pixels p for which Iobj,mask has the same value. The concept
of contiguity requires a definition of pixel adjacency. As depicted in Fig. 2.9, adja-
cency may be based on either a 4-neighborhood or an 8-neighborhood. In this sec-
tion the 8-neighborhood will be used. In general, regions may contain other regions
and/or holes, but this shall not be considered here because it is of minor importance
in most gesture recognition applications.
Fig. 2.9. Adjacent pixels (gray) in the 4-neighborhood (a) and 8-neighborhood (b) of a refer-
ence pixel (black).
Under laboratory recording conditions, Iobj,mask will contain exactly one target
region. However, in many real world applications, other skin colored objects may
be visible as well, so that Iobj,mask typically contains multiple target regions (see
e.g. Fig. 2.8). Unless advanced reasoning strategies (such as described in [23, 28])
are used, the feature extraction stage is responsible for identifying the region that
represents the user’s hand, possibly among a multitude of candidates. This is done
on the basis of the regions’ geometric features.
The calculation of these features is performed by first creating an explicit descrip-
tion of each region contained in Iobj,mask. Various algorithms exist that generate, for
every region, a list of all of its pixels. A more compact and computationally more
efficient representation of a region can be obtained by storing only its border points
(which are, in general, significantly fewer). A border point is conveniently defined
as a pixel p ∈ R that has at least one pixel q ∈ / R within its 4-neighborhood. An
example region (a) and its border points (b) are shown in Fig. 2.10.
In a counterclockwise traversal of the object’s border, every border point has a
predecessor and a successor within its 8-neighborhood. An efficient data structure
for the representation of a region is a sorted list of its border points. This can be
Hand Gesture Commands 21
Fig. 2.10. Description of an image region by the set of all of its pixels (a), a list of its bor-
der pixels (b), and a closed polygon (c). Light gray pixels in (b) and (c) are not part of the
respective representation, but are shown for comparison with (a).
interpreted as a closed polygon whose vertices are the centers of the border points
(Fig. 2.10c). In the following, the object’s border is defined to be this polygon. This
definition has the advantages of being sub-pixel accurate and facilitating efficient
computation of various shape-based geometric features (as described in the next sec-
tion).5
Finding the border points of a region is not as straightforward as identifying its
pixels. Fig. 2.11 shows an algorithm that processes Iobj,mask and computes, for every
region R, a list of its border points
5
It should be noted that the polygon border definition leads to the border pixels no longer
being considered completely part of the object. Compared to other, non-sub-pixel accurate
interpretations, this may lead to slight differences in object area that are noticeable only for
very small objects.
6
m(x, y) = 1 indicates touching the border of a region that was already processed.
7
m(x, y) = 2 indicates crossing the border of a region that was already processed.
22 Non-Intrusive Acquisition of Human Action
1. Create a helper matrix m with the same dimensions as Iobj,mask and initialize
all entries to 0. Define (x, y) as the current coordinates and initialize them to
(0, 0). Define (x , y ) and (x , y ) as temporary coordinates.
2. Iterate from left to right through all image rows successively, starting at y = 0.
If m(x, y) = 0 ∧ Iobj,mask (x, y) = 1, goto step 3. If m(x, y) = 1, ignore
this pixel and continue with the next pixel.6 If m(x, y) = 2, increase x until
m(x, y) = 2 again, ignoring the whole range of pixels.7
3. Create a list B of border points and store (x, y) as the first element.
4. Set (x , y ) = (x, y).
5. Scan the 8-neighborhood of (x , y ) using (x , y ), starting at the pixel that
follows the butlast pixel stored in B in a counterclockwise orientation, or at
(x − 1, y − 1) if B contains only one pixel. Proceed counterclockwise, skip-
ping coordinates that lie outside of the image, until Iobj,mask (x , y ) = 1. If
(x , y ) is identical with the first element of B, goto step 6. Else store (x , y )
in B. Set (x , y ) = (x , y ) and goto step 5.
6. Iterate through B, considering, for every element (xi , yi ) its predecessor
(xi−1 , yi−1 ) and its successor (xi+1 , yi+1 ). The first element is the succes-
sor of the last, which is the predecessor of the first. If yi−1 = yi+1 = yi , set
m(xi , yi ) = 1 to indicate that the border touches the line y = yi at xi . Other-
wise, if yi−1 = yi ∨ yi = yi+1 , set m(xi , yi ) = 2 to indicates that the border
intersects with the line y = yi at xi .
7. Add B to the list of computed borders and proceed with step 2.
Fig. 2.11. Algorithm to find the border points of all regions in an image.
Fig. 2.12. a, b, c: Borders of the regions shown in Fig. 2.8 (d, e, f), computed by the algorithm
described in Fig. 2.11. d, e, f: Additional smoothing with a Gaussian kernel before segmenta-
tion.
important factors in practical applications. In the image domain, changing the cam-
era’s perspective results in rotation and/or translation of the object, while resolution
and object distance affect the object’s scale (in terms of pixels). Since the error intro-
duced in the shape representation by the discretization of the image (the discretiza-
tion noise) is itself affected by image resolution, rotation, translation, and scaling of
the shape, features can, in a strict sense, be invariant of these transformations only in
continuous space. All statements regarding transformation invariance therefore refer
to continuous space. In discretized images, features declared “invariant” may still
show small variance. This can usually be ignored unless the shape’s size is on the
order of several pixels.
Border Length
The border length l can trivially be computed from B, considering that the distance
√ two successive border points is 1 if either their x or y coordinates are equal,
between
and 2 otherwise. l depends on scale/resolution, and is translation and rotation in-
variant.
In [25] an efficient algorithm for the computation of arbitrary moments νp,q of poly-
gons is presented. The area a = ν0,0 , as well as the normalized moments αp,q
24 Non-Intrusive Acquisition of Human Action
and central moments μp,q up to order 2, including the center of gravity (COG)
(xcog , ycog ) = (α1,0 , α0,1 ), are obtained as shown in (2.15) to (2.23). xi and yi
with i = 0, 1, . . . , n − 1 refer to the elements of BR as defined in (2.14). Since these
equations require the polygon to be closed, xn = x0 and yn = y0 is defined for
i = n.
1
n
ν0,0 = a = xi−1 yi − xi yi−1 (2.15)
2 i=1
1
n
α1,0 = xcog = (xi−1 yi − xi yi−1 )(xi−1 + xi ) (2.16)
6a i=1
1
n
α0,1 = ycog = (xi−1 yi − xi yi−1 )(yi−1 + yi ) (2.17)
6a i=1
1
n
α2,0 = (xi−1 yi − xi yi−1 )(x2i−1 + xi−1 xi + x2i ) (2.18)
12a i=1
1
n
α1,1 = (xi−1 yi − xi yi−1 )(2xi−1 yi−1 + xi−1 yi + xi yi−1 + 2xi yi )
24a i=1
(2.19)
1
n
α2,0 = (xi−1 yi − xi yi−1 )(yi−1
2
+ yi−1 yi + yi2 ) (2.20)
12a i=1
μ2,0 = α2,0 − α21,0 (2.21)
μ1,1 = α1,1 − α1,0 α0,1 (2.22)
μ0,2 = α0,2 − α20,1 (2.23)
Eccentricity
One possible measure for eccentricity e that is based on central moments is given
in [10]:
Another intuitive measure is based on the object’s inertia along and perpendicular
to its main axis, j1 and j2 , respectively:
j1
e= (2.25)
j2
with
2
μ2,0 + μ0,2 μ2,0 − μ0,2 μ 2
1,1
j1 = + + (2.26)
2a 2a a
2
μ2,0 + μ0,2 μ2,0 − μ0,2 μ 2
1,1
j2 = − + (2.27)
2a 2a a
Here e starts at 1 for circular shapes, increases for elongated shapes, and is in-
variant of translation, rotation, and scale/resolution.
Orientation
The region’s main axis is defined as the axis of the least moment of inertia. Its orien-
tation α is given by
1 2μ1,1
α = arctan (2.28)
2 μ2,0 − μ0,2
and corresponds to what one would intuitively call “the object’s orientation”. Ori-
entation is translation and scale/resolution invariant. α carries most information for
elongated shapes and becomes increasingly random for circular shapes. It is 180◦ -
periodic, which necessitates special handling in most cases. For example, the angle
by which an object with α1 = 10◦ has to be rotated to align with an object with
α2 = 170◦ is 20◦ rather than |α1 − α2 | = 160◦ . When using orientation as a feature
for classification, a simple workaround to this problem is to ensure 0 ≤ α < 180◦
and, rather than use α directly as a feature, compute two new features f1 (α) = sin α
and f2 (α) = cos 2α instead, so that f1 (0) = f1 (180◦ ) and f2 (0) = f2 (180◦ ).
Compactness
Border Features
In addition to the border length l, the minimum and maximum pixel coordinates
of the object, xmin , xmax , ymin and ymax , as well as the minimum and maximum
distance from the center of gravity to the border, rmin and rmax , can be calculated
from B. Minimum and maximum coordinates are not invariant to any transformation.
rmin and rmax are invariant to translation and rotation, and variant to scale/resolution.
Normalization
Some of the above features can only be used in a real-life application if a suitable
normalization is performed to eliminate translation and scale variance. For example,
the user might perform gestures at different locations in the image and at different
distances from the camera.
Which normalization strategy yields the best results depends on the actual appli-
cation and recording conditions, and is commonly found empirically. Normalization
is frequently neglected, even though it is an essential part of feature extraction. The
following paragraphs present several suggestions. In the remainder of this section,
the result of the normalization of a feature f is designated by f .
For both static and dynamic gestures, the user’s face can serve as a reference
for hand size and location if it is visible and a sufficiently reliable face detection is
available.
In case different resolutions with identical aspect ratio are to be supported, reso-
lution independence can be achieved by specifying lengths and coordinates relative
to the image width N (x and y must be normalized by the same value to preserve
image geometry) and area relative to N 2 .
If a direct normalization is not feasible, invariance can also be achieved by com-
puting a new feature from two or more unnormalized features. For example, from
xcog , xmin , and xmax , a resolution and translation invariant feature xp can be com-
puted as
xmax − xcog
xp = (2.30)
xcog − xmin
xp specifies the ratio of the longest horizontal protrusion, measured from the
center of gravity, to the right and to the left.
For dynamic gestures, several additional methods can be used to compute a nor-
malized feature f (t) from an unnormalized feature f (t) without any additional in-
formation:
• Subtracting f (1) so that f (1) = 0:
Fig. 2.13. Boundaries of the static hand gestures shown in Fig. 2.2. Corresponding features
are shown in Tab. 2.5.
For the dynamic example gestures “clockwise”, “open”, and “grab”, the features
area a, compactness c, and x coordinate of the center of gravity xcog were computed.
Since a depends on the hand’s anatomy and its distance to the camera, it was nor-
malized by its maximum (see (2.35)) to eliminate this dependency. xcog was divided
by N to eliminate resolution dependency, and additionally normalized according
to (2.32) so that x cog = 0. This allows the gestures to be performed anywhere in
the image. The resulting plots for normalized area a , compactness c (which does not
require normalization), and normalized x coordinate of the center of gravity xcog are
visualized in Fig. 2.14.
The plots are scaled equally for each feature to allow a direct comparison. Since
“counterclockwise”, “close”, and “drop” are backwards executions of the above ges-
tures, their feature plots are simply temporally mirrored versions of these plots.
28 Non-Intrusive Acquisition of Human Action
Table 2.5. Features computed from the boundaries shown in Fig. 2.13.
Gesture
“left” “right” “stop” none
Feature Symbol (2.13a) (2.13b) (2.13c) (2.13d)
Normalized Border Length l 1.958 1.705 3.306 1.405
Normalized Area a 0.138 0.115 0.156 0.114
Normalized Center of Gravity xcog 0.491 0.560 0.544 0.479
ycog 0.486 0.527 0.487 0.537
Eccentricity e 1.758 1.434 1.722 2.908
Orientation α 57.4° 147.4° 61.7° 58.7°
Compactness c 0.451 0.498 0.180 0.724
Normalized Min./Max. Coordinates xmin 0.128 0.359 0.241 0.284
xmax 0.691 0.894 0.881 0.681
ymin 0.256 0.341 0.153 0.325
ymax 0.747 0.747 0.747 0.747
Protrusion Ratio xp 0.550 1.664 1.110 1.036
Fig. 2.14. Normalized area a , compactness c, and normalized x coordinate of the center of
gravity xcog plotted over time t = 1, 2, . . . , 60 for the dynamic gestures “clockwise” (left
column), “open” (middle column), and “grab” (right column) from the example vocabulary
shown in Fig. 2.3.
Hand Gesture Commands 29
The task of feature classification occurs in all pattern recognition systems and has
been subject of considerable research effort. A large number of algorithms is avail-
able to build classifiers for various requirements. The classifiers considered in this
section operate in two phases: In the training phase the classifier “learns” the vo-
cabulary from a sufficiently large number of representative examples (the training
samples). This “knowledge” is then applied in the following classification phase.
Classifiers that continue to learn even in the classification phase, e.g. to automati-
cally adapt to the current user, shall not be considered here.
2.1.3.1 Classification Concepts
This section introduces several basic concepts and terminology commonly used in
pattern classification, with the particular application to gesture recognition.
ω ∈ Ω = {ω1 , ω2 , . . . , ωn } (2.36)
The elements of Ω are called classes. In gesture recognition Ω is the vocabulary,
and each class ωi (i = 1, 2, . . . , n) represents a gesture. Note that (2.36) restricts the
allowed inputs to elements of Ω. This means that the system is never presented with
an unknown gesture, which influences the classifier’s design and algorithms.
The classifier receives, from the feature extraction stage, an observation O,
which, for example, might be a single feature vector for static gestures, or a sequence
of feature vectors for dynamic gestures. Based on this observation it outputs a result
Ω̂ = Ω ∪ {ω0 } (2.38)
Thus, (2.37) becomes
Several types of classifiers can be designed to consider a cost or loss value L for
classification errors, including rejection. In general, the loss value depends on the
actual input class ω and the classifier’s output ω̂:
L(ω, ω) = 0 (2.41)
This allows to model applications where certain misclassifications are more se-
rious than others. For example, let us consider a gesture recognition application that
performs navigation of a menu structure. Misclassifying the gesture for “return to
main menu” as “move cursor down” requires the user to perform “return to main
menu” again. Misclassifying “move cursor down” as “return to main menu”, how-
ever, discards the currently selected menu item, which is a more serious error because
the user will have to navigate to this menu item all over again (assuming that there is
no “undo” or “back” functionality).
Simplifications
Classifiers that consider rejection and loss values are discussed in [22]. For the pur-
pose of this introduction, we will assume that the classifier never rejects its input (i.e.
(2.37) holds) and that the loss incurred by a classification error is a constant value L:
Garbage Classes
Modes of Classification
Overfitting
With many classification algorithms, optimizing performance for the training data at
hand bears the risk of reducing performance for other (generic) input data. This is
called overfitting and presents a common problem in machine learning, especially
for small sets of training samples. A classifier with a set of parameters p is said
to overfit the training samples T if there exists another set of parameters p that
yields lower performance on T , but higher performance in the actual “real-world”
application [15].
A strategy for avoiding overfitting is to use disjunct sets of samples for training
and testing. This explicitly measures the classifier’s ability of generalization and al-
lows to include it in the optimization process. Obviously, the test samples need to be
sufficiently distinct from the training samples for this approach to be effective.
2.1.3.2 Classification Algorithms
From the numerous strategies that exist for finding the best-matching class for an
observation, three will be presented here:
• A simple, easy-to-implement rule-based approach suitable for small vocabularies
of static gestures.
• The concept of maximum likelihood classification, which can be applied to a
multitude of problems.
• Hidden Markov Models (HMMs) for dynamic gestures. HMMs are frequently
used for the classification of various dynamic processes, including speech and
sign language.
Further algorithms, such as artificial neural networks, can be found in the litera-
ture on pattern recognition and machine learning. A good starting point is [15].
32 Non-Intrusive Acquisition of Human Action
1. Start with a feature selection containing all available features, and measure
recognition accuracy.
2. For every feature in the current feature selection, measure recognition accuracy
on the current feature selection reduced by this feature. Remove the feature for
which the resulting recognition accuracy is highest from the current selection.
3. If computational cost or system complexity exceed the acceptable maximum
and the current selection contains more than one feature, goto step 2. Otherwise,
halt.
Exhaustive search
1. Let n be the total number of features and m the number of selected features.
2. For every m ∈ {1, 2, . . . , n}, measure recognition accuracy for all m
n
possible
feature combinations (this equals a total of n i=1 i
n
combinations). Discard
all cases where computational cost or system complexity exceed the acceptable
maximum, then choose the selection that yields the highest recognition accu-
racy.
Fig. 2.15. Three algorithms for automatic feature selection. Stepwise forward selection and
stepwise backward elimination are fast heuristic approaches that are not guaranteed to find
the global optimum. This is possible with exhaustive search, albeit at a significantly higher
computational cost.
The threshold of 0.25 for c was determined empirically from a data set of multiple
productions of “stop” and all other gestures, performed by different users. On this
data set, max c ≈ 0.23 and c ≈ 0.18 was observed for “stop”, while c
0.23 for
all other gestures.
Obviously, with increasing vocabulary size, number and complexity of the rules
grow and they can quickly become difficult to handle. The manual creation of rules,
as described in this section, is therefore only suitable for very small vocabularies
such as the static example vocabulary, where appropriate thresholds are obvious.
Algorithms for automatic learning of rules are described in [15].
For larger vocabularies, rules are often used as a pre-stage to another classifier.
For instance, a rule like
IF max (xcog (i) − xcog (j))2 + (ycog (i) − ycog (j))2 ) < Θmotion N (2.45)
i=j
THEN the hand is idle.
where i and j are frame indices, and Θmotion specifies the minimum dislocation for
the hand to qualify as moving, relative to the image width.
2.1.3.5 Maximum Likelihood Classification
Due to its simplicity and general applicability, the concept of maximum likelihood
classification is widely used for all kinds of pattern recognition problems. An exten-
sive discussion can be found in [22]. This section provides a short introduction. We
assume the observation O to be a vector of K scalar features fi :
⎛ ⎞
f1
⎜ f2 ⎟
⎜ ⎟
O=⎜ . ⎟ (2.46)
⎝ .. ⎠
fK
Stochastic Framework
The maximum likelihood classifier identifies the class ω ∈ Ω that is most likely to
have caused the observation O by maximizing the a posteriori probability P (ω|O):
P (O|ω)P (ω)
P (ω|O) = (2.48)
P (O)
For simplification it is commonly assumed that all n classes are equiprobable:
1
P (ω) = (2.49)
n
Therefore, since neither P (ω) nor P (O) depend on ω, we can omit them when
substituting (2.48) in (2.47), leaving only the observation probability P (O|ω):
For reasons of efficiency and simplicity, the observation probability P (O|ω) is com-
monly computed from a class-specific parametric distribution of O. This distribution
is estimated from a set of training samples. Since the number of training samples is
often insufficient to identify an appropriate model, a normal distribution is assumed
in most cases. This is motivated by practical considerations and empirical success
rather that by statistical analysis. Usually good results can be achieved even if the
actual distribution significantly violates this assumption.
To increase modeling accuracy one may decide to use multimodal distributions.
This decision should be backed by sufficient data and motivated by the nature of the
regarded feature in order to avoid overfitting (see Sect. 2.1.3.1). Consider, for exam-
ple, the task of modeling the compactness c for the “stop” gesture shown in Fig. 2.2c.
Computing c for e.g. 20 different productions of this gesture would not provide suffi-
cient motivation for modeling it with a multimodal distribution. Furthermore, there is
no reason to assume that c actually obeys a multimodal distribution. Even if measure-
ments do suggest this, it is not backed by the biological properties of the human hand;
adding more measurements is likely to change the distribution to a single mode.
In this section only unimodal distributions will be used. The multidimensional
normal distribution is given by
1 1
N (O, μ, C) = exp − (O − μ)T C−1 (O − μ) (2.51)
(2π)K det C 2
where μ and C are the mean and covariance of the stochastic process {O}:
For the one-dimensional case with a single feature f this reduces to the familiar
equation
36 Non-Intrusive Acquisition of Human Action
2
1 1 f −μ
N (f, μ, σ ) =
2
exp − (2.54)
(2π)σ 2 σ
Computing μi and Ci from the training samples for each class ωi in Ω yields the
desired n models that allow to solve (2.50):
Many applications require the classification stage to output only the class ωi for
which P (O|ω) is maximal, or possibly a list of all classes in Ω, sorted by P (O|ω).
The actual values of the conditional probabilities themselves are not of interest. In
this case, a strictly monotonic mapping can be applied to (2.55) to yield a score S
of Ω, but is faster to compute. S(O|ωi )
that imposes the same order on the elements
is obtained by multiplying P (O|ωi ) by (2π)K and taking the logarithm:
1 1
S(O|ωi ) = − ln(det Ci ) − (O − μi )T C−1
i (O − μi ) (2.56)
2 2
Thus,
Ci = C i = 1, 2, . . . , n (2.58)
we can further simplify (2.56) by omitting the constant term − 21 ln(det C) and mul-
tiplying by −2:
C = σ2 I (2.62)
Replacing this in (2.59) and omitting the constant σ 2 leads to the Euclidian clas-
sifier:
S(O|ωi ) = (O − μi )T (O − μi ) = |O − μi |2 (2.63)
2.1.3.6 Classification Using Hidden Markov Models
The classification of time-dependent processes such as gestures, sign language, or
speech, suggests to explicitly model not only the samples’ distribution in the fea-
ture space, but also the dynamics of the process from which the samples originate.
In other words, a classifier for dynamic events should consider variations in execu-
tion speed, which means that we need to identify and model these variations. This
requirement is met by Hidden Markov Models (HMMs), a concept derived from
Markov chains.8
A complete discussion of HMMs is out of the scope of this chapter. We will
therefore focus on basic algorithms and accept several simplifications, yet enough
information will be provided for an implementation and practical application. Fur-
ther details, as well as other types and variants of HMMs, can be found e.g. in
[8, 11, 13, 19–21]. The LTI-L IB contains multiple classes for HMMs and HMM-
based classifiers that offer several advanced features not discussed here.
In the following the observation O denotes a sequence of T individual observa-
tions ot :
O = (o1 o2 · · · oT ) (2.64)
Each individual observation corresponds to a K-dimensional feature vector:
⎛ ⎞
f1 (t)
⎜ f2 (t) ⎟
⎜ ⎟
ot = ⎜ . ⎟ (2.65)
⎝ .. ⎠
fK (t)
For the purpose of this introduction we will at first assume K = 1 so that ot is
scalar:
8
Markov chains describe stochastic processes in which the conditional probability distri-
bution of future states, given the present state, depends only on the present state.
38 Non-Intrusive Acquisition of Human Action
ot = ot (2.66)
This restriction will be removed later.
Concept
This section introduces the basic concepts of HMMs on the basis of an example.
In Fig. 2.16 the feature ẋcog (t) has been plotted for three samples of the gesture
“clockwise” (Fig. 2.3a). It is obvious that, while the three plots have similar shape,
they are temporally displaced. We will ignore this for now and divide the time axis
into segments of 5 observations each (one observation corresponds to one frame).
Since the samples have 60 frames this yields 12 segments, visualized by straight
vertical lines.9 For each segment i, i = 1, 2, . . . , 12, mean μi and variance σi2 were
computed from the corresponding observations of all three samples (i.e. from 15
values each), as visualized in Fig. 2.16 by gray bars of length 2σi , centered at μi .
Fig. 2.16. Overlay plot of the feature ẋcog (t) for three productions of the gesture “clockwise”,
divided into 12 segments of 5 observations (frames) each. The gray bars visualize mean and
variance for each segment.
This represents a first crude model (though not an HMM) of the time dependent
stochastic process ẋcog (t), as shown in Fig. 2.17. The model consists of 12 states
9
The segment size of 5 observations was empirically found to be a suitable temporal res-
olution.
Hand Gesture Commands 39
Fig. 2.17. A simple model for the time dependent process ẋcog (t) that rigidly maps 60 obser-
vations to 12 states.
60
t−1
P (O|“clockwise”) = N (ot , μi(t) , σi(t)
2
) with i(t) = + 1 (2.67)
t=1
5
and
ai,j = 0 for j ∈
/ {i, i + 1, i + 2} (2.69)
Fig. 2.18 shows the modified model for ẋcog (t), which now constitutes a com-
plete Hidden Markov Model. Transitions from si to si allow to model a “slower-than-
normal” execution part, while those from si to ss+2 model a “faster-than-normal”
execution part.
Equation (2.69) is not a general requirement for HMMs; it is a property of the
specific Bakis topology used in this introduction. One may choose other topologies,
allowing e.g. a transition from si to si+3 , in order to optimize classification perfor-
mance. In this section we will stick to the Bakis topology because it is widely used
40 Non-Intrusive Acquisition of Human Action
Fig. 2.18. A Hidden Markov Model in Bakis topology. Observations are mapped dynamically
to states.
in speech and sign language recognition, and because it is the simplest topology that
models both delay and acceleration.10
Even though we have not changed the number of states compared to the rigid
model, the stochastic transitions affect the distribution parameters μi and σi2 of each
state i. In the rigid model, the lack of temporal alignment among the training sam-
ples leads to μi and σi2 not accurately representing the input data. Especially for the
first six states σ is high and μ has a lower amplitude than the training samples (cf.
Fig. 2.16). Since the HMM provides temporal alignment by means of the transition
probabilities ai,j we expect σ to decrease and μ to better match the sample plots (this
is proven in Fig. 2.21). The computation of ai,j , μi and σi2 will be discussed later.
For the moment we will assume these values to be given.
In the following a general HMM with N states in Bakis topology is considered.
The sequence of states adopted by this model is denoted by the sequence of their
indices
where qt indicates the state index at time t and T equals the number of observations.
This notion allows to specify the probabilities ai,j as
A = [ai,j ]N ×N (2.72)
One may want to loosen the restriction that q1 = s1 by introducing an initial state
probability vector π:
10
If the number of states N is significantly smaller than the number of observations T ,
acceleration can also be modeled solely by the transition from si to si+1 .
Hand Gesture Commands 41
π = (π1 π2 . . . πN ) with πi = P (q1 = si ) and πi = 1 (2.73)
i
B = (b1 b2 . . . bN ) (2.74)
where bi specifies the distribution of the observation o in state si . Since we use
unimodal normal distributions,
λ = (π, A, B) (2.76)
Obviously, to use HMMs for classification, we again need to compute the proba-
bility that a given observation O originates from a model λ. In contrast to (2.67) this
is no longer simply a product of probabilities. Instead, we have to consider all pos-
sible state sequences of length T , denoted by the set QT , and sum up the accordant
probabilities:
T
P (O|λ) = P (O, q|λ) = b1 (o1 ) aqt−1 ,qt bqt (ot ) (2.77)
q∈QT q∈QT t=2
with
q1 = 1 ∧ qT = N (2.78)
Since, in general, the distributions bi will overlap, it is not possible to identify
a single state sequence q associated with O. This property is reflected in the name
Hidden Markov Model.
The model λ̂ that best matches the observation O is found by maximizing (2.77)
on the set Λ of all HMMs:
Even though efficient algorithms exist to solve this equation, (2.79) is commonly
approximated by
P (O, q|λ)
P (q|O, λ) = (2.82)
P (O|λ)
Since P (O|λ) does not depend on q we can simplify (2.81) to
The optimal state sequence q∗ can efficiently be computed using the Viterbi algo-
rithm (see Fig. 2.20): Instead of performing a brute force search of QT , this method
Hand Gesture Commands 43
exploits the fact that from all state sequences q that arrive at the same state si at time
index t, only the one for which the probability up to si is highest, q∗i , can possibly
solve (2.83):
t
q∗i = argmax b1 (o1 ) aqt −1 ,qt bqt (ot ) (2.84)
q∈{q|qt =i} t =2
All others need not be considered, because regardless of the subsequence of states
following si , the resulting total probability would be lower than that achieved by
following q∗i by the same subsequence. (A common application of this algorithm in
everyday life is the search for the shortest path from a location A to another location
B in a city: From all paths that have a common intermediate point C, only the one
for which the distance from A to C is shortest has a chance of being the shortest path
between A and B.)
exp(·) operation may be omitted without affecting the argmax operation. In addition
to increased numerical stability the use of scores also reduces computational cost.11
We will now discuss the iterative estimation of the transition probabilities A and
the distribution functions B for a given number of states N and a set of M training
samples O1 , O2 , . . . , OM . Tm denotes the length of Om (m ∈ {1, 2, . . . , M }). An
initial estimate of B is obtained in the same way as shown in Fig. 2.16, i.e. by equally
distributing the elements of all training samples to the N states.12 A is initialized to
⎛ ⎞
0.3 0.3 0.3 0 0 · · · 0
⎜ 0 0.3 0.3 0.3 0 · · · 0 ⎟
⎜ ⎟
⎜ .. .. .. ⎟
⎜ . . . ⎟
A=⎜ ⎟ (2.87)
⎜ 0 ··· 0 0.3 0.3 0.3 ⎟
⎜ ⎟
⎝ 0 ··· 0 0 0.5 0.5 ⎠
0 ··· 0 0 0 1
We now apply the Viterbi algorithm described above to find, for each train-
ing sample Om = (o1,m o2,m . . . oTm ,m ), the optimal state sequence q∗m =
(q1,m q2,m . . . qT,m ). Each q∗m constitutes a mapping of observations to states, and
the set of all q∗m , m = 1, 2, . . . , M , is used to update the model’s distributions B
and transition probabilities A accordingly: Mean μi and variance σi2 for each state si
are recomputed from all sample values ot,m for which qt,m = i holds. The transition
probabilities ai,j are set to the number of times that the subsequence {. . . , qi , qj , . . . }
is contained in the set of all q∗m and normalized to fulfill (2.68). Formally,
with
11
To avoid negative values, the score is sometimes defined as the negative logarithm of P .
12
If N is not a divisor of Tm , an equal distribution is approximated.
Hand Gesture Commands 45
The training is continued on the same training samples until λ remains constant.
This is the case when the optimal state sequences q∗m remain constant for all M
samples.13 A proof of convergence can be found in [13].
It should be noted that the number of states N is not part of the parameter esti-
mation process and has to be specified manually. Lower values for N reduce compu-
tational cost at the risk of inadequate generalization, while higher values may lead to
overspecialization. Also, since the longest forward transition in the Bakis topology
is two states per observation,
N ≤ 2 min{T1 , T2 , . . . , TM } − 1 (2.93)
is required.
Revisiting the example introduced at the beginning of this section, we will now
create an HMM for ẋcog from the three productions of “clockwise” presented in
Fig. 2.16. To allow a direct comparison with the rigid state assignment shown in
Fig. 2.17 the number of states N is set to 12 as well. Fig. 2.21 overlays the parameters
of the resulting model’s distribution functions with the same three plots of ẋcog .
Fig. 2.21. Overlay plot of the feature ẋcog (t) for three productions of the gesture “clockwise”,
and mean μ and variance σ of the 12 states of an HMM created therefrom. The gray bars are
centered at μ and have a length of 2σ. Each represents a single state.
13
Alternatively, the training may also be aborted when the number of changes in all q∗m
falls below a certain threshold.
46 Non-Intrusive Acquisition of Human Action
Compared to Fig. 2.16, the mean values μ now represent the shape of the plots
significantly better. This is especially noticeable for 10 ≤ t ≤ 30, where the differ-
ences in temporal offset previously led to μ having a much smaller amplitude than
ẋcog . Accordingly, the variances σ are considerably lower. This demonstrates the
effectiveness of the approach.
Multidimensional Observations
bi = N (o, μi , Ci ) (2.94)
where μi and Ci are mean and covariance matrix of state si estimated in the Viterbi
training:
These equations replace (2.88) and (2.89). In practical applications the elements
of o are frequently assumed uncorrelated so that C becomes a diagonal matrix, which
simplifies computation.
Extensions
14
For multimodal models, the considerations stated in Sect. 2.1.3.5 apply.
Hand Gesture Commands 47
2.1.3.7 Example
Based on the features of the static and the dynamic vocabulary computed in
Sect. 2.1.2.4, this section describes the selection of suitable features to be used in
the two example applications.
We will manually analyze the values in Tab. 2.5 and the plots in 2.14 to identify
features with high inter-gesture variance. Since all values and plots describe only a
single gesture, they first have to be computed for a number of productions to get an
idea of the intra-gesture variance. For brevity, this process is not described explicitly.
The choice of classification algorithm is straightforward for both the static and the
dynamic case.
The data in Tab. 2.5 shows that a characteristic feature of the “stop” gesture
(Fig. 2.2a) is its low compactness: c is smaller than for any other shape. This was
already mentioned in Sect. 2.1.3.4, and found to be true for any production of “stop”
by any user.
Pointing gestures result in an roughly circular, compact shape with a single pro-
trusion from the pointing finger. For the “left” and “right” gestures (Fig. 2.2b,c), this
protrusion is along the x axis. Assuming that the protruding area does not signifi-
cantly affect xcog , the protrusion ratio xp introduced in (2.30) can be used: Shapes
with xp
1 are pointing to the right, while x1p
1 indicates pointing to the left.
Gestures for which xp ≈ 1 are not pointing along the x axis. This holds regardless
of whether the thumb or the index finger is used for pointing. Empirically, xp > 1.2
and x1p > 1.2 was found for pointing right and left, respectively.
Since the hand may not be the only skin-colored object in the image (see
Tab. 2.4), the segmentation might yield multiple skin colored regions. In our ap-
plication scenario it is safe to assume that the hand is the largest object; however, it
need not always be visible. The recording conditions in Tab. 2.4 demand a minimum
shape size amin of 10% of the image size, or
The feature plots in Fig. 2.14 show that normalized area a , compactness c, and nor-
malized x coordinate of the center of gravity xcog should be sufficient to discriminate
among the six dynamic gestures. “clockwise” and “counterclockwise” can be identi-
fied by their characteristic sinusoidal xcog feature. “open” and “grab” both exhibit a
drop in a around T2 but can be distinguished by compactness, which simultaneously
drops for “open” but rises for “grab”. “close” and “drop” are discerned correspond-
ingly. We thus arrive at a feature vector of
⎛ ⎞
a
O=⎝ c ⎠ (2.106)
xcog
An HMM-based classifier is the obvious choice for processing a time series of
such feature vectors. The resulting example application and its I MPRESARIO process
graph is presented in Sect. 2.1.5.
To run the example application for static gesture recognition, please install the
rapid prototyping system I MPRESARIO from the accompanying CD as described in
Sect. B.1. After starting I MPRESARIO, choose “File”→“Open...” to load the process
graph named StaticGestureClassifier.ipg (see Fig. 2.22). The data flow
is straightforward and corresponds to the structure of the sections in this chapter. All
six macros are described below.
Video Capture Device Acquires a stream of images from a camera connected e.g.
via USB. In the following, each image is processed individually and indepen-
dently. Other sources may be used by replacing this macro with the appropriate
source macro (for instance, Image Sequence (see Sect. B.5.1) for a set of image
files on disk).
Hand Gesture Commands 49
Fig. 2.22. The process graph of the example application for static gesture recognition (file
StaticGestureClassifier.ipg). The image shows the classification result.
may have to be set manually. The remaining parameters control further aspects of
the segmentation process and should be left at their defaults.
StaticGestureClassifier Receives the list of regions from the preceding stage and
discards all but the largest one. The macro then computes the feature vector de-
scribed in (2.98), using the equations discussed in Sect. 2.1.2.3, and performs
the actual classification according to the rules derived in Sect. 2.1.3.7 (see (2.99)
to (2.105)). The thresholds used in the rules can be changed to modify the sys-
tem’s behavior. For the visualization of the classification result, the macro also re-
ceives the original input image from Video Capture Device (if the output image
is not shown, double-click on the macro’s red output port). It displays the border
of the hand and a symbol indicating the classification result (“left”, “right”, or
“stop”). If the hand is idle or not visible, the word IDLE is shown.
Press the “Play” button or the F1 key to start the example. By double-clicking on
the output port of GaussFilter the smoothed skin probability image can be displayed.
The hand should be clearly distinguishable from the background. Make sure that
the camera’s white balance is set correctly (otherwise the skin color detection will
fail), and that the image is never overexposed, but rather slightly underexposed. If
required, change the device properties as described in Sect. B.5.2.
The application recognizes the three example gestures “left”, “right”, and “stop”
shown in Fig. 2.2. If the classification result is always IDLE or behaves erratic, see
Sect. 2.1.6 for troubleshooting help.
Fig. 2.23. Process graph of the example application for dynamic gesture recognition (file
DynamicGestureClassifier.ipg). The image was recorded with a webcam mounted
atop the screen facing down upon the keyboard.
After the preparation phase, the recording phase starts automatically. Its duration
can be configured by Gesture Samples. A red progress bar is shown during recording.
The complete gesture must be made while this progress bar is running. The hand
motion should be smooth and take up the entire recording phase.
The parameter Hidden Markov Model specifies a file containing suitable Hidden
Markov Models.15 You can either use gestures.hmm, which is provided on the
CD and contains HMMs for the six example gestures shown in Fig. 2.3, or generate
your own HMMs using the external tool generateHMM.exe as described below.
Since the HMMs are identified by numerical IDs (0, 1, . . . ), the parameter Ges-
ture Names allows to set names for each model that are used when reporting clas-
sification results. The names are specified as a list of space-separated words, where
the first word corresponds to ID 0, the second to ID 1, etc. Since the HMMs in
gestures.hmm are ordered alphabetically this parameter defaults to “clockwise
close counterclockwise drop grab open”.
15
The file actually contains an instance of lti::hmmClassifier, which includes in-
stances of lti::hiddenMarkovModel.
52 Non-Intrusive Acquisition of Human Action
The parameter Last Classified Gesture is actually not a parameter, but is set by
the macro itself to reflect the last classification result (which is also dumped to the
output pane).
Start the example application by pressing the “Play” button or the F1 key. Ad-
just white balance, shutter speed etc. as for StaticGestureRecognition (Sect. 2.1.4).
Additionally, make sure that the camera is set to the same frame rate used when
recording the training samples from which the HMMs were created (25 fps for
gestures.hmm).
When the segmentation of the hand is accurate, set State to “running” to start the
example. Perform one of the six example gestures (Fig. 2.3) while the red progress
bar is showing and check Last Classified Gesture for the result. If the example does
not work as expected, see Sect. 2.1.6 for help.
16
As above, this file actually contains a complete lti::hmmClassifier.
Hand Gesture Commands 53
Fig. 2.24. Recognition of dynamic gestures read from disk frame by frame (process graph file
DynamicGestureClassifierFromDisk.ipg). Each frame corresponds to a single
file.
Fig. 2.25. Directory layout for training samples to be read by the generateHMM.exe com-
mand line utility.
structure automatically. Its process graph is presented in Fig. 2.26. Like Dynam-
icGestureRecognition, it has an “idle” and a “running” state, and precedes the ac-
54 Non-Intrusive Acquisition of Human Action
tual recording with a preparation phase. The parameters State, Delay Samples, and
Gesture Samples work as described above.
To record a set of training gestures, first set the parameter Destination Directory
to specify where the above described directory structure shall be created. Ensure that
Save Images to Disk is activated and State is set to “idle”, then press “Play” or F1.
For each gesture you wish to include in the vocabulary, perform the following steps:
1. Set the parameter Gesture Name (since the name is used as a directory name, it
should not contain special characters).
2. Set Sequence Index to 0.
3. Set State to “running”, position your hand appropriately in the preparation phase
(green progress bar) and perform the gesture during the recording phase (red
progress bar). After recording has finished, Sequence Index is automatically in-
creased by 1, and State is reset to “idle”.
4. Repeat step 3 several times to record multiple productions of the gesture.
5. Examine the results using an image viewer such as the free IrfanView17 or the
macro Image Sequence (see Sect. B.5.1). If a production was incorrect or oth-
17
http://www.irfanview.com/
Hand Gesture Commands 55
erwise not representative, set Sequence Index to its index and repeat step 3 to
overwrite it.
Each time step 3 is performed, a directory with the name of Sequence Index
is created in a subdirectory of Destination Directory named according to Gesture
Name, containing as many image files (frames) as specified in Gesture Samples. This
results in a structure as shown in Fig. 2.25. All images are stored in BMP format.
The files created by RecordGestures can be used for both training and test-
ing. For instance, in order to observe the effects of changes to a certain pa-
rameter while keeping the input data constant, replace the source macro Video
Capture Device in the process graphs StaticGestureRecognition.ipg
or DynamicGestureRecognition.ipg with Image Sequence and use the
recorded gestures as image source.
To run generateHMM.exe on the recorded gestures, change to the direc-
tory where it resides (the I MPRESARIO data directory), ensure that the files
skin-32-32-32.hist and non-skin-32-32-32.hist are also present
there, and type
where the options -d, -e, and -f specify the input directory, the input file extension,
and the output filename, respectively. For use with Record Gestures, replace <dir>
with the directory specified as Destination Directory, <extension> with bmp18 ,
and <outfile> with the name of the .hmm file to be created. The directories are
then scanned and each presentation is processed (this may take several seconds). The
resulting .hmm file can now be used in DynamicGestureClassifier.
To control the segmentation process, additional options as listed in Tab. 2.6 may
be used. It is important that these values match the ones of the corresponding parame-
ters of the ObjectsFromMask macro in DynamicGestureClassifier.ipg
to ensure that the process of feature computation is consistent. The actual HMM
creation can be configured as shown in Tab. 2.7.
18
For the samples provided on CD, specify jpg instead.
56 Non-Intrusive Acquisition of Human Action
2.1.6 Troubleshooting
In case the classification performance of either example is poor, verify that the
recording conditions in Tab. 2.4 are met. Wooden furniture may cause problems for
the skin color segmentation, so none should be visible in the image. Point the camera
to a white wall to be on the safe side. If the segmentation fails (i.e. the hand’s border
is inaccurate) even though no other objects are visible, recheck white balance, shutter
speed, and gain/brightness. The camera may perform an automatic calibration that
yields results unsuitable for the skin color histogram. Disable this feature and change
the corresponding device parameters manually. The hand should be clearly visible in
the output of the GaussFilter macro (double-click on its red output port if no output
image is shown).
Fig. 2.27. Hardware for facial expression recording. Sometimes the reflections of Styrofoam
balls are exploited (as in a. and b.) Other solutions use a head cam together with markers on
the face (c) or a mechanical devise (d). In medical diagnostics line of sight is measured with
piezo elements or with cameras fixed to the spectacles to measure skin tension.
trinsic causes. Intrinsic variations result from individual face shape and from move-
ments while speaking, extrinsic effects result from camera position, illumination, and
possibly occlusions.
Here we propose a generic four stage facial feature analysis system that can cope
with such influence factors. The processing chain starts with video stream acquisition
and continues over image preprocessing and feature extraction to classification.
During image acquisition, a camera image is recorded while camera adjustments
are optimized continuously. During preprocessing, the face is localized and its ap-
pearance is improved in various respects. Subsequently feature extraction results in
58 Non-Intrusive Acquisition of Human Action
Fig. 2.28. Information processing stages of the facial expression recognition system.
During image acquisition, the constancy of color and luminance even under variable
illumination conditions must be guaranteed. Commercial web cams which are used
here, usually permit the adjustment of various parameters. Mostly there exists a fea-
ture for partial or full self configuration. The hardware implemented algorithms take
care of white balance, shutter speed, and back light compensation and yield, to the
average, satisfactory results.
Camera parameter adjustment can however be further optimized if context
knowledge is available, as e.g. shadows from which the position of a light source
can be estimated, or the color of a facial region.
Since the factors lighting and reflection cannot be affected due to the objective,
only the adjustment of the camera parameters remains. The most important para-
meters of a digital camera for the optimization of the representation quality are the
shutter speed, the white-balance, and the gain. In order to optimize these factors in
parallel, the extended simplex algorithm is utilized, similar to the approach given
in [16].
With a simplex in n-dimensional space (here n=3 for the three parameters) it acts
around a geometrical object with n+1 position vectors which possess an identical
distance to each other in each case. The position vector is defined as follows:
⎛ ⎞
s s = shutter
P = ⎝w⎠ w = white − balance (2.107)
g g = gain
P is a scalar value that the number of skin-colored pixel in the face region rep-
resents, dependent on the three parameters s, w and g. The worst (smallest) vector
is mirrored at the line, that by the remaining vectors is given and results in a new
simplex.
The simplex algorithm tries to iteratively find an optimal position vector which
comes the desired targets next. In the available case it applies to approximate the
average color of the face region with an empirically determined reference value.
For the optimal adjustment of the parameters, however, a suitable initialization is
necessary which can be achieved for example by a pre-use of the automatic white-
balance of the camera.
Fig. 2.29 (left) shows an example of the two-dimensional case. To the position
vector two additional position vectors U and V are added. Since |P| produces too
few skin-colored areas, it is mirrored at the straight line UV on P . It results the new
simplex S . The procedure is repeated until it converges towards a local maximum
Fig. 2.29 (right).
A disadvantage of the simplex algorithm is its rigid gradual and linear perfor-
mance. Therefore the approach was extended, as an upsetting and/or an aspect ratio
was made possible for the mirroring by introduction of a set of rules. Fig. 2.30
represents the extended simplex algorithm for a two-dimensional parameter space.
1. If P > |U| ∧ P > |V| then repeat the mirroring with stretching factor 2
from P to P2
2. If P > |U| ∧ P < |V| ∨ P < |U| ∧ P < |V| then use P
3. If P < |U| ∧ P < |V| ∧ P > |P| then repeat mirroring with stretching
factor 0.5 from P to P0,5
4. If P < |U| ∧ P < |V| ∧ P < |P| then repeat mirroring with stretching
factor -0.5 from P to P−0,5
The goal of image preprocessing is the robust localization of the facial region which
corresponds to the rectangle bounded by bottom lip and brows. With regard to
processing speed only skin colored regions are analyzed in an image. Additionally a
general skin color model is adapted to each individual.
Face finding relies on a holistic approach which takes the orientation of facial
regions into consideration. Brows, e.g., are characterized by vertically alternating
Facial Expression Commands 61
1.0 if || I(x, y, tk ) − I(x, y, tk−1 )||2 > Θ
Imotion (x, y, tk ) = (2.108)
max(0.0, IB (x, y, tk−1 ) − T1 otherwise
Fig. 2.31. Visualized movement patterns for movement of head and background (left), move-
ment of the background alone (middle), and movement of the head alone (right).
62 Non-Intrusive Acquisition of Human Action
For a combined skin color and movement search the largest skin colored ob-
ject is selected and subsequently limited by contiguous, non skin colored regions
(Fig. 2.32).
Fig. 2.32. Search mask composed of skin color (left) and motion filter (right).
Holistic face finding makes use of three separate modules. The first transforms
the input image in an integral image for the efficient calculation of features. The
second module supports the automatic selection of suitable features which describe
variations within a face, using Ada-Boosting. The third module is a cascade classifier
that sorts out insignificant regions and analyzes in detail the remaining regions. All
three modules are subsequently described in more detail.
Integral Image
For finding features as easily as possible, Haar-features are used which are presented
by Papgeorgiou et al. [18]. A Haar-feature results from calculating the luminance
difference of equally sized neighboring regions. These combine to the five constella-
tions depicted in Fig. 2.33, where the feature M is determined by:
the side ratio of the regions may vary. To speed-up search in such a large search
space an integral image is introduced which provides the luminance integral in an
economic way.
The integral image Iint at the coordinate (x, y) is defined as the sum over all
intensities I(x, y):
Iint = I(x , y ) (2.110)
x ≤x,y ≤y,
By introducing the cumulative column sum Srow (x, y) two recursive calculations
can be performed:
The application of these two equations enables the computation of a complete in-
tegral image by iterating just once through the original image. The integral image has
to be calculated only once. Subsequently, the intensity integrals Fi of any rectangle
are determined by executing a simple addition and two subtractions.
Fig. 2.34. Calculation of the luminance of a region with the integral image. The luminance of
region FD is I(xD , yD ) + I(xA , yA ) − I(xB , yB ) − I(xD , yD ).
The approach is further exemplified when regarding Fig. 2.35. The dark/gray
constellation is the target feature. In a first step the integral image is constructed.
This, in a second step, enables the economic calculation of a difference image. Each
entry in this image represents the luminance difference of both regions with the upper
left corner being the reference point. In the example there is a maximum at position
(2, 3).
Fig. 2.35. Localizing a single feature in an integral image, here consisting of the luminance
difference of the dark and gray region.
Ada-Boosting
Now the n most significant features are selected with the Ada-Boosting-algorithm,
a special kind of ensemble learning. The basic idea here is to construct a diverse
ensemble of “weak” learners and combine these to a “strong” classifier.
Facial Expression Commands 65
A:D→h (2.120)
N = (xi , yi ) ∈ X × Y
i = 1 . . . m (2.121)
h:X→Y (2.122)
X is the set of all features; Y is the perfect classification result. The number of
training examples is m. Here A was selected to be a perceptron:
1 if pj fj (x) < pj Θj
hj (x) = (2.123)
0 otherwise
The perceptron hj (x) depends on the feature fj (x) and a given threshold Θj
whereby pj serves for the correct interpretation of the inequality.
During search for a weak classifier a number of variants Dt with t = 1, 2, . . . K
is generated from the training data D. This results in an ensemble of classifiers ht =
A(Dt ). The global classifier H(x) is then implemented as weighted sum:
K
H(x) = αk hk (x) (2.124)
k=1
Step 1: Initialization
• let m1 . . . mn be the set of training images
• let bi be a binary variable, with: bi = 0, 1∀mi ∈ Mfaces , mi ∈ / Mfaces
• let m be the number of all faces and l the number of non-faces
• 1
w1,i are initialization weights with: w1,i = 2m 1
, 2l ∀bi = 0, 1
Step 2: Iterative search for the T most significant features For t = 1, . . . , T
w
• Normalize the weights wt,i = n t,iwt,j such that wt forms a probability distri-
j=1
bution
• feature j train a classifier hj . The error depending on wt is:
For each
j = i wi · |hj (xi − yi |
• Select the classifier ht with the smallest error t
• Actualize the weights wt+1,i = wt,i βt1−ei , where ei = 0, 1 for correct, uncorrect
t
classification and βt = 1− t
T
1 for Tt=1 αt ht (x) ≥ 1
2 t=1 αt with αt = log β1t
h(x) = (2.125)
0 otherwise
Cascaded Classifier
This type of classifier is applied to reduce classification errors and simultaneously
shorten processing time. To this end small efficient classifiers are built, that reject
numerous parts of a image as non-face (false positives), without incorrectly discard-
ing faces (false negatives).
Subsequently more complex classifiers are applied to further reduce the false
positives-rate. A simple classifier for a two-region feature hereby reaches, e.g., in a
first step a false-positive rate of only 40 % while at the same time recognizing 100
% of the faces correctly. The total system is equivalent to a truncated decision tree,
with several classifiers arranged in layers called cascades (Fig. 2.36).
Fig. 2.36. Schematic diagram of the Cascaded Classifier: Several Classifiers in line are applied
for each image-region. Each classifier can reject that region. This approach strongly reduces
computational and at the same provides very accurate results.
K
F = fi (2.126)
i=1
where fi is the false-positive rate of one of the K layers. The recognition rate D
is respectively:
K
D= di (2.127)
i=1
The final goal for each layer is to compose a classifier from several features, until
a required maximum recognition rate D is reached and simultaneously the number
F of misclassifications is minimal. This is achieved by the following algorithm:
Facial Expression Commands 67
Step 1: Initialization
• The user defines f and FTarget
• Let Mfaces be the set of all faces
• Let M¬faces be the set of all non-faces
• F0 = 1.0; D0 = 1.0; i = 0
Step 2: Iterative Determination of all Classifiers of a Layer
• i = i + 1; ni = 0; Fi = Fi−1
• while Fi > FTarget
– ni = ni+1
– Use Mfaces and M¬faces to generate with Ada-Boosting a classifier with ni
features.
– Test the classifier of the actual layer on a separate test set to determine Fi and
Di .
– Reduce the threshold of the i-th classifier, until the classifier of the actual
layer reaches a recognition rate of d × Di−1 .
• If Fi > FTarget , test the classifier of the actual layer on a set of non-faces and
add each misclassified object to N .
2.2.2.2 Face Tracking
For side view images of faces the described localization procedure yields only uncer-
tain results. Therefore an additional module for tracking suitable points in the facial
area is applied, using the algorithm of Tomasi and Kanade [26].
The tracking of point pairs in subsequent images can be partitioned in two sub-
tasks. First suitable features are extracted which are then tracked in the following
frames of a sequence. In case features are lost, new ones must be identified. If in
subsequent images only small motions of the features to be tracked occur, the inten-
sity of immediately following frames is:
λ = min(λ1 , λ2 ) (2.130)
is highest. In addition, further restrictions like the minimal distance between fea-
tures or the minimal distance from image border determine the selection.
For the selected features, the shift in the next frame is determined based on trans-
lation models. If the intensity function is approximated linearly by
G·d =e (2.132)
with e = (I(x, y) − I (x, y))g T · ω dW (2.133)
w
Here the feature region W is integrated, with ω being a weighting function. Since
neither the translation model nor the linear approximation is exact, the equations are
solved iteratively and the admittable shift d is assigned an upper limit. In addition
a motion model is applied which enables to define a measure of feature unsimiliar-
ity by estimating the affine transform. In case this unsimiliarity gets too large it is
substituted by a new one.
For the analysis of single facial features the face is modeled by dynamic face graphs.
Most common are Elastic Bunch Graphs (EBG) [27] and Active Appearance Models
(AAM), the latter are treated here.
An active Appearance Model (AAM) is a statistic model combining textural and
geometrical information into an appearance. The combination of outline and tex-
ture models are based on an eigenvalue approach which reduces the amount of data
needed, hereby enabling real-time calculation.
Outline Model
The contour of an object is represented by a set of points placed at characteristic
positions of the object. The point configuration must be stable with respect shift,
rotation and scaling. Cootes et al. [6] describe how marking points, i.e. landmarks,
are selected consistently on each image of the training set. Hence, the procedure is
restricted to object classes with a common basic structure.
In a training set of triangles, e.g., typically the corner points are selected since
they describe the essential property of a triangle and allow easy distinction of differ-
ences in the various images of a set. In general, however, objects have more compli-
cated outlines so that further points are needed to adequately represent the object in
question as, e.g., the landmarks for a face depicted in Fig. 2.37.
As soon as n points are selected in d dimensions, the outline (shape) is repre-
sented by a vector x which is composed of the single landmarks.
Facial Expression Commands 69
Fig. 2.38. Adjustment of a shape with respect to a reference shape by a procrustes analysis
using affine transformation. Shape B is translated such that its center of gravity matches that
of shape A. Subsequently B is rotated and scaled until the average distance of corresponding
points is minimized (procrustes distance).
Now the s point sets are adjusted in a common coordinate system. They describe
the distribution of the outline points of an object. This property is now used to con-
struct new examples which are similar to the original training set. This means that a
new example x can be generated from a parameterized model M in the form
70 Non-Intrusive Acquisition of Human Action
x = M (p) (2.135)
where p is the model parameter vector. By a suitable selection of p new examples
are kept similar to the trainings set. The Principal Component Analysis (PCA) is a
means for dimensionality reduction by first identifying the main axes of a cluster.
Therefore it involves a mathematical procedure that transforms a number of possi-
bly correlated parameters into a smaller number of uncorrelated parameters called
principal components. The first principal component accounts for as much of the
variability in the data as possible, and each succeeding component accounts for as
much of the remaining variability as possible.
For performing the PCA, first the mean vector x of the s trainings set vectors xi
is determined by:
1
s
x= x (2.136)
s i=1 i
Let the outline points of the model be described by n parameter xi . Then the
variance of a parameter xi describes how far the values x2ij deviate from the expec-
tation value E(Xi ). The covariance Cov then describes the linear coherence of the
various xi parameters:
Φk = (ϕ1 , ϕ2 , . . . ϕt ) (2.140)
Then these t vectors form a subspace basis which again is orthogonal. Each point
in the training set can now be approximated by
x ≈ x + Φk · pk (2.141)
Facial Expression Commands 71
where pk corresponds to the model parameter vector mentioned above, now with
a reduced number of parameters.
The number t of Eigenvectors may be selected so that the model explains the
total variance to a certain degree, e.g. 98%, or that each example in the training set is
approximated to a desired accuracy. Thus an approximation for the n marked contour
points with dimension d each is found for all s examples in the training set, as related
to the deviations from the average outline. With this approximation, and considering
the orthogonality of the eigenvalue matrix, the parameter vector pk results in
pk = ΦT (x − x) (2.142)
By varying the elements of pk the outline x may be varied as well. The average
outline and exemplary landmark variances are depicted in (Fig. 2.39).
The eigenvalue λi is the variance of the i-th parameter pki , over all examples in
the training set. To make sure that a newly generated outline is similar to the training
patterns, limits are set. Empirically it was√found that a maximum deviation for the
parameter pki should be no more than ±3 λi (Fig. 2.40).
Texture Model
With the calculated mean outline and the principal components it is possible to re-
construct each example of the training data. This removes texture variances resulting
from the various contours. Afterwards the RGB values of the normalized contour are
extracted in a texture-vector tim .
To minimize illumination effects, the texture vectors are normalized once more
using a scaling factor and an offset a resulting in:
72 Non-Intrusive Acquisition of Human Action
Fig. √ models for variations of the first three eigenvalues φ1 , φ2 and φ3 between
√ 2.40. Outline
3 λ, 0 and −3 λ.
(tim − β · 1)
t= (2.143)
α
For the identification of t the average of all texture vectors has to be determined
first. Additionally a scaling and an offset have to be estimated, because the sum of
the elements must be zero and the variance should be 1. The values for α and β for
normalizing tim are then
(tim · 1)
α = tim · tβ = (2.144)
n
where n is the number of elements in a vector. As soon as all texture vectors have
been assigned scaling and offset, the average is calculated again. This procedure is
iterated for the new average until the average converges to a stable value.
From here the further processing is identical to that described in the outline sec-
tion. Again a PCA is applied to normalized data which results in the following ap-
proximation for a texture vector in the training set
t = t + Φt · pt (2.145)
where Φt is again the Eigenvector matrix of the covariance matrix, now for tex-
ture vectors. The texture parameters are represented by pt .
Facial Expression Commands 73
R+G+B
K1 = (2.146)
3
2G − R − B
K2 = (2.147)
4
2G − R − B
K3 = (2.148)
4
Fig. 2.41. An artificial 3-band image (a) composed of intensity channel (b), I3 channel (c),
and Y gradient (d).
pk = ΦTk (x − x) (2.150)
pt = ΦTt (t − t) (2.151)
74 Non-Intrusive Acquisition of Human Action
Wk ΦTk (x − x)
pa = (2.152)
ΦTt (t − t)
pa ≈ Φa c (2.153)
Here Φa denotes the appearance Eigenvector matrix and c its parameter vector.
The latter controls the parameters of outline as well as of texture. The linearity of
the model now permits to apply linear algebra to formulate outline and texture as a
function of c. Setting:
Φpk
Φp = (2.154)
Φpt
cpk
c= (2.155)
cpt
results in
Wk ΦTk (x − x) Wk ΦTk x − Wk ΦTk x Φpk cpk
P = = = (2.156)
ΦTt (t − t) ΦTt t − ΦTt t Φpt cpt
Wk ΦTk x Φpk cpk + Wk ΦTk x
⇔ = (2.157)
ΦTt t Φpt cpt + ΦTt t
x Φk Wk−1 Φpk cpk + x
⇔ = (2.158)
t Φt Φpt cpt + t
From this the following functions for the determination of an outline x and a
texture t can be read:
x = x + Qk cpk (2.161)
t = t + Qt cpt (2.162)
Facial Expression Commands 75
with
∂I = Ii − Im (2.165)
2
For minimizing this distance, the error Δ = |∂I| is considered. Optimization
then concerns
• Learning the relation between image difference ∂I and deviations in the model
parameters.
• Minimizing the distance error Δ by applying the learned relation iteratively.
Experience shows that a linear relation between ∂I and ∂c is a good approxima-
tion:
∂c = A · ∂I (2.166)
To identify A a multiple multivariate linear regression is performed on a selection
of model variations and corresponding difference images ∂I. The selection could be
76 Non-Intrusive Acquisition of Human Action
Fig.
√ 2.42. Appearance
√ for variations of the first five Eigenvectors c1 , c2 , c3 , c4 and c5 between
3 λ, 0, −3 λ.
emphasized by adding carefully, randomly changing the model parameters for the
given examples.
For the determination of error Δ the reference points have to be identical. There-
fore the image outlines to be compared are made identical and their texture is com-
pared. In addition to variations in the model parameters, small deviations in pose, i.e.
translation, scaling and orientation are introduced.
These additional parameters are made part of the regression by simply adding
them to the model parameter deviation vector ∂c. To maintain model linearity, pose
Facial Expression Commands 77
To localize areas of interest, a face graph is iteratively matched to the face using a
modified active appearance model (AAM) which exploits geometrical and textural
Information within a triangulated face. The AAM must be trained with data of many
subjects. Facial features of interest which have to be identified in this processing
stage, are head pose, line of sight, and lip outline.
2.2.4.1 Head Pose Estimation
For the identification of the posture of the head two approaches are prosecuted in
parallel. The first method is an analytical method based on the trapezoid described
by the outer corners of the eyes and the corners of the mouth. The four required
points are taken from a matched face graph.
The second, holistic method also makes use of a face graph. Here the face region
is derived from the convex hull that contains all nodes of the face-graph (Fig. 2.43).
Then a transformation of the face region into the eigenspace takes place. The
base of the eigenspace is created by the PCA transformation of 60 reference views
which represent different head poses. The calculation of the Euclidian Distance of
the actual view into the eigenspace allows the estimation of the head pose.
Finally the results of the analytic and the holistic approach are compared. In case
of significant differences the actual head pose is estimated by utilizing the last correct
result combined with a prediction involving the optical flow.
Analytic Approach
In a frontal view the area between eye and mouth corners appears to be symmetrical.
If, however, the view is not frontal, the face plane will be distorted. To identify roll σ
and pitch angle τ the distorted face plan is transformed into a frontal view (Fig. 2.44).
A point x on the distorted plane is transformed into an undistorted point x by
x = U x + b (2.167)
where U is a 2x2 transformation matrix and b the translation. It can be shown
that U can be decomposed to:
Facial Expression Commands 79
Fig. 2.43. Head pose is derived by a combined analytical and holistic approach.
0 21 Re
U (ab) = (2.171)
−1 0
0 12 Re
⇔U = (ab)−1 (2.172)
−1 0
Where Re is the ratio of a and b Since a and b are parallel and orthogonal with
respect to the symmetry axes it follows:
80 Non-Intrusive Acquisition of Human Action
Fig. 2.44. Back transformation of a distorted face plane (right) on an undistorted frontal view
of the face (left) which yields roll and pitch angle.
aT · U T · U · b = 0 (2.173)
Hence U must be a symmetrical matrix:
αβ
UT · U = (2.174)
βγ
Also it follows that:
2β 1
tan(2τ ) = and cos(σ) = √ √ (2.175)
α−γ μ+ μ−1
(α + γ)2
with μ= (2.176)
4(α · γ − β 2 )
Fig. 2.45. Pose reconstruction from facial geometry distortion is ambiguous as may be seen
by comparing the left and right head pose.
Holistic Approach
A second method for head pose estimation makes use of the Principal Component
Analysis (PCA). It transforms the face region into a data space of Eigenvectors that
results from different poses. The monochrome images used for reference are first
normalized by subtracting the average intensity from individual pixels and then di-
viding the result by the standard deviation. Hereby variations resulting from different
illuminations are averaged.
For n views of a rotating head a pose eigenspace (PES) can be generated by
performing the PCA. A reference image sequence distributed equally between -60
and 60 degrees yaw angle is collected, where each image has size m = w × h and
is represented as m-dimensional row vector x. This results in a set of such vectors
(x1 , x2 , . . . , xn ). The average μ and the covariance matrix S is calculated by:
1
n
μ= x (2.178)
n i=1 i
1
n
S= [x − μ][xi − μ]T (2.179)
n i=1 i
The n Eigenvectors establish the basis for the PES. For each image of size w × h
a pattern vector Ω(x) = [ω1 , ω2 , . . . , ωn ] may be found by projection with the
Eigenvectors φj .
Fig. 2.46. top: Five of sixty synthesized views. middle: masked faces. bottom: cropped faces.
right: Projection into the PES.
Fig. 2.47. Variations in the PES due to different illumination of the faces.
Facial Expression Commands 83
Fig. 2.48. Here four gabor wavelet transforms are used instead of image intensity. The influ-
ence of illumination variations is much smaller than in Fig. 2.47
Now, if the pose of a new view is to be determined, this image is projected into
the PES as well and subsequently the reference point with the smallest Euclidean
distance is identified.
2.2.4.2 Determination of Line of Sight
For iris localization the circle Hough transformation is used which supports the re-
liable detection of circular objects in an image. Since the iris contains little red hue,
only the red channel is extracted from the eye region which contains a high contrast
between skin and iris. By the computation of directed X and Y gradients in the red
channel image the amplitude and phase between iris and its environment is empha-
sized.
The Hough space is created using an extended Hough transform. It is applied on
search mask which is generated based on a threshold segmentation of the red channel
image. Local maxima then point to circular objects in the original image.
The iris by its very nature represents a homogeneous area. Hence the local ex-
tremes in the Hough space are being validated by verifying the filling degree of the
circular area with a expected radius in the red-channel Finally the line of sight is
determined by comparing the intensity distributions around the iris with trained ref-
erence distributions using a Maximum-Likelihood-Classifier. In Fig. 2.49 the entire
concept for line of sight analysis is illustrated.
2.2.4.3 Circle Hough Transformation
The Hough transformation (HT) represents digital images in a special parameter
space called Hough Space. For a line, e.g., this space is two dimensional and has the
coordinates slope and shift. For representing circles and ellipses a Hough space of
higher dimensions is required.
84 Non-Intrusive Acquisition of Human Action
Fig. 2.49. Line of sight is identified based on amplitude and phase information in the red
channel image. An extended Hough transformation applied to a search mask finds circular
objects in the eye region.
For the recognition of circles, the circle hough transformation (CHT) is applied.
Here the Hough space H(x, y, r) corresponds to a three dimensional accumulator
that indicates the probability that there is a circle with radius r at position (x, y) in
the input image. Let Cr be a circle with radius r:
This means that for each position P of the input image where the gradient is
greater than a certain threshold, a circle with a given radius is created in the Hough
space. The contributions of the single circles are accumulated in Hough space. A
local maximum at position P then points to a circle center in the input image as
illustrated in Fig. 2.50 for 4 and 16 nodes, if a particular radius is searched for.
For the application at hand the procedure is modified such that the accumulator
in Hough Space is no longer incremented by circles but by circular arc segments
(Fig. 2.51). The sections of the circular arc are incremented orthogonally in respect
of the phase to both sides.
Facial Expression Commands 85
Fig. 2.50. In the Hough transformation all points in an edge-coded image are interpreted as
centers of circles. In Hough space the contributions of single circles are accumulated. This
results in local maxima which correspond to circle centers in the input image. This is exem-
plarily shown for 4 and 16 nodes.
Fig. 2.51. Hough transformation for circular arc segments which facilitates the localization of
circles by reducing noise. This is exemplarily shown for 4 and 16 nodes.
For line of sight identification the eyes’ sclera is analyzed. Following iris local-
ization, a concentric circle is described around it with the double of its diameter. The
resulting annulus is divided in to eight segments. The intensity distribution of these
segments then indicates line of sight. This is illustrated by Fig. 2.52 where distrib-
utions for two different lines of sight are presented which are similar for both eyes
but dissimilar for different lines of sight. A particular line of sight associated with
a distribution as shown in Fig. 2.52 is then identified with a maximum likelihood
classifier.
The maximum likelihood classifier assigns a feature vector g = m(x, y) to one
of t object classes. Here g consists of the eight intensities of the circular arc segments
mentioned above. For classifier training a sample of 15 (5 x 3) different lines of sights
were collected from ten subjects. From these data the average vector μi (i = 0 . . . 14),
the covariance matrix Si , and the inverse covariance matrix Si−1 were calculated.
Based on this the statistical spread vector si could be calculated.
86 Non-Intrusive Acquisition of Human Action
Fig. 2.52. Line of sight identification by analyzing intensities around the pupil.
From these instances the rejection threshold dzi = sTi · Si−1 · si for each ob-
ject class is determined. For classification the Mahalanobis distances di = (g −
μi )T Si−1 (g − μi ) of the actual feature vector to all classes is calculated. Afterwards
a comparison with pretrained thresholds takes place. The feature vector g then is as-
signed to that class kj for which the Mahalanobis distance is minimal and for which
the distance is lower than the rejection threshold of that class(Fig. 2.53). .
Fig. 2.53. A maximum likelihood classifier for line of sight identification. A feature vector is
assigned to that class to which the Mahalanobis distance is smallest.
Facial Expression Commands 87
Segmentation is based on four different feature maps which all emphasize the lip
region from the surrounding by color and gradient information Fig. 2.54.
Fig. 2.54. Identification of lip outline with an active shape model. Four different features
contribute to this process.
The first feature represents the probability that a pixel belongs to the lips. The
required ground truth, i.e. lip color histograms, has been derived from 800 images
segmented by hand. The a posteriori probabilities for lips and background are then
calculated as already described in Sect. 2.1.2.1.
In the second feature the nonlinear LUX color space (Logarithmic hUe eXten-
tion) is exploited to emphasize the contrast between lips and surrounding [14], where
L is the luminance, and U and X incorporate chromaticity. Here the U channel is used
as a feature.
88 Non-Intrusive Acquisition of Human Action
A third feature makes use of the I1 I2 I3 color space which also is suited to em-
phasize the contrast between lips and surrounding. The three channels are defined as
follows:
R+G+B
I1 = (2.188)
3
R−B
I2 = (2.189)
2
2G − R − B
I3 = (2.190)
4
The fourth feature utilizes a Sobel operator that emphasizes the edges between
the lips and skin- or beard-region.
⎡ ⎤ ⎡ ⎤
−1 0 1 −1 −2 −1
⎣ −2 0 2 ⎦ or rather ⎣ 0 0 0 ⎦ (2.191)
−1 0 1 1 2 1
The filter mask is convoluted with the corresponding image region. The four
different features need to be combined for establishing an initialization mask. For
fusion, a logical OR without individual weighting was selected, as weighting did not
improve the results.
While the first three features support the segmentation between lips and sur-
rounding, the gradient map serves the union of single regions in cases where upper
and lower lips are segmented separately due to dark corners of the mouth or to teeth.
Nevertheless, segmentation errors still occur when the color changes between
lips and surrounding is floating. This effect is observed if, e.g., shadow falls below
the lips, making that region dark. Experiments show that morphological convolution
operators like erosion or dilatation are inappropriate to smooth out such distortions.
Therefore here an iterative procedure is used which relates the center of gravity of
the rectangular surrounding the lips Cog Lip to the center of gravity of the segmented
area Cog Area . If Cog Lip lies above Cog Area , this most likely is due to line-formed
artifacts. These then are erased linewise until a recalculation results in approximately
identical centers of gravity (Fig. 2.55).
2.2.4.5 Lip modeling
For the collection of ground truth data, mouth images were taken from 24 subjects
with the head filling the full image format. Each subject had to perform 15 visu-
ally distinguishable mouth forms under two different illumination conditions. Sub-
sequently upper and lower lip, mouth opening, and teeth were segmented manually
Facial Expression Commands 89
Fig. 2.55. The mask resulting from fused features (left) is iteratively curtailed until the distance
of the center of gravity from the bounding box to the center of gravity of the area is minimized
(right).
in these images. Then 44 points were equally assigned on the outline with point 1
and point 23 being the mouth corners.
It is possible to use vertically symmetrical and asymmetrical PDMs. Research
into this shows that most lip outlines are asymmetrical, so asymmetrical models fit
better. Since the segmented lip outlines vary in size, orientation, and position in the
image, all points have to be normalized accordingly. The average form of training
lip outlines, their Eigenvector matrix and variance vector, then form the basis for the
PDMs which are similar to the outline models of the Active Appearance Models.
The resulting PDMs are depicted√ in Fig. 2.56,
√ where the first four Eigenvectors have
been varied in a range between 3 λ, 0, −3 λ.
Fig. 2.56. √
Point distribution
√ models for variations of the first four Eigenvectors φ1 , φ2 , φ3 , φ4
between 3 λ, 0, −3 λ.
90 Non-Intrusive Acquisition of Human Action
To run the example application for eye localization, please install the rapid proto-
typing system I MPRESARIO from the accompanying CD as described in Sect. B.1.
After starting I MPRESARIO, choose File → Open to load the process graph named
FaceAnalysis_Eye.ipg (Fig. 2.57). The data flow is straightforward and
corresponds to the structure of the sections in this chapter. All six macros are
described below.
Image Sequence Acquires a sequence of images from hard disk or other storage
media. In the following, each image is processed individually and independently.
Other sources may be used by replacing this macro with the appropriate source
macro (for instance, Video Capture Device for a video stream from an USB cam).
SplitImage Split image in three basic channels Red – Green – Blue. It is possible
to split all the channels at the same time or to extract just one channel (here the Red
channel is used).
Gradient This is a simple wrapper for the convolution functor with some conve-
nience parameterization to choose between different common gradient kernels. Not
only the classical simple difference computation (right minus left for the x direction
or bottom minus top for the y direction) and the classical Sobel, Prewitt, Robinson,
Roberts and Kirsch kernels can be used, but more sophisticated optimal kernels (see
lti::gradientKernelX) and the approximation using oriented Gaussian derivatives
can be used. (output here: An absolute channel computed from X and Y gradient
channels)
ConvertChannel The output of the preceding stage is a floating point gray value
image I(x, y) ∈ [0, 1]. Since the next macro requires a fixed point gray value image
I(x, y) ∈ {0, 1, . . . , 255} a conversion is performed here.
To run the example application for mouth localization, please install the rapid
prototyping system I MPRESARIO from the accompanying CD as described in Sect.
B.1. After starting I MPRESARIO, choose File → Open to load the process graph
named FaceAnalysis_Mouth.ipg (Fig. 2.58). The data flow is straightforward
and corresponds to the structure of the sections in this chapter. All six macros are
described below.
Facial Expression Commands 91
Image Sequence Acquires a sequence of images from hard disk or other storage
media. In the following, each image is processed individually and independently.
Other sources may be used by replacing this macro with the appropriate source
macro (for instance, Video Capture Device for a video stream from an USB cam).
Gradient This is a simple wrapper for the convolution functor with some conve-
nience parameterization to choose between different common gradient kernels. Not
only the classical simple difference computation (right minus left for the x direction
or bottom minus top for the y direction) and the classical Sobel, Prewitt, Robinson,
Roberts and Kirsch kernels can be used, but more sophisticated optimal kernels
(see lti::gradientKernelX) and the approximation using oriented Gaussian
derivatives can be used. (output here: An absolute channel computed from X and Y
gradient channels)
colors of a real world scene lie on the gray line of a color space. The most simple
approach uses only the mean color and maps it linearly into one specific point on
the gray line, assuming that a diagonal transform (a simple channel scaling) is valid.
Normalization is done under the assumption of mondrian worlds, constant point
light sources and delta function camera sensors.
References
1. Akyol, S. and Canzler, U. An Information Terminal using Vision Based Sign Language
Recognition. In Büker, U., Eikerling, H.-J., and Müller, W., editors, ITEA Workshop
on Virtual Home Environments, VHE Middleware Consortium, volume 12, pages 61–68.
2002.
2. Akyol, S., Canzler, U., Bengler, K., and Hahn, W. Gesture Control for Use in Automo-
biles. In Proceedings of the IAPR MVA 2000 Workshop on Machine Vision Applications.
2000.
3. Bley, F., Rous, M., Canzler, U., and Kraiss, K.-F. Supervised Navigation and Manip-
ulation for Impaired Wheelchair Users. In Thissen, W., Wierings, P., Pantic, M., and
Ludema, M., editors, Proceedings of the IEEE International Conference on Systems, Man
and Cybernetics. Impacts of Emerging Cybernetics and Huamn-Machine Systems, pages
2790–2796. 2004.
4. Canzler, U. and Kraiss, K.-F. Person-Adaptive Facial Feature Analysis for an Advanced
Wheelchair User-Interface. In Drews, P., editor, Conference on Mechatronics & Robotics,
volume Part III, pages 871 – 876. Sascha Eysoldt Verlag, September 2004.
5. Canzler, U. and Wegener, B. Person-adaptive Facial Feature Analysis. The Faculty of
Electrical Engineering, page 62, May 2004.
6. Cootes, T., Edwards, G., and Taylor, C. Active Appearance Models. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 23(6):681 – 685, 2001.
7. Csink, L., Paulus, D., Ahlrichs, U., and Heigl, B. Color Normalization and Object Local-
ization. In 4. Workshop Farbbildverarbeitung, Koblenz, pages 49–55. 1998.
References 93
Fig. 2.58. Process graph of the example application for mouth localization.
18. Papageorgiou, C., Oren, M., and Poggio, T. A General Framework for Object Detection.
pages 555 – 562. 1998.
19. Rabiner, L. and Juang, B.-H. An Introduction to Hidden Markov Models. IEEE ASSP
Magazine, 3(1):4–16, 1986.
20. Rabiner, L. R. A Tutorial on Hidden Markov Models and Selected Applications in Speech
Recognition. Readings in Speech Recognition, 1990.
21. Schukat-Talamazzini, E. Automatische Spracherkennung. Vieweg Verlag, Braun-
schweig/Wiesbaden, 1995.
22. Schürmann, J. Pattern Classification. John Wiley & Sons, 1996.
23. Sherrah, J. and Gong, S. Tracking Discontinuous Motion Using Bayesian Inference. In
Proceedings of the European Conference on Computer Vision, pages 150–166. Dublin,
Ireland, 2000.
24. Sonka, M., Hlavac, V., and Boyle, R. Image Processing, Analysis and Machine Vision.
Brooks Cole, 1998.
25. Steger, C. On the Calculation of Arbitrary Moments of Polygons. Technical Report
FGBV–96–05, Forschungsgruppe Bildverstehen (FG BV), Informatik IX, Technische
Universität München, October 1996.
26. Tomasi, C. and Kanade, T. Detection and Tracking of Point Features. Technical Report
CS-91-132, CMU, 1991.
27. Wiskott, L., Fellous, J., Krüger, N., and von der Malsburg, C. Face Recognition by Elas-
tic Bunch Graph Matching. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 19(7):775–779, 1997.
28. Zieren, J. and Kraiss, K.-F. Robust Person-Independent Visual Sign Language Recog-
nition. In Proceedings of the 2nd Iberian Conference on Pattern RecognitionandImage
Analysis IbPRIA 2005, volume Lecture Notes in Computer Science. 2005.
Chapter 3
1
Sect. 3.1
2
Sect. 3.2
3
Sect. 3.3
96 Sign Language Recognition
thermore, it allows deaf people intuitive access to electronic media such as computers
or the internet (Fig. 3.1).
Fig. 3.1. laptop and webcam based sign language recognition system.
flexion of the finger joints, optical or magnetic markers placed on face and hands fa-
cilitate a straightforward determination of mimic and manual configuration. For the
user, however, this is unnatural and restrictive. Furthermore, data gloves are unsuit-
able for practical applications due to high cost. Also most existing systems exploit
manual features only; so far facial features were rarely used [17]. A recognition sys-
tem’s usability is greatly influenced by the robustness of its image processing stage,
i.e. its ability to handle inhomogeneous, dynamic, or generally uncontrolled back-
grounds and suboptimal illumination. Many publications do not explicitly address
this issue, which – in connection with accordant illustration – suggests homoge-
neous backgrounds and strong diffuse lighting. Another common assumption is that
the signer wears long-sleeved clothing that differs in color from his skin, allowing
color-based detection of hands and face.
The majority of systems only support person-dependent operation, i.e. every user
is required to train the system himself before being able to use it. Person-independent
operation requires a suitable normalization of features early in the processing chain
to eliminate dependencies of the features on the signer’s position in the image, his
distance from the camera, and the camera’s resolution. This is rarely described in
publications; instead, areas and distances are measured in pixels, which even for
person-dependent operation would require an exact reproduction of the conditions
under which the training material was recorded.
Similar to the early days of speech recognition, most researchers focus on iso-
lated signs. While several systems exist that process continuous signing, their vo-
cabulary is very small. Recognition rates of 90% and higher are reported, but the
exploitation of context and grammar – which is sometimes rigidly fixed to a cer-
tain sentence structure – aid considerably in classification. As in speech recognition,
coarticulation effects and the resulting ambiguities form the primary problem when
using large vocabularies.
Tab. 3.1 lists several important publications and the described systems’ features.
When comparing the indicated performances, it must be kept in mind that, in con-
trast to speech recognition, there is no standardized benchmark for sign language
recognition. Thus, recognition rates cannot be compared directly. The compilation
also shows that most systems’ vocabularies are in the range of 50 signs.
Larger vocabularies have only been realized with the use of data gloves. All sys-
tems are person-dependent. All recognition rates are valid only for the actual test
scenario. Information about robustness in real-life settings is not available for any
of the systems. Furthermore, the exact constellation of the vocabularies is unknown,
despite its significant influence on the difficulty of the recognition task. In summary
it can be stated that none of the systems currently found in literature meets the re-
quirements for a robust real world application.
Table 3.1. Classifiers characteristics for user dependent sign language recognition.
Author, Features Interface Vocabulary Language Recognition
Year Level Rate in %
Vamplew 1996 [24] manual data glove 52 word 94
Holden 2001 [10] manual optical markers 22 word 95,5
Yang 2002 [28] manual video 40 word 98,1
Murakami 1991 [15] manual data glove 10 sentence 96
Liang 1997 [14] manual data glove 250 sentence 89,4
Fang 2002 [8] manual data glove 203 sentence 92,1
Starner 1998 [20] manual video 40 sentence 97,8
Vogler 1999 [25] manual video 22 sentence 91,8
Parashar 2003 [17] manual video 39 sentence 92
and facial
The input image sequence is forwarded to two parallel processing chains that
extract manual and facial features using a priori knowledge of the signing process.
Before the final sign classification is performed, a pre-classification module restricts
the active vocabulary to reduce processing time. Manual and facial features are then
classified separately, and both results are merged to yield a single recognition result.
In the following section of this chapter the person-independent recognition of
isolated signs in real world scenarios is dealt with. The second section explains how
facial features can be integrated in sign language recognition process. Both parts
build on related basic concepts presented already in chapter 2.
The third section describes how signs can be divided automatically in subunits by
a data driven procedure and how these subunits enable coding of large vocabularies
and accelerate training times. Finally some performance measures for the described
system are provided.
Recognition of Isolated Signs in Real-World Scenarios 99
Fig. 3.3. Difficulties in sign language recognition: Overlap of hands and face (a), tracking of
both hands in ambiguous situations (b), and similar signs with different meaning (c, d).
Just like in spoken language, there are dialects in sign language. In contrast to
the pronunciation of words, however, there is no standard for signs, and people may
use an altogether different sign for the same word (Fig. 3.4). Even when performing
identical signs, the variations between different signers are considerable (Fig. 3.5).
Finally, when using color and/or motion to detect the user’s hands and face, un-
controlled backgrounds as shown in Fig. 3.6 may give rise to numerous false alarms
4
The video material shown in this section was kindly provided by the British Deaf Asso-
ciation (http://www.signcommunity.org.uk). All signs are British Sign Language (BSL).
100 Sign Language Recognition
(a) (b)
Fig. 3.4. Two different signs for “actor”: Circular motion of hands in front of upper body (a)
vs. dominant hand (here: right) atop non-dominant hand (b).
(a) “autumn” (b) “recruitment” (c) “tennis” (d) “distance” (e) “takeover”
Fig. 3.5. Traces of (xcog , ycog ) for both hands from different persons (white vs. black plots)
performing the same sign. The background shows one of the signers for reference.
that require high-level reasoning to tell targets from distractors. This process is com-
putationally expensive and error-prone, because the required knowledge and experi-
ence that comes natural to humans is difficult to encode in machine-readable form.
Fig. 3.6. Examples for uncontrolled backgrounds that may interfere with the tracking of hands
and face.
Recognition of Isolated Signs in Real-World Scenarios 101
A multitude of algorithms has been published that aim at the solution of the above
problems. This section describes several concepts that were successfully combined
in a working sign language recognition system at the RWTH Aachen university [29].
A comprehensive overview of recent approaches can be found in [6, 16].
This section focuses on the recognition of isolated signs based on manual pa-
rameters. The combination of manual and non-manual parameters such as facial
expression and mouth shape, which constitute an integral part of sign language, is
discussed in Sect. 3.2. Methods for the recognition of sign language sentences are
described in Sect. 3.3.
The following text builds upon Sect. 2.1. It follows a similar structure since
it is based on the same processing chain. However, the feature extraction stage is
extended as shown in Fig. 3.7. An image preprocessing stage is added that ap-
plies low-level algorithms for image enhancement, such as background modeling
described in Sect. 3.1.2.1. High-level knowledge is applied for the resolution of over-
laps (Sect. 3.1.3.1) and to support hand localization and tracking (Sect. 3.1.3.2).
Compared to Sect. 2.1, the presentation is less concerned with concrete algo-
rithms. Due to the complexity and diversity of the topic, the focus lies on general
considerations that should apply to most, though certainly not all, applications.
Fig. 3.7. Feature extraction extended by an image preprocessing stage. High-level knowledge
of the signing process is used for overlap resolution and hand localization/tracking.
With regard to the input data, switching from gestures to sign language input brings
about the following additional challenges:
102 Sign Language Recognition
1. While most gestures are one-handed, signs may be one- or two-handed. The
system must therefore not only be able to handle two moving objects, but also
to detect whether the sign is one- or two-handed, i.e. whether one of the hands
is idle.
2. Since the hands’ position relative to the face carries a significant amount of in-
formation in sign language, the face must be included in the image as a reference
point. This reduces image resolution available for the hands and poses an addi-
tional localization task.
3. Hands may occlude one another and/or the face. For occluded objects, the extrac-
tion of features through a threshold segmentation is no longer possible, resulting
in a loss of information.
4. Signs may be extremely similar (or even identical) in their manual features and
differ mainly (or exclusively) in non-manual features. This makes automatic
recognition based on manual features difficult or, for manually identical signs,
even unfeasible.
5. Due to dialects and lack of standardization, it may not be possible to map a
given word to a single sign. As a result, either the vocabulary has to be extended
to include multiple variations of the affected signs, or the user needs to know
the signs in the vocabulary. The latter is problematic because native signers may
find it unnatural to be forced to use a certain sign where they would normally
use a different one.5
The last two problems cannot be solved by the approach described here. The
following text presents solution strategies for the former three.
The preprocessing stage improves the quality of the input data to increase the per-
formance (in terms of processing speed and/or accuracy) of the subsequent stages.
High-level information computed in those stages may be exploited, but this bears
the usual risks of a feedback loop, such as instability or the reinforcement of errors.
Low-level pixel-wise algorithms, on the other hand, do not have this problem and can
often be used in a wide range of applications. A prominent example is the modeling
of the image background.
3.1.2.1 Background Modeling
The detection of background (i.e. static) areas in dynamic scenes is an important step
in many pattern recognition systems. Background subtraction, the exclusion of these
areas from further processing, can significantly reduce clutter in the input images
and decrease the amount of data to be processed in subsequent stages.
If a view of the background without any foreground objects is available, a sta-
tistical model can be created in a calibration phase. This approach is described in
Sect. 5.3.1.1. In the following, however, we assume that the only available input data
is the signing user, and that the signing process takes about 1–10 seconds. Thus the
5
This problem also exists in speech recognition, e.g. for the words “zero” and “oh”.
Recognition of Isolated Signs in Real-World Scenarios 103
Δ (r g b)T , (rbg gbg bbg )T = (r − rbg )2 + (g − gbg )2 + (b − bbg )2 (3.6)
104 Sign Language Recognition
However, (3.6) does not distinguish between differences in brightness and dif-
ferences in hue. Separating the two is desirable since it permits to ignore shadows
(which mainly affect brightness) cast on the background by moving (foreground)
objects. Shadows can be modeled as a multiplication of Ibg with a scalar. We define
Δ = I − Ibg (3.7)
ϕ = (I, Ibg ) (3.8)
Most background models, including the one presented above, assume a static back-
ground. However, backgrounds may also be dynamic, especially in outdoor scenar-
ios. For example, a tree with green leaves that move in front of a white wall results
in each background pixel alternating between green and white.
Approaches that model dynamic backgrounds are described in [13, 18, 21]. In-
stead of computing a single background color as in (3.1), background and foreground
are modeled together for each pixel as a mixture of Gaussian distributions, usually in
RGB color space. The parameters of these distributions are updated online with every
new frame. Simple rules are used to classify each individual Gaussian as background
or foreground. Handling of shadows can be incorporated as described above.
Recognition of Isolated Signs in Real-World Scenarios 105
Fig. 3.8. Four example frames from the sign “peel”. The input video consists of 54 frames, all
featuring at least one person in the background.
Fig. 3.9. Background model computed for the input shown in Fig. 3.8 (a), its application to an
example frame (b), and the resulting foreground mask (c).
cations, where the hand typically fills the complete image. Background subtraction
alone cannot be expected to isolate foreground objects without errors or false alarms.
Also, the face – which is mostly static – has to be localized before background sub-
traction can be applied.
Since the hands’ extremely variable appearance prevents the use of shape or tex-
ture cues, it is common to exploit only color for hand localization. This leads to the
restriction that the signer must wear long-sleeved, non-skin-colored clothing to facil-
itate the separation of the hand from the arm by means of color. By using a generic
color model such as the one presented in [12], user and illumination independence
can be achieved at the cost of a high number of false alarms. This method therefore
requires high-level reasoning algorithms to handle these ambiguities. Alternatively,
one may choose to either explicitly or automatically calibrate the system to the cur-
rent user and illumination. This chapter describes the former approach.
3.1.3.1 Overlap Resolution
In numerous signs both hands overlap with each other and/or with the face. When
two or more objects overlap in the image, the skin color segmentation yields only a
single blob for all of them, rendering a direct extraction of meaningful features im-
possible. Low contrast, low resolution, and the hands’ variable appearance usually
Recognition of Isolated Signs in Real-World Scenarios 107
(a) (b)
(c) (d)
Fig. 3.10. Skin color segmentation of Fig. 3.9b (a) and a subset of the corresponding hypothe-
ses (b, c, d; correct: d).
Fig. 3.11. Hypothesis space and probabilities for states and transitions.
• Keeping track of the hand’s mean or median color can prevent confusion of the
hand with nearby distractors of similar size but different color. This criterion
affects ptransition .
To search the hypothesis space, the Viterbi algorithm (see Sect. 2.1.3.6) is applied
in conjunction with pruning of unlikely paths at each step.
The MHT approach ensures that all available information is evaluated before the
final tracking result is determined. The tracking stage can thus exploit, at time t,
information that becomes available only at time t1 > t. Errors are corrected retro-
spectively as soon as they become apparent.
After successful extraction, the features need to be normalized. Assuming that reso-
lution dependencies have been eliminated as described in Sect. 2.1.2.3, the area a still
depends on the signer’s distance to the camera. Hand coordinates (xcog , ycog ) addi-
tionally depend on the signer’s position in the image. Using the face position and
size as a reference for normalization can eliminate both dependencies. If a simple
threshold segmentation is used for the face, the neck is usually included. Different
necklines may therefore affect the detected face area and position (COG). In this case
one may use the topmost point of the face and its width instead to avoid this prob-
lem. If the shoulder positions can be detected or estimated, they may also provide a
suitable reference.
110 Sign Language Recognition
Fig. 3.12. Example frame from the test and training samples provided on CD.
As shown in Fig. 3.12 the signer is standing. The hands are always visible, and
their start/end positions are constant and identical, which simplifies tracking.
The directory slrdatabase has five subdirectories named var00 to var04,
each of which contains one production of all 80 signs. Each sign production is stored
in the form of individual JPG files in a directory named accordingly.
6
We sincerely thank the British Deaf Association (http://www.signcommunity.org.uk) for
these recordings.
Sign Recognition Using Nonmanual Features 111
This database can be used to test tracking and feature extraction algorithms, and
to train and test a sign language classifier based e.g. on Hidden Markov Models.
Fig. 3.13. In German Sign Language the signs ”not”(A) and ”to”(B) are identical with respect
to manual gesturing but vary in head movement.
Fig. 3.14. In British Sign Language the signs ”now” (A) and ”offer” (B) are identical with
respect to manual gesturing but vary in lip outline.
ing” or ”enticing”, e.g., show a slight inclination of the torso towards the rear and
in forward direction. Likewise grammatical aspects as, e.g., indirect speech can be
coded by torso posture.
Head pose
The head pose also supports the semantics of sign language. Questions, affirmations,
denials, and conditional clauses are, e.g., communicated with the help of head pose.
In addition, information concerning time can be coded. Signs, which refer to a short
time lapse are, e.g., characterized by a minimal change of head pose while signs
referring to a long lapse are performed by turning the head clearly into the direction
opposite to the gesture.
Line of sight
Two communicating deaf persons will usually establish a close visual contact. How-
ever, a brief change of line of sight can be used to refer to the spatial meaning of a
gesture. Likewise line of sight can be used in combination with torso posture to ex-
press indirect speech as, e.g., as representation between two absent persons is virtual
placed behind the signer.
Facial expression
Lip outline
Lip outline represents the most pronounced nonmanual characteristic. Often it dif-
fers from voicelessly expressed words in that part of a word is shortened. Lip outline
solves ambiguities between signs (brother vs. sister), and specifies expressions (meat
vs. Hamburger). It also provides information redundant to gesturing to support dif-
ferentiation of similar signs.
The next step is the fitting of a user-adapted face-graph by utilizing Active Ap-
pearance Models (Sect. 2.2.3). The face graph serves the extraction of the nonmanual
parameters like lip outline, eyes and brows (Fig. 3.16).
The used nonmanual features (Fig. 3.17) can be divided in three groups. At first
the lip outline is described by width, height and form-features like invariant mo-
ments, eccentricity and orientation (Sect. 2.1.2.3). The distances between eyes and
mouth corners are in the second group and the third group contains the distances
between eyes and eye brows.
For each image of the sequence the extracted parameters are merged into a feature
vector, which in the next step is used for classification (Fig. 3.18).
Overlap Resolution
In case of partially overlapping of hands and face an accurate fitting of the Active
Appearance Models is usually no longer possible. Furthermore it is problematic that
114 Sign Language Recognition
Fig. 3.16. Processing scheme of the face region cropping and the matching of an adaptive
face-graph.
sometimes the face graph is computed astounding precisely, if there is not enough
face texture visible. In this case a determination of the features in the affected re-
gions is no longer possible. For compensation of this effect an additional module is
involved, that evaluates all specific regions separately with regard to overlappings by
hands.
In Fig. 3.19 two cases are presented. In the first case one eye and the mouth are
hidden, so the feature-vector of the facial parameters is not used for classification. In
the second case the overlapping is not critical for classification so the facial features
are thus consulted for the classification process.
The hand tracker indicates a crossing of one or both hands with the face, once
the skin colored surfaces touch each other. In these cases it is necessary to decide,
whether the hands affect the shape substantially. Therefore, an ellipse for the hand
is computed by using the manual parameters. In addition an oriented bounding box
is drawn around the lip contour of the active appearance shape. If the hand ellipse
touches the bounding box, the Mahalanobis-distance of the shape fitting is deciding.
If this is too large, the shape is marked as invalid. Since the Mahalanobis-distance
of the shapes depends substantially on the trained model, not an absolute value is
Sign Recognition Using Nonmanual Features 115
Fig. 3.17. Representation of nonmanual parameters suited for sign language recognition.
used here, but a proportional worsening. In experiments it showed itself that a good
overlapping recognition can be achieved, if 25% of the face is hidden.
116 Sign Language Recognition
Fig. 3.19. During the overlapping of the hands and face several regions of the face are eval-
uated separately. If e.g. mouth and one eye could be hidden (left), no features of the face are
considered. However, if eyes and mouth are located sufficiently the won features could be used
for the classification (right).
It is yet unclear in sign language recognition, which part of a sign sentence serves
as a good underlying model. The theoretical best solution is a whole sign language
sentence as one model. Due to their long duration sentences do not lead to ill clas-
sification. However, the main disadvantage is obvious: there are far too many com-
binations to train all possible sentences in sign language. Consequently most sign
language recognition systems are based on word models where one sign represents
one model in the model database. Hence, the recognition of sign language sentences
is more flexible. Nevertheless, even this method contains some drawbacks:
• The training complexity increases with vocabulary size.
• A future enlargement of the vocabulary is problematic as new signs are usually
signed in context of other signs for training (embedded training).
Recognition of continuous Sign Language using Subunits 117
TODAY I COOK
300
Position [Pixel]
250
200
150
100
50
0
0 10 20 30 40 50
Time [Frames]
SU1 SU2 SU3 SU4 SU5 SU6 SU7 SU8 SU9 SU10
TODAY I COOK
Fig. 3.20. The signed sentence TODAY I COOK (HEUTE ICH KOCHEN) in German Sign
Language (GSL). Top: the recorded sign sequence. Center: Vertical position of right hand
during signing. Bottom: Different possibilities to divide the sentence. For a better visualization
the signer wears simple colored cotton gloves.
in such a way that any sign can be composed with subunits. The advantages are:
• The training period terminates when all subunit models are trained. An enlarge-
ment of the vocabulary is achieved by composing a new sign through concatena-
tion of convenient subunit models.
• The general vocabulary size can be enlarged.
• Eventually, the recognition will be more robust (depending on the vocabulary
size).
Modification of Training and Classification for Recognition with Subunits
A subunit based recognition system needs an additional knowledge source, where the
coding (also called transcription) of the sign into subunits is itemized. This know-
ledge source is called sign-lexicon and contains all transcriptions of signs and their
coherence with subunits. Both, training and classification processes are based on a
sign-lexicon.
118 Sign Language Recognition
Aim of the training process is the estimation of subunit model parameters. The ex-
ample in figure 3.21 shows that sign 1 consists of subunits (SU) SU4 , SU7 and SU3 .
The accordant model parameter of these subunits are trained on the recorded data of
sign 1 by means of the Viterbi algorithm (see section 2.1.3.6).
Recorded Segmented
images Images
Sequence of
Feature Vectors
TRAINING
Model Database
Subunits
Sign 1 SU 4 SU 7 SU 3
Sign 2 SU 6 SU 1 SU 5
. .
. .
. .
Sign M SU 2 SU 7 SU 5
Sign-Lexicon
After completion of the training process, a database is filled with all subunit models
which serve now as a base for the classification process. However, the aim of the
classification is not the recognition of subunits but of signs. Hence, again the infor-
mation which sign consists of which subunit is needed, which is contained in the
sign-lexicon. The accordant modification for classification is depicted in figure 3.22.
So far we have assumed that a sign-lexicon is available, i.e. it is already known which
sign is composed of which subunits. This however is not the case. The subdivision of
a sign into suitable subunits still poses difficult problems. In addition, the semantics
of the subunits have yet to be determined. The following section gives an overview
of possible approaches to linguistic subunit formation.
3.3.2.1 Linguistics-orientated Transcription of Sign Language
Subunits for speech recognition are mostly linguistically motivated and are typically
syllables, half-syllables or phonemes. The base of this breakdown of speech is similar
to the speech’s notation system: a written text with the accordant orthography is the
standard notation system for speech. Nowadays, huge speech-lexica exist consisting
Recognition of continuous Sign Language using Subunits 119
Recorded Segmented
images Images
Sequence of
Feature Vectors
TRAINING
Model Database
Subunits
Sign 1 SU 4 SU 7 SU 3
Sign 2 SU 6 SU 1 SU 5
. .
. .
. .
Sign M SU 2 SU 7 SU 5
Sign-Lexicon
of the transcription of speech into subunits. These lexica are usually the base for
today’s speech recognizer.
Transferring this concept of linguistic breakdown to sign language recognition,
one is confronted with a variety of options for notation, which all are unfortunately
not yet standardized as is the case for speech. An equally accepted notation system as
written text does not exist for sign language. However, some known notation systems
are examined in the following section, especially with respect to its applicability in
a recognition system. Corresponding to phonemes, the term cheremes (derived from
the Greek term for ’manual’) is used for subunits in sign language.
Stokoe was one of the first, who did research in the area of sign language linguistic in
the sixties [22]. He defined three different types of cheremes. The first type describes
the configuration of handshape and is called dez for designator. The second type is
sig for signation and describes the kind of movement of the performed sign. The
third type is the location of the performed sign and is called tab for tabula. Stokoe
developed a lexicon for American Sign Language by means of the above mentioned
types of cheremes. The lexicon consists of nearly 2500 entries, where signs are coded
in altogether 55 different cheremes (12 ’tab’, 19 ’dez’ and 24 different ’sig’). An
example of a sign coded in the Stokoe system is depicted in figure 3.23 [22].
The employed cheremes seem to qualify as subunits for a recognition system,
their practical employment in a recognition system however turns out to be difficult.
Even though Stokoe’s lexicon is still in use today and consists of many entries, not all
signs, especially no German Sign Language signs, are included in this lexicon. Also
most of Stokoe’s cheremes are performed in parallel, whereas a recognition system
expects subunits in subsequent order. Furthermore, none of the signations cheremes
(encoding the movement of a performed sign) are necessary for a recognition system,
as movements are modeled by Hidden Markov Models (HMMs). Hence, Stokoe’s
120 Sign Language Recognition
Fig. 3.23. Left: The sign THREE in American Sign Language (ASL). Right: notation of this
sign after Stokoe [23].
lexicon is a very good linguistic breakdown of signs into cheremes, however without
manual alterations, it is not useful as a base for a recognition system.
Another notation system was invented by Liddell and Johnson. They break signs
into cheremes by a so called Movement-Hold-Model, which was introduced in 1984
and further developed since then. Here, the signs are divided in sequential order into
segments. Two different kinds of segments are possible: ’movements’ are segments,
where the configuration of a hand is still in move, whereas for ’hold’-segments no
movement takes place, i.e. the configuration of the hands is fixed. Each sign can
now be modeled as a sequence of movement and hold-segments. In addition, each
hold-segment consists of articulary features [3]. These describe the handshape, the
position and orientation of the hand, movements of fingers, and rotation and orienta-
tion of the wrist. Figure 3.24 depicts an example of a notation of a sign by means of
movement- and hold-segements.
Fig. 3.24. Notation of the ASL sign FATHER by means of the Movement-Hold model (from
[26]).
1600
Sign 2
1400
1200
1000
800
600
400
200 Sign 1
0
0 5 10 15
Time [Frames]
7
For speech-recognition the accordant name is acoustic subunits. For sign language recog-
nitions the name is adapted.
122 Sign Language Recognition
The performed signs are similar in the beginning. Consequently, both signs are
assigned to the same subunit (SU3 ). However the further progression differs signifi-
cantly. While the gradient of sign 2 is going upwards, the slope of sign 1 decreases.
Hence, the further transcription of both signs differs.
The example in Fig. 3.25 illustrates merely one aspect of the performed sign: the
horizontal progression of the right hand. Regarding sign language recognition and
their feature extraction, this aspect would correspond to the y-coordinate of the
right hand’s location. For a complete description of a sign, one feature is however
not sufficient. In fact a recognition system must handle many more features which
are merged in so called feature groups. The composition of these feature groups
must take the lingustical sign language parameters into account, which are location,
handshape and -orientation.
The further details in this section refer to an example separation of a feature vector
into a feature group ’pos’, where all features regarding the position (pos) of the both
hands are grouped. Another group represents all features describing the ’size’ of
the visible part of the hands (size), whereas the third group comprises all features
regarding distances between all fingers (dist). The latter two groups stand for the
sign parameter handshape and -orientation. Note, that these are only examples of
how to model a feature vector and its accordant feature groups. Many other ways
of modeling sign language are conceivable, also the number of feature groups may
vary. To demonstrate the general approach, this example makes use of the three
feature groups ’pos’, ’size’ and ’dist’ mentioned above.
Following the parallel breakdown of a sign, each resulting feature group is seg-
mented in sequential order into subunits. The identification of similar segments is
not carried out on the whole feature vector, but only within each of the three feature
groups. Similar segments finally stand for the subunits of one feature group. Pooled
segments of the feature group ’pos’ e.g., now represent a certain location independent
of any specific handshape and -orientation. The parallel and sequential breakdown
of the signs finally yields three different sign lexica, which are combined to one (see
also figure 3.26). Figure 3.27 shows examples of similar segments of different signs
according to specific feature groups.
Conventional HMMs are suited to handle sequential signals (see section 2.1.3.6).
However, after the breakdown into feature groups, we now deal with parallel signals
which can be handled by Parallel Hidden Markov Models (PaHMMs) [9]. Parallel
Hidden Markov Models are conventional HMMs used in parallel. Each of the parallel
combined HMMs is called a channel. Each channel is independent from each other,
the state probabilities of one channel do not influence any of the other channels.
Recognition of continuous Sign Language using Subunits 123
Fig. 3.26. Each feature group leads to an own sign-lexicon, which are finally combined to one
sign-lexicon.
ESSEN
Fig. 3.27. Different examples for the assignment to all different feature groups.
An accordant PaHMM is depicted in figure 3.28. The last state, a so called con-
fluent state, combines the probabilities of the different channels to one probability,
valid for the whole sign. The combination of probabilities is determined by the fol-
lowing equation:
J
P (O|λ) = P (O j |λj ) . (3.12)
j=1
O j stands here for the relevant observation sequence of one channel, which is evalu-
ated for the accordant segment. All parallel calculated probabilities of a sign finally
124 Sign Language Recognition
J
K
P (O|λ) = P (O kj |λkj ) . (3.13)
j=1 k=1
s0 s1 s2 s3 s4 s5 s6 s7 s8 s9
Pos
s0 s1 s2 s3
Size
s0 s1 s2 s3 s4 s5
Dist
Gebärde WIEVIEL
Sign HOW MUCH
Fig. 3.29. Modeling the GSL sign HOW MUCH (WIEVIEL) by means of PaHMMs.
Figure 3.29 also illustrates the specific topology for a subunit based recognition
system. As the duration of one subunit is quite short, a subunit HMM consists merely
Recognition of continuous Sign Language using Subunits 125
of two states. The connection of several subunits inside one sign depend however on
Bakis topology (see also equation 2.69 in section 2.1.3.6).
3.3.5 Classification
Determining the most likely sign sequence Ŝ which fits the feature vector sequence
best X results in a demanding search process. The recognition decision is carried
out by jointly considering the visual and linguistic knowledge sources. Following
[1], where the most likely sign sequence is approximated by the most likely state
sequence, a dynamic programming search algorithm can be used to compute the
probabilities P (X|S) · P (S). Simultaneously, the optimization over the unknown
sign sequence is applied. Sign language recognition is then solved by matching the
input feature vector sequence to all the sequences of possible state sequences and
finding the most likely sequence of signs using the visual and linguistic knowledge
sources. The different steps are described in more detail in the following subsections.
3.3.5.1 Classification of Single Signs by Means of Subunits and PaHMMs
Before dealing with continuous sign language recognition based on subunit models,
the general approach will be demonstrated by a simplified example of single sign
recognition with subunit models. The extension to continuous sign language recog-
nition is then given in the next section. The general approach is the same in both
cases. The classification example is depicted in figure 3.30. Here, the sign consists of
three subunits for feature group ’size’, four subunits for ’pos’ and eventually two for
feature group ’dist’. It is important to note, that the depicted HMM is not a random
sequence of subunits in each feature group, nor is it a random parallel combination of
subunits. The combination of subunits – in parallel as well as in sequence – depends
on a trained sign, i.e. the sequence of subunits is transcribed in the sign-lexicon of the
accordant feature-group. Furthermore, the parallel combination, i.e. the transcription
in all three sign-lexica codes the same sign. Hence, the recognition process does not
search ’any’ best sequence of subunits independently.
The signal of the example sign of figure 3.30 has a total length of 8 feature-vectors.
In each channel an assignment of feature-vectors (the part of the feature vector of the
accordant feature group) to the different states happens entirely independent from
each other by time alignment (Viterbi algorithm). Only at the end of the sign, i.e.
after the 8th feature vector is assigned, the so far calculated probabilities of each
channel are combined. Here, the first and last states are confluent states. They are
not emitting any probability, as they serve as a common beginning- and end-state
for the three channels. The confluent end-state can only be reached by the accordant
end-states of all channels. In the depicted example this is the case only after feature
vector 8, even though the end-state in channel ’dist’ is already reached after 7 times
steps. The corresponding equation for calculating the combined probability for one
model is:
P (X|λ) = P (X|λP os ) · P (X|λSize ) · P (X|λDist ). (3.14)
The decision on the best model is reached by a maximisation over all models of
the signs of the vocabulary:
126 Sign Language Recognition
Dist
Pos
Size
After completion of the training process, a word model λi exists for each sign S i ,
which consists of the Hidden Markov Models of the accordant subunits. This word
model will serve as reference for recognition.
3.3.5.2 Classification of Continuous Sign Language by Means of Subunits and
PaHMMs
In principle the classification process for continuous and isolated sign language
recognition is identical. However, in contrast to the recognition of isolated signs,
continuous sign language recognition is concerned with a number of further difficul-
ties, such as:
• A sign may begin or end anywhere in a given feature vector sequence.
• It is ambiguous, how many signs are contained in each sentence.
• There is no specific order of given signs.
• All transitions between subsequent signs are to be detected automatically.
All these difficulties are strongly linked to the main problem: The detection of sign
boundaries. Since these can not be detected accurately, all possible beginning and
Recognition of continuous Sign Language using Subunits 127
Pos Dist
Pos Dist
Size
Size
Size
Sign 1 1
Gebärde Gebärde
Sign 1 1 Gebärde
Sign 1 1
. .
. .
. .
Size Pos Dist
Gebärde
Sign i i Gebärde
Sign i i Gebärde
Sign i i
. . .
. . .
. . .
Gebärde
Sign N N Gebärde N Gebärde
Sign N N
Fig. 3.31. A Network of PaHMMs for Continuous Sign Language Recognition with Subunits.
At the time of classification, the number of signs in the sentence as well as the transi-
tions between these signs are unknown. To find the correct sign transitions all models
of signs are combined as depicted in figure 3.31. The generated model now consti-
tutes one comprehensive HMM. Inside a sign-model there are still three transitions
(Bakis-Topology) between states. The last confluent state of a sign model has tran-
sitions to all other sign models. The Viterbi algorithm is employed to determine the
best state sequence of this three-channel PaHMM. Hereby the assignment of feature
vectors to different sign models becomes obvious and with it the detection of sign
boundaries.
3.3.5.3 Stochastic Language Modeling
The classification of sign language depends on two knowledge sources: the visual
model and the language model. Visual modeling is carried out by using HMMs as
described in the previous section. Language modeling will be discussed in this sec-
tion. Without any language model technology, the transition probabilities between
two successive signs are equal. Knowledge about a specific order of the signs in the
training corpus is not utilised during recognition.
In contrast, a statistical language model takes advantage of the knowledge, that
pairs of signs, i.e. two successive signs, occur more often than others. The following
equation gives the probability of so called bigramm-models.
128 Sign Language Recognition
m
P (w) = P (w1 ) · P (wi |wi−1 ) (3.16)
i=2
The equation describes the estimation of sign pairs wi−1 and wi . During the clas-
sification process the probability of a subsequent sign changes, depending on the
classification result of the preceding sign. By this method, typical sign pairs receive
a higher probability. The estimation of these probabilities requires however a huge
training corpus. Unfortunately, since training corpora do not exist for sign language,
a simple but efficient enhancement of the usual statistical language model is intro-
duced in the next subsection.
The approach of an enhanced statistical language model for sign language recog-
nition is based on the idea of dividing all signs of the vocabulary into different
sign-groups (SG) [2]. The determination of occurance probabilities is then calcu-
lated between these sign-groups and not between specific signs. If a combination of
different sign-groups is not seen in the training corpus, this is a hint, that signs of
these specific sign-groups do not follow each other. This approach does not require
that all combinations of signs occur in the data base.
If the sequence of two signs of two different sign-groups SGi and SGj is observed
in the training corpus, any sign of sign-group SGi followed by any other sign of
sign-group SGj is allowed for recognition. For instance, if the sign succession ’I
EAT’ is contained in the training corpus, the probability that a sign of group ’verb’
(SGverb ) occurs when a sign of sign-group ’personal pronoun’ (SGpersonalpronoun )
was already seen, is increased. Hereby, the occurrence of the signs ’YOU DRINK’
receives a high probability even though this succession does not occur in the training
corpus. On the other hand, if the training corpus does not contain a sample of two
succeeding ’personal pronoun’ signs (e.g. ’YOU WE’), it is a hint that this sequence
is not possible in sign language. Therefore, the recognition of these two succeeding
signs is excluded from the recognition process.
By this modification, a good compromise between statistical and linguistic lan-
guage modeling is found. The assignment to specific sign-groups is mainly motivated
by word categories known from German Speech Grammar. Sign-groups are ’nouns’,
’personal pronouns’, ’verbs’, ’adjectives’, ’adverbs’, ’conjunctions’, ’modal verbs’,
’prepositions’ and two additional groups, which take the specific sign language char-
acteristics into account.
3.3.6 Training
This section finally details the training process of subunit models. The aim of the
training phase is the estimation of all subunit model parameters λ = (Π, A, B) for
each sign of the vocabulary. It is important to note, that the subunit-model parame-
ters are based on data recorded by single signs. The signs are not signed in context of
Recognition of continuous Sign Language using Subunits 129
a whole sign language sentences, which is usually the case for continuous sign lan-
guage recognition and is then called embedded training. As only single signs serve
as input for the training, this approach leads to a decreased training complexity.
The training for subunits of signs proceeds in four iterative different steps. An
overview is depicted in figure 3.32. In a first step a transcription for all signs has
to be determined for the sign-lexicon. This step is called initial transcription (see
section 3.3.6.1). Hereafter, the model parameter are estimated (estimation step, see
section 3.3.6.2), which leads to to preliminary models for subunits. A classifica-
tion step follows (see section 3.3.6.3). Afterwards, the accordant results has to be
reviewed in an evaluation step. The result of the whole process is also known as
singleton fenonic base form respectively synthetic base forms [11]. However, before
the training process starts, the feature vector has to be split to consider the different
feature groups.
Split of feature-vector
The partition of the feature vector into three different feature groups (pos, size, dist)
has to be considered during the training process. Hence, as a first step before the
actual training process starts, the feature vector is split into the three feature groups.
Each feature group can be considered as a complete feature vector and is trained
separately, since they are (mostly) independent from each other. Each training step
is performed for all three feature groups. The training process eventually leads to
subunit-models λpos , λsize and λdist .
The following training steps are described for one of the three feature groups. The
same procedure applies to the two remaining feature groups. The training process
eventually leads to three different sign-lexica, where the transcription for all signs in
the vocabulary are listed with respect to the corresponding feature-group.
3.3.6.1 Initial Transcription
To start the training process, a sign-lexicon is needed which does not yet exist.
Hence, an initial transcription of the signs for the sign-lexicon is needed before
the training can start. This choice of an initial transcription is arbitrary and of little
importance for the success of the recognition system, as the transcription is updated
iteratively during training. Even with a poor initial transcription, the training process
may lead to good models for subunits of signs [11]. Two possible approaches for the
initial transcription are given in the following.
Again, this step is undertaken for each of the feature group data separately. The ex-
isting data is divided into K different cluster. Each vector is assigned to one of the
K cluster. Eventually, all feature vectors belonging to the same cluster are similar,
which is one of the goals of the initial transcription. A widespread cluster method
is the K-Means-Algorithm [7]. After processing of the K-means algorithm, a assign-
ment of the feature vectors to the clusters is fixed. Each cluster is also associated
130 Sign Language Recognition
Initial Transcription
Sign 1 SU 4
SU 7 SU 3
Sign 2 SU SU 1 SU
Estimation for Model-parmeters 6 5
. .
for Subunits . .
Trainingsdata . .
Sign M SU SU 7 SU
(Feature Vectors 2 5
Sign 1 SU 1 SU 7 SU 5 Sign M SU 2 SU 7 SU 5
SU 6
SU 5 .
Recognition SU SU
. 1 SU 5
6 . . . .
. .
Result .
SU 6 SU 1 SU 5 SU 9 SU 7 SU 5
Evaluation of preliminary
Recognition Results
next Iteration
Sign 1 SU 4 SU 7 SU 3
Sign 2 SU 6 SU 1 SU 5
. .
. .
Training
Model-parameter of . .
Result
Subunits Sign M SU 2
SU 7
SU 5
Final Sign-Lexicon
Fig. 3.32. The different processing steps for the identification of subunit models.
with a unique index number. For an initial transcription in the sign-lexicon, these
index numbers have to be coded for each feature vector into the sign-lexicon. An
example is depicted in figure 3.33. Experiments show, that an inventory of K = 200
is reasonable for sign recognition.
Recognition of continuous Sign Language using Subunits 131
THIS
x. x.
..
11 21
..
... ...
..
. ...
HOW MUCH
x 1D x
x. x.
2D
31 x. x.
...
41
X1 X2 .. ..
51
..
61
...
81
...
91
X7 X8 X9
x. 41
x. ..
... x.
...
11
x.
91
.. ..
... ... ...
51
.
.. x ...
x.
4D
..
71
...
x 1D
x ... ..
X4 9D
.. x
X1 X9 x
.
5D
x. 81 7D
X5
..
... X7
C1 (SU1) x.
..
. x.
.
21
x .
61 C2 (SU2)
... ...
8D
... x. 31 X8 ...
. . .
x 2D ... x
...
6D
X2 . X6
x 3D
X3 C3(SU3)
Sign-Lexicon
Fig. 3.33. Example for clustering of feature vectors for a GSL sign language sentence THIS
HOW MUCH COST (DAS KOSTEN WIEVIEL).
For this approach, each sign is divided in a fixed number of subunits, e.g. six, per
sign and feature group. Accordingly, sign 1 is associated with subunit 1 - 6, sign 2
with subunit 7 - 12 and so forth. However, as it is not desirable to have 6 times more
subunits than signs in the vocabulary, not all signs of the vocabulary are chosen.
Instead only a subgroup, i.e. the first approximately 30 signs are chosen at random
for the initial transcription, which finally leads to approximately 200 subunits.
3.3.6.2 Estimation of Model Parameters for Subunits
The estimation of model parameters is actually the core of this training process, as
it leads eventually to the parameters of an HMM λi for each subunit. However, to
start with, the data of single signs (each sign with m repetitions) serve here as input.
In addition, the transcription of signs coded in the sign-lexicon (which is in the first
iteration the result of the former initial transcription step) is needed. In all following
steps, the transcription of the prior iteration step is the input for the actual iteration.
Before the training starts, the number of states per sign has to be determined. The
total number of states for a sign is equal to the number of contained subunits of the
sign times two, as each subunit consists of two states.
Viterbi training, which has been already introduced in section 2.1.3.6 is employed to
estimate the model parameters for subunits. Thus, only the basic approach will here
be introduced, for further details and equations see section 2.1.3.6. The final result
132 Sign Language Recognition
of this step are HMMs for subunits. However, as data for single signs are recorded,
we are again concerned with the problem of finding boundaries, here the boundaries
of subunits within signs. Figure 3.34 illustrates how these boundaries are found.
UE7
SU
SU1
UE
SU19
UE
SU5
UE
UE
SU5 UE
SU19 UE
SU1 UE
SU7
Fig. 3.34. Example for the detection of subunit boundaries. After time alignment the assign-
ment of feature vectors to states is obvious.
The figure details the situation after time alignment by the Viterbi algorithm per-
formed on a single sign. Here, the abscissa presents the time and thereby the feature
vector sequence, whereas the concatenated models for each subunit contained in the
sign are depicted at the ordinate. After the alignment of each feature vector sequence
to the states, the detection of subunit boundaries is obvious. In this example the first
two feature vectors are assigned to the states of the model SU5 , the following two
are assigned to subunit SU1 9, one to SU1 and finally the last three to subunit SU7 .
Now the Viterbi algorithm has to be performed again, this time on the feature vectors
assigned to each subunit, in order to identify the parameters of an HMM λi for one
subunit.
3.3.6.3 Classification of Single Signs
The same data of single signs, which already served as input to the training process
is now classified to proof the quality of the estimated subunit models (of the previ-
ous step). It is expected that most recognition results are similar to the transcription
coded in the sign-lexicon. However, not all results will comply with the original tran-
scription, which is especially true if data of several, e.g. m, repetitions of one sign
exist.
As the recognition results of all repetitions of a single sign are most probable
not identical, the best representation for all repetitions of that specific sign has to be
Recognition of continuous Sign Language using Subunits 133
found. To reach this goal, an HMM for each of the m repetition is constructed. The
Hidden Markov Models is based on the recognition result of the last mentioned step.
The HMMs of the accordant subunit models are concatenated to a word model for
that repetition of the sign. For instance, if the recognition result of a repetition of a
sign1 is SU4 , SU1 , SU7 and SU9 , the resulting word model for that repetition is the
concatenation of SU4 ⊕ SU1 ⊕ SU7 ⊕ SU9 . λsi refers here to the ith repetition of a
sign s.
We start with the evaluation of the word model λs1 of the first repetition. For this
purpose, a probability is determined, that the feature vector sequence of a specific
repetition, say s2 , is produced by the model (λs1 ). This procedure leads eventually
to m different probabilities (one for each of the m repetitions) for the model λs1 . The
multiplication of all these probabilities finally results in an indication of how ’good’
this model fits to all m repetitions. For the remaining repetitions s2 , s3 , . . . , sm of
that sign the proceeding is analogous. Finally, the model of that repetition si is cho-
sen, which results in the highest probability. This probability is determined by:
M
M
C ∗ (w) = argmax PAj (w) (Ai (w)). (3.17)
j=1 i=1
Eventually, the transcription of the most probable repetition si is coded in the sign-
lexicon. If the transcription of this iteration step differs from the previous iteration
step, another iteration of the training will follow.
By determination of subunit models the basic training process is complete. For con-
tinuous sign language recognition ’legal’ sequences of subunits must be composed to
signs. It may however happen, that an ’illegal’ sequence of subunits, i.e. a sequence
of subunits, which is not coded in the sign-lexicon, is recognised.
To avoid this misclassification, the subunits are concatenated according to their
transcription in the sign-lexicon. For example the model λSi for sign S with a tran-
scription of SU6 , SU1 , SU4 is then defined by λSi = λUE6 ◦ λUE1 ◦ λUE4 .
New Signs
(Sequence of
Feature Vectors of Model-Parameter of
Single Signs) Subunits
Sign 1 SU 1
SU 7 SU 5 Sign M SU 2
SU 7
SU 5
SU 6
SU 5 .
Recognition SU SU
. 1 SU 5
6 . . . .
. .
Result .
SU 6
SU 1 SU 5 SU 9
SU 7
SU 5
Evaluation of preliminary
Recognition Results
Sign 1 SU 4 SU 7 SU 3
Sign 2 SU 6
SU 1 SU 5
. .
. .
. .
Sign M SU 2
SU 7
SU 5
Sign-Lexicon
including new Signs
trained subunit models serve as reference for classification. However, if more than
one repetition of the sign is available it is most unlikely, that all repetitions lead to the
same recognition result. The worst case represents n different results for n different
sign repetitions. To solve this problem, a further evaluation step, which is the same
as the one in the training process, is required. Here, for all repetitions of the sign,
the ’best’ recognition result (in terms of all repetitions) is chosen and added to the
sign-lexicon.
as person independent recognition rates are given for single signs under controlled
laboratory condition as well as under real world condition. Recognition results are
also presented as to the applicability of subunits, which were automatically derived
from signs.
As a first result Tab. 3.2 shows the person-dependent recognition rates from a
leaving-one-out test for four signers and various video resolutions under controlled
laboratory conditions. The training resolution was always 384 × 288. Since the num-
ber of recorded signs varies slightly the vocabulary size is specified separately for
each signer. Only manual features were considered. Interestingly COG coordinates
alone accounted for approx. 95% recognition of a vocabulary of around 230 signs.
On a 2 GHz PC, processing took an average of 11.79s/4.15s/3.08s/2.92s per sign, de-
pending on resolution. Low resolutions caused only a slight decrease in recognition
rate but reduced processing time considerably. So far a comparably high performance
has only been reported for intrusive systems.
Table 3.2. Person-dependent isolated sign recognition with manual features in controlled en-
vironments.
Test Signer, Vocabulary Size
Video Features Ben Michael Paula Sanchu
Resolution 235 signs 232 signs 219 signs 230 signs 229 signs
384 × 288 all 98.7% 99.3% 98.5% 99.1% 98.9%
192 × 144 all 98.5% 97.4% 98.5% 99.1% 98.4%
128 × 96 all 97.7% 96.5% 98.3% 98.6% 97.8%
96 × 72 all 93.1% 93.7% 97.1% 95.9% 94.1%
384 × 288 x, ẋ, y, ẏ 93.8% 93.9% 95.5% 96.1% 94.8%
Under the same conditions head pose, eyebrow position, and lip outline were
employed as nonmanual features. The recognition rates achieved on the vocabular-
ies presented in Tab. 3.2 varied between 49,3% and 72,4% among the four signers,
with an average of 63,6%. Hence roughly two of three signs were recognized just
from nonmanual features, a result that emphasizes the importance of mimics for sign
language recognition.
Tab. 3.3 shows results for person-independent recognition. Since the signers used
different signs for some words, the vocabulary has been chosen as the intersection of
the test signs with the union of all training signs. No selection has been performed
otherwise, and no minimal pairs have been removed. As expected, performance drops
significantly. This is caused by strong interpersonal variance in signing, as visualized
in Fig. 3.5 by COG traces for identical signs done by different signers. Recognition
rates are also affected by the exact constellation of training/test signers.
136 Sign Language Recognition
Table 3.3. Person-independent isolated sign recognition with manual features in controlled
environments. The n-best rate indicates the percentage of results for which the correct sign
was among the n signs deemed most similar to the input sign by the classifier. For n = 1 this
corresponds to what is commonly specified as the recognition rate.
Training Test Vocabulary n-Best Rate
Signer(s) Signer Size 1 5 10
Michael Sanchu 205 36.0% 58.0% 64.9%
Paula, Sanchu Michael 218 30.5% 53.6% 63.2%
Ben, Paula, Sanchu Michael 224 44.5% 69.3% 77.1%
Ben, Michael, Paula Sanchu 221 54.2% 79.4% 84.5%
Ben, Michael, Sanchu Paula 212 37.0% 63.6% 72.8%
Michael, Sanchu Ben 206 48.1% 70.0% 77.4%
consisted of two to nine signs. After eliminating coarticulation effects and applying
bigrams according to the statistical probability of sign order, a recognition rate of
87% was achieved. This finding is very essential as it solves the problem of vocabu-
lary extension without additional training.
3.4.3 Discussion
In this chapter a concept for video based sign language recognition was described.
Manual and non-manual features have been integrated into the recognizer to cover
all aspects of sign language expressions. Excellent recognition performance has
been achieved for person-dependent classification and medium sized vocabularies.
The presented system is also suitable for person-independent real world applications
where small vocabularies suffice, as e.g. for controlling interactive devices. The re-
sults also show that a reasonable automatic transcription of signs to subunits is feasi-
ble. Hence the extension of an existing vocabulary is made possible without the need
for large amounts of training data. This constitutes a key feature in the development
of sign language recognition systems supporting large vocabularies.
References
1. Bahl, L., Jelinek, F., and Mercer, R. A Maximum Likelihood Approach to Continuous
Speech Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence,
5(2):179–190, March 1983.
2. Bauer, B. Erkennung kontinuierlicher Gebärdensprache mit Untereinheiten-Modellen.
Shaker Verlag, Aachen, 2003.
3. Becker, C. Zur Struktur der deutschen Gebärdensprache. WVT Wissenschaftlicher Ver-
lag, Trier (Germany), 1997.
4. Canzler, U. and Dziurzyk, T. Extraction of Non Manual Features for Videobased Sign
Language Recognition. In The University of Tokyo, I. o. I. S., editor, Proceedings of
IAPR Workshop on Machine Vision Applications, pages 318–321. Nara, Japan, December
2002.
5. Canzler, U. and Ersayar, T. Manual and Facial Features Combination for Videobased Sign
Language Recognition. In 7th International Student Conference on Electrical Engineer-
ing, page IC8. Prague, May 2003.
6. Derpanis, K. G. A Review of Vision-Based Hand Gestures. Technical report, Department
of Computer Science, York University, 2004.
7. Duda, R., Hart, P., and Stork, D. Pattern Classifikation. Wiley-Interscience, New York,
2000.
8. Fang, G., Gao, W., Chen, X., Wang, C., and Ma, J. Signer-Independent Continuous Sign
Language Recognition Based on SRN/HMM. In Revised Papers from the International
Gesture Workshop on Gestures and Sign Languages in Human-Computer Interaction,
pages 76–85. Springer, 2002.
9. Hermansky, H., Timberwala, S., and Pavel, M. Towards ASR on Partially Corrupted
Speech. In Proc. ICSLP ’96, volume 1, pages 462–465. Philadelphia, PA, 1996.
10. Holden, E. J. and Owens, R. A. Visual Sign Language Recognition. In Proceedings of
the 10th International Workshop on Theoretical Foundationsof Computer Vision, pages
270–288. Springer, 2001.
138 Sign Language Recognition
11. Jelinek, F. Statistical Methods for Speech Recognition. MIT Press, 1998. ISBN 0-262-
10066-5.
12. Jones, M. and Rehg, J. Statistical Color Models with Application to Skin Detection.
Technical Report CRL 98/11, Compaq Cambridge Research Lab, December 1998.
13. KaewTraKulPong, P. and Bowden, R. An Improved Adaptive Background Mixture Model
for Realtime Tracking with Shadow Detection. In AVBS01. 2001.
14. Liang, R. H. and Ouhyoung, M. A Real-time Continuous Gesture Interface for Taiwanese
Sign Language. In UIST ‘97 Proceedings of the 10th Annual ACM Symposium on User
Interface Software and Technology. ACM, Banff, Alberta, Canada, October 14-17 1997.
15. Murakami, K. and Taguchi, H. Gesture Recognition Using Recurrent Neural Networks.
In Proceedings of the SIGCHI conference on Human factors in computingsystems, pages
237–242. ACM Press, 1991.
16. Ong, S. C. W. and Ranganath, S. Automatic Sign Language Analysis: A Survey and the
Future Beyond Lexical Meaning. IEEE TPAMI, 27(6):873–8914, June 2005.
17. Parashar, A. S. Representation And Interpretation Of Manual And Non-manual Infor-
mationFor Automated American Sign Language Recognition. Phd thesis, Department of
Computer Science and Engineering, College of Engineering,University of South Florida,
2003.
18. Porikli, F. and Tuzel, O. Human Body Tracking by Adaptive Background Models and
Mean-Shift Analysis. Technical Report TR-2003-36, Mitsubishi Electric Research Labo-
ratory, July 2003.
19. Sonka, M., Hlavac, V., and Boyle, R. Image Processing, Analysis and Machine Vision.
Brooks Cole, 1998.
20. Starner, T., Weaver, J., and Pentland, A. Real-Time American Sign Language Recognition
Using Desk and WearableComputer Based Video. IEEE Transactions on Pattern Analysis
and Machine Intelligence,, 20(12):1371–1375, December 1998.
21. Stauffer, C. and Grimson, W. E. L. Adaptive Background Mixture Models for Real-time
Tracking. In Computer Vision and Pattern Recognition 1999, volume 2. 1999.
22. Stokoe, W. Sign language structure: An Outline of the Visual Communication Systems of
the American Deaf. (Studies in Linguistics. Occasional Paper; 8) University of Buffalo,
1960.
23. Sutton, V. http://www.signwriting.org/, 2003.
24. Vamplew, P. and Adams, A. Recognition of Sign Language Gestures Using Neural Net-
works. In European Conference on Disabilities, Virtual Reality and AssociatedTechnolo-
gies. 1996.
25. Vogler, C. and Metaxas, D. Parallel Hidden Markov Models for American Sign Language
Recognition. In Proceedings of the International Conference on Computer Vision. 1999.
26. Vogler, C. and Metaxas, D. Toward Scalability in ASL Recognition: Breaking Down
Signs into Phonemes. In Braffort, A., Gherbi, R., Gibet, S., Richardson, J., and Teil,
D., editors, The Third Gesture Workshop: Towards a Gesture-Based Communication in
Human-Computer Interaction, pages 193–204. Springer-Verlag Berlin, Gif-sur-Yvette
(France), 2000.
27. Welch, G. and Bishop, G. An Introduction to the Kalman Filter. Technical Report TR
95-041, Department of Computer Science, University of North Carolina at Chapel Hill,
2004.
28. Yang, M., Ahuja, N., and Tabb, M. Extraction of 2D Motion Trajectories and Its Applica-
tion to HandGesture Recognition. IEEE Transactions On Pattern Analysis And Machine
Intelligence, 24:1061–1074, 2002.
References 139
29. Zieren, J. and Kraiss, K.-F. Robust Person-Independent Visual Sign Language Recog-
nition. In Proceedings of the 2nd Iberian Conference on Pattern RecognitionandImage
Analysis IbPRIA 2005, volume Lecture Notes in Computer Science. 2005.
Chapter 4
Our goal here is neither to describe the entire history of this development, nor
to provide the reader with a detailed presentation of the complete state-of-the-art
of the fundamental principles of speech recognition (which would probably require
a separate book). Instead, the aim of this section is to present a relatively compact
overview on the currently most actual and successful method for speech recognition,
which is based on a probabilistic approach for modeling the production of speech us-
ing the technique of Hidden-Markov-Models (HMMs). Today, almost every speech
recognition system, including laboratory as well as commercial systems, is based on
this technology and therefore it is useful to concentrate in this book exclusively on
this approach. Although this method makes use of complicated mathematical foun-
dations in probability theory, machine learning and information theory, it is possible
to describe the basic functionality of this approach using only a moderate amount of
mathematical expressions which is the approach pursued in this presentation.
The fundamental assumption of this approach is the fact that each speech sound for
a certain language (more commonly called phoneme) is represented as a Hidden-
Markov-Model, which is nothing else but a stochastic finite state machine.
Similarly to classical finite state machines, an HMM consists of a finite num-
ber of states, with possible transitions between those states. The speech production
process is considered as a sequence of discrete acoustic events, where typically each
event is characterized by the production of a vector of features that basically describe
the produced speech signal at the equivalent discrete time step of the speech produc-
tion process. At each of these discrete time steps, the HMM is assumed to be in one
of its states, where the next time step will let the HMM perform a transition into a
state that can be reached from its current state according to its topology (including
possible transitions back to its current state or formerly visited states).
Fig. 4.1 shows such an HMM, which has two major sets of parameters: The first
set is the matrix of transition probabilities describing the probability p(s(k − 1) →
Speech Recognition 143
s(k)), where k is the discrete time index and s the notation for a state. These prob-
abilities are represented by the parameters a in Fig. 4.1. The second set represents
the so-called emission probabilities p(x|s(k)), which is the probability that a certain
feature vector can occur while the HMM is currently in state s at time k. This proba-
bility is usually expressed by a continuous distribution function, denoted as function
b in Fig. 4.1, which is in many cases a mixture of Gaussian distributions. If a transi-
tion into another state is assumed at discrete time k with the observed acoustic vector
x(k), it is very likely that this transition will be into a state with a high emission prob-
ability for this observed vector, i.e. into a state that represents well the characteristics
of the observed feature vector.
With these basic assumptions, it is already possible to formulate the major prin-
ciples of HMM-based speech recognition, which can be best done by having a closer
look at Fig. 4.2: In the lower part of this figure, one can see the speech signal of
an utterance representing the word sequence ”this was”, also displayed in the upper
part of this figure. This signal has been subdivided into time windows of constant
length (typically around 10 ms length) and for each window a vector of features has
been generated, e.g. by applying a frequency transformation to the signal, such as
a Fourier-transformation or similar. This results in a sequence of vectors displayed
just above the speech signal. This vector sequence is now ”observed” by the Markov-
Model in Fig. 4.2, which has been generated by representing each phoneme of the un-
derlying word sequence by a 2-state-HMM and concatenating the separate phoneme
HMMs into one larger single HMM. An important issue in Fig. 4.2 is represented
by the black lines that visualize the assignment of each feature vector to one specific
state of the HMM. We have thus as many assignments as there are vectors in the fea-
ture vector sequence X = [x(1), x(2), ..., x(K)], where K is the number of vectors,
144 Speech Communication and Multimodal Interfaces
i.e. the length of the feature vector sequence. This so-called state-alignment of the
feature vectors is one of the essential capabilities of HMMs and there are different
algorithms for computation of this alignment, which shall not be presented here in
detail. However, the basic principle of this alignment procedure can be made clear by
considering one single transition of the HMM at discrete time k from state s(k−1) to
state s(k), where at the same time the occurrence of feature vector x(k) is observed.
Obviously, the joint probability of this event, consisting of the mentioned transition
and the occurrence of vector x(k) can be expressed according to Bayes law as:
and thus is exactly composed out of the two parameter sets a and b mentioned
before, that describe the HMM and are shown in Fig. 4.1 As already stated before,
a transition into one of the next states will be likely, that results into a high joint
probability as expressed in the above formula. One can thus imagine that an algo-
rithm for computing the most probable state sequence as alignment to the observed
feature vector sequence must be based on an optimal selection of a state sequence
that eventually leads to the maximization of the product of all probabilities according
to the above equation for k = 1 to K. A few more details on this algorithm will be
provided later.
For now, let us assume that the parameters of the HMMs are all known and that
the optimal state sequence has been determined as indicated in Fig. 4.2 by the black
lines that assign each vector to one most probable state. If that is the case, then this
approach has produced two major results: The first one is a segmentation result that
assigns each vector to one state. With this, it is for instance possible to determine
which vectors – and thus which section of the speech signal – has been assigned to a
specific phoneme, e.g. to the sound /i/ in Fig. 4.2 (namely the vectors nr. 6-9). The
second result will be the before mentioned overall probability, that the feature vec-
tor sequence representing the speech signal has been assigned to the optimal state
sequence. Since this probability will be the maximum possible probability and no
other state sequence will lead to a larger value, this probability can be considered as
the overall production probability that the speech signal has been produced by the
underlying hidden Markov model shown in Fig. 4.2. Thus, an HMM is capable of
processing a speech signal and producing two important results, namely a segmen-
tation of the signal into subordinate units (such as e.g. phonemes) and the overall
probability that such a signal can have been produced at all by the underlying prob-
abilistic model.
for the start of an iterative procedure, then it will be of course possible to compute
with these parameters the optimal corresponding state sequence, as outlined before.
This will be certainly not an optimal sequence, since the initial HMM parameters
might not have been selected very well. However, it is possible to derive from that
state sequence a new estimation for the HMM parameters by simply exploiting its
statistics. This is quite obvious for the transition probabilities, since one only needs
to count the occurring transitions between the states of the calculated state sequence
and divide that by the total number of transitions. The same is possible for the prob-
abilities p(x(k)|s(k)) which can basically be (without details) derived by observing
which kind of vectors have been assigned to the different states and calculating sta-
tistics of these vectors, e.g. their mean values and variances. It can be shown that
this procedure can be repeated iteratively: With the HMM parameters updated in the
way as just described, one can now compute again a new improved alignment of the
vectors to the states and from there again exploit the state sequence statistics for up-
dating the HMM parameters. Typically, this leads indeed to a useful estimation of the
HMM parameters after a few iterations. Moreover, at the end of this procedure, an-
other important and final step can be carried out: By ”cutting” the large concatenated
HMM in Fig. 4.2 again into the smaller phoneme-based HMMs, one obtains a single
HMM for each phoneme which has now estimated and assigned parameters that rep-
resent well the characteristics of the different sounds. Thus, the single HMM for the
phoneme /i/ in Fig. 4.2 will have certainly different parameters than the HMM for
the phoneme /a/ in this figure and serves well as probabilistic model for this sound.
This is basically the training procedure of HMMs and it becomes obvious, that
the HMM technology allows the training of probabilistic models for each speech unit
(typically the unit ”phoneme”) from the processing of entire long sentences without
the necessity to phoneme-label or to pre-segment these sentences manually, which is
an enormous advantage.
Furthermore, Fig. 4.2 can also serve as suitable visualization for demonstrating the
recognition procedure in HMM-based speech recognition. In this case – contrary to
the training phase – now the HMM parameters are given (from the training phase)
and the speech signal represents an unknown utterance, therefore the transcription
in Fig. 4.2 is now unknown and shall be reconstructed by the recognition procedure.
From the previous description, we have to recall once again that very efficient algo-
rithms exist for the computation of the state-alignment between the feature vector
sequence and the HMM-states, as depicted by the black lines in the lower part of
Fig. 4.2. So far we have not looked at the details of such an algorithm, but shall go
into a somewhat more detailed analysis of one of these algorithms now by looking
at Fig. 4.3, which displays in the horizontal direction the time axis with the feature
vectors x appearing at discrete time steps and the states of a 3-state HMM on the
vertical axis.
Assuming that the model starts in the first state, it is obvious that the first fea-
ture vector will be assigned to that initial state. Looking at the second feature vector,
146 Speech Communication and Multimodal Interfaces
according to the topology of the HMM, it is possible that the model stays in state 1
(i.e. makes a transition from state 1 to state1) or moves with a transition from state 1
into state 2. For both options, the probability can be computed according to Eqn. 4.1
and both options are shown as possible path in Fig. 4.3. It is then obvious that from
both path end points for time step 2, the path can be extended, in order to compute if
the model has made a transition from state 1 (into either state 1 or state 2) or a transi-
tion from state 2 (into either state 2 or state 3). Thus, for the 3rd time step, the model
can be in state 1, 2 or 3, and all these options can be computed with a certain prob-
ability, by multiplying the probabilities obtained for time step 2 by the appropriate
transition and emission probabilities for the occurrence of the third feature vector. In
this way it should be easily visible that the possible state sequence can be displayed
in a grid (also called trellis) as shown in Fig. 4.3, which displays all possible paths
that can be taken from state 1 to the final state 3 in this figure, by assuming the obser-
vation of five different feature vectors. The bold line in this grid shows one possible
path through this grid and it is clear that for each path, a probability can be computed
that this path has been taken, according to the before described procedure of mul-
tiplying the appropriate production probabilities of each feature vector according to
Eqn. 4.1. Thus, the optimal path can be computed with the help of the principle of
dynamic programming, which is well-known from the theory of optimization. The
above procedure describes the Viterbi algorithm, which can be considered as one
of the major algorithms for HMM-based speech recognition. As already mentioned
several times, the major outcome of this algorithm is the optimal state sequence as
well as the ”production probability” for this sequence, which is as well the proba-
bility that the given HMM has produced the associated feature vector sequence X.
Let’s assume that the 3-state HMM in Fig. 4.3 represents a phoneme, as mentioned
in the previous section on the training procedure, resulting in trained parameters for
HMMs which typically represent all the phonemes in a given language. Typically,
an unknown speech signal will either represent a spoken word or an entire spoken
sentence. How can such a sentence then be recognized by the Viterbi algorithm as
described above for the state-alignment procedure of an HMM representing a single
Speech Recognition 147
phoneme? This can be achieved by simply extending the algorithm so that it com-
putes the most likely sequence of phoneme-HMMs that maximize the probability of
emitting the observed feature vector sequence. That means that after the final state
of the HMM in Fig. 4.3 has been reached (assuming a rather long feature vector se-
quence of which the end is not yet reached) another HMM (and possibly more) will
be appended and the algorithm is continued until all feature vectors are processed
and a final state (representing the end of a word) is obtained. Since it is not known
which HMMs have to be appended and what will be the optimal HMM sequence, it
becomes obvious that this procedure implies a considerable search problem and this
process is therefore also called ”decoding”. There are however several ways to sup-
port this decoding procedure, for instance by considering the fact that the phoneme
order within words is rather fixed and variation can basically only happen between
word boundaries. Thus the phoneme search procedure is more a word search proce-
dure which will be furthermore assisted by the so-called language model, that will
be discussed later and assigns probabilities for extending the search path into a new
word model if the search procedure has reached the final state of a preceding word
model. Therefore, finally the algorithm can compute the most likely word sequence
and the final result of the recognition procedure is then the recognized sentence. It
should be noted that due to the special capabilities of the HMM approach, this de-
coding result can be obtained additionally with the production probability of that
sentence, and with the segmentation result indicating which part of the speech signal
can be assigned to the word and even to the phoneme boundaries of the recognized
utterance. Some extensions of the algorithms make it even possible to compute the
N most likely sentences for the given utterance (for instance for further processing
of that result in a semantic module), indicating once again the power and elegance of
this approach.
originally uttered word sequence, but instead sees the ”encoded” version of that in
form of the feature vector sequence X that has been derived from the acoustic wave-
form, as output of the acoustic channel as displayed in Fig. 4.4. There is a probabilis-
tic relation between the word sequence W and the observed feature vector sequence
X, and indeed this probabilistic relation is modeled by the Hidden Markov Models
that represent the phoneme sequence implied by the word sequence W . In fact, this
model yields the already mentioned ”production probability” that the word sequence
W , represented by the appropriate sequence of phoneme HMMs has generated the
observed feature vector sequence and this probability can be denoted as p(X|W ).
The second part of Fig. 4.4 shows the so-called ”linguistic decoder”, a module that
is responsible for decoding the original information W from the observed acoustic
sequence X, by taking into account the knowledge about the model provided by the
HMMs expressed in p(X|W ). The decoding strategy of this module is to find the
best possible word sequence W from observing the acoustic feature string X and
thus to maximize p(W |X) according to Bayes’ rule as follows:
p(W )
max p(W |X) = max[p(X|W ) · ] (4.2)
W p(X)
And because finding the optimal word sequence W is independent of the proba-
bility p(X), the final maximization rule is:
Exactly this product of probabilities has to be maximized during the search pro-
cedure described before in the framework of the Viterbi algorithm. In this case, it
should be noted that p(X|W ) is nothing else but the probability of the feature vec-
tor sequence X under the assumption that it has been generated by the underlying
word sequence W , and exactly this probability is expressed by the HMMs resulting
from the concatenation of the phoneme-based HMMs into a model that represents
the resulting word string. In this way, the above formula expresses indeed the be-
fore mentioned decoding strategy, namely to find the combination of phoneme-based
HMMs that maximize the corresponding emission probability. However, the above
mentioned maximization rule contains an extra term p(W ) that has to be taken into
account additionally to the so far known maximization procedure: This term p(W )
is in fact the ”sentence probability” that the word sequence W will occur at all, in-
dependent of the acoustic observation X. Therefore, this probability is described by
the so-called language model, that simply expresses the occurrence probability of a
word in a sentence given its predecessors, which can be expressed as:
N
p(W ) = p(w(n)|w(n − 1), ..., w(n − m)) (4.5)
n=1
where N is the length of the sentence and m is the considered length of the
word history. As already mentioned, the above word probabilities are completely
independent of any acoustic observation and can be derived from statistics obtained
e.g. from the analysis of large written text corpora by counting the occurrence of all
words given a certain history of word predecessors.
contains the parameters of the distribution functions that are used to compute these
state conditional probabilities. This computation is integrated into the already men-
tioned search procedure, that attempts to find the best possible string of concatenated
HMM phoneme models that maximize the emission probability of the feature vector
sequence. This search procedure is controlled by the gray-shaded table containing
the phonological word models (i.e. how a word is composed of phonemes) and the
table containing the language model that delivers probabilities for examining the next
likely word if the search procedure has reached the final HMM-state of a previous
word model. In this way, the above mentioned computation of state conditional prob-
abilities does not need to be carried out for all possible states, but only for the states
that are considered to be likely by the search procedure. The result of the search pro-
cedure is the most probable word model sequence that is displayed as transcription
representing the recognized sentence to the user of the speech recognition system.
This brief description of the basic functionality of HMM-based speech recog-
nition can of course not cover this complicated subject in sufficient detail and it is
therefore not amazing that many interesting sub-topics in ASR have not been covered
in this introduction. These include e.g. the area of discriminative training techniques
for HMMs, different HMM architectures such as discrete, hybrid and tied-mixture
approaches, the field of context-dependent modeling where acoustic models are cre-
ated that model the coarticulation effects of phonemes in the context of neighboring
phonemes, as well as clustering techniques that are required for representing the
acoustic parameters of these extended models. Other examples include the use of
HMM multi-stream techniques to handle different acoustic feature streams or the
entire area of efficient decoding techniques, e.g. the inclusion of higher level seman-
tics in decoding, fast search techniques, efficient dictionary structures or different
decoder architectures such as e.g. stack decoders. The interested reader is advised
to study the available literature on these topics and the large number of conference
papers describing those approaches in more detail.
As already mentioned, the HMM-technology has become the major technique for
Automatic Speech Recognition and has nowadays reached a level of maturity that
has led to the fact that this is not only the dominating technology in laboratory sys-
tems but in commercial systems as well. The HMM technology is also that much
flexible that it can be deployed for almost every specialization in ASR. Basically,
one can distinguish the following different technology lines: Small-vocabulary ASR
systems with 10-50 word vocabulary, in speaker-independent mode, mainly used for
telephone applications. Here, the capability of HMMs to capture the statistics of large
training corpora obtained from many different speakers is exploited. The second line
is represented by speaker-independent systems with medium size vocabulary, often
used in automotive or multimedia application environments. Here, noise reduction
technologies are often combined with the HMM framework and the efficiency of
HMMs for decoding entire sequences of phonemes and words are exploited for the
recognition of continuously spoken utterances in adverse environments. The last ma-
Speech Recognition 151
jor line are dictation systems with very large vocabulary (up to 100,000 words) which
often operate in speaker-dependent and/or speaker-adaptive mode. In this case, the
special capabilities of HMMs are in the area of efficient decoding techniques for
searching very large trellis spaces, and especially in the field of context-dependent
acoustic modeling, where coarticulation effects in continuous speech can be effi-
ciently modeled by so-called triphones and clustering techniques for their acoustic
parameters. One of the most recent trends is the developments of so-called embed-
ded systems, where medium to large vocabulary size ASR systems are implemented
on systems such as mobile phones, thin clients or other electronic devices. This has
become possible with the availability of appropriate memory cards with sufficiently
large storage capacity, so that the acoustic parameters and especially the memory-
intensive language model with millions of word sequence probabilities can be stored
directly on such devices. Due to these mentioned memory problems, another recent
trend is so-called Distributed Speech Recognition (DSR), where only feature extrac-
tion is computed on the local device and the features are then transferred by wireless
transmission to a large server where all remaining recognition steps are carried out,
i.e. computation of emission probabilities and the decoding into the recognized word
sequence. For these mentioned steps, an arbitrarily large server can be employed,
with sufficient computation power and memory for large language models and a
large number of Gaussian parameters for acoustic modeling.
It is not completely amazing that the above mentioned algorithms and available tech-
nologies have led to a large variety of interesting applications. Although ASR tech-
nology is far from being perfect, the currently achievable performance is in many
cases satisfactory enough in order to create novel ideas for application scenarios
or revisit already established application areas with now improved technology. Al-
though current speech recognition applications are manifold, a certain structure can
be established by identifying several major application areas as follows:
Telecommunications: This is still the probably most important application area
due to the fact that telecommunications is very much concerned with telephony ap-
plications involving naturally speech technology and in this case, speech recognition
is a natural bridge between telecommunications and information technology by pro-
viding a natural interface in order to enter data via a communication channel into in-
formation systems. Thus, most application scenarios are in the area of speech recog-
nition involving the telephone. Prominent scenarios include inquiry systems, where
the user will inquire information by calling an automated system, e.g. for banking
information or querying train and flight schedules. Dialing assistance, such as speak-
ing the telephone number instead of dialing or typing it belongs to this application
area. More advanced applications include telephony interpretation, i.e. the automatic
translation of phone calls between partners from different countries, and all tech-
niques involving mobile communications, such as embedded speech recognition on
mobile clients and distributed speech recognition.
152 Speech Communication and Multimodal Interfaces
4.2.1 Introduction
It is strongly believed that, following command line interfaces being popular in the
years 1960-80 and graphical user interfaces in the years 1980-2000, the future lies
in speech and multimodal user interfaces. A lot of factors thereby clearly speak for
speech as interaction form:
• Speech is the most natural communication form between humans.
• Only limited space is required: a microphone and a loudspeaker can be placed
even in wearable devices. However, this does not respect computational effort.
• Hands and eyes are left free, which makes speech interaction the number one
modality in many controlling situations as driving a car.
• Approximately 1.3 billion telephones exist worldwide, which resembles more
than five times the number of computers connected to the Internet at the time.
This provides a big market for automatic dialog systems in the future.
Using speech as an input and output form leads us to dialog systems: in general these
are, in hierarchical order by complexity, systems that allow for control e.g. of func-
tions in the car, information retrieval as flight data, structured transactions by voice,
for example stock control, combined tasks of information retrieval and transactions
as booking a hotel according to flight abilities and finally complex tasks as care of
elder persons. More specifically a dialog may be defined as an exchange of different
aspects in a reciprocal conversation between at least two instances which may either
be human or machine with at least one change of the speaker.
Within this chapter we focus on spoken dialog, still it may also exist in other
forms as textual or combined manners. More concretely we will deal with so called
Spoken Language Dialog Systems, abbreviated SLDS, which in general are a combi-
nation of speech recognition, natural language understanding, dialog act generation,
and speech synthesis. While it is not exactly defined which parts are mandatory for
a SLDS, this contributes to their interdisciplinary nature uniting the fields of speech
processing, dialog design, usability engineering and process analysis within the tar-
get area.
As enhancement of Question and Answer (Q&A) systems, which allow natural
language access to data by direct answers without reference to past context (e.g. by
pronominal allusions), dialog systems also allow for anaphoric expressions. This
makes them significantly more natural as anaphora are frequently used as linguistic
elements in human communication. Furthermore SLDS may also take over initiative.
What finally distinguishes them from sheer Q&A and Command and Control (C&C)
systems in a more complex way is that user modeling may be included. However,
both Q&A and SLDS already provide an important improvement over today’s pre-
dominant systems where users are asked to adhere to a given complex and artificial
syntax. The following figure gives an overview of a typical circular pipeline archi-
tecture of a SLDS [23]. On top the user can be found who controls an application
154 Speech Communication and Multimodal Interfaces
found at the bottom by use of a SLDS as interaction medium. While the compo-
nents are mostly the same, other architectures exist, as organization around a central
process [15]. In detail the single components are:
and analysis of context and dialog history, flow control e.g. for active initiative or
barge-in handling, direction of the course of a conversation, answer production
in an abstract way, and database access or application control.
• Database (DB): Storage of information in respect of dialog content.
• Natural Language Generation (NLG): Formulation of the abstract answer pro-
vided by the DM. A variety of approaches exists for this task reaching from
probabilistic approaches with grammatical post-processing to pre-formulated ut-
terances.
• Speech Synthesis (TTS): Audio-production for the naturally formulated system
answer. Such modules are in general called Text-to-Speech engines. The two ma-
jor types of such are once formant synthesizers resembling a genuinely artifi-
cial production of audio by formant tract modeling and concatenative synthesis.
Within the latter audio clips at diverse lengths reaching from phonemes to whole
words of recorded speech are concatenated to produce new words or sentences.
At the moment these synthesizers tend to sound more natural depending on the
type of modeling and post-processing as pitch and loudness correction. More so-
phisticated approaches use bi- or trigrams to model phonemes in the context of
their neighboring ones. Recently furthermore prosodic cues as emotional speech
gain interest in the field of synthesis. However, the most natural form still remains
prerecorded speech, while providing less flexibility or more cost at recording and
storage space.
Still, as it is disputed which parts besides the DM belong to a strict definition of a
SLDS, we will focus hereon in the ongoing.
Before getting into dialog modeling, we want to make a classification in view of the
initiative:
• system-driven: the system keeps the initiative throughout the whole dialog
• user-driven: the user keeps the initiative
• mixed initiative: the initiative changes throughout the dialog
Usually, the kind of application, and thus the kind of voice interface, codeter-
mines this general dialog strategy. For example systems with limited vocabulary
tend to employ rigid, system initiative dialogs. The system thereby asks very spe-
cific questions, and the user can do nothing else but answer them. Such a strategy is
required due to the highly limited set of inputs the system can cope with. C&C ap-
plications tend to have rigid, user initiative dialogs: the system has to wait for input
from the user before it can do anything. However, the ideal of many researchers and
developers is a natural, mixed initiative dialog: both counterparts have the possibility
of taking the initiative when this is opportune given the current state of the dialog,
and the user can converse with the system as (s)he would with another human. This is
difficult to obtain in general, for at least two reasons: Firstly, it is technically demand-
ing, as the user should have the freedom to basically say anything at any moment,
156 Speech Communication and Multimodal Interfaces
Within the DM an incorporated dialog model is responsible for the structure of the
communication. We want to introduce the most important such models in the ongo-
ing. They can be mainly divided into structural models, which will be introduced
firstly, and nonstructural ones [23, 31]. Thereby a more or less predefined path is
given within structural ones. While these are quite practicable in the first order, their
methodology is not very principled, and the quality of dialogs based on them is ar-
guable. Trying to overcome this, the non-structural approaches rely rather on general
principles of dialog.
4.2.3.1 Finite State Model
The dialog structure represents a sequence of limited predetermined steps or states
in form of a condition transition graph which models all legal dialogs within the
very basic graph-based-, or Finite State Automaton (FSA) model. This graph’s nodes
represent system questions, system outputs or system actions, while its edges show
all possible paths in the network which are labeled with the according user expres-
sions. As all ways in a deterministic finite automaton are fixed, it cannot be spoken
of an explicit dialog control. The dialog therefore tends to be rigid, and all reactions
for each user-input are determined. We can denote such a model as the quintuple
{S, s0 , sf , A, τ }. Thereby S represents a set of states besides s0 , the initial state, and
sf , the final state, and A a set of actions including an empty action . τ finally is a
transition function with τ : S × A → S. By τ it is specified to which state an action
given the actual state leads. Likewise a dialog is defined as the path from s0 to sf
within the space of possible states.
Generally, deterministic FSA are the most natural kind to represent system-driven
dialogs. Furthermore they are well suitable for completely predictable information
exchange due to the fact that the entire dialog model is defined at the time of the
development. Often it is possible to represent the dialog flow graphically, which
resembles a very natural development process.
However, there is no natural order of the conditions, as each one can be followed
by any other resulting in a combinatorial explosion. Furthermore the model is un-
suited for different abstraction levels of the exchanged information, and for complex
dependencies between information units. Systems basing on FSA models are also in-
flexible, as they are completely system-led, and no path deviations are possible. The
user’s input is restricted to single words or phrases that provide responses to carefully
designed system prompts. Tasks, which require negotiations, cannot be implemented
with deterministic automata due to the uncertainty of the result. Grounding and re-
pair is very rigid and must be applied after each turn. Later corrections by the user are
hardly possible, and the context of the statements cannot be used. Summed up, such
Speech Dialogs 157
systems are appropriate for simple tasks with flat menu structure and short option
lists.
4.2.3.2 Slot Filling
In so called frame-based systems the user is asked questions that enable to fill slots
in a template in order to perform a task [41]. These systems are more or less today’s
standard for database retrieval systems as flight or cinema information. In difference
to automaton techniques the dialog flow is not predetermined but depends on the
content of the user’s input and the information that the system has to elicit. However,
if the user provides more input than is requested at a time, the system can accept this
information and check if any additional item is still required. A necessary compo-
nent therefore is a frame that controls over already expressed information fragments.
If multiple slots are yet to fill, the question of the optimal order in respect of sys-
tem questioning remains. The idea thereby is to constrain the originally large set of
a-priori possibilities of actions in order to speed up the dialog. Likewise, one reason-
able approach to establish a hierarchy of slots is to stick to the order of highest infor-
mation based on Shannon’s entropy measure, as proposed in [24]. Such an item with
high information might e.g. be the one leaving most alternatives open. Let therefore
ck be a random variable of items within the set C, and vj a value of attribute a. The
attribute a that minimizes the following entropy measure H(c|a) will be accordingly
selected:
H(c|a) = − P (ck , a = vj ) log2 P (ck |a = vj ) (4.6)
vj ∈a ck ∈C
P (ck ) 1
P (ck |a = vj ) = , P (ck ) = (4.7)
P (cj ) |C|
cj ∈C
An advantage of dialog control with frames is higher flexibility for the user and
multiple slot filling. The duration or dialogs are shorter, and a mixed initiative control
is possible. However, an extended recognition grammar is required for more flexible
user expressions, and the dialog control algorithm must specify the next system ac-
tions on basis of the available frames. The context that decides on the next action (last
user input, status of the slots, simple priority ranking) is limited, and the knowledge
level of the user, negotiating or collaborative planning cannot – or only be modeled
to a very limited degree. No handling for communication problems exists, and the
form of system questions is not specified. Still, there is an extension [53], whereby
complex questions are asked if communication works well. In case of problems a
system can switch to lower-level questions splitting up the high-level ones. Finally,
158 Speech Communication and Multimodal Interfaces
an overview over the system is not possible due to partially complex rules: which
rule fires when?
4.2.3.3 Stochastic Model
In order to find a less hand-crafted approach which mainly bases on a designer’s
input, recently data-driven methods successfully applied in machine learning are
used within the field of dialog modeling. They aim at overcoming the shortcomings
of low portability to new domains and the limited predictability of potential prob-
lems within a design process. As the aim is to optimize a dialog, let us first define a
cost function C. Thereby the numbers Nt of turns , Ne of errors, and Nm of miss-
ing values are therefore summed up by introduction of individual according weights
wt , we , and wm :
tf
C= ct , (4.9)
t=0
where tf resembles the instant when the final state sf is reached. In the conse-
quence the best action V ∗ (s) to take within a state s is the action that minimizes
over incurred and expected costs for the succeeding state with the best action within
there, too:
% &
∗ ∗
V (s) = arga min c(s, a) + Ptf (s |s, a)V (s ) (4.10)
s
Given all model parameters and a finite state space, the unique function V ∗ (s)
can be computed by value interaction techniques. Finally, the optimal strategy ψ ∗
resembles the chain of actions that minimizes overall costs. In our case of a SLDS
the parameters are not known in advance, but have to be learned by data. Normally
such data might be acquired by test users operating on a simulated system known as
Wizard-of-Oz experiments [33]. However, as loads of data are needed, and there is no
data like more data, the possibility of simulating a user by another stochastic process
as a solution to sparse data handling exists [27]. Basing on annotated dialogs with
Speech Dialogs 159
real users, probabilities are derived with which a simulated user acts responding to
the system. Next reinforced learning e.g. by Monte Carlo with exploring starts is used
to obtain an optimal state-action value function Q∗ (s, a). It resembles the costs to be
expected of a dialog starting in state s, moving to s by a, and optimally continuing
to the end:
Q∗ (s, a) = c(s, a) + Ptf (s |s, a) arga min Q∗ (s , a ), (4.11)
s
Note that V ∗ (s) = arga min Q∗ (s, a). Starting in an arbitrary set of Q∗ (s, a) the
algorithm iteratively finds the costs for a dialog session. In [27] it is shown that it
converges to an optimal solution, and that it is expensive to immediately shortcut a
dialog by ”bye!”.
The goals set at the beginning of this subsection can be accomplished by sto-
chastic dialog modeling. The necessity of data seems one drawback, while it often
suffices to have very sparse data, as low accuracies of the initial transition probabili-
ties may already lead to satisfying results.
Still, the state space is hand-crafted, which highly influences the learning process.
Likewise the state space itself should be learned. Furthermore the cost function de-
termination is not trivial, especially assigning the right weights, while also highly
influencing the result. In the suggested cost function no account is taken for subjec-
tive costs as user satisfaction.
4.2.3.4 Goal Directed Processing
The more or less finite state and action based approaches with predefined paths in-
troduced so far are well suited for C&C and information retrieval, but less suited
for task-oriented dialogs where the user and the system have to cooperate to solve
a problem. Consider therefore a plan-based dialog with a user that has only sparse
knowledge about the problem to solve, and an expert system. The actual task thereby
mostly consists of sub-tasks, and influences the structure of the dialog [16]. A system
has now to be able to reason about the problem at hand, the application, and the user
to solve the task [50]. This leads to a theorem prover that derives conclusions by the
laws of logic from a set of true considered axioms contained in a knowledge-base.
The idea is to proof whether a subgoal or goal is accomplished, as defined by a theo-
rem, yet. User modeling is thereby also done in form of stored axioms that consist of
his competence and knowledge. Whenever an axiom is missing, interaction with the
user becomes necessary, which claims for an interruptible theorem prover that is able
to initiate a user directed question. This is known as the missing axiom theory [50].
As soon as a subgoal is fulfilled, the selection of a new one can be made accord-
ing to the highest probability of success given the actual dialog state. However, the
prover should be flexible enough to switch to different sub-goals, if the user actively
provides new information. This process is iteratively repeated until the main goal is
reached.
In the framework of Artificial Intelligence (AI) this can be described by the Be-
liefs, Desires, and Intentions (BDI) model [23], as shown in the following figure 4.7.
160 Speech Communication and Multimodal Interfaces
This model can be applied to conversational agents that have beliefs about the cur-
rent state of the world, and desires, how they want it to be. Basing on these they
determine their intention, respectively goal, and build a plan consisting of actions to
satisfy their desires. Note that utterances are thereby treated as (speech) actions.
To conclude, the advantages of goal directed processing are the ability to also
deal with task-oriented communication and informational dialogs, a more principled
approach to dialog based on a general theory of communication, and less domain-
dependency at least for the general model.
Yet, by applying AI methods, several problems are inherited: high computational
cost, the frame problem dealing with the specification of non-influenced parts of
the world by actions, and the lack of proper belief and desire formalizations and
independently motivated rationality principles in view of agent behavior.
4.2.3.5 Rational Conversational Agents
Rational Conversational Agents directly aim at imitation of natural conversation be-
havior by trying to overcome plan-based approaches’ lack in human-like rationality.
The latter is therefore reformulated in a formal framework establishing an intelligent
system that relies on a more general competence basis [43]. The beliefs, desires and
intentions are logically formulated, and a rational unit is constructed basing thereon
that decides upon the actions to be taken.
Let us denote an agent i believing in proposition p as B(i, p), and exemplary
propositions as ϕ, and ψ. We can next construct a set of logical rules known as
NKD45 or Weak S5 that apply for the B operator:
• Maxim of quality: be truthful. E.g. ’If I hear that song again I’ll kill myself’ (we
accept this as a hyperbole and do not immediately turn the radio off), or ’The boss
has lost his marbles’ (we imagine a mental problem and not actual marbles).
• Maxim of quantity: be informative, say neither too much not too little. (Asked the
date, we do not include the year).
• Maxim of manner: be clear and orderly.
In order to obtain high overall dialog quality, some aspects shall be further out-
lined: Firstly, consistency and transparency are important to enable the user to picture
a model of the system. Secondly, social competence in view of user behavior mod-
eling and providing a personality to the system seems very important. Thirdly, error
handling plays a key role, as speech recognition is prone to errors. The precondi-
tion thereby is clarification, demanding that problems at the user-input (no input,
cut input, word-recognition errors, wrong interpretations, etc.) must become aware
to the system. In general it is said that users accept a maximum of five percent errors
or less. Therefore trade-offs have to be made considering the vocabulary size, and
naturalness of the input. However, there is a chance to raise the accuracy by expert
prompt modeling, allusion to the problem nature, or at least cover errors and reduce
annoyance of the users by a careful design. Also, the danger of over-modeling a dia-
log in view of world knowledge, social experience or general complexity of manlike
communication shall be mentioned. Finally, the main characteristic of a good voice
interface is probably that is usable. Usability thereby is a widely discussed concept
in the field of interfaces, and various operationalizations have been proposed. In [33]
it is stated that usability is a multidimensional concept comprising learnability, effi-
ciency, memorability, errors and satisfaction, and ways are described, in which these
can be measured. Defining when a voice interface is usable is one thing, developing
one is quite another. It is by now received wisdom that usability design is an iterative
process which should be integrated in the general development process.
• Select: The first form not yet filled is selected in a top down manner in the active
VXML document that has an open guard condition.
• Collect: The selected form is visited, and the following prompt algorithm is ap-
plied to the prompts of the form. Next the input grammar of the form is activated,
and the algorithm waits for user input.
• Process: Input evaluation of a form’s fields in accordance to the active grammar.
Filled elements are called for example to the input validation. The process phase
ends, if no more items can be selected or a jump point is reached.
Typically, HTTP is used as the transport protocol for fetching VXML pages.
While simpler applications may use static VXML pages, nearly all rely on dynamic
VXML page generation using an application server. In a well-architected web appli-
cation, the voice interface and the visual interface share the same back-end business
logic.
Tagging of dialogs for machine learning algorithms on the other hand is mostly
done using Dialogue Act Mark-up in Several Layers (DAMSL) - an annotation
scheme for communicative acts in dialog [9]. While different dialogs being analyzed
with different aims in mind will lead to diverse acts, it seems reasonable to agree on
a common basis for annotation in order to enable database enlargement by integra-
tion of other ones. The scheme has three layers: Forward Communicative Functions,
Backward Communicative Functions, and Information Level. Each layer allows mul-
tiple communicative functions of an utterance to be labeled. The Forward Commu-
nicative Functions consist of a taxonomy in a similar style as the actions of tradi-
tional speech act theory. The most important thereby are statement (assert, reassert,
other statement), influencing addressee future action (open-option, directive (info-
request, action-directive), and committing speaker future action (offer, commit). The
Backward Communicative Functions indicate how the current utterance relates to
the previous dialog: agreement (accept, accept-part, maybe, reject-part, reject, hold),
understanding (signal non-understanding, signal understanding (acknowledge, re-
peat phrase, completion), correct misspeaking), and answer. Finally, the Information
Level annotation encodes whether an utterance is occupied with the dialog task, the
communication process or meta-level discussion about the task.
way of interaction with computers. In general, the term interaction describes the mu-
tual influence of several participants to exchange information that is transported by
most diverse means in a bilateral fashion. Therefore, a major goal of HCI is to con-
verge the interface increasingly towards a familiar and ordinary interpersonal way of
interaction. For natural human communication several in- and output channels are
combined in a multimodal manner.
The human being is able to gather, process and express information through a number
of channels. The input channel can be described as sensor function or perception, the
processing of information as cognition, and the output channel as motor function (see
figure 4.8).
Humans are equipped with six senses respectively sense organs to gathers stim-
uli. These senses are defined by physiology [14]:
different channels he can flexibly adapt to the specific abilities of the conversational
partner and the current surroundings.
However, HCI is far from information transmission of this intuitive and natural
manner. This is due to the limitations of the technical systems in regard of their num-
ber and performance of the single in- and output modalities. The term modality refers
to the type of communication channel used to convey or acquire information. One of
the core difficulties of HCI is the divergent boundary conditions between computers
and human. Nowadays, the user can transfer his commands to the computer only by
standard input devices – e.g. mouse or keyboard. The computer feedback is carried
out by visual or acoustic channel – e.g. monitor or speakers. Thus, todays’ technol-
ogy uses only a few interaction modalities of the humans. Whenever two or more of
these modalities are involved you speak from multimodality.
The term ”multimodal” is derived from ”multi” (lat.: several, numerous) and ”mode”
(lat.: naming the method). The word ”modal” may cover the notion of ”modality” as
well as that of ”mode”. Scientific research focuses on two central characteristics of
multimodal systems:
• the user is able to communicate with the machine by several input and output
modes
• the different information channels can interact in a sensible fashion
According to S. Oviatt [35] multimodal interfaces combine natural input modes
- such as speech, pen, touch, manual gestures, gaze and head and body movements
- in a coordinated manner with multimedia system output. They are a new class of
interfaces which aims to recognize naturally occurring forms of human language and
behavior, and which incorporate one or more recognition-based technologies (e.g.,
speech, pen, vision). Benoit [5] expanded the definition to a system which repre-
sents and manipulates information from different human communication channels at
multiple levels of abstraction. These systems are able to automatically extract mean-
ing from multimodal raw input data and conversely produce perceivable informa-
tion from symbolic abstract representations. In 1980, Bolt’s ”Put That There” [6]
demonstration showed the new direction for computing which processed speech
in parallel with touch-pad pointing. Multimodal systems benefit from the progress
in recognition-based technologies which are capable to gather naturally occurring
forms of human language and behavior. The dominant theme in users’ natural orga-
nization of multimodal input actually is complementary of content that means each
input consistently contributes to different semantic information. The partial infor-
mation sequences must be fused and can only be interpreted altogether. However,
redundancy of information is much more less common in human communication.
Sometimes different modalities can input concurrent content that has to be processed
independently. Multimodal applications range from map-based (e.g. tourist informa-
tion) and virtual reality systems, to person identification and verification systems, to
medical and web-based transaction systems. Recent systems integrate two or more
Multimodal Interaction 167
recognition-based technologies like speech and lips. A major aspect is the integra-
tion and synchronization of these multiple information streams what is discussed
intensely in Sect. 4.3.3.
4.3.2.1 Advantages
Multimodal interfaces are largely inspired by the goal of supporting more transpar-
ent, flexible, effective, efficient and robust interaction [29, 35]. The flexible use of
input modes is an important design issue. This includes the choice of the appropri-
ate modality for different types of information, the use of combined input modes,
or the alternate use between modes. A more detailed differentiation is made in Sect.
4.3.2.2. Input modalities can be selected according to context and task by the user or
system. Especially for complex tasks and environments, multimodal systems permit
the user to interact more effectively. Because there are large individual differences
in abilities and preferences, it is essential to support selection and control for diverse
user groups [1]. For this reason, multimodal interfaces are expected to be easier to
learn and use. The continuously changing demands of mobile applications enables
the user to shift these modalities, e.g. in-vehicle applications. Many studies proof that
multimodal interfaces satisfy higher levels of user preferences. The main advantage
is probably the efficiency gain that derives from the human ability to process input
modes in parallel. The human brain structure is developed to gather a certain kind of
information with specific sensor inputs. Multimodal interface design allows a supe-
rior error handling to avoid and to recover from errors which can have user-centered
and system-centered reasons. Consequently, it can function in a more robust and sta-
ble manner. In-depth information about error avoidance and graceful resolution from
errors is given in Sect. 4.3.4. A future aim is to interpret continuous input from vi-
sual, auditory, and tactile input modes for everyday systems to support intelligent
adaption to user, task and usage environment.
4.3.2.2 Taxonomy
A base taxonomy for the classification of multimodal systems was created in the
MIAMI-project [17]. As a basic principle there are three decisive degrees of freedom
for multimodal interaction:
• the degree of abstraction
• the manner (temporal) of application
• the fusion of the different modalities
Figure 4.9 shows the resulting classification space and the consequential four ba-
sic categories of multimodal applications dependent on the parameters value of data
fusion (combined/independent) and temporal usage (sequential/parallel). The exclu-
sive case is the simplest variant of a multimodal system. Such a system supports two
or more interaction channels but there is no temporal or content related connection. A
sequential application of the modalities with functional cohesion is denominated as
alternative multimodality. Beside a sequential appliance of the modalities there is the
possibility of parallel operation of different modalities as seen in figure 4.9. Thereby,
168 Speech Communication and Multimodal Interfaces
collective in one HMM. A great problem of early fusion is the large data amount
necessary for the training of the utilized HMMs.
• Late (semantic) fusion: Multimodal systems which use late fusion consist of
several single mode recognition devices as well as a downstream data fusion
device. This approach contains a separate preprocessing, feature extraction and
decision level for each separate modality. The results of the separate decision
levels are fused to a total result. For each classification process each discrete
decision level delivers a probability result respective a confidence result for the
choice of a class n. These confidence results are afterwards fused for example by
appropriate linear combination. The advantage of this approach is the different
recognition devices being independently realizable. Therefore the acquisition
of multimodal data sets is not necessary. The separate recognition devices are
trained with monomodal data sets. Because of this easy integration of new
recognizers, systems that use late fusion scale up easier compared to early
fusion, either in number of modalities or in size of command set. [56]
• Soft decision fusion: A compromise between early and late fusion is the so
called soft decision fusion. In this method the confidence of each classifier is
also respected as well as the integration of an N -best list of each classifier.
• Statistical Integration:
Every input device produces an N -best list with recognition results and prob-
abilities. The statistical integrator produces a probability for every meaningful
combination from different N -best lists, by calculating the cross product of the
individual probabilities. The multimodal command with the best probability is
then chosen as the recognized command.
In a multimodal system with K input devices we assume that Ri,j = 1, . . . , Ni
with the probabilities Pi,j = Pi (1), . . . , Pi (Ni ) are the Ni possible recognition
results from interface number i, i ∈ (1, . . . , K). The statistical integrator calcu-
lates the product of each combination of probabilities
K
Cn , n = 1, . . . , Ni (4.24)
i=1
K
Pn = Pi,j , j ∈ (1, . . . , Ni ) (4.25)
i=1
• Hybrid Processing:
In hybrid architectures symbolic unification-based techniques that integrate
feature structures are combined with statistical approaches. In contrast to
symbolic approaches these architectures are very robust functioning. The
Associative Mapping and Members-Teams-Committee (MTC) are two main
techniques. They develop and optimize the following factors with a statistical
approach: the mapping structure between multimodal commands and their
respective constituents, and the manner of combining posterior probabilities.
The Associative Mapping defines all semantically meaningful mapping relations
between the different input modes. It supports a process of table lookup that
excludes consideration of those feature structures that can impossibly be
unified semantically. This table can be build by the user or automatically. The
associative mapping approach basically helps to exclude concurrency inputs
form the different recognizer and to quickly rule out impossible combinations.
Members-Team-Committee is a hierarchical technique with multiple members,
teams and committees. Every recognizer is represented by a member. Every
member reports his results to the team leader. The team leader has an own
function to weight the results and reports it to the committee. At last, the
committee chooses the best team, and reports the recognition result to the
system. The weightings at each level have to be trained.
• Rule-based Integration:
Rule-based approaches are well established and applied in a lot of integration ap-
plications. They are similar to temporal approaches, but are not strictly bound to
timing constraints. The combination of different inputs is given by rules or look-
up tables. For example in a look-up table every combination is rated with a score.
Redundant inputs are represented with high scores, concurrency information with
negative scores. This allows to profit from redundant and complementary infor-
mation and excludes concurrency information.
A problem with rule-based integration is the complexity: many of the rules
have preconditions in other rules. With increasing amount of commands these
preconditions lead to an exponential complexity. They are highly specified and
bound to the domains developed for. Moreover application knowledge has to be
integrated in the design process to a high degree. Thus, they are hard to scale up.
• Temporal Aspects:
Overlapped inputs, or inputs that fall within a specific period of time are com-
bined by the temporal integrator. It checks if two or more signals from different
input devices occur in a specific period of time. The main use for temporal in-
tegrators is to determine whether a signal should be interpreted on its own or in
172 Speech Communication and Multimodal Interfaces
combination with other signals. Programmed with results from user studies the
system can use the information how users react in general. Temporal integration
consists of two parts: microtemporal and macrotemporal integration:
Microtemporal integration is applied to combine related information from vari-
ous recognizers in parallel or in a pseudo-parallel manner. Overlapped redundant
or complementary inputs are fused and a system command or a sequence of com-
mands is generated.
Macrotemporal integration is used to combine related information from the
various recognizers in a sequential manner. Redundant or complementary inputs,
which do not directly overlap each other, but fall together in one timing window
are fused by macrotemporal integration. The macrotemporal integrator has to
be programmed with results from user studies to determine which inputs in a
specified timing window belong together.
Generally, we define an error in HCI, if the user does not reach her or his desired
goal, and no coincidence can be made responsible for it. Independent from the do-
main, error robustness substantially influences the user acceptance of a technical
system. Based on this fact, error robustness is necessary. Basically, errors can never
be avoided completely. Thus, in this field both passive (a-priori error avoidance) and
active (a-posteriori error avoidance) error handling are of great importance. Unde-
sired system reactions due to faulty operation as well as system-internal errors must
be avoided as far as possible. Error resolution must be efficient, transparent and ro-
bust. The following scenario shows a familiar error-prone situation: A driver wants
to change the current radio station using speech command ”listen to hot radio”. The
speech recognition misinterprets his command as ”stop radio” and the radio stops
playing.
4.3.4.1 Error Classification
For a systematic classification of error types, we basically assume that either the user
or the system can cause an error in the human-machine communication process.
4.3.4.2 User Specific Errors
The user interacting with the system is one error source. According to J. Reason [42]
user specific errors can be categorized in three levels:
• Errors on the skill-based level (e.g., slipping from a button)
The skill-based level comprises smooth, automated, and highly integrated
routine actions that take place without conscious attention or control. Human
performance is governed by stored patterns of pre-programmed instructions
represented as analog structures in a time-space domain. Errors at this level
are related to the intrinsic variability of force, space, or time coordination.
Sporadically, the user checks, if the action initiated by her or him runs as
planned, and if the plan for reaching the focused goal is still adequate. Error
Multimodal Interaction 173
patterns on skill-based level are execution or memory errors that result from
inattention or overattention of the user.
• Errors on the rule-based level (e.g., using an valid speech command, which is
not permitted in this, but in another mode)
Concerning errors on the rule-based level, the user violates stored prioritized
rules (so-called productions). Errors are typically associated with the misclas-
sification of situations leading to the application of the wrong rule or with the
incorrect recall of procedures.
4.4.1 Background
Within this section we motivate emotion recognition and the modalities chosen in
this article. Also, we introduce models of emotion and talk about databases.
4.4.1.1 Application Scenarios
Even though button pressing starts to be substituted by more natural communica-
tion forms such as talking and gesturing, human-computer communication still feels
somehow impersonal, insensitive, and mechanical. If we take a comparative glance at
human-human communication we will realize a lack of the extra information sensed
by man concerning the affective state of the counterpart. This emotional information
highly influences the explicit information, as recognized by today’s human-machine
communication systems, and with an increasingly natural communication, respect
of it will be expected. Throughout the design of next generation man-machine inter-
faces inclusion of this implicit channel therefore seems obligatory [10].
Automatic emotion recognition is nowadays already introduced experimentally
in call centers, where an annoyed customer is handed over from a robot to a human
call operator [11, 25, 39], and in first commercial lifestyle products as fun software
176 Speech Communication and Multimodal Interfaces
intended to detect lies, stress- or love level of a telephoner. Besides these, more gen-
eral fields of application are an improved comprehension of a user intention, emo-
tional accommodation in the communication (e.g. adaptation of acoustic parameters
for speech synthesis if a user seems sad), behavioral observation (e.g. whether an
airplane passenger seems aggressive) , objective emotional measurement (e.g. as a
guide-line for therapists), transmission of emotion (e.g. sending laughing or crying
images within text-based emails), affect-related multimedia retrieval (e.g. highlight
spotting in a sports event), and affect-sensitive lifestyle products (e.g. a trembling
cross-hair in video games, if the player seems nervous) [3, 10, 38].
4.4.1.2 Modalities
Human emotion is basically observable within a number of different modalities.
First attempts to automatic recognition applied invasive measurement of e.g. the skin
conductivity, heart rate, or temperature [40]. While exploitation of this information
source provides a reliable estimation of the underlying affect, it is often felt uncom-
fortable and unnatural, as a user needs to be wired or at least has to stay in touch
with a sensor. Modern emotion recognition systems therefore focus rather on video
or audio based non-invasive approaches in the style of human emotion recognition:
It is claimed that we communicate by 55% visually, through body language, by 38%
through the tone of our voice and by 7% through the actual spoken words [32]. In
this respect the most promising approach clearly seems to be a combination of these
sources. However, in some systems and situations only one may be available.
Interestingly, contrary to most other modalities, speech allows the user to control
the amount of emotion shown, which may play an important role, if the user feels
too much observed otherwise. Speech-based emotion recognition in general provides
reasonable results already by now. However, it seems sure that the visual information
helps to enable a more robust estimation [38]. Seen from an economical point of
view a microphone as sensor is standard hardware in many HCI-systems today, and
also more and more cameras emerge as in cellular phones of today’s generation. In
these respects we want to give an insight into acoustic, linguistic and vision-based
information analysis in search of affect within this chapter, and provide solutions to
a fusion of these.
4.4.1.3 Emotion Model
Prior to recognizing emotion one needs to establish an underlying emotion model.
In order to obtain a robust recognition performance it seems reasonable to limit the
complexity of the model, e.g. kind and number of emotion labels used, in view of the
target application. This may be one of the reasons that no consensus exists about such
a model in technical approaches, yet. Two generally different views dominate the
scene: on the one hand an emotion sphere is spanned by two up to three orthogonal
axes: firstly arousal or activation, respecting the readiness to take some action, sec-
ondly valence or evaluation, considering a positive or negative attitude, and finally
control or power, analyzing the speaker’s dominance or submission [10]. While this
approach provides a good basis for emotional synthesis, it is often too complex for
concrete application scenarios. The better known way therefore is to classify emo-
Emotions from Speech and Facial Expressions 177
tion by a limited set of discrete emotion tags. A first standard set of such labels exists
within the MPEG-4 standard comprising anger, disgust, fear, joy, sadness and sur-
prise [34]. In order to discriminate a non-emotional state it is often supplemented
by neutrality. While this model opposes many psychological approaches, it provides
a feasible basis in technical view. However, further emotions as boredom are often
used.
4.4.1.4 Emotional Databases
In order to train and test intended recognition engines, a database of emotional sam-
ples is needed. Such a corpus should provide spontaneous and realistic emotional
behavior out of the field. The sample quality should ensure studio audio and video
quality, but for analysis of robustness in the noise also samples with known back-
ground noise conditions may be desired. A database further has to consist of a high
number of ideally equally distributed samples for each emotion, both of the same,
and of many different persons in total. These persons should provide a flashy model
considering genders, age groups, ethical backgrounds, among others. Respecting fur-
ther variability, uttered phrases should possess different contents, lengths, or even
languages. Thereby an unambiguous assignment of collected samples to emotion
classes is especially hard in this discipline. Also, perception tests by human test-
persons are very useful: As we know, it may be hard to rate one’s emotion for sure.
In this respect it seems obvious that comparatively minor recognition rates can be
demanded in this discipline considering related pattern recognition tasks. However,
the named human performance provides a reasonable benchmark for a maximum
expectation. Finally, a database should be made publicly available in view of inter-
national comparability, which seems a problem considering the lacking consensus
about emotion classes used and privacy of the test-persons. A number of methods
exist to create a database, with arguably different strengths: The predominant ones
among these are acting or eliciting of emotions in test set-ups, hidden or conscious
long-term observations, and use of clips out of public media content. However, most
databases use acted emotions, which allow for fulfillment of the named requirements
besides the spontaneity, as there is doubt whether acted emotions are capable of rep-
resenting true characteristics of affect. Still, they do provide a reasonable starting
point, considering that databases of real emotional speech are hard to obtain. In [54]
an overview over existing speech databases can be found. Among the most popular
ones we want to name the Danish Emotional Speech Database (CEICES), the Berlin
Emotional Speech Database (EMO-DB), and the AIBO Emotional Speech Corpus
(AEC) [4]. Audio-visual databases are however still sparse, especially in view of the
named requirements.
Basically it can be said that two main information sources are exploited considering
emotion recognition from speech: the acoustic information analyzing the prosodic
structure as well as the spoken content itself, namely the language information.
Hereby the predominant aims besides high reliability are an independence of the
178 Speech Communication and Multimodal Interfaces
speaker, the spoken language, the spoken content when considering acoustic process-
ing, and the background noise. A number of parameters besides the named under-
lying emotion model and database size and quality strongly influence the quality
in these respects and will be mostly addressed throughout the ongoing: the signal
capturing, pre-processing, feature selection, classification method, and a reasonable
integration in the interaction and application context.
4.4.2.1 Feature Extraction
In order to estimate a user’s emotion by acoustic information one has to carefully
select suited features. Such have to carry information about the transmitted emo-
tion, but they also need to fit the chosen modeling by means of classification algo-
rithms. Feature sets used in existing works differ greatly, but the feature types used
in acoustic emotion recognition may be divided into prosodic features (e.g. inten-
sity, intonation, durations), voice quality features (e.g. 1-7 formant positions, 1-7
formant band widths, harmonic-to-noise ratio (HNR), spectral features, 12-15 Mel
Frequency Cepstral Coefficients (MFCC)), and articulatory ones (e.g. spectral cen-
troid, and more hard to compute ones as centralization of vowels).
In order to calculate these, the speech signal is firstly weighted with a shifting
soft window function (e.g. Hamming window) of lengths reaching from 10-30ms
with a window overlap around 50%. This is a common procedure in speech process-
ing and is needed, as the speech signal is quasi stationary. Next a contour value is
computed for every frame and every contour, leading to a multivariate time series.
As for intensity mostly simple logarithmic frame energy is computed. However, it
should be mentioned that this does not respect human perception. Spectral analy-
sis mostly relies on the Fast Fourier Transform or MFCC - a standard homomor-
phic spectral transformation in speech processing aiming at de-convolution of the
vowel tract transfer function and perceptual modeling by the Mel-frequncy-scale.
First problems now arise, as the remaining feature contours pitch, HNR, or formants
can only be estimated. Especially pitch and HNR can either be derived out of the
spectrum, or – more populary – by peak search within the auto correlation function
of the speech signal. Formants may be obtained by analysis of the Linear Predic-
tion Coefficients (LPC), which we will not dig into. Often backtracking by means of
dynamic programming is used to ensure smooth feature contours and reduce global
costs rather than local ones. Mostly, also higher order derivatives as speed and ac-
celeration are included to better model temporal changes. In any case filtering of
the contours leads to a gain by noise reduction, and is done with low-pass filters as
moving average or median filters [34, 46].
Next, two in general different approaches exist considering the further acoustic
feature processing in view of the succeeding classification: dynamic and static mod-
eling. Within the dynamic approach the raw feature contours, e.g. the pitch or inten-
sity contours, are directly analyzed frame-wise by methods capable of handling mul-
tivariate time-series as dynamic programming (e.g. Hidden Markov Models (HMM)
[34] or Dynamic Bayesian Nets (DBN)).
The second way, by far more popular, is to systematically derive functionals out
of the time-series by means of descriptive statistics. Mostly used are thereby moments
Emotions from Speech and Facial Expressions 179
as mean and standard deviation, or extrema and their positions. Also zero-crossing-
rates (ZCR), number of turning points and others are often considered. As temporal
information is thereby mostly lost, duration features are included. Such may be the
mean length of pauses or voiced sounds, etc. However, these are more complex to
estimate. All features should generally be normalized by either mean subtraction and
division by the standard deviance or maximum, as some classifiers are susceptible to
different number ranges.
In a direct comparison under constant test conditions the static features outper-
formed the dynamic approach in our studies [46]. This is highly due to the unsatis-
factory independence of the overall contour in respect of the spoken content. In the
ongoing we therefore focus on the static approach.
4.4.2.2 Feature Selection
Now that we generated a high order multivariate time-series (approx. 30 dimensions),
included delta regression-coefficients (approx. 90 dimensions in total), and started to
derive static features in a deterministic way, we end up with a too high dimensionality
(>300 features) for most classifiers to handle, especially considering typically sparse
databases in this field. Also, we would not expect every feature that is generated to
actually carry important and non-redundant information about the underlying affect.
Still, this costs extraction effort. Likewise, we aim at a dimensionality reduction by
feature selection (FS) methods.
A often chosen approach thereby is to use Principal Component Analysis (PCA)
in order to construct superposed-features out of all features, and select the ones with
highest eigenvalues corresponding to the highest variance [8]. This however does not
save the original extraction effort, as still all features are needed for the computation
of the artificial ones. A genuine reduction of the original features should therefore
be favored, and can be done e.g. by single feature relevance calculation (e.g. infor-
mation gain ratio based on entropy calculation), named filter-based selection. Still,
the best single features do not necessarily result in the best set. This is why so called
wrapper-based selection methods usually deliver better overall results at lower di-
mensionality. The term wrapper alludes to the fact that the target classifier is used
as an optimization target function, which helps to not only optimize a set, but rather
the compound of features and classifiers as a whole. A search function is thereby
needed, as exhaustive search is in general NP-hard. Mostly applied among these are
Hill Climbing search methods (e.g. Sequential Forward Search (SFS) or Sequential
Backward Search (SBS)) which start from a full or empty feature set and step-wisely
reduce it by the least relevant or add the most important one. If this is done in a float-
ing manner we have the Sequential Floating Search Methods (SFSM) [48]. However,
several other mighty such methods exist, among which especially genetic search
proves powerful [55].
After such reduction the feature vector may be reduced to approximate 100 fea-
tures, and will be classified in a next step.
180 Speech Communication and Multimodal Interfaces
Now, in order to calculate P (ei |wj ), we can use simple Maximum Likelihood
Estimation (MLE):
T F (wj , ei )
PMLE (ei |wj ) = (4.28)
T F (wj , E)
Let thereby T F (wj , ei ) denote the frequency of occurrence of the term wj tagged
with the emotion ei within the whole training corpus. T F (wj , E) then resembles the
whole frequency of occurrence of this particular term. One problem thereby is that
never occurring term/emotion couples lead to a probability resembling zero. As this
is crucial within the overall calculation, we assume that every word has a general
182 Speech Communication and Multimodal Interfaces
T F (wj , ei ) + λ
Pλ (ei |wj ) = , λ ∈ [0...1] (4.29)
T F (wj , E) + λ · |E|
If λ resembles one, this is also known as Maximum a-Posteriori (MAP) estima-
tion.
4.4.3.2 Bag-of-Words
The so called Bag-of-Words method is a standard representation form for text in
automatic document categorization [21] that can also be applied to recognize emo-
tion [45]. Thereby each word wk ∈ V in the vocabulary V adds a dimension to a
linguistic vector x representing the logarithmic term frequency within the actual ut-
terance U known as logTF (other calculation methods exist, which we will not show
herein). Likewise a component xlogT F,j of the vector xlogTF with the dimension |V |
can be calculated as:
T F (wj , U )
xlogT F,j = log (4.30)
|U |
As can be seen in the equation, the term frequency is normalized by the phrase
length. Due to the fact that a high dimensionality may decrease the performance of
the classifier and flections of terms reduce performance especially within small data-
bases, stopping, stemming, or further methods of feature reduction are mandatory. A
classification can now be fulfilled as with the acoustic features. Preferably, one would
chose SVM for this task, as they are well known to show high performance [21].
Similar as for the uni-grams word order is not modeled by this approach. How-
ever, one advantage is the possibility of direct inclusion of linguistic features within
the acoustic feature vector.
4.4.3.3 Phrase Spotting
As mentioned, a key-drawback of the approaches so far is a lacking view of the
whole utterance. Consider hereby the negation in the following example: ”I do not
feel too good at all,” where the positively perceived term ”good” is negated. There-
fore Bayesian Nets (BN) as a mathematical background for the semantic analysis of
spoken utterances may be used taking advantage of their capabilities in spotting and
handling uncertain and incomplete information [47]. Thereby each BN consists of
a set of N nodes related to state variables Xi , comprising a finite set of states. The
nodes are connected by directed edges reaching from parent to child nodes, and ex-
pressing quantitatively the conditional probabilities of nodes and their parent nodes.
A complete representation of the network structure and conditional probabilities is
provided by the joint probability distribution:
N
P (X1 , ..., XN ) = P (Xi |parents(Xi )) (4.31)
i=1
Emotions from Speech and Facial Expressions 183
For a human spectator the visual appearance of a person provides rich information
about his or her emotional state [32]. Thereby several sources can be identified,
such as body-pose (upright, slouchy), hand-gestures (waving about, folded arms),
head-gestures (nodding, inclining), and – especially in a direct close conversation –
the variety of facial expressions (smiling, surprised, angry, sad, etc.). Very few ap-
184 Speech Communication and Multimodal Interfaces
proaches exist towards an affective analysis of body-pose and gestures, while several
works report considerable efforts in investigating various methods for facial expres-
sion recognition. Therefore we are going to concentrate on the latter in this section,
provide background information, necessary pre-processing stages, algorithms for af-
fective face analysis, and give an outlook on forthcoming developments.
4.4.4.1 Prerequisites
Similar to the acoustic analysis of speech for emotion estimation, the task of Facial
Expression Recognition can be regarded as a common pattern recognition problem
with the familiar stages of preprocessing, feature extraction, and classification. The
addressed signal, i.e. the camera view to a face, provides lots of information that
is neither dependent on nor relevant to the expression. These are mainly ethnic and
inter-cultural differences in the way to express feelings, inter-personal differences in
the look of the face, gender, age, facial hair, hair cut, glasses, orientation of the face,
and direction of gaze. All these influences are quasi disturbing noise with respect to
the target source facial expression and the aim is to reduce the impact of all noise
sources, while preserving the relevant information. This task however constitutes a
considerable challenge located in the preprocessing and feature extraction stage, as
explicated in the following.
From the technical point of view another disturbing source should be minimized
– the variation in the position of the face in the camera image. Therefore, the pre-
processing comprises a number of required modules for robust head-localization and
estimation of the face-orientation. As a matter of fact facial expression recognition
algorithms perform best, when the face is localized most accurately with respect to
the person’s eyes. Thereby, best addresses quality and execution time of the method.
However, eye localization is a computationally even more complex task than face
localization. For this reason a layered approach is proposed, where the search area
for eyes is limited to the output-hypotheses of the previous face localizer.
Automatic Facial Expression Recognition has still not arrived in real world sce-
narios and applications. Existing systems postulate a number of restrictions regarding
camera hardware, lighting conditions, facial properties (e.g. glasses, beard), allowed
head movements of the person, and view to the face (frontal, profile). Different ap-
proaches show different robustness on the mentioned parameters. In the following we
want to give an insight in different basic ideas that are applied to automatic mimic
analysis. Many of them were derived from the task of face recognition, which is the
visual identification or authentication of a person based on his or her face. Hereby,
the applied methods can generally be categorized in holistic and non-holistic or an-
alytic [37].
4.4.4.2 Holistic Approaches
Holistic methods (Greek: holon = the whole, all parts together) strive to process
the entire face as it is without any incorporated expert-knowledge, like geometric
properties or special regions of interest for mimic analysis.
One approach is to extract a comprehensive and respectively large set of features
from the luminance representation or textures of the face image. Exemplarily, Gabor-
Emotions from Speech and Facial Expressions 185
a set of faces expressing all categories of mimics. Subsequently, we apply the Eigen-
vector analysis and extract representative sets of weights for each mimic-class that
should be addressed. During classification of unknown faces, the weights for this
image are computed accordingly, and will then be compared to weight-vectors of the
training set.
4.4.4.3 Analytic Approaches
These methods concentrate on the analysis of dominant regions of interest. Thus,
pre-existing knowledge about geometry and facial movements is incorporated, so
that subsequent statistical methods benefit [12]. Ekman and Friesen introduced the
so called Facial Action Coding System in 1978, which consists of 64 Action Units
(AU). Presuming that all facial muscles are relaxed in the neutral state, each AU
models the contraction of a certain set of them, leading to deformations in the face.
Thereby the focus lies on the predominant facial features, such as eyes, eye-brows,
nose, and mouth. Their shape and appearance contain most information about the fa-
cial expressions. One approach for analyzing shape and appearance of facial features
is known as Point Distribution Model (PDM). Cootes and Taylor gave a comprehen-
sive introduction to the theory and implementation aspects of PDM. Subclasses of
PDM are Active Shape Models (ASM) and Active Appearance Models (AAM), which
showed their applicability to mimic analysis [20].
Shape models are statistical descriptions of two-dimensional relations between
landmarks, positioned on dominant edges in face images. These relations are freed
of all transformations, like rotation, scaling and translation. The different shapes that
occur just due to the various inter-personal proportions are modeled by PCA of the
observed landmark displacements during training phase. The search of the landmarks
starts with an initial estimation where the shape is placed over a face manually or
automatically, when preprocessing stages allow for. During search, the edges are
approximated by an iterative approach that tries to measure the similarity of the sum
of edges under the shape to the model. Appearance Models additionally investigate
the textures or gray-value distributions over the face, and combine this knowledge
with shape statistics.
As mentioned before, works in the research community try to investigate a broad
range of approaches and combinations of different holistic and analytic methods in
order to proceed towards algorithms, that are robust to the broad range of different
persons and real-life situations. One of the major problems is still located in the
immense computational effort and applications will possibly have to be distributed
on multiple CPU and include the power of GPU to converge to real-time abilities.
In this chapter we aim to fuse the acoustic, linguistic and vision information ob-
tained. This integration (see also Sect. 4.3.3) is often done in a late semantic manner
as (weighted) majority voting [25]. More elegant however is the direct fusion of the
streams within one feature vector [45], known as the early feature fusion. The advan-
tage thereby is that less knowledge is lost prior to the final decision. A compromise
References 187
between these two is the so called soft decision fusion whereby the confidence of
each classifier is also respected. Also the integration of a N -best list of each classi-
fier is possible. A problem however is the synchronization of video, and the acoustic
and linguistic audio streams. Especially if audio is classified on a global word or ut-
terance level, it may become difficult to find video segments that correspond to these
units. Likewise, audio processing may be preferred by dynamic means in view of
early fusion with a video stream.
4.4.6 Discussion
References
1. When Do We Interact Multimodally? Cognitive Load and Multimodal Communication
Patterns., 2004.
2. Ang, J., Dhillon, R., Krupski, A., Shriberg, E., and Stolcke, A. Prosody-Based Automatic
Detection of Annoyance and Frustration in Human-Computer Dialog. In Proceedings of
the International Conference on Speech and Language Processing (ICSLP 2002). Denver,
CO, 2002.
3. Arsic, D., Wallhoff, F., Schuller, B., and Rigoll, G. Video Based Online Behavior Detec-
tion Using Probabilistic Multi-Stream Fusion. In Proceedings of the International IEEE
Conference on Image Processing (ICIP 2005). 2005.
4. Batliner, A., Hacker, C., Steidl, S., Nöth, E., Russel, S. D. M., and Wong, M. ’You
Stupid Tin Box’ - Children Interacting with the AIBO Robot:A Cross-linguisitc Emo-
tional Speech Corpus. In Proceedings of the LREC 2004. Lisboa, Portugal, 2004.
5. Benoit, C., Martin, J.-C., Pelachaud, C., Schomaker, L., and Suhm, B., editors. Audiovi-
sual and Multimodal Speech Systems. In: Handbook of Standards and Resources for Spo-
ken Language Systems-Supplement Volume. D. Gibbon, I. Mertins, R.K. Moore ,Kluwer
International Series in Engineering and Computer Science, 2000.
6. Bolt, R. A. “Put-That-There”: Voice and Gesture at the Graphics Interface. In Inter-
national Conference on Computer Graphics and Interactive Techniques, pages 262–270.
July 1980.
188 Speech Communication and Multimodal Interfaces
26. Lee, T. S. Image Representation Using 2D Gabor Wavelets. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 18(10):959–971, 1996.
27. Levin, E., Pieraccini, R., and Eckert, W. A stochastic Model of Human-Machine Interac-
tion for Learning Dialog Strategies. IEEE Transactions on Speech and Audio Processing,
8(1):11–23, 2000.
28. Litman, D., Kearns, M., Singh, S., and Walker, M. Automatic Optimization of Dialogue
Management. In Proceedings of the 18th International Conference on Computational
Linguistics. Saarbrücken, Germany, 2000.
29. Maybury, M. T. and Stock, O. Multimedia Communication, including Text. In Hovy,
E., Ide, N., Frederking, R., Mariani, J., and Zampolli, A., editors, Multilingual Informa-
tion Management: Current Levels and Future Abilities. A study commissioned by the
US National Science Foundation and also delivered to European Commission Language
Engineering Office and the US Defense Advanced Research Projects Agency, 1999.
30. McGlaun, G., Althoff, F., Lang, M., and Rigoll, G. Development of a Generic Multi-
modal Framework for Handling Error Patterns during Human-Machine Interaction. In
SCI 2004, 8th World Multi-Conference on Systems, Cybernetics, and Informatics, Or-
lando, FL, USA. 2004.
31. McTear, M. F. Spoken Dialogue Technology: Toward the Conversational User Interface.
Springer Verlag, London, 2004. ISBN 1-85233-672-2.
32. Mehrabian, A. Communication without Words. Psychology Today, 2(4):53–56, 1968.
33. Nielsen, J. Usability Engineering. Academic Press, Inc., 1993. ISBN 0-12-518405-0.
34. Nogueiras, A., Moreno, A., Bonafonte, A., and Marino, J. Speech Emotion Recognition
Using Hidden Markov Models. In Eurospeech 2001 Poster Proceedings, pages 2679–
2682. Scandinavia, 2001.
35. Oviatt, S. Ten Myths of Multimodal Interaction. Communications of the ACM 42, 11:74–
81, 1999.
36. Oviatt, S., Cohen, P., Wu, L., Vergo, J., Duncan, L., Suhm, B., Bers, J., Holzman, T.,
Winograd, T., Landay, J., Larson, J., and Ferro, D. Designing the User Interface for
Multimodal Speech and Pen-based Gesture Applications: State-of-the-Art Systems and
Future Research Directionss. Human Computer Interaction, (15(4)):263–322, 2000.
37. Pantic, M. and Rothkrantz, L. Automatic Analysis of Facial Expressions: The State of
the Art. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12):1424–
1445, 2000.
38. Pantic, M. and Rothkrantz, L. Toward an Affect-Sensitive Multimodal Human-Computer
Interaction. Proccedings of the IEEE, 91:1370–1390, September 2003.
39. Petrushin, V. Emotion in Speech: Recognition and Application to Call Centers. In Pro-
ceedings of the Conference on Artificial Neural Networks in Engineering(ANNIE ’99).
1999.
40. Picard, R. W. Affective Computing. MIT Press, Massachusetts, 2nd edition, 1998. ISBN
0-262-16170-2.
41. Pieraccini, R., Levin, E., and Eckert, W. AMICA: The AT&T Mixed Initiative Conver-
sational Architecture. In Proceedings of the Eurospeech ’97, pages 1875–1878. Rhodes,
Greece, 1997.
42. Reason, J. Human Error. Cambridge University Press, 1990. ISBN 0521314194.
43. Sadek, D. and de Mori, R. Dialogue Systems. In de Mori, R., editor, Spoken Dialogues
with computers, pages 523–562. Academic Press, 1998.
44. Schomaker, L., Nijtmanns, J., Camurri, C., Morasso, P., and Benoit, C. A Taxonomy
of Multimodal Interaction in the Human Information Processing System. Report of ES-
PRIT III: Basic Research Project 8579, Multimodal Interface for Advanced Multimedia
Interfaces (MIAMI). Technical report, 1995.
190 Speech Communication and Multimodal Interfaces
45. Schuller, B., Müller, R., Lang, M., and Rigoll, G. Speaker Independent Emotion Recog-
nition by Early Fusion of Acousticand Linguistic Features within Ensembles. In Proceed-
ings of the ISCA Interspeech 2005. Lisboa, Portugal, 2005.
46. Schuller, B., Rigoll, G., and Lang, M. Hidden Markov Model-Based Speech Emo-
tion Recognition. In Proceedings of the IEEE International Conference on Acoustics,
Speech,and Signal Processing (ICASSP 2003), volume II, pages 1–4. 2003.
47. Schuller, B., Rigoll, G., and Lang, M. Speech Emotion Recognition Combining Acoustic
Features and Linguistic Information in a Hybrid Support Vector Machine - Belief Net-
work Architecture. In Proceedings of the IEEE International Conference on Acoustics,
Speech,and Signal Processing (ICASSP 2004), volume I, pages 577–580. Montreal, Que-
bec, 2004.
48. Schuller, B., Villar, R. J., Rigoll, G., and Lang, M. Meta-Classifiers in Acoustic and
Linguistic Feature Fusion-Based Affect Recognition. In Proceedings of the International
Conference on Acoustics, Speechand Signal Processing (ICASSP) 2005, volume 1, pages
325–329. Philadelphia, Pennsylvania, 2005.
49. Shneiderman, B. Designing the user interface: Strategies for effective human-computer
interaction (3rd ed.). Addison-Wesley Publishing, 1998. ISBN 0201694972.
50. Smith, W. and Hipp, D. Spoken Natural Language Dialog Systems: A Practical Approach.
Oxford University Press, 1994. ISBN 0-19-509187-6.
51. Tian, Y., Kanade, T., and Cohn, J. Evaluation of Gabor-wavelet-based Facial Action Unit
Recognitionin Image Sequences of Increasing Complexity. In Proceedings of the Fifth
IEEE International Conference on AutomaticFace and Gesture Recognition, pages 229–
234. May 2002.
52. Turk, M. and Pentland, A. Face Recognition Using Eigenfaces. In Proc. of Conference
on Computer Vision and Pattern Recognition, pages 586–591. 1991.
53. van Zanten, G. V. User-modeling in Adaptive Dialogue Management. In Proceedings of
the Eurospeech ’99, pages 1183–1186. Budapest, Hungary, 1999.
54. Ververidis, D. and Kotropoulos, C. A State of the Art Review on Emotional Speech
Databases. In Proceedings of the 1st Richmedia Conference, pages 109–119. Lausanne,
Sitzerland, 2003.
55. Witten, I. H. and Frank, E. Data Mining: Practical Machine Learning Tools with Java
Implementations. Morgan Kaufmann, San Francisco, CA, 2000. ISBN 1-558-60552-5.
56. Wu, L., Oviatt, S., and Cohen, P. Multimodal integration - A Statistical Review. 1(4),
pages 334–341. 1999.
57. Young, S. Probabilistic Methods in Spoken Dialogue Systems. Philosophical Transac-
tions of the Royal Society, 358:1389–1402, 2000.
Chapter 5
1
Sect. 5.1, 5.2
2
Sect. 5.3
192 Person Recognition and Tracking
access to a system is then not possible anymore or at least requires additional effort
and causes higher costs.
recognition
identification authentication
physiological behavioral
features features
Fig. 5.1. Biometric and non-biometric person recognition methods. In the following two sec-
tions face recognition and camera-based full-body recognition are discussed.
Hence, in the last years much attention has been paid to biometric systems which
are either based on physiological features (face, hand shape and hand vein pattern,
fingerprint, retina pattern) or depend on behavioral patterns that are unique for each
person to be recognized (gait, signature stroke, speech).
These systems are mostly intrusive because a minimum of active participation of
the person to be identified is required as one has to lay down his thumb on the fin-
gerprint sensor or look at a camera or scanner to get the necessary biometric features
extracted properly. This condition is acceptable in many applications.
Non-intrusive recognition can be applied for surveillance tasks where subjects usu-
ally are observed by cameras. In this case, it is not desired that the persons participate
in the recognition process.
The second advantage of non-intrusive over intrusive recognition is the increased
comfort of use of such systems, e.g. if a person can be identified while heading to-
wards a security door, the need for putting the thumb on the fingerprint sensor or the
insertion of a magnetic card into a slot would be unnecessary.
Today, identification of a person from a distance is possible only by gait recognition,
though also this biometric method is still a research issue. Face recognition has the
potential to fulfill this task as well, but still research needs to be done to develop
Face Recognition 193
reliable systems based on face recognition at a glance. This section can not address
all aspects of recognition, however, selected topics of special interest are discussed.
Face recognition has evolved to commercial systems which control the access to
restricted areas e.g. in airports or high-security rooms or even computers. Though
being far from reliably working under all real world conditions, many approaches
have been developed to at least guarantee a safe and reliable functioning under well
determined conditions (Sect. 5.1).
A new non-biometric method for non-intrusive person recognition is based on
images of the full body, in applications where it can be assumed that clothing is not
changed. The big variety in colors and patterns makes clothing a valuable indicator
that adds information to an identification system even if fashion styles and seasonal
changes may affect the condition of “constant clothing”. An approach to this token-
based method is discussed in Sect. 5.2.
As full-body recognition, also gait recognition depends on a proper detection
and segmentation of persons in a scene which is a main issue of video-based person
tracking. Hence, person tracking can also be considered as a preprocessing step for
these recognition methods. Besides that, determining the position of persons in an
observed scene is a main task for surveillance applications in buildings, shopping
malls, airports and other public places and is hence discussed as well (Sect. 5.3).
environment. These factors can be categorized into human and technical factors
(Fig. 5.2).
biological
number of sample images
aging
resolution
behavioral
occlusion head pose
Fig. 5.2. Factors mostly addressed by the research community that affect face recognition
performance and their categorization. Additionally, fashion (behavioral) and twin faces (bio-
logical) must be mentioned.
Technical factors arise from the technology employed and the environment in
which a recognition system is operated. Any image-based object recognition task has
to deal with these factors. Their influence can often easily be reduced by choosing
suitable hardware, like lighting (for illumination) and cameras (for illumination and
resolution).
1. Image resolution: The choice of the needed image resolution is a trade-off be-
tween a desired processing performance (⇒ low resolution) and separability of
the features used for recognition (⇒high resolution). The highest possible reso-
lution should be used that allows to meet given time restrictions.
2. One-sample problem: Many methods can extract good descriptions of a person’s
face out of a bunch of images taken under different conditions, e.g. illuminations,
facial expressions and poses. However, several images of the same person are
often not available. Sometimes a few, maybe even only one, sample image can
Face Recognition 195
extraction if these textural changes are not considered by the detection algo-
rithms.
2. Biological factors
a) Aging: Changes of facial appearance due to aging and their modeling are not
well studied yet. Main changes of the skin surface and slight geometrical
changes (considering longer time spans) cause problems in the recognition
process due to strong differences in the extracted features.
b) The twin-problem: The natural “copy” of a face to be identified is the twin’s
face. Even humans have difficulties to distinguish between two unknown
twins, so have automatic systems. However, humans are able to recognize
twins if they are known well.
In contrast to a non-intrusive face recognition system that has to deal with most
of the above mentioned factors, an intrusive systems can neglect many of these fac-
tors. If a so called “friendly user” can be asked either to step into an area with con-
trolled illumination or to look straight into the camera, it is unnecessary to handle
strong variations in e.g. illumination and pose changes. Operational conditions can
often reduce the need for more sophisticated algorithms by setting up suitable hard-
ware environments. However, friendly users usually can not be found in surveillance
tasks.
This section will discuss the basic processing steps of a face recognition system
(Fig. 5.3). Referring to chapter 2.1, the coarse structure of an image-based object
recognition system consists of an (image) acquisition, a feature extraction and a
feature classification stage. These stages can be broken up into several steps: While
acquisition, suitable preprocessing and face detection as well as facial feature
extraction are described more detailed in chapter 2.2, feature extraction, feature
transformation, classification and combination will be discussed in the following.
Acquisition
During acquisition the camera takes snapshots or video streams of a scene con-
taining one or more faces. Already at this stage it should be considered where
to set up a camera. If possible, one should use more than one camera to get a
couple of views of the same person either to combine recognition results or to
build a face model that is used to handle different influencing factors like pose
or illumination. Besides positioning the camera, the location for suitable lighting
should be considered as well. If a constant illumination situation can be provided
in the enrollment phase (training) as well as in the test phase (classification), less
effort needs to be put into modeling illumination changes and the accuracy of
the system will increase. If possible, the camera parameters like shutter speed,
amplification and white balance should be adaptively controlled by a suitable
Face Recognition 197
feature extraction
face detection
facial feature
detection
acquisition
preprocessing
feature classification
transformation
classification
combination
feature
recognition
result
S
Fig. 5.3. General structure of a face recognition system.
algorithm that evaluates the current illumination situation and adapts the parame-
ters accordingly. For a more detailed discussion of this issue, please refer to Sect. 2.2.
Preprocessing
In the preprocessing stage usually the images are prepared on a low-level basis to be
further processed in other stages on a high-level basis. Preprocessing steps include
e.g. (geometrical) image normalization (e.g. Sect. 5.1.4), the suitable enhancement
of brightness and contrast, e.g. brightness planes or histogram equalization [45] [52]
or image filtering.
This preprocessing is always strongly depending on the following processing step.
For the face detection stage, it might be useful to determine face candidates, i.e.
image regions that might contain a face, on a low-level basis. This can e.g. be
achieved using skin color histograms (Sect. 2.1.2.1). For the facial feature extraction
198 Person Recognition and Tracking
Face Detection
Face detection is a wide field of research and many different approaches have been
developed with creditable results [5] [58] [50]. However, it is difficult to get these
methods work under real world conditions. As all face detection algorithms are
controlled by several parameters, generally suitable parameter settings can usually
not be found. Additional image preprocessing is needed, to gain a working system
that performs fast (maybe even real-time) and reliable, even under different kind of
influencing factors as e.g. resolution, pose and illumination (Sect. 5.1.1). The result
of most face detection algorithms are rectangles containing the face area. Usually a
good estimate of the face size, which is needed for normalization and facial feature
detection can be derived from these rectangles.
Feature Extraction
Face recognition research usually concentrates on the development of feature ex-
traction algorithms. Besides a proper preprocessing, feature extraction is the most
important step in face recognition because a description of a face is built which must
be sufficient to separate the face images of one person from the images of all other
subjects.
Substantial surveys about face recognition are available, e.g. [60]. Therefore,
we refrain from giving an overview of face recognition methods and do not discuss
all methods in detail because much literature can be found for the most promising
approaches. Some feature extraction methods are however briefly outlined in Sect.
5.1.5 for face recognition based on local regions. Those methods originally were
applied globally, i.e. on the whole face image.
Feature Transformation
Feature extraction and feature transformation are very often combined in one single
step. The term feature transformation, however, describes a set of mathematical
methods that try to decrease the within-class distance while increasing the between-
class difference of the features extracted from the image. Procedures like e.g. PCA
(principle component analysis, Sect. 5.1.4) transform the feature space to a more
Face Recognition 199
Classification
The result of a recognition process is determined in the classification stage in
which a classification method compares a given feature vector to the previously
stored training data. Simple classifiers like the k-nearest-neighbor classifier (Sect.
5.1.4) are often surprisingly powerful, especially when only one sample per class is
available. More sophisticated methods like neural networks [6] or Support Vector
Machines [49] can determine more complex, non-linear feature space subdivisions
(class borders) which are often needed if more than one sample per class was trained.
Combination
A hybrid classification system can apply different combination strategies which in
general can be categorized in three groups:
• Voting:
The overall result is given by the ID which was rated best by the majority of all
single classifiers.
• Ranking:
If the classifiers can build a ranking for each test image then the ranks for each
class are added. Winner is the class with the lowest sum.
• Scoring:
If the rankings are annotated with a score that integrates information about
how similar a test image is in comparison to the closest trained image of each
class, more sophisticated combination methods can be applied that combine these
scores to an overall score. The winner class is the one with the best score. Recent
findings can be found in [33]
The input to a combination algorithm can be the result either of classifications
of different facial parts or the result of different types of recognition algorithms and
classifiers (e.g. local and hybrid recognition methods, Sect. 5.1.3). It was argued
in [39] that, e.g. PCA-based methods (global classifiers) should be combined with
methods extracting local information. PCA-based methods are robust against noise
because of estimating low-order modes with large eigenvalues, while local classifiers
better describe high-order modes that amplify noise but are more discriminant con-
cerning different persons. Fig. 5.4 shows some Eigenfaces (extracted by a PCA on
gray-level images) ordered by their eigenvalues. The upper Eigenfaces (high eigen-
values) seem to code only basic structure, e.g. the general shape of a face, while the
Eigenfaces in the lower part of Fig. 5.4, structures can be identified that are more
specific to certain persons. The results of the two methods (global and local) penal-
ize each other while amplifying the correct result. In general, a higher recognition
rate can be expected when different kinds of face recognition algorithms are com-
bined [60].
200 Person Recognition and Tracking
Fig. 5.4. Visualization of a mean face (upper left corner) and 15 Eigenfaces. The corresponding
eigenvalues are decreasing from left to right and top to bottom. (from [47]). Eigenfaces with
high eigenvalues only code general appearance information like the face outline and global
illumination (upper row).
• Global or holistic methods use the whole image information of the face at once.
Here geometrical information is not used at all or just indirectly through the dis-
tribution of intensity values (e.g. eyes in different images). The most prominent
representative of this class of algorithms is the eigenface approach [48] which
will be discussed in Sect. 5.1.4.
The eigenface approach [44] [48] is one of the earliest methods of face recognition
that uses appearance information. It originally was applied globally and, though
it is lacking stability and good performance under general conditions, it is still a
state-of-the-art method, which often is used as a reference to new classification
methods.
The following sections will describe the eigenface approach considering the basic
structure of face recognition algorithms discussed in Sect. 5.1.2.
Preprocessing
The eigenface approach shows bad performance if the training and input images do
not undergo a preprocessing that normalizes them in a way that only keeps the usable
information in the face area of the image.
For this normalization step, the face and the position of different facial features
must be detected as described in 2.2. The coordinates of the eye centers and the nose
tip are sufficient for a simple yet often applicable normalization (Fig. 5.5).
In a first step, the image must be rotated by an angle α to make sure the eye
centers have the same y-coordinate (Fig. 5.5b). The angle α is given by:
ylefteye − yrighteye
α = arctan( )
xlefteye − xrighteye
where (xlefteye , ylefteye ) and (xrighteye , yrighteye ) are the coordinates of the left
and right eye, respectively.
Then the image is scaled in the directions of x and y independently (Fig. 5.5 (c)).
The scaling factors are determined by:
202 Person Recognition and Tracking
rotate scale
deyes
dnose
deyes
dnose
(d) (e)
Fig. 5.5. Face image normalization for recognition using Eigenfaces. (a) original image, (b)
rotated image (eyes lie on a horizontal line), (c) scaled image (eyes have a constant distance;
the nose has a constant length), (d) a mask with previously defined coordinates for the eyes
and the nose tip where the image is shifted to, (e) normalized image)
xlefteye − xrighteye
scalex =
deyes
yeye − ynose
scalex =
dnose
where xlefteye and xrighteye determine the x-coordinates of the eyes and yeye
(ynose are the y-coordinate of the eyes (nose tip) after the rotation step. The distance
)
between the eyes deye and the nose length dnose must be constant for all considered
images and must be chosen a-priori. Cropping the image, shifting the eyes to a
previously determined position, and masking out the border regions of the face
image (using a mask like in Fig. 5.5d) leads to an image that contains only the
appearance information of the face (Fig. 5.5e). All training and test images must be
normalized to yield images that have corresponding appearance information at the
same positions.
are not uniformly distributed over the whole image space but are concentrated in a
lower dimensional “face space” (Fig. 5.6).
face space
Fig. 5.6. Simplified illustration of an image space. The face images are located in a sub-space
of the whole image space and form the face space.
The basis of this face space can be used to code the images using their coordi-
nates. This coding is a unique representation for each image. Assuming that images
of the same person are grouped together allows to apply a nearest-neighbor classifier
(Sect. 5.1.4) or more sophisticated classification methods.
All Ji have the dimension j = W · H. This is the simplest feature extraction pos-
sible. Just the grey-levels of the image pixels are considered. Modified approaches
preprocess the images, e.g. by filtering with Gabor functions3 [10].
k2 x2
A Gabor function is given by: g(x) = cos(kx)·e −2σ2 . This is basically a cosine wave
3
combined with an enveloping exponential function limiting the support of the cosine wave. k
204 Person Recognition and Tracking
Fig. 5.4 (see page 200) shows a visualization of a set of these Eigenvectors. Be-
cause they have a face-like appearance, they were called ”eigenpictures“ [44] or
Eigenfaces [48]. Belhumeur [4] suggested to discard some of the most significant
Eigenfaces (high eigenvalues) because they mainly seem to code variations in illu-
mination but no significant differences between subjects which are coded by Eigen-
faces with smaller eigenvalues. This leads to some stability, at least when global
illumination changes appear.
There are basically two strategies how to determine the number of Eigenvectors
to be kept: (1) simply define a constant number Ñ of Eigenvectors (Ñ ≤ N )4 or (2)
define a minimum amount of information to be kept by determining an information
quality measure Q:
Ñ
λk
Q= k=0
N
λk
k=0
Eigenvalues must be added to the numerator until Q is bigger than the previously
determined threshold Qthreshold . Hereby, the number Ñ of dimensions is deter-
mined. Commonly used thresholds are Qthreshold = 0,95 or Qthreshold = 0,98.
determines the direction ( k) and the frequency (|k|) of the wave in the two-dimensional
space.
4
Turk and Pentland claim that about 40 Eigenfaces are sufficient [48]. This, however,
strongly depends on the variations (e.g. illumination, pose) in the images to be trained. The
more variation is expected, the more Eigenfaces should be kept. The optimal number of Eigen-
faces can only determined heuristically.
Face Recognition 205
After calculating the PCA and performing dimension reduction, we get a ba-
sis B = {e1 , . . . , eÑ } of Eigenvectors and corresponding eigenvalues Bλ =
˜ y) −→ J̃ is then projected into the face space
{λ1 , . . . , λÑ }. A new image I(x,
determined by the basis B. The projection into each dimension of the face space
basis is given by:
wk = ek T (J̃ − Ψ) (5.5)
for k = 1, . . . Ñ . Each wk describes the contribution of eigenface ek to the currently
examined image J̃. The vector Wi = {w1 , . . . , wÑ } is hence a representation of
the image and can be used to distinguish it from other images.
The application of the PCA and the following determination of the wk can be
regarded as a feature transformation step which transforms the original feature
vector (i.e. the grey-level values) into a differently coded feature vector containing
the wk .
Classification
A quite simple yet often very effective algorithm is the k-nearest-neighbor classifier
(kNN), especially when just a few feature vectors of each class are available. In
this case, training routines of complex classification algorithms do not converge
or do not achieve better recognition results (than a kNN) but are computationally
more expensive. Introductions to more complex classification methods like Support
Vector Machines or neural networks can be found in the literature [49] [16].
5
The term “class ID” is usually used for a unique number that acts as a label for each class
(person) to be identified.
206 Person Recognition and Tracking
Fig. 5.7 exemplifies the functionality of a kNN classifier using the Euclidean dis-
tance. The vectors which were trained are drawn as points into the two-dimensional
feature space while the vector to be classified is marked with a cross. For each of the
training samples the corresponding class is stored (only shown for classes A, B, and
C). Determining the training sample that is closest to the examined feature vector
determines the class which this vector is assigned to (here class “A”, as dA is the
smallest distance). Usually k = 1 is used because exact class assignment is needed
in object recognition tasks. The class ID is then determined by:
class ID = argminp {d(W, {Ti })}
For k > 1 an ordered list is built containing the class IDs of the k closest feature
vectors in increasing order.
class IDs = {a, b, . . .}
with d(W, Ta ) ≤ d(W, Tb ) ≤ · · · ≤ d(W, Ti ) and ||class IDs|| = k. From this
k results the winner class must be determined. Usually the voting scheme is applied,
however, other combination strategies can be used as well (Sect. 5.1.2).
A B
dA
Fig. 5.7. Two-dimensional feature space. A 1NN classifier based on the Euclidean distance
measure will return the point A as it is the closest to the point to be classified (marked with a
cross). Class borders between A, B, and C are marked with dotted lines.
determine
contour component
detection positions
extract components
F.E. F.E. F.E. F.E. F.E. F.E. F.E. F.E. F.E. F.E.
class. class. class. class. class. class. class. class. class. class.
combination
classification result
Fig. 5.8. Overview of a component-based face recognition system. For each component, a
feature extraction (F.E.) and a classification step (class.) is separately performed.
performed separately. The single classification results are finally combined, leading
to an overall classification result.
Component Extraction
As a preprocessing step for component-based face recognition, facial regions must
be detected. Suitable approaches for the detection of distinctive facial points and
contours were already discussed in chapter 2.2.
As soon as facial contours are detected, facial components can be extracted by
determining their position according to distinctive points in the face (Fig. 5.9).
a b c
Fig. 5.9. Face component extraction. Starting from mesh/shape (thin white line in (a)) that
is adapted to the input image, the centers of facial components (filled white dots) can be
determined (b) and with pre-defined size the facial components can be extracted (white lines
in (c)). The image was taken from the FERET database [41].
The centers ci (x, y) of the facial components are either determined by a single
contour point, i.e. cj (x, y) = pk (x, y) (e.g. for the corners of the mouth or the nose
root) or are calculated using the points of a certain facial contour (e.g. the center of
the points of the eye or nose contour) with:
1
cj (x, y) = pi (x, y)
ñ
i∈C
where C is a contour containing the ñ distinctive points pi (x, y) of e.g. the eyes.
Grey-level information can be used in two ways for object recognition purposes: (1)
the image is transformed row-wise into a feature vector. Then, the trained vector
having the smallest distance to the testing vector is determined, or, (2) a template
matching based on a similarity measure can be applied. Given an image I(x, y) and
a template T (x, y)6 , the similarity can be calculated by:
1
S(u, v) =
(i,j)∈V [I(i + u, j + v) − T (i, j)] + 1
2
where V is the set of all coordinate pairs. The template with the highest similarity
score Smax (u, v) determines the classified person.
6
A template is an image pattern showing the object to be classified. A face component can
be regarded as a template as well.
210 Person Recognition and Tracking
can be determined (i being the index of the training images, and j being the index
of the facial component, here j ∈ {1, . . . , 10}. The Eigenvectors of CFCj are often
called “eigenfeatures” referring to the term “Eigenfaces” as they span a sub-space
for specific facial features [40]:
The classification stage provides a classification result for each face component.
This is done according to any holistic method by determining a distance measure
(Sect. 5.1.4)
Fourier coefficients for image coding
The 2D-DFT (Discrete Fourier Transform) of an image I(x, y) is defined as:
y·n
I(x, y) · ej2π( M + N )
x·m
g(m, n) =
x y
Fig. 5.11 shows a face image and the magnitude of a typical DFT spectrum in
polar coordinates. Note, that mainly coefficients representing low frequencies (close
to the center) can be found in face images. Therefore, only the r most relevant Fourier
coefficients g(m, n) are determined and combined in an r-dimensional feature vector
which is classified using e.g. the Euclidean distance.
Face Recognition 211
Discussion
To show the distinctiveness of facial components, we ran experiments with different
combinations of facial components (Fig. 5.12) on a subset of the FERET database
(Sect. 5.1.6). Only grey-level information was used as feature, together with a 1NN
classifier. Frontal images of a set of 200 persons were manually annotated with dis-
tinctive points/shapes. One image of each person was trained, another one showing
slight facial expression changes, was tested.
a b
c d
In table 5.1 the recognition rate of each single component is given. The results
show that the eye region (eye and eye brow components) is the most discriminating
part of the face. Just with a single eye component a recognition rate of about 86%
could be achieved. The lower face part (mouth, nose) performs worse with rates up
to only 50%.
The results of the combined tests are shown in table 5.2. The combination of
all face components increased the recognition rate to 97%. Even though the mouth
components alone perform not very well, they still add valuable information to the
classification as the recognition rate drops by 1,5% when the mouth components are
not considered (test set (b)). Further reduction (test set (c)) shows however, that the
nose tip in our experiments could not add valuable information, it actually worsened
the result. A recognition rate of about 90% can be achieved under strong occlusion,
simulated by only considering four components of the right face half. This shows that
occlusion can be compensated reasonably well by the component-based approach.
Table 5.2. Classification rates of different combinations of face components (Fig. 5.12)
test sets recognition rate
(a) 10 components 97,0%
(b) 8 components (no mouth) 95,5%
(c) 7 components (eye region only) 96,5%
(d) 4 components (right face half) 90,1%
The Color FERET Face Database [41] is the biggest database in terms of dif-
ferent individuals. It contains images of 994 persons including frontal images with
neutral and a randomly different facial expression (usually a smile) for all persons.
For about 990 persons there are images of profiles, half profiles and quarter pro-
files. Images are also included, where the head was turned more or less randomly.
The ground-truth information defines the year of birth (1924-1986), gender, race and
if the person was wearing glasses, a beard, or a mustache. For the frontal images,
also the eye, mouth and nose tip coordinates are annotated. Older images of the first
version of the FERET database are included in the newer Color FERET database.
However, these images are supplied in grey-scale only.
Color FERET
AR
Yale A
Yale B
There are two versions of the Yale Face Database. The original version [4] con-
tains 165 frontal face images of 15 persons under different illuminations, facial ex-
pressions, with and without glasses. The second version (Yale Face Database B) [20]
contains grey-scale images of 10 individuals (nine male and one female) under 9
poses and 64 illumination conditions with neutral facial expressions. No subject
wears glasses and no other kind of occlusion is contained.
5.1.7 Exercises
The following exercises deal with face recognition using the Eigenface (Sect. 5.1.7.2)
and the component-based approach (Sect. 5.1.7.3). Image processing macros for the
I MPRESARIO software are explained and suitable process graphs are provided on the
book CD.
5.1.7.1 Provided Images and Image Sequences
A few face images and ground truth data files (directories
<Impresario>/imageinput/LTI-Faces-Eigenfaces and
<Impresario>/imageinput/LTI-Faces-Components)8 are provided on
the book CD with which the process graphs of this exercise can be executed. This
image database is not extensive, however, it should be big enough to get a basic
understanding of the face recognition methods.
Additionally, we suggest to use the Yale face database [4] which
can be obtained easily through the internet9 . Additional ground truth data
like coordinates of facial feature points are provided on the CD shipped
with this book and can be found after installation of I MPRESARIO in
the directories <Impresario>/imageinput/Yale DB-Eigenfaces and
<Impresario>/imageinput/Yale DB-Components, respectively.
To set up the image and ground truth data of the Yale face database follow these
steps:
1. Download the Yale face database from the internet9 and extract the image files
and their 15 sub-directories (one for each subject) to the following directories:
7
http://www-prima.inrialpes.fr/FGnet/data/05-ARFace/tarfd markup.html
8
<Impresario> denotes the installation directory on your hard disc.
9
http://cvc.yale.edu/projects/yalefaces/yalefaces.html
Face Recognition 215
Fig. 5.14 shows the first process graph. The ImageSequence source provides an
image (left output pin) together with its file name (right output pin) which is
needed in the LoadFaceCoordinates macro which loads the coordinates of the
eyes, mouth and nose which are stored in a file with the same name but dif-
ferent extension like the image file (e.g. subject05.normal.asf for image
subject05.normal.bmp).
The ExtractIntensityChannel macro will extract the intensity channel of the image
which is then passed to the FaceNormalizer macro.
10
The process graph files can be found on the book CD. The file names are given in brack-
ets.
216 Person Recognition and Tracking
Fig. 5.14. I MPRESARIO process graph for the calculation of the PCA transformation matrix.
The grey background marks the normalization steps which can be found in all process graphs
of this Eigenface exercise. (the face image was taken from the Yale face database [4])
After this normalization step, all face images have the same mean grey-level
avgnorm . Global linear illumination differences can be reduced by this step.
mean grey-value: The new mean value of all intensities in the image (avgnorm ).
As some parts of the image might still contain background, the normal-
ized face image is masked using a pre-defined mask image which is loaded by
LoadImageMask. The macro MaskImage actually performs the masking operation.
Finally, another ExtractIntensityChannel macro converts the image information
into a feature vector which is then forwarded to the calculatePCAMatrix macro.
calculatePCAMatrix For each image, this macro stores all feature vectors. After
the last image is processed (the ExitMacro function of the macros will be called),
calculatePCAMatrix can calculate the mean feature vector, the covariance ma-
trix and their eigenvalues which are needed to determine the transformation ma-
trix (Sect. 5.1.4) that is then stored.
data file: Path to the file in which the transformation matrix is stored.
result dim: With this parameter the number of dimensions to which the feature
vectors are reduced can be explicitly determined. If this parameter is set to −1,
a suitable dimension reduction is performed by dismissing all Eigenvectors
λmax
whose eigenvalues are smaller than 100000 where λmax determines the highest
eigenvalue.
Eigenface Training
The normalization steps described above (Fig. 5.14) can be found in the training
process graph as well (Fig. 5.15). Instead of calculating the PCA, the training
graph contains the EigenfaceFeature macro which loads the previously stored
transformation matrix and projects given feature vectors into the eigenspace (face
space). The transformed feature vectors are then forwarded to the kNNTrainer
macro which stores all feature vectors with their corresponding person IDs which
are determined by the macro ExtractIDFromFilename.
Fig. 5.16 shows the graph which can be used for running tests on the Eigen-
face approach. The kNNTrainer macro of the training graph is substituted by a
kNNClassifier.
kNNClassifier Classifies the given feature vector by loading the previously stored
classifier information (see kNNTrainer above) from hard disc.
Face Recognition 219
Fig. 5.16. I MPRESARIO process graph for testing the eigenface approach using a nearest-
neighbor classifier.
training file: File in which the kNNTrainer macro stored the trained feature vec-
tors.
RecognitionStatistics Compares the classified ID (obtained by kNNClassifier) to
the target ID which is again obtained by the macro ExtractIDFromFilename.
Additionally, it counts the correct and false classifications and determines the
classification rate which is displayed in the output window of I MPRESARIO (Ap-
pendix B).
label: A label which is displayed in the output windows of I MPRESARIO to mark
each message from the respective RecognitionStatistics macro in the case
more than one macro is used.
Annotations
For the normalization of the images, parameters must be set which affect the size
of the result images. FaceNormalizer crops the image to a certain size (determined
by parameters width and height). These parameters must be set to the same value
as in LoadImageMask and EigenfaceFeature (Fig. 5.17). If this convention is not
followed, the macros will exit with an error because the dimensionality for vectors
and matrices do not correspond and hence the calculations can not be performed.
Suggested experiments
To understand and learn about how the different parameters influence the perfor-
mance of the eigenface approach, the reader might want to try changing the following
parameters:
220 Person Recognition and Tracking
Fig. 5.17. Proper image size settings for the Eigenface exercise. The settings in all involved
macros must be set to the same values, e.g., 120 and 180 for the width and height, respectively.
by training face images (e.g. of the Yale face database) under one illumination
situation (e.g. the yale.normal.seq images) and testing under another
illumination (the yale.leftlight.seq images). The same accounts for
training and testing images showing different facial expressions.
Fig. 5.18. I MPRESARIO Process graph for training a component-based face recognizer.
The training graph of the component-based classifier is quite simple (Fig. 5.18). The
macro LoadComponentPositions loads a list of facial feature points from a file on
disc that corresponds to each image file to be trained. The macro returns a list of
feature points which is forwarded to the ComponentExtractor.
222 Person Recognition and Tracking
component: One can take a look at the extracted components by setting the com-
ponent parameter to the ID of the desired component (e.g. 4 for the nose tip.
All IDs are documented in the macro help; on the left of Fig. 5.18).
The list of face component images is then forwarded to a feature extractor macro
that builds up a list of feature vectors containing one vector per component image. In
Fig. 5.18 the IntensityFeatureExtractor is used which only extracts the grey-level
information from the component images. This macro must be substituted if other
feature extraction methods should be used.
The list of feature vectors is then used to train a set of 12 kNN classifiers (macro
ComponentskNNTrainer). For each classifier, a file is stored in the given directory
(parameter training directory) that contains information of all trained feature vectors.
ComponentskNNTrainer Trains 12 kNN classifiers, one for each component. The
ground truth ID is extracted by the macro ExtractIDFromFilename (Sect.
5.1.7.2).
training directory: Each trained classifier is stored in a file in the directory given
here.
training file base name: The names of each of the 12 classifier files contains
the base file name given here and the ID of the face component, e.g.
TrainedComponent4.knn for the nose tip.
For a test image the facial feature point positions are loaded and the face compo-
nent images are extracted from the face image. Then, the features for all compo-
nents are determined and forwarded to the classifier. The ComponentskNNTrainer
is substituted by a ComponentskNNClassifier that will load the previously stored
12 classifier files to classify each component separately (Fig. 5.19).
ComponentskNNClassifier For each component, this macro determines a win-
ner ID and an outputVector11 containing information about the distances
(Euclidean distance is used) and the ranking of the different classes, i.e. a list
containing a similarity value for each class. The winner ID can be used to
determine the classification rate of a single component by feeding it into a
RecognitionStatistics macro that compares the winner ID with the target ID. The
outputVector can be used to combine the results of the single components to
an overall result.
training directory: Directory where ComponentskNNTrainer stored the classi-
fication information for all components.
11
outputVector is a LTILib class. Please refer to the documentation of the LTILib for
implementations details.
Face Recognition 223
training file base name: The base name of the files in which the classification
information is stored.
The ResultCombiner macro performs this step by linearly combining the out-
puts and determines an overall winner ID which again can be examined in an
RecognitionStatistics macro to get an overall classification rate.
Classification results of single components can be considered in the
overall recognition result by connecting the corresponding output of the
ComponentskNNClassifier macro to the input of the ResultCombiner. In Fig. 5.19
all components are considered as all are connected to the combination macro.
Suggested Experiments
void CIntensityFeatureExtractor::featureExtraction() {
// set the size of the feature vector to the dimension of the face component
if(m_Feature.size() != m_FaceComponent.rows()*m_FaceComponent.columns())
m_Feature.resize(m_FaceComponent.rows()*m_FaceComponent.columns(),
0,false,false);
Fig. 5.20. Code of the feature extractor interface for component-based face classification.
Reimplement this function to build your own feature extraction method.
Until today, the description of the appearance of people for recognition purposes
is only addressed by the research community. Work on appearance-based full-body
recognition was done by Yang et al. [57] and Nakajima et al. [37] whose work is
briefly reviewed here.
Yang et al. developed a multimodal person identification system which, besides face
Full-body Person Recognition 225
a b c d e
Fig. 5.21. Different persons whose appearance differs because of their size and especially in
texture and color of their clothing (a - d). However, sometimes similar clothing might not
allow a proper separation of individuals (d + e).
and speaker recognition, also contains a module for the recognition based on color
appearance. Using the whole segmented image, they build up color histograms based
on the r-g color space and the tint-saturation color space (t-s-color space) to gain illu-
mination independence to some extent (Sect. 5.2.2.1). Depending on the histograms
which are built after the person is separated from the background, a smoothed proba-
bility distribution function is determined for each histogram. During training for each
color model and each person, i probability distribution functions Pi,model (r, g) and
Pi,model (t, s) are calculated and stored. For classification, distributions Pi,test (r, g)
and Pi,test (t, s) are determined and compared to all previously trained Pi,model for
the respective color space. For comparison the following measure (Kullback-Leibler
divergence) is used:
P (r,g)
D(modeli
test) = s∈S Pi,model (r, g) · Pi,model
i,test (r,g)
where S is the set of all histogram bins. For the t-s-color space models, this is
performed analogously. The model having the smallest divergence determines the
recognition result.
Tests on a database containing 5000 images of 16 persons showed that the t-s model
performs slightly better than the r-g color model. Recognition is not influenced
heavily by the size of the histograms using any of the two color spaces and
recognition rates are up to 70% depending on histogram size and color model.
was combined with a RGB histogram. The local shape features were determined by
convolving the image with 25 shape patterns patches. The convolution is performed
on the following color channels: R - G, R + G and R + G - B. For classification a
Support Vector Machine [49] was used.
Tests were performed on a database of about 3500 images of 8 persons which
were taken during a period of 16 days. All persons wore different clothes during
the acquisition period so that different features were assigned to the same person.
Recognition rates of about 98% could be achieved. However, the ratio of training
images to test images was up to 9 : 1. It could be shown that the simple shape fea-
tures are not useful for this kind of recognition task, probably because of insufficient
segmentation. The local shape features performed quite well, though the normalized
color histogram produced the best results. This is in line with the results of Yang [57].
Histograms can be based on any color space that suits the specific application.
The RGBL-histogram is a four-dimensional histogram built on the three RGB chan-
nels of an image and the luminance channel which is defined by:
min(R, G, B) + max(R, G, B)
L=
2
Yang et al. [57] and Nakajima et al. [37] used the r-g color space for person
identification where a RGB image is transformed into a r-channel and a g-channel
based on the following equations applied to each pixel P (x, y) = (R, G, B):
Full-body Person Recognition 227
R G
r= g=
R+G+B R+G+B
Each combination (r̃, g̃) of specific colors are represented by a bin in the his-
togram. Nakajima used 32 bins for each channel, i.e. K = 32 · 32 = 1024 bins for
the whole r-g histogram. The maximum size of the histogram (assuming the com-
monly used color resolution of 256 per channel) is Kmax = 256 · 256 = 65536. So,
just a fraction of the maximum size was used leading to more robust classification.
As the r-channel and the g-channel are coding the chrominance of an image, this
histogram is also called chrominance histogram.
5.2.2.2 Color Structure Descriptor
The Color Structure Descriptor (CSD) was defined by the MPEG-7 standard for im-
age retrieval applications and image content based database queries. It can be seen
as an extension to a color histogram which is not just representing the color distribu-
tion but also considers the local spatial distribution of the colors. With the so called
structuring element not only each pixel of the image is considered when building the
histogram but also its neighborhood.
structuring element
Fig. 5.22. 3 × 3 structuring elements located in two images to consider the neighborhood of
the pixel under the elements center. In both cases, the histogram bins representing the colors
“black” and “white” are increased by exactly 1.
Fig. 5.22 illustrates the usage of the structuring element (here with a size of 3 × 3
shown in grey) with two simply two-colored images. Both images have a size of
10 × 10 pixels of which 16 are black. For building the CSD histogram the structuring
element is moved row-wise over the whole image. A bin of the histogram is increased
by exactly 1 if the corresponding color can be found within the structuring element
window even though more than one pixel have this color. At the position marked
in Fig. 5.22 in the left image one black pixel and eight white pixels are found. The
values of the bin representing the black color and the white color are increased by 1
each, just like on the right side of Fig. 5.22, even though two black pixels and only
seven white pixels are found within the structuring element.
228 Person Recognition and Tracking
By applying the structuring element, the histogram bins will have the values:
nBlack = 36 and nW hite = 60 for the left image, nBlack = 62 and nW hite = 64
for the right. As can be seen for the black pixels, the more the color is distributed
the larger is the value of the histogram bin. A normal histogram built on these two
images will have the bin values nBlack = 16 and nW hite = 84. The images could
not be separated by these identical histograms while the CSD histograms still show
a difference.
The MPEG-7 standard suggests a structure element size of 8×8 pixels for images
smaller than 256 × 256 pixels which is the case for most segmented images in our
experiments.
It must be noted that the MPEG-7 standard also introduced a new color space
[29]. Images are transformed into the HMMD color space (Hue, Min, Max, Diff)
and then the CSD is applied.
The hue value is the same as in the HSV color space while Min and Max are the
smallest and the biggest of the R, G, and B values and Diff is the difference between
these extremes:
⎧
⎨ (0 +
G−B
Max−Min ) × 60, if R = Max
Hue = (2 + B−R
Max−Min ) × 60, if G = Max
⎩
(4 + R−G
Max−Min ) × 60, if B = Max
Min = min(R, G, B)
Max = max(R, G, B)
Diff = max(R, G, B) − min(R, G, B)
Clothing is not only characterized by color but also by texture. So far, texture features
have not been evaluated as much as color features though they might give valuable
information about a person’s appearance. The following two features turned out to
have good recognition performance, especially when combined with other features.
In general, most texture features filter an image, split it up into separated fre-
quency bands and extract energy measures that are merged to a feature vector. The
applied filters have different specific properties, e.g. they extract either certain fre-
quencies or frequencies in certain directions.
5.2.3.1 Oriented Gaussian Derivatives
For a stable rotation-invariant feature we considered a feature based on the Oriented
Gaussian Derivatives (OGD) which showed promising results in other recognition
tasks before [54]. The OGDs are steerable, i.e. it can be determined how the in-
formation in a certain direction influences the filter result and therefore the feature
vector.
In general, a steerable filter g θ (x, y) can be expressed as:
Full-body Person Recognition 229
J
g θ (x, y) = kj (θ)gjθ (x, y)
j=1
where gjθ (x, y) denote basis filters and kj (θ) interpolation functions. For the first-
order OGDs the filter can be defined as:
◦ ∂
g θ (x, y) = g 0 (x , y ) = g(x , y )
∂x 2
1 x + y2
=− (x cos θ + y sin θ) exp −
2πσ 4 2σ 2
∂ ∂
= cos θ g(x, y) + sin θ g(x, y)
∂x ∂y
◦ ◦
= cos θg 0 (x, y) + sin θg 90 (x, y)
A directed channel C θ with its elements cθ (x, y) can then be extracted by con-
volving the steerable filter g θ (x, y) with the image I(x, y):
For a n-th order Gaussian derivative, the energy of a region R in a filtered channel
cθ (x, y) can then be expressed as [2]:
θ
E = pθ (x, y)dxdy
J
= ki (θ)kj (θ) cθi (x, y)cθj (x, y)dxdy
(i,j)=1
n
= A0 + Ai cos(2iθ − Θi )
i=1
where A0 is independent of the steering angle θ and the Ai are rotation-invariant, i.e.
θ
the energy ER is steerable as well. The feature vector is composed of all coefficients
Ai . For the exact terms of the coefficients Ai and more details the interested reader
is referred to [2].
5.2.3.2 Homogeneous Texture Descriptor
The Homogeneous Texture Descriptor (HTD) describes texture information in a
region by determining the mean energy and energy deviations from 30 frequency
230 Person Recognition and Tracking
θ, s
4 3
5 2
ωs=1
6 i=1
12 18 24
ω, r
30
Fig. 5.23. (Ideal) HTD frequency domain layout. The 30 ideal frequency bands are modelled
with Gabor filters.
In the literature an implementation of the HTD is proposed that applies the “cen-
tral slice problem” and the Radon transformation [42]. However, an implementation
using a 2D Fast-Fourier-Transformation (FFT) turned out to have a much higher
computational performance in our experiments.
The resulting frequency domain is symmetric i.e. only N2 points of the discrete
transformation need to be calculated and afterwards filtered by Gabor filters which
are defined as:
−(ω−ωs )2 −(θ−θr )2
2 2σ2
G(ω, θ) = e 2σw
s ·e θr
where ωs (θr ) determines the center frequency (angle) of the band i where i =
6s + r + 1, r defining the radial index and s the angular index.
From each of the 30 bands the energies ei and their deviations di are defined as:
ei = log(1 + pi )
di = log(1 + qi )
Full-body Person Recognition 231
with ◦
1 360
pi = [G(w, θ) · F (w, θ)]2
w=0 θ=0◦
,
- 1 360◦
-
qi = . {[G(w, θ) · F (w, θ)]2 − pi }
w=0 θ=0◦
respectively, where G(w, θ) denotes the Gabor function in the frequency domain,
F (w, θ) the 2D-Fourier transformation in polar coordinates.
fDC is the intensity mean value of the image (or color channel), fSD the cor-
responding standard deviation. If fDC is not used, the HTD can be thought of a
intensity invariant feature [42].
This section will show some results which were achieved by applying the above
described feature descriptors.
5.2.4.1 Experimental Setup
The tests were performed on a database of 53 persons [22]. For each individual, a
video sequence was available showing the person standing upright. 10 test images
were chosen to train a person and 200 images were used for testing. Some examples
of the used images are depicted in Fig. 5.21.
Note that the combination of the CSD and the HTD12 still led to a slight
performance improvement, though the performance of the HTD was worse than the
12
Combination was performed by adding the distance measures of the single features:
d(x) = dCSD (x) + dHT D (x) and then determining the class with the smallest overall dis-
tance.
232 Person Recognition and Tracking
0,95
0,9
recognition rate
0,85
0,8
0,75
0,7
0,65
10 20 30 40 50
number of persons
Fig. 5.24. Recognition results of the examined features. Color features are marked with filled
symbols. Texture features have outlined symbols.
performance of the CSD. This proofs that the HTD describes information that can
not completely be covered by the CSD. Therefore, the combination of texture and
color features is advisable.
Fig. 5.25. Different situations require more or less complex approaches, depending on the
number of people in the scene as well as the camera perspective and distance.
cause real situations are far too varied and complex (see Fig. 5.25) to be interpreted
without any prior knowledge. Furthermore, people have a wide variety of shape and
appearance. A human observer solves the image analysis task by three-dimensional
understanding of the situation in its entirety, using a vast background knowledge
about every single element in the scene, physical laws and human behaviour. No
present computer model is capable to represent and use that amount of world know-
ledge, so the task has to be specialized to allow for more abstract descriptions of the
desired observations, e.g. “Humans are the only objects that move in the scene” or
“People in the scene are walking upright and therefore have a typical shape”.
5.3.1 Segmentation
“Image segmentation” denotes the process to group pixels with a certain common
property and can therefore be regarded as a classification task. In the case of people
tracking, the goal of this step is to find pixels in the image that belong to persons.
The segmentation result is represented by a matrix M(x, y) of Booleans, that labels
each image pixel I(x, y) as one of the two classes “background scene” or “probably
person” (Fig. 5.27 b):
0 if background scene
M(x, y) = (5.6)
1 if moving foreground object (person)
While this two-class segmentation suffices for numerous tracking tasks (Sect.
5.3.2.1), individual tracking of people during inter-person overlap requires further
segmentation of the foreground regions, i.e. to label each pixel as one of the classes
background scene, n-th person, or unrecognized foreground (probably a new person)
(Fig. 5.27 c). This segmentation result can be represented by individual foreground
masks Mn (x, y) for every tracked person n ∈ {1, . . . , N } and one mask M0 (x, y)
to represent the remaining foreground pixels. Sect. 5.3.3 presents several methods
to segment the silhouettes separately using additional knowledge about shape and
appearance.
moving objects would be to label all those pixels of the camera image I(x, y, t) at
time t as foreground, that differ more than a certain threshold θ from the prerecorded
background image:
1 if
B(x, y) − I(x, y, t)
> θ
M(x, y, t) = (5.7)
0 otherwise
θ has to be chosen greater than the average camera noise. In real camera setups,
however, the amount of noise is not only correlated to local color intensities (e.g. due
to saturation effects), but there are also several additional types of noise that affect the
segmentation result. Pixels at high contrast edges often have high noise variance due
to micro-movements of the camera and “blooming/smearing”-effects in the image
sensor. Non-static background like moving leaves on trees in outdoor scenes also
causes high local noise. Therefore it is more suitable to describe each background
pixel individually by its mean value μc (x, y) and its noise variance σ 2c (x, y). In the
following, it is assumed that there is no correlation between the noise of the separate
color channels Bc (x, y), c ∈ {r, g, b}. The background model is created from N
images B(x, y, i), i ∈ {1, . . . , N } of the empty scene:
1
N
μc (x, y) = Bc (x, y, i) (5.8)
N i=1
1 2
N
σ 2c (x, y) = Bc (x, y, i) − μc (x, y) (5.9)
N i=1
Assuming a Gaussian noise model, the Mahalanobis distance ΔM provides a suitable
measure to decide, whether the difference between image and background is relevant
or not. 2
Ic (x, y, t) − μc (x, y)
ΔM (x, y, t) = (5.10)
σ 2c (x, y)
c∈{r,g,b}
Camera-based People Tracking 237
element positioned at (x, y), the formal description of the morphological operations
is given by:
+d +d
δx =−d δy =−d M(x + δx , y + δy )E(δx , δy )
F(x, y) = +d +d (5.14)
δx =−d δy =−d E(δx , δy )
1 if F(x, y) = 1
Erosion: ME (x, y) = (5.15)
0 otherwise
1 if F(x, y) > 0
Dilation: MD (x, y) = (5.16)
0 otherwise
The effects of both operations are shown in Fig. 5.29. Since the erosion reduces the
Opening
d
d Structure
Element
Erosion Dilation
Closing
Original
Dilation Erosion
Fig. 5.29. Morphological operations. Black squares denote foreground pixels, grey lines mark
the borderline of the original region.
size of the original shape by the radius of the structure element while the dilation
expands it, the operations are always applied pairwise. An erosion followed by a di-
lation is called morphological opening. It erases foreground regions that are smaller
than the structure element. Holes in the foreground are closed by morphological clos-
ing, i.e. a dilation followed by an erosion. Often opening and closing operations are
performed subsequently, since both effects are desired.
5.3.2 Tracking
The foreground mask M(x, y) resulting from the previous section marks each image
pixel either as “potential foreground object” or as “image background”, however so
far no meaning has been assigned to the foreground regions. This is the task of the
238 Person Recognition and Tracking
tracking step. “Tracking” denotes the process of associating the identity of a moving
object (person) with the same object in the preceding frames of an image sequence,
thus following the trajectory of the object either in the image plane or in real world
coordinates. To this end, features of each tracked object, like the predicted position
or appearance descriptions, are used to classify and label the foreground regions.
There are two different types of tracking algorithms: In the first one (Sect.
5.3.2.1), all human candidates in the image are detected before the actual tracking
is done. Here, tracking is primarily a classification problem, using various features
to assign each person candidate to one of the internal descriptions of all observed
persons.
The second type of tracking systems combines the detection and tracking step
(Sect. 5.3.2.2). The predicted image area of each tracked person serves as initiali-
zation to search for the new position. This search can be performed either on the
segmented foreground mask or on the image data itself. In the latter case, the process
often also includes the segmentation step, using color descriptions of all persons to
improve the segmentation quality or even to separate people during occlusion. This
second, more advanced class of tracking algorithms is also preferred in model-based
tracking, where human body models are aligned to the person’s shape in the image
to detect the exact position or body posture.
In the following, both methods are presented and discussed. Inter-person overlap
and occlusions by objects in the scene will be covered in chapter 5.3.3.
A
B
Fig. 5.30. Tracking following detection: a) camera image, b) foreground segmentation and
person candidate detection, c) people assignment according to distance and appearance.
The prediction (xpn , ypn ) is extrapolated from the history of positions in the preced-
ing frames of the sequence, e.g. with a Kalman Filter [53].
A similar distance metric is the overlap ratio between the predicted bounding box
of a tracked person and the bounding box of the compared foreground region. Here,
the region with the largest area inside the predicted bounding box is selected.
While assignment by spatial distance provides the basic framework of a tracking
system, more advanced features are necessary for reliable identification whenever
people come close to each other in the image plane or have to be re-identified after
occlusions (Sect. 5.3.3). To this end, appearance models of all persons are created
when they enter the scene and compared to the candidates in each new frame. There
240 Person Recognition and Tracking
are many different types of appearance models proposed in the literature, of which a
selection is presented below.
The most direct way to describe a person would be his or her full-body im-
age. Theoretically, it would be possible to take the image of a tracked person in
one frame and to compare it pixel-by-pixel with the foregound region in question in
the next frame. This, however, is not suitable for re-identification after more than a
few frames, e.g. after an occlusion, since the appearance and silhouette of a person
changes continuously during the movement through the scene. A good appearance
model has to be general enough to comprise the varying appearance of a person dur-
ing the observation time, as well as being detailed enough to distinguish the person
from others in the scene. To solve this problem for a large number of people un-
der varying lighting conditions requires elaborate methods. The appearance models
presented in the following sections assume that the lighting within the scene is tem-
porally and spatially constant and that the clothes of the tracked persons look similar
from all sides. Example applications using these models for people tracking can be
found in [24, 59] (Temporal Texture Templates), [3, 32] (color histograms), [34, 56]
(Gaussian Mixture Models) and [17] (Kernel Density Estimation). Other possible
descriptions, e.g. using texture features, are presented in Sect. 5.2.
x, y/, t) = W(/
Wn (/ x, y/, t − 1) + Mn (x, y, t) (5.18)
x, y/, t − 1)Tn (/
Wn (/ x, y/, t) + Mn (x, y, t)I(x, y, t)
x, y/, t) =
Tn (/ (5.19)
max{Wn (/ x, y/, t), 1}
Figure 5.31 shows an example of how the template and the weight matrix evolve
over time. Depicting the weights as a grayscale image shows the average silhouette
of the observed person with higher values towards the body center and lower values
in the regions of moving arms and legs. Since these weights are also a measure for
the reliability of the respective average image value, they are used to calculate the
weighted sum of differences dTn,k between the region of a person candidate k and
the average appearance of the compared n-th person.
x, y/, t − 1)
Tn (/
y Mk (x, y, t)Wn (/ x, y/, t − 1) − I(x, y, t)
dTn,k =
x
x x, y/, t − 1)
y Mk (x, y, t)Wn (/
(5.20)
Camera-based People Tracking 241
Color Histograms
Instead of the (r,g,b)-space, other color spaces are often used that are less sensitive
regarding illumination changes (see appendix A.5.3). Another expansion of this
principle uses multiple local histograms instead of a global one to include spatial
information in the description.
using the Expectation Maximization (EM-) algorithm [15]. Each v-dimensional blob
i = 1 . . . N is described as a Gaussian distribution by its mean μi and covariance
matrix Ki either in three-dimensional color space (c = (r, g, b), v = 3) or in the
five-dimensional combined feature space of color and normalized image coordinates
/, y/), v = 5):
(c = (r, g, b, x
1 1
pi (c) = √ · exp − (c − μi )T K−1
i (c − μi ) (5.24)
(2π)v/2 detKi 2
The Mahalanobis distance between a pixel value c and the i-th blob is given as:
ΔM (c, i) = (c − μi )T K−1
i (c − μi ) (5.25)
For less calculation complexity, the uncorrelated Gaussian distribution is often used
instead, see eq. 5.8 . . . 5.10. The difference dGn,k between the description of the n-th
person and an image region Mk (x, y) is calculated as the average minimal Maha-
lanobis distance between each image pixel c(x, y) inside the mask and the set of
Gaussians:
x y Mk (x, y) minin =1...Nn ΔM (c(x, y), in )
dGn,k = (5.26)
x y Mk (x, y)
h(c1) pK(c1)
3 3
2 2
1 1
c1 c1
Sample points
(a) (b)
Fig. 5.32. Comparison of probability density approximation by histograms (a) and kernel den-
sity estimation (b).
A B B
A
C C
Shape-based Tracking
The silhouette adaptation process is initialized with the predicted image position
of a person’s silhouette, which can be represented either by the parameters of
the surrounding bounding box (center coordinates, width and height) or by the
parameters of a 2D model of the human shape. The most simple model is an average
human silhouette, that is shifted and scaled to approximate a person’s silhouette in
the image. If the framerate is not too low, the initial image area overlaps with the
segmented foreground region that represents the current silhouette of that person.
The task of the adaptation algorithm is to update the model parameters to find the
best match between the silhouette representation (bounding box, 2D model) and the
foreground region.
1 1 x2 y2
k(x, y) = ∗ exp − ( 2 + 2 ) (5.28)
2πσx σy 2 σx σy
As a result, less importance is given to the outer points of a silhouette, which include
high variability and therefore unreliability due to arm and leg movements. Using the
segmented foreground mask M(x, y) as source data, the mean shift vector dx(i) is
calculated as
w/2
h/2
x
M(xm (i − 1) + x, ym (i − 1) + y)k(x, y)
y
y=−h/2 x=−w/2
dx(i) = (5.29)
h/2
w/2
M(xm (i − 1) + x, ym (i − 1) + y)k(x, y)
y=−h/2 x=−w/2
The position of the search window is then updated with the mean shift vector:
To this end, the rectangle is resized after the last iteration according to the size of
the enclosed foreground area (CAMShift - Continuously Adaptive Mean Shift Algo-
rithm [7]),
h/2
w/2
/∗/
w h=C∗ M(xm (i) + x, ym (i) + y)k(x, y) (5.31)
y=−h/2 x=−w/2
Fig. 5.34. Iterative alignment (c) of a shape model (a) to the segmented foreground (b) using
linear regression.
One approach to solve this problem is to define a measure for the matching error
(e.g. the sum of all pixel differences) and to find the minimum of the error function
in parameter space using global search techniques or gradient-descent algorithms.
These approaches, however, disregard the fact that the high-dimensional vector of
pixel differences between the current model position and the foreground region
contains valuable information about how to adjust the parameters. For example, the
locations of pixel differences in Fig. 5.34 c between the average silhouette and the
246 Person Recognition and Tracking
foreground region correspond to downscaling and shifting the model to the right.
Expressed mathematically, the task is to find a transformation between two vector
spaces: the w ∗ h-dimensional space of image differences di (by concatenating all
rows of the difference image into one vector) and the space of model parameter
changes dp , which has 4 dimensions here. One solution to this task is provided by
linear regression, resulting in the regression matrix R as a linear approximation of
the correspondance between the two vector spaces:
dp = Rdi (5.32)
The matching algorithm works iteratively. In each iteration, the pixel difference vec-
tor di is calculated using the current model parameters, which are then updated using
the parameter change resulting from eq. 5.32.
To calculate the regression matrix R, a set of training data has to be collected in
advance by randomly varying the model parameters and calculating the difference
vector between the original and the shifted and scaled shape. All training data is
written in two matrices Di and Dp , of which the corresponding columns are the
difference vectors di or the parameter changes dp respectively. The regression
matrix is then calculated as
Color-based Tracking
In this type of tracking algorithms, appearance models of all tracked persons in the
image are used for the position update. Therefore no previous segmentation of the
image in background and foreground regions is necessary. Instead, the segmentation
and tracking steps are combined. Using a background model (Sect. 5.3.1) and the
color descriptions of the persons at the predicted positions, each pixel is classified ei-
ther as background, as belonging to a specific person, or as unrecognized foreground.
An example is the Pfinder (“Person finder”) algorithm [56]. Here, each person n is
represented by a mixture of m multivariate Gaussians Gn,i (c), i = 1 . . . m, with
c = (r, g, b, x, y) being the combined feature space of color and image coordinates
(Sect. 5.3.2.1). For each image pixel c(x, y), the Mahalanobis distances ΔM,B to
the background model and ΔM,n,i to each Gaussian are calculated according to
equation 5.10 or 5.25 respectively. A pixel is then classified using the following
equation, where θ is an appropriate threshold:
Camera-based People Tracking 247
Fig. 5.35. Image segmentation with fixed thresholds (b,c) or expected color distributions (d).
In the previous sections, several methods have been presented to track people as
isolated regions in the image plane. In real situations, however, people form groups or
pass each other in different distances from the camera. Depending on the monitored
scene and the camera perspective, it is more or less likely that the corresponding
silhouettes in the image plane merge into one foreground region. How to deal with
these situations is an elementary question in the development of a tracking system.
The answer depends largely on the application area and the system requirements.
In the surveillance of a parking lot from a high and distant camera position, it is
sufficient to detect when and where people merge into a group and, after the group
splits again, to continue tracking them independently (Sect. 5.3.3.1). Separate person
tracking during occlusion is neither necessary nor feasible here due to the small
image regions the persons occupy and the relatively small resulting position error. A
counter-example is the automatic video surveillance in public transport like a subway
wagon, where people partially occlude each other permanently. In complex situations
like that, elaborate methods that combine person separation by shape, color and 3D
information (Sect. 5.3.3.2 - 5.3.3.3) are required for robust tracking.
5.3.3.1 Occlusion Handling without Separation
As mentioned above, there are many surveillance tasks where inter-person occlusions
are not frequent and therefore a tracking logic that analyzes the merging and split-
ting of foreground regions suffices. This method is applied particularly in tracking
following detection systems as presented in Sect. 5.3.2.1 [24, 32].
Merging of two persons in the image plane is detected, if both are assigned to
the same foreground region. In the following frames, both are tracked as one group
object until it splits up again or merges with other persons. The first case occurs, if
two or more foreground regions are assigned to the group object. The separate image
regions are then identified by comparing them to the appearance models of the per-
sons in question. This basic principle can be extended by a number of heuristic rules
to cope with potential combinations of appearing/disappearing and splitting/merging
regions. The example in Sect. 5.3.5 describes a tracking system of this type in detail.
5.3.3.2 Separation by Shape
When the silhouettes of overlapping people merge into one foreground region, the
resulting shape contains information about how the people are arranged (Fig. 5.36).
There are multiple ways to take advantage of that information, depending on the type
of tracking algorithm used.
The people detection method in tracking following detection systems (Sect.
5.3.2.1) can be improved by further analyzing the shape of each connected fore-
ground region to derive a hypothesis about the number of persons it is composed of.
A common approach is the detection and counting of head shapes that extend from
the foreground blob (Fig. 5.36 b). Possible methods are the detection of local peaks
in combination with local maxima of the extension in y-direction [59] or the analysis
of the curvature of the foreground region’s top boundary [23]. Since these methods
Camera-based People Tracking 249
Fig. 5.36. Shape-based silhouette separation: (b) by head detection, (c) by silhouette align-
ment.
will fail in situations where people are standing directly behind each other so that
not all of the heads are visible, the results are primarily used to provide additional
information to the tracking logic or to following processing steps.
A more advanced approach uses models of the human shape as introduced in
Sect. 5.3.2.2. Starting with the predicted positions, the goal is to find an arrangement
of the individual human silhouettes that minimizes the difference towards the re-
spective foreground region. Using the linear regression tracking algorithm presented
above, the following principle causes the silhouettes to align only with the valid outer
boundaries and ignore the areas of overlap: In each iteration, the values of the differ-
ence vector di are set to zero in foreground points that lie inside the bounding box
of another person (gray region in Fig. 5.36 c). As a result, the parameter update is
calculated only from the remaining non-overlapping parts. If a person is completely
occluded, the difference vector and therefore also the translation is zero, resulting in
the prediction as final position assumption.
Tracking during occlusion using only the shape of the foreground region requires
good initialization to converge to the correct positions. Furthermore, there is often
more than one possible solution in arranging the overlapping persons. Tracking ro-
bustness can be increased in combination with separation by color or 3D information.
5.3.3.3 Separation by Color
The best results in individual people tracking during occlusions are achieved if their
silhouettes are separated from each other using appearance models as described in
Sect. 5.3.2.2.
In contrast to the previous methods that work entirely in the 2D image plane,
reasoning about the relative depth of the occluding people towards each other is
necessary here to cope with the fact that people in the back are only partially visible,
thus affecting the position calculation. Knowing the order of overlapping persons
from front to back in combination with their individually segmented image regions,
a more advanced version of the shape-based tracking algorithm from Sect. 5.3.3.2
can be implemented that adapts only to the visible parts of a person and ignores the
250 Person Recognition and Tracking
corresponding occluded areas, i.e. image areas assigned to persons closer towards
the camera.
The relative depth positions can be derived either by finding the person arrange-
ment that maximizes the overall color similarity [17], by using 3D information, or
by tracking people on the ground plane of the scene [18, 34].
5.3.3.4 Separation by 3D Information
While people occluding each other share the same 2D region in the image plane, they
are positioned at different depths in the scene. Therefore additional depth information
allows to split the combined foreground region into areas of similar depth, that can
then be assigned to the people in question by using one of the above methods (e.g.
color descriptions).
One possibility to measure the depth of each image pixel is the use of a stereo
camera [25]. It consists of a pair of identical cameras attached side by side. An im-
age processing module searches the resulting image pair for corresponding feature
points, of which the depth is calculated from the horizontal offset. Drawbacks of this
method are that the depth resolution is often coarse and that feature point detection
in uniformly colored areas is unstable.
An extension of the stereo principle is the usage of multiple cameras that are
looking at the scene from different viewpoints. All cameras are calibrated to trans-
form the image coordinates in 3D scene coordinates and vice versa as explained in
the following section. People are tracked on the ground plane, taking advantage of
the least occluded view of each person. Such multi-camera systems are the most
favored approach to people tracking in narrow indoor scenes [3, 34].
In the preceding sections, various methods have been presented to track and describe
people as flat image regions in the 2D image plane. These systems provide the infor-
mation, where the tracked persons are located in the camera image. However, many
applications need to know their location and trajectory in the real 3D scene instead,
e.g., to evaluate their behaviour in sensitive areas or to merge the views from multiple
cameras. Additionally, it has been shown that knowledge about the depth positions
in the scene improves the segmentation of overlapping persons (Sect. 5.3.3.3).
A mathematical model of the imaging process is used to transform 3D world
coordinates into the image plane and vice versa. To this end, the prior knowledge of
extrinsic (camera position, height, tilt angle) and intrinsic camera parameters (focal
length, opening angle, potential lense distortions) is necessary. The equations derived
in the following use so-called homogeneous coordinates to denote points in the world
or image space: ⎛ ⎞
wx
⎜ wy ⎟
x=⎜ ⎝ wz ⎠
⎟ (5.36)
w
Camera-based People Tracking 251
The coefficients rij denote the rotation of the coordinate system, (xΔ , yΔ , zΔ ) the
translation, ( d1x , d1y , d1z ) the perspective distortion and s the scaling.
yC
DC
xC
0
HC
z
x 0
zC
Most cameras can be approximated by the pinhole camera model (Fig. 5.37). In
case of pincushion or other image distortions caused by the optical system, additional
normalization is necessary. The projection of 3D scene coordinates into the image
plane of a pinhole camera located in the origin is given by the transformation matrix
MH : ⎛ ⎞
1 0 0 0
⎜0 1 0 0⎟
MH = ⎜ ⎝0 0 1 0⎠
⎟ (5.38)
0 0 − DC 0
1
DC denotes the focal length of the camera model. Since the image coordinates are
measured in pixel units while other measures are used for the scene positions (e.g.
252 Person Recognition and Tracking
0 0 0 1 0 0 0 1
The total transformation matrix M results from the concatenation of the three trans-
formations:
⎛ ⎞
1 0 0 0
⎜ 0 cos α sin α −HC cos α ⎟
M = MH MT MR = ⎜ ⎝ 0 − sin α cos α HC sin α ⎠
⎟ (5.41)
0 sin α
DC − DC − DC
cos α HC sin α
The camera coordinates (xC , yC ) can now be calculated from the world coordinates
(x, y, z) as follows:
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
wC xC x x
⎜ wC yC ⎟ ⎜y⎟ ⎜ y cos α − HC cos α ⎟
⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎝ wC zC ⎠ = M ⎝ z ⎠ = M ⎝ −y sin α + z cos α + HC sin α ⎠ (5.42)
DC (y sin α − z cos α − HC sin α)
1
wC 1
By dividing the homogeneous coordinates by wc , the final equations for the image
coordinates are derived:
x
xC = −DC (5.43)
z cos α + (HC − y) sin α
help of body models. Therefore the height is either equal to zero or to the body height
HP of the person. The body height can be calculated, if the y-positions of the head
yc,H and of the feet yc,F are detected simultaneously:
yc,F − yc,H
H P = HC D C (5.45)
(yc,F cos α + DC sin α)(DC cos α − yc,H sin α)
5.3.5 Example
In this example, a people tracking system of the type tracking following detection is
built and tested using I MPRESARIO and the LTI-L IB (see appendices A and B). As
explained in Sect. 5.3.2.1, systems of this type are more appropriate in scenes where
inter-person occlusion is rare, e.g. the surveillance of large outdoor areas from a
high camera perspective, than in narrow or crowded indoor scenes. This is due to the
necessity of the tracked persons to be separated from each other most of the time to
ensure stable tracking results.
The general structure of the system as it appears in I MPRESARIO is shown in
Fig. 5.38. Image source can either be a live stream from a webcam or a prerecorded
image sequence. Included on the book CD are two example sequences, one of an out-
side parking lot scene (in I MPRESARIO-subdirectory ./imageinput/ParkingLot) and
an indoor scene (directory ./imageinput/IndoorScene). The corresponding I MPRE -
SARIO -projects are PeopleTracking Parking.ipg and PeopleTracking Indoor.ipg.
The first processing step is the segmentation of moving foreground objects by
background subtraction (see Sect. 5.3.1.1). The background model is calculated from
the first initialFrameNum frames of the image sequence, so it has to be ensured
that these frames show only the empty background scene without any persons or
other moving objects. Alternatively, a pre-built background model (using the I MPRE -
SARIO -macro trainBackgroundModel) can be loaded that must have been created
under exactly the same lighting conditions. The automatic adjustment of the camera
parameters has to be deactivated to prevent background colors from changing when
people with dark or bright clothes enter the scene.
After reducing the noise of the resulting foreground mask with a sequence of
morphological operations, person candidates are detected using the objectsFrom-
Mask macro (see Fig. 2.11 in Sect. 2.1.2.2). The detected image regions are passed
on to the peopleTracking macro, where they are used to update the internal list of
tracked objects. A tracking logic handles the appearing and disappearing of people in
254 Person Recognition and Tracking
the camera field of view as well as the merging and splitting of regions as people oc-
clude each other temporarily in the image plane. Each person is described by a one-
or two-dimensional Temporal Texture Template (Sect. 5.3.2.1) for re-identification
after an occlusion or, optionally, after re-entering the scene.
The tracking result is displayed using the drawTrackingResults macro (Fig.
5.39). The appearance models of individual persons can be visualized with the macro
extractTrackingTemplates.
Fig. 5.39. Example output of tracking system (a) before, (b) during and (c) after an overlap.
In the following, all macros are explained in detail and the possibilities to vary
their parameters are presented. The parameter values used in the example sequences
Camera-based People Tracking 255
wS( DI)
0
DI
0 minShadow maxShadow
Fig. 5.40. Shadow similarity scaling factor ws as a function of the decrease of intensity ΔI .
threshold: Threshold for the Mahalanobis distance in color space to decide be-
tween foreground object and background. (30.0)
use adaptive model: True: Update background model in each frame at the back-
ground pixels, false: use static background model for entire sequence. (True)
adaptivity: Number of past images to define the background model. A new image
is weighted with 1/adaptivity, if the current frame number is lower than
adaptivity, or with 1/f rame number otherwise. “0” stands for an infinite
number of frames, i.e. the background model is the average of all past frames.
(100)
apply shadow reduction: True: reduce the effect of classifying shadows as fore-
ground object. (True)
minShadow: Shadow reduction parameter, see explanation above. (0.02)
maxShadow: Shadow reduction parameter. (0.2)
shadowThreshold: Shadow reduction parameter. (0.1)
intensityWeight: Shadow reduction parameter. (3.0)
Morphological Operations This macro applies a sequence of morphological oper-
ations like dilation, erosion, opening and closing on a binary or grayscale image
as presented in Sect. 5.3.1.2. Input and output are always of the same data type,
either a lti::channel8 or a lti::channel. The parameters are:
Morph. Op. Sequence: std::string that defines the sequence of morpholog-
ical operations using the following apprevations: “e” = erosion, “d” = dilation,
“o” = opening, “c” = closing. (“ddeeed”)
Mode: process input channel in “binary” or “gray” mode. (“binary”)
Kernel type: Type of structure element used. (square)
Kernel size: Size of the structure element (3,5,7,...). (3)
apply clipping: Clip values that lie outside the valid range (only used in gray
mode). (False)
filter strength: Factor for kernel values (only used in gray mode). (1)
ObjectsFromMask This macro detects connected regions in the foreground mask.
Threshold: Threshold to separate between foreground and background values.
(100)
Level: Iteration level to describe hierarchy of objects and holes. (0)
Assume labeled mask: Is input a labeled foreground mask (each region identified
by a number)? (False)
Melt holes: Close holes in object description (not necessary when Level=0).
(False)
Minimum size: Minimum size of detected objects in pixel. (60)
Sort Objects: Sort objects according to Sort by Area. (False)
Sort by Area: Sort objects by their area size (true) or the size of the data structure
(false). (False)
PeopleTracker The tracking macro is a tracking following detection system (see
Sect. 5.3.2.1) based on a merge/split logic (Sect. 5.3.3.1). The internal state of
the system, which is also the output of the macro, is represented by an instance
of the class trackedObjects (Fig. 5.41). This structure includes two lists: The first
one is a list of the individual objects (here: persons), each described by a one-
Camera-based People Tracking 257
trackedObjects
std::list<individualObjects> std::list<visibleBlobs>
Fig. 5.41. Structure of class trackedObjects. One or more individual objects (persons) are
assigned to one visible foreground blob. Gray not-assigned objects have left the scene, but are
kept for re-identification.
or two-dimensional Temporal Texture Template. The second list includes all cur-
rent foreground regions (“blobs”), represented each by lti::areaPoints,
the bounding box and a Kalman filter to predict the movement from the past tra-
jectory. One or, in case of inter-person overlap, multiple individual objects are
assigned to one visible blob. In each image frame, this structure is updated by
comparing the predicted positions of the foreground blobs in the visibleBlob list
with the new detected foreground regions. The following cases are recognized
(see Fig. 5.42):
a b c d e
Fig. 5.42. Possible cases during tracking. Dashed line = predicted bounding box of tracked
blob, solid line and gray rectangle = bounding box of new detected foreground region.
a) Old blob does not overlap with new blob: The blob is deleted from the
visibleBlob list, the assigned persons are deleted or deactivated in the indi-
vidualObjects list.
b) One old blob overlaps with one new blob: Clear tracking case, the position
of the blob is updated.
258 Person Recognition and Tracking
c) Multiple old blobs overlap with one new blob: Merge case, new blob is
added to visibleBlob list, old blobs are deleted, and all included persons are
assigned to the new blob.
d) One old blob overlaps with mutliple new blobs: To solve the ambiguity,
each person included in the blob is assigned to one of the new blobs using
the Temporal Texture Template description. Each assigned new blob is added
to the visibleBlob list, the old blob is deleted. In the special case of only one
included person and multiple overlapping blobs which do not overlap with
other old blobs, a “false split” is assumed. This can occur if the segmented
silhouette of a person consists of separate regions due to similar colors in the
background. In this case, the separate regions are united to update the blob
position.
e) New blob that doesn’t overlap with an old blob: The Blob is added to
the visibleBlob list, if its width-to-height ratio is inside the valid range for
a person. If parameter memory is true, it is tried to re-identify the person,
otherwise a new person is added to the individualObjects list.
The macro uses the following parameters:
Memory: True: Persons that have left the field of view are not deleted but used
for re-identification whenever a new person is detected. (True)
Threshold: Similarity threshold to re-identify a person. (100)
Use Texture Templates: True: use Temporal Texture Templates to identify a per-
son, false: assign persons using only the distance between the predicted and
detected position (True)
Texture Template Weight: Update ratio for Temporal Texture Templates (0.1)
Template Size x/y: Size of Temporal Texture Template in pixels (x=50, y=100)
Average x/y Values: Use 2D Temporal Texture Template or use average values in
x- or y-direction. (x: True, y: False)
Intensity weight: Weight of intensity difference in contrast to color difference
when comparing templates (0.3)
UseKalmanPrediction: True: use a Kalman filter to predict the bounding box
positions in the next frame, false: use old positions in new frame. (True)
Min/Max width-to-height ratio: Valid range of width-to-height ratio of new de-
tected objects (Min = 0.25, Max = 0.5)
Text output: Write process information in console output window (False)
DrawTrackingResults This macro draws the tracking result into the source image.
Text: Label objects. (True)
Draw Prediction: Draw predicted bounding box additionally. (False)
Color: Choose clearly visible colors for each single object and merged objects.
ExtractTrackingTemplates This macro can be used to display the appearance
model of a chosen person from the trackedObjects class.
Object ID: ID number of tracked object (person) to display.
Display Numbers: True: Diplay ID number.
Color of the numbers: Color of displayed ID number.
References 259
References
1. Ahonen, T., Hadid, A., and Pietikäinen, M. Face Recognition with Local Binary Patterns.
In Proceedings of the 8th European Conference on Computer Vision, pages 469–481.
Springer, Prague, Czech Republic, May 11-14 2004.
2. Alvarado, P., Dörfler, P., and Wickel, J. AXON 2 - A Visual Object Recognition System
for Non-Rigid Objects. In Hamza, M. H., editor, IASTED International Conference-
Signal Processing, Pattern Recognition and Applications (SPPRA), pages 235–240.
Rhodes, Greece, July 3-6 2001.
3. Batista, J. Tracking Pedestrians Under Occlusion Using Multiple Cameras. In Int. Con-
ference on Image Analysis and Recognition, volume Lecture Notes in Computer Science
3212, pages 552–562. 2004.
4. Belhumeur, P., Hespanha, J., and Kriegman, D. Eigenfaces vs. Fisherfaces: Recognition
Using Class Specific Linear Projection. In IEEE Transactions on Pattern Analysis and
Machine Intelligence, pages 711–720. 1997.
5. Bileschi, S. and Heisele, B. Advances in Component-based Face Detection. In 2003 IEEE
International Workshop on Analysis and Modeling of Faces and Gestures, AMFG 2003,
pages 149–156. IEEE, October 2003.
6. Bishop, C. Neural Networks for Pattern Recognition. Oxford University Press, 1995.
7. Bradski, G. Computer Vision Face Tracking for Use in a Perceptual User Interface. Intel
Technology Journal, Q2, 1998.
8. Brunelli, R. and Poggio, T. Face Recognition: Features versus Templates. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 15(10):1042–1052, 1993.
9. Cheng, Y. Mean Shift, Mode Seeking and Clustering. IEEE PAMI, 17:790–799, 1995.
10. Chung, K., Kee, S., and Kim, S. Face Recognition Using Principal Component Analysis
of Gabor Filter Responses. In International Workshop on Recognition, Analysis, and
Tracking of Faces and Gestures in Real-Time Systems. 1999.
11. Comaniciu, D. and Meer, P. Mean Shift: A Robust Approach toward Feature Space Analy-
sis. IEEE PAMI, 24(5):603–619, 2002.
12. Cootes, T. and Taylor, C. Statistical Models of Appearance for Computer Vision. Tech-
nical report, University of Manchester, September 1999.
13. Cootes, T. and Taylor, C. Statistical Models of Appearance for Computer Vision. Tech-
nical report, Imaging Science and Biomedical Engineering, University of Manchester,
March 2004.
14. Craw, I., Costen, N., Kato, T., and Akamatsu, S. How Should We Represent Faces for
Automatic Recognition? IEEE Transactions on Pattern Recognition and Machine Intel-
ligence, 21(8):725–736, 1999.
15. Dempster, A., Laird, N., and Rubin, D. Maximum Likelihood from Incomplete Data
Using the EM Algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1–38,
1977.
16. Duda, R. and Hart, P. Pattern Classification and Scene Analysis. John Wiley & Sons,
1973.
17. Elgammal, A. and Davis, L. Probabilistic Framework for Segmenting People Under Oc-
clusion. In IEEE ICCV, volume 2, pages 145–152. 2001.
18. Fillbrandt, H. and Kraiss, K.-F. Tracking People on the Ground Plane of a Cluttered Scene
with a Single Camera. WSEAS Transactions on Information Science and Applications,
9(2):1302–1311, 2005.
19. Gavrila, D. The Visual Analysis of Human Movement: A Survey. Computer Vision and
Image Understanding, 73(1):82–98, 1999.
260 Person Recognition and Tracking
20. Georghiades, A., Belhumeur, P., and Kriegman, D. From Few to Many: Illumination Cone
Models for Face Recognition under Variable Lighting and Pose. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 23(6):643–660, 2001.
21. Gutta, S., Huang, J., Singh, D., Shah, I., Takács, B., and Wechsler, H. Benchmark Studies
on Face Recognition. In Proceedings of the International Workshop on Automatic Face-
and Gesture-Recognition (IWAFGR). Zürich, Switzerland, 1995.
22. Hähnel, M., Klünder, D., and Kraiss, K.-F. Color and Texture Features for Person Recog-
nition. In International Joint Conference onf Neural Networks (IJCNN 2004). Budapest,
Hungary, July 2004.
23. Haritaoglu, I., Harwood, D., and Davis, L. Hydra: Multiple People Detection and Track-
ing Using Silhouettes. In Proc. IEEE Workshop on Visual Surveillance, pages 6–13. 1999.
24. Haritaoglu, I., Harwood, D., and Davis, L. W4: Real-Time Surveillance of People and
Their Activities. IEEE PAMI, 22(8):809–830, 2000.
25. Harville, M. and Li, D. Fast, Integrated Person Tracking and Activity Recognition with
Plan-View Templates from a Single Stereo Camera. In IEEE CVPR, volume 2, pages
398–405. 2004.
26. Heisele, B., Ho, P., and Poggio, T. Face Recognition with Support Vector Machines:
Global versus Component-based Approach. In ICCV01, volume 2, pages 688–694. 2001.
27. Kanade, T. Computer Recognition of Human Faces. Birkhauser, Basel, Switzerland,
1973.
28. Lades, M., Vorbrüggen, J., Buhmann, J., Lange, J., von der Malsburg, C., Würtz, R. P.,
and Konen, W. Distortion Invariant Object Recognition in the Dynamic Link Architecture.
IEEE Transactions on Computers, 42(3):300–311, 1993.
29. Manjunath, B., Salembier, P., and Sikora, T. Introduction to MPEG-7. John Wiley and
Sons, Ltd., 2002.
30. Martinez, A. Recognizing Imprecisely Localized, Partially Occluded, and Expression
Variant Faces from a Single Sample per Class. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 24(6):748–763, June 2002.
31. Martinez, A. and Benavente, R. The AR face database. Technical Report 24, CVC, 1998.
32. McKenna, S., Jabri, S., Duric, Z., Wechsler, H., and Rosenfeld, A. Tracking Groups of
People. CVIU, 80(1):42–56, 2000.
33. Melnik, O., Vardi, Y., and Zhang, C.-H. Mixed Group Ranks: Preference and Confidence
in Classifier Combination. PAMI, 26(8):973–981, August 2004.
34. Mittal, A. and Davis, L. M2-Tracker: A Multi-View Approach to Segmenting and Track-
ing People in a Cluttered Scene. Int. Journal on Computer Vision, 51(3):189–203, 2003.
35. Moeslund, T. Summaries of 107 Computer Vision-Based Human Motion Capture Papers.
Technical Report LIA 99-01, University of Aalborg, March 1999.
36. Moeslund, T. and Granum, E. A Survey of Computer Vision-Based Human Motion Cap-
ture. Computer Vision and Image Understanding, 81(3):231–268, 2001.
37. Nakajima, C., Pontil, M., Heisele, B., and Poggio, T. Full-body Person Recognition Sys-
tem. Pattern Recognition, 36:1997–2006, 2003.
38. Ohm, J. R., Cieplinski, L., Kim, H. J., Krishnamachari, S., Manjunath, B. S., Messing,
D. S., and Yamada, A. Color Descriptors. In Manjunath, B., Salembier, P., and Sikora, T.,
editors, Introduction to MPEG-7, chapter 13, pages 187–212. Wiley & Sons, Inc., 2002.
39. Penev, P. and Atick, J. Local Feature Analysis: A General Statistical Theory for Object
Representation. Network: Computation in Neural Systems, 7(3):477–500, 1996.
40. Pentland, A., Moghaddam, B., and Starner, T. View-based and Modular Eigenspaces for
Face Recognition. In IEEE Conference on Computer Vision and Pattern Recognition
(CVPR’94). Seattle, WA, June 1994.
References 261
41. Phillips, P., Moon, H., Rizvi, S., and Rauss, P. The FERET Evaluation Methodology
for Face-Recognition Algorithms. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 22(10):1090–1104, October 2000.
42. Ro, Y., Kim, M., Kang, H., Manjunath, B., and Kim, J. MPEG-7 Homogeneous Texture
Descriptor. ETRI Journal, 23(2):41–51, June 2001.
43. Schwaninger, A., Mast, F., and Hecht, H. Mental Rotation of Facial Components and
Configurations. In Proceedings of the Psychonomic Society 41st Annual Meeting. New
Orleans, USA, 2000.
44. Sirovich, L. and Kirby, M. Low-Dimensional Procedure for the Characterization of Hu-
man Faces. Journal of the Optical Society of America, 4(3):519–524, 1987.
45. Sung, K. K. and Poggio, T. Example-Based Learning for View-Based Human Face De-
tection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1):39–51,
1997.
46. Swain, M. and Ballard, D. Color Indexing. Int. Journal on Computer Vision, 7(1):11–32,
1991.
47. Turk, M. A Random Walk through Eigenspace. IEICE Transactions on Information and
Systems, E84-D(12):1586–1595, 2001.
48. Turk, M. and Pentland, A. Eigenfaces for Recognition. Journal of Cognitive Neuro-
science, 3(1):71–86, 1991.
49. Vapnik, V. Statistical Learning Theory. Wiley & Sons, Inc., 1998.
50. Viola, P. and Jones, M. Robust Real-time Object Detection. In ICCV01, 2, page 747.
2001.
51. Wallraven, C. and Bülthoff, H. View-based Recognition under Illumination Changes Us-
ing Local Features. In Proceedings of the International Conference on Computer Vision
and pattern Recognition (CVPR). Kauai, Hawaii, December, 8-14 2001.
52. Waring, C. A. and Liu, X. Face Detection Using Spectral Histograms and SVMs. IEEE
Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics, 35(3):467–476,
June 2005.
53. Welch, G. and Bishop, G. An Introduction to the Kalman Filter. Technical Report TR
95-041, Department of Computer Science, University of North Carolina at Chapel Hill,
2004.
54. Wickel, J., Alvarado, P., Dörfler, P., Krüger, T., and Kraiss, K.-F. Axiom - A Modular
Visual Object Retrieval System. In Jarke, M., Koehler, J., and Lakemeyer, G., editors,
Proceedings of 25th German Conference on Artificial Intelligence (KI 2002), Lecture
Notes in Artificial Intelligence, pages 253–267. Springer, Aachen, September 2002.
55. Wiskott, L., Fellous, J., Krüger, N., and von der Malsburg, C. Face Recognition by Elas-
tic Bunch Graph Matching. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 19(7):775–779, 1997.
56. Wren, C., Azerbayejani, A., Darell, T., and Pentland, A. Pfinder: Real-Time Tracking of
the Human Body. IEEE PAMI, 19(7):780–785, 1997.
57. Yang, J., Zhu, X., Gross, R., Kominek, J., Pan, Y., and Waibel, A. Multimodal People
ID for a Multimedia Meeting Browser. In Proceedings of the Seventh ACM International
Conference on Multimedia, pages 159 – 168. ACM, Orlando, Florida, United States, 1999.
58. Yang, M., Kriegman, D., and Ahuja, N. Detecting Faces in Images: A Survey. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 24:34–58, 2002.
59. Zhao, T. and Nevatia, R. Tracking Multiple Humans in Complex Situations. IEEE PAMI,
26(9):1208–1221, 2004.
60. Zhao, W., Chellappa, R., Phillips, P. J., and Rosenfeld, A. Face recognition: A Literature
Survey. ACM Computing Surveys, 35(4):399–458, 2003.
263
Chapter 6
The word ”virtual” is widely used in computer science and in a variety of situ-
ations. In general, it refers to something that is merely conceptual from something
that is physically real. Insofar, the term ”Virtual Reality”, which has been formed
by the American artist Jaron Lanier in the late eighties, is a paradox in itself. Until
today, there exists no standardized definition of Virtual Reality (VR). All definitions
have in common however, that they specify VR as a computer-generated scenario of
objects (virtual world, virtual environment) a user can interact with in real time and
in all three dimensions. The characteristics of VR are most often described by means
of the I 3 : Interaction, Imagination, and Immersion:
• Interaction
Unlike an animation which cannot be influenced by the user, in VR real-time
interaction is a fundamental and imperative feature. Thus, movies like ”Jurassic
Parc” realize computer-generated, virtual worlds, strictly speaking they are not a
matter of VR however.
The interaction comprises navigation as well as the manipulation of the virtual
scene. In order to achieve real-time interaction, two important basic criteria have
to be fulfilled:
1. Minimum Frame Rate: A VR application is not allowed to work with a frame
rate lower than a threshold which is defined by Bryson [6] to 10 Hz. A more
restrictive limit is set by Kreylos et. al. [9] who recommend a minimum rate
of 30 frames per second.
2. Latency: Additionally, Bryson as well as Kreylos have defined the maximum
response time of a VR system after a user input was applied. Both claim that a
maximum latency of 100 milliseconds can be tolerated until a feedback must
take place.
I/O devices, and especially methods and algorithms have to meet these require-
ments. As a consequence, high quality, photorealistic rendering algorithms in-
cluding global illumination methods, or complex algorithms like the Finite El-
ement Method for a physically authentic manipulation of deformable objects,
typically cannot be used in VR without significant optimizations or simplifica-
tions.
• Imagination
The interaction within a virtual environment should be as intuitive as possible. In
the ideal case, a user can perceive the virtual world and interact with it in exactly
the same way as with the real world, so that the difference between both becomes
blurred. Obviously, this goal has impact on the man-machine interface design.
264 Interacting in Virtual Reality
First of all, since humans interact with the real world in all three dimensions, the
interface must be three-dimensional as well. Furthermore, multiple senses should
be included into the interaction, i.e., besides the visual sense, the integration of
other senses like the auditory, the haptic/tactile, and the olfactory ones should
be considered in order to achieve a more natural, intuitive interaction with the
virtual world.
• Immersion
By special display technology, the user has the impression of being a part of the
virtual environment, standing in the midst of the virtual scene, fully surrounded
by it instead of looking from outside.
In the recent years, VR has proven its potential to provide an innovative human
computer interface, from which multiple application areas can profit. Whereas in the
United States VR research is mainly pushed by military applications, the automotive
industry has been significantly promoting VR in Germany during the last ten years,
provoking Universities and research facilities like the Fraunhofer Institutes to sig-
nificant advances in VR technology and methodology. The reason for their role as
a precursor is their goal to realize the digital vehicle, where the whole development
from the initial design process up to the production planning is implemented in the
computer. By VR technology, so called ”virtual mockups” can to some extent replace
physical prototypes, having the potential to shorten time to market, to reduce costs,
and to lead to a higher quality of the final product. In particular, the following con-
crete applications are relevant for the automotive industry and can profit from VR as
an advanced human computer interface:
• Car body design
• Car body aerodynamics
• Design and ergonomical studies of car interiors
• Simulation of air conditioning technology in car interiors
• Crash simulations
• Car body vibrations
• Assembly simulations
• Motor development: Simulation of airflow and fuel injection in the cylinders of
combustion engines
• Prototyping of production lines and factories
Besides automotive applications, medicine has emerged as a very important field
for VR. Obviously, data resulting from 3D computer tomography or magnet reso-
nance imaging suggest a visualization in virtual environments. Furthermore, the VR-
based simulation of surgical interventions is a significant step towards the training
of surgeons or the planning of complicated operations. The development of realistic
surgical procedures evokes a series of challenges however, starting with a real-time
simulation of (deformable) organs and tissue, and ending in a realistic simulation of
liquids like blood and liquor.
Obviously, VR covers a large variety of complex aspects which cannot all be cov-
ered within the scope of a single book chapter. Therefore, instead of trying to cover
Visual Interfaces 265
all topics, we decided to pick out single aspects and to describe them in an adequate
detailedness. Following the overall focus of this book, chapter 6 concentrates on the
interaction aspects of VR. Technological aspects are only described when they con-
tribute to an understanding of the algorithmic and mathematical issues. In particular,
in sections 6.1 to 6.3 we address the multimodality of VR with emphasis on visual,
acoustic, and haptic interfaces for virtual environments. Sect. 6.4 is about the model-
ing of a realistic behavior of virtual objects. The methods described are restricted to
rigid objects, since a treatment of deformable objects would go beyond the scope of
this chapter. In Sect. 6.5, support mechanisms are introduced as an add-on to haptics
and physically-based modeling, which facilitate the completion of elementary inter-
action tasks in virtual environments. Sect. 6.6 gives an introduction to algorithmic
aspects of VR toolkits, and scene graphs as an essential data structure for the im-
plementation of virtual environments. Finally, Sect. 6.7 explains the examples which
are provided on the CD that accompanies this book. For further reading, we recom-
mend the book ”3D User Interfaces – Theory and Practice” written by D. Bowman
et. al. [5].
The capability of the human visual system to perceive the environment in three di-
mensions originates from psychological and physiological cues. The most important
psychological cues are:
• (Linear) perspective
• Relative size of objects
• Occlusion
• Shadows and lighting
• Texture gradients
Since these cues rely on monocular and static mechanisms only, they can be
covered by conventional visual displays and computer graphics algorithms. The main
difference between vision in VR and traditional computer graphics is that VR aims
at emulating all cues. Thus, in VR interfaces the following physiological cues must
be taken into account as well:
• Stereopsis
Due to the interocular distance, slightly different images are projected onto the
retina of the two eyes (binocular disparity). Within the visual cortex of the brain,
these two images are then fused to a three-dimensional scene. Stereopsis is con-
sidered the most powerful mechanism for seeing in 3D. However, it merely works
for objects that are located at a distance not more than a few meters from the
viewer.
266 Interacting in Virtual Reality
• Oculomotor factors
The oculomotor factors are based on muscle activities and can be further subdi-
vided into accommodation and convergence.
– Accommodation
In order to focus on an object and to produce a sharp image on the eye’s
retina, the ciliar muscles deform the eye lenses accordingly.
– Convergence
The viewer rotates her eyes towards the object in focus so that the two images
can be fused together.
• Motion parallax
When the viewer is moving from left to right while looking at two objects at
different distances, the object which is closer to the viewer passes her field of
view faster than the object which is farer away.
(a) (b)
(a) (b)
Fig. 6.2. Immersive, room-mounted displays: The CAVE-like display at RWTH Aachen Uni-
versity, Germany.
268 Interacting in Virtual Reality
Fig. 6.3. Opto-electronical head tracking (Courtesy of A.R.T. GmbH, Weilheim, Germany).
space impossible. Instead, the projection has to be adopted to the new viewpoint as
shown in the right drawing of Fig. 6.4. By this adaptation called Viewer Centered
Projection (VCP), the user is also capable of looking around objects by just moving
her head. In the example, the right side of the virtual cube becomes visible when the
user moves to the right. Fig. 6.5 demonstrates the quasi-holographic effect of VCP by
means of a tracking sensor attached to the camera. Like real objects, virtual objects
are perceived as stationary in 3D space.
projection plane
eye position 2
Fig. 6.5. The viewer centered projection evokes a pseudo-holographic effect. As with the
lower, real cuboid, the right side of the upper, virtual cube becomes visible when the viewpoint
moves to the right.
homogeneous coordinates (see, e.g., [10]). Assuming the origin of the coordinate
system is in the middle of the projection plane with its x- and y-axes parallel to the
plane, the following shearing operation has to be applied to the vertices of a virtual,
polygonal object:
⎛ ⎞
1 0 00
ex ey ⎜ 0 1 0 0⎟
SHxy − , − =⎜ ⎝ − ex − ey 1 0 ⎠
⎟ (6.1)
ez ez ez ez
0 0 01
For an arbitrary point P = (x, y, z, 1) in the object space, applying the projection
matrix leads to the following new coordinates:
ex ey ex ey
(x, y, z, 1) · SHxy − , − = x− · z, y − · z, z, 1 (6.2)
ez ez ez ez
By this operation, the viewpoint is shifted to the z-axis, i.e., P = (ex , ey , ez , 1) is
transformed to P = (0, 0, ez , 1). The result is a simple, central perspective projec-
tion (Fig. 6.6), which is most common in computer graphics and which can, e.g., be
further processed into a parallel projection in the unit cube and handled by computer
graphics hardware.
Of course, shearing the view volume as shown above must leave the pixels (pic-
ture elements) on the projection plane unaffected, which is the case when applying
(6.1):
ex ey
(x, y, 0, 1) · SHxy − , − = (x, y, 0, 1) (6.3)
ez ez
In CAVE-like displays with several screens not parallel to each other, VCP is ab-
solutely mandatory. If the user is not tracked, perspective distortions arise, especially
Visual Interfaces 271
at the boundaries of the screens. Since every screen has its own coordinate system,
for each eye the number of different projection matrices is equal to the number of
screens.
Experience has shown that not every VR application needs to run in a fully im-
mersive system. In many cases, semi-immersive displays work quite well, especially
in the scientific and technical field. In the simplest case such displays only consist
of a single, large wall (see Fig. 6.7 left) or a table-like horizontal screen known as
Responsive Workbench [12]. Workbenches are often completed by an additional ver-
tical screen in order to enlarge the view volume (see Fig. 6.7 right). Recent efforts
towards more natural visual interfaces refer to high-resolution, tiled displays built
from a matrix of multiple projectors [11].
In contrast to the human visual system, the human auditory system can perceive input
from all directions and has no limited field of view. As such, it provides valuable cues
for navigation and orientation in virtual environments. With respect to immersion,
the acoustic sense is a very important additional source of information for the user,
as acoustical perception works precisely especially for close auditory stimuli. This
enhances the liveliness and credibility of the virtual environment. The ultimate goal
thus would be the ability to place virtual sounds in any three dimensions and distance
around the user in real-time. However, audio stimuli are not that common in today’s
VR applications.
For visual stimuli, the brain compares pictures from both eyes to determine the
placing of objects in a scene. With this information, it creates a three-dimensional
cognitive representation that humans perceive as a three-dimensional image. In
straight analogy, stimuli that are present at the eardrums will be compared by the
brain to determine the nature and the direction of a sound event [3]. In spatial
acoustics, it is most convenient to describe the position of a sound source relative
to the position and orientation of the listener’s head by means of a coordinate system
which has its origin in the head, as shown in Fig. 6.8. Depending on the horizontal
angle of incidence, different time delays and levels between both ears consequently
arise. In addition, frequency characteristics dependent on the angle of incidence are
influenced by the interference between the direct signal and the reflections of head,
shoulder, auricle and other parts of the human body. The interaction of these three
factors permits humans to assign a direction to acoustic events [15].
The characteristics of the sound pressure at the eardrum can be described in the
time domain by the Head Related Impulse Response (HRIR) and in the frequency
domain by the Head Related Transfer Function (HRTF). These transfer functions
can be measured individually with small in-ear microphones or with an artificial
head. Fig. 6.9 shows a measurement with an artificial head under 45◦ relating to the
frontal direction in the horizontal plane. The Interaural Time Difference (ITD) can
Acoustic Interfaces 273
be assessed in the time domain plot. The Interaural Level Difference (ILD) is shown
in the frequency domain plot and clarifies the frequency dependent level increase at
the ear turned toward the sound source, and the decrease at the ear which is turned
away from the sound source.
(a) (b)
Fig. 6.9. HRIR and HRTF of a sound source under 45◦ measured using an artificial head. The
upper diagram shows the Interaural Time Difference in the time domain plot. The Interaural
Level difference is depicted in the frequency domain plot of the lower diagram.
approaches reproducing a sound event with true spatial relation: wave field synthesis
and binaural synthesis. While wave field synthesis reproduces the whole sound field
with a large number of loudspeakers, binaural synthesis in combination with cross
talk cancellation (CTC) is able to provide a spatial auditory representation by using
only few loudspeakers. The following sections will briefly introduce the principles
and problems of these technologies with focus on the binaural approach.
Another problem can be found in the synchronization of the auditory and visual
VR subsystems. Enhanced immersion dictates that the coupling between the two
systems has to be tight and with small computational overhead, as each sub-system
introduces its own lag and latency problems due to different requirements on the
processing. If the visual and the auditory cues differ too much in time and space, this
is directly perceived as a presentation error, and the immersion of the user ceases.
The basic theory of the wave field synthesis is the Huygens principle. An array of
loudspeakers (ranging from just a few to a few hundreds in number) is placed in the
same position a microphone array was placed in at the time of recording the sound
event in order to reproduce an entire real sound field. Fig. 6.11 shows this principle
of recording and reproduction [2], [20]. In a VR environment the loudspeaker signal
will then be calculated for a given position of one or more virtual sources.
By reproducing a wave field in the whole listening area, it is possible to walk
around or turn the head while the spatial impression does not change. This is the big
advantage of this principle. The main drawback, beyond the high effort of processing
power, is the size of the loudspeaker array. Furthermore, mostly two-dimensional so-
lutions have been presented so far. The placement of those arrays in semi-immersive,
projection-based VR systems like a workench or a single wall is only just possible,
but in display systems with four to six surfaces as in CAVE-like environments, it
is nearly impossible. A wave field synthesis approach is thus not applicable, as the
Acoustic Interfaces 275
Recording
Fig. 6.11. Setup for recording a sound field and reproducing it by wave field synthesis.
loudspeaker arrays have to be positioned around the user and outside of the environ-
ment’s boundaries.
In contrast to wave field synthesis a binaural approach does not deal with the com-
plete reproduction of the sound field. It is convenient and sufficient to reproduce that
sound field only at two points, the ears of the listener. In this case, only two signals
have to be calculated for a complete three-dimensional sound scene.
The procedure of convolving a mono sound source with an appropriate pair of
HRTFs in order to obtain a synthetic binaural signal is called binaural synthesis.
The binaural synthesis transforms a sound source without position information into
a virtual source related to the listener’s head. The synthesized signals contain the
directional information of the source, which is provided by the information in the
HRTFs.
For the realization of a room related virtual source, the HRTF must be changed
when the listener turns or moves her head, otherwise the virtual source would move
with the listener. Since a typical VR system includes head tracking for the adaptation
of the visual perspective, the listener’s position is always known and can also be used
to realize a synthetic sound source with a fixed position corresponding to the room
coordinate system. The software calculates the relative position and orientation of
the listener’s head to the imaginary point where the source should be localized. By
knowing the relative position and orientation the appropriate HRTF can be chosen
from a database (see Fig. 6.12) [1].
All in all, the binaural synthesis is a powerful and at the same time feasible
method for an exact spatial imaging of virtual sound sources, which in principle only
needs two loudspeakers. In contrast to panning systems where the virtual sources are
always on or behind the line spanned by the speakers, binaural synthesis can realize
a source at any distance to the head, and in particular one close to the listener’s
276 Interacting in Virtual Reality
The requirement for a correct binaural presentation is that the right channel of the
signal is audible only in the right ear and the left one is audible only in the left ear.
From a technical point of view, the presentation of binaural signals by headphones
is the easiest way since the acoustical separation between both channels is perfectly
solved. While headphones may be quite acceptable in combination with HMDs, in
room-mounted displays the user should wear as few active hardware as possible.
Furthermore, when the ears are covered by headphones, the impression of a source
located at a certain point and distance to the listener often does not match the im-
pression of a real sound field. For these reasons, a reproduction by loudspeakers
should be considered, especially in CAVE-like systems. Speakers can be mounted,
e.g., above the screens. The placing of virtual sound sources is not restricted due to
the binaural reproduction.
Using loudspeakers for binaural synthesis introduces the problem of cross talk.
As sound waves determined for the right ear arrive at the left ear and vice versa,
without an adequate CTC the three-dimensional cues of the binaural signal would be
destroyed. In order to realize a proper channel separation at any point of the listening
area, CTC filters have to be calculated on-line when the listener moves his head.
When XL and XR are the signals coming from the binaural synthesis (see
Fig. 6.12 and 6.13), and ZL and ZR are the signals arriving at the listener’s left
and right ear, filters must be found providing signals YL and YR , so that XL = ZL
and XR = ZR . Following the way of the binaural signals to the listener’s ears results
in the two equations
!
ZL = YL · L · HLL + YR · R · HRL = XL (6.4)
!
ZR = YR · R · HRR + YL · L · HLR = XR , (6.5)
Acoustic Interfaces 277
where L, R are the transfer functions of the loudspeakers, and Hpq is the transfer
function from speaker p to ear q.
Fig. 6.13. Principle of (static) cross-talk cancellation. Presenting an impulse addressed to the
left ear (1) and the first steps of compensation (2, 3).
For a better understanding of the CTC filters, the first four iteration steps will be
shown by example for a signal from the right speaker. In an initial step, the influence
R of the right speaker and HRR are eliminated by
−1
BlockRR := R−1 · HRR (6.7)
278 Interacting in Virtual Reality
1. Iteration Step The resulting signal XR · BlockRR · R from the right speaker also
arrives at the left ear via HRL and must be compensated for by an adequate filter
K1R , which is sent from the left speaker L to the left ear via HLL :
!
XR · BlockRR · R · HRL − XR · K1R · L · HLL = 0 (6.8)
−1 HRL
⇒ K1R = L−1 · HRR ·( ) =: BlockRL (6.9)
HLL
2. Iteration Step The compensation signal −XR ·K1R ·L from the first iteration step
arrives at the right ear via HLR and is again compensated for by a filter K2R , sent
from the right speaker R to the right ear via HRR :
−1 HRL
−XR · [L−1 · HRR
!
·( )] · L · HLR + XR · K2R · R · HRR = 0 (6.10)
HLL
−1 HRL · HLR
⇒ K2R = R−1 · HRR ·( ) (6.11)
0 12 3 HLL · HRR
Block
0 12 3
RR
=:K
3. Iteration Step The cross talk produced by the preceding iteration step is again
eliminated by a compensation signal from the left speaker.
−1
XR · [R−1 · HRR
!
· K] · R · HRL − XR · [K3R ] · L · HLL = 0 (6.12)
−1 HRL
⇒ K3R = L−1 · HRR · ·K (6.13)
HLL
0 12 3
BlockRL
4. Iteration Step The right speaker produces a compensation signal to eliminate the
cross talk of the third iteration step:
−1 HRL
−XR · [L−1 · HRR
!
· · K] · L · HLR + XR · [K4R ] · R · HRR = 0 (6.14)
HLL
−1
⇒ K4R = R−1 · HRR ·K2 (6.15)
0 12 3
BlockRR
With the fourth step, the regularity of the iterative CTC becomes clear. Even iter-
ations extend BlockRR and odd iterations extend BlockRL . All in all, the following
compensation filters can be identified for the right channel (compare to Fig. 6.13):
For an infinite number of iterations, the solution converges to the simple analytic
approach:
∞
1 HRL · HLR
Ki = with K = (6.17)
i=1
1−K HLL · HRR
Be aware that the convergence is guaranteed, since |K| < 1 due to HRL < HLL and
HLR < HRR .
controller. The computed haptic models and algorithms that run on a computer host
belong to computer haptics. This includes the creation of virtual environments for
particular applications, general haptic rendering techniques and control issues.
The sensorimotor control guides our physical motion in coordination with our
touch sense. There is a different balance of position and force control when we are
exploring an environment (i.e. lightly touching a surface), versus manipulation (dom-
inated by motorics), where we might rely on the touch sense only subconsciously.
As anyone knows intuitively, the amount of force you generate depends on the way
you hold something.
Many psychological studies have been done to determine human capabilities in
force sensing and control resolution. Frequency is an important sensory parameter.
We can sense tactile frequencies much higher than kinesthetic. This makes sense,
since skin can be vibrated more quickly than a limb or even a fingertip (10-10000
Hz vs. 20-30 Hz). The control bandwidth, i.e., how fast we can move our own limbs
(5-10 Hz), is much lower than the rate of motion we can perceive. As a reference,
table 6.1 provides more detail on what happens at different bandwidths.
A force feedback device is typically a robotic device interacting with the user,
who supplies physical input by imposing force or position onto the device. If the
measured state is different from the desired state, a controller translates the desired
signal (or the difference between the measured and the desired state) into a control
action. The controller can be a computer algorithm or an electronic circuit. Haptic
control systems can be difficult, mainly because the human user is a part of the sys-
tem, and it is quite hard to predict what the user is going to do. Fig. 6.15 shows the
haptic control loop schematically. The virtual environment takes as input the mea-
sured state, and supplies a desired state. The desired state depends on the particular
environment, and its computation is called haptic rendering.
The goal of haptic rendering is to enable a user to touch, feel, and manipulate
virtual objects through a haptic device. A rendering algorithm will generally include
a code loop that reads the position of the device, tests the virtual environment for
Haptic Interfaces 281
collisions, calculates collision responses and sends the desired device position to the
device. The loop must execute in about 1 millisecond (1 kHz rate).
A wall is the most common virtual object to render haptically. It involves two dif-
ferent modes (contact and non-contact) and a discontinuity between them. In non-
contact mode, your hand is free (although holding the haptic device) and no forces
are acting, until it encounters the virtual object. The force is proportional to the pen-
etration depth determined by the collision detection, pushing the device out of the
virtual wall Fig. 6.16. The position of the haptic device in the virtual environment is
called the haptic interaction point (HIP).
Rendering a wall is actually one of the hardest tests for a haptic interface even
though it is conceptually so simple. This is because of the discontinuity between a
regime with zero stiffness and a regime with very high stiffness (usually the highest
stiffness the system can manage). This can lead to a computational and physical
instability and cause vibrations. In addition to that, the contact might not feel very
282 Interacting in Virtual Reality
Fig. 6.16. Rendering a wall. In non-contact mode (a) no forces are acting. Once the penetration
into the virtual wall is detected (b), the force is proportional to the penetration depth, pushing
the device out of the virtual wall.
hard. When you first enter the wall, the force you are feeling is not very large, kΔx
is close to zero when you are near the surface. Fig. 6.17 depicts changes in force
Fig. 6.17. The energy gain. Due to the finite sampling frequency of the haptic device a “stair-
case” effect is observed when rendering a virtual wall. The difference in the areas enclosed by
the curves that correspond to penetrating into and moving out of the virtual wall is a manifes-
tation of energy gain.
profile with respect to position for real and virtual walls of a given stiffness. Since the
position of the probe tip is sampled with a certain frequency during the simulation, a
“staircase” effect is observed. The difference in the areas enclosed by the curves that
correspond to penetrating into and moving out of the virtual wall is a manifestation
of energy gain. This energy gain leads to instabilities as the stiffness coefficient is
increased (compare the energy gains for stiffness coefficients k1 and k2 ). On the
Haptic Interfaces 283
other hand, a low value of the stiffness coefficient generates a soft virtual wall, which
is not desirable either.
The hardness of a contact can be increased by applying damping when the user
enters the wall, i.e., by adding a term proportional to velocity v to the computation
of force:
F = kΔx + bv
It makes the force large right at entry if the HIP is moving with any speed. This term
has to be turned off when the HIP is moving out of the wall, otherwise the wall would
feel sticky.
Adding either visual or acoustic reinforcement of the wall’s entry can make the
wall seem harder than relying on haptic feedback alone. Therefore, the wall penetra-
tion should never be displayed visually. The HIP is always displayed where it would
be in the real world (on the object’s surface). In the literature, the ideal position has
got different names, but all of them refer to the same concept Fig. 6.18.
Fig. 6.18. The surface contact point (SCP) is also called the ideal haptic interaction point
(IHIP), the god-object or the proxy point. Basically, it corresponds to the position of the HIP
on the objects’s surface.
Penalty based methods represent a naive approach to the rendering of solid objects.
Similarly to the rendering of a wall, the force is proportional to the penetration depth.
Fig. 6.19 demonstrates the limitations of the penalty based methods.
The penalty based methods simply pull the HIP to the closest surface point with-
out tracking the HIP history. This causes force discontinuities at the object’s corners,
where the probe is attracted to other surfaces than to the one the object originally pen-
etrated. When the user touches a thin surface, he feels a small force. As he pushes
harder, he penetrates deeper into the object until he passes more than a halfway
through the object, where the force vector changes direction and shoots him out the
other side. Furthermore, when multiple primitives touch or are allowed to intersect,
it is often difficult to determine, which exterior surface should be associated with a
given internal volume. In the worst case, a global search of all primitives may be re-
quired to find the nearest exterior surface. To avoid these problems, constraint based
methods have been proposed by [21] (the god-object method) and [17] (the virtual
284 Interacting in Virtual Reality
(a) object corners (b) thin objects (c) multiple touching objects
proxy method). These methods are similar in that they trace the position of a vir-
tual proxy, which is placed where the haptic interface point would be if the haptic
interface and the object were infinitely stiff (Figure 6.20).
Fig. 6.20. The virtual proxy. As the probe (HIP) is moving, the virtual proxy (SCP) is trying
to keep the minimal distance to the probe by moving along the constrained surfaces.
The constraint based methods use the polygonal representation of the geometry
to create one-way penetration constraints. The constraint is a plane determined by
the polygon through which the geometry has been entered. For convex parts of ob-
jects, only one constraint is active at a time. Concave parts require up to three active
constraints at a time. The movement is then constrained to a plane, a line or a point
Fig. 6.21.
Finally, we are going to show how the SCP can be computed if the HIP and the
constraint planes are known. Basically, we want to minimize the distance between
the SCP and HIP so that the SCP is placed on the active constraint plane(s). This is a
typical application case where the Lagrange multipliers can be used. Lagrange mul-
tipliers are a mathematical tool for constrained optimization of differentiable func-
tions. In the unconstrained situation, we have some (differentiable) function f that
we want to minimize (or maximize). This can be done by finding the points where
the gradient ∇f is zero, or, equivalently, each of the partial derivatives is zero. The
constraints have to be given by functions gi (one function for each constraint). The
ith constraint is fulfilled if gi (x) = 0. It can be shown, that for the constrained case,
the gradient of f has to be equal to the linear combination of the gradients of the
constraints. The coefficients of the linear combination are the Lagrange multipliers
Haptic Interfaces 285
Fig. 6.21. The principle of constraints. Convex objects require only one active constraint at a
time. Concave objects require up to three active constraints at a time. The movement is then
constrained to a plane, a line or a point.
λi .
∇f (x) = λ1 ∇g1 (x) + λn ∇gn (x)
The function we want to minimize is
|HIP − SCP | = (xHIP − xSCP )2 + (yHIP − ySCP )2 + (zHIP − zSCP )2
The unknowns are the xSCP , ySCP , zSCP and the λi . Evaluating the partial deriv-
atives and rearranging leads to the following system of 4, 5, or 6 linear equations
(depending on the number of constraints):
⎡ ⎤⎡ ⎤ ⎡ ⎤
1 0 0 A1 A2 A3 xSCP xHIP
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ 0 1 0 B1 B2 B3 ⎥ ⎢ ySCP ⎥ ⎢ yHIP ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
0 constraints ⎢ 0 0 1 C C C ⎥ ⎢ zSCP ⎥ ⎢ zHIP ⎥
⎢ 1 2 3 ⎥⎢ ⎥ ⎢ ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ ⎥⎢ ⎥=⎢ ⎥
1 constraint ⎢ ⎢ A1 B 1 C 1 0 0 0 ⎥ ⎢ λ1 ⎥ ⎢ D1 ⎥
⎥⎢ ⎥ ⎢ ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
2 constraints ⎢ A2 B2 C2 0 0 0 ⎥ ⎢ λ2 ⎥ ⎢ D2 ⎥
⎢ ⎥ ⎢ ⎥ ⎢
⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣ ⎦⎣ ⎦ ⎣ ⎦
3 constraints A3 B3 C3 0 0 0 λ3 D3
286 Interacting in Virtual Reality
The equation 6.18 is thus an ordinary differential equation (ODE) of the second or-
der, and solving it means finding a function x that satisfies the relation F = mẍ.
Moreover, the initial value of x0 at some starting time t0 is given and we are inter-
ested in following x in the time thereafter.
There are various kinds of forces in the real world, e.g., the gravitational force is
causing a constant acceleration, a damping force is proportional to velocity, a (static)
spring force is proportional to the spring elongation. The left side of the equation
6.18 has to be replaced with a specific formula describing the acting force.
When solving the equation numerically it is easier to split it into two equations
of first order introducing velocity. The state of the particle is then described by its
position and velocity: [x, v]. At the beginning of the simulation, the particle is in the
initial state [x0 , v0 ]. A discrete time step Δt is used to evolve the state [x0 , v0 ] →
[x1 , v1 ] → [x2 , v2 ] → . . . corresponding to the next time steps.
The most basic numerical integration method is the Euler method.
dx f initetimestep explicit
v= =⇒ Δx = vΔt =⇒ xt+Δt = xt + Δtvt (6.21)
dt
implicit
=⇒ xt+Δt = xt + Δtvt+Δt (6.22)
dv f initetimestep explicit
a= =⇒ Δv = aΔt =⇒ vt+Δt = vt + Δtat (6.23)
dt
implicit
=⇒ vt+Δt = vt + Δtat+Δt (6.24)
xt=0 = x0 , vt=0 = v0
The equations 6.21, 6.23 correspond to the explicit (forward) Euler method, equa-
tions 6.22, 6.24 correspond to the implicit (backward) Euler method.
For each time step t, the current state [xt , vt ] is known, and the acceleration can
be evaluated using equation 6.18:
Ft
at =
m
Though simple, the explicit Euler’s method is not accurate. Moreover, the explicit
Euler’s method can be unstable. For sufficiently small time steps we get reasonable
behavior, but as the time step gets larger, the solution oscillates or even explodes
toward infinity.
The implicit Euler’s method is unconditionally stable, in general yet not neces-
sarily more accurate. However, the implicit method leads to a system of equations
that has to be solved in each time step, as xt+Δt and vt+Δt not only depend on the
previous time step but also on vt+Δt and at+Δt respectively. The first few terms
of an approximation of at+Δt using a Taylor’s series expansion can be used in the
equation 6.24.
∂a(xt , vt ) ∂a(xt , vt )
at+Δt = a(xt+Δt , vt+Δt ) = a(xt , vt ) + Δx + Δv + ...
" ∂x ∂v #
∂a(xt , vt ) ∂a(xt , vt )
vt+Δt = vt + Δtat+Δt ⇒ Δv = Δt a(xt , vt ) + Δx + Δv
∂x ∂v
288 Interacting in Virtual Reality
The I3×3 on the left side of the equation is a 3 × 3 identity matrix. However, in many
practical cases, the matrices can be replaced by a scalar. When the equation 6.25 is
solved, vt+Δt and xt+Δt can be computed as
The choice of the integration method depends on the problem. The explicit
Euler’s method is very simple to implement and it is sufficient for a number of prob-
lems. Other methods focus on improving accuracy, but of course, the price for this
accuracy is a higher computational cost per time step. In fact, more sophisticated
methods can turn out to be worse than the Euler’s method, because their higher cost
per step is more than the increase of the time step they allow. For more numerical
integration methods (e.g., midpoint, Runge-Kutta, Verlet, multistep methods), the
stability analysis and methods for solving systems of linear and nonlinear equations
see textbooks on numerical maths or [16].
Simulating the motion of a rigid body is almost the same as simulating the motion
of a particle. The location of a particle in space at time t can be described as a
vector xt , which describes the translation of the particle from the origin. Rigid bodies
are more complicated, in that in addition to translating them, we can also rotate
them. Moreover, a rigid body, unlike a particle, has a particular shape and occupies
a volume of space. The shape of a rigid body is defined in terms of a fixed and
unchanging space called body space. Given a geometric description of a body in
body space, we use a vector xt and a matrix Rt to translate and rotate the body in
the world space respectively (Fig. 6.22). In order to simplify some equations, it is
required that the origin of the body space lies in the center of mass of the body. If p0
is an arbitrary point on the rigid body in body space, then the world space location
p(t) of p0 is the result of rotating p0 about the origin and then translating it:
y y’
p0
p(t)=R(t)p0+x(t)
x(t)
y
z’ R(t)
x
z x’
x
z
We have defined the position and orientation of the rigid body. As the next step
we need to determine how they change over time, i.e. we need expressions for ẋ(t)
and Ṙ(t). Since x(t) is the position of the center of mass in world space, ẋ(t) is the
velocity of the center of mass in world space v(t) (also called linear velocity). The
angular velocity ω(t) describes the spinning of a rigid body. The direction of ω(t)
gives the direction of the spinning axis that passes through the center of mass. The
magnitude of ω(t) tells how fast the body is spinning. Angular velocity ω(t) cannot
where r(t) is a vector in world space fixed to the rigid body r(t) = p(t) − x(t) and
R(t) is the rotation matrix of the body.
290 Interacting in Virtual Reality
To determine the mass of the rigid body, we can imagine that the rigid body is
made up of a large number of small particles. The particles are indexed from 1 to N .
The mass of the ith particle is mi . The total mass of the body M is the sum
N
M= mi (6.27)
i=1
Each particle has a constant location p0i in the body space. The location of the ith
particle in world space at time t, is therefore given by the formula
N
N
N
mi pi (t) mi (R(t)p0i + x(t)) mi p0i
i=1 i=1
c(t) = = = R(t) i=1 + x(t)
M M M
When we are using the center of mass coordinate system for the body space, it means
that:
N
mi p 0 i
i=1
=0
M
Therefore, the position of the center of mass in the world space is x(t).
Imagine Fi as an external force acting on the ith particle. Torque acting on the
ith particle is defined as
Torque differs from force in that it depends on the location of the particle relative to
the center of mass. We can think of the direction of torque as being the axis the body
would spin about due to Fi if the center of mass was held in place.
N
F = Fi (t) (6.29)
i=1
N
N
τ = τi (t) = (pi (t) − x(t)) × Fi (t) (6.30)
i=1 i=1
The torque τ (t) tells us something about the distribution of forces over the body.
The total linear momentum P (t) of a rigid body is the sum of the products of
mass and velocity of each particle:
N
P (t) = mi ṗi (t)
i=1
Modeling the Behavior of Virtual Objects 291
which leads to
P (t) = M v(t)
Thus, the linear momentum of a rigid body is the same as if the body was a particle
with mass M and velocity v(t). Because of this, we can also write:
Ṗ (t) = M v̇(t) = F (t)
The change in linear momentum is equivalent to the force acting on a body. Since
the relationship between P (t) and v(t) is simple, we can use P (t) as a state variable
of the rigid body instead of v(t). However, P (t) tells us nothing about the rotational
velocity of the body.
While the concept of linear momentum P (t) is rather intuitive (P (t) = M v(t)),
the concept of angular momentum L(t) for a rigid body is not. However, it will
let us write simpler equations than using angular velocity. Similarly to the linear
momentum, the angular momentum is defined as:
L(t) = I(t)ω(t)
where I(t) is a 3 × 3 matrix called the inertia tensor. The inertia tensor describes
how the mass in a body is distributed relative to the center of mass. It depends on the
orientation of the body, but does not depend on the body’s translation.
N
I(t) = mi (riT (t) · ri (t))[1]3×3 − ri (t) · riT (t)
i=1
ri (t) = pi (t) − xi (t)
[1]3×3 is a 3 × 3 identity matrix. Using the body space coordinates, I(t) can be
efficiently computed for any orientation:
I(t) = R(t)Ibody RT (t)
Ibody is specified in the body space and is constant over the simulation.
N
Ibody = mi (pT0i · p0i )[1]3×3 − p0i · pT0i
i=1
To calculate the inertia tensor, the sum can be treated as a volume integral.
T
Ibody = ρ (p0i · p0i )[1]3×3 − p0i · pT0i dV
V
ρ is the density of the rigid body (M = ρV ). For example, the inertia tensor of a
block a × b × c with origin in the center of mass is
⎡ 2 ⎤
y + z 2 −yx −zx
Ibody = ρ ⎣ −xy x2 + z 2 −zy ⎦ dxdydz
V −xz −yz x2 + y 2
⎡ 2 ⎤
b + c2 0 0
M⎣
= 0 a 2 + c2 0 ⎦
12
0 0 a2 + b 2
292 Interacting in Virtual Reality
The relationship between the angular momentum L(t) and the total torque τ is very
simple
L̇(t) = τ (t)
analogously to the relation Ṗ (t) = F (t). The simulation loop performs the following
steps:
• determine the force F (t) and torque τ (t) acting on a rigid body
• update the state derivatives ẋ(t), Ṙ(t), Ṗ (t), L̇(t)
• update the state, e.g., using the Euler’s method
Table 6.2 summarizes all important variables and formulas for particle dynamics and
rigid body dynamics.
Table 6.2. The summary of particle dynamics and rigid body dynamics.
Particles Rigid bodies
constants m M, Ibody
state x(t), v(t) x(t), R(t), P (t), L(t)
time derivatives of state v(t), a(t) v(t), ω ∗ R(t), F (t), τ (t)
simulation input F (t) F (t), τ (t)
P (t) = M v(t)
useful forms F (t) = ma(t) ω(t) = I −1 (t)L(t)
−1
I −1 (t) = R(t)Ibody RT (t)
Fig. 6.24. Influence of haptics and PBM on the judgement of task difficulty (a) and user inter-
face (b) [18].
– Geometry
To achieve a graphical representation of the virtual scene in real time, the
shapes of virtual objects are typically tessellated, i.e., they are modeled as
polyhedrons. It is obvious that by such a polygonal representation, curved
surfaces in the virtual world can only be an approximation of those in a real
world. As a consequence, even very simple positioning tasks lead to unde-
sired effects. As an example, it nearly impossible for a user to put a cylindri-
cal virtual bolt into a hole, since both, bolt and hole, are not perfectly round,
so that inadequate collisions inevitably occur.
– Physics
Due to the discrete structure of a digital computer, collisions between vir-
tual objects and the resulting, physically-based collision reaction can only be
computed at discrete time steps. As a consequence, not only the graphical,
but also the ”physical” complexity of a virtual scene is limited by the perfor-
mance of the underlying computer system. As with the graphical representa-
tion, the physical behavior of virtual objects can only be an approximation of
reality.
• Hardware Technology
The development of adequate interaction devices, especially force and tactile
feedback devices, remains a challenging task. Hardware solutions suitable for a
use in industrial and scientific applications, like the PHANToM Haptic Device,
typically only display a single force vector and cover a rather limited work vol-
ume. First attempts have also been made to add force feedback to instrumented
gloves (see Fig. 6.25), thus providing kinesthetic feedback to complete hand
movements. It seems arguable however, whether such approaches become ac-
cepted in industrial and technical applications. With regard to tactile devices, it
seems even impossible at all to develop feasible solutions due to the large amount
of tactile sensors in the human skin that have to be addressed.
In order to compensate for the problems arising from the non-exact modeling of
geometry and behavior of virtual objects as well as from the inadequacy of avail-
able interaction devices, additional mechanisms should be made available to the user
which facilitate the execution of manipulation tasks in virtual environments.
294 Interacting in Virtual Reality
At the beginning of a positioning task is the transport phase, i.e., the movement to
bring a selected virtual object towards the target position. Guiding sleeves are a sup-
port mechanism for the transport phase which make use of a priori knowledge about
possible target positions and orientations of objects. In the configuration phase of
the virtual scenario, the guiding sleeves are generated in form of a scaled copy or a
scaled bounding box of the objects that are to be assembled, and then placed as invis-
ible items at the corresponding target location. When an interactively guided object
collides with an adequate guiding sleeve, the target position and orientation is visual-
ized by means of a semi-transparent copy of the manipulated object. In addition, the
motion path necessary to complete the positioning task is animated as a wireframe
representation (see Fig. 6.27).
When the orientations of objects are described as quaternions (see [10]), the an-
imation path can be calculated on the basis of the position difference pdif f and the
orientation difference qdif f between the current position pcur and orientation qcur ,
and the target position pgoal and orientation qgoal :
Interacting with Rigid Objects 295
Fig. 6.26. Assignment of support mechanisms to the single interaction phases of a positioning
task.
Fig. 6.27. Guiding Sleeves support the user during the transport phase of a positioning task by
means of a wireframe animation of the correct path. A video of this scenario is available on
the book CD.
The in-between representations are then computed according to the Slerp (Spher-
ical Linear intERPolation) algorithm [4]:
296 Interacting in Virtual Reality
Fig. 6.28. Sensitive polygons can be used to constrain the degrees of freedom for an interac-
tively guided object.
Fig. 6.29. Virtual magnetism to facilitate a correct alignment of virtual objects. A short video
on the book CD shows virtual magnetism in action.
6.5.4 Snap-In
in which virtual nails must be brought into the corresponding holes of a virtual block.
The experiment consisted of three phases: In phase 1, subjects had to complete the
task in a native virtual environment. In phases 2 and 3, they were supported by guid-
ing sleeves and sensitive polygons, respectively. Every phase was carried out twice.
During the first trial, objects snapped in as far as the subject placed them into the
target position below a certain tolerance. Fig. 6.30 depicts the qualitative analysis of
this experiment. Subjects judged task difficulty as much easier and the quality of the
user interface as much better when the support mechanisms were activated.
Fig. 6.30. Average judgement of task difficulty and user interface with and without support
mechanisms [18].
The concept of scenes is derived from theatre plays, and describes units of a play that
are smaller than the act or a picture. Usually, scenes define the changing of a place
or the exchange of characters on the set. In straight analogy for virtual environments,
scenes define the set of objects that define the place where the user can act. This
definition of course includes the objects that can be interacted with by the user. From
a computer graphics point of view, several objects have to work together in order
to display a virtual environment. The most prominent data is the geometry of the
objects, usually defined as polygonal meshes. In addition to that, information about
Implementing Virtual Environments 299
(a) A naive scene graph example, (b) Still a simple but more elaborate
where content nodes are modeled as scene graph for a car.
leaf nodes.
Fig. 6.31. Two very basic examples for a scene graph that describes a car with four tyres.
car in Fig. 6.31(b) is affected by the transformation of the car node, as well are its
siblings.
6.6.2 Toolkits
Currently, the number of available toolkits for the implementation of virtual envi-
ronments varies, as well as their abilities and focuses vary. But some basic concepts
evolve that can be distinguished and are more or less equally followed by most envi-
ronments. A list of some available toolkits and a URL to download them is enlisted
in the README.txt on the book CD in the vr directory. The following section
will try to concentrate on these principles, without describing a special toolkit and
without enumerating the currently available frameworks.
The core of any VR toolkit is the application model that can be followed by the
programmer to create the virtual world. A common approach is to provide a complete
and semi-open infrastructure, that includes scene graphs, input/output facilities, de-
vice drivers and a tightly defined procedure for the application processing. Another
strategy is to provide independent layers that augment existing low level facilities,
e.g., scene graph APIs with VR functionality, such as interaction metaphors, hard-
ware abstraction, physics or artificial intelligence. While the first approach tends to
be a static all-in-one approach, the latter one needs manual adaptations that can be te-
dious when using exotic setups. However, the application models can be event driven
or procedural models where the application programmer either listens to events prop-
agated over an event bus or defines a number of callback routines that will be called
by the toolkit in a defined order.
A common model for interactive applications is the frame-loop. It is depicted
in Fig. 6.32. In this model, the application calculates its current state in between
the rendering of two consequent frames. After the calculation is done, the current
scene is rendered onto the screen. This is repeated in an endless loop way, until the
user breaks the loop and the application exits. A single iteration of the loop is called
a frame. It consists of a calculation step for the current application state and the
rendering of the resulting scene.
User interaction is often dispatched after the rendering of the current scene, e.g.,
when no traversal of the scene graph is occurring. Any state change of an interaction
entity is propagated to the system and the application using events. An event indicates
that a certain state is present in the system. E.g., the press of a button on the keyboard
defines such a state. All events are propagated over an event-bus.
Events usually trigger a change of state in the world’s objects which results in a
changed rendering on the various displays. As a consequence, the task of implement-
ing and vitalizing a virtual environment deals with the representation of the world’s
objects for the various senses. For the visual sense, the graphical representation, that
is the polygonal meshes and descriptions that are needed for the graphics hardware
to render an object properly. For the auditory sense, this comprises of sound waves,
and radiation characteristics or attributes that are needed to calculate filters that drive
loudspeakers. Force feedback devices which appeal the haptic sense usually need
information about materials, such as stiffness and masses with forces and torques.
302 Interacting in Virtual Reality
event bus
interaction-
update
application
update
kit should provide basic means to create geometry for 3D widgets and algorithms
for collision detection, deformation and topological changes of geometries. A very
obvious requirement for toolkits that provide an immersive virtual environment is
the automatic calculation for the user centered projection that enables stereoscopic
views. Ideally, this can directly be bound to a specific tracking device sensor such
that it is transparent to the user of the toolkit. Fig. 6.33 depicts the building blocks
of modern virtual environments that are more or less covered by most available tool-
kits for implementing virtual environments today. It can be seen that the software
necessary for building vivid virtual worlds is more than just a scene graph and even
simple applications can be rather complex in its set-up and runtime behavior.
As stated above, room mounted multi-screen projection displays are nowadays driven
by off-the-shelf PC clusters instead of multi-pipe, shared memory machines. The
topology and system layout of a PC cluster raises fundamental differences in the
software design, as the application has to respect distributed computation and inde-
pendent graphics drawing. The first issue introduces the need for data sharing among
the different nodes of the cluster, while the latter one raises the need for a synchro-
nization of frame drawing across the different graphics boards. Data locking deals
with the question of sharing the relevant data between nodes. Usually, nodes in a PC
cluster architecture do not share memory. Requirements on the type of data differ,
depending if a system distributes the scene graph or synchronizes copies of the same
application. Data locked applications are calculating on the same data, and if the
algorithms are deterministic, are computing the same results in the same granularity.
An important issue especially for VR applications is rendering. The frame draw-
ing on the individual projection screens has to be precisely timed. This is usually
achieved with specialized hardware, e.g., gen locking or frame locking features, that
are available on the graphics boards. However, this hardware is not common in off-
the-shelf graphics boards and usually expensive. Ideally, a software-based solution
to the swap synchronization issue would strengthen the idea of using non-specialized
hardware for VR rendering, thus making the technique more common.
In VR applications, it is usually distinguished between two types of knowledge,
the graphical setup of the application (the scene graph) and the values of the domain
models that define the state of the application. Distributing the scene graph results
in a simple application setup for cluster environments. A setup like this is called a
client-server setup, where the server cluster nodes provide the service of drawing the
scene, while the client dictates what is drawn by providing the initial scene graph
and subsequent modifications to it. This technique is usually embedded as a low
level infrastructure in the scene graph API that is used for rendering. Alternatives
to the distribution of the scene graph can be seen in the distribution of pixel based
information over high bandwidth networks, where all images are rendered in a high
performance graphics environment, or the distribution of graphics primitives.
A totally different approach respects the idea that a Virtual Reality application
that has basically the same state of its domain objects will render the same scene,
304 Interacting in Virtual Reality
Fig. 6.33. Building blocks and aspects that are covered by a generic toolkit that implements
virtual environments. Applications based on these toolkits need a large amount of infrastruc-
ture and advanced real time algorithms.
One can choose between the distribution of the domain objects or the distribution
of the influences that can alter the state of any domain object, e.g., user input. Do-
main objects and their interactions are usually defined on the application level by the
author of the VR application, so it seems more reasonable to distribute the entities of
influence to these domain objects and apply these influences to the domain objects
on the slave nodes of the PC cluster, see Fig. 6.34.
Every VR system constantly records user interaction and provides information
about this to the application which will change the state of the domain objects. For
a master-slave approach, as depicted above, it is reasonable to distribute this inter-
action and non-deterministic influences across several cluster nodes, which will in
turn calculate the same state of the application. This results in a very simple appli-
cation model at the cost of redundant calculations and synchronization costs across
different cluster nodes.
We can see that the distribution of user interaction in the form of events is a
key component to the master-slave approach. As a consequence, in a well designed
system it is sufficient to mirror application events to a number of cluster nodes to
transparently realize a master-slave approach of a clustered virtual environment. The
task for the framework is to listen to the events that run over the event bus during
the computational step in between the rendering steps of an application frame, and
distribute this information across a network.
6.7 Examples
The CD that accompanies this book contains some simple examples for the imple-
mentation of VR applications. It is, generally speaking, a hard task to provide some
real world examples for VR applications, as much of its fascination and expressive-
event bus
event bus
slave node
master node
interaction
event observer net listener
update
application application
update serialized events update
0101011101101
1010110101011
01010010110...
cluster’s intranet
ti ti+1 ti t i+1
domain objects domain objects
ness comes from the hardware that is used, e.g., stereo capable projection devices,
haptic robots or precisely calibrated loudspeaker set-ups. The examples provided
here are to introduce the reader to the very basic work with 3D graphical applica-
tions and provide a beginner’s set-up that can be used for one’s own extensions.
For a number of reasons, we have chosen to use the Java3D API for implemen-
tation of the examples, which is a product of SUN Microsystems. Java3D needs a
properly installed Java environment in order to run. For the time of writing this book,
the current Java environment version number is 1.5.0 (J2SE). Java3D is an add-on
package that has to be downloaded separately, and its interface is still under devel-
opment. The version that was used for this book is 1.3.2. The book CD contains a
README.txt file where you can find the URLs to download the Java Environment,
the Java3D API, and some other utilities needed for the development. You can use
any environment that is suitable for the development of normal Java applications,
e.g., Eclipse, JBuilder or the like. Due to the licensing policies you have to down-
load all the needed libraries, they are not contained on the book CD. You can find
the contents of this chapter on the book CD in the directory vr. There you will find
the folders SimpleScene, Robot, SolarSystem and Tetris3D. Each folder
contains a README.txt file that contains some additional information about the
execution of the applications or the configuration.
The Java3D API provides a well structured and intuitive access to the work with
scene graphs. For a detailed introduction, please see the comprehensive tutorials on
the Java3D website. You can find the URLs for the tutorials and additional stuff on
the book CD. The following examples all use a common syntax when depicting the
scene graph layouts. Figure 6.35 shows the entities that are used in all the diagrams.
This syntax is close to the one in the Java3D tutorials that can be found at the URLs
as given on the book CD.
Fig. 6.35. Syntax of the scene graph figures that describe the examples in this chapter. The
syntax is close to the one that can be found in the Java3D tutorials.
Examples 307
Fig. 6.36. The scene graph that is outlined by the SimpleScene example. In addition to that,
the augmented scene graph that is automatically created by the SimpleUniverse object of
the Java3D API.
the root node, called objRoot, and a transform node that allows the interaction
308 Interacting in Virtual Reality
with an input device on this node. Java3D provides predefined behavior classes that
can be used to apply rotation, translation and scaling operations on transform nodes.
Behaviors in Java3D can be active or inactive depending on the current spatial rela-
tions of objects. In this example, we define the active region of the mouse behavior
for the whole scene. This can be accomplished by the code fragment in the method
createBehavior(), see Fig. 6.37.
mouseRotate.setSchedulingBounds(boundingSphere);
mouseTranslate.setSchedulingBounds(boundingSphere);
mouseZoom.setSchedulingBounds(boundingSphere);
Fig. 6.37. Example code for mouse behavior. This code sets the influence region of the mouse
interaction to the complete scene.
A classic example of spatial relations and the usage of scene graphs is the vi-
sualization of an animated solar system containing earth, moon and the sun. The
earth rotates around the sun, and the moon orbits the earth. Additionally, the sun and
the earth rotate around their own axis. The folder SolarSystem on the book CD
contains the necessary code in order to see how to construct these spatial relations.
Figure 6.38 depicts the scene graph of the application. In addition to the branch group
and transform group nodes, the figure contains the auxiliary objects that are needed
for the animation. These are given, e.g., by the rotator objects that modify the
transformation of the transform group they are attached to over time.
Fig. 6.38. The scene graph that is outlined by the SolarSystem example. It contains the
branch, transform and content nodes as well as special objects needed for the animation of the
system, e.g., rotator objects.
In the folder Robot you will find the file start.bat, which starts the Robot
application on your Microsoft Windows system. You can take the command line of
310 Interacting in Virtual Reality
the batch file as an example to start the Robot application on a different operating
system that has Java and Java3D installed.
The Robot application example creates an animated and textured robot which
can be turned and zoomed with the mouse, using left and middle mouse button re-
spectively. In addition to the more complex scene graph, it features light nodes. Light
is an important aspect, and in most scene graph APIs, special light nodes provide
different types of lightning effects. These lights are themselves part of the scene
graph. The Robot application indicates the use of spatial hierarchies rather clearly.
Every part that can move independently is part of a subtree of a transform group.
Figure 6.39 depicts the scene graph needed to animate The Robot. It omits the aux-
iliary objects necessary for the complete technical realization of the animation. See
the source code on the book CD for more details.
6.8 Summary
In the recent years, VR could profit considerably from dramatically increasing com-
puting power and especially from the growing computer graphics market. Virtual
Summary 311
Fig. 6.39. The scene graph that is outlined by the Robot example.
scenarios can now be modeled and visualized at interactive frame rates and at a fi-
delity which could by far not be achieved a few years before. As a consequence,
VR technology promises to be a powerful and henceforth affordable man-machine
interface not only for the automotive industry and military, but also for smaller com-
panies and medical applications. However, in practice most applications only address
the visual sense and are restricted to navigation through the virtual scenario. More
complex interaction functionality, which is, e.g., absolutely necessary for applica-
tions like assembly simulation, are still at the beginning. The main reason for this
situation is that such interaction tasks heavily rely on the realistic behavior of ob-
jects, and on the team play of multiple senses. Therefore, we concentrated in this
chapter on the physically-based modeling of virtual object behavior as well as on the
312 Interacting in Virtual Reality
multimodal aspects of VR with focus on acoustics and haptics, since for assembly
tasks these two modalities promise to be the most important complementation to the
visual sense. Due to limitations in hardware technology and modeling, for which so-
lutions are outside the range of vision in the near future, we introduced additional
support mechanisms as an example to cope with the completion of VR-based assem-
bly tasks.
The chapter closes with an introduction to the fundamental concept of scene
graphs that are used for the modeling of spatial hierarchies and play an important
role in the vitalization of virtual worlds. Scene graphs are often the most important
part of VR toolkits that provide general functionality for special VR hardware like
trackers and display systems. The usage of PC cluster based rendering introduces a
tool that can be used to lower the total costs for a VR installation and to ease the
utilization of VR techniques in the future.
On the book CD, videos are available which demonstrate how visual and haptic
interfaces, physically-based modeling, and additional support mechanisms have been
integrated to a comprehensive interaction tool, allowing for an intuitive completion
of fundamental assembly tasks in virtual environments.
6.9 Acknowledgements
We want to thank Tobias Lentz and Professor Michael Vorländer, our cooperation
partners in a joint research project funded by the German Research Foundation about
acoustics in virtual environments. Without their expertise it would have been im-
possible to write Sect. 6.2. Furthermore, we thank our former colleague and friend
Roland Steffan. The discourse on support mechanisms in Sect. 6.5 is based on stud-
ies in his Ph.D. thesis. We kindly thank Marc Schirski for the Robot example that is
contained on the book CD.
References
1. Assenmacher, I., Kuhlen, T., Lentz, T., and Vorländer, M. Integrating Real-time Binau-
ral Acoustics into VR Applications. In Virtual Environments 2004, Eurographics ACM
SIGGRAPH Symposium Proceedings, pages 129–136. June 2004.
2. Berkhout, A., Vogel, P., and de Vries, D. Use of Wave Field Synthesis for Natural Rein-
forced Sound. In Proceedings of the Audio Engineering Society Convention 92, volume
Preprint 3299. 1992.
3. Blauert, J. Spatial Hearing, Revised Edition. MIT Press, 1997.
4. Bobick, N. Rotating Objects Using Quaternions. Game Developer, 2(26):21–31, 1998.
5. Bowman, D. A., Kruijff, E., LaViola, J. J., and Poupyrev, I. 3D User Interfaces - Theory
and Practice. Addison Wesley, 2004.
6. Bryson, S. Virtual Reality in Scientific Visualization. Communications of the ACM,
39(5):62–71, 1996.
7. Cruz-Neira, C., Sandin, D. J., DeFanti, T. A., and K, R. Virtual Reality in Scientific
Visualization. Communications of the ACM, 39(5):62–71, 1996.
References 313
8. Delaney, B. Forget the Funny Glasses. IEEE Computer Graphics & Applications,
25(3):14–19, 1994.
9. Farin, G., Hamann, B., and Hagen, H., editors. Virtual-Reality-Based Interactive Explo-
ration of Multiresolution Data, pages 205–224. Springer, 2003.
10. Foley, J., van Dam, A., Feiner, S., and Hughes, J. Computer Graphics: Principles and
Practice. Addison Wesley, 1996.
11. Kresse, W., Reiners, D., and Knöpfle, C. Color Consistency for Digital Multi-Projector
Stereo Display Systems: The HEyeWall and The Digital CAVE. In Proceedings of the
Immersive Projection Technologies Workshop, pages 271–279. 2003.
12. Krüger, W. and Fröhlich, B. The Responsive Workbench. IEEE Computer Graphics &
Applications, 14(3):12–15, 1994.
13. Lentz, T. and Renner, C. A Four-Channel Dynamic Cross-Talk Cancellation System. In
Proceedings of the CFA/DAGA. 2004.
14. Lentz, T. and Schmitz, O. Realisation of an Adaptive Cross-talk Cancellation System for
a Moving Listener. In Proceedings of the 21st Audio Engineering Society Conference.
2002.
15. Møller, H. Reproduction of Artificial Head Recordings through Loudspeakers. Journal
of the Audio Engineering Society, 37(1/2):30–33, 1989.
16. Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T. Numerical Recipes in
C: The Art of Scientific Computing. Cambridge University Press, second edition edition,
1992.
17. Ruspini, D. C., Kolarov, K., and Khatib, O. The Haptic Display of Complex Graphical
Environments. In Proceedings of the 24th Annual Conference on Computer Graphics and
Interactive Techniques. 1997.
18. Steffan, R. and Kuhlen, T. MAESTRO - A Tool for Interactive Assembly Simulation in
Virtual Environments. In Proceedings of the Immersive Projection Technologies Work-
shop, pages 141–152. 2001.
19. Sutherland, I. The Ultimate Display. In Proceedings of the IFIP Congress, pages 505–
508. 1965.
20. Theile, G. Potential Wavefield Synthesis Applications in the Multichannel Stereophonic
World. In Proceedings of the 24th AES International Conference on Multichannel Audio,
page published on CDROM. 2003.
21. Zilles, C. B. and Salisbury, J. K. A Constraint-based God-object Method for Haptic Dis-
play. In Proceedings of the International Conference on Intelligent Robots and Systems.
1995.
Chapter 7
research areas of telepresence and telerobotics. Finally, section nine shows some ex-
amples of human-robot interaction and cooperation.
Often, robots are described according to their functionality, e.g. underwater robot,
service robot etc. In contrast to this, the term “humanoid robot´´ does not in particular
refer to the functionality of a robot, but rather to its outer design and form. Compared
to other more specialized robot types, humanoid robots resemble humans not only
in their form, but also in the fact that they can cope with a variety of different tasks.
Their human-like form also makes them very suitable to work in environments which
are designed for humans like households. Another important factor is that a more
humanoid form and behavior is more appealing especially to non-expert users in
everyday situations, i.e. a humanoid robot will probably be more accepted as a robot
assistant e.g. in a household. On the other hand, users expect much more in terms of
communication, behavior and cognitive skills of these robots. It is important to note,
though, that there is not one established meaning of the term “humanoid robot´´
in research – as shown in this section, there currently exists a wide variety of so-
called humanoid robots with very different shapes, sizes, sensors and actors. What is
common among them is the goal of creating more and more human-like robots.
Having this goal in mind, humanoid robotics is a fast developing area of research
world-wide, and currently represents one of the main challenges for many robotics
researchers. This trend is particularly evident in Japan. Many humanoid robotic plat-
forms have been developed there in the last few years, most of them with an approach
very much focussed on mechanical, or more general on hardware problems, in the
attempt to replicate as closely as possible the appearance and the motions of human
beings.
Impressive results were achieved by the Waseda University of Tokyo since 1973
with the WABOT system and its later version WASUBOT, that was exhibited in
playing piano with the NHK Orchestra at a public concert in 1985. The WASUBOT
system could read the sheets of music by a vision system (Fig. 7.1) and could use
its feet and five-finger hands for playing piano. The evolution of this research line at
the Waseda University has led to further humanoid robots, such as: Wabian (1997),
able to walk on two legs, to dance and to carry objects; Hadaly-2 (1997), focused
on human-robot interaction through voice and gesture communication; and Wendy
(1999) ( [29]).
With a great financial effort over more than ten years, Honda was able to transfer
the results of such research projects into a family of humanoid prototypes. Honda’s
main focus has been on developing a stable and robust mechatronic structure for their
humanoid robots (cf. Fig. 7.2). The result is the Honda Humanoid Robot, whose cur-
rent version is called ASIMO. It presents very advanced walking abilities and fully
functional arms and hands with very limited capabilities of hand-eye coordination
for grasping and manipulation ( [31]). Even if the robots presented in Fig. 7.2 seem
Mobile and Humanoid Robots 317
Fig. 7.1. Waseda humanoid robots: a) Wabot-2, the pioneer piano player, b) Hadaly-2, c)
walking Wabian and d) Wendy.
to have about the same size, the real size was consistently reduced during the evo-
lution: height went down from the P3’s 160cm to the ASIMO’s 120cm, and weight
was reduced from the P3’s 130kg to ASIMO’s 43kg.
Fig. 7.2. The Honda humanoid robots: a) P2, b) P3, and c) the current version, ASIMO.
Fig. 7.3. a) HRP-1S operating a backhoe, b) The humanoid robot HRP-2 is able to lie down
and stand up without any help.
35 kg. The right robot, 1 m tall and weighing 35 kg, is supposed to assist humans in
manufacturing processes ( [63]).
Fig. 7.4. Two of Toyota’s humanoid robot partners: a) a walking robot for assisting elderly
people, and b) a robot for manufacturing purposes.
The Japanese research activities which led to these humanoid robot systems con-
centrate especially on the mechanical, kinematic, dynamic, electronic, and control
problems, with minor concern for the robot’s behavior, their application perspectives,
and their acceptability. Most progresses in humanoid robotics in Japan have been
accomplished without focusing on specific applications. In Japan, the long-term per-
spective of humanoid robots concerns all those roles in the Society in which humans
can benefit by being replaced by robots. This includes some factory jobs, emergency
actions in hazardous environments, and other service tasks. Amongst them, in a Soci-
Mobile and Humanoid Robots 319
ety whose average age is increasing fast and steadily, personal assistance to humans
is felt to be one of the most critical tasks.
In the past few years, Korea has developed to be a second center for humanoid
robotics research on the Asian continent. Having carried out several projects on walk-
ing machines and partly humanoid systems in the past, the Korea Advanced Institute
of Science and Technology (KAIST) announced a new humanoid system which they
call HUBO in 2005 (Fig. 7.5). HUBO has 41 DOF, stands 1.25m tall and weighs
about 55 kg. It can walk, talk and understand speech. As a second humanoid project
the robot NBH-1 (Smart-Bot, Fig. 7.5 b)) is developed in Seoul. The research group
claims their humanoid to be the first network-based humanoid which will be able to
think like a human and learn like a human. NBH-1 stands 1.5m tall, weighs about 67
kg, and is also able to walk, talk and understand speech.
In the USA, the research on humanoid robotics received its major impulse
through the studies related to Artificial Intelligence, mainly through the approach
proposed by Rodney Brooks, who identified the need for a physical human-like struc-
ture as prior for achieving human-like intelligence in machines ( [12]). Brooks’ group
at the AI lab of the MIT is developing human-like upper bodies which are able to
learn how to interact with the environment and with humans. Their approach is much
more focused on robot behavior, which is built-up by experience in the world. In this
framework, research on humanoids does not focus on any specific application. Nev-
ertheless, it is accompanied by studies on human-robot interaction and sociability,
which aim at favoring the introduction of humanoid robots in the society of Humans.
The probably most famous example of MIT’s research is COG, shown in Fig. 7.6.
Still at the MIT AI Lab, the Kismet robot has been developed as a platform for
investigation on human-robot interaction (Fig. 7.6). Kismet is a pet-like head with
320 Interactive and Cooperative Robot Assistants
vision, audition, speech and eye and neck motion capability. It can therefore perceive
external stimuli, track faces and objects, and express its own feelings accordingly
( [11]).
Europe is more cautiously entering the field of humanoid robotics, but can rely on
an approach that, based on the peculiar cultural background, allows to integrate var-
ious considerations of different nature, by integrating multidisciplinary knowledge
from engineering, biology and humanities. Generally speaking, in Europe, research
on robotic personal assistants has received a higher attention, even without the impli-
cation of anthropomorphic solutions. On the other hand, in the European humanoid
robotics research the application as personal assistants has always been much more
clear and explicitly declared. Often, humanoid solutions are answers to the problem
of developing personal robots able to operate in human environments and to interact
with human beings. Personal assistance or, more in general, helpful services are the
European key to introduce robots in the society and, actually, research and industrial
activities on robotic assistants and tools (not necessarily humanoids) have received
a larger support than research on basic humanoid robotics. While robotic solutions
for rehabilitation and personal care are now at a more advanced stage with respect
to their market opportunities, humanoid projects are currently being carried out by
several European Universities.
Some joint European Projects, like the Brite-Euram Syneragh and the IST-FET
Paloma, are implementing biologically-inspired sensory-motor coordination models
onto humanoid robotic systems for manipulation. In Italy, at the University of Gen-
ova, the Baby-Bot robot is being developed for studying the evolution of sensory-
motor coordination as it happens in human babies (cf. Fig. 7.7). Starting the devel-
opment in 2001, a cooperation between the russian company New Era and the St.
Petersburg Polytechnic University led to the announcement of two humanoid sys-
tems called ARNE (male) and ARNEA (female) in 2003. Both robots have 28 DOF,
stand about 1.23m and weigh 61 kg (7.7), i.e. their dimensions resemble those of
Interaction with Robot Assistants 321
Asimo. Like Asimo, they can walk on their own, avoid obstacles, distinguish and
remember objects and colors, recognize 40 separate commands, and can speak in
a synthesized voice. The androids can run on batteries for 1 hour, compared to 1.5
hours for the Asimo. Located in Sofia, Bulgaria, the company Kibertron Inc. has
a full scale humanoid project called Kibertron. Their humanoid is 1.75m tall and
weighs 90 kg. Kibertron has 82 DOF (7.7). Each robot hand has 20 DOF and each
arm has another 8 DOF ( [45]).
Compared to the majority of electro-motor driven robot systems the german com-
pany Festo is following a different approach. Their robot called Tron X (cf. Fig. 7.8
a)) is driven by servo-pneumatic muscles and uses over 200 controllers to emulate
human-like muscle movement ( [23]).
ARMAR, developed at Karlsruhe University, Germany, is an autonomous mobile
humanoid robot for supporting people in their daily life as personal or assistance
robot (Fig. 7.8). Currently, two anthropomorphic arms have been constructed and
mounted on a mobile base with a flexible torso and studies on manipulation based on
human arm movements are carried out ( [7,8]). The robot system is able to detect and
track the user by vision and acoustics, talk and understand speech. Simple objects can
be manipulated.
Thus, European humanoid robot systems in general are being developed in view
of their integration in our society and they are very likely to be employed in assis-
tance activities, meeting the need for robotic assistants already strongly pursued in
Europe.
Fig. 7.8. Humanoid robot projects in Germany: a) Festo’s Tron X and b) ARMAR by the
University of Karlsruhe.
individual’s needs and taste. So, in each household a different set of facilities, fur-
niture and tools is used. Also, the wide variety of tasks the robot encounters cannot
be foreseen in advance. Ready-made programs will not be able to cope with such
environments. So, flexibility will surely be the most important requirement for the
robot’s design. It must be able to navigate through a changing environment, to adapt
its recognition abilities to a particular scenery and to manipulate a wide range of ob-
jects. Furthermore, it is extremely important for the user to easily adapt the system
to his needs, i.e. to teach the system what to do and how to do it.
The basis of intelligent robots acting in close cooperation with humans and being
fully accepted by humans is the ability of the robot systems to understand and to
communicate. This process of interaction, which should be intuitive for humans, can
be structured along the role of the robot in:
• Passive interaction or Observation. The pure observation of humans together
with the understanding of seen actions is an important characteristic of robot
assistants, and it enables them to learn new skills in an unsupervised manner and
to decide by prediction when and how to get active.
• Active interaction or Multimodal Dialog. Performing a dialog using speech,
gesture and physical contact denotes a mechanism of controlling robots in an
intuitive way. The control process can either take place by activity selection and
parameterization or by selecting knowledge to be transferred to the robot system.
Regardless of the type and role of interaction with robots assistants, one nec-
essary capability is to understand and to interpret the given situation. In order to
achieve this, robots must be able to recognize and to interpret human actions and
spatio-temporal situations. Consequently the cognitive capabilities of robot systems
in terms of understanding human activities are crucial. In cognitive psychology, hu-
man activity is characterized by three features [3]:
Direction: Human activity is purposeful and directed to a specific goal situation.
Interaction with Robot Assistants 323
The set of supported elementary actions should be derived from human interaction
mechanisms. Depending on the goal of the operator’s demonstration, a categorization
into three categories seems to be appropriate [20]:
• Performative actions: One human shows a task solution to somebody else. The
interaction partner observes the activity as a teaching demonstration. The ele-
mentary actions change the environment. Such demonstrations are adapted to
the particular situation in order to be captured in an appropriate way by the ob-
server [36].
• Commenting actions: Humans refer to objects and processes by their name, they
label and qualify them. Primarily, this type of action serves to reduce complexity
and to aid the interpretation process after and during execution of performative
actions.
• Commanding actions: Giving orders falls into the last category. This could, e.g.,
be commands to move, stop, hand something over, or even complex sequences
of single commands that directly address robot activity.
The identified categories may be rolled out by the modality of their application
(see figure 7.9). For teaching an action sequence to a robot system, the most im-
portant type of action is the performative one. But perfomative actions are usually
accompanied by comments, containing additional information about the performed
task. So, for knowledge acquisition, it is very important to include this type of action
into a robot assistant system. Intuitive handling of robot assistance through instruc-
tion is based on commanding actions, which trigger all function modes of the system.
Performed actions are categorized along the action type (i.e. ”change position” or
”move object”), and, closely related to this first categorization, also along the in-
volved actuators or subsystems. Under this global viewpoint, user and robots are
324 Interactive and Cooperative Robot Assistants
with objective
utter
iconic
speech
commenting metaphoric
action
gesticulate
deictic
commanding
action
guide tact
emblematic
Fig. 7.9. Hierarchy and overview of possible demonstration actions. Most of them are taken
from everyday tasks (see lower right images)
part of the environment. Manipulation, navigation and the utterance of verbal perfor-
mative sentences are different forms of performative actions.
Manipulation: Manipulation is certainly one of the main functions of a robot
assistant: tasks as e.g. fetch and carry and handling tools or devices are vital
parts of most applications in household- or office-like environments. Grasps and
movements are relevant for interpretation and representation of actions of the
type ”manipulation”.
Grasps: Established schemes can be reverted to classify grasps that in-
volve one hand. Here, an underlying distinction is made between grasps dur-
ing which finger configuration does not change (”static grasps”) and grasps
that require such configuration changes (”dynamic grasps”). While for sta-
tic grasps exhaustive taxonomies exist based on finger configurations and
the geometrical structure of the carried object [14], dynamic grasps may be
categorized by movements of manipulated objects around the local hand co-
ordinate system [71]. Grasps being performed by two hands have to take
into account synchronicity and parallelism in addition to the requirements of
single-handed grasp recognition.
Movement: Here, the movement of both extremities and of objects has to be
discerned. The first may be partitioned further into movements that require a
specific goal pose and into movements where position changes underly cer-
tain conditions (e.g. force/torque, visibility or collision). On the other hand,
the transfer of objects can be carried out with or without contact. It is very
useful to check if the object in the hand has or has not tool quality. The latter
case eases reasoning on the goal of the operator (e.g.: tool type screwdriver →
operator turn screw upwards or downwards). Todays programming systems
support very limited movement patterns only. Transportation movements
Learning and Teaching Robot Assistants 325
have been investigated in block worlds [33, 37, 41, 54, 64] or in simple as-
sembly scenarios [9] and dedicated observation environments [10,27,49,61].
The issue of movement analysis is being addressed differently by many re-
searchers in contrast to the issue of grasp analysis. The representation of
movement types varies strongly with respect to the action representation as
well as with respect to the goal situation.
Navigation: In contrast to object manipulation, navigation means the movement
of the interaction partner himself. This includes position changes with a certain
destination in order to transport objects and movement strategies that may serve
for exploration [13, 55, 56].
Verbal performative utterance: In language theory, actions are known that get
realized by speaking out so called performative sentences (such as ”Your’re wel-
come”). Performative utterance is currently not relevant in the context of robot
programming. We treat this as a special case and will not further discuss it.
Commanding actions are used for instructing a robot either to fulfill a certain ac-
tion or to initiate adequate behaviors. Typically these kinds of actions imply a robot
activity in terms of manipulation, locomotion or communication. In contrast to com-
mands, commenting actions of humans usually result in a passive attitude of the robot
system, since they do not require any activity. Nevertheless, an intelligently behav-
ing robot assistant might decide to initiate a dialog or interaction after perceiving a
comment action.
Looking onto robot assistants in human environments the most intuitive way of
transferring commands or actions are the channels which are also used for human-
human communication. These can either be speech, gestures or physical guidance.
Speech and gesture are usually used for communication and through these interac-
tion channels a symbolic high level communication is possible. But for explaining
manipulation tasks which are often performed mechanically by humans, physical
guidance seems more appropriate as a communication channel.
As can be seen by the complexity of grasp performance, navigation, and inter-
action, observation of human actions requires vast and dedicated sensors. Hereby,
diverse information is vital for the analysis of an applied operator: a grasp type may
have various rotation axes, a certain flow of force/torque exerted on the grasped ob-
ject, special grasp points where the object is touched etc.
the learning process on the one hand new skills have to be acquired and on the other
hand the knowledge of the system has to be adapted to new contexts and situations.
Robot assistants coping with these demands must be equipped with innate skills
and should learn live-long from their users. Supposedly, these are no robot experts
and require a system that adapts itself to their individual needs. For meeting these
requirements, a new paradigm for teaching robots is defined to solve the problems
of skill and task transfer from human (user) to robot, as a special way of knowledge
transfer between man and machine.
The only approach that satisfies the latter condition are programming systems
that automatically acquire relevant information for task execution by observation and
multimodal interaction. Obviously, systems providing such a functionality require:
1. powerful sensor systems to gather as much information as possible by observing
human behavior or processing explicit instructions like commands or comments,
2. a methodology to transform observed information for a specific task to a robot-
independent and flexible knowledge structure, and
3. actuator systems using this knowledge structure to generate actions that will
solve the acquired task in a specific target environment.
For the term ”skill”, different definitions exist in literature. In this chapter, it will be
used as follows: A skill denotes an action (i.e. manipulation or navigation), which
contains a close sensor-actuator coupling. In terms of representation, it denotes the
smallest symbolic entity which is used for describing a task.
Examples for skills are grasps or moves of manipulated objects with constraints
like ”move along a table surface” etc. The classical ”peg in hole” problem can be
modeled as a skill, for example using a force controller which minimizes a cost func-
tion or a task consisting of a sequence of skills like ”find the hole”, ”set perpendicular
position” and ”insert”.
Due to the close relation between sensors and actors the process of skill learning
is mainly done by direct programming using the robot system and if necessary some
special control devices. Skill learning means to find the transfer function R, which
satisfy the equation U (t) = R(X(t), Z(t)). The model of the skill transfer process
is visualized in Fig. 7.11.
Learning and Teaching Robot Assistants 327
As learning methods for skill acquisition often statistical methods like Neuronal
Networks or Hidden Markov Models are used, in order to generalize a set of skill
examples in terms of sensory input. Given a model of a skill for example through
demonstration, reinforcement techniques can be applied to refine the model. Here
a lot of background knowledge in form of a valuation function which guides the
optimization process has to be included into the system.
However, teaching skills to robot assistants is time-intensive and affords a lot of
knowledge about the system. Therefore, looking to robot assistants were the user of
the systems is no expert, it seems more feasible that the majority of the needed skills
are pre-learned. The system would have a set of innate skills and will only have to
learn a few new skills during its ”living time.”
From the semantic point of view, a task denotes a more complex action type than a
skill. From the representation and model side a task is usually seen as a sequence
of skills, and that should serve as a definition for the following sections. In terms of
knowledge representation a task denotes a more abstract view of actions, were the
control mechanisms for actuators and sensors are enclosed in modules represented
by symbols.
328 Interactive and Cooperative Robot Assistants
Learning of tasks differs from skill learning, since, apart from control strategies
of sensors and actuators, the goal and subgoal of actions have to be considered. As in
this case the action depends on the situation as well as on the related action domain,
the learning methods used rely on a vast knowledge base. Further analytical learning
seems to be more appropriate for coping with symbolic environment descriptions
used to represent states and effects of skills within a task.
For modeling a task T as a sequence of skills (or primitives) Si , the context
including environmental information or constraints Ei and internal states of the robot
system Zi are introduced as pre- and post-conditions of the skills. Formally, a task
can be described as:
were Effect() denotes the effect of a skill on the environment and the internal robot
states. Consequently, the goal and subgoals of a task are represented by the effect of
the task or the subsequence of the task.
Noticeable in the presented task model is the fact that tasks are state-, but not
time-dependent like skills, and therefore enable a higher generalization in terms
of learning. Although the sensory-motor skills in this representation are encapsu-
lated in symbols, learning and planning task execution for robot assistants becomes
very complex due to the fact that state descriptions of environments (including hu-
mans) and internal states are very complex, and decision making is a not always
well-defined.
The process of interactive learning has the highest abstraction level because of the
information exchange through speech and gestures. Through these interaction chan-
nels various situation-dependent symbols are used to refer to objects in the environ-
ment, or to select actions to be performed. Information, like control strategies on the
sensory-motor level, can not be transferred any more, but rather can global parame-
ters, which are used for triggering and adapting existing skills and tasks.
The difference between task learning and interactive learning on the one hand
consist in the complexity of actions that can be learned and on the other hand in
the use of dialog during the learning process with explicit system demands. Beside
communication issues, one of the main problems in developing robots assistants with
such capabilities denotes the understanding and handling of ”common sense” know-
ledge. In this context the problem of finding the same view of situations and com-
municating these between robot and human is still a huge research area.
For learning however, a state driven representation of actions which can either
be skills, tasks or behaviors is required as well as a vast decision making module,
which is able to disambiguate gathered information by initiating adequate dialogs.
This process relies on a complex knowledge base including skill and task models, en-
vironment descriptions, and human robot interaction mechanisms. Learning in this
Learning and Teaching Robot Assistants 329
• The first classification feature is the the abstraction level. Here, the learning goal
can be divided in learning of low-level (elementary) skills or high-level skills
(complex task knowledge).
Elementary skills represent human reflexes or actions applied unconsciously. For
them, a direct mapping between the actuator’s actions and the corresponding
sensor input is learned. In most cases neural networks are trained for a certain
sensor/actuator combination like grasping objects or peg-in-hole tasks [6,38,46].
Obviously, those systems must be trained explicitly and can not easily be reused
for a different sensor/actuator combination. Another drawback is the need for a
high number of demonstration examples, since otherwise the system might only
learn a specific sensor/actuator configuration.
332 Interactive and Cooperative Robot Assistants
To acquire complex task knowledge from user demonstrations on the other hand
needs a high amount of background knowledge and sometimes communication
with the user. Most applications deal with assembly tasks like for example [35,
43, 59]
• Another important classification feature is the methodology of giving examples to
the system. Within robotic applications, active, passive and implicit examples can
be provided for the learning process. Following the terminology, active examples
indicate those demonstrations where the user performs the task by himself, while
the system uses sensors like data-gloves, cameras and haptic devices for tracking
the environment and/or the user. Obviously, powerful sensor systems are required
to gather as much data as available [25, 32, 42, 44, 65, 66]. Normally, the process
of finding relevant actions and goals is a demanding challenge for those systems.
Passive examples can be used if the robot can be controlled by an external device
as for example a space-mouse, a master-slave-system or graphical interface. Dur-
ing the demonstration process the robots supervises its internal or external sen-
sors and tries to directly map those actions to a well known action or goal [38].
Obviously, this method requires less intelligence, since perception is restricted
to well known sensor sources and actions for the robot system are directly avail-
able. However, the main disadvantage is given by the fact that the demonstration
example is restricted to the target system.
If implicit examples are given, the user directly specifies system goals to be
achieved by selecting a set of iconic commands. Although this methodology is
the easiest one to interpret, since the mapping to well known symbols is implic-
itly given, it both restricts the user to a certain set of actions and provides the
most uncomfortable way of giving examples.
• A third aspect is the system’s internal representation of the given examples. To-
day’s systems regard effects in the environment, trajectories, operations and ob-
ject positions. While observing effects in the environment often requires highly
developed cognitive components, observing trajectories of the user’s hand and
fingers can be considered comparably simple. However, recording only trajecto-
ries is in most cases not enough for real world applications, since all informa-
tion is restricted to only one environment. Tracking positions of objects, either
statically at certain phases of the demonstration or dynamically by extracting
kinematic models, abstracts on the one hand from the demonstration devices, but
implies an execution environment similar to the demonstration environment.
If the user demonstration is transformed into a sequence of pre-defined actions,
it is called operation-based representation. If the target system supports the same
set of actions, execution becomes very easy. However, this approach only works
in very structured environments where a small number of actions is enough to
solve most of the forecoming tasks.
From our point of view a hybrid representation, where knowledge about the user
demonstration is provided not only on the level of representation class, but also
on different abstraction levels is the most promising approach for obtaining the
best results.
Interactive Task Learning from Human Demonstration 333
5. The next necessary phase is the mapping of the internal knowledge representa-
tion to the target system. Here, the task solution knowledge generated in the pre-
vious phase serves as input. Additional background knowledge about the kine-
matic structure of the target system is required. Within this phase as much in-
formation as possible available from the user demonstration should be used to
allow robot programming for new tasks.
6. In the simulation phase the generated robot program is tested for its applica-
bility in the execution environment. It is also desirable to let the user confirm
correctness in this phase to avoid dangerous situations in the execution phase.
7. During execution in the real world success and failure can be used to modify the
current mapping strategy (e.g. by selecting higher grasping forces when a picked
workpiece slips out of the robot´s gripper).
The overall process with its identified phases is suitable for manipulation tasks
in general. However, in practice the working components are restricted to a certain
domain to reduce environment and action space.
Interactive Task Learning from Human Demonstration 335
In this section, the structure of a concrete PbD system developed at the Uni-
versity of Karlsruhe [17] is presented as an example of necessary components of
a Programming by demonstration system. Since visual sensors can not easily cope
with occlusions and therefore cannot track dexterous manipulation demonstrations, a
fixed robot training center has been set up in this project that integrates various sen-
sors for observing user actions. The system is depicted schematically in figure 7.14.
It uses data gloves, magnetic field tracking sensors supplied by Polhemus, and active
stereo camera heads. Here, manipulations tasks can be taught that the robot then is
able to execute in a different environment.
To meet the requirements for each phase of the PbD-process identified in sec-
tion 7.4.2 the software system consists of the components shown in figure 7.15. The
system has four basic processing modules which are connected to a set of databases
and a graphical user interface. The observation and segmentation module is respon-
sible for analysis and pre-segmentation of information channels, connecting the sys-
tem to the sensor data. Its output is a vector describing states in the environment
and user actions. This information is stored in a database including the world model.
The interpretation module operates on the gained observation vectors and associates
sequences of observation vectors to a set of predefined symbols. These parameteriz-
able symbols do represent the elementary action set. During this interpretation phase
the symbols are chunked into hierarchical macro operators after replacing specific
task-dependent parameters by variables. The result is stored in a database as gener-
alized execution knowledge. This knowledge is used by the execution module which
uses specific kinematic robot data for processing. It calculates optimized movements
for the target system, taking into account the actual world model. Before sending
the generated program to the target system its validity is tested through simulation.
In case of unforeseen errors, movements of the robot have to be corrected and op-
336 Interactive and Cooperative Robot Assistants
timized. All four components do communicate with the user by a graphical user-
interface. Additional information can be retrieved from the user, and also hypotheses
can be accepted or rejected via this interface.
tracking algorithms to observe complex realistic scenes like household or office en-
vironments.
microphone
and
loudspeaker
for speech
interaction
data gloves
with attached
tactile sensors
and magnetic
field trackers
Fig. 7.16. Training center for observation and learning of manipulation tasks at the University
of Karlsruhe.
Figure 7.16 shows the demonstration area (”Training Center”) for learning and
acquisition of task knowledge built at the University of Karlsruhe (cf. section 7.4.3).
During the demonstration process, the user handles objects in the training center.
This is equipped with the following sensors: two data gloves with magnetic field
based tracking sensors and force sensors which are mounted on the finger tips and the
palm, as well as two active stereo camera heads. Object recognition is done by com-
puter vision approaches using fast view-based approaches described in [18]. From
the data gloves, the system extracts and smoothes finger joint movements and hand
positions in 3D space. To reduce noise in trajectory information, the user’s hand
is additionally observed by the camera system. Both measurements are fused using
confidence factors, see [21]. This information is stored with discrete time stamps in
the world model database.
338 Interactive and Cooperative Robot Assistants
One of the basic capabilities of robot assistants is the manipulation of the environ-
ment. Under the assumption that robots will share the working or living space with
humans, learning manipulation tasks from scratch would not be adequate for many
reasons: first, in many cases it is not possible for a non-expert user to specify what
and how to learn and secondly, teaching skills is a very time-consuming and maybe
annoying task. Therefore a model-based approach including a general task represen-
tation were a lot of common (or innate) background knowledge is incorporated is
more appropriate.
Modeling the manipulation tasks requires definitions of general structures used
for representing the goal of a task as well as a sequence of actions. Therefore the
next section will present some definitions the following task models will rely on.
7.5.1 Definitions
Considering the household and office domain under the hypothesis that the user
causes all of the changes in the environment (closed world assumption) through
manipulation actions, a goal-related classification of manipulation tasks can be per-
formed. Hereby the manipulation actions can generally be viewed including dual
hand and fine manipulations. Furthermore, the goal of manipulation tasks is closely
related to the functional view of actions and, besides Cartesian criteria, consequently
Models and Representation of Manipulation Tasks 339
incorporates a semantic aspect, which is very important for interaction and commu-
nication with humans.
According to the functional role a manipulation action aims for, the following
classes of manipulation tasks can be distinguished (Figure 7.17):
Transport operations are one of the widely executed action classes of robots in
the role of servants or assistants. Tasks like Pick & Place or Fetch & Carry denot-
ing the transport of objects are part of almost all manipulations. The change of
the Cartesian position of the manipulated object serves as a formal distinguishing
criterion of this kind of task. Consequently, for modeling transport actions, the
trajectory of the manipulated object has to be considered and modeled as well.
In terms of teaching transport actions to robots the acquisition and interpretation
of the performed trajectory is therefore crucial.
Device handling. Another class of manipulation tasks deals with changing the
internal state of objects like opening a drawer, pushing a button etc. Actions of
this class are typically applied when using devices, but many other objects also
have internal states that can be changed (e.g. a bottle can be open or closed,
filled or empty etc.). Every task changing an internal state of an object belongs to
this class. In terms of modeling device handling tasks, transition actions leading
to an internal state change have to be modeled. Additionally, the object models
need to integrate an adequate internal state description. Teaching this kind of task
requires observation routines able to detect internal state changes by continuously
tracking relevant parameters.
Tool handling. The most distinctive feature for actions belonging to the class
of tool handling is the interaction modality between two objects, typically a tool
and some workpiece. Hereby the grasped object interacts with the environment
or another grasped object in the case of dual arm manipulations. Like the class of
device handling tasks, this action type does not only include manipulations using
a tool, but rather all actions containing an interaction between objects. Examples
for such tasks are pouring a glass of water, screwing etc. The interaction modal-
ity is related to the functional role of objects used or the correlation between the
roles of all objects included in the manipulation, respectively. The model of tool
handling actions thus should contain a model of the interaction. According to dif-
ferent modalities of interaction, considering contact, movements etc., a diversity
of handling methods has to be modeled. In terms of teaching robots the observa-
tion and interpretation of the actions can be done using parameters corresponding
to the functional role and the movements of the involved objects.
These three distinguished classes are obviously not disjunct, since for example a
transport action is almost always part of the other two classes. However identifying
the actions as part of these classes eases the teaching and interaction process with
robot assistants, since the inherent semantics behind these classes correlates with
human descriptions of manipulation tasks.
340 Interactive and Cooperative Robot Assistants
Relative Position
and Trajectory
1. Transport Actions
Internal State,
Opening angle
2. Device Handling
Object
Interaction
3. Tool Handling
Fig. 7.17. Examples for the manipulation classes distinguished in a PbD system.
Representing manipulation tasks as pure action sequences is not flexible and also
not scalable. Therefore a hierarchical representation is introduced in order to gene-
ralize an action sequence and to prune elementary skills to more complex subtasks.
Looking only on manipulation tasks the assumption is that each manipulation con-
sists of a grasp and a release action, as defined before. To cope with the manipulation
classes specified above, pushing or touching an object is interpreted as a grasp. The
representation of grasping an object constitutes a ”pick” action and consists of three
sub-actions: an approach, a grasp type and a dis-approach (Fig. 7.18). Each of these
sub-actions consists of a variable sequence of elementary skills represented by ele-
mentary operators (EO’s) of certain types (i.e. for the approach and dis-approach the
Models and Representation of Manipulation Tasks 341
EO’s are ”move” types and the grasp will be of type ”grasp”). A ”place” action is
treated analogously.
Between a pick and a place operation, depending on the manipulation class, seve-
ral basic manipulation operations can be placed (Fig. 7.19). A demonstration of the
task ”pouring a glass of water” consists, e.g., of the basic operations: ”pick a bottle”,
”transport the bottle”, ”pour in”, ”‘transport the bottle”’ and ”‘place the bottle”’. A
sequence of basic manipulation operations starting with a pick and ending with a
place is abstracted to a manipulation segment.
Teaching robots through demonstration, independently of the used sensors and the
abstraction level of the demonstration (e.g. interactive learning, task learning and
mostly also skill learning), a search of the goals and subgoals of the demonstration
has to be performed. Therefore, a recorded or observed demonstration has to be
segmented and analyzed, in order to understand the users intention and to extract all
information necessary for enabling the robot to reproduce the learned action.
A feasible approach for the subgoal extraction is done over several steps, where
the gathered sensor data is consequently abstracted. As a practical example for a
PbD system with subgoal extraction, this section presents a two step approach based
on the manipulation models and representation and the sensor system mounted in
the training center (see Fig. 7.16). The example system outlined here is presented
in more detail in [71]. In the first step the sensor data has to be preprocessed for
extracting reliable measurements and key points, which are in a second step used to
segment the demonstration. The following sections describe these steps.
Figure 7.20 only shows the performed preprocessing steps for the segmentation. A
more elaborate preprocessing and fusion system would, e.g., be necessary for de-
tecting gesture actions, as shown in [21]. The input signals jt , ht , ft are gathered
from data glove, magnetic tracker and tactile force sensors. All signals are filtered
i.e. smoothed and high passed in order eliminate outliers. In the next step, the joint
angles are normalized and passed over by a rule-based switch to a static (SGC) and a
dynamic (DGC) grasp classifier. The tracker values, representing the absolute posi-
tion of the hand, are differentiated in order to obtain the velocity of the hand move-
ment. As mentioned in [71] the tactile sensors have a 10-20% hysteresis, which is
eliminated by the Function H(x). The normalized force values, together with the
velocity of the hand, are passed to the Module R, which consists of a rule set in order
to determine if an object is potentially grasped and triggers the SGC and DGC. The
output of these classifiers are grasp types according to the Cutkosky hierarchy (static
grasps) or dynamic grasps according to hierarchy presented in section 7.6.3.2.
Fig. 7.21. Analyzing segments of a demonstration: force values and finger joint velocity.
For mapping detected manipulative tasks and representations of grasps, the diversity
of human object handling has to be considered. In order to conserve as many infor-
mation as possible from the demonstration, two classes of grasps are distinguished.
Static grasps are used for Pick & Place tasks, while fine manipulative tasks require
dynamic handling of objects.
7.6.3.1 Static Grasp Classification
Static grasps are classified according to the Cutkosky hierarchy ( [14]). This hier-
archy subdivides static grasps into 16 different types mainly with respect to their
posture (see figure 7.22). Thus, a classification approach based on the finger angles
measured by the data glove can be used. All these values are fed into a hierarchy
Subgoal Extraction from Human Demonstration 345
of neural networks, each consisting of a three layer Radial Basis Function (RBF)
network . The hierarchy resembles the Cutkosky hierarchy passing the finger values
from node to node like in a decision tree. Thus, each network has to distinguish very
few classes and can be trained on a subset of the whole set of learning examples.
The classification results of the system for all static grasp types are listed in
table 7.1. A training set of 1280 grasps recorded from 10 subjects and a test set of
640 examples was used for testing. The average classification rate is 83,28% where
most confusions happen between grasps of similar posture (e.g. types 10 and 12 in
346 Interactive and Cooperative Robot Assistants
Grasp type
9 10 11 12 13 14 15 16
0.88 0.95 0.85 0.53 0.35 0.88 1.00 0.45
Fig. 7.24. Hierarchy of dynamic grasps containing the results of the DGC .
In order to cope with high interaction demands, an event-driven architecture like the
one outlined in figure 7.25 can be used. Hereby the focus on control mechanism
of the robot systems lies in enabling the system to react on interaction demands on
the one hand and to be autonomous on the other. All multimodal input especially
from gesture and object recognition with speech input can be melted together into
so-called ”events“, which are building the key concept of representing this input.
Task Mapping and Execution 349
As mentioned above, event transformers are also capable of fusing two or more
events. A typical problem is a verbal instruction by the user, combined with some
gesture, e.g. ”Take this cup.“ with a corresponding pointing gesture. In such cases, it
is part of the appropriate event transformer to identify the two events which belong
together and to put in progress the actions which are appropriate for this situation.
In each case, one or more actions can be started, as appropriate for a certain event.
In addition, it is easily possible to generate questions concerning the input: an event
transformer designed specifically for queries (to the user) can handle events which
are ambiguous or underspecified, start an action which asks an appropriate question
to the user, and then uses the additional information to transform the event. For con-
trolling hardware resources a scheduling system locking and unlocking execution of
the action performer is applied.
All possible actions (learned or innate) which can be executed by the robot sys-
tem are coded or wrapped as an automaton. These control the execution, and generate
events which can start new automatons or generate ”actions” for the subsystems. The
mechanism for starting new automatons, which are carrying out subtasks, leads to a
hierarchical execution of actions. Furthermore, the interaction can be initiated from
each automaton, what implies that every subtask can fulfill its interaction delibera-
tive.
For ensuring reliability of the system the ”Security and Exception Handler” is in-
troduced into the systems architecture. The main tasks of the module are handling ex-
ceptions from the agents, reacting on cancelling commands by the user and clearing
up internally all automatons after an exception. According to the distributed control
through different automatons, every automaton reacts on special events for ”cleaning
up” its activities and ensures a stable state of all used and lock agents.
The presented architecture concept not only ensures a working way of control-
ling robot assistants, but also represents a general way of coping with new know-
ledge and behaviors. These can easily be integrated as new automatons in the system.
The integration process itself can be done via a learning automaton , which creates
new automatons. The process of generating execution knowledge from general task
knowledge is part of the next section.
As an example for grasp mapping the 16 different power and precision grasps
of Cutkosky’s grasp hierarchy [14] should be considered. As input the symbolic op-
erator description and the positions of the user’s fingers and palm during the grasp
are used. For mapping the grasps an optimal group of coupled fingers have to be
calculated. These are fingers having the same grasp direction. The finger groups are
associated with the chosen gripper type. This helps to orient finger positions for ad-
equate grasp pose.
Since the symbolic representation of performed grasps include information about
enclosed fingers, only those fingers are used for finding the correct pose. Let’s as-
sume that fingers used for the grasp are numbered as H1 , H2 , . . . , Hn , n ≤ 51 .
For coupling fingers with the same grasping direction it is necessary to calculate
forces affecting the object. In case of precision grasps the finger tips are projected on
a grasping plane E defined by the finger tip position during the grasp (figure 7.26).
Since the plane might be overdetermined if more than three fingers are used, the
plane is determined by using the least square method.
Fig. 7.26. Calculation of a plane E and the gravity point G defined by the finger tip’s positions
.
Now the force-vectors can be calculated by using the geometric formation of the
fingers given by the grasp type and the force value measured by the force sensors on
the finger tips. Prismatic and sphere grasps are distinguished. For prismatic grasps all
finger forces are assumed to act against the thumb. This leads to a simple calculation
of the forces F1 , . . . , Fn , n ≤ 5 according to figure 7.27.
For circular grasps, the direction of forces are assumed to act against the finger
tip’s center of gravity G, as shown in Fig. 7.28.
To evaluate the coupling of fingers the degree of force coupling Dc (i, j) is de-
fined:
Fi .Fj
Dc (i, j) = (7.2)
|Fi ||Fj |
1
It should be clear that the number of enclosed fingers can be less than 5.
352 Interactive and Cooperative Robot Assistants
This means |Dc (i, j)| ≤ 1, and in particular Dc (i, j) = 0 for f i ⊥ f j . So the
degree of force coupling is high for forces acting in the same direction and low for
forces acting in orthogonal direction. The degree of force coupling is used for finding
three or two optimal groups of fingers using Arbib’s method [4].
When the optimal groups of fingers is found, the robot fingers are being assigned
to these finger groups. This process is rather easy for a three finger hand, since the
three groups of fingers can be assigned directly to the robot fingers. In case there are
only two finger groups, two arbitrary robot fingers are selected to be treated as one
finger.
The grasping position depends on the grasp type as well as the grasp pose. Since
there is no knowledge about the object present, the grasp position must be obtained
directly from the performed grasp. Two strategies are feasible:
• Precision Grasps: The center of gravity G which has been used for calculat-
ing the grasp pose serves as reference point for grasping. The robot gripper is
positioned relative to this point performing the grasp pose.
• Power Grasps: The reference point for power grasping is the user’s palm rela-
tive to the object. Thus, the robot’s palm is positioned with a fixed transformation
with respect to this point.
Now, the correct grasp pose and position are determined and the system can
perform force controlled grasping by closing the fingers.
Task Mapping and Execution 353
Figure 7.29 shows 4 different grasp types mapped from human grasp operations
to a robot equipped with a three finger hand (Barrett) and 7 DoF arm. It can be seen
that the robot’s hand position depends strongly on the grasp type.
Fig. 7.29. Example of different grasp types. 1. 2-finger precision grasp, 2. 3-finger tripoid
grasp, 3. circular power grasp, 4. prismatic power grasp
For mapping simple movements of the human demonstration to the robot, a set of
logical rules can be defined that select sensor constraints depending on the execution
context, i.e. for example to select a force threshold parallel to the movement when
approaching an object or selecting zero-force control during grasping an object. So
context information serves for selecting intelligent sensor control. The context infor-
mation should be stored, in order to be available for processing.
Thus, a set of heuristic rules can be used for handling the system’s behavior
depending on the current context. To give an overview the following rules turned out
to be useful in specific contexts:
• Approach: The main approach direction vector is determined. Force control is
set to 0 in orthogonal direction to the approach vector. Along the grasp direction
a maximum force threshold is selected.
• Grasp: Set force control to 0 in all directions.
• Retract: The main approach direction vector is determined. Force control is set
to 0 in orthogonal direction to the approach vector.
• Transfer: Segments of basic movements are chunked into complex robot moves
depending on direction and speed.
354 Interactive and Cooperative Robot Assistants
Summarizing, the mapping module generates a set of default parameters for the
target robot system itself as well as for movements and force control. These parame-
ters are directly used for controlling the robot.
Although it is desirable to make hypotheses about the user’s wishes and intention
based on only one demonstration, this will not work in the general case. Therefore,
the system needs the possibility to accept and reject interpretations of the human
demonstration generated by the system. Since the generation of a well working robot
program is the ultimate goal of the programming process, the human user can interact
with the system in two ways:
1. Evaluation of hypotheses concerning the human demonstration, e.g. recognized
grasps, grasped objects, important effects in the environment.
2. Evaluation and correction of the robot program that will be generated from the
demonstration.
Interaction with the system to evaluate learned tasks can be done through speech,
when only very abstract information have to be exchanged like specifying the order
of actions or some parameters. Otherwise, especially if Cartesian points or trajecto-
ries have to be corrected the interaction must be done in an adequate way using for
example a graphical interface. In addition, a simulation system showing the identified
objects of the demonstration and actuators (human hand with its particular observed
trajectory and robot manipulators) in a 3D environment would be helpful.
During the evaluation phase of the user demonstration all hypotheses about
grasps including types and grasped objects can be displayed in a replayed scenario.
With graphical interfaces the user is prompted for acceptance or rejection of actions
(see figure 7.30).
After the modified user demonstration has been mapped onto a robot program
the correctness of the program must be validated in a simulation phase. The modi-
fication of the environment and the gripper trajectories are displayed within the 3D
environment, giving the user the opportunity to interactively modify trajectories or
grasp points of the robot if desired (see figure 7.31).
Fig. 7.30. Example of a dialog mask for accepting and rejecting hypotheses generated by a
PbD system.
which instantiates an executable program from its database with the given environ-
mental information. It is also possible to advice the robot to grasp selected objects
and lay them down at desired positions.
In this experiment, off-the-shelf cups and plates in different forms and colors
were used. The examples were recorded in the training center described in sec-
tion 7.4.4 [21]. Here, the mentioned sensors are integrated into an environment al-
lowing free movements of the user without restrictions. In a single demonstration,
crockery is placed for one person (see left column of figure 7.33). The observed ac-
tions can then be replayed in simulation (cf. second column). Here, sensor errors can
be corrected interactively. The system displays its hypotheses on recognized actions
356 Interactive and Cooperative Robot Assistants
Fig. 7.32. Interaction with a robot assistant to teach how to lay a table. The instantiation of
manipulable objects is aided through pointing gestures.
and their segments in time. After interpreting and abstracting this information, the
learned task can be mapped to a specific robot. The third column of Fig. 7.33 shows
the simulation of a robot assistant performing the task. In the fourth column, the real
robot is performing the task in a household environment.
Although autonomous robots are the explicit goal of many research projects, there
are nonetheless tasks which are very hard to perform for an autonomous robot at the
moment, and which robots may never be able to perform completely on their own.
Examples are tasks where the processing of sensor data would be very complex or
where it is vital to react flexibly on unexpected incidents. On the other hand, some
of these tasks can not be performed by humans as well since they are too dangerous
(e.g. in contaminated areas), too expensive (e.g. some underwater missions) or too
time-consuming (e.g. space missions). In all of these cases, it would be advantageous
to have a robot system which performs the given tasks, combined with the foresight
and knowledge of a human expert.
In this section, the basic ideas of telepresence and telerobotics are introduced, and
some example applications are presented. ”Telepresence” indicates a system by
which all relevant environment information is presented to a human user as if he
or she would be present on site. In ”telerobotic” applications, a robot is located re-
motely to the human operator. Due to latencies in signal transfer from and to the
robot, instantaneous intervention by the user is not possible.
Fig. 7.34 shows the range of potential systems with regard to both system com-
plexity and the abstraction level of programming which is necessary. As can be seen
Telepresence and Telerobotics 357
Fig. 7.33. Experiment ”Laying a table”. 1. column: real demonstration, 2. column: analyze
and interpretation of the observed task, 3. column: simulated execution, 4. column: execution
with a real robot in an household environment.
358 Interactive and Cooperative Robot Assistants
Fig. 7.34. The characteristic of telerobotic systems in terms of system complexity and level of
programming abstraction.
from the figure, both the interaction with the user and the use of sensor informa-
tion increase heavily with increasing system complexity and programming abstrac-
tion. At the one end of the spread are industrial robot systems with low complexity
and explicit programming . The scale extends towards implicit programming (as in
Programming by Demonstration systems) and via telerobots towards completely au-
tonomous systems. In terms of telepresence and telerobotics, this means a shift from
telemanipulation via teleoperation towards autonomy . ”Telemanipulation” refers to
systems where the robot is controlled by the operator directly, i.e. without the use
of sensor systems or planning components. In ”teleoperation”, the planning of robot
movements is initialized by the operator, and occurs by use of environment model
data which is actualized permanently from sensor data processing.
Telerobotic systems are thus semi-autonomous robot systems which integrate the
human operator into their control circle because of his perceptive and reactive capa-
bilities which are indispensable for some tasks. Fig. 7.35 shows the components of
such a system in more detail. The robot is monitored by sensor systems like cameras.
The sensor data is presented to the operator who decides about the next movements
of the robot and communicates them to the system by appropriate input devices. The
planning and control system integrates the instructions of the operator and the results
of the sensor data processing, plans the movements to be performed by the robot and
controls the robot remotely. The main problem of these systems is the data transfer
between the robot and the control system on the user’s side. Latencies in signal trans-
fer due to technical reasons make it impossible for the user or for the planning and
control system, respectively, to intervene immediately in the case of emergencies.
Telepresence and Telerobotics 359
Beside the latencies in the signal transfer, the kind of information gathered from
the teleoperated robot system is often limited due to limited sensors on board, and
the sensing or the visualization of important sensor values like force feedback or
temperature is restricted as well.
They need to be synchronized with teleoperator planning and control system in order
to determine the internal state of the robot.
The teleoperator is aided by a planning and control system which simulates the
robot. The simulation is updated with sensor values and internal robot states. Based
on the simulation, a prediction module is needed for estimating the actual robot state
and (if possible) the sensor values. Based on this the planning module will generate
a new plan, which is then sent to the remote system.
Beside the automatic planning module the interaction modality of the teleop-
erator and the remote system represents a crucial component whenever the human
operator controls the robot directly. Here apart from the adequate in- and output
modalities (depending on the used sensors and devices) the selection of informa-
tion and the presentation form must be adapted to the actual situation and control
strategies.
Error and Exception Handling of a teleoperated system must ensure that a the
system does not destroy itself or collide with the environment on the one hand and on
the other hand that it maintains the communication channel. Therefore often security
strategies like stopping the robot’s actions in dangerous situations and waiting for
manual control are implemented.
Fig. 7.36. (a) Six-legged walking machine LAURON; (b) Head-mounted display with the head
pose sensor, (c) flow of image and head pose data (for details see [1]).
tion of laser distance sensor and scene and presents the distance obtained from the
sensor at that point, cf. Fig. 7.37.
Fig. 7.37. Immersive interface (user view) and the remotely working teleoperator.
Fig. 7.38. Input setup with data gloves and magic wand as input devices.
Fig. 7.39. (left) Mobile platform ODETE , with emotional expression that humans can read;
(right) Anthropomorphic robot ALBERT loading a dish washer.
• Manipulating the environmental model. Since robots are excellent at refining ex-
isting geometrical object models from sensor data, but very bad at creating new
object models from unclassified data themselves, the human user is integrated
into the robot’s sensory loop. The user receives a view of the robot’s sensor data
and is thus able to use interactive devices to perform cutting and selection actions
on the sensor data to help the robot to create new object models from the data
itself.
364 Interactive and Cooperative Robot Assistants
Fig. 7.40. Views in the head-mounted display: top) topological map, and bottom) trajectory of
the manipulation ”‘loading a dishwasher”’.
References
1. Albiez, J., Giesler, B., Lellmann, J., Zöllner, J., and Dillmann, R. Virtual Immersion for
Tele-Controlling a Hexapod Robot. In Proceedings of the 8th International Conference
on Climbing and Walking Robots (CLAWAR). September 2005.
2. Alissandrakis, A., Nehaniv, C., and Dautenhahn, K. Imitation with ALICE: Learning to
Imitate Corresponding Actions Across Dissimilar Embodiments. IEEE Transactions on
Systems, Man, and Cybernetics, 32(4):482–496, 2002.
3. Anderson, J. Kognitive Psychologie, 2. Auflage. Spektrum der Wissenschaft Verlagsge-
sellschaft mbH, Heidelberg, 1989.
4. Arbib, M., Iberall, T., Lyons, D., and Linscheid, R. Hand Function and Neocortex, chapter
Coordinated control programs for movement of the hand, pages 111–129. A. Goodwin
and T. Darian-Smith (Hrsg.), Springer-Verlag, 1985.
5. Archibald, C. and Petriu, E. Computational Paradigm for Creating and Executing Sensor-
based Robot Skills. 24th International Symposium on Industrial Robots, pages 401–406,
1993.
6. Asada, H. and Liu, S. Transfer of Human Skills to Neural Net Robot Controllers. In IEEE
International Conference on Robotics and Automation (ICRA), pages 2442–2448. 1991.
7. Asfour, T., Berns, K., and Dillmann, R. The Humanoid Robot ARMAR: Design and
Control. In IEEE-RAS International Conference on Humanoid Robots (Humanoids 2000),
volume 1, pages 897–904. MIT, Boston, USA, Sep. 7-8 2000.
8. Asfour, T. and Dillmann, R. Human-like Motion of a Humanoid Robot Arm Based on
a Closed-Form Solution of the Inverse Kinematics Problem. In IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS 2003), volume 1, pages 897–904.
Las Vegas, USA, Oct. 27-31 2003.
9. Bauckhage, C., Kummmert, F., and Sagerer, G. Modeling and Recognition of Assembled
Objects. In IEEE 24th Annual Conference of the Industrial Electronics Society, volume 4,
pages 2051–2056. Aachen, September 1998.
10. Bergener, T., Bruckhoff, C., Dahm, P., Janßen, H., Joublin, F., Menzner, R., Steinhage,
A., and von Seelen, W. Complex Behavior by Means of Dynamical Systems for an An-
thropomorphic Robot. Neural Networks, 12(7-8):1087–1099, 1999.
11. Breazeal, C. Sociable Machines: Expressive Social Exchange Between Humans and Ro-
bots. Master’s thesis, Department of Electrical Engineering and Computer Science, MIT,
2000.
References 365
12. Brooks, R. The Cog Project. Journal of the Robotics Society of Japan Advanced Robotics,
15(7):968–970, 1997.
13. Bundesministerium für Bildung und Forschung. Leitprojekt Intelligente anthropomorphe
Assistenzsysteme, 2000. Http://www.morpha.de.
14. Cutkosky, M. R. On Grasp Choice, Grasp Models, and the Design of Hands for Manu-
facturing Tasks. IEEE Transactions on Robotics and Automation, 5(3):269–279, 1989.
15. Cypher, A. Watch what I do - Programming by Demonstration. MIT Press, Cambridge,
1993.
16. Dillmann, R., Rogalla, O., Ehrenmann, M., Zöllner, R., and Bordegoni, M. Learning
Robot Behaviour and Skills based on Human Demonstration and Advice: the Machine
Learning Paradigm. In 9th International Symposium of Robotics Research (ISRR 1999),
pages 229–238. Snowbird, Utah, USA, 9.-12. Oktober 1999.
17. Dillmann, R., Zöllner, R., Ehrenmann, M., and and, O. R. Interactive Natural Program-
ming of Robots: Introductory Overview. In Proceedings of the DREH 2002. Toulouse,
France, Octobre 2002.
18. Ehrenmann, M., Ambela, D., Steinhaus, P., and Dillmann., R. A Comparison of Four Fast
Vision-Based Object Recognition Methods for Programing by Demonstration Applica-
tions. In Proceedings of the 2000 International Conference on Robotics and Automation
(ICRA), volume 1, pages 1862–1867. San Francisco, Kalifornien, USA, 24.–28. April
2000.
19. Ehrenmann, M., Rogalla, O., Zöllner, R., and Dillmann, R. Teaching Service Robots
complex Tasks: Programming by Demonstration for Workshop and Household Environ-
ments. In Proceedings of the 2001 International Conference on Field and Service Robots
(FSR), volume 1, pages 397–402. Helsinki, Finnland, 11.–13. Juni 2001.
20. Ehrenmann, M., Rogalla, O., Zöllner, R., and Dillmann, R. Analyse der Instrumentarien
zur Belehrung und Kommandierung von Robotern. In Human Centered Robotic Systems
(HCRS), pages 25–34. Karlsruhe, November 2002.
21. Ehrenmann, M., Zöllner, R., Knoop, S., and Dillmann, R. Sensor Fusion Approaches
for Observation of User Actions in Programming by Demonstration. In Proceedings of
the 2001 International Conference on Multi Sensor Fusion and Integration for Intelligent
Systems (MFI), volume 1, pages 227–232. Baden-Baden, 19.–22. August 2001.
22. Elliot, J. and Connolly, K. A Classification of Hand Movements. In Developmental
Medicine and Child Neurology, volume 26, pages 283–296. 1984.
23. Festo. Tron X web page, 2000. Http://www.festo.com.
24. Friedrich, H. Interaktive Programmierung von Manipulationssequenzen. Ph.D. thesis,
Universität Karlsruhe, 1998.
25. Friedrich, H., Holle, J., and Dillmann, R. Interactive Generation of Flexible Robot Pro-
grams. In IEEE International Conference on Robotics and Automation. Leuven, Belgium,
1998.
26. Friedrich, H., Münch, S., Dillmann, R., Bocionek, S., and Sassin, M. Robot Programming
by Demonstration: Supporting the Induction by Human Interaction. Machine Learning,
pages 163–189, May/June 1996.
27. Fritsch, J., Lomker, F., Wienecke, M., and Sagerer, G. Detecting Assembly Actions by
Scene Observation. In International Conference on Image Processing 2000, volume 1,
pages 212–215. Vancouver, BC, Kanada, September 2000.
28. Giesler, B., Salb, T., Steinhaus, P., and Dillmann, R. Using Augmented Reality to Interact
with an Autonomous Mobile Platform. In IEEE International Conference on Robotics
and Automation (ICRA 2004), volume 1, pages 1009–1014. New Orleans, USA, Oct. 27-
31 2004.
366 Interactive and Cooperative Robot Assistants
29. Hashimoto. Humanoid Robots in Waseda University - Hadaly-2 and Wabian. In IEEE-
RAS International Conference on Humanoid Robots (Humanoids 2000), volume 1, pages
897–904. Cambridge, MA, USA, Sep. 7-8 2000.
30. Heise, R. Programming Robots by Example. Technical report, Department of Computer
Science, The university of Calgary, 1992.
31. Honda Motor Co., L. Honda Debuts New Humanoid Robot ”ASIMO”. New Technol-
ogy allows Robot to walk like a Human. Press Release of November 20th, 2000, 2000.
Http://world.honda.com/news/2000/c001120.html.
32. Ikeuchi, K. and Suehiro, T. Towards an Assembly Plan from Observation. In International
Conference on Robotics and Automation, pages 2171–2177. 1992.
33. Ikeuchi, K. and Suehiro, T. Towards an Assembly Plan from Observation, Part I: Task
Recognition with Polyhedral Objects. IEEE Transactions on Robotics and Automation,
10(3):368–385, 1994.
34. Immersion. Cyberglove Specifications, 2004. Http://www.immersion.com.
35. Inaba, M. and Inoue, H. Robotics Research 5, chapter Vision Based Robot Programming,
pages 129–136. MIT Press, Cambridge, Massachusetts, USA, 1990.
36. Jiar, Y., Wheeler, M., and Ikeuchi, K. Hand Action Perception and Robot Instruction.
Technical report, School of Computer Science, Carnegie Mellon University, Pittsburgh,
Pennsylvania 15213, 1996.
37. Jiar, Y., Wheeler, M., and Ikeuchi, K. Hand Action Perception and Robot Instruction. In
Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems,
volume 3, pages 1586–93. 1996.
38. Kaiser, M. Interaktive Akquisition elementarer Roboterfähigkeiten. Ph.D. thesis, Uni-
versität Karlsruhe (TH), Fakultät für Informatik, 1996. Erschienen im Infix-Verlag, St.
Augustin (DISKI 153).
39. Kanehiro, F., Kaneko, K., Fujiwara, K., Harada, K., Kajita, S., Yokoi, K., Hirukawa, H.,
Akachi, K., and Isozumi, T. KANTRA – Human-Machine Interaction for Intelligent
Robots using Natural Language. In IEEE International Workshop on Robot and Human
Communication, volume 4, pages 106–110. 1994.
40. Kang, S. and Ikeuchi, K. Temporal Segmentation of Tasks from Human Hand Mo-
tion. Technical Report CMU-CS-93-150, Computer Science Department, Carnegie Mel-
lon University, Pittsburgh, PA, April 1993.
41. Kang, S. and Ikeuchi, K. Toward Automatic Robot Instruction from Perception: Mapping
Human Grasps to Manipulator Grasps. Robotics and Automation, 13(1):81–95, Februar
1997.
42. Kang, S. and Ikeuchi, K. Toward Automatic Robot Instruction from Perception: Mapping
Human Grasps to Manipulator Grasps. Robotics and Automation, 13(1):81–95, Februar
1997.
43. Kang, S. B. Robot Instruction by Human Demonstration. Ph.D. thesis, Carnegie Mellon
University, Pittsburgh, PA, 1994.
44. Kang, S. B. and Ikeuchi, K. Determination of Motion Breakpoints in a Task Sequence
from Human Hand Motion. In IEEE International Conference on Robotics and Automa-
tion (ICORA‘94), volume 1, pages 551–556. San Diego, CA, USA, 1994.
45. Kibertron. Kibertron Web Page, 2000. Http://www.kibertron.com.
46. Koeppe, R., Breidenbach, A., and Hirzinger, G. Skill Representation and Acquisition
of Compliant Motions Using a Teach Device. In IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS‘96), volume 1, pages 897–904. Pittsburgh, PA,
USA, Aug. 5-9 1996.
References 367
47. Kuniyoshi, Y., Inaba, M., and Inoue, H. Learning by Watching: Extracting Reusable
Task Knowledge from Visual Observation of Human Performance. IEEE Transactions
on Robotics and Automation, 10(6):799–822, 1994.
48. Lee, C. and Xu, Y. Online, Interactive Learning of Gestures for Human/Robot Inter-
faces, Minneapolis, Minnesota. In Proceedings of the IEEE International Conference on
Robotics and Automation, volume 4, pages 2982–2987. April 1996.
49. Menzner, R. and Steinhage, A. Control of an Autonnomous Robot by Keyword Speech.
Technical report, Ruhr-Universität Bochum, April 2001. Seiten 49-51.
50. MetaMotion. Gypsy Specification. Meta Motion, 268 Bush St. 1, San Francisco, Califor-
nia 94104, USA, 2001. Http://www.MetaMotion.com/motion-capture/magnetic-motion-
capture-2.htm.
51. Mitchell, T. Explanation-based Generalization - a unifying View. Machine Learning,
1:47–80, 1986.
52. Newtson, D. The Objective Basis of Behaviour Units. Journal of Personality and Social
Psychology, 35(12):847–862, 1977.
53. Onda, H., Hirukawa, H., Tomita, F., Suehiro, T., and Takase, K. Assembly Motion
Teaching System using Position/Force Simulator—Generating Control Program. In 10th
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages
389–396. Grenoble, Frankreich, 7.-11. September 1997.
54. Paul, G. and Ikeuchi, K. Modelling planar Assembly Tasks: Representation and Recogni-
tion. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS), Pittsburgh, Pennsylvania, volume 1, pages 17–22. 5.-9. August 1995.
55. Pomerleau, D. A. Efficient Training of Artificial Neural Networks for Autonomous Nav-
igation. Neural Computation, 3(1):88 – 97, 1991.
56. Pomerleau, D. A. Neural Network Based Vision for Precise Control of a Walking Robot.
Machine Learning, 15:125 – 135, 1994.
57. Rogalla, O., Ehrenmann, M., and Dillmann, R. A Sensor Fusion Approach for PbD. In
Proc. of the IEEE/RSJ Conference Intelligent Robots and Systems, IROS’98, volume 2,
pages 1040–1045. 1998.
58. Schaal, S. Is Imitation Learning the Route to Humanoid Robots? In Trends in Cognitive
Scienes, volume 3, pages 323–242. 1999.
59. Segre, A. Machine Learning of Assembly Plans. Kluwer Academic Publishers, 1989.
60. Steinhage, A. and Bergener, T. Learning by Doing: A Dynamic Architecture for Gener-
ating Adaptive Behavioral Sequences. In Proceedings of the Second International ICSC
Symposium on Neural Computation (NC), pages 813–820. 2000.
61. Steinhage, A. and v. Seelen, W. Dynamische Systeme zur Verhaltensgenerierung eines
anthropomorphen Roboters. Technical report, Institut für Neuroinformatik, Ruhr Univer-
sität Bochum, Bochum, Deutschland, 2000.
62. Takahashi, T. Time Normalization and Analysis Method in Robot Programming from
Human Demonstration Data. In Proceedings of the IEEE International Conference on
Robotics and Automation (ICRA), Minneapolis, USA, volume 1, pages 37–42. April 1996.
63. Toyota. Toyota Humanoid Robot Partners Web Page, 2000.
Http://www.toyota.co.jp/en/special/robot/index.html.
64. Tung, C. and Kak, A. Automatic Learning of Assembly Tasks using a Dataglove System.
In Proceedings of the International Conference on Intelligent Robots and Systems (IROS),
pages 1–8. 1995.
65. Tung, C. P. and Kak, A. C. Automatic Learning of Assembly Tasks Using a Data-
Glove System. In IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS‘95), volume 1, pages 1–8. Pittsburgh, PA, USA, Aug. 5-9 1995.
368 Interactive and Cooperative Robot Assistants
66. Ude, A. Rekonstruktion von Trajektorien aus Stereobildfolgen für die Programmierung
von Roboterbahnen. Ph.D. thesis, IPR, 1995.
67. Vapnik, V. Statistical Learning Theory. John Wiley & Sons, Inc., 1998.
68. Vicon. Vicon Web Page, 2000. Http://www.vicon.com/products/viconmx.html.
69. Voyles, R. and Khosla, P. Gesture-Based Programming: A Preliminary Demonstration. In
Proceedings of the IEEE International Conference on Robotics and Automation, Detroit,
Michigan, pages 708–713. Mai 1999.
70. Zhang, J., von Collani, Y., and Knoll, A. Interactive Assembly by a Two-Arm-Robot
Agent. Robotics and Autonomous Systems, 1(29):91–100, 1999.
71. Zöllner, R., Rogalla, O., Zöllner, J., and Dillmann, R. Dynamic Grasp Recognition within
the Framework of Programming by Demonstration. In The 10th IEEE International Work-
shop on Robot and Human Interactive Communication (Roman), pages 418–423. 18.-21.
September 2001.
Chapter 8
Fig. 8.1. Interpersonal conversation (left) vs. conventional man-machine interaction (right).
Data bases
Information
display
Man Machine
User inputs
Data Bases
In general there are several application specific data bases needed for user support
generation. For dialog systems these data bases may contain menu hierarchy, nomen-
clature, and functionality descriptions. For dynamic systems a system data base must
be provided that contains information related to performance limits, available re-
sources, functionality schematics for subsystems, maintenance data and diagnostic
aids.
For vehicle systems an environmental data base will contain navigational, geo-
graphical and topological information, infrastructure data as well as weather con-
ditions. For air traffic the navigational data base may e.g. contain, airport approach
maps, approach procedures or radio beacon frequencies. In surface traffic e.g. digital
road maps and traffic signaling would be needed.
Based on context of use and on data base content there are five options how to support
user in tasks execution:
372 Assisted Man-Machine Interaction
Manual control relies on perception, mental models of system dynamics, and man-
ual dexterity. While human perception outperforms computer vision in many respects
like, e.g. in Gestalt recognition and visual scene analysis, there are also some serious
shortcomings. It is, e.g. difficulty for humans to keep attention high over extended
periods of time. In case of fatigue there is a tendency for selective perception, i.e.
the environment is partially neglected (Tunnel vision ). Night vision capabilities are
rather poor; also environmental conditions like fog or rain restrict long range vision.
Visual acuity deteriorates with age. Accommodation and adaptation processes result
in degraded resolution, even with young eyes. Support needs arise also from cogni-
tive deficiencies in establishing and maintaining mental models of system dynamics.
Human motor skills are on one hand very elaborate, considering the 27 degrees of
a hand. Unfortunately hand skills are limited by muscle tremor as far as movement
precision is concerned and by fatigue as far as force exertion is concerned [7, 42].
Holding a position steadily during an extended period of time is beyond human abil-
ities. In high precision tasks, as e.g. during medical teleoperation, this can not be
tolerated. Vibrations also pose a problem, since vibrations present in a system, origi-
nating e.g. from wind gust or moving platforms, tend to cross talk into manual inputs
and disturb control accuracy.
For the discussion of manual control limitations a simple compensatory control
loop is depicted in (Fig. 8.3a), where r(t) is the input reference variable, e(t) the
error variable, u(t) the control variable, and y(t) the output variable. The human
operator is characterized by the transfer function GH (s), and the system dynamics
by GS (s).
Fig. 8.3. Manual compensatory control loop (a) and equivalent crossover model (b).
The control loop can be kept stable manually, as long as the open loop transfer
function G0 (s) meets the crossover model of (8.1), where ωc is the crossover fre-
quency and τ the delay time accumulated in the system. Model parameters vary with
input bandwidth, subjects and subject practice. The corresponding crossover model
loop is depicted in (Fig. 8.3b).
ωc −sτ
G0 (s) = GH (s) · GS (s) = ·e (8.1)
s
374 Assisted Man-Machine Interaction
(1 + T1 s) 1
GH (s) = · · e−sτ (8.2)
(1 + T2 s) (1 + Tn s)
Stabilization requires adaptation of lead/ lag and gain terms in GH (s). In a second
order system the operator must e.g. produce a lead time T1 in equation (8.2). With
reference to the crossover model it follows, that stabilization of third order system
dynamics is beyond human abilities (except after long training and following build-
up of mental models of system dynamics).
Context of use for manual control tasks must take into account operator state, system
state, and environmental state. The selection of suitable data describing these states
is largely application dependent and can not be discussed in general.
Therefore the methods of state identification addressed here will focus on car
driving. A very general and incomplete scheme characterizing context of car driving
has to take into account the driver, vehicle state, and traffic situation. For each of
these typical input data are given in (Fig. 8.4), which will subsequently be discussed
in more detail.
Biosignals appear to be the first choice for driver state identification. In particular avi-
ation medicine has stimulated the development of manageable biosensors to collect
data via electrocardiograms (ECG), electro myograms (EMG), electro oculograms
(EOG), and from electro dermal activity sensors (EDA). All of these have proved to
be useful, but have the disadvantage of intrusiveness . Identification of the driver’s
state in everyday operations must in no case interfere with driver comfort. Therefore
non-intrusive methods of biosignal acquisition have been tested as, e.g. deriving skin
resistance and heartbeat from the hands holding the steering wheel.
Assistance in Manual Control 375
Fig. 8.5. Eye blink as an indicator of fatigue (left: dizzy; right: awake).
Driver control behavior is easily accessible from control inputs to steering wheel,
brakes, and accelerator pedal. The recorded signals may be used for driver modeling.
Individual and interindividual behavior may however vary significantly over time.
For a model to be useful, it must therefore be able to adapt to changing behavior in
real time.
Consider e.g. the development of an overhauling warning system for drivers on a
winding road. Since the lateral acceleration preferred during curve driving is different
for relaxed and dynamic drivers, the warning threshold must be individualized (Fig.
8.6). Warning must only be given if the expected lateral acceleration in the curve
ahead exceeds the individual preference range. A fixed threshold set too high would
be useless for the relaxed driver, while a threshold set too low would be annoying to
the dynamic driver [32].
Number of curves
Fig. 8.6. Interpersonal differences in lateral acceleration preferences during curve driving [32].
With respect to user models, parametric and black-box approaches must be dis-
tinguished. If a priori knowledge about behavior is available, a model structure can be
developed, where only some parameters remain to be identified, otherwise no a priori
assumptions can be made [26]. An example of parametric modeling is the method
376 Assisted Man-Machine Interaction
for direct measurement of crossover model parameters proposed by [19]. Here a gra-
dient search approach is used to identify ωc and τ in the frequency domain, based on
partial derivatives of G0 as shown in (Fig. 8.7).
In contrast to parametric models , black box models use regression to match
model output to observed behavior (Fig. 8.8). The applied regression method must
be able to cope with inconsistent and uncertain data and follow behavioral changes
in real time.
One option for parametric driver modeling are functional link networks , which
approximate a function y(x) by nonlinear basis functions Φj (x) [34].
b
y(x) = wj Φj (x) (8.3)
j=1
Kraiss [26] used this approach to model driver following and passing behavior in a
two way traffic situation. The weights are learned during by supervised training and
summed up to yield the desired model output. Since global basis functions are used,
the resulting model represents mean behavior averaged over subjects and time.
Black box
model
Model adaptation
For individual driver support mean behavior is not sufficient, since also rare
events are of interest and must not be concealed by frequently occurring events. For
Assistance in Manual Control 377
the identification of such details semi-parametric, local regression methods are better
suited than global regression [15]. In a radial basis function network a function y(x)
is approximated accordingly by weighted Gaussian functions Φj ,
b
y(x) = wj Φj (
x − xj
) (8.4)
j=1
where
x − xj
denotes a norm, mostly the Euclidian distance, of a data point x from
the center xj of Φj :
Garrel [13] used a modified version of basis function regression as proposed
by [39] for driver modeling. The task was, to learn individual driving behavior when
following a leading car. Inputs were own speed as well as distance and speed differ-
ences. Data were collected in a driving simulator during 20 minutes training inter-
vals. As documented by (Fig.8.9), the model output emulates driver behavior almost
exactly.
Also traffic rules to be followed and optional paths in case of traffic slams must be
known. Further factors of influence include time of day, week, month, or year as well
as weather conditions and available infrastructure.
Distance sensors used for reconnaissance include far, mid, and short range radars,
ultrasonic, and infrared (Fig. 8.10). Recently also video cameras are powerful enough
for outdoor application. Especially CMOS high dynamic range cameras can cope
with varying lighting conditions.
Advanced real-time vision systems try to emulate human vision by active gaze
control and simultaneous processing of foveal and peripheral field of view. A sum-
mary of related machine vision developments for road vehicles in the last decade
may be found in [11]. Visual scene analysis also draws to a great deal upon digital
map information for a priori hypotheses disambiguation [37].
In digital maps infrastructural data like the position of gas stations, restaurants,
and workshops are associated with geographical information. In combination with
traffic telematics information can be provided beyond what the driver perceives with
his eyes like, e.g. looking around a corner. Telematics also enables communication
between cars and drivers (a feature to be used e.g. for advance warnings), as well as
between driver and infrastructure as, e.g. electronic beacons (Fig. 8.11).
Sensor data fusion is needed to take best advantage of all sensors. In (Fig. 8.12)
a car is located on the street by a combination of digital map and differential global
positioning system (DGPS) outputs. In parallel a camera tracks road side-strip and
mid-line to exactly identify car position as related to lane margins.
Following augmentation by radar, traffic signs and obstacles are extracted from
pictures, as well as other leading, trailing, passing, or cutting in cars. Reference to the
digital maps also yields the curve radius and reference speed of the road ahead. Fi-
nally braking distance and impending skidding are derived from yaw angle sensing,
wheel slip detection,wheel air pressure, and road condition.
Assistance in Manual Control 379
GPS localization
GPS localisation
Information
forwarding
Fig. 8.11. Traffic telematic enables contact between drivers and between driver and infrastruc-
ture.
Digital map
Reference speed
Precision Route Curve radius
DGPS positioning planning
V ehicle
INS position
Information Management
mation, or on the provision of commands and alerts. In this section the different
measures are presented and discussed.
Environmental cue perceptibility augmentation
Compensation of sensory deficiencies by technical means applies in difficult en-
vironmental conditions. An example from ground traffic is active night vision, where
the headlights are made to emit near IR radiation reaching wider than the own low
beam. Thus a normally invisible pedestrian in (Fig. 8.14, left) is illuminated and can
be recorded by an IR-sensitive video camera. The infrared picture is then superim-
posed on the outside view with a head-up display (Fig. 8.14, right). Hereby no change
of accommodation is needed when looking from the panel outside and vice versa.
Fig. 8.14. Left: Sensory augmentation by near infrared lighting and head-up display of pictures
recorded by an IR sensitive video camera [22].
Predicted
e(t) u(t) y(t) y p(t) output variable
GH GS GP
r(t)
yP τ2 2
GP (s) = =1+τ ·s+ · s + remnant (8.5)
y 2
Now assume the dynamics GS of the controlled process to be of third order (with ζ=
degree of damping and ωn = natural frequency )
y 1
GS (s) = = (8.6)
u (1 + 2ς
ωn ·s+ 1
ωn2 · s2 ) · s
√ √
With τ = 2/ωn and ς = 1/ 2 the product GS (s) · GP (s) simplifies – with
respect to the predicted state variable – to an easily controllable integrator (8.7).
yP τ2 2 2ς 1 1
GS (s) · GP (s) = = (1 + τ · s + · s ) · (1 + · s + 2 · s2 )−1 = (8.7)
u 2 ωn ωn s
As an example a submarine depth control predictor display is shown in Fig. 8.16.
The boat is at 22 m actual depth and tilts downward. The predictor is depicted as
a quadratic spline line. It shows that the boat will level off after about 28 s at a
depth of about 60 m, given the control input remains as is. Dashed lines in (Fig.
8.16) indicate limits of maneuverability , if the depth rudders are set to extreme
upper respectively lower deflections. The helmsman can observe the effect of inputs
well ahead of time and take corrective actions in time. The predictor display thus
visualizes system dynamics, which otherwise would have to be trained into a mental
model [25].
Predictors can be found in many traffic systems, in robotics, and in medicine.
Successful compensation of several seconds of time delay by prediction has e.g. been
demonstrated for telerobotics in space [16]. Reintsema et al [35] summarize recent
developments in high fidelity telepresence in space and surgery robotics, which also
make use of predictors.
Provision of alerts and commands
Since reaction time is critical during manual control, assistance often takes the
form of visual, acoustic, or haptic commands and alerts. Making best use of human
senses requires addressing the sensory modality that matches task requirements best.
Reaction to haptic stimulation is the fastest among human modalities (Fig. 8.17).
Also haptic stimulation is perceived in parallel to seeing and hearing and thus es-
tablishes an independent sensory input channel. These characteristics have led to an
extensive use of haptic alerts . In airplanes control stick shakers become active if
airspeed is getting critically slow. More recently a tactile situation awareness system
for helicopter pilots has been presented, where the pilot’s waistcoat is equipped with
382 Assisted Man-Machine Interaction
120 pressure sensors which provide haptic 3D warning information. In cars the driver
receives haptic torque or vibration commands via the steering wheel or accelerator
pedal as part of an adaptive distance and heading control system.
Also redundant coding of a signal by more than one modality is a proven means
to guarantee perception. An alarm coded simultaneously via display and sound is
less likely to be missed than a visual signal alone [5].
Reac tion time [ms]
800
Fig. 8.17. Reaction times following haptic and acoustic stimulation [30].
already overloaded by other observation tasks. Therefore other modalities often are
preferable.
In this section two measures of managing control inputs in dynamic systems are
illustrated. This refers to control input modification and adaptive filtering of platform
vibrations.
Control input modifications
Modifications of the manual control input signal may consist in its shaping, limit-
ing, or amplifying. Nonlinear signal shaping is e.g. applied in active steering systems
in premium cars, where it contributes to an increased comfort of use. During park-
ing with low speed little turning angles at the steering wheel result in large wheel
rotation angles, while during fast highway driving, wheel rotations corresponding to
the same steering input are comparatively small. The amplitude of required steering
actions is thus made adaptive to the traffic situation.
Similar to steering gain , nonlinear force execution gain can be a useful method
of control input management. Applying the right level of force to a controlled ele-
ment can be a difficult task. It is e.g. a common observation that people refrain from
exerting the full power at their disposal to the brake pedal. In fact, even in emergency
situations, only 25% of the physically possible pressure is applied. In consequence
a car comes much later to a stop, than brakes and tires would permit. In order to
compensate this effect a braking assistant is offered by car manufacturers, which in-
tervenes by multiplying the applied braking force by a factor of about four. Hence
already moderate braking pressure results in emergency braking (Fig. 8.18). This
braking augmentation is however only triggered when the brake pedal is activated
very fast, as is the case in emergency situations.
100
dec eleration
%
C ar
Emergency
50 %
braking
Normal executed brake
30 % pedal force
braking
amplified by
a factor of 4.
15 % 25 % 100 %
Executed brake pedal force
A special case of signal shaping is control input limiting . The structure of air-
planes is e.g. dimensioned for a range of permitted flight maneuvers. In order to
prevent a pilot from executing illegal trajectories like flying too narrow turns at ex-
ceeding speeds, steering inputs are confined with respect to the exerted accelerations
(g-force limiters). Another example refers to traction control systems on cars where
inputs to the accelerator pedal are limited to prevent wheels from slipping.
Input shaping by smoothing plays an important role in human-robot interaction
and teleoperation. Consider e.g. a surgeon using a robot actuator over an extended
period of time. In the sequence of strain his muscles unavoidably will show tremor
that cross talks into the machine. This tremor can be smoothed out, to guarantee
steady movements of the robot tool.
Adaptive filtering of moving platform vibration cross talk into control input
In vehicle operations control inputs may be affected by moving platform vibra-
tions which must be disposed of by adaptive filtering. This method makes use of the
fact, that noise no contained in the control signal s is correlated with a reference
noise n1 acquired at a different location in the vehicle [14, 29, 40]. The functioning
of adaptive filtering is illustrated by (Fig. 8.19), where s, n0 , n1 , y are assumed to be
stationary and have zero means:
δ = s + n0 − y (8.8)
Squaring 8.8 and taking into account, that s is correlated with neither n0 nor y yields
the expectation E:
E[δ 2 ] = E[s2 ] + E[(n0 − y)2 ] + 2E[s(n0 − y)] = E[s2 ] + E[(n0 − y)2 ] (8.9)
Adapting the filter to minimize E[δ 2 ] will minimize E[(n0 − y)2 ] but not affect the
signal power E[s2 ]. The filter output y therefore is the best least square estimate of
the noise n0 contained in the manual control input signal. Subtracting y according to
(8.8) results in the desired noise canceling .
Automation
1. sublayer
Hidden
sublayer
menu 2. sublayer
options
3. sublayer Desired 3. sublevel menu option j
Executing a dialog requires two kinds of knowledge. On one hand the user needs
application knowledge which refers to knowing whether a function is available,
which service it provides, and how it is accessed. Secondly knowledge about the
menu hierarchy, the interface layout, and the labeling of functions must be familiar.
All this know-how is subsumed as interaction knowledge . Unfortunately only the
top layer menu items are visible to the user, while sublayers are hidden until acti-
vated. Since a desired function may be buried in the depths of a multilayer dialog
hierarchy, menu selection requires a mental model of the dialog tree structure that
tells, where a menu item is located in the menu tree and by which menu path it can
be reached.
An additional difficulty arises from the fact that the labeling of menu items hap-
pens to be misleading, since many synonyms exist. Occasionally even identical labels
occur on different menu branches and for different functions. Finally the interface
geometric layout must be known, to be able to localize the correct control buttons.
All of this must be kept ready in memory. Unfortunately human working memory
is limited to only 7 ± 2 Chunks. Also mental models are restricted in accuracy and
completeness and subject to forgetting which enforces the use of manuals or query
of other people.
Assistance in Man-Machine Dialogs 387
Application Trained
expert specialist
Novice Interaction
Least expert
informed
user
low
The context of a dialog is determined by the state of the user, the state of the dialog
system itself, and the situation in which the dialog takes place, as depicted by (Fig.
8.22). In this figure parameters available for context of dialog identification are listed,
which will be discussed in more detail in the following sections.
Skill level
User state
User identity & inputs Preferences
identification
Plans/Intentions
System functions
Context
Running application Dialog system state
Dialog state & history of
Dialog hierarchy identification
dialog
User inputs
Data sources for user state identification are the user’s identity and inputs. From these
his skill level, preferences, and intentions are inferred. Skill level and preferences
are associated with the user’s identity. Entering a password is a simple method to
make oneself known to a system. Alternatively biometric access like fingerprints is
an option. Both approaches require action by the user. Nonintrusive methods rely
on observed user actions. Preferences may e.g. be derived from the frequency with
which functions are used.
The timing of keystrokes is an indicator of skill level. This method exploits the
fact, that people deliberately choose interaction intervals between 2 and 5 seconds
when not forced to keep pace. Longer intervals are indicative of deliberation, while
shorter intervals point to a trial and error approach [24].
Knowing what a user plans to do is a crucial asset in providing support. At the
same time it is difficult to find out the pursued plans, since prompting the user is
out of the question. The only applicable method then is keyhole intent recognition,
which is based on user input observation [9]. Techniques for plan identification from
logged data include Bayesian Belief Networks (BBNs), Dempster Shafer Theory ,
and Fuzzy Logic , [1, 2, 10, 18, 20]. From these BBNs turn out to be most suited (Fig.
8.23):
action 1
action 2
….
action m
Plan recognition
Fig. 8.23. The concept of plan recognition from observed user actions [17].
As long as plans are complete and flawless they can be inferred easily from action
sequences. Unfortunately users also act imperfectly by typing flawed inputs or by
applying false operations or those irrelevant to a set goal. Classification also has
to consider individual preferences among various alternative strategies to achieve
identical goals. A plan recognizer architecture that can cope with these problems
was proposed by [17] and is depicted in (Fig. 8.24):
The sequence of observations O in (Fig. 8.24) describes the logged user actions.
Feature extraction performs a filtering of O in order to get rid of flawed inputs, re-
sulting in a feature vector M. The plan library P contains all intentions I a user may
possibly have while interacting with an application. Therefore the library content of-
Assistance in Man-Machine Dialogs 389
Feature extraction
Sequence of
Context of use
observations O
Feature vector M
Feature vector M
Training of models
Fig. 8.24. Plan based interpretation of user actions (adopted from [17]).
ten is identical with the list of functions a system offers. The a priori probability of
plans can be made dependent on context of use.
The plan execution models PM are implemented as three layer Bayes Belief
Networks , where nodes represent statistical variables, while links represent depen-
dencies (Fig. 8.25). User inputs are fed as features M into the net and combined to
syntactically and semantically meaningful operators R, which link to plan nodes and
serve to support or reject plan hypotheses.
The structure of and the conditional probabilities in a plan execution model PM
must cover all possible methods applicable to pursuit a particular plan. The condi-
tional probabilities of a BBN plan execution models are derived from training data,
which must be collected during preceding empirical tests.
For plan based classification , feature vector components are fed into PM feature
nodes. In case of a feature match the corresponding plan model state variable is set
to true, while the state of all other feature nodes remain false. Matching refers to the
feature as well as to the feature sequence (Fig. 8.26). For each plan hypothesis the
probability of the observed actions being part of that plan is then calculated and the
most likely plan is adopted. This classification procedure is iterated after every new
user input.
390 Assisted Man-Machine Interaction
Fig. 8.26. Mapping feature vector components plan model to feature nodes (adopted from
[17]).
Identification of dialog system state and history poses no problem, if data related to
GUI state, user interaction, and running application can be derived from the operating
system or the application software respectively. Otherwise the expense needed to
collect these data may be prohibitive and prevent an assistance system from being
installed.
Essential attributes of a dialog situation are user whereabouts and concurrent user
activities . In the age of global positioning, ambient intelligence, and ubiquitous
computing ample technical measures exist to locate and track people’s whereabouts.
Also people may be equipped with various sensors (e.g. micro gyros) and wearable
computing facilities, which allow identification of their motor activities.
Sensors may alternatively be integrated into handheld appliances, as imple-
mented in the mobile phone described by [36]. This appliance is instrumented with
sensors for temperature, light, touch, acceleration, and tilt angle. Evaluation of such
sensor data enables recognition of user micro activities like phone “held in hand”, or
“located in pocket” from which situations like user is “walking”, “boarding a plane”,
“flying”, or “in a meeting” can be inferred.
According to ISO/DIS 9241-110 (2004) Part 10 the major design principles for dia-
logs refer to functionality and usability . As far as system functionality is concerned,
a dialog must be tailored to the requirements of the task at hand. Not less, but also
no more functions than needed must be provided. Hereby user perceived complexity
is kept as low as possible. The second design goal, usability, is substantiated by the
following attributes:
Assistance in Man-Machine Dialogs 391
• Self-descriptiveness
• Conformity with expectations
• Error tolerance
• Transparency
• Controllability
• Suitability for individualization
Self-descriptiveness requires, that the purpose of single system functions or menu
options are readily understood by the user or can be explained by the system on
request. Conformity with expectations requires that system performance complies
with user anticipations originating from previous experiences or user training. Error
tolerance suggests that a set plan can be achieved in spite of input errors or with
only minor corrections. Transparency refers to GUI design and message formatting,
which must be matched to human perceptual, behavioral, and cognitive processes.
Controllability of a dialog allows the user to adjust the speed and sequence of in-
puts as individually preferred. Suitability for individualization finally describes the
adaptability of the user interface to individual preferences as well as to language and
culture.
Dialog assistance is expected to support the stated dialog properties. A system ar-
chitecture accordant with this goal is proposed in (Fig. 8.27). It takes dialog context,
dictionaries, and a plan library as inputs. The generated assistance is concerned with
various aspects of information management and data entry management as detailed
in the following sections.
Information
display
Dialog
Man
system
Control
inputs
Information Management
Information management is concerned with the question, which data should be pre-
sented to the user when and how. Different aspects of information management are
codes, formats, tutoring, functionality shaping, prompting, timing, and prioritizing.
392 Assisted Man-Machine Interaction
Data entry often poses a problem especially during mobile use of small appli-
ances. Also people with handicaps need augmentative and alternative communication
(AAC). The goal of data entry management is to provide suitable input media, to ac-
celerate text input by minimizing the number of inputs needed and to cope with input
errors. Recently multimodal interaction receives increasing attention, since input by
speech, gesture, mimics, and haptics can substitute keyboards and mice altogether
(see related chapters in this book).
For spelling-based systems in English, the number of symbols is at least twenty-
seven (space included), and more when the system offers options to select from a
number of lexical predictions . A great variety of special input hardware has been
devised to enable selection of these symbols on small hand-held devices.
394 Assisted Man-Machine Interaction
The simplest solution is to assign multiple letters to each key. The user then is
required to press the appropriate key a number of times for a particular letter to
be shown. This makes entering text a slow and cumbersome process. For wearable
computing also keyboards have been developed, which can be operated by one hand.
The thumb operates a track point and function keys on the back side, while the other
four fingers activate keys on the front side. Key combinations (accords) permit input
of all signs.
Complementary to such hardware prediction, completion, and disambiguation
algorithms are used to curtail the needed number of inputs. Prediction is based on
statistics of “n-grams”, i.e. groups of n letters as they occur in sequence in words
[28]. Disambiguation of signs is based on letter-by-letter or word-level comparison
[3]. In any case reference is made to built-in dictionaries. Some applications are:
Predictive text
For keyboards where multiple letters are assigned to each key predictive text re-
duces the number of key pressings necessary to enter text. By restricting the available
words to those in a dictionary each key needs to be pressed only once. It allows e.g.
the word “hello” to be typed in 5 key presses instead of 12.
Auto completion
Auto-completion or auto-fill is a feature of text-entry fields that automatically
completes typed entries with the best guess of what the user may intend to enter, such
as pathnames, URLs, or long words, thus reducing the amount of typing necessary
to enter long strings of text. It uses similar techniques as predictive text.
Do what I mean
In Do-what-I-mean functions an input string is compared with all dictionary en-
tries. Automatic correction is made if both strings differ by one character, if one
character is inserted or deleted, or if two characters are transposed. Sometimes even
adjacent sub words are transposed (existsFile == fileExists).
Predictive scanning
For menu item selection single-switch scanning can be used with a simple linear
scan, thus requiring only one switch activation per key selection. Predictive scanning
alters the scan such, that keys are skipped over when they do not correspond to any
letters which occur in the database at the current point in the input key sequence.
This is a familiar feature for destination input in navigation systems.
8.4 Summary
Technically there are virtually no limits to the complexity of systems and appliances.
The only limits appear to be user performance and acceptance. Both threaten to en-
cumber innovation and prohibit success on the market place. In this chapter the con-
cept of assistance has been put forward as a remedy. This approach is based on the
hypothesis, that man-machine system usability can be improved by providing user
assistance adapted to the actual context of use.
First the general concept of user assistance was described, which involves context
of use identification as key issue. Subsequently this concept was elaborated for the
References 395
two main tasks facing the user in man-machine interaction, i.e. manual control and
dialogs.
Needs for assistance in manual control were identified. Then attributes of context
of use for manual control tasks were listed and specified for a car driving situation.
Also methods for context identification were discussed, which include parametric
and black-box approaches to driver modeling. Finally three methods of providing
support to a human controller were identified. These are information management,
control input management, and automation.
The second part of the chapter was devoted to assistance in man-machine dialogs.
First support needs in dialogs were identified. This was followed by a treatment of the
variables describing a dialog context, i.e. user state, dialog system state, and dialog
situation. Several methods for context identification were discussed with emphasis on
user intent recognition. Finally two methods of providing assistance in dialog tasks
were proposed. These are information management and data entry management.
In summary it was demonstrated how assistance based on context of use can
be a huge asset to the design of man-machine systems. However, the potential of
this concept remains yet to be fully exploited. Progress relies mainly on improved
methods for context of use identification.
References
1. Akyol, S., Libuda, L., and Kraiss, K.-F. Multimodale Benutzung adaptiver Kfz-
Bordsysteme. In Jürgensohn, T. and Timpe, K.-P., editors, Kraftfahrzeugführung, pages
137–154. Springer, 2001.
2. Albrecht, D. W., Zukerman, I., and Nicholson, A. Bayesian Models for Keyhole Plan
Recognition in an Adventure Game. User Modeling and User-Adapted Interaction, 8(1–
2):5–47, 1998.
3. Arnott, A. L. and Javed, M. Y. Probabilistic Character Disambiguation for Reduced
Keyboards Using Small Text Samples. Augmentative and Alternative Communication,
8(3):215–223, 1992.
4. Azuma, R. T. A Survey of Augmented Reality. Presence: Teleoperators and Virtual
Environments, 6(4):355–385, 1997.
5. Belz, S. M. A Simulator-Based Investigation of Visual, Auditory, and Mixed-Modality
Display of Vehicle Dynamic State Information to Commercial Motor Vehicle Operators.
Master’s thesis, Virginia Polytechnic Institute, Blacksburg, Virginia, USA, 1997.
6. Bien, Z. Z. and Stefanov, D. Advances in Rehabilitation Robotics. Springer, 2004.
7. Boff, K. R., Kaufmann, L., and Thomas, J. P. Handbook of Perception and Human Per-
formance. Wiley, 1986.
8. Breazeal, C. Recognition of Affective Communicative Intent in Robot-Directed Speech.
Autonomous Robots, 12(1):83–104, 2002.
9. Canberry, S. Techniques for Plan Recognition. User Modeling and User-Adapted Inter-
action, 11(1–2):31–48, 2001.
10. Charniak, E. and Goldmann, R. B. A Bayesian Model of Plan Recognition. Artificial
Intelligence, 64(1):53–79, 1992.
11. Dickmanns, E. D. The Development of Machine Vision for Road Vehicles in the last
Decade. In Proceedings of the IEEE Intelligent Vehicle Symposium, volume 1, pages
268–281. Versailles, June 17–21 2002.
396 Assisted Man-Machine Interaction
12. Friedrich, W. ARVIKA Augmented Reality für Entwicklung, Produktion und Service. Pub-
licis Corporate Publishing, Erlangen, 2004.
13. Garrel, U., Otto, H.-J., and Onken, R. Adaptive Modeling of the Skill- and Rule-Based
Driver Behavior. In The Driver in the 21st Century, VDI-Berichte 1613, pages 239–261.
Berlin, 3.–4. Mai 2001.
14. Haykin, S. Adaptive Filter Theory. Prentice-Hall, Englewood Cliffs, NJ, 2nd edition,
1991.
15. Haykin, S. Neural Networks: A Comprehensive Foundation. Prentice Hall, Upper Saddle
River, NJ, 2nd edition, 1999.
16. Hirzinger, G., Brunner, B., Dietrich, J., and Heindl, J. Sensor-Based Space Robotics -
ROTEX and its Telerobotic Features. IEEE Transactions on Robotics and Automation
(Special Issue on Space Robotics), 9(5):649–663, October 1993.
17. Hofmann, M. Intentionsbasierte maschinelle Interpretation von Benutzeraktionen. Dis-
sertation, Technische Universität München., 2003.
18. Hofmann, M. and Lang, M. User Appropriate Plan Recognition for Adaptive Interfaces.
In Smith, M. J., editor, Usability Evaluation and Interface Design: Cognitive Engineer-
ing,Intelligent Agents and Virtual Reality. Proceedings of the 9th International Confer-
enceon Human-Computer Interaction, volume 1, pages 1130–1134. New Orleans, August
5–10 2001.
19. Jackson, C. A Method for the Direct Measurement of Crossover Model Parameters. IEEE
MMS, 10(1):27–33, March 1969.
20. Jameson, A. Numerical Uncertainty Management in User and Student Modeling: An
Overview of Systems and Issues. User Modeling and User-Adapted Interaction. The
Journal of Personalization Research, 5(4):193–251, 1995.
21. Ji, Q. and Yang, X. Real-Time Eye, Gaze, and Face Pose Tracking for Monitoring Driver
Vigilance. Real-Time Imaging, 8(5):357–377, 2002.
22. Knoll, P. The Night Sensitive Vehicle. In VDI-Report 1768, pages 247–256. VDI, 2003.
23. Kragic, D. and Christensen, H. F. Robust Visual Servoing. The International Journal of
Robotics Research, 22(10–11):923 ff, 2003.
24. Kraiss, K.-F. Ergonomie, chapter Mensch-Maschine Dialog, pages 446–458. Carl Hanser,
München, 3rd edition, 1985.
25. Kraiss, K.-F. Fahrzeug- und Prozessführung: Kognitives Verhalten des Menschen und
Entscheidungshilfen. Springer Verlag, Berlin, 1985.
26. Kraiss, K.-F. Implementation of User-Adaptive Assistants with Neural Operator Models.
Control Engineering Practice, 3(2):249–256, 1995.
27. Kraiss, K.-F. and Hamacher, N. Concepts of User Centered Automation. Aerospace
Science Technology, 5(8):505–510, 2001.
28. Kushler, C. AAC Using a Reduced Keyboard. In Proceedings of the CSUN California
State University Conference, Technology and Persons with Disabilities Conference. Los
Angeles, March 1998.
29. Larimore, M. G., Johnson, C. R., and Treichler, J. R. Theory and Design of Adaptive
Filters. Prentice Hall, 1st edition, 2001.
30. Mann, M. and Popken, M. Auslegung einer fahreroptimierten Mensch-Maschine-
Schnittstelle am Beispiel eines Querführungsassistenten. In GZVB, editor, 5. Braun-
schweiger Symposium ”Automatisierungs- und Assistenzsystemefuer Transportmittel”,
pages l82–108. 17.–18. Februar 2004.
31. Matsikis, A., Zoumpoulidis, T., Broicher, F., and Kraiss, K.-F. Learning Object-Specific
Vision-Based Manipulation in Virtual Environments. In IEEE Proceedings of the 11th
International Workshop on Robot and Human Interactive Communication ROMAN 2002,
pages 204–210. Berlin, September 25–27 2002.
References 397
32. Neumerkel, D., Rammelt, P., Reichardt, D., Stolzmann, W., and Vogler, A. Fahrermodelle
- Ein Schlüssel für unfallfreies Fahren? Künstliche Intelligenz, 3:34–36, 2002.
33. Nickerson, R. S. On Conversational Interaction with Computers. In Treu, S., editor,
Proceedings of the ACM/SIGGRAPH Workshop on User-Oriented Design of Interactive
Graphics Systems, pages 101–113. Pittsburgh, PA, October 14–15 1976.
34. Pao, Y. Adaptive Pattern Recognition and Neural Networks. Addison-Wesley, 1989.
35. Reintsema, D., Preusche, C., Ortmaier, T., and Hirzinger, G. Towards High Fidelity Telep-
resence in Space and Surgery Robotics. Presence- Teleoperators and Virtual Environ-
ments, 13(1):77–98, 2004.
36. Schmidt, A. and Gellersen, H. W. Nutzung von Kontext in ubiquitären Informationssyste-
men. it+ti - Informationstechnik und technische Informatik, Sonderheft: UbiquitousCom-
puting - der allgegenwärtige Computer, 43(2):83–90, 2001.
37. Schraut, M. Umgebungserfassung auf Basis lernender digitaler Karten zur vorauss-
chauenden Konditionierung von Fahrerassistanzsystemen. Ph.D. thesis, Technische Uni-
versität München, 2000.
38. Shneiderman, B. and Plaisant, C. Designing the User Interface. Strategies for Effective
Human-Computer Interaction. Addison Wesley, 4th edition, 2004.
39. Specht, D. F. A General Regression Neural Network. IEEE Transactions on Neural
Networks, 2(6):568–576, 1991.
40. Velger, M., Grunwald, A., and Merhav, S. J. Adaptive Filtering of Biodynamic Stick
Feedthrough in ManipulationTaskson Board Moving Platforms. AIAA Journal of Guid-
ance, Control, and Dynamics, 11(2):153–158, 1988.
41. Wahlster, W., Reitiger, N., and Blocher, A. SMARTKOM: Multimodal Communication
with a Life-Like Character. In Proceedings of the 7th European Conference on Speech
CommunicationandTechnology, volume 3, pages 1547–1550. Aalborg, Denmark, Sep-
tember 3–7 2001.
42. Wickens, C. D. Engineering Psychology and Human Performance. Columbus. Merrill,
1984.
Appendix A
LTI-L IB —
a C++ Open Source Computer Vision Library
Peter Dörfler and José Pablo Alvarado Moya
The LTI-L IB is an open source software library that contains a large collection
of algorithms from the field of computer vision. Its roots lie in the need for more
cooperation between different research groups at the Chair of Technical Computer
Science at RWTH Aachen University, Germany. The name of the library stems from
the German name of the Chair: Lehrstuhl für Technische Informatik.
This chapter gives a short overview of the LTI-L IB that provides the reader with
sufficient background knowledge for understanding the examples throughout this
book that use this library. It can also serve as a starting point for further exploration
and use of the library. The reader is expected to have some experience in program-
ming with C++.
The main idea of the LTI-L IB is to facilitate research in different areas of com-
puter vision. This leads to the following design goals:
• Cross platform:
It should at least be possible to use the library on two main platforms: Linux and
Microsoft Windows© .
• Modularity:
To enhance code reusability and to facilitate exchanging small parts in a chain of
operations the library should be as modular as possible.
• Standardized interface:
All classes should have the same or very similar interfaces to increase interoper-
ability and flatten the user’s learning curve.
• Fast code:
Although not the main concern, the implementations should be efficient enough
to allow the use of the library in prototypes working in real time.
The LTI-L IB is licensed under the GNU Lesser General Public License (LGPL)1
to encourage collaboration with other researchers and allow its use in commercial
products. This has already lead to many enhancements of the existing code base and
also to some contributions from outside the Chair of Technical Computer Science.
From the number of downloads the userbase can be estimated to a few thousand
researchers.
This library differs from other open source software projects in its design: it
follows an object oriented paradigm that allows to hide the configuration of the al-
gorithms, making it easier for unexperienced users to start working with them. At
the same time it allows more experienced researchers to take complete control of
the software modules as well. The LTI-L IB also exploits the advantages of generic
programming concepts for the implementation of container and functional classes,
seeking for a compromise between “pure” generic approaches and a limited size for
the final code.
This chapter only gives a short introduction to the fundamental concepts of the
LTI-L IB. We focus on the functionality needed for working with images and im-
age processing methods. For more detailed information please refer to one of the
following resources:
• project webpage: http://ltilib.sourceforge.net
• online manual: http://ltilib.sourceforge.net/doc/html/index.shtml
• wiki: http://ltilib.pdoerfler.com/wiki
• mailing list: ltilib-users@lists.sourceforge.net
The LTI-L IB is the result of the collaboration of many people. As the current
administrators of the project we would like to thank Suat Akyol, Ulrich Canzler,
Claudia Gönner, Lars Libuda, Jochen Wickel, and Jörg Zieren for the parts they
play(ed) in the design and maintenance of the library additionally to their coding.
We would also like to thank the numerous people who contributed their time and the
resulting code to make the LTI-L IB such a large collection of algorithms:
Daniel Beier, Axel Berner, Florian Bley, Thorsten Dick, Thomas Erger, Hel-
muth Euler, Holger Fillbrandt, Dorothee Finck, Birgit Gehrke, Peter Gerber, Ingo
Grothues, Xin Gu, Michael Haehnel, Arnd Hannemann, Christian Harte, Bastian
Ibach, Torsten Kaemper, Thomas Krueger, Frederik Lange, Henning Luepschen, Pe-
ter Mathes, Alexandros Matsikis, Ralf Miunske, Bernd Mussmann, Jens Pausten-
bach, Norman Pfeil, Vlad Popovici, Gustavo Quiros, Markus Radermacher, Jens Ri-
etzschel, Daniel Ruijters, Thomas Rusert, Volker Schmirgel, Stefan Syberichs, Guy
Wafo Moudhe, Ruediger Weiler, Benjamin Winkler, Xinghan Yu, Marius Wolf
The LTI-L IB does not in general require additional software to be usable. However,
it becomes more useful if some complementary libraries are utilized. These are in-
1
http://www.gnu.org/licenses/lgpl.html
Installation and Requirements 401
cluded in most distributions of Linux and are easily installed. For Windows systems
the most important libraries are included in the installer found on the CD.
The script language perl should be available on your computer for maintenance
and to build some commonly used scripts. It is included on the CD for Windows, and
is usually installed on Linux systems. The Gimp Tool Kit (GTK )2 should be installed
as it is needed for almost all visualization tools. To read and write images using the
Portable Network Graphics (PNG) and the Joint Picture Experts Group (JPEG) im-
age formats you either need the relevant files from the ltilib-extras package, which
are adaptations of the Colosseum Builders C++ Image Library, or you have to pro-
vide access to the freely available libraries libjpeg and libpng. Additionally, the data
compression library zlib is useful for some I/O functions3.
The LTI-L IB provides interfaces to some functions of the well-known LA-
PACK (Linear Algebra PACKage)4 for optimized algorithms. On Linux systems
you need to install lapack, blas , and the fortran to C translation tool f2c (includ-
ing the corresponding development packages, which in distributions like SuSE or
Debian are always denoted with and additional “-dev” or “-devel” postfix). The
installation of LAPACK on Windows systems is a bit more complicated. Refer to
http://ltilib.pdoerfler.com/wiki/Lapack for instructions.
The installation of the LTI-L IB itself is quite straightforward on both systems.
On Windows execute the installer program (ltilib/setup-ltilib-1.9.15-activeperl.exe).
This starts a guided graphical setup program where you can choose among some of
the packages mentioned above. If unsure we recommend installing all available files.
Installation on Linux follows the standard procedure for configuration and instal-
lation of source code packages established by GNU as a de facto standard:
cd $HOME
cp /media/cdrom/ltilib/*.gz .
tar -xzvf 051124_ltilib_1.9.15.tar.gz
cd ltilib
tar -xzvf ../051124_ltilib-extras_1.9.15.tar.gz
cd linux
make -f Makefile.cvs
./configure
make
Here it has been assumed that you want the sources in a subdirectory called
ltilib within your home directory, and that the CD has been mounted into
/media/cdrom. The next line unpacks the tarballs found on the CD in the direc-
tory ltilib. In the file name, the first number of six digits specifies the release date
of the library, where the first two digits relate to the year, the next two indicate the
month and the last two the day. You can check for newer versions in the library home-
page. The sources in the ltilib-extras package should be unpacked within the
2
http://www.gtk.org
3
http://www.ijg.org, http://www.libpng.org, and http://www.zlib.net, respectively
4
http://netlib.org/lapack/
402 LTI-L IB — A C++ Open Source Computer Vision Library
directory ltilib created by the first tarball. The next line is optional, and creates all
configuration scripts. For it to work the auto-configure packages have to be installed.
You can however simply use the provided configure script as indicated above.
Additional information can be obtained from the file ltilib/linux/README.
You can of course customize your library via additional options of the configure
script (call ./configure --help for a list of all available options). To install the
libraries you will need write privileges on the --prefix directory (which defaults
to /usr/local). Switch to a user with the required privileges (for instance su
root) and execute make install. If you have doxygen and graphviz installed
on your machine, and you want to have local access to the API HTML documenta-
tion, you can generate it with make doxydoc. The documentation to the release
1.9.15 which comes with this book is already present on the CD.
A.2 Overview
Following the design goals, data structures and algorithms are mostly separated in
the LTI-L IB. The former can be divided into small types such as points and pix-
els and more complex types ranging from (algebraic) vectors and matrices to trees
and graphs. Data structures are discussed in section A.3. Algorithms and other func-
tionality are encapsuled in separate classes that can have parameters and a state and
manipulate the data given to them in a standardized way. These are the subject of sec-
tion A.4. Before dealing with those themes, we cover some general design concepts
used in the library.
Most classes used for data structures and functional objects support several mecha-
nisms for duplication and replication. The copy constructor is often used while ini-
tializing a new object with the contents of another one. The copy() method and
the operator= are equivalent and permit to acquire all data contained in another
instance, whereby, as a matter of style, the former is the way preferred within the
library, as it visually permits a faster realization that the object belongs to an LTI-
L IB class instead of being an elemental data type (like integers, floating point values,
etc.). As LTI-L IB instances might represent or contain images or other relatively
large data structures, it is important to always keep in mind that a replication task
may take its time, and it is therefore useful to explicitly indicate when this expensive
copies are done.
In class hierarchies it is often required to replicate instances of classes to which
you just have a pointer to a parent common class. This is the case, for instance, when
implementing the factory design pattern. For these situations the LTI-L IB provides
a virtual clone() method.
Overview 403
A.2.2 Serialization
The data of almost all classes of the LTI-L IB can be written to and read from
a stream. For this purpose an ioHandler is used that formats and parses the
stream in a special way. At this time, two such ioHandlers exist. The first one,
lispStreamHandler, writes the data of the class in a Lisp like fashion, i. e. each
attribute is stored as a list enclosed in parenthesis. This format is human readable
and can be edited manually. However, it is not useful for large data sets due to its
extensive size and expensive parsing. For these applications the second ioHandler,
binaryStreamHandler, is better suited.
All elementary types of C++ and some more complex types like the STL5
std::string, std::vector, std::map and std::list can be read or
written from the LTI-L IB streams using a set of global functions called read()
and write(). Global functions also allow to read and write almost all LTI-Lib
types from and to the stream handlers, although as a matter of style, if a class defines
its own read() and write() methods they should be preferred over the global
ones to indicate that the given object belongs to the LTI-L IB.
The example in Fig. A.1 shows the principles of copying and serialization on a
LTI-L IB container called vector<double>, and a STL list container of strings.
A.2.3 Encapsulation
All classes of the LTI-L IB reside in the namespace lti to avoid ambiguity when
other libraries might use the same class names. This allows for instance to use the
class name vector, even if the Standard Template Library already uses this type.
For the sake of readability the examples in this chapter omit the explicit scoping of
the namespace (i. e., the prefix lti:: before all class names omitted).
Within the library, types regarding just one specific class are always defined
within that class. If you want to use those types you have to indicate the class scope
too (for example lti::filter::parameters::borderType). Even if this
is cumbersome to type, it is definitely easier to maintain, as the place of declaration
and the context are directly visible.
A.2.4 Examples
If you want to try out the examples in this chapter, the easiest way is to copy them
directly into the definition of the operator() of the class tester found in the
file ltilib/tester/ltiTester.cpp. The necessary header files to be in-
cluded are listed at the begining of each example. Standard header files (e.g. cmath,
cstdlib, etc.) are omitted for brevity.
5
The Standard Template Library (STL) is included with most compilers. It contains com-
mon container classes and algorithms.
404 LTI-L IB — A C++ Open Source Computer Vision Library
#include "ltiVector.h"
#include "ltiLispStreamHandler.h"
#include <list>
v3=v2; // v3 is equal to v2
v3.copy(v2); // same as above
When the size of the numeric representations is not of importance, the default
C/C++ types are used. In the LTI-L IB a policy of avoidance of unsigned integer types
is followed, unless the whole data representation range is required (for example,
ubyte when values from 0 to 255 are to be used). This has been extended to the
values used to represent the sizes of container classes, which are usually of type
int, as opposite to the unsigned int of other class libraries like the STL.
Most math and classification modules employ the double floating point precision
(double) types to allow a better conditioning of the mathematical computations in-
volved. Single precision numbers (float) are used where the memory requirements
are critical, for instance in grey valued images that require a more flexible numerical
representation than an 8-bit integer.
More involved data structures in the LTI-L IB (for instance, matrix and vector
containers) use generic programming concepts and are therefore implemented as
C++ template types. However, these are usually explicitly instantiated to reduce the
size of the final library code. Thus, you can only choose between the explicitly in-
stantiated types. Note that many modules are restricted to those types that are useful
for the algorithm they implement, i. e. linear algebra functors only work on float and
double data structures. This is a design decision in the LTI-L IB architecture, which
might seem in some cases rigid, but simplifies the maintenance and optimization of
code for the most frequently used cases. At this point we want to stress the differ-
ence of the LTI-L IB with other “pure” generic programming oriented libraries. For
the “generic” types, like genericVector or genericMatrix you can also cre-
ate explicit instantiations for your own types, for which you will require to include
the header files with the template implementations, usually denoted with a postfix
template in the file name (for example, ltiGenericVector template.h).
The next sections introduce the most important object types for image processing.
Many other data structures exist in the library. Please refer to the online documenta-
tion for help.
The template classes tpoint<T> and tpoint3D<T> define two and three dimen-
sional points with attributes x, y, and if applicable z all of the template type T. For
T=int the type aliases point and point3D are available. Especially point is
often used as coordinates of a pixel in an image.
Pixels are defined similarly: trgbPixel<T> has three attributes red, green,
and blue. They should be accessed via the member functions like getRed().
However, the pixel class used for color images is rgbPixel which is a 32 bit data
structure. It has the three color values and an extra alpha value all of which are
ubyte. There are several reasons for a four byte long pixel representation of a typ-
ically 24 bit long value. First of all, in 32 bit or 64 bit processor architectures, this
is necessary to ensure the memory alignment of the pixels with the processor’s word
length, which helps to improve the performance of the memory access. Second, it
helps to improve the efficiency when reading images from many 32-bit based frame
406 LTI-L IB — A C++ Open Source Computer Vision Library
grabbers. Third, some image formats store the 24-bit pixel information together with
an alpha channel, resulting in 32-bit long pixels.
All of these basic types support the fundamental arithmetic functions. The exam-
ple in Fig. A.2 shows the idea.
#include "ltiPoint.h"
#include "ltiRGBPixel.h"
// create two points and add the first one to the second
tpoint<float> p1(2.f,3.f); // create a 2D point
tpoint<float> p2(3.f,2.f); // and another one
p2.add(p1); // add both points and leave
// the result in p2
std::cout << p2 << std::endl; // show the result
Fig. A.2. This small code example shows how to create instances of some basic types and how
to do arithmetic operations on them.
Probably the most important types in the LTI-L IB are vectorand matrix. Un-
like the well known STL classes these are vectors and matrices in the mathematical
sense and provide the appropriate arithmetic functions. Since LTI-L IB vectors and
matrices have been designed to support efficient data structures for image processing
applications, they also differ from the STL equivalents in the memory management
approach.
STL vectors, for instance, increase their memory dynamically by allocating a
larger block and copying the old contents by calling the elements’ copy constructors.
This can be very useful when the final size of a vector is not known but is not very
efficient.
On the other hand, since most image processing algorithms require very efficient
ways to access the data in their memory, the LTI-L IB vector and matrix classes
optimize the mechanisms to share, transfer, copy and access the memory among
different instances, but are less efficient in reserving and deallocating memory. For
example, matrices are organized in a row-wise manner and each row can be accessed
as a vector. In order to avoid copying the data of the matrix to a vector when the in-
formation is required as such, a pool of row vectors is stored within the matrix which
Data Structures 407
share the data with the matrix. The matrix memory is also a single adjacent and com-
pact memory block, that can therefore be easily traversed. Iteration on the elements
of the matrices or vectors has been optimized to allow boundary check in a code de-
bugging stage, but also to be as fast as C pointers when the code is finally released.
Additionally, complete memory transfers and exchange of data among matrix and
vector instances is allowed, which usually saves valuable time, since unnecessary
copying is avoided.
We will not go into a deeper detail of the memory management in the LTI-L IB
containers here, but rather point out some of the main container features. Figure A.3
shows a small example for vector matrix multiplication.
#include "ltiVector.h"
#include "ltiMatrix.h"
#include "ltiVector.h"
#include "ltiMatrix.h"
The LTI-L IB provides four container classes for images: image represents color im-
ages, channel and channel8 are utilized as grey value images, and channel32
is mostly used to hold so-called label masks. Using the template functionality of
matrix we define the image types as matrices of rgbPixel, float, ubyte,
and int32, respectively.
The two grey valued image types have quite different objectives: channel8
uses 8-bit per pixel which is a common value for monochrome cameras. Algorithms
that work directly on these are usually very fast due, on the one hand, to the single-
byte fixed point arithmetic and on the other hand to the possibility of using look-up
tables to speed up computations as only 256 values are available. However, many
image processing methods return results with a wider range of values or need higher
precision. For these, the channel is more appropriate as its pixels employ single
precision floating point numbers.
For a regular monochrome channel the value range for a pixel lies between
0.0 and 1.0. Quite obviously a channel can have values outside of the normal-
ized interval. This property is often necessary, for example, when the magnitude of
an image gradient, or the output of a high-pass filter is computed and stored in a
channel. The image class represents a 24(32)-bit RGB(A) color image, i. e. 24
bits are used for the three colors and an additional byte is used for the alpha channel.
Color images are often split into three channels for processing (cf. A.5.3).
The pixel representation as a 32-bit fixed point arithmetic value, as found in
channel32 objects, are mostly employed to store pixel classification results, where
each integer number will denote a specific class, and the total number of pixel classes
may become greater than the capabilities of a channel8. This is the case, for in-
stance, in image segmentation results or in multiple object recognition systems.
Functional Objects 409
The advantage of using matrix for image types is twofold: the individual pixels
can easily be accessed and the calculation power of matrix is available for images
as well. The only additional functions in the image types are conversion functions
from other image types via castFrom(), which maintain certain semantical mean-
ing in the conversion: for example, the range of 0 to 255 of a channel8is linearly
mapped into the interval 0.0 to 1.0 for a channel. The example in Fig. A.5 shows
how easy it is to implement a transform of grey values to the correct interval between
0 and 1 using matrix functions.
#include "ltiImage.h"
The channel ch could be the result of some previous processing step. First the
minimum and maximum values are sought. From these the required stretching factor
and corresponding offset are calculated and each element of the image matrix is
transformed accordingly. Note that operations like the above can be achieved more
efficiently by defining a function and then using the apply() member function
supplied by matrix. For this case in particular channel also offers the member
function mapLinear().
Functor
User
Input Data
Functionality
Output Data
Configuration
Parameter State
Fig. A.6. A functor has three parts: functionality, parameters, and state. The diagram shows
their interaction
A.4.1 Functor
A functor consists of three main parts: functionality, parameters, and state. The func-
tionality of the functor is accessed via the apply() methods. In many cases there
are on-copy and on-place versions of these methods, so the user can choose whether
the output of the algorithm should overwrite the data in the container objects used
as input, or if the results should to be stored in container objects passed specially
for this purpose as arguments of the apply() methods. The functionality can be
modified via the parameters of the functor. Finally, some functors require a state, for
example, for keeping track of previous input data. Figure A.6 shows the three parts
of a functor and their interaction.
The parameters are implemented as an inner class of each functor. This allows to
consistently use the type name parameters to designate this configuration object
in each functor, where the class name is used to clearly indicate their scope. To en-
sure consistency along lines of inheritance the parameters class of a derived functor
is derived from the parameters class of its base class. The current parameters of a
functor can be changed via the function setParameters(). A read-only refer-
ence to the current parameters instance is returned by getParameters(). The
parameters are never changed by the functor itself. Only the user or another functor
may set parameters. This is opposite to the state of a functor which is manipulated
exclusively by the functor itself.
The parameters as well as the state of the functor are serializable through the
members read() and write(). Note that when saving the functor the parameters
are saved as well.
Setting parameters is one of the most important tasks when using the LTI-L IB.
Although most functors have useful default parameters, these need to be adapted for
special tasks quite frequently. In Fig. A.7 we take a look at configuring our imple-
mentation of the Harris corner detector [1]. As can be seen in the documentation
it uses two other functors internally, localMaxima and gradientFunctor.
To keep the code maintainable, the parameters of harrisCorners just contain
instances of the parameter classes of those two functors. In the example the pa-
Functional Objects 411
#include "ltiImage.h"
#include "ltiHarrisCorners.h"
// list of lti::point
pointList corners;
channel cornerness;
float maxCornerness;
harris.apply(src,cornerness,maxCornerness,corners);
A.4.2 Classifier
Classifiers are very similar to functors. The main difference is that instead of
apply() they have train() and classify() member functions. All classi-
fiers have a state which stores the results of the training and which is later used for
classification. The input data is strictly limited to dvector (or sequences thereof).
The LTI-L IB discerns three general types of classifiers: supervised and unsuper-
vised instance classifiers, and supervised sequence classifiers. Hidden Markov Mod-
els (cf. chapter 2.1.3.6) are currently the only representatives of the last category.
Supervised instance classifiers are trained with pairs of dvector and an integer,
i. e. feature vector and label. Unsupervised instance classifiers only receive feature
412 LTI-L IB — A C++ Open Source Computer Vision Library
vectors as input and try to find natural groups within the data. This group contains
clustering algorithms as well as connectionist algorithms such as Self Organizing
Feature Maps. When used for classification the unsupervised version tries to give a
measure of how likely the given feature vector would have belonged to each cluster
had it been in the original data set.
The classifiers are not much used within the context of this book. The interested
reader is refered to the CD and online documentation for more information about
classifiers.
A.5.1 Convolution
#include "ltiImage.h"
#include "ltiLinearKernels.h"
#include "ltiGaussKernels.h"
#include "ltiConvolution.h"
// a convolution functor
convolution conv;
A.5.2 Image IO
The LTI-L IB can read and write three common image formats: BMP, JPEG, and
PNG. While the former always uses an implementation which is part of the library
414 LTI-L IB — A C++ Open Source Computer Vision Library
the latter two use the standard libjpg and libpng if they are available. These
libraries are mostly present on Unix/Linux systems. For Windows systems source
code is available to build the libraries. If the libraries are not available an alternative
implementation based on the Colosseum Builders C++ Image Library is available in
the extras package of the LTI-L IB.
Unless the LTI-L IB is used to write a specific application it is most practical
to be able to load all possible image types. The file ltiALLFunctor.h contains
two classes loadImage and saveImage that load/save an image depending on
its file extension. The filename can be set in the parameters or given directly in the
convenience functions load() and save(), respectively. The small example in
Fig. A.9 shows how a bitmap is loaded and then saved as a portable network graphic.
#include "ltiImage.h"
#include "ltiALLFunctor.h"
loadImage loader;
saveImage saver;
image img;
loader.load("/home/homer/bart.bmp",img);
saver.save("/home/homer/bart.png",img);
Fig. A.9. In this example a bitmap is loaded from a file and saved as a PNG.
Very often it is not necessary to save images on disk but to process a large set
of images. For this purpose the LTI-L IB has the loadImageList functor. It can
be used to sequentially get all images in a directory, listed in a file or supplied in
a std::vector of strings. In any case the filenames are sorted alphabetically to
ensure consistency across different systems. While the method hasNext() returns
true there are more images available. The next image is loaded by calling apply().
It is also possible to load grey valued and indexed images with the image I/O
functors. If the loaded file actually contains grey value information this can be loaded
into a channel. For a 24 bit file the intensity of the image is left in the channel.
Indexed images can be loaded into a channel8 and a palette which stores up to 256
colors. If the file contains a 24 bit image the colors are quantized to up to 256 colors.
Many image processing algorithms that work on color images first split them into the
different components in a particular color space (the so called color channels) and
then work on these separately. The most common color space is RGB (Red, Green,
Blue) which is also the native color space for LTI-L IB images, as it is frequently
used when dealing with industrial cameras and image file formats. Particular types
of images for additional color spaces are not provided, nor special pixel types, as
Handling Images 415
the number of available color spaces would have made the maintenance of all LTI-
L IB functors very difficult, due mainly to the typical singularities found in different
spaces (for example, the case of intensity zero in hue-saturation spaces). Therefore,
in this library just one color format is used, while all other cases have to be managed
as a collection of separate channel objects.
To extract all three color components from an image in a particular color space,
the functors splitImageToXXX is used, with XXX the desired color space abbre-
viation. The functors mergeXXXToImage form an RGB image from three sep-
arate components in the XXX color space. The code in Fig. A.10 shows how color
channels can be swapped to create a strangely colored image.
#include "ltiImage.h"
#include "ltiSplitImageToRGB.h"
#include "ltiMergeRGBToImage.h"
splitImageToRGB split;
mergeRGBToImage merge;
split.apply(src,r,g,b);
merge.apply(g,b,r,src);
Fig. A.10. The image src is first split into the three RGB channels. Then an image is recon-
structed with swapped colors.
#include "ltiImage.h"
// a color image
image img;
channel ch;
#include "ltiImage.h"
#include "ltiMergeHSIToImage.h"
6
The Gimp Tool Kit is a multi-platform GUI toolkit, licensed under the LGPL.
http://www.gtk.org
418 LTI-L IB — A C++ Open Source Computer Vision Library
#include "ltiImage.h"
#include "ltiDraw.h"
painter.setColor(Blue);
point p1(25,5),p2(25,25),p3(5,25);
painter.line(p1,p2);
painter.setColor(Green);
painter.lineTo(p3);
point pm;
pm.add(p1,p3).divide(2);
painter.setColor(Magenta);
painter.arc(pm,p1,p3);
#include "ltiImage.h"
#include "ltiViewer.h"
#include "ltiTimer.h"
(c) By using the button Scale Min-Max the values in the cornerness (a) get scaled to
completely fill the range from 0 to 1.
Fig. A.15. Using the viewer configuration dialog.
via some functors in the LTI-L IB that provide interfaces to some popular cameras
and framegrabbers. Furthermore, you will find functors for programming with mul-
tiple threads, for control of serial ports, as well as for system performance evalu-
ation. Since classification and processing tasks require a plethora of mathematical
functions, the LTI-L IB contains many classes for statistics and linear algebra. We
encourage you to browse the HTML documentation to get an idea of the great possi-
bilities that this library provides.
References 421
#include "ltiImage.h"
#include "ltiViewer.h"
#include "ltiTimer.h"
image img;
img.castFrom(ch);
draw<rgbPixel> painter;
painter.use(img);
painter.marker(corners,7,"rx");
viewer vCornerness("Cornerness");
viewer vCorners("Corners");
vCornerness.show(cornerness);
vCorners.show(img);
vCorners.waitButtonPressed();
vCorners.hide();
vCornerness.hide();
passiveWait(10000);
gtkServer::shutdown();
Fig. A.16. The first viewer just displays the cornerness. For the second viewer the grey scale
image is first transformed to an image. Then the corners can be marked as red crosses.
References
1. Harris, C. and Stephens, M. A Combined Corner and Edge Detector. In Proc. 4th Alvey
Vision Conference, pages 147–151. 1988.
Appendix B
During development of complex software systems like new human machine in-
terfaces, some tasks always reappear: First, the whole system has to be separated into
modules which handle subtasks and communicate with each other. Afterwards, algo-
rithms for each module have to be developed whereas each algorithm is described by
its input data, output data, and parameters. After implementation the algorithms have
to be optimized. Optimization includes the determination of the optimal parameter
settings for each algorithm in context of the application. This is normally done by
comparing an algorithm’s output data for different parameter settings applied to the
same input data. Therefore, the output data has to be visualized in an appropriate way.
The whole process is very time consuming if everything has to be coded manually,
especially visualization of more complex data types like images or 3D data.
But the process may be speeded up if a framework is available, which supports
the development process. These kinds of frameworks are called Rapid Prototyping
Systems. Ideally, such a system has to meet the following requirements:
• Input data from different sources has to be supported, e. g. image data from still
images, video streams or cameras in case of image processing.
• It has to define a standardized but flexible and easy to use interface to integrate
any algorithm with any input data, output data, and parameters as a reusable
component in the framework.
• Processing chains consisting of several algorithms have to be composed and
changed easily.
• Parameter settings of each algorithm have to be able to be changed during run-
time.
• Output data of each algorithm has to be visualized on demand during execution
of the processing chain.
• The system has to provide standardized methods to analyze, compare, and eval-
uate the results obtained from different parameter settings.
• All components have to be integrated in a graphical user interface (GUI) for ease
of use.
Rapid Prototyping Systems exist for different types of applications like Borland’s
Delphi [5] for data base applications, Statemate [9] for industrial embedded systems,
or 3D Studio Max [11] for creating animated virtual worlds.
The remaining part of this chapter introduces a Rapid Prototyping System called
I MPRESARIO which was originally developed as a graphical user interface for image
processing with the LTI-L IB but is not necessarily limited to it. The software can be
424 I MPRESARIO — A GUI for Rapid Prototyping of Image Processing Systems
Table B.1. Requirements to run I MPRESARIO on a PC. The DirectX interface is needed if you
plan to use a capture device or if image data should be acquired from video streams.
Minimum Recommended
©
Operating system: Windows 98 SE Windows XP©
Memory: 256 MB 512 MB
Hard disc space: 500 MB -
DirectX version: 8.1 -
found on the CD accompanying this book and is provided for a deeper understanding
of the examples in the chapters 2, 5.1, and 5.3.
The next paragraph describes the requirements, installation, and deinstallation
of I MPRESARIO. The following paragraphs contain a tutorial on the graphical user
interface and detailed references for the different steps needed to setup an image
processing system. The last three paragraphs provide details about the internals of
I MPRESARIO itself, summarize the I MPRESARIO examples which are part of the
chapters related to image processing, and describe how the user can extend I MPRE -
SARIO ’s functionality with her own algorithms.
document view
macro browser
output window
statusbar
Fig. B.1. I MPRESARIO after first start. The GUI is divided in 5 parts: Menu and toolbar, macro
browser, document view, output window, and statusbar.
• Right to the macro browser, the main workspace displays a view on the current
document. At the moment, this document named “ProcessGraph1” is empty. It
has to be filled with macros from the macro browser.
426 I MPRESARIO — A GUI for Rapid Prototyping of Image Processing Systems
• Below the main workspace and the macro browser the output window displays
messages generated by the application (e. g. see the tab “System”) and by opened
documents. For each document a tab is added to the output window and named
after the document.
• At the bottom of the main window, the status bar shows help texts for each menu
or toolbar command when it is selected and the overall processing speed of the
current document in milliseconds and frames per second when the system exe-
cutes the active document.
To work with this interface it is necessary to understand the basic components
and operations which are described in the following subsections.
The core component of the system is the document. A document contains a process-
ing graph and the configuration of the user interface used with this graph. A process-
ing graph is a directed graph consisting of vertices and directed edges whereas
macros represent the vertices and data links represent the edges. For more theoreti-
cal background on graphs, e. g. see [7]. Documents are stored in files with the suffix
“.ipg” which is the abbreviation for I MPRESARIO processing graph. The commands
to create, open, save, and close documents are located in the “File” menu. Shortcuts
for some file commands can also be found on the standard toolbar (see Tab. B.2). For
further discussion, we open a simple example:
1. Select the “File”→“open” command in the main menu or click the correspond-
ing toolbar button.
2. In the upcoming dialog select the document called “CannyEdgeDetection.ipg”.
3. Confirm the selection by pressing the “Open” button in the dialog.
A new view on the opened document will be created and a new tab will be added
to the output window, both named after the document. The view will show the con-
tents depicted in Fig. B.2. This is a very simple processing graph consisting of three
vertices and two edges. The vertices are made up by the three macros named “Image
Sequence”, “SplitImage”, and “CannyEdges” and two unnamed data links visualized
by black arrows represent the two directed edges.
It is necessary to take a closer look at the macros because they encapsulate the
algorithms for image processing and thus the graph’s functionality while the links
determine the data flow in the graph.
B.2.2 Macros
In I MPRESARIO macros are used to integrate algorithms for image processing in the
user interface. A macro has two purposes: First, it contains an algorithm for im-
age processing which, in general, can be described by its input data, its output data,
and its parameters. This algorithm is executed by I MPRESARIO with the help of the
macro. Second, a macro provides a graphical representation to change the parameters
Basic Components and Operations 427
Fig. B.3. Opened macro property window of the “SplitImage” macro from the Canny edge
example in Fig. B.2 and a tooltip showing a port’s data type and name. The tooltip appears as
soon as the mouse cursor is hovered over a port.
“SplitImage” macro in the Canny edge example (Fig. B.3) is a macro assembling all
LTI-L IB functors which split color images into the components of a certain color
model. The macro takes a color image of type class lti::image as input and
splits it into three grayvalue images of type class lti::channel8 represent-
ing the three components of a color model which can be selected by the macro’s
parameter named Output. In Fig. B.3 the RGB model is currently selected. The first
grey value image of the “SplitImage” macro serves as input for the “CannyEdge”
macro which contains a single functor (the Canny algorithm [3, 4]). This macro per-
forms edge detection on the input data and outputs two grey value images of type
class lti::channel8 named “Edge channel” and “Orientation channel”. The
edge channel is a black and white image where all pixels are white which belong to
an edge in the source image. The orientation channel encodes the direction of the
gradient used to detect the edges.
A document may contain as many macros as necessary. They can be added us-
ing the macro browser, deleted, configured, and linked with each other. A detailed
description of handling macros and linking them together is provided in Sect. B.3
and B.4.
But now it is time to see the Canny edge example in action.
Basic Components and Operations 429
Fig. B.4. View on the Canny edge example document in processing mode. Each macro dis-
plays the execution time and its current state. The property window of the Image Sequence
macro shows the name of the currently processed image file.
430 I MPRESARIO — A GUI for Rapid Prototyping of Image Processing Systems
I MPRESARIO will switch back into edit mode. The “Snap” command demands a spe-
cial note. Using this command will set I MPRESARIO to processing mode for exactly
one image. After processing this image, I MPRESARIO stops processing and reacti-
vates the edit mode.
Now that the Canny edge example is running, the only thing missing is a visual-
ization of the macros’ results.
B.2.4 Viewer
Viewer visualize the output data of every macro. A viewer is opened by double-
clicking a red output port. Fig. B.5 shows two viewers from the Canny edge example.
They visualize the original image “hallway.jpg” which is accessible via the “Color
Image” output port of the “Image Sequence” macro and the edge channel of the
“CannyEdge” macro (the left output port of the “CannyEdge” macro). When a viewer
is opened the first time it appears right to the macro and in a standardized size. Most
Fig. B.5. Two viewers showing visualizing some results of the Canny edge example. The left
viewer shows the original image loaded by the “Image Sequence” macro, the right viewer
displays the “Edge channel” of the “CannyEdge” macro.
Basic Components and Operations 431
times, this size is too small to see anything because the image data is adjusted to
the window size. The viewer can be resized like every other window by clicking on
its frame and moving the mouse until it reaches the desired size. Additionally, every
viewer provides a toolbar with commands for zooming the image and adjusting the
window size to the image size. Tab. B.4 lists all available viewer commands.
Beside the toolbar a viewer also contains a statusbar which is divided in four
panes. The first pane displays the image size in pixels (width x height). When the
mouse is hovered over the image the second pane shows the image coordinates the
mouse cursor is pointing to. The origin of the coordinate system is the upper left
corner of the image. The pixel value belonging to the current image coordinates can
be found in the third pane. The value depends on the image type which is shown in
the fourth pane. For an image of type lti::image the value is a vector consisting
of the pixel’s red, green, and blue fraction, each in the range of 0 to 255. For a
lti::channel8 the value is a single integer in the range of 0 to 255 and for a
lti::channel a floating point number. Take a look at chapter A.3.3 for more
information about the different image types.
Now that we know how to visualize results of the different macros it is easy to
see the influence of parameter changes in a macro on the final result. For a last time,
take the Canny edge example, start the execution of the processing graph and open
the viewer for the edge channel of the “CannyEdge” macro. After that, open the
macro property window of the “SplitImage” macro and change the parameter named
Output from the value “RGB” to “CIE Luv” by opening the parameter’s drop down
box and selecting the item. You will notice changes in the edge channel. Different
values will even increase the changes up to a point, where the edge channel cannot
be interpreted in a meaningful way any more. Try to experiment with the parameters
of both processing macros and open some more viewers to get a feeling for the user
interface.
In this tutorial the basic components and operations were introduced starting with
the document as core component containing the processing graph which consists
of macros and data links. Afterwards macros were introduced encapsulating image
432 I MPRESARIO — A GUI for Rapid Prototyping of Image Processing Systems
Fig. B.6 depicts two views on the macro browser which contains a database with all
available macros in the system. Tab. B.5 summarizes the commands available on the
macro browser’s toolbar.
The macro browsers’s main component is the macro tree list. It provides four
different categories to view macros. The first category lists the macros alphabetically
by their name. The second category named “Group” sorts all macros according to
their function. The third category lists the macros by the DLL1 they are located in,
and the fourth category by their creator. A category can be accessed by clicking
the corresponding tab beneath the macro tree list with the left mouse button. The
right part of Fig. B.6 shows the macro browser displaying the group category and all
1
DLL – Dynamic Link Library, see Sect. B.6 for more information.
Adding Functionality to a Document 433
Fig. B.6. Macro browser with all available macros. Left: Default view with activated name
category. Right: View on the group category with activated filter “Source”.
macros of the group “Source” which will be discussed in detail in Sect. B.5. In each
category macros can be selected by a left click on their name. As soon as a macro is
selected, a more detailed description is displayed in the help area beneath the tabs.
There are two ways to add macros from the macro selection window to a doc-
ument. The first one uses the “Search for” input field. Activate the field with a left
click. A blinking cursor appears in the field. Now it’s possible to enter characters.
The first macro name which matches the entered character sequence is selected in
the macro tree list. The search can be narrowed by using the “Filter” drop down
list box. The filter options vary depending on the selected filter category. The name
434 I MPRESARIO — A GUI for Rapid Prototyping of Image Processing Systems
category does not provide any filter options but e. g. the group category allows to
search only one group (Right part of Fig. B.6). As soon as the Enter key is pressed
in the “Search for” input field, the selected macro is added to the active document. It
appears in the document view’s upper left corner. The second and more comfortable
way to add macros works by drag and drop:
1. Click the desired macro with the left mouse button and hold it down.
2. Move the mouse cursor to some free space in the document view while the left
mouse button is still held down.
3. Release the left mouse button in the document view and the selected macro will
be inserted at the mouse cursor’s position.
This operation can be cancelled any time by pressing the Escape key.
After macros have been added to a document, they can be rearranged or removed
from the document.
Macros can be rearranged by a simple drag and drop operation. Move the mouse
cursor onto the desired macro, press and hold down the left mouse button and move
the cursor to the desired location in the document view. A rectangle representing the
macro’s borders will move with the cursor. If the cursor is moved to the view’s right
or bottom edge, the view will start to scroll to reach invisible parts of the document.
As soon as the desired location is reached release the mouse button and the macro is
moved there.
To remove a macro from the document, select it by clicking it with the left mouse
button. The macro’s border will turn light green to indicate that it is selected. Exe-
cute the “Delete” command in the “Edit” menu to remove the macro. In the current
version of I MPRESARIO only one macro can be selected and removed at one time.
To realize data exchange between different macros they have to be linked. All links
underly the following rules:
• A link can only be established between an output port (colored red) and an input
port (colored yellow).
• The input port and the output port have to support the same data type. A port’s
data type is displayed as a tooltip when the mouse pointer is hovered over the
port and left there unmoved for a moment.
Defining the Data Flow with Links 435
• An output port may be connected to many input ports but one input port excepts
exactly one incoming link from an output port.
Links are created by drag and drop operations with the mouse by the following
steps:
1. The mouse cursor has to be placed on a red output port or a yellow input port.
The cursor appearance changes according to Tab. B.6. The “Link forbidden”
cursor will appear only if link creation is started from a yellow input port which
is already connected to an output port.
2. If it is allowed to start a link from the selected port (indicated by the “Link
allowed” cursor) press the left mouse button, hold it down and start to move the
mouse to the desired destination port. The system draws a gray line from the
starting point of the link to the current mouse position and the mouse cursor
changes to the “Link forbidden” cursor to indicate an incomplete link (see left
column in Fig. B.7).
3. To connect two macros which are not concurrently visible in the document view,
move the cursor to the view’s border. The view will start to scroll without abort-
ing the link operation. Scroll the view until the desired macro becomes visible.
4. As soon as the desired destination port is reached, the system indicates wether it
is possible to establish the link or not. If the possible link is valid the gray line
will turn green and the cursor will change to “Link allowed” (middle column in
Fig. B.7). In case of an invalid link the gray line turns red and the cursor appear-
ances does not change (right column in Fig. B.7). Invalid links are indicated in
two cases: First, the source and destination port are both input or output ports.
Second, the data types of the source and destination port are unequal.
5. If a valid link is indicated, release the left mouse button to create the link. The
green line will be replaced by a black arrow. If the mouse button is released
when the system indicates an incomplete or invalid link, the operation will be
cancelled.
Link creation can also be cancelled at any time by pressing the Escape key.
436 I MPRESARIO — A GUI for Rapid Prototyping of Image Processing Systems
Fig. B.7. Possible appearances of links during creation. Left column: Gray line indicating
incomplete link. Middle column: Green line indicating valid complete link. Right column:
Red line indicating invalid link.
The source macros have to be fed with information, where they can find the
image data. Therefore they provide customized user interfaces which are described in
detail in the following subsections. All other macros coming with this book provide
a standard user interface which will be discussed at the end of this section.
The Image Sequence macro handles still images of the types Bitmap (BMP), Portable
Networks Graphics (PNG), and JPEG. These kinds of images may be used as input
data to the processing graph either as single images or as a sequence of several im-
ages. Therefore, the Image Sequence macro maintains so called sequence files which
are suffixed with “.seq”. Sequence files are simple text files containing the path to one
existing image file in each line. They can be generated manually with any text editor
or much more comfortably with the macro’s user interface depicted in Fig. B.8. On
top, the interface contains a menu bar and a toolbar for image sequence manipula-
tion. The commands are summarized in Tab. B.7. The drop down list box beneath the
toolbar contains the currently and recently used sequences for fast access. Beneath
the list box the image list displays the contents of the current sequence.
Creating and Loading Sequences
By default, an empty and unnamed image sequence is generated when a new docu-
ment is created. A new sequence can be created by selecting the “Create sequence”
command (see Tab. B.7). To load an existing sequence, press the “Load sequence”
button in the toolbar. A standard open file dialog appears where the desired sequence
can be selected. An alternative way to open one of the recently used sequences is to
open the drop down list box and select one of its items. The corresponding sequence
will be opened immediately if it exists.
Fig. B.8. User interface of the Image Sequence macro. It consists of the command toolbar, the
drop down list box with the last opened sequences and the list containing the image files to be
processed.
438 I MPRESARIO — A GUI for Rapid Prototyping of Image Processing Systems
Fig. B.9. Appearance of the Image Sequence user interface after adding new images. The
asterix appended to the sequence name indicates, that the sequence was changed and needs to
be saved to preserve the changes.
The system automatically asks to save a sequence if changes to the current se-
quence may be lost due to a following execution of a “Create new sequence” or
“Load sequence” command.
Image control during processing
When the process graph is processed, the “Play as sequence” command allows to
switch between two different processing modes of the image sequence. In the first
mode, every image is processed in the order it appears in the list. In this mode, the
list is grayed and any interaction with the list is impossible. Further, the “Play in
loop” command allows to toggle between one and infinite iterations of the sequence.
The second mode processes only the selected image. It is possible to change the
processed image by selecting a different one in the list. The “Play in loop” command
is not available in this mode.
During processing of the image graph it is not possible to modify the image list
in any way.
Image Sequence provides two output ports. The left one contains the currently
loaded image and the right one the path to the file where the image was loaded from.
To visualize any output double click the ports of the macro. A viewer window will
open and display the currently processed data.
With the Video Capture Device macro it is possible to acquire life image data from
any capture device which supports Microsoft’s DirectX interface [6] in the version
8.1 or above. Most USB cameras and Firewire cameras support this interface [1, 10].
The macros’s user interface is shown in Fig. B.10 and the meaning of the two buttons
is described in Tab. B.8.
Fig. B.10. Configuration page of the Video Capture Device macro. The list box shows all
detected capture devices and the combo box all possible capture modes supported by the
selected device. In this case a Phillips ToUCam is detected and selected.
Configuring Macros 441
The main list box in the middle shows all available devices for image capturing
and is updated on program start. If a new capture device is connected, press the “Scan
system” button to update the list box. By default, the first found device is selected
for image acquisition. To change the device, select the corresponding list box entry
with the left mouse button.
The drop down list box beneath the main list box contains all capture modes
supported by the selected device. A capture mode consists of the video resolution
and the used compression method. By default, the first capture mode is selected by
the system. It can be changed by opening the drop down list box and selecting a
different entry with the left mouse button.
These three controls operate in edit mode and are disabled in processing mode.
The button “Show device properties” is available in processing mode only (see
Fig. B.11). Pressing this button brings up the properties dialog of the selected de-
vice. Pressing the button once again will close the dialog. The properties dialog is
normally provided by the device driver but it is possible that some drivers do not
offer such a dialog. In this case, there won’t be any system reaction if the button is
Fig. B.11. Configuration page of the Video Capture Device macro in processing mode. In this
mode only the “Show device properties” button is enabled to display and change the device
dependent property dialog.
442 I MPRESARIO — A GUI for Rapid Prototyping of Image Processing Systems
pressed. To see the life image double click the red output port of the Video Capture
Device macro.
The Video Stream macro gains image data from video files of type AVI (Audio Video
Interleaved [2]) and MPEG-1 (Motion Picture Experts Group). The image data of
videos may be compressed with different Codecs (Compressor – Decompressor).
To view a video on a computer, the required Codec for decompression has to be
installed. If this is not the case the video will not play. Furthermore, dependent on
the Codec being used for decompression, some options may not be available during
video playback.
To configure the macro, the graphical interface depicted in Fig. B.12 is provided.
The interface consists of the video file selection area at the top, the video information
area in the middle, and the frame control at the bottom of the window.
Loading a video stream
To load a video file, click the “Open” button in the video file selection area. An
open file dialog appears where the desired video file can be selected. After press-
ing the “Open” button in the file dialog the selected video file is validated. If it can
be opened its file name is displayed in the box labeled “Video file” and information
about the video stream is shown in the video information area (Fig. B.12). This in-
cludes the resolution of the video in (width x height) in pixels, the total number of
frames contained in the video, the duration, and the frame rate. If the selected file
cannot be opened a corresponding message will appear in the output window of the
current document. The most probable reason for this failure is a missing Codec for
Fig. B.12. User interface of the Video Stream macro consisting of the video file selection
area, the video information area, and the frame control. In this example a video file named
“Street.avi” has already been opened.
Configuring Macros 443
decompression in case of AVI video files or an interlaced MPEG video file which is
currently not supported.
After a video is loaded, the macro is ready to be processed.
Video Stream Control During Processing
In the default settings, a video is processed frame by frame and when the end is
reached, it is started at the beginning again. The frame control allows to change the
standard behavior. It consists of four controls:
• Manual frame control: If this checkbox is selected, the edit box and the slider
control for selecting a frame will be activated and the “Continuous Play” check-
box will be disabled. Only the selected frame will be processed until a different
one is chosen (see Fig. B.13).
Please note, that the availability of manual frame control depends on the Codec
being used for decompressing the video. If it is not supported by the Codec,
the check box will be disabled and a corresponding warning message will be
displayed in the output window of the document.
• Continuous play: This checkbox will be only available if the “Manual frame
control” is not selected and tells the system how to continue processing after the
last frame of the video has been reached. If the checkbox is selected, the video
will be rewound and processing will continue with the first frame. If the checkbox
is not selected, processing will stop after the last frame.
• Frame edit box: If “Manual frame control” is not selected, this control is disabled
and just displays the number of the frame currently processed. During manual
control it is possible to enter the number of the desired frame. Invalid frame
numbers cannot be entered. The entered frame will be processed until a different
frame is selected.
• Frame slider control: The slider is coupled with the frame edit box and is also
disabled when “Manual frame control” is switched off. In this case, the slider in-
Fig. B.13. Video Stream macro with activated manual frame control. The “Continuous Play”
checkbox is disabled and the edit and slider control allow the selection of a desired frame.
444 I MPRESARIO — A GUI for Rapid Prototyping of Image Processing Systems
dicates approximately the position of the currently processed frame in the video.
During manual control a frame can be selected by dragging the slider with the
mouse. The selected frame will be displayed in the edit control and processed
subsequently.
To view the image data of the processed video frame double click the red output
port of the Video Stream macro. This will bring up a viewer. Finally, it has to be noted,
that the video is not played with the frame rate denoted in the video information area
but as fast as possible because the software is not intended as a pure video viewer.
The main goal is to allow random access to every frame contained in the video.
This requires sometimes more time for decompression than playing the frames in
sequential order because of interframe compression.
Except the source macros introduced in the last three subsections all other macros
use a standardized user interface which allows the configuration of their parameters.
We already encountered a standard macro property window in the tutorial (Fig. B.3
in Sect. B.2.2). Fig. B.14 shows the standard property window of the CannyEdges
macro and Tab. B.9 summarizes the available toolbar commands.
Table B.9. Toolbar commands for the standard macro property window
Colors are changed like files and folders but a color selection dialog appears
instead of the file or folder dialog.
When experimenting with different parameter values especially in processing
mode it may happen that the results produced by the macros look very strange and
cannot be interpreted in any sense. This problem can be solved by restoring the
default values for all macro parameters. To do this, the macro property window’s
toolbar (see Tab. B.9) provides two commands. The left most button restores all pa-
rameters to their default values whereas the middle button just restores the default
value for the currently selected parameter.
Fig. B.15. View menu. With this menu the parts of the graphical user interface can be hidden
or made visible again.
Advanced Features and Internals of I MPRESARIO 447
Fig. B.16. The three pages of the process information window displaying the running threads,
the window hierarchy of I MPRESARIO and the linked DLLs. The information have to be up-
dated manually by pressing the “Update” command on the toolbar.
distribution at hand, only the macro execution order is printed in this window. The
process information window shows the threads with their attached windows running
in I MPRESARIO, the window hierarchy as defined by the operating system, and the
used Dynamic Link Libraries on separate pages (see Fig. B.16). All information have
to be updated manually by pressing the “Update” button on the toolbar.
A Dynamic Link Library (DLL) is a file with the suffix “.dll” which contains
executable program code and is loaded into memory by the operating system on
application demand [8]. Code within a DLL may be shared among several applica-
tions. DLLs play an important role in the concept of I MPRESARIO as we will see in
Sect. B.6.2.
The third component is the blackboard. Fig. B.17 shows a view on the black-
board when it is activated using the menu depicted in Fig. B.15. A blackboard can
be considered as a place where information can be exchanged globally. This enables
macros to define a kind of global variable. For a user of I MPRESARIO it is not very
Fig. B.17. View on the activated blackboard. The figure shows a filled blackboard from a
document which is not part of the distribution coming with this book.
448 I MPRESARIO — A GUI for Rapid Prototyping of Image Processing Systems
important but for those readers who might like to develop their own macros it may
be of interest. In this case please refer to Sect. B.8 and the online documentation for
macro developers. However, the blackboard is not used in the examples included in
this book.
As mentioned in the tutorial, the most important components are the macros. Macros
are based on executable code which is stored in Dynamic Link Libraries (DLLs)
and loaded at program start. Therefore the libraries have to be located in a directory
known to the program. Take a look at the “System” tab in the output window. It re-
ports the loaded Macro-DLLs. Each Macro-DLL contains one or more macros which
are added to the macro browser. In the version of I MPRESARIO coming with this
book several Macro-DLLs are delivered: The three DLLs ImageSequence.dll,
VideoCapture.dll, and VideoStream.dll contain the three source macros.
The two remaining Macro-DLLs, namely macros.dll and ammi.dll, contain
all other macros. The former file is publicly available, the latter file was exclusively
programmed for the examples in this book. The DLL concept allows an easy exten-
sion of the system’s functionality without changing the application itself. An Ap-
plication Programming Interface (API) exists for development of own Macro-DLLs.
For integration in I MPRESARIO the developed DLLs just have to be copied to the
directories where I MPRESARIO expects them. The directories where I MPRESARIO
searches for DLLs can be configured in the settings dialog (see Sect. B.6.3).
Fig. B.18. The two pages of the settings dialog accessible via the menu
“Extras”→“Settings...”. Left: Default settings for directories, macro property windows,
and viewers. Right: Fictive declaration of dependent dynamic link libraries. Because no third
party products are shipped with the version of I MPRESARIO in this book, this page is empty
by default.
Advanced Features and Internals of I MPRESARIO 449
Directory Meaning
DLL-Directory for Macros Directory where I MPRESARIO searches for Macro-
DLLs at program start. This directory should not be
changed. Otherwise it may happen, that no macros are
available next time I MPRESARIO is used. Check the
“System” tab in the output window for loaded Macro-
DLLs.
DLL-Directory for dependent files In this directory dependent files are stored which may
be shipped with third party software. In this version of
I MPRESARIO no third party software is contained and
this directory remains empty. So, changing it won’t have
any effect.
Data directory Macros will store their data in this directory by default.
Document directory Directory where the document files (*.ipg) are loaded
from and stored by default.
Image directory (Input) Still images in the format Bitmap, Portable Network
Graphics and JPEG are stored in this directory for use
in the Image Sequence source.
Image directory (Output) Images produced by macros and viewers will be stored
here.
450 I MPRESARIO — A GUI for Rapid Prototyping of Image Processing Systems
B.8.1 Requirements
The API for developing Macro-DLLs is written in C++ because the LTI-L IB has
also been developed using this language. To successfully develop new Macro-DLLs
the reader has to be familiar with
References 451
The necessary files containing the API for Macro-DLLs as well as further doc-
umentation and code examples can be found on the book’s CD in the file
impresario api setup.exe in the Impresario directory. Copy this file to
your hard drive and execute it. You will be prompted for a directory. Please provide
the directory I MPRESARIO is located in. The setup program will decompress several
files into a separate subdirectory named macrodev. This subdirectory contains a
file called documentation.html which contains the complete documentation
for the API. Please refer to this file for further reading.
I MPRESARIO is freeware and was not developed for this book only but for the
community of LTI-L IB users in first place. Updates and bug fixes for I MPRE -
SARIO as well as new macros can be obtained from the World Wide Web at
http://www.techinfo.rwth-aachen.de/Software/Impresario/index.html
References
1. Anderson, D. FireWire System Architecture. Addison-Wesley, 1998. ISBN 0201485354.
2. Austerberry, D. Technology of Video and Audio Streaming. Focal Press, 2002. ISBN
024051694X.
3. Canny, J. Finding Edges and Lines in Images. Technical report, MIT, Artificial Intelli-
gence Laboratory, Cambridge, MA, 1983.
4. Canny, J. A Computational Approach to Edge Detection. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 8(6):679–698, 1986.
5. Cantu, M. Mastering Delphi 7. Sybex Inc, 2003. ISBN 078214201X.
6. Gray, K. The Microsoft DirectX 9 Programmable Graphics Pipline. Microsoft Press
International, 2003. ISBN 0735616531.
452 I MPRESARIO — A GUI for Rapid Prototyping of Image Processing Systems
7. Gross, J. L. and Yellen, J., editors. Handbook of Graph Theory. CRC Press, 2003. ISBN
1584880902.
8. Hart, J. M. Windows System Programming. Addison-Wesley, 3rd edition, 2004. ISBN
0321256190.
9. i-Logix Inc., Andover, MA. Statemate Magnum Reference Manual, Version 1.2, 1997.
Part-No. D-1100-37-C.
10. Jaff, K. USB Handbook: Based on USB Specification Revision 1.0. Annabooks, 1997.
ISBN 0929392396.
11. Woods, C., Bicalho, A., and Murray, C. Mastering 3D Studio Max 4. Sybex Inc, 2001.
ISBN 0782129382.
Index